Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

Xintong Zhang, Xiaowen Zhang, Jongrong Wu, Zhi Gao, Shilin Yan, Zhenxin Diao, Kunpeng Gao, Xuanyan Chen, Yuwei Wu, Yunde Jia, Qing Li

Main category: cs.CV

TL;DR: AdaptMMBench: A comprehensive benchmark for evaluating adaptive multimodal reasoning in VLMs across five domains with dynamic difficulty assessment and multi-dimensional process evaluation.

Details

Motivation: Existing evaluations for adaptive multimodal reasoning rely on static difficulty labels and simplistic metrics that fail to capture dynamic difficulty relative to model capacities, obscuring the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses.

Method: Proposes AdaptMMBench benchmark across five domains (real-world, OCR, GUI, knowledge, math) with Matthews Correlation Coefficient metric to evaluate selection rationality of reasoning modes, dynamically identifying task difficulties based on models’ capability boundaries, and facilitating multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency.

Result: Evaluation reveals that adaptive mode selection scales with model capacity but decouples from final accuracy, while key step coverage aligns with performance, and tool effectiveness remains highly inconsistent across model architectures.

Conclusion: AdaptMMBench provides a comprehensive framework for evaluating adaptive multimodal reasoning that isolates meta-cognition abilities and enables fine-grained analysis of reasoning processes beyond simple accuracy metrics.

Abstract: Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models’ capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.

Relevance: 9/10

[2] GRAM: Spatial general-purpose audio representations for real-world environments

Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden

Main category: cs.SD

TL;DR: GRAM is a general-purpose real-world audio model using multi-channel masked autoencoders to learn spatial audio representations, addressing limitations of current audio foundation models in reverberant, noisy environments with spatial sound localization.

Details

Motivation: Current audio foundation models perform well on clean, single-channel audio but struggle in real-world acoustic environments with reverberation and noise, and ignore spatial dimensions needed for sound localization tasks.

Method: Proposes GRAM using multi-channel masked autoencoder to learn spatial audio representations, evaluated on two benchmark suites: NatHEAR (simulated naturalistic spatial environments) and RealSELD (real-world recordings).

Result: GRAM outperforms state-of-the-art self-supervised audio foundation models on NatHEAR and clean single-channel HEAR benchmarks using less training data, and shows SOTA localization performance in simulated environments with efficient generalization to real-world recordings.

Conclusion: GRAM represents a significant advance toward robust spatial audio foundation models for real-world environments, addressing key limitations of current models.

Abstract: Audio foundation models learn general-purpose audio representations that facilitate a wide range of downstream tasks. While the performance of these models has greatly increased for conventional single-channel, dry audio clips, their success in real-world acoustic environments with reverberation and noise is limited. Furthermore, most audio foundation models ignore the spatial dimension of real-world acoustic environments, ruling out tasks involving sound localization. To address these limitations, we propose GRAM: a general-purpose real-world audio model that employs a multi-channel masked autoencoder to efficiently learn spatial audio representations. We evaluated GRAM and other audio foundation models in a standardized manner on high-quality simulations of naturalistic, spatial acoustic environments as well as recordings of real-world environments and release these two complementary benchmark task suites: NatHEAR and RealSELD. Our results demonstrate that GRAM outperforms all state-of-the-art self-supervised audio foundation models on NatHEAR and the clean, single-channel version HEAR, while using only a fraction of the training data. GRAM also shows state-of-the-art localization performance in simulated environments and generalizes efficiently to real-world recordings in RealSELD. Taken together, GRAM presents a significant advance toward robust spatial audio foundation models for real-world environments.

Relevance: 9/10

[3] SpecFLASH: A Latent-Guided Semi-autoregressive Speculative Decoding Framework for Efficient Multimodal Generation

Zihua Wang, Ruibo Li, Haozhe Du, Joey Tianyi Zhou, Yu Zhang, Xu Yang

Main category: cs.CV

TL;DR: SpecFLASH: A speculative decoding framework for multimodal models that accelerates inference by compressing visual tokens and using semi-autoregressive decoding to predict multiple tokens at once.

Details

Motivation: Multimodal models suffer from slow decoding due to long visual token sequences with low information density. Existing speculative decoding approaches ignore visual structure and use text-only draft models, limiting their effectiveness for multimodal tasks.

Method: 1) Latent-guided token compression module reduces redundancy in visual sequences while preserving semantics. 2) Semi-autoregressive decoding scheme leverages co-occurrence and local correlations of visual entities to predict multiple tokens in one forward pass.

Result: Achieves up to 2.68× speed-up on video captioning and 2.55× on visual instruction tuning compared to original LMMs, consistently outperforming prior speculative decoding baselines.

Conclusion: SpecFLASH effectively accelerates multimodal model inference by exploiting visual structure characteristics, making it a practical solution for efficient multimodal generation tasks.

Abstract: Large language models and large multimodal models (LLMs and LMMs) deliver strong generative performance but suffer from slow decoding, a problem that becomes more severe when handling visual inputs, whose sequences typically contain many more tokens with lower information density than text. Speculative decoding accelerates LLM inference by letting a compact draft model propose candidate tokens that are selectively accepted by a larger target model, achieving speed-up without degrading quality. However, existing multimodal speculative decoding approaches largely ignore the structural characteristics of visual representations and usually rely on text-only draft models. In this paper, we introduce SpecFLASH, a speculative decoding framework tailored to LMMs that explicitly exploits multimodal structure when designing the draft model. We first mitigate redundancy in visual token sequences with a lightweight, latent-guided token compression module that compacts visual features while preserving semantics, and then leverage the co-occurrence and local correlations of visual entities via a semi-autoregressive decoding scheme that predicts multiple tokens in a single forward pass. Extensive experiments demonstrate that SpecFLASH consistently surpasses prior speculative decoding baselines, achieving up to $2.68\times$ speed-up on video captioning and $2.55\times$ on visual instruction tuning, relative to the original LMM. Our code is available here: https://github.com/ZihuaEvan/FlashSD/.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 132]
cs.CV [Total: 174]
cs.AI [Total: 111]
cs.SD [Total: 13]
cs.LG [Total: 363]
cs.MA [Total: 8]
cs.MM [Total: 1]
eess.AS [Total: 10]
eess.IV [Total: 14]

cs.CL

[1] The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders

Shikhar Shiromani, Archie Chaudhury, Sri Pranav Kunda

Main category: cs.CL

TL;DR: A method using Sparse Autoencoders to detect when LLMs produce hypocritical outputs that contradict their internal reasoning, measured via a “Hypocrisy Gap” metric.

Details

Motivation: Large Language Models often exhibit unfaithful behavior where their final answers differ from their internal chain-of-thought reasoning to appease users, creating a need for better detection methods.

Method: Introduces the Hypocrisy Gap metric using Sparse Autoencoders (SAEs) to quantify divergence between a model’s internal reasoning and final generation. Uses sparse linear probes to derive internal truth beliefs and compares them to final generated trajectories in latent space.

Result: Experiments on Gemma, Llama, and Qwen models using Anthropic’s Sycophancy benchmark show AUROC of 0.55-0.73 for detecting sycophantic runs and 0.55-0.74 for hypocritical cases, outperforming log-probability baselines (0.41-0.50 AUROC).

Conclusion: The Hypocrisy Gap provides an effective mechanistic approach for detecting unfaithful behavior in LLMs by quantifying the divergence between internal reasoning and external outputs.

Abstract: Large Language Models (LLMs) frequently exhibit unfaithful behavior, producing a final answer that differs significantly from their internal chain of thought (CoT) reasoning in order to appease the user they are conversing with. In order to better detect this behavior, we introduce the Hypocrisy Gap, a mechanistic metric utilizing Sparse Autoencoders (SAEs) to quantify the divergence between a model’s internal reasoning and its final generation. By mathematically comparing an internal truth belief, derived via sparse linear probes, to the final generated trajectory in latent space, we quantify and detect a model’s tendency to engage in unfaithful behavior. Experiments on Gemma, Llama, and Qwen models using Anthropic’s Sycophancy benchmark show that our method achieves an AUROC of 0.55-0.73 for detecting sycophantic runs and 0.55-0.74 for hypocritical cases where the model internally “knows” the user is wrong, consistently outperforming a decision-aligned log-probability baseline (0.41-0.50 AUROC).

[2] STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models

Xuzhao Li, Xuchen Li, Jian Zhao, Shiyu Hu

Main category: cs.CL

TL;DR: STEMVerse: A diagnostic framework for analyzing LLM STEM reasoning capabilities across academic disciplines and cognitive complexity dimensions

Details

Motivation: Current LLM evaluation paradigms treat STEM benchmarks as isolated silos with monolithic aggregate scores, failing to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, limiting diagnostic value.

Method: Proposes STEMVerse framework that characterizes model performance across academic specialization and cognitive complexity. Re-aggregates over 20,000 STEM problems into a unified “Discipline × Cognition” capability space with dual-axis labels for each instance.

Result: Systematic evaluation of representative LLM families reveals structural failure patterns in STEM reasoning. The framework provides clear perspective on scientific reasoning characteristics of LLMs.

Conclusion: STEMVerse offers a diagnostic framework that integrates multi-disciplinary coverage and fine-grained cognitive stratification to better understand LLM capabilities in STEM reasoning.

Abstract: As Large Language Models (LLMs) achieve significant breakthroughs in complex reasoning tasks, evaluating their proficiency in science, technology, engineering, and mathematics (STEM) has become a primary method for measuring machine intelligence. However, current evaluation paradigms often treat benchmarks as isolated “silos,” offering only monolithic aggregate scores that neglect the intricacies of both academic specialization and cognitive depth. This result-oriented approach fails to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, thereby limiting the diagnostic value. To address this, we propose STEMVerse, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. This framework characterizes model performance across academic specialization and cognitive complexity to map the capability required for reasoning. We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified “Discipline $\times$ Cognition” capability space, assigning dual-axis labels to every instance. Utilizing this unified diagnostic framework, we systematically evaluate representative LLM families across varying parameter scales and training paradigms. Our empirical results reveal structural failure patterns in STEM reasoning. By integrating multi-disciplinary coverage and fine-grained cognitive stratification into a unified framework, STEMVerse provides a clear and actionable perspective for understanding the scientific reasoning characteristics of LLMs.

[3] Test-Time Detoxification without Training or Learning Anything

Baturay Saglam, Dionysis Kalogerias

Main category: cs.CL

TL;DR: A black-box test-time optimization method that reduces toxicity in LLM outputs by performing gradient descent on input embeddings using only forward passes and a toxicity scorer.

Details

Motivation: Large language models often produce toxic content even for benign inputs, creating deployment risks. Existing detoxification methods require model retraining, gradients, or auxiliary components, which are costly and don't transfer well across models or to black-box settings.

Method: Proposes a test-time procedure that approximates the gradient of completion toxicity with respect to input embeddings using zeroth-order optimization. Requires only access to input embeddings, a toxicity scoring function, and forward evaluations of the model. Uses a small number of descent steps to steer generation toward less toxic continuations.

Result: The approach delivers robust toxicity reductions across models and prompts, achieving the best overall toxicity-quality trade-off in most settings. Works without requiring any training or access to intermediate computations.

Conclusion: Word embeddings can serve as effective control variables for guiding autoregressive language models toward safer text generation. Encourages wider use of black-box optimization for scalable safety improvements without model modifications.

Abstract: Large language models can produce toxic or inappropriate text even for benign inputs, creating risks when deployed at scale. Detoxification is therefore important for safety and user trust, particularly when we want to reduce harmful content without sacrificing the model’s generation quality. Many existing approaches rely on model retraining, gradients, or learned auxiliary components, which can be costly and may not transfer across model families or to truly black-box settings. We introduce a test-time procedure that approximates the gradient of completion toxicity with respect to the input embeddings and uses a small number of descent steps to steer generation toward less toxic continuations. This is achieved with zeroth-order optimization that requires only access to input embeddings, a toxicity scoring function, and forward evaluations of the model. Empirically, the approach delivers robust toxicity reductions across models and prompts and, in most settings, achieves the best overall toxicity-quality trade-off. More broadly, our work positions word embeddings as effective control variables and encourages wider use of black-box optimization to guide autoregressive language models toward scalable, safer text generation, without requiring any training or access to intermediate computations.

[4] ROSA-Tuning: Enhancing Long-Context Modeling via Suffix Matching

Yunao Zheng, Xiaojie Wang, Lei Ren, Wei Chen

Main category: cs.CL

TL;DR: ROSA-Tuning enhances long-context modeling in pretrained LLMs using a CPU-based retrieval module (ROSA) that finds relevant historical positions and injects them into model state via trainable weighted fusion, maintaining computational efficiency similar to windowed attention.

Details

Motivation: Addresses the challenge of balancing long-context capability with computational efficiency in large language models. Existing efficient attention methods reduce complexity but suffer from limited coverage of model state, creating a need for approaches that can efficiently handle long contexts while maintaining performance.

Method: Introduces ROSA-Tuning with a CPU-based ROSA (RWKV Online Suffix Automaton) retrieval module that efficiently locates historically relevant positions in long contexts. Uses binary discretization strategy and counterfactual gradient algorithm for end-to-end training, with asynchronous CPU-GPU pipeline optimization. Combines standard attention with retrieval-and-recall mechanism.

Result: On Qwen3-Base-1.7B, ROSA-Tuning substantially restores long-context modeling ability of windowed-attention models, achieving performance close to global attention on LongBench benchmarks while maintaining computational efficiency and GPU memory usage comparable to windowed-attention methods.

Conclusion: ROSA-Tuning offers a new technical path for efficient long-context processing by combining retrieval mechanisms with attention, achieving near-global attention performance with windowed-attention efficiency through CPU-based retrieval and trainable information injection.

Abstract: Long-context capability and computational efficiency are among the central challenges facing today’s large language models. Existing efficient attention methods reduce computational complexity, but they typically suffer from a limited coverage of the model state. This paper proposes ROSA-Tuning, a retrieval-and-recall mechanism for enhancing the long-context modeling ability of pretrained models. Beyond the standard attention mechanism, ROSA-Tuning introduces in parallel a CPU-based ROSA (RWKV Online Suffix Automaton) retrieval module, which efficiently locates historical positions in long contexts that are relevant to the current query, and injects the retrieved information into the model state in a trainable manner; subsequent weighted fusion can then be handled by range-restricted attention. To enable end-to-end training, we design a binary discretization strategy and a counterfactual gradient algorithm, and further optimize overall execution efficiency via an asynchronous CPU-GPU pipeline. Systematic evaluations on Qwen3-Base-1.7B show that ROSA-Tuning substantially restores the long-context modeling ability of windowed-attention models, achieving performance close to and in some cases matching global attention on benchmarks such as LongBench, while maintaining computational efficiency and GPU memory usage that are nearly comparable to windowed-attention methods, offering a new technical path for efficient long-context processing. The example code can be found at https://github.com/zyaaa-ux/ROSA-Tuning.

[5] Graph-Augmented Reasoning with Large Language Models for Tobacco Pest and Disease Management

Siyu Li, Chenwei Song, Qi Zhou, Wan Zhou, Xinyi Liu

Main category: cs.CL

TL;DR: Graph-augmented reasoning framework for tobacco pest/disease management that integrates domain knowledge graphs with LLMs to provide evidence-aware recommendations.

Details

Motivation: To improve pest and disease management recommendations by integrating structured domain knowledge into LLMs, enabling evidence-aware retrieval beyond surface-level text similarity and mitigating hallucinations in treatment suggestions.

Method: Builds domain-specific knowledge graph, retrieves query-relevant subgraphs, uses ChatGLM as Transformer backbone with LoRA fine-tuning, employs graph neural network to learn node representations capturing symptom-disease-treatment dependencies, and incorporates graph evidence into LLM input.

Result: Shows consistent improvements over text-only baselines, with largest gains on multi-hop and comparative reasoning questions requiring chaining multiple relations.

Conclusion: Graph-augmented reasoning effectively integrates structured domain knowledge with LLMs for improved pest/disease management recommendations, particularly for complex reasoning tasks.

Abstract: This paper proposes a graph-augmented reasoning framework for tobacco pest and disease management that integrates structured domain knowledge into large language models. Building on GraphRAG, we construct a domain-specific knowledge graph and retrieve query-relevant subgraphs to provide relational evidence during answer generation. The framework adopts ChatGLM as the Transformer backbone with LoRA-based parameter-efficient fine-tuning, and employs a graph neural network to learn node representations that capture symptom-disease-treatment dependencies. By explicitly modeling diseases, symptoms, pesticides, and control measures as linked entities, the system supports evidence-aware retrieval beyond surface-level text similarity. Retrieved graph evidence is incorporated into the LLM input to guide generation toward domain-consistent recommendations and to mitigate hallucinated or inappropriate treatments. Experimental results show consistent improvements over text-only baselines, with the largest gains observed on multi-hop and comparative reasoning questions that require chaining multiple relations.

[6] WideSeek: Advancing Wide Research via Multi-Agent Scaling

Ziyang Huang, Haolin Ren, Xiaowei Yuan, Jiawei Wang, Zhongtao Jiang, Kun Xu, Shizhu He, Jun Zhao, Kang Liu

Main category: cs.CL

TL;DR: WideSeek introduces a multi-agent system for broad information seeking (Wide Research) with a new benchmark and hierarchical agent architecture optimized via RL.

Details

Motivation: Current search intelligence focuses on deep research but lacks support for wide research - retrieving comprehensive information under complex constraints in parallel. Progress is hindered by missing benchmarks and optimization methods for search breadth.

Method: Two-pronged approach: 1) Created WideSeekBench, a General Broad Information Seeking benchmark with diverse information volume, constraints, and domains; 2) Developed WideSeek, a dynamic hierarchical multi-agent architecture that forks parallel sub-agents based on task requirements, trained with end-to-end RL on linearized multi-agent trajectories.

Result: Experimental results show effectiveness of WideSeek and multi-agent RL, demonstrating that scaling agent number is promising for advancing Wide Research paradigm.

Conclusion: The work addresses critical gaps in wide research through dedicated benchmark and multi-agent optimization, showing multi-agent scaling as a viable direction for broad information seeking systems.

Abstract: Search intelligence is evolving from Deep Research to Wide Research, a paradigm essential for retrieving and synthesizing comprehensive information under complex constraints in parallel. However, progress in this field is impeded by the lack of dedicated benchmarks and optimization methodologies for search breadth. To address these challenges, we take a deep dive into Wide Research from two perspectives: Data Pipeline and Agent Optimization. First, we produce WideSeekBench, a General Broad Information Seeking (GBIS) benchmark constructed via a rigorous multi-phase data pipeline to ensure diversity across the target information volume, logical constraints, and domains. Second, we introduce WideSeek, a dynamic hierarchical multi-agent architecture that can autonomously fork parallel sub-agents based on task requirements. Furthermore, we design a unified training framework that linearizes multi-agent trajectories and optimizes the system using end-to-end RL. Experimental results demonstrate the effectiveness of WideSeek and multi-agent RL, highlighting that scaling the number of agents is a promising direction for advancing the Wide Research paradigm.

[7] Monotonicity as an Architectural Bias for Robust Language Models

Patrick Cooper, Alireza Nadali, Ashutosh Trivedi, Alvaro Velasquez

Main category: cs.CL

TL;DR: Monotonicity as architectural inductive bias improves Transformer robustness by enforcing order-preserving behavior in feed-forward layers while leaving attention mechanisms unconstrained, reducing adversarial attack success from 69% to 19% with minimal performance degradation.

Details

Motivation: LLMs exhibit brittle behavior under adversarial prompts and jailbreak attacks despite extensive alignment. Small perturbations in high-dimensional input spaces cause unpredictable changes in internal representations and outputs. The paper explores monotonicity as an architectural bias to improve robustness while maintaining expressivity.

Method: Enforce monotonicity selectively in feed-forward sublayers of sequence-to-sequence Transformers while leaving attention mechanisms unconstrained. This architectural separation allows attention to handle negation, contradiction, and contextual interactions, while feed-forward layers ensure order-preserving semantic refinement.

Result: Adversarial attack success rates drop from approximately 69% to 19% while standard summarization performance degrades only marginally. Monotone language models preserve the performance of their pretrained counterparts while substantially improving robustness.

Conclusion: Monotonicity is a viable architectural inductive bias for improving Transformer robustness without sacrificing expressivity. The trade-off between monotonicity and expressivity is not inherent when properly separated in architecture, with attention handling complex interactions and feed-forward layers ensuring order-preserving behavior.

Abstract: Large language models (LLMs) are known to exhibit brittle behavior under adversarial prompts and jailbreak attacks, even after extensive alignment and fine-tuning. This fragility reflects a broader challenge of modern neural language models: small, carefully structured perturbations in high-dimensional input spaces can induce large and unpredictable changes in internal semantic representations and output. We investigate monotonicity as an architectural inductive bias for improving the robustness of Transformer-based language models. Monotonicity constrains semantic transformations so that strengthening information, evidence, or constraints cannot lead to regressions in the corresponding internal representations. Such order-preserving behavior has long been exploited in control and safety-critical systems to simplify reasoning and improve robustness, but has traditionally been viewed as incompatible with the expressivity required by neural language models. We show that this trade-off is not inherent. By enforcing monotonicity selectively in the feed-forward sublayers of sequence-to-sequence Transformers – while leaving attention mechanisms unconstrained – we obtain monotone language models that preserve the performance of their pretrained counterparts. This architectural separation allows negation, contradiction, and contextual interactions to be introduced explicitly through attention, while ensuring that subsequent semantic refinement is order-preserving. Empirically, monotonicity substantially improves robustness: adversarial attack success rates drop from approximately 69% to 19%, while standard summarization performance degrades only marginally.

[8] InfMem: Learning System-2 Memory Control for Long-Context Agent

Xinyu Wang, Mingze Li, Peng Lu, Xiao-Wen Chang, Lifeng Shang, Jinping Li, Fei Mi, Prasanna Parthasarathi, Yufei Cui

Main category: cs.CL

TL;DR: InfMem is a control-centric agent for reasoning over ultra-long documents that uses active memory management with a PreThink-Retrieve-Write protocol and evidence-aware joint compression.

Details

Motivation: Current streaming agents for ultra-long document processing have passive memory update strategies that fail to preserve low-salience bridging evidence needed for multi-hop reasoning, especially under strict memory constraints.

Method: InfMem implements System-2-style control via a PreThink-Retrieve-Write protocol that actively monitors evidence sufficiency, performs targeted in-document retrieval, and applies evidence-aware joint compression to update bounded memory. It uses a practical SFT-to-RL training recipe to align retrieval, writing, and stopping decisions with end-task correctness.

Result: InfMem consistently outperforms MemAgent across backbones on ultra-long QA benchmarks from 32k to 1M tokens, improving average absolute accuracy by +10.17, +11.84, and +8.23 points on Qwen3-1.7B, Qwen3-4B, and Qwen2.5-7B respectively, while reducing inference time by 3.9× on average (up to 5.1×) via adaptive early stopping.

Conclusion: The proposed control-centric agent with active memory management and evidence-aware compression effectively addresses the challenge of preserving bridging evidence for multi-hop reasoning in ultra-long documents while improving both accuracy and efficiency.

Abstract: Reasoning over ultra-long documents requires synthesizing sparse evidence scattered across distant segments under strict memory constraints. While streaming agents enable scalable processing, their passive memory update strategy often fails to preserve low-salience bridging evidence required for multi-hop reasoning. We propose InfMem, a control-centric agent that instantiates System-2-style control via a PreThink-Retrieve-Write protocol. InfMem actively monitors evidence sufficiency, performs targeted in-document retrieval, and applies evidence-aware joint compression to update a bounded memory. To ensure reliable control, we introduce a practical SFT-to-RL training recipe that aligns retrieval, writing, and stopping decisions with end-task correctness. On ultra-long QA benchmarks from 32k to 1M tokens, InfMem consistently outperforms MemAgent across backbones. Specifically, InfMem improves average absolute accuracy by +10.17, +11.84, and +8.23 points on Qwen3-1.7B, Qwen3-4B, and Qwen2.5-7B, respectively, while reducing inference time by $3.9\times$ on average (up to $5.1\times$) via adaptive early stopping.

Rohan Pandey, Haijuan Yan, Hong Yu, Jack Tsai

Main category: cs.CL

TL;DR: This study uses EHR data from 4.2M veterans to predict first-episode homelessness using various ML approaches, finding that incorporating social/behavioral factors improves prediction and LLMs show smaller racial disparities despite lower discrimination performance.

Details

Motivation: Homelessness among US veterans is a critical public health challenge, and risk prediction offers a pathway for proactive intervention to enable targeted prevention strategies.

Method: Analyzed EHR data from 4,276,403 VA patients using static and time-varying representations with clinician-informed logic to model persistence of clinical conditions and social risks. Compared classical ML, transformer-based masked language models, and fine-tuned LLMs for predicting homelessness 3-12 months in advance.

Result: Incorporating social/behavioral factors improved PR-AUC by 15-30%. In top 1% risk tier, models achieved PPVs ranging from 3.93-13.80% across time horizons. LLMs underperformed encoder-based models on discrimination but showed smaller performance disparities across racial groups.

Conclusion: Longitudinal, socially informed EHR modeling concentrates homelessness risk into actionable strata, enabling targeted prevention strategies for at-risk veterans, with LLMs offering potential benefits for fairness despite lower discrimination performance.

Abstract: Homelessness among US veterans remains a critical public health challenge, yet risk prediction offers a pathway for proactive intervention. In this retrospective prognostic study, we analyzed electronic health record (EHR) data from 4,276,403 Veterans Affairs patients during a 2016 observation period to predict first-episode homelessness occurring 3-12 months later in 2017 (prevalence: 0.32-1.19%). We constructed static and time-varying EHR representations, utilizing clinician-informed logic to model the persistence of clinical conditions and social risks over time. We then compared the performance of classical machine learning, transformer-based masked language models, and fine-tuned large language models (LLMs). We demonstrate that incorporating social and behavioral factors into longitudinal models improved precision-recall area under the curve (PR-AUC) by 15-30%. In the top 1% risk tier, models yielded positive predictive values ranging from 3.93-4.72% at 3 months, 7.39-8.30% at 6 months, 9.84-11.41% at 9 months, and 11.65-13.80% at 12 months across model architectures. Large language models underperformed encoder-based models on discrimination but showed smaller performance disparities across racial groups. These results demonstrate that longitudinal, socially informed EHR modeling concentrates homelessness risk into actionable strata, enabling targeted and data-informed prevention strategies for at-risk veterans.

[10] Time-Critical Multimodal Medical Transportation: Organs, Patients, and Medical Supplies

Elaheh Sabziyan Varnousfaderani, Syed A. M. Shihab, Mohammad Taghizadeh

Main category: cs.CL

TL;DR: A greedy heuristic algorithm for multimodal medical transportation using integrated ground and air vehicles to optimize delivery efficiency while considering traffic, weather, and costs.

Details

Motivation: Critical medical transportation (organs, patients, supplies) faces delays from traffic congestion (ground vehicles) and high costs/weather limitations (air vehicles). A multimodal system integrating both could leverage their respective strengths for better efficiency.

Method: Developed a constructive greedy heuristic algorithm for vehicle dispatching that tests four fleet configurations: ambulances only, ambulances+UAVs, ambulances+eVTOLs, and fully integrated ambulances+UAVs+eVTOLs. Algorithm includes payload consolidation, accounts for traffic congestion (ground) and weather conditions (air), and enables rapid dispatching compared to optimization models.

Result: Evaluation of all four fleet types under common conditions to identify most effective configurations for fulfilling medical transportation needs while minimizing operating costs, recharging/fuel costs, and total transportation time.

Conclusion: Multimodal integration of ground and air vehicles offers potential efficiency improvements for medical transportation, with specific fleet configurations providing optimal balance of speed, cost, and reliability.

Abstract: Timely transportation of organs, patients, and medical supplies is critical to modern healthcare, particularly in emergencies and transplant scenarios where even short delays can severely impact outcomes. Traditional ground-based vehicles such as ambulances are often hindered by traffic congestion; while air vehicles such as helicopters are faster but costly. Emerging air vehicles – Unmanned Aerial Vehicles and electric vertical take-off and landing aircraft – have lower operating costs, but remain limited by range and susceptibility to weather conditions. A multimodal transportation system that integrates both air and ground vehicles can leverage the strengths of each to enhance overall transportation efficiency. This study introduces a constructive greedy heuristic algorithm for multimodal vehicle dispatching for medical transportation. Four different fleet configurations were tested: (i) ambulances only, (ii) ambulances with Unmanned Aerial Vehicles, (iii) ambulances with electric vertical take-off and landing aircraft, and (iv) a fully integrated fleet of ambulances, Unmanned Aerial Vehicles, and electric vertical take-off and landing aircraft. The algorithm incorporates payload consolidation across compatible routes, accounts for traffic congestion in ground operations and weather conditions in aerial operations, while enabling rapid vehicle dispatching compared to computationally intensive optimization models. Using a common set of conditions, we evaluate all four fleet types to identify the most effective configurations for fulfilling medical transportation needs while minimizing operating costs, recharging/fuel costs, and total transportation time.

[11] From Task Solving to Robust Real-World Adaptation in LLM Agents

Pouya Pezeshkpour, Estevam Hruschka

Main category: cs.CL

TL;DR: Paper introduces a robustness evaluation framework for LLM agents in deployment-like conditions with partial observability, dynamic environments, noisy signals, and dynamic agent state, revealing significant gaps between nominal task-solving and real-world readiness.

Details

Motivation: Existing evaluations of LLM agents assume "clean interfaces" with stable dynamics and reliable tools, overestimating real-world readiness. In practice, agents face underspecified rules, unreliable signals, shifting environments, and implicit multi-stakeholder goals.

Method: Stress-test deployment-relevant robustness using a grid-based game with simple goals but long-horizon execution. Episodes violate clean-interface assumptions while remaining solvable, forcing agents to infer rules, pay for information, adapt to environmental/internal shifts, and act cautiously under noise.

Result: Large gaps found between nominal task-solving and deployment-like robustness across five state-of-the-art LLM agents. Performance degrades with grid size and horizon, but rankings are unstable - weaker models can beat stronger ones when strategy matches uncertainty regime. Agents show partial objective inference without explicit instruction.

Conclusion: Highlights need for work on verification, safe action selection, and objective inference under partial observability, noise, and non-stationarity. Current agent evaluations overestimate real-world readiness, requiring more realistic testing frameworks.

Abstract: Large language models are increasingly deployed as specialized agents that plan, call tools, and take actions over extended horizons. Yet many existing evaluations assume a “clean interface” where dynamics are specified and stable, tools and sensors are reliable, and success is captured by a single explicit objective-often overestimating real-world readiness. In practice, agents face underspecified rules, unreliable signals, shifting environments, and implicit, multi-stakeholder goals. The challenge is therefore not just solving tasks, but adapting while solving: deciding what to trust, what is wanted, when to verify, and when to fall back or escalate. We stress-test deployment-relevant robustness under four operational circumstances: partial observability, dynamic environments, noisy signals, and dynamic agent state. We benchmark agentic LLMs in a grid-based game with a simple goal but long-horizon execution. Episodes violate clean-interface assumptions yet remain solvable, forcing agents to infer rules, pay for information, adapt to environmental and internal shifts, and act cautiously under noise. Across five state-of-the-art LLM agents, we find large gaps between nominal task-solving and deployment-like robustness. Performance generally degrades as grid size and horizon increase, but rankings are unstable: weaker models can beat stronger ones when strategy matches the uncertainty regime. Despite no explicit instruction, agents trade off completion, efficiency, and penalty avoidance, suggesting partial objective inference. Ablations and feature analyses reveal model-specific sensitivities and failure drivers, motivating work on verification, safe action selection, and objective inference under partial observability, noise, and non-stationarity.

[12] AmharicStoryQA: A Multicultural Story Question Answering Benchmark in Amharic

Israel Abebe Azime, Abenezer Kebede Angamo, Hana Mekonen Tamiru, Dagnachew Mekonnen Marilign, Philipp Slusallek, Seid Muhie Yimam, Dietrich Klakow

Main category: cs.CL

TL;DR: AmharicStoryQA benchmark reveals cultural variation within Amharic language affects LLM performance, showing regional differences in narrative understanding despite shared language.

Details

Motivation: Current multilingual benchmarks treat language and culture as synonymous, overlooking cultural variation within single languages, particularly in low-resource contexts like Amharic in Ethiopia.

Method: Created AmharicStoryQA benchmark with culturally diverse narratives from different Ethiopian regions, evaluated existing LLMs on long-sequence story question answering, and conducted supervised fine-tuning experiments.

Result: Found significant narrative understanding gap in LLMs, pronounced regional performance differences despite shared language, and uneven improvements from fine-tuning across regions.

Conclusion: Need culturally grounded benchmarks beyond language-level evaluation to accurately assess and improve narrative understanding in low-resource languages.

Abstract: With the growing emphasis on multilingual and cultural evaluation benchmarks for large language models, language and culture are often treated as synonymous, and performance is commonly used as a proxy for a models understanding of a given language. In this work, we argue that such evaluations overlook meaningful cultural variation that exists within a single language. We address this gap by focusing on narratives from different regions of Ethiopia and demonstrate that, despite shared linguistic characteristics, region-specific and domain-specific content substantially influences language evaluation outcomes. To this end, we introduce \textbf{\textit{AmharicStoryQA}}, a long-sequence story question answering benchmark grounded in culturally diverse narratives from Amharic-speaking regions. Using this benchmark, we reveal a significant narrative understanding gap in existing LLMs, highlight pronounced regional differences in evaluation results, and show that supervised fine-tuning yields uneven improvements across regions and evaluation settings. Our findings emphasize the need for culturally grounded benchmarks that go beyond language-level evaluation to more accurately assess and improve narrative understanding in low-resource languages.

[13] When Efficient Communication Explains Convexity

Ashvin Ranjan, Shane Steinert-Threlkeld

Main category: cs.CL

TL;DR: The paper investigates why efficient communication explains semantic typology patterns, using Information Bottleneck to show convexity of communicative need distributions drives optimality in language evolution.

Details

Motivation: Recent work suggests language variation results from efficient communication balancing simplicity and informativeness. This paper aims to identify which specific factors make efficient communication explanations successful for semantic typology patterns.

Method: Uses Information Bottleneck framework to formalize trade-off between simplicity and informativeness. First demonstrates correlation between IB optimality and novel generalization of convexity. Second experiment manipulates modeling parameters in IB framework to determine which factors drive this correlation.

Result: Found that convexity of the communicative need distribution plays especially important role in explaining correlation between convexity and optimality. Moves beyond showing efficient communication explains semantic typology to identifying underlying responsible factors.

Conclusion: The research identifies specific factors (particularly convexity of communicative need distributions) that make efficient communication explanations successful for semantic typology patterns, advancing understanding of why language evolution follows certain patterns.

Abstract: Much recent work has argued that the variation in the languages of the world can be explained from the perspective of efficient communication; in particular, languages can be seen as optimally balancing competing pressures to be simple and to be informative. Focusing on the expression of meaning – semantic typology – the present paper asks what factors are responsible for successful explanations in terms of efficient communication. Using the Information Bottleneck (IB) approach to formalizing this trade-off, we first demonstrate and analyze a correlation between optimality in the IB sense and a novel generalization of convexity to this setting. In a second experiment, we manipulate various modeling parameters in the IB framework to determine which factors drive the correlation between convexity and optimality. We find that the convexity of the communicative need distribution plays an especially important role. These results move beyond showing that efficient communication can explain aspects of semantic typology into explanations for why that is the case by identifying which underlying factors are responsible.

[14] R2-Router: A New Paradigm for LLM Routing with Reasoning

Jiaqi Xue, Qian Lou, Jiarong Xing, Heng Huang

Main category: cs.CL

TL;DR: R2-Router introduces a novel LLM routing approach that jointly selects both the best LLM and optimal output length budget, enabling cost-effective use of powerful LLMs with constrained outputs.

Details

Motivation: Existing LLM routers assume fixed quality and cost per LLM for each query, ignoring that LLM quality varies with output length. This causes routers to exclude powerful LLMs when estimated cost exceeds budget, missing opportunities where these LLMs could deliver high quality at reduced cost with shorter outputs.

Method: R2-Router treats output length budget as a controllable variable and jointly selects the best LLM and length budget, enforcing the budget via length-constrained instructions. The approach includes constructing R2-Bench, the first routing dataset capturing LLM behavior across diverse output length budgets.

Result: Experiments show R2-Router achieves state-of-the-art performance at 4-5x lower cost compared with existing routers. The router discovers that powerful LLMs with constrained outputs can outperform weaker LLMs at comparable cost-efficient configurations.

Conclusion: This work opens a new direction: routing as reasoning, where routers evolve from reactive selectors to deliberate reasoners that explore which LLM to use and at what cost budget, enabling more sophisticated and cost-effective LLM deployment strategies.

Abstract: As LLMs proliferate with diverse capabilities and costs, LLM routing has emerged by learning to predict each LLM’s quality and cost for a given query, then selecting the one with high quality and low cost. However, existing routers implicitly assume a single fixed quality and cost per LLM for each query, ignoring that the same LLM’s quality varies with its output length. This causes routers to exclude powerful LLMs when their estimated cost exceeds the budget, missing the opportunity that these LLMs could still deliver high quality at reduced cost with shorter outputs. To address this, we introduce R2-Router, which treats output length budget as a controllable variable and jointly selects the best LLM and length budget, enforcing the budget via length-constrained instructions. This enables R2-Router to discover that a powerful LLM with constrained output can outperform a weaker LLM at comparable cost-efficient configurations invisible to prior methods. Together with the router framework, we construct R2-Bench, the first routing dataset capturing LLM behavior across diverse output length budgets. Experiments show that R2-Router achieves state-of-the-art performance at 4-5x lower cost compared with existing routers. This work opens a new direction: routing as reasoning, where routers evolve from reactive selectors to deliberate reasoners that explore which LLM to use and at what cost budget.

[15] LatentMem: Customizing Latent Memory for Multi-Agent Systems

Muxin Fu, Guibin Zhang, Xiangyuan Xue, Yafu Li, Zefeng He, Siyuan Huang, Xiaoye Qu, Yu Cheng, Yang Yang

Main category: cs.CL

TL;DR: LatentMem: A learnable multi-agent memory framework with role-aware customization and compact latent memory representations to address memory homogenization and information overload in LLM-powered multi-agent systems.

Details

Motivation: Existing multi-agent memory designs suffer from two bottlenecks: (1) memory homogenization due to lack of role-aware customization, and (2) information overload from excessively fine-grained memory entries. These limitations hinder the collective intelligence of LLM-powered multi-agent systems.

Method: Proposes LatentMem framework with two components: an experience bank storing raw interaction trajectories in lightweight form, and a memory composer that synthesizes compact latent memories conditioned on retrieved experience and agent-specific contexts. Also introduces Latent Memory Policy Optimization (LMPO) to propagate task-level optimization signals through latent memories to the composer.

Result: Extensive experiments across diverse benchmarks and mainstream MAS frameworks show LatentMem achieves up to 19.36% performance gain over vanilla settings and consistently outperforms existing memory architectures without requiring modifications to underlying frameworks.

Conclusion: LatentMem effectively addresses memory homogenization and information overload in multi-agent systems through learnable, role-aware memory customization, enabling more efficient collective intelligence in LLM-powered multi-agent systems.

Abstract: Large language model (LLM)-powered multi-agent systems (MAS) demonstrate remarkable collective intelligence, wherein multi-agent memory serves as a pivotal mechanism for continual adaptation. However, existing multi-agent memory designs remain constrained by two fundamental bottlenecks: (i) memory homogenization arising from the absence of role-aware customization, and (ii) information overload induced by excessively fine-grained memory entries. To address these limitations, we propose LatentMem, a learnable multi-agent memory framework designed to customize agent-specific memories in a token-efficient manner. Specifically, LatentMem comprises an experience bank that stores raw interaction trajectories in a lightweight form, and a memory composer that synthesizes compact latent memories conditioned on retrieved experience and agent-specific contexts. Further, we introduce Latent Memory Policy Optimization (LMPO), which propagates task-level optimization signals through latent memories to the composer, encouraging it to produce compact and high-utility representations. Extensive experiments across diverse benchmarks and mainstream MAS frameworks show that LatentMem achieves a performance gain of up to $19.36$% over vanilla settings and consistently outperforms existing memory architectures, without requiring any modifications to the underlying frameworks.

[16] CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment

Zhengbang Yang, Yisheng Zhong, Junyuan Hong, Zhuangdi Zhu

Main category: cs.CL

TL;DR: CATNIP is a calibrated token-level negative preference alignment method for LLM unlearning that rescales unlearning effects based on model confidence to selectively remove undesirable knowledge while preserving general capabilities.

Details

Motivation: Existing LLM unlearning methods have limitations: Gradient Ascent approaches degrade general knowledge and require impractical retention data, while Negative Preference Alignment methods are constrained by reference model choice and struggle with realistic data settings. The paper aims to develop a more precise unlearning method that quantifies model confidence in undesirable knowledge and works robustly with data scarcity.

Method: CATNIP uses calibrated token-level negative preference alignment that rescales unlearning effects proportionally to the model’s token-level confidence. This provides fine-grained control over forgetting by precisely targeting undesirable knowledge based on the model’s own confidence scores, eliminating the need for retention data or contrastive response pairs.

Result: Extensive evaluations on MUSE and WMDP benchmarks show CATNIP achieves effective unlearning without requiring retention data or contrastive pairs, with stronger knowledge forgetting and preservation tradeoffs than state-of-the-art methods.

Conclusion: CATNIP provides a principled approach to LLM unlearning that addresses key limitations of existing methods by using token-level confidence calibration, enabling precise removal of undesirable knowledge while preserving general capabilities without requiring additional data resources.

Abstract: Pretrained knowledge memorized in LLMs raises critical concerns over safety and privacy, which has motivated LLM Unlearning as a technique for selectively removing the influences of undesirable knowledge. Existing approaches, rooted in Gradient Ascent (GA), often degrade general domain knowledge while relying on retention data or curated contrastive pairs, which can be either impractical or data and computationally prohibitive. Negative Preference Alignment has been explored for unlearning to tackle the limitations of GA, which, however, remains confined by its choice of reference model and shows undermined performance in realistic data settings. These limitations raise two key questions: i) Can we achieve effective unlearning that quantifies model confidence in undesirable knowledge and uses it to calibrate gradient updates more precisely, thus reducing catastrophic forgetting? ii) Can we make unlearning robust to data scarcity and length variation? We answer both questions affirmatively with CATNIP (Calibrated and Tokenized Negative Preference Alignment), a principled method that rescales unlearning effects in proportion to the model’s token-level confidence, thus ensuring fine-grained control over forgetting. Extensive evaluations on MUSE and WMDP benchmarks demonstrated that our work enables effective unlearning without requiring retention data or contrastive unlearning response pairs, with stronger knowledge forgetting and preservation tradeoffs than state-of-the-art methods.

[17] AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation

Sara Papi, Marco Turchi, Matteo Negri

Main category: cs.CL

TL;DR: AlignAtt is a novel policy for simultaneous speech translation that uses attention information to generate source-target alignments, improving both translation quality and latency compared to previous methods.

Details

Motivation: Attention mechanisms have proven useful for word alignment in speech translation tasks. The authors aim to leverage attention information to develop better simultaneous speech translation policies that can guide models during inference.

Method: AlignAtt exploits attention information from the model to generate source-target alignments. These alignments are then used as a policy to guide the simultaneous speech translation model during inference, determining when to read more input versus when to produce output.

Result: Experiments on 8 language pairs from MuST-C v1.0 show AlignAtt outperforms previous state-of-the-art SimulST policies, achieving BLEU score improvements of 2 points and latency reductions of 0.5s to 0.8s across all languages.

Conclusion: Attention-based alignment information can be effectively used to develop superior simultaneous speech translation policies that balance translation quality and latency better than previous approaches.

Abstract: Attention is the core mechanism of today’s most used architectures for natural language processing and has been analyzed from many perspectives, including its effectiveness for machine translation-related tasks. Among these studies, attention resulted to be a useful source of information to get insights about word alignment also when the input text is substituted with audio segments, as in the case of the speech translation (ST) task. In this paper, we propose AlignAtt, a novel policy for simultaneous ST (SimulST) that exploits the attention information to generate source-target alignments that guide the model during inference. Through experiments on the 8 language pairs of MuST-C v1.0, we show that AlignAtt outperforms previous state-of-the-art SimulST policies applied to offline-trained models with gains in terms of BLEU of 2 points and latency reductions ranging from 0.5s to 0.8s across the 8 languages.

[18] Act or Clarify? Modeling Sensitivity to Uncertainty and Cost in Communication

Polina Tsvilodub, Karl Mulligan, Todd Snider, Robert D. Hawkins, Michael Franke

Main category: cs.CL

TL;DR: Humans ask clarification questions based on expected regret - balancing uncertainty against the cost of incorrect actions, showing rational decision-making in communication under uncertainty.

Details

Motivation: To understand how humans decide when to ask clarification questions versus acting under uncertainty, and whether this decision follows rational principles based on expected regret.

Method: Developed a computational model based on expected regret theory, then tested predictions through two experiments: one examining linguistic responses to questions, and another extending to choices between clarification and non-linguistic actions.

Result: Results show humans tend to seek clarification proportional to the risk of substantial loss when acting under uncertainty, supporting the expected regret model’s predictions about the interaction between uncertainty and action costs.

Conclusion: Human decision-making about asking clarification questions follows rational principles based on expected regret, balancing uncertainty reduction against the potential costs of incorrect actions.

Abstract: When deciding how to act under uncertainty, agents may choose to act to reduce uncertainty or they may act despite that uncertainty.In communicative settings, an important way of reducing uncertainty is by asking clarification questions (CQs). We predict that the decision to ask a CQ depends on both contextual uncertainty and the cost of alternative actions, and that these factors interact: uncertainty should matter most when acting incorrectly is costly. We formalize this interaction in a computational model based on expected regret: how much an agent stands to lose by acting now rather than with full information. We test these predictions in two experiments, one examining purely linguistic responses to questions and another extending to choices between clarification and non-linguistic action. Taken together, our results suggest a rational tradeoff: humans tend to seek clarification proportional to the risk of substantial loss when acting under uncertainty.

[19] Which course? Discourse! Teaching Discourse and Generation in the Era of LLMs

Junyi Jessy Li, Yang Janet Liu, Kanishka Misra, Valentina Pyatkin, William Sheffield

Main category: cs.CL

TL;DR: A paper describing the design and implementation of an undergraduate course “Computational Discourse and Natural Language Generation” that integrates discourse processing theory with NLP education.

Details

Motivation: To address the challenge of designing NLP courses that bridge sub-disciplines in the rapidly evolving field, particularly connecting discourse processing (with its rich linguistic insights) with natural language generation, which is under-explored in undergraduate curricula.

Method: Collaborative course design by a team with complementary expertise, offered as an upper-level undergraduate course cross-listed between Linguistics and Computer Science, with deep integration of theoretical and empirical aspects and exploratory assignments.

Result: Successful implementation of the course in Fall 2025, with detailed course description and takeaways from an independent survey showing positive outcomes in bridging discourse theory with NLP practice.

Conclusion: The course successfully addresses the gap in NLP education by integrating discourse processing with language generation, providing a model for interdisciplinary course design in evolving technical fields, with plans for future improvements.

Abstract: The field of NLP has undergone vast, continuous transformations over the past few years, sparking debates going beyond discipline boundaries. This begs important questions in education: how do we design courses that bridge sub-disciplines in this shifting landscape? This paper explores this question from the angle of discourse processing, an area with rich linguistic insights and computational models for the intentional, attentional, and coherence structure of language. Discourse is highly relevant for open-ended or long-form text generation, yet this connection is under-explored in existing undergraduate curricula. We present a new course, “Computational Discourse and Natural Language Generation”. The course is collaboratively designed by a team with complementary expertise and was offered for the first time in Fall 2025 as an upper-level undergraduate course, cross-listed between Linguistics and Computer Science. Our philosophy is to deeply integrate the theoretical and empirical aspects, and create an exploratory mindset inside the classroom and in the assignments. This paper describes the course in detail and concludes with takeaways from an independent survey as well as our vision for future directions.

[20] Modeling Sarcastic Speech: Semantic and Prosodic Cues in a Speech Synthesis Framework

Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

Main category: cs.CL

TL;DR: A computational framework models sarcasm as integration of semantic interpretation and prosodic realization, using LLaMA 3 for semantic cues and speech database for prosodic exemplars, showing both independently enhance sarcasm perception with strongest effects when combined.

Details

Motivation: Sarcasm involves complex interaction between semantic content and prosodic expression, but how these cues jointly contribute to sarcasm recognition remains poorly understood. The paper aims to develop a computational framework to model this integration and understand the mechanisms underlying sarcastic communication.

Method: Proposes a computational framework that models sarcasm as integration of semantic interpretation and prosodic realization. Semantic cues are derived from LLaMA 3 model fine-tuned to capture discourse-level markers of sarcastic intent. Prosodic cues are extracted through semantically aligned utterances from a database of sarcastic speech, providing prosodic exemplars of sarcastic delivery. Uses speech synthesis testbed for perceptual evaluations.

Result: Perceptual evaluations demonstrate that both semantic and prosodic cues independently enhance listeners’ perception of sarcasm, with the strongest effects emerging when the two are combined. Findings highlight complementary roles of semantics and prosody in pragmatic interpretation.

Conclusion: The study illustrates how computational modeling can shed light on mechanisms underlying sarcastic communication, showing that both semantic and prosodic cues play important and complementary roles in sarcasm recognition, with optimal perception occurring when both are present.

Abstract: Sarcasm is a pragmatic phenomenon in which speakers convey meanings that diverge from literal content, relying on an interaction between semantics and prosodic expression. However, how these cues jointly contribute to the recognition of sarcasm remains poorly understood. We propose a computational framework that models sarcasm as the integration of semantic interpretation and prosodic realization. Semantic cues are derived from an LLaMA 3 model fine-tuned to capture discourse-level markers of sarcastic intent, while prosodic cues are extracted through semantically aligned utterances drawn from a database of sarcastic speech, providing prosodic exemplars of sarcastic delivery. Using a speech synthesis testbed, perceptual evaluations demonstrate that both semantic and prosodic cues independently enhance listeners’ perception of sarcasm, with the strongest effects emerging when the two are combined. These findings highlight the complementary roles of semantics and prosody in pragmatic interpretation and illustrate how modeling can shed light on the mechanisms underlying sarcastic communication.

[21] HALT: Hallucination Assessment via Log-probs as Time series

Ahmad Shapiro, Karan Taneja, Ashok Goel

Main category: cs.CL

TL;DR: HALT is a lightweight hallucination detector using only top-20 token log-probabilities as time series data, achieving 60x speedup over existing methods while maintaining strong performance across diverse LLM capabilities.

Details

Motivation: Hallucinations remain a major obstacle for LLMs, especially in safety-critical domains. Existing approaches either require access to internal model states (white-box) or rely on surface-form text (black-box), lacking efficient, generalizable solutions.

Method: HALT uses top-20 token log-probabilities as time series data, processed through a gated recurrent unit (GRU) model with entropy-based features to learn model calibration bias. It requires only output log-probs, not hidden states or attention maps.

Result: HALT outperforms Lettuce (fine-tuned BERT-base encoder) while being 30x smaller and achieving 60x speedup on the HUB benchmark, which covers ten diverse LLM capabilities including reasoning and general-purpose tasks.

Conclusion: HALT provides an efficient, lightweight hallucination detection framework that generalizes well across domains and is compatible with proprietary LLMs without requiring internal model access.

Abstract: Hallucinations remain a major obstacle for large language models (LLMs), especially in safety-critical domains. We present HALT (Hallucination Assessment via Log-probs as Time series), a lightweight hallucination detector that leverages only the top-20 token log-probabilities from LLM generations as a time series. HALT uses a gated recurrent unit model combined with entropy-based features to learn model calibration bias, providing an extremely efficient alternative to large encoders. Unlike white-box approaches, HALT does not require access to hidden states or attention maps, relying only on output log-probabilities. Unlike black-box approaches, it operates on log-probs rather than surface-form text, which enables stronger domain generalization and compatibility with proprietary LLMs without requiring access to internal weights. To benchmark performance, we introduce HUB (Hallucination detection Unified Benchmark), which consolidates prior datasets into ten capabilities covering both reasoning tasks (Algorithmic, Commonsense, Mathematical, Symbolic, Code Generation) and general purpose skills (Chat, Data-to-Text, Question Answering, Summarization, World Knowledge). While being 30x smaller, HALT outperforms Lettuce, a fine-tuned modernBERT-base encoder, achieving a 60x speedup gain on HUB. HALT and HUB together establish an effective framework for hallucination detection across diverse LLM capabilities.

[22] Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness

Alireza Amiri-Margavi, Arshia Gharagozlou, Amin Gholami Davodi, Seyed Pouyan Mousavi Davoudi, Hamidreza Hasani Balyani

Main category: cs.CL

TL;DR: LLMs show fairness disparities in interaction quality (tone, uncertainty, framing) across demographic identities despite equal access, with GPT-4 hedging more toward younger males and LLaMA showing sentiment variation.

Details

Motivation: Prior fairness work focused on access-level behaviors like refusals, but equitable access doesn't ensure equitable interaction quality. Need to examine how LLMs differ in tone, uncertainty, and linguistic framing across demographic identities after access is granted.

Method: Controlled fairness audit using counterfactual prompt design evaluating GPT-4 and LLaMA-3.1-70B on career advice tasks while varying identity attributes (age, gender, nationality). Assessed access fairness through refusal analysis and measured interaction quality using automated linguistic metrics (sentiment, politeness, hedging). Used paired statistical tests to evaluate identity-conditioned differences.

Result: Both models exhibited zero refusal rates across all identities (uniform access). However, systematic model-specific disparities in interaction quality: GPT-4 expressed significantly higher hedging toward younger male users, while LLaMA exhibited broader sentiment variation across identity groups.

Conclusion: Fairness disparities can persist at the interaction level even when access is equal, motivating evaluation beyond refusal-based audits to include interaction quality metrics.

Abstract: Prior work on fairness in large language models (LLMs) has primarily focused on access-level behaviors such as refusals and safety filtering. However, equitable access does not ensure equitable interaction quality once a response is provided. In this paper, we conduct a controlled fairness audit examining how LLMs differ in tone, uncertainty, and linguistic framing across demographic identities after access is granted. Using a counterfactual prompt design, we evaluate GPT-4 and LLaMA-3.1-70B on career advice tasks while varying identity attributes along age, gender, and nationality. We assess access fairness through refusal analysis and measure interaction quality using automated linguistic metrics, including sentiment, politeness, and hedging. Identity-conditioned differences are evaluated using paired statistical tests. Both models exhibit zero refusal rates across all identities, indicating uniform access. Nevertheless, we observe systematic, model-specific disparities in interaction quality: GPT-4 expresses significantly higher hedging toward younger male users, while LLaMA exhibits broader sentiment variation across identity groups. These results show that fairness disparities can persist at the interaction level even when access is equal, motivating evaluation beyond refusal-based audits.

[23] Where Norms and References Collide: Evaluating LLMs on Normative Reasoning

Mitchell Abrams, Kaveh Eskandari Miandoab, Felix Gervits, Vasanth Sarathy, Matthias Scheutz

Main category: cs.CL

TL;DR: LLMs struggle with norm-based reference resolution in embodied settings, revealing a key limitation for socially situated AI systems.

Details

Motivation: Embodied agents need to understand social norms for effective communication in situated environments, but it's unclear if LLMs can support norm-based reference resolution.

Method: Created SNIC (Situated Norms in Context), a human-validated diagnostic testbed to probe LLMs’ ability to extract and utilize normative principles for reference resolution in everyday tasks.

Result: Even state-of-the-art LLMs struggle to consistently identify and apply social norms, especially when norms are implicit, underspecified, or conflicting.

Conclusion: Current LLMs have a blind spot in social norm reasoning, highlighting a key challenge for deploying language-based systems in socially situated, embodied settings.

Abstract: Embodied agents, such as robots, will need to interact in situated environments where successful communication often depends on reasoning over social norms: shared expectations that constrain what actions are appropriate in context. A key capability in such settings is norm-based reference resolution (NBRR), where interpreting referential expressions requires inferring implicit normative expectations grounded in physical and social context. Yet it remains unclear whether Large Language Models (LLMs) can support this kind of reasoning. In this work, we introduce SNIC (Situated Norms in Context), a human-validated diagnostic testbed designed to probe how well state-of-the-art LLMs can extract and utilize normative principles relevant to NBRR. SNIC emphasizes physically grounded norms that arise in everyday tasks such as cleaning, tidying, and serving. Across a range of controlled evaluations, we find that even the strongest LLMs struggle to consistently identify and apply social norms, particularly when norms are implicit, underspecified, or in conflict. These findings reveal a blind spot in current LLMs and highlight a key challenge for deploying language-based systems in socially situated, embodied settings.

[24] CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

Ran Li, Zeyuan Liu, Yinghao chen, Bingxiang He, Jiarui Yuan, Zixuan Fu, Weize Chen, Jinyi Hu, Zhiyuan Liu, Maosong Sun

Main category: cs.CL

TL;DR: CPMöbius introduces a collaborative Coach-Player paradigm for data-free reinforcement learning to enhance mathematical reasoning in LLMs without external training data.

Details

Motivation: Current LLM training for reasoning relies heavily on massive human-curated data, making supervision-heavy paradigms unsustainable with diminishing scalability. There's a need for data-free approaches to overcome this limitation.

Method: A collaborative Coach-Player paradigm where Coach proposes instructions tailored to Player’s capability and receives rewards based on Player’s performance improvement, while Player is rewarded for solving increasingly instructive tasks. This creates a cooperative optimization loop without external data.

Result: CPMöbius achieves substantial improvements without external training data, outperforming existing unsupervised approaches. On Qwen2.5-Math-7B-Instruct, it improves accuracy by +4.9 overall average and +5.4 out-of-distribution average, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy.

Conclusion: The collaborative Coach-Player paradigm enables effective data-free reinforcement learning for mathematical reasoning, offering a sustainable alternative to supervision-heavy training methods.

Abstract: Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPMöbius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player’s capability and receives rewards based on changes in the Player’s performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player’s mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy.

[25] SAES-SVD: Self-Adaptive Suppression of Accumulated and Local Errors for SVD-based LLM Compression

Xing Hu, Dawei Yang, Yuan Cheng, Zhixuan Chen, Zukang Xu

Main category: cs.CL

TL;DR: SAES-SVD: A low-rank compression framework for LLMs that jointly optimizes intra-layer reconstruction and inter-layer error compensation to prevent error accumulation through the network.

Details

Motivation: Existing LLM compression methods compress each layer independently, minimizing per-layer reconstruction error but ignoring that errors propagate and accumulate through the network, leading to amplified global deviations from the full-precision baseline.

Method: Proposes SAES-SVD with two components: 1) Cumulative Error-Aware Layer Compression (CEALC) formulates compression as local reconstruction plus weighted cumulative error compensation, deriving closed-form low-rank solution using second-order activation statistics; 2) Adaptive Collaborative Error Suppression (ACES) automatically adjusts weighting coefficients to enhance low-rank structure and maximize rank budget utilization.

Result: Extensive experiments across multiple LLM architectures and tasks show SAES-SVD consistently improves post-compression performance without fine-tuning or mixed-rank strategies.

Conclusion: SAES-SVD effectively addresses error accumulation in LLM compression by jointly optimizing intra-layer reconstruction and inter-layer error compensation, providing a more effective compression framework.

Abstract: The rapid growth in the parameter scale of large language models (LLMs) has created a high demand for efficient compression techniques. As a hardware-agnostic and highly compatible technique, low-rank compression has been widely adopted. However, existing methods typically compress each layer independently by minimizing per-layer reconstruction error, overlooking a critical limitation: the reconstruction error propagates and accumulates through the network, which leads to amplified global deviations from the full-precision baseline. To address this, we propose Self-Adaptive Error Suppression SVD (SAES-SVD), a LLMs compression framework that jointly optimizes intra-layer reconstruction and inter-layer error compensation. SAES-SVD is composed of two novel components: (1) Cumulative Error-Aware Layer Compression (CEALC), which formulates the compression objective as a combination of local reconstruction and weighted cumulative error compensation. Based on it, we derive a closed-form low-rank solution relied on second-order activation statistics, which explicitly aligns each layer’s output with its full-precision counterpart to compensate for accumulated errors. (2) Adaptive Collaborative Error Suppression (ACES), which automatically adjusts the weighting coefficient to enhance the low-rank structure of the compression objective in CEALC. Specifically, the coefficient is optimized to maximize the ratio between the Frobenius norm of the compressed layer’s output and that of the compression objective under a fixed rank, thus ensuring that the rank budget is utilized effectively. Extensive experiments across multiple LLM architectures and tasks show that, without fine-tuning or mixed-rank strategies, SAES-SVD consistently improves post-compression performance.

[26] ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

Junjie Huang, Jiarui Qin, Di Yin, Weiwen Liu, Yong Yu, Xing Sun, Weinan Zhang

Main category: cs.CL

TL;DR: ReMiT introduces a bidirectional training approach where RL-tuned models retroactively improve pre-trained foundations by dynamically reweighting tokens during mid-training, creating a self-reinforcing flywheel for LLM evolution.

Details

Motivation: Current LLM training follows a unidirectional pipeline (pre-training → post-training), but the potential for bidirectional improvement where post-training insights enhance the base model remains unexplored. The authors aim to establish a self-reinforcing cycle where RL-tuned models strengthen the foundation model.

Method: ReMiT (Reinforcement Learning-Guided Mid-Training) leverages reasoning priors from RL-tuned models to dynamically reweight tokens during the critical mid-training (annealing) phase. This phase occurs at the end of pre-training with high-quality corpora and decaying learning rates. The method prioritizes tokens pivotal for reasoning without requiring specially trained teacher or reference models.

Result: ReMiT achieves average 3% improvement on 10 pre-training benchmarks across math, code, and general reasoning domains. These gains are sustained by over 2% throughout the post-training pipeline, validating the iterative feedback loop concept.

Conclusion: The paper demonstrates a successful bidirectional training approach that enables continuous, self-reinforcing evolution of LLMs through reinforcement learning-guided mid-training, establishing a flywheel effect where RL-tuned models improve base models which then enhance subsequent post-training performance.

Abstract: Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process–where insights from post-training retroactively improve the pre-trained foundation–remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.

[27] AERO: Autonomous Evolutionary Reasoning Optimization via Endogenous Dual-Loop Feedback

Zhitao Gao, Jie Ma, Xuhong Li, Pengyu Li, Ning Qu, Yaqiang Wu, Hui Liu, Jun Liu

Main category: cs.CL

TL;DR: AERO is an unsupervised framework for autonomous reasoning evolution in LLMs that uses dual-loop self-questioning/answering/criticism, entropy-based positioning inspired by ZPD theory, and staggered training to improve reasoning without expert data.

Details

Motivation: LLMs face bottlenecks in complex reasoning due to reliance on expert-annotated data and external verifiers. Existing self-evolution methods often fail to identify optimal learning zones and risk reinforcing hallucinations and incorrect priors through flawed internal feedback.

Method: AERO uses a synergistic dual-loop system with internalized self-questioning, answering, and criticism. It employs entropy-based positioning inspired by Zone of Proximal Development theory to target the “solvability gap,” Independent Counterfactual Correction for robust verification, and a Staggered Training Strategy to synchronize capability growth.

Result: Extensive evaluations across nine benchmarks spanning three domains show average performance improvements of 4.57% on Qwen3-4B-Base and 5.10% on Qwen3-8B-Base, outperforming competitive baselines.

Conclusion: AERO successfully achieves autonomous reasoning evolution without expert supervision, addressing key limitations in existing self-evolution paradigms through its innovative dual-loop architecture and training strategies.

Abstract: Large Language Models (LLMs) have achieved significant success in complex reasoning but remain bottlenecked by reliance on expert-annotated data and external verifiers. While existing self-evolution paradigms aim to bypass these constraints, they often fail to identify the optimal learning zone and risk reinforcing collective hallucinations and incorrect priors through flawed internal feedback. To address these challenges, we propose \underline{A}utonomous \underline{E}volutionary \underline{R}easoning \underline{O}ptimization (AERO), an unsupervised framework that achieves autonomous reasoning evolution by internalizing self-questioning, answering, and criticism within a synergistic dual-loop system. Inspired by the \textit{Zone of Proximal Development (ZPD)} theory, AERO utilizes entropy-based positioning to target the ``solvability gap’’ and employs Independent Counterfactual Correction for robust verification. Furthermore, we introduce a Staggered Training Strategy to synchronize capability growth across functional roles and prevent curriculum collapse. Extensive evaluations across nine benchmarks spanning three domains demonstrate that AERO achieves average performance improvements of 4.57% on Qwen3-4B-Base and 5.10% on Qwen3-8B-Base, outperforming competitive baselines. Code is available at https://github.com/mira-ai-lab/AERO.

[28] Test-time Recursive Thinking: Self-Improvement without External Feedback

Yufan Zhuang, Chandan Singh, Liyuan Liu, Yelong Shen, Dinghuai Zhang, Jingbo Shang, Jianfeng Gao, Weizhu Chen

Main category: cs.CL

TL;DR: LLMs can self-improve reasoning without additional training using Test-time Recursive Thinking (TRT), an iterative framework that generates diverse solutions and selects correct answers without ground-truth supervision.

Details

Motivation: To explore whether LLMs can self-improve reasoning capabilities without additional training, addressing challenges of generating diverse high-quality solutions and reliably selecting correct answers without ground-truth supervision.

Method: Proposes Test-time Recursive Thinking (TRT), an iterative self-improvement framework that conditions generation on rollout-specific strategies, accumulated knowledge, and self-generated verification signals.

Result: Open-source models reach 100% accuracy on AIME-25/24, and closed-source models improve by 10.4-14.8 percentage points on LiveCodeBench’s most difficult problems without external feedback.

Conclusion: LLMs can effectively self-improve reasoning through iterative test-time frameworks like TRT without requiring additional training data or external supervision.

Abstract: Modern Large Language Models (LLMs) have shown rapid improvements in reasoning capabilities, driven largely by reinforcement learning (RL) with verifiable rewards. Here, we ask whether these LLMs can self-improve without the need for additional training. We identify two core challenges for such systems: (i) efficiently generating diverse, high-quality candidate solutions, and (ii) reliably selecting correct answers in the absence of ground-truth supervision. To address these challenges, we propose Test-time Recursive Thinking (TRT), an iterative self-improvement framework that conditions generation on rollout-specific strategies, accumulated knowledge, and self-generated verification signals. Using TRT, open-source models reach 100% accuracy on AIME-25/24, and on LiveCodeBench’s most difficult problems, closed-source models improve by 10.4-14.8 percentage points without external feedback.

[29] Task–Specificity Score: Measuring How Much Instructions Really Matter for Supervision

Pritam Kadasi, Abhishek Upperwal, Mayank Singh

Main category: cs.CL

TL;DR: Proposes Task-Specificity Score (TSS) to measure how much an instruction uniquely determines its output by comparing against plausible alternative instructions for the same input.

Details

Motivation: Many instruction-input-output pairs in instruction tuning are weakly specified - the same output can be plausible under multiple alternative instructions. This raises the question of whether the instruction uniquely determines the target output, which is important for effective instruction tuning.

Method: Introduces Task-Specificity Score (TSS) that quantifies how much an instruction matters for predicting its output by contrasting the true instruction against plausible alternatives for the same input. Also proposes TSS++ with hard alternatives and a small quality term to mitigate easy-negative effects.

Result: Across three instruction datasets (Alpaca, Dolly-15k, NI-20) and three open LLMs (Gemma, Llama, Qwen), selecting task-specific examples improves downstream performance under tight token budgets and complements quality-based filters like perplexity and IFD.

Conclusion: Task specificity is an important dimension for instruction tuning that complements existing quality metrics, and selecting task-specific examples can improve model performance especially under constrained training budgets.

Abstract: Instruction tuning is now the default way to train and adapt large language models, but many instruction–input–output pairs are only weakly specified: for a given input, the same output can remain plausible under several alternative instructions. This raises a simple question: \emph{does the instruction uniquely determine the target output?} We propose the \textbf{Task–Specificity Score (TSS)} to quantify how much an instruction matters for predicting its output, by contrasting the true instruction against plausible alternatives for the same input. We further introduce \textbf{TSS++}, which uses hard alternatives and a small quality term to mitigate easy-negative effects. Across three instruction datasets (\textsc{Alpaca}, \textsc{Dolly-15k}, \textsc{NI-20}) and three open LLMs (Gemma, Llama, Qwen), we show that selecting task-specific examples improves downstream performance under tight token budgets and complements quality-based filters such as perplexity and IFD.

[30] The Mask of Civility: Benchmarking Chinese Mock Politeness Comprehension in Large Language Models

Yitong Zhang, Yuhan Xiang, Mingxuan Liu

Main category: cs.CL

TL;DR: Systematic evaluation of LLMs’ ability to recognize politeness phenomena in Chinese using pragmatic frameworks, testing six models under four prompting strategies.

Details

Motivation: Address gaps in pragmatic comprehension of LLMs, particularly for Chinese politeness phenomena, and explore how technology and humanities can coexist through interdisciplinary research.

Method: Constructed three-category dataset (politeness, impoliteness, mock politeness) using Rapport Management Theory and Model of Mock Politeness; tested six LLMs (including GPT-5.1 and DeepSeek) under zero-shot, few-shot, knowledge-enhanced, and hybrid prompting conditions.

Result: Performance differences among LLMs in recognizing Chinese politeness phenomena were systematically evaluated, though specific quantitative results are not provided in the abstract.

Conclusion: This study represents a meaningful attempt in “Great Linguistics” paradigm, offering novel approach to applying pragmatic theory in technological transformation and bridging linguistic technology with humanistic reflection.

Abstract: From a pragmatic perspective, this study systematically evaluates the differences in performance among representative large language models (LLMs) in recognizing politeness, impoliteness, and mock politeness phenomena in Chinese. Addressing the existing gaps in pragmatic comprehension, the research adopts the frameworks of Rapport Management Theory and the Model of Mock Politeness to construct a three-category dataset combining authentic and simulated Chinese discourse. Six representative models, including GPT-5.1 and DeepSeek, were selected as test subjects and evaluated under four prompting conditions: zero-shot, few-shot, knowledge-enhanced, and hybrid strategies. This study serves as a meaningful attempt within the paradigm of ``Great Linguistics,’’ offering a novel approach to applying pragmatic theory in the age of technological transformation. It also responds to the contemporary question of how technology and the humanities may coexist, representing an interdisciplinary endeavor that bridges linguistic technology and humanistic reflection.

[31] ChemPro: A Progressive Chemistry Benchmark for Large Language Models

Aaditya Baranwal, Shruti Vyas

Main category: cs.CL

TL;DR: ChemPro is a progressive chemistry benchmark with 4100 questions across 4 difficulty levels to evaluate LLMs’ proficiency in general chemistry topics, revealing limitations in complex scientific reasoning.

Details

Motivation: To assess the proficiency of Large Language Models in general chemistry topics through a carefully designed benchmark that mimics academic evaluation, identifying limitations in scientific reasoning as question difficulty increases.

Method: Created ChemPro benchmark with 4100 natural language question-answer pairs across 4 difficulty sections covering Bio-Chemistry, Inorganic-Chemistry, Organic-Chemistry and Physical-Chemistry. Includes Multiple Choice and Numerical Questions with balanced ratio of question types. Evaluated 45+7 state-of-the-art LLMs (open-source and proprietary).

Result: LLMs perform well on basic chemistry questions but accuracy declines with different types and levels of complexity. Findings highlight critical limitations in general scientific reasoning and understanding, pointing toward understudied dimensions of difficulty.

Conclusion: The benchmark reveals significant gaps in LLMs’ ability to handle complex scientific reasoning, emphasizing the need for more robust methodologies to improve LLMs’ scientific understanding capabilities.

Abstract: We introduce ChemPro, a progressive benchmark with 4100 natural language question-answer pairs in Chemistry, across 4 coherent sections of difficulty designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics. We include Multiple Choice Questions and Numerical Questions spread across fine-grained information recall, long-horizon reasoning, multi-concept questions, problem-solving with nuanced articulation, and straightforward questions in a balanced ratio, effectively covering Bio-Chemistry, Inorganic-Chemistry, Organic-Chemistry and Physical-Chemistry. ChemPro is carefully designed analogous to a student’s academic evaluation for basic to high-school chemistry. A gradual increase in the question difficulty rigorously tests the ability of LLMs to progress from solving basic problems to solving more sophisticated challenges. We evaluate 45+7 state-of-the-art LLMs, spanning both open-source and proprietary variants, and our analysis reveals that while LLMs perform well on basic chemistry questions, their accuracy declines with different types and levels of complexity. These findings highlight the critical limitations of LLMs in general scientific reasoning and understanding and point towards understudied dimensions of difficulty, emphasizing the need for more robust methodologies to improve LLMs.

Bowen Jiang, Taiwei Shi, Ryo Kamoi, Yuan Yuan, Camillo J. Taylor, Longqi Yang, Pei Zhou, Sihao Chen

Main category: cs.CL

TL;DR: OMAR is a reinforcement learning framework that enables a single AI model to develop social intelligence by role-playing all participants in multi-turn conversations through self-play, learning complex social behaviors like empathy and persuasion without human supervision.

Details

Motivation: Traditional AI training uses static, single-turn optimizations which fail to capture the dynamic, multi-turn nature of real social interactions. There's a need for AI to develop genuine social intelligence through learning from complex conversational dynamics and long-term goals in multi-agent settings.

Method: OMAR uses a reinforcement learning framework where a single model role-plays all participants in conversations simultaneously through self-play. To handle long dialogues, it implements hierarchical advantage estimation with turn-level and token-level advantages. Training occurs in social environments like SOTOPIA and Werewolf strategy games.

Result: The trained models develop emergent social intelligence including empathy, persuasion, and compromise seeking. They demonstrate effective collaboration even in competitive scenarios. While some practical challenges like reward hacking were identified, the results show rich social intelligence can emerge without human supervision.

Conclusion: OMAR demonstrates that AI can develop sophisticated social intelligence through multi-agent conversational self-play without human supervision. The framework successfully enables learning of complex social norms and long-term goals in dynamic interactions, paving the way for more socially intelligent AI systems.

Abstract: This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework that enables AI to develop social intelligence through multi-turn, multi-agent conversational self-play. Unlike traditional paradigms that rely on static, single-turn optimizations, OMAR allows a single model to role-play all participants in a conversation simultaneously, learning to achieve long-term goals and complex social norms directly from dynamic social interaction. To ensure training stability across long dialogues, we implement a hierarchical advantage estimation that calculates turn-level and token-level advantages. Evaluations in the SOTOPIA social environment and Werewolf strategy games show that our trained models develop fine-grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking, demonstrating the effectiveness of learning collaboration even under competitive scenarios. While we identify practical challenges like reward hacking, our results show that rich social intelligence can emerge without human supervision. We hope this work incentivizes further research on AI social intelligence in group conversations.

[33] Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization

Runquan Gui, Jie Wang, Zhihai Wang, Chi Ma, Jianye Hao, Feng Wu

Main category: cs.CL

TL;DR: CoSMo is a framework that optimizes reasoning efficiency in Large Reasoning Models by eliminating structural redundancy through a split-merge algorithm and structure-aligned reinforcement learning.

Details

Motivation: Large Reasoning Models generate verbose reasoning chains that cause significant latency and computational overhead. The paper aims to address efficiency issues by eliminating structural redundancy rather than simply restricting token volume.

Method: CoSMo uses a split-merge algorithm to dynamically refine reasoning chains by merging redundant segments and splitting logical gaps. It employs structure-aligned reinforcement learning with a novel segment-level budget to supervise efficient reasoning structures during training.

Result: CoSMo achieves superior performance across multiple benchmarks and backbones, improving accuracy by 3.3 points while reducing segment usage by 28.7% on average compared to reasoning efficiency baselines.

Conclusion: CoSMo effectively optimizes reasoning efficiency in Large Reasoning Models by addressing structural redundancy, demonstrating significant improvements in both accuracy and computational efficiency.

Abstract: While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and computational overhead. To address these challenges, we propose \textbf{CoSMo} (\textbf{Co}nsistency-Guided \textbf{S}plit-\textbf{M}erge \textbf{O}ptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume. Specifically, CoSMo utilizes a split-merge algorithm that dynamically refines reasoning chains by merging redundant segments and splitting logical gaps to ensure coherence. We then employ structure-aligned reinforcement learning with a novel segment-level budget to supervise the model in maintaining efficient reasoning structures throughout training. Extensive experiments across multiple benchmarks and backbones demonstrate that CoSMo achieves superior performance, improving accuracy by \textbf{3.3} points while reducing segment usage by \textbf{28.7%} on average compared to reasoning efficiency baselines.

[34] FASA: Frequency-aware Sparse Attention

Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang, Ismini Lourentzou, Zhengzhong Tu, Xiangxiang Chu, Julian McAuley

Main category: cs.CL

TL;DR: FASA is a query-aware token eviction framework that reduces KV cache memory footprint in LLMs by leveraging functional sparsity in RoPE to identify and retain only critical tokens.

Details

Motivation: LLMs face memory bottlenecks with long inputs due to large KV cache footprints. Existing token pruning methods are inadequate - static approaches risk information loss, while dynamic heuristics fail to capture query-dependent token importance.

Method: FASA discovers functional sparsity at the frequency-chunk level in RoPE, identifying a small subset of “dominant” FCs that correlate with full attention. It uses these as a free computational proxy to predict token importance, then performs attention only on the pruned subset of critical tokens.

Result: FASA outperforms all token-eviction baselines across long-context tasks, achieving near-oracle accuracy. On LongBench-V1, it reaches nearly 100% of full-KV performance with only 256 tokens and achieves 2.56× speedup using 18.9% of cache on AIME24.

Conclusion: FASA provides an effective query-aware token eviction framework that significantly reduces KV cache memory requirements while maintaining performance, addressing a critical bottleneck in LLM deployment for long inputs.

Abstract: The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of “dominant” FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. %making them a powerful and efficient proxy for token importance. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. % Since accessing only a small fraction of the KV cache, FASA drastically lowers memory bandwidth requirements and computational cost. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100% of full-KV performance when only keeping 256 tokens, and achieves 2.56$\times$ speedup using just 18.9% of the cache on AIME24.

[35] Privasis: Synthesizing the Largest “Public” Private Dataset from Scratch

Hyunwoo Kim, Niloofar Mireshghallah, Michael Duan, Rui Xin, Shuyue Stella Li, Jaehun Jung, David Acuna, Qi Pang, Hanshen Xiao, G. Edward Suh, Sewoong Oh, Yulia Tsvetkov, Pang Wei Koh, Yejin Choi

Main category: cs.CL

TL;DR: Privasis: First million-scale fully synthetic dataset for privacy-sensitive research, containing 1.4M records with diverse private information types, used to train compact text sanitization models that outperform large LLMs.

Details

Motivation: Privacy-sensitive research faces data scarcity due to privacy constraints, while modern AI agents increasingly access sensitive personal information, creating urgent need for synthetic privacy data to enable research without compromising real user privacy.

Method: Created Privasis dataset from scratch with 1.4M records across diverse document types (medical, legal, financial, calendars, messages) with 55.1M annotated attributes. Used this to build parallel corpus for text sanitization via pipeline that decomposes texts and applies targeted sanitization.

Result: Compact sanitization models (≤4B parameters) trained on Privasis outperform state-of-the-art large language models like GPT-5 and Qwen-3 235B. Dataset offers orders-of-magnitude larger scale with quality and greater diversity than existing datasets.

Conclusion: Privasis enables privacy research at scale without compromising real user data, with compact models outperforming large LLMs, accelerating research in privacy-sensitive domains and AI agents handling personal information.

Abstract: Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents–such as OpenClaw and Gemini Agent–are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch–an expansive reservoir of texts with rich and diverse private information–designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (<=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.

[36] ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution

Zican Dong, Peiyu Liu, Junyi Li, Zhipeng Chen, Han Peng, Shuo Wang, Wayne Xin Zhao

Main category: cs.CL

TL;DR: ForesightKV is a training-based KV cache eviction framework that uses supervised learning with Pairwise Ranking Loss and reinforcement learning (GRPO) to predict optimal KV pairs to evict during long-text generation, reducing memory/computation costs while maintaining performance.

Details

Motivation: Large language models suffer from linearly growing KV cache memory/computation costs during long reasoning traces. Existing eviction methods fail to capture complex KV dependencies, causing performance degradation. Need better balance between efficiency and performance.

Method: 1) Golden Eviction algorithm identifies optimal eviction KV pairs using future attention scores; 2) Supervised training with Pairwise Ranking Loss distills these traces; 3) Formulate cache eviction as Markov Decision Process and apply GRPO algorithm to mitigate language modeling loss on low-entropy tokens.

Result: Experiments on AIME2024 and AIME2025 benchmarks with three reasoning models show ForesightKV consistently outperforms prior methods under only half the cache budget, benefiting from both supervised and reinforcement learning approaches.

Conclusion: ForesightKV effectively balances efficiency and performance in KV cache management for long-text generation, demonstrating superior performance with reduced cache requirements through combined supervised and reinforcement learning techniques.

Abstract: Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens. Experiments on AIME2024 and AIME2025 benchmarks of three reasoning models demonstrate that ForesightKV consistently outperforms prior methods under only half the cache budget, while benefiting synergistically from both supervised and reinforcement learning approaches.

[37] Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Dongwon Jo, Beomseok Kang, Jiwon Song, Jae-Joon Kim

Main category: cs.CL

TL;DR: Token Sparse Attention: A dynamic token-level sparsification mechanism that compresses Q,K,V to reduced token sets during attention and decompresses outputs, enabling scalable long-context inference with minimal accuracy loss.

Details

Motivation: The quadratic complexity of attention in Transformers is the main bottleneck for long-context inference. Existing methods either use structured sparse patterns or permanently remove tokens, which can retain irrelevant tokens or make irreversible early decisions without considering layer-/head-wise dynamics of token importance.

Method: Proposes Token Sparse Attention, a lightweight dynamic token-level sparsification mechanism that compresses per-head Q, K, V matrices to a reduced token set during attention computation, then decompresses the output back to the original sequence. This allows token information to be reconsidered in subsequent layers. The method is compatible with dense attention implementations like Flash Attention and can be composed with existing sparse attention kernels.

Result: Achieves up to 3.23× attention speedup at 128K context length with less than 1% accuracy degradation. Consistently improves accuracy-latency trade-off compared to existing methods.

Conclusion: Dynamic and interleaved token-level sparsification is a complementary and effective strategy for scalable long-context inference, providing a new design point at the intersection of token selection and sparse attention.

Abstract: The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance. In this paper, we propose Token Sparse Attention, a lightweight and dynamic token-level sparsification mechanism that compresses per-head $Q$, $K$, $V$ to a reduced token set during attention and then decompresses the output back to the original sequence, enabling token information to be reconsidered in subsequent layers. Furthermore, Token Sparse Attention exposes a new design point at the intersection of token selection and sparse attention. Our approach is fully compatible with dense attention implementations, including Flash Attention, and can be seamlessly composed with existing sparse attention kernels. Experimental results show that Token Sparse Attention consistently improves accuracy-latency trade-off, achieving up to $\times$3.23 attention speedup at 128K context with less than 1% accuracy degradation. These results demonstrate that dynamic and interleaved token-level sparsification is a complementary and effective strategy for scalable long-context inference.

[38] ATACompressor: Adaptive Task-Aware Compression for Efficient Long-Context Processing in LLMs

Xuancheng Li, Haitao Li, Yujia Zhou, Qingyao Ai, Yiqun Liu

Main category: cs.CL

TL;DR: ATACompressor: Adaptive Task-Aware Compressor for LLMs that dynamically compresses long contexts by selectively preserving task-relevant information to solve the “lost in the middle” problem.

Details

Motivation: Long-context inputs in LLMs suffer from the "lost in the middle" problem where critical information gets diluted. Existing context compression methods struggle to balance information preservation with compression efficiency.

Method: ATACompressor uses a selective encoder to compress only task-relevant portions of long contexts, with an adaptive allocation controller that perceives relevant content length and adjusts compression rate accordingly.

Result: Outperforms existing methods on three QA datasets (HotpotQA, MSMARCO, SQUAD) in both compression efficiency and task performance, with ablation studies providing insights into key components.

Conclusion: Provides a scalable solution for long-context processing in LLMs by adaptively compressing based on task requirements while preserving essential information.

Abstract: Long-context inputs in large language models (LLMs) often suffer from the “lost in the middle” problem, where critical information becomes diluted or ignored due to excessive length. Context compression methods aim to address this by reducing input size, but existing approaches struggle with balancing information preservation and compression efficiency. We propose Adaptive Task-Aware Compressor (ATACompressor), which dynamically adjusts compression based on the specific requirements of the task. ATACompressor employs a selective encoder that compresses only the task-relevant portions of long contexts, ensuring that essential information is preserved while reducing unnecessary content. Its adaptive allocation controller perceives the length of relevant content and adjusts the compression rate accordingly, optimizing resource utilization. We evaluate ATACompressor on three QA datasets: HotpotQA, MSMARCO, and SQUAD-showing that it outperforms existing methods in terms of both compression efficiency and task performance. Our approach provides a scalable solution for long-context processing in LLMs. Furthermore, we perform a range of ablation studies and analysis experiments to gain deeper insights into the key components of ATACompressor.

[39] POP: Prefill-Only Pruning for Efficient Large Model Inference

Junhui He, Zhihui Fu, Jun Wang, Qingan Li

Main category: cs.CL

TL;DR: POP: Stage-aware pruning for LLMs/VLMs that removes deep layers during prefill stage only, maintaining full model for decode stage to balance efficiency and accuracy.

Details

Motivation: Existing structured pruning methods for LLMs/VLMs cause significant accuracy degradation due to stage-agnostic approaches that don't consider the asymmetric roles of prefill (context encoding) and decode (next-token prediction) stages in inference.

Method: Introduces Prefill-Only Pruning (POP) with: 1) Virtual gate mechanism to analyze layer importance, revealing deep layers are critical for decode but redundant for prefill; 2) Independent Key-Value projections to maintain cache integrity during stage transitions; 3) Boundary handling strategy for first token accuracy.

Result: Achieves up to 1.37× speedup in prefill latency with minimal performance loss on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities, overcoming accuracy-efficiency trade-off of existing pruning methods.

Conclusion: POP demonstrates that stage-aware pruning can significantly improve inference efficiency for LLMs/VLMs while maintaining accuracy, by leveraging the asymmetric computational requirements of prefill vs decode stages.

Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37$\times$ speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.

[40] MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research

Yifan Shi, Jialong Shi, Jiayi Wang, Ye Fan, Jianyong Sun

Main category: cs.CL

TL;DR: MIRROR is a fine-tuning-free multi-agent framework that translates natural language optimization problems into mathematical models and solver code using execution-driven iterative revision and hierarchical retrieval from curated exemplars.

Details

Motivation: Traditional Operations Research modeling is expert-driven, slow, and fragile for novel scenarios. While LLMs can translate natural language to optimization models, existing approaches lack reliable error correction and task-specific retrieval, leading to incorrect outputs.

Method: MIRROR uses a fine-tuning-free, end-to-end multi-agent framework with two core mechanisms: (1) execution-driven iterative adaptive revision for automatic error correction, and (2) hierarchical retrieval to fetch relevant modeling and coding exemplars from a curated library.

Result: MIRROR outperforms existing methods on standard OR benchmarks and achieves notable results on complex industrial datasets like IndustryOR and Mamo-ComplexLP.

Conclusion: By combining precise external knowledge infusion with systematic error correction, MIRROR provides non-expert users with an efficient and reliable OR modeling solution, overcoming limitations of general-purpose LLMs in expert optimization tasks.

Abstract: Operations Research (OR) relies on expert-driven modeling-a slow and fragile process ill-suited to novel scenarios. While large language models (LLMs) can automatically translate natural language into optimization models, existing approaches either rely on costly post-training or employ multi-agent frameworks, yet most still lack reliable collaborative error correction and task-specific retrieval, often leading to incorrect outputs. We propose MIRROR, a fine-tuning-free, end-to-end multi-agent framework that directly translates natural language optimization problems into mathematical models and solver code. MIRROR integrates two core mechanisms: (1) execution-driven iterative adaptive revision for automatic error correction, and (2) hierarchical retrieval to fetch relevant modeling and coding exemplars from a carefully curated exemplar library. Experiments show that MIRROR outperforms existing methods on standard OR benchmarks, with notable results on complex industrial datasets such as IndustryOR and Mamo-ComplexLP. By combining precise external knowledge infusion with systematic error correction, MIRROR provides non-expert users with an efficient and reliable OR modeling solution, overcoming the fundamental limitations of general-purpose LLMs in expert optimization tasks.

[41] Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

Rakshith Vasudev, Melisa Russak, Dan Bikel, Waseem Alshikh

Main category: cs.CL

TL;DR: LLM critic interventions can cause severe performance degradation despite strong offline accuracy, requiring pre-deployment testing to determine when intervention is safe.

Details

Motivation: Proactive interventions by LLM critic models are assumed to improve reliability, but their effects at deployment time are poorly understood, and accuracy alone is insufficient to determine intervention safety.

Method: Identifies a disruption-recovery tradeoff and proposes a pre-deployment test using a small pilot of 50 tasks to estimate whether intervention will help or harm without requiring full deployment.

Result: Intervention degrades performance on high-success tasks (0 to -26 pp) while yielding modest improvement on high-failure benchmarks (+2.8 pp, p=0.014). The test correctly anticipates outcomes across benchmarks.

Conclusion: The primary value is identifying when not to intervene, preventing severe regressions before deployment, as critic accuracy alone is insufficient for safe intervention decisions.

Abstract: Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment.

[42] PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

Yunzhi Shen, Hao Zhou, Xin Huang, Xue Han, Junlan Feng, Shujian Huang

Main category: cs.CL

TL;DR: PEGRL is a two-stage RL framework for LLM-based machine translation that uses post-editing as an auxiliary task to stabilize training and guide optimization, achieving better performance than RL baselines.

Details

Motivation: Current RL methods for LLM-based machine translation face challenges with noisy learning signals from Monte Carlo return estimation and large trajectory spaces that favor global exploration over fine-grained local optimization.

Method: A two-stage RL framework where translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on current translation behavior. A task-specific weighting scheme balances translation and post-editing objectives.

Result: Experiments on English→Finnish, English→Turkish, and English↔Chinese show consistent gains over RL baselines. For English→Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems like DeepSeek-V3.2.

Conclusion: PEGRL effectively addresses RL challenges in machine translation by using post-editing as an auxiliary task, providing more stable training and better optimization through a two-stage approach with balanced objectives.

Abstract: Reinforcement learning (RL) has shown strong promise for LLM-based machine translation, with recent methods such as GRPO demonstrating notable gains; nevertheless, translation-oriented RL remains challenged by noisy learning signals arising from Monte Carlo return estimation, as well as a large trajectory space that favors global exploration over fine-grained local optimization. We introduce \textbf{PEGRL}, a \textit{two-stage} RL framework that uses post-editing as an auxiliary task to stabilize training and guide overall optimization. At each iteration, translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on the current translation behavior, while jointly supporting both global exploration and fine-grained local optimization. A task-specific weighting scheme further balances the contributions of translation and post-editing objectives, yielding a biased yet more sample-efficient estimator. Experiments on English$\to$Finnish, English$\to$Turkish, and English$\leftrightarrow$Chinese show consistent gains over RL baselines, and for English$\to$Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems (DeepSeek-V3.2).

[43] Pursuing Best Industrial Practices for Retrieval-Augmented Generation in the Medical Domain

Wei Zhu

Main category: cs.CL

TL;DR: Analysis of RAG system components and best practices for industrial applications, with focus on medical domain

Details

Motivation: No consensus on best practices for building RAG systems in industrial applications, especially in medical domain where accuracy and reliability are critical

Method: 1) Analyze each RAG component and propose practical alternatives, 2) Conduct systematic evaluations on three types of tasks to reveal best practices and trade-offs

Result: Reveals best practices for improving RAG systems and shows how LLM-based RAG systems make trade-offs between performance and efficiency

Conclusion: Provides practical guidance for building effective RAG systems in industrial applications, particularly in medical domain where precision matters

Abstract: While retrieval augmented generation (RAG) has been swiftly adopted in industrial applications based on large language models (LLMs), there is no consensus on what are the best practices for building a RAG system in terms of what are the components, how to organize these components and how to implement each component for the industrial applications, especially in the medical domain. In this work, we first carefully analyze each component of the RAG system and propose practical alternatives for each component. Then, we conduct systematic evaluations on three types of tasks, revealing the best practices for improving the RAG system and how LLM-based RAG systems make trade-offs between performance and efficiency.

[44] Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

Hao Fang, Tianyi Zhang, Tianqu Zhuang, Jiawei Kong, Kuofeng Gao, Bin Chen, Leqi Liang, Shu-Tao Xia, Ke Xu

Main category: cs.CL

TL;DR: A defense method against logit-based distillation attacks on proprietary LLMs using information-theoretic purification of teacher outputs.

Details

Motivation: Proprietary LLMs have economic value but are vulnerable to knowledge extraction via distillation attacks. Existing defenses only address text-based distillation, leaving logit-based distillation undefended.

Method: Uses conditional mutual information (CMI) between teacher logits and input queries conditioned on labels to characterize distillation-relevant information. Proposes learning a transformation matrix to purify outputs, with a CMI-inspired anti-distillation objective to remove distillation information while preserving utility.

Result: Extensive experiments across multiple LLMs and distillation algorithms show the method significantly degrades distillation performance while preserving task accuracy, effectively protecting model IP.

Conclusion: The proposed information-theoretic approach provides effective defense against logit-based distillation attacks on proprietary LLMs, addressing a previously unexplored vulnerability.

Abstract: Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models’ intellectual property.

[45] Verified Critical Step Optimization for LLM Agents

Mukai Li, Qingcheng Zeng, Tianqing Fang, Zhenwen Liang, Linfeng Song, Qi Liu, Haitao Mi, Dong Yu

Main category: cs.CL

TL;DR: CSO (Critical Step Optimization) is a post-training method for LLM agents that focuses preference learning on verified critical steps where alternate actions flip task outcomes from failure to success, using failed trajectories and selective verification to provide fine-grained supervision.

Details

Motivation: Existing post-training methods face challenges: outcome-only rewards fail to attribute credit to intermediate steps, step-level rewards introduce noise, and Monte Carlo sampling is computationally expensive. The paper aims to provide precise, verifiable supervision at critical decision points.

Method: CSO starts from failed policy trajectories, uses a process reward model (PRM) to identify candidate critical steps, leverages expert models to propose alternative actions, continues execution with the policy model, and only uses successfully corrected trajectories as DPO training data. This focuses learning on verified critical steps.

Result: Experiments on GAIA-Text-103 and XBench-DeepSearch show CSO achieves 37% and 26% relative improvement over SFT baseline, substantially outperforms other post-training methods, and requires supervision at only 16% of trajectory steps.

Conclusion: CSO demonstrates the effectiveness of selective verification-based learning for agent post-training, providing fine-grained supervision at critical decisions while avoiding trajectory-level coarseness and step-level noise.

Abstract: As large language model agents tackle increasingly complex long-horizon tasks, effective post-training becomes critical. Prior work faces fundamental challenges: outcome-only rewards fail to precisely attribute credit to intermediate steps, estimated step-level rewards introduce systematic noise, and Monte Carlo sampling approaches for step reward estimation incur prohibitive computational cost. Inspired by findings that only a small fraction of high-entropy tokens drive effective RL for reasoning, we propose Critical Step Optimization (CSO), which focuses preference learning on verified critical steps, decision points where alternate actions demonstrably flip task outcomes from failure to success. Crucially, our method starts from failed policy trajectories rather than expert demonstrations, directly targeting the policy model’s weaknesses. We use a process reward model (PRM) to identify candidate critical steps, leverage expert models to propose high-quality alternatives, then continue execution from these alternatives using the policy model itself until task completion. Only alternatives that the policy successfully executes to correct outcomes are verified and used as DPO training data, ensuring both quality and policy reachability. This yields fine-grained, verifiable supervision at critical decisions while avoiding trajectory-level coarseness and step-level noise. Experiments on GAIA-Text-103 and XBench-DeepSearch show that CSO achieves 37% and 26% relative improvement over the SFT baseline and substantially outperforms other post-training methods, while requiring supervision at only 16% of trajectory steps. This demonstrates the effectiveness of selective verification-based learning for agent post-training.

[46] FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding

Yingli Shen, Wen Lai, Jie Zhou, Xueren Zhang, Yudong Wang, Kangyang Luo, Shuo Wang, Ge Gao, Alexander Fraser, Maosong Sun

Main category: cs.CL

TL;DR: FactNet is a massive open-source resource that unifies 1.7B atomic assertions with 3.01B auditable evidence pointers from 316 Wikipedia editions, providing deterministic grounding with 92.1% precision for training trustworthy multilingual systems.

Details

Motivation: LLMs suffer from factual hallucinations and lack traceable provenance. Existing grounding resources offer either structured knowledge without textual context (knowledge bases) or grounded text with limited scale and linguistic coverage, creating a gap that needs bridging.

Method: FactNet employs a strictly deterministic construction pipeline that extracts 1.7B atomic assertions with 3.01B evidence pointers from 316 Wikipedia editions, ensuring every evidence unit is recoverable with byte-level precision (unlike synthetic approaches).

Result: Extensive auditing confirms high grounding precision of 92.1%, even in long-tail languages. The resource includes FactNet-Bench, a comprehensive evaluation suite for Knowledge Graph Completion, Question Answering, and Fact Checking.

Conclusion: FactNet provides a foundational, reproducible resource for training and evaluating trustworthy, verifiable multilingual systems, bridging the gap between structured knowledge and grounded text at massive scale.

Abstract: While LLMs exhibit remarkable fluency, their utility is often compromised by factual hallucinations and a lack of traceable provenance. Existing resources for grounding mitigate this but typically enforce a dichotomy: they offer either structured knowledge without textual context (e.g., knowledge bases) or grounded text with limited scale and linguistic coverage. To bridge this gap, we introduce FactNet, a massive, open-source resource designed to unify 1.7 billion atomic assertions with 3.01 billion auditable evidence pointers derived exclusively from 316 Wikipedia editions. Unlike recent synthetic approaches, FactNet employs a strictly deterministic construction pipeline, ensuring that every evidence unit is recoverable with byte-level precision. Extensive auditing confirms a high grounding precision of 92.1%, even in long-tail languages. Furthermore, we establish FactNet-Bench, a comprehensive evaluation suite for Knowledge Graph Completion, Question Answering, and Fact Checking. FactNet provides the community with a foundational, reproducible resource for training and evaluating trustworthy, verifiable multilingual systems.

[47] A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Shaohan Wang, Pengyu Wang, Xiaorui Wang, Zhendong Mao

Main category: cs.CL

TL;DR: A-RAG is an Agentic RAG framework that gives language models direct control over hierarchical retrieval tools (keyword search, semantic search, chunk read) to adaptively search information across granularities, outperforming existing RAG approaches.

Details

Motivation: Existing RAG systems fail to leverage frontier language models' reasoning and tool-use capabilities. Current paradigms either use single-shot retrieval algorithms or predefined workflows, preventing models from participating in retrieval decisions and limiting scalability with model improvements.

Method: A-RAG exposes hierarchical retrieval interfaces directly to the model, providing three retrieval tools: keyword search, semantic search, and chunk read. This allows the agent to adaptively search and retrieve information across multiple granularities based on the task needs.

Result: Experiments on multiple open-domain QA benchmarks show A-RAG consistently outperforms existing approaches with comparable or lower retrieved tokens. The framework effectively leverages model capabilities and dynamically adapts to different RAG tasks.

Conclusion: A-RAG demonstrates that giving language models direct control over retrieval decisions enables more efficient information gathering and better performance. The paper also systematically studies how A-RAG scales with model size and test-time compute.

Abstract: Frontier language models have demonstrated strong reasoning and long-horizon tool-use capabilities. However, existing RAG systems fail to leverage these capabilities. They still rely on two paradigms: (1) designing an algorithm that retrieves passages in a single shot and concatenates them into the model’s input, or (2) predefining a workflow and prompting the model to execute it step-by-step. Neither paradigm allows the model to participate in retrieval decisions, preventing efficient scaling with model improvements. In this paper, we introduce A-RAG, an Agentic RAG framework that exposes hierarchical retrieval interfaces directly to the model. A-RAG provides three retrieval tools: keyword search, semantic search, and chunk read, enabling the agent to adaptively search and retrieve information across multiple granularities. Experiments on multiple open-domain QA benchmarks show that A-RAG consistently outperforms existing approaches with comparable or lower retrieved tokens, demonstrating that A-RAG effectively leverages model capabilities and dynamically adapts to different RAG tasks. We further systematically study how A-RAG scales with model size and test-time compute. We will release our code and evaluation suite to facilitate future research. Code and evaluation suite are available at https://github.com/Ayanami0730/arag.

[48] Preferences for Idiomatic Language are Acquired Slowly – and Forgotten Quickly: A Case Study on Swedish

Jenny Kunz

Main category: cs.CL

TL;DR: Language models develop idiomatic competence slower than other linguistic abilities, with instruction tuning on machine-translated data causing rapid loss of idiomatic language preference.

Details

Motivation: To understand how language models develop preferences for idiomatic vs. linguistically acceptable Swedish during pretraining and adaptation from English, and to assess the impact of instruction tuning on machine-translated data.

Method: Train models on Swedish from scratch and by fine-tuning English-pretrained models, probing preferences at checkpoints using minimal pairs. Adapt existing benchmarks for linguistic acceptability and introduce two new datasets for idiomaticity: conventionalized idioms vs. variants, and idiomatic Swedish vs. Translationese.

Result: Idiomatic competence emerges more slowly than other linguistic abilities. Longer training yields diminishing returns for most tasks but idiom performance continues improving, especially in largest models (8B). Instruction tuning on machine-translated English data causes rapid loss of idiomatic language preference.

Conclusion: Models develop idiomatic understanding gradually, but standard instruction tuning approaches using machine-translated data can degrade idiomatic competence, highlighting challenges for multilingual model development.

Abstract: In this study, we investigate how language models develop preferences for \textit{idiomatic} as compared to \textit{linguistically acceptable} Swedish, both during pretraining and when adapting a model from English to Swedish. To do so, we train models on Swedish from scratch and by fine-tuning English-pretrained models, probing their preferences at various checkpoints using minimal pairs that differ in linguistic acceptability or idiomaticity. For linguistic acceptability, we adapt existing benchmarks into a minimal-pair format. To assess idiomaticity, we introduce two novel datasets: one contrasting conventionalized idioms with plausible variants, and another contrasting idiomatic Swedish with Translationese. Our findings suggest that idiomatic competence emerges more slowly than other linguistic abilities, including grammatical and lexical correctness. While longer training yields diminishing returns for most tasks, idiom-related performance continues to improve, particularly in the largest model tested (8B). However, instruction tuning on data machine-translated from English – the common approach for languages with little or no native instruction data – causes models to rapidly lose their preference for idiomatic language.

[49] Self-Verification Dilemma: Experience-Driven Suppression of Overused Checking in LLM Reasoning

Quanyu Long, Kai Jie Jiang, Jianda Chen, Xu Guo, Leilei Gan, Wenya Wang

Main category: cs.CL

TL;DR: LRMs overuse self-verification steps that rarely correct errors; proposed experience-driven framework suppresses unnecessary rechecks to reduce token usage while maintaining accuracy.

Details

Motivation: Large Reasoning Models frequently use self-verification (recheck) steps that are mostly confirmatory rather than corrective, creating inefficiency. The paper aims to reduce this overused verification behavior.

Method: Experience-driven test-time framework that detects recheck activation, consults offline experience pool of past verification outcomes, estimates if recheck is likely unnecessary via efficient retrieval, and suppresses unnecessary rechecks.

Result: Reduces token usage up to 20.3% while maintaining accuracy across multiple models and benchmarks; some datasets even show accuracy improvements.

Conclusion: Self-verification in LRMs is often overused; the proposed experience-driven suppression framework effectively reduces unnecessary verification steps while preserving reasoning quality.

Abstract: Large Reasoning Models (LRMs) achieve strong performance by generating long reasoning traces with reflection. Through a large-scale empirical analysis, we find that a substantial fraction of reflective steps consist of self-verification (recheck) that repeatedly confirm intermediate results. These rechecks occur frequently across models and benchmarks, yet the vast majority are confirmatory rather than corrective, rarely identifying errors and altering reasoning outcomes. This reveals a mismatch between how often self-verification is activated and how often it is actually useful. Motivated by this, we propose a novel, experience-driven test-time framework that reduces the overused verification. Our method detects the activation of recheck behavior, consults an offline experience pool of past verification outcomes, and estimates whether a recheck is likely unnecessary via efficient retrieval. When historical experience suggests unnecessary, a suppression signal redirects the model to proceed. Across multiple model and benchmarks, our approach reduces token usage up to 20.3% while maintaining the accuracy, and in some datasets even yields accuracy improvements.

[50] Learning to Reason Faithfully through Step-Level Faithfulness Maximization

Runquan Gui, Yafu Li, Xiaoye Qu, Ziyan Liu, Yeqiu Cheng, Yu Cheng

Main category: cs.CL

TL;DR: FaithRL is a reinforcement learning framework that optimizes reasoning faithfulness in LLMs by introducing geometric reward design and faithfulness-aware advantage modulation to reduce hallucinations while maintaining answer correctness.

Details

Motivation: Current RLVR pipelines for LLMs rely on sparse outcome-based rewards, which provide little supervision over intermediate reasoning steps. This encourages over-confidence and spurious reasoning, leading to increased hallucinations in multi-step reasoning tasks.

Method: Proposes FaithRL framework with: 1) Formalized faithfulness-maximization objective, 2) Geometric reward design, 3) Faithfulness-aware advantage modulation mechanism that assigns step-level credit by penalizing unsupported steps while preserving valid partial derivations.

Result: Across diverse backbones and benchmarks, FaithRL consistently reduces hallucination rates while maintaining (and often improving) answer correctness. Further analysis confirms increased step-wise reasoning faithfulness and robust generalization.

Conclusion: FaithRL provides an effective reinforcement learning framework that directly optimizes reasoning faithfulness, addressing key limitations of sparse reward RLVR pipelines and reducing hallucinations in LLM reasoning tasks.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has markedly improved the performance of Large Language Models (LLMs) on tasks requiring multi-step reasoning. However, most RLVR pipelines rely on sparse outcome-based rewards, providing little supervision over intermediate steps and thus encouraging over-confidence and spurious reasoning, which in turn increases hallucinations. To address this, we propose FaithRL, a general reinforcement learning framework that directly optimizes reasoning faithfulness. We formalize a faithfulness-maximization objective and theoretically show that optimizing it mitigates over-confidence. To instantiate this objective, we introduce a geometric reward design and a faithfulness-aware advantage modulation mechanism that assigns step-level credit by penalizing unsupported steps while preserving valid partial derivations. Across diverse backbones and benchmarks, FaithRL consistently reduces hallucination rates while maintaining (and often improving) answer correctness. Further analysis confirms that FaithRL increases step-wise reasoning faithfulness and generalizes robustly. Our code is available at https://github.com/aintdoin/FaithRL.

[51] Can Large Language Models Generalize Procedures Across Representations?

Fangru Lin, Valentin Hofmann, Xingchen Wan, Weixing Wang, Zifeng Ding, Anthony G. Cohn, Janet B. Pierrehumbert

Main category: cs.CL

TL;DR: A two-stage curriculum training method improves LLM generalization across symbolic (code/graphs) and natural language representations for procedural tasks like planning.

Details

Motivation: LLMs are extensively trained on symbolic representations like code and graphs, but real-world user tasks are often specified in natural language. The paper investigates whether LLMs can generalize across these different representations for isomorphic tasks involving procedures.

Method: Proposes a two-stage data curriculum: first train on symbolic representations (code or graphs), then train on natural language data. This approach helps models learn to generalize across different representations for procedural tasks like scheduling and planning.

Result: The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained with this method can closely match zero-shot GPT-4o in naturalistic planning tasks.

Conclusion: Successful cross-representation generalization can be interpreted as a form of generative analogy, which the proposed curriculum effectively encourages. The method bridges the gap between symbolic and natural language representations for procedural tasks.

Abstract: Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage data curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained by our method can closely match zero-shot GPT-4o in naturalistic planning. Finally, our analysis suggests that successful cross-representation generalization can be interpreted as a form of generative analogy, which our curriculum effectively encourages.

[52] SEAD: Self-Evolving Agent for Multi-Turn Service Dialogue

Yuqin Dai, Ning Gao, Wei Zhang, Jie Wang, Zichen Luo, Jinpeng Wang, Yujie Wang, Ruiyuan Wu, Chaozheng Wang

Main category: cs.CL

TL;DR: SEAD is a self-evolving agent framework for service dialogues that improves performance by generating diverse user states and realistic role-playing without large-scale human annotations.

Details

Motivation: Current LLMs perform poorly in service dialogues due to reliance on noisy, low-quality human conversation data, stemming from data scarcity and difficulty simulating authentic goal-oriented user behaviors.

Method: SEAD decouples user modeling into two components: a Profile Controller that generates diverse user states to manage training curriculum, and a User Role-play Model that focuses on realistic role-playing, ensuring adaptive training scenarios rather than adversarial environments.

Result: SEAD significantly outperforms both open-source foundation models and closed-source commercial models, improving task completion rate by 17.6% and dialogue efficiency by 11.1%.

Conclusion: The SEAD framework enables agents to learn effective service dialogue strategies without large-scale human annotations, addressing data scarcity and simulation challenges through self-evolving mechanisms.

Abstract: Large Language Models have demonstrated remarkable capabilities in open-domain dialogues. However, current methods exhibit suboptimal performance in service dialogues, as they rely on noisy, low-quality human conversation data. This limitation arises from data scarcity and the difficulty of simulating authentic, goal-oriented user behaviors. To address these issues, we propose SEAD (Self-Evolving Agent for Service Dialogue), a framework that enables agents to learn effective strategies without large-scale human annotations. SEAD decouples user modeling into two components: a Profile Controller that generates diverse user states to manage training curriculum, and a User Role-play Model that focuses on realistic role-playing. This design ensures the environment provides adaptive training scenarios rather than acting as an unfair adversary. Experiments demonstrate that SEAD significantly outperforms Open-source Foundation Models and Closed-source Commercial Models, improving task completion rate by 17.6% and dialogue efficiency by 11.1%. Code is available at: https://github.com/Da1yuqin/SEAD.

[53] Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models

Vitalii Hirak, Jaap Jumelet, Arianna Bisazza

Main category: cs.CL

TL;DR: Analysis of how target language typology affects translation quality in large multilingual models, finding typology drives quality disparities even after controlling for data/resources.

Details

Motivation: To understand why quality disparities persist across languages in multilingual models, investigating whether typological properties (beyond just training data/resources) determine intrinsic modeling difficulty.

Method: Analyze two large pre-trained multilingual translation models (NLLB-200 and Tower+), examine translation quality across broad language set, control for factors like data resourcedness and writing script, study how typology affects decoding strategies.

Result: Target language typology drives translation quality in both encoder-decoder and decoder-only models; certain typological properties benefit more from wider output space search, suggesting alternative decoding strategies could help.

Conclusion: Typological properties are significant factors in multilingual model performance, and language-specific decoding strategies could improve translation for languages with certain typological features.

Abstract: Despite major advances in multilingual modeling, large quality disparities persist across languages. Besides the obvious impact of uneven training resources, typological properties have also been proposed to determine the intrinsic difficulty of modeling a language. The existing evidence, however, is mostly based on small monolingual language models or bilingual translation models trained from scratch. We expand on this line of work by analyzing two large pre-trained multilingual translation models, NLLB-200 and Tower+, which are state-of-the-art representatives of encoder-decoder and decoder-only machine translation, respectively. Based on a broad set of languages, we find that target language typology drives translation quality of both models, even after controlling for more trivial factors, such as data resourcedness and writing script. Additionally, languages with certain typological properties benefit more from a wider search of the output space, suggesting that such languages could profit from alternative decoding strategies beyond the standard left-to-right beam search. To facilitate further research in this area, we release a set of fine-grained typological properties for 212 languages of the FLORES+ MT evaluation benchmark.

Yizhao Gao, Jianyu Wei, Qihao Zhang, Yu Cheng, Shimao Chen, Zhengju Tang, Zihan Jiang, Yifan Song, Hailin Zhang, Liang Zhao, Bo Yang, Gang Wang, Shijie Cao, Fuli Luo

Main category: cs.CL

TL;DR: HySparse is a novel attention architecture that interleaves full attention layers with sparse attention layers, using full attention as an oracle to identify important tokens and enabling KV cache reuse to reduce both computation and memory.

Details

Motivation: Current sparse attention methods have two key limitations: they rely on additional proxies to predict token importance (adding complexity and potential suboptimal performance), and they reduce computation without saving KV cache memory. HySparse aims to address both issues by strategically using full attention layers to guide sparse attention.

Method: HySparse interleaves each full attention layer with several sparse attention layers. It strategically derives each sparse layer’s token selection and KV caches directly from the preceding full attention layer, using the full attention as a precise oracle to identify important tokens. This enables sparse attention layers to reuse the full attention KV cache.

Result: HySparse consistently outperforms both full attention and hybrid SWA baselines across 7B dense and 80B MoE models. In the 80B MoE model with 49 total layers, only 5 layers use full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.

Conclusion: HySparse provides an effective architecture that overcomes fundamental limitations of prior sparse attention methods by using full attention as an oracle for token selection and enabling KV cache reuse, achieving better performance with significantly reduced memory requirements.

Abstract: This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer’s token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We evaluate HySparse on both 7B dense and 80B MoE models. Across all settings, HySparse consistently outperforms both full attention and hybrid SWA baselines. Notably, in the 80B MoE model with 49 total layers, only 5 layers employ full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.

[55] ACL: Aligned Contrastive Learning Improves BERT and Multi-exit BERT Fine-tuning

Wei Zhu

Main category: cs.CL

TL;DR: Supervised contrastive learning framework (ACL) that aligns label embeddings with sample representations and resolves conflicts with cross-entropy loss through gradient-based optimization.

Details

Motivation: Contrastive learning shows success in self-supervised settings but faces challenges in supervised settings due to conflicts between contrastive learning objectives and cross-entropy loss, hindering its application in supervised learning scenarios.

Method: Proposes Aligned Contrastive Learning (ACL) with three components: 1) ACL-Embed treats label embeddings as augmented samples and aligns them with sample representations; 2) ACL-Grad selectively discards ACL-Embed terms when objectives conflict; 3) ACL-CL uses teacher exits to guide shallow student exits in multi-exit BERT models.

Result: ACL-BERT outperforms or matches CE and CE+SCL on GLUE tasks, and ACL-CL significantly improves multi-exit BERT fine-tuning, providing better quality-speed tradeoffs for low-latency applications.

Conclusion: The proposed ACL framework successfully resolves conflicts between contrastive learning and cross-entropy loss in supervised settings, enabling effective application of contrastive learning to improve model performance and efficiency in NLP tasks.

Abstract: Despite its success in self-supervised learning, contrastive learning is less studied in the supervised setting. In this work, we first use a set of pilot experiments to show that in the supervised setting, the cross-entropy loss objective (CE) and the contrastive learning objective often conflict with each other, thus hindering the applications of CL in supervised settings. To resolve this problem, we introduce a novel \underline{A}ligned \underline{C}ontrastive \underline{L}earning (ACL) framework. First, ACL-Embed regards label embeddings as extra augmented samples with different labels and employs contrastive learning to align the label embeddings with its samples’ representations. Second, to facilitate the optimization of ACL-Embed objective combined with the CE loss, we propose ACL-Grad, which will discard the ACL-Embed term if the two objectives are in conflict. To further enhance the performances of intermediate exits of multi-exit BERT, we further propose cross-layer ACL (ACL-CL), which is to ask the teacher exit to guide the optimization of student shallow exits. Extensive experiments on the GLUE benchmark results in the following takeaways: (a) ACL-BRT outperforms or performs comparably with CE and CE+SCL on the GLUE tasks; (b) ACL, especially CL-ACL, significantly surpasses the baseline methods on the fine-tuning of multi-exit BERT, thus providing better quality-speed tradeoffs for low-latency applications.

[56] Use Graph When It Needs: Efficiently and Adaptively Integrating Retrieval-Augmented Generation with Graphs

Su Dong, Qinggang Zhang, Yilin Xiao, Shengyuan Chen, Chuang Zhou, Xiao Huang

Main category: cs.CL

TL;DR: EA-GraphRAG: An adaptive framework that dynamically routes queries between vanilla RAG and GraphRAG based on syntactic complexity analysis to optimize performance for both simple and complex queries.

Details

Motivation: GraphRAG underperforms vanilla RAG in real-world scenarios despite theoretical advantages, with significant accuracy drops and prohibitive latency. The rigid application of GraphRAG to all queries regardless of complexity is identified as the root cause.

Method: Proposes EA-GraphRAG with three components: 1) syntactic feature constructor that parses queries and extracts structural features, 2) lightweight complexity scorer that maps features to continuous complexity scores, 3) score-driven routing policy that selects dense RAG for low-score queries, graph-based retrieval for high-score queries, and complexity-aware reciprocal rank fusion for borderline cases.

Result: Extensive experiments on comprehensive benchmark (two single-hop and two multi-hop QA benchmarks) demonstrate significant improvements in accuracy, reduced latency, and state-of-the-art performance in handling mixed scenarios of simple and complex queries.

Conclusion: EA-GraphRAG effectively addresses the limitations of rigid GraphRAG application by dynamically adapting retrieval strategies based on query complexity, achieving optimal performance across diverse query types.

Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks due to hallucinations and outdated parametric knowledge. While Retrieval-Augmented Generation (RAG) addresses this by integrating external corpora, its effectiveness is limited by fragmented information in unstructured domain documents. Graph-augmented RAG (GraphRAG) emerged to enhance contextual reasoning through structured knowledge graphs, yet paradoxically underperforms vanilla RAG in real-world scenarios, exhibiting significant accuracy drops and prohibitive latency despite gains on complex queries. We identify the rigid application of GraphRAG to all queries, regardless of complexity, as the root cause. To resolve this, we propose an efficient and adaptive GraphRAG framework called EA-GraphRAG that dynamically integrates RAG and GraphRAG paradigms through syntax-aware complexity analysis. Our approach introduces: (i) a syntactic feature constructor that parses each query and extracts a set of structural features; (ii) a lightweight complexity scorer that maps these features to a continuous complexity score; and (iii) a score-driven routing policy that selects dense RAG for low-score queries, invokes graph-based retrieval for high-score queries, and applies complexity-aware reciprocal rank fusion to handle borderline cases. Extensive experiments on a comprehensive benchmark, consisting of two single-hop and two multi-hop QA benchmarks, demonstrate that our EA-GraphRAG significantly improves accuracy, reduces latency, and achieves state-of-the-art performance in handling mixed scenarios involving both simple and complex queries.

[57] $V_0$: A Generalist Value Model for Any Policy at State Zero

Yi-Kai Zhang, Zhiyuan Yao, Hongyan Hao, Yueqing Sun, Qi Gu, Hui Su, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye

Main category: cs.CL

TL;DR: V₀ is a generalist value model that estimates LLM performance on unseen prompts without training, using policy capability as context input to enable efficient sampling allocation in GRPO training and cost-effective model routing.

Details

Motivation: Traditional Actor-Critic methods require expensive synchronous training of value models to track evolving policy capabilities, while group-based methods like GRPO need extensive sampling for stable baselines. There's a need for efficient value estimation without parameter updates.

Method: V₀ treats policy capability as explicit context input using history of instruction-performance pairs. It focuses on value estimation at State Zero (initial prompt) and serves as a resource scheduler for GRPO training and model routing during deployment.

Result: V₀ significantly outperforms heuristic budget allocation and achieves Pareto-optimal trade-off between performance and cost in LLM routing tasks, demonstrating effective value estimation without parameter updates.

Conclusion: The V₀ model provides an efficient alternative to traditional value estimation methods by using policy capability as context, enabling better resource allocation in RL training and cost-effective model routing in deployment.

Abstract: Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the training of Large Language Models (LLMs) using Actor-Critic methods (e.g., PPO), this baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself. However, as the policy continuously evolves, the value model requires expensive, synchronous incremental training to accurately track the shifting capabilities of the policy. To avoid this overhead, Group Relative Policy Optimization (GRPO) eliminates the coupled value model by using the average reward of a group of rollouts as the baseline; yet, this approach necessitates extensive sampling to maintain estimation stability. In this paper, we propose $V_0$, a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts without requiring parameter updates. We reframe value estimation by treating the policy’s dynamic capability as an explicit context input; specifically, we leverage a history of instruction-performance pairs to dynamically profile the model, departing from the traditional paradigm that relies on parameter fitting to perceive capability shifts. Focusing on value estimation at State Zero (i.e., the initial prompt, hence $V_0$), our model serves as a critical resource scheduler. During GRPO training, $V_0$ predicts success rates prior to rollout, allowing for efficient sampling budget allocation; during deployment, it functions as a router, dispatching instructions to the most cost-effective and suitable model. Empirical results demonstrate that $V_0$ significantly outperforms heuristic budget allocation and achieves a Pareto-optimal trade-off between performance and cost in LLM routing tasks.

[58] CL-bench: A Benchmark for Context Learning

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, Shunyu Yao

Main category: cs.CL

TL;DR: CL-bench is a real-world benchmark for evaluating language models’ ability to learn from complex context containing new domain knowledge, rules, and procedures not seen during pre-training.

Details

Motivation: Current LMs excel at reasoning with pre-trained knowledge but struggle with real-world tasks requiring learning from task-specific context and new knowledge beyond pre-training, a capability termed "context learning" that humans naturally possess but has been overlooked in AI research.

Method: Introduces CL-bench with 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics crafted by domain experts. Each task requires learning new content from context including domain-specific knowledge, rule systems, complex procedures, and laws derived from empirical data.

Result: Evaluation of ten frontier LMs shows they solve only 17.2% of tasks on average, with the best model (GPT-5.1) solving only 23.7%, revealing LMs have not achieved effective context learning.

Conclusion: Context learning is a critical bottleneck for real-world applications, and CL-bench represents a step toward building LMs with this fundamental capability to make them more intelligent and deployable in real-world scenarios.

Abstract: Current language models (LMs) excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability context learning, a crucial ability that humans naturally possess but has been largely overlooked. To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived from empirical data, all of which are absent from pre-training. This goes far beyond long-context tasks that primarily test retrieval or reading comprehension, and in-context learning tasks, where models learn simple task patterns via instructions and demonstrations. Our evaluations of ten frontier LMs find that models solve only 17.2% of tasks on average. Even the best-performing model, GPT-5.1, solves only 23.7%, revealing that LMs have yet to achieve effective context learning, which poses a critical bottleneck for tackling real-world, complex context-dependent tasks. CL-bench represents a step towards building LMs with this fundamental capability, making them more intelligent and advancing their deployment in real-world scenarios.

[59] Efficient Algorithms for Partial Constraint Satisfaction Problems over Control-flow Graphs

Xuran Cai, Amir Goharshady

Main category: cs.CL

TL;DR: A linear-time algorithm for Partial Constraint Satisfaction Problems (PCSPs) over control-flow graphs using Series-Parallel-Loop decompositions, with applications to compiler optimization tasks like register allocation and bank selection.

Details

Motivation: Compiler optimization tasks like register allocation, speculative partial redundancy elimination, and bank selection can be framed as Partial Constraint Satisfaction Problems (PCSPs) over control-flow graphs. Control-flow graphs of structured programs have sparse, decomposable structures that can be exploited for efficient algorithms.

Method: The paper presents a general algorithm for PCSPs over Series-Parallel-Loop (SPL) decompositions of control-flow graphs. The algorithm has time complexity O(|G|·|D|⁶), where |G| is graph size and |D| is domain size, yielding linear-time solutions for fixed domains. This unifies previous SPL-based approaches for specific compiler optimization problems.

Result: The algorithm achieves runtimes four times better than previous state-of-the-art for the Optimal Bank Selection problem. For fixed domains, it provides linear-time solutions to PCSPs over control-flow graphs.

Conclusion: The work provides a unified, efficient algorithm for solving PCSPs over control-flow graphs using SPL decompositions, generalizing previous specialized approaches and demonstrating practical improvements for compiler optimization tasks.

Abstract: In this work, we focus on the Partial Constraint Satisfaction Problem (PCSP) over control-flow graphs (CFGs) of programs. PCSP serves as a generalization of the well-known Constraint Satisfaction Problem (CSP). In the CSP framework, we define a set of variables, a set of constraints, and a finite domain $D$ that encompasses all possible values for each variable. The objective is to assign a value to each variable in such a way that all constraints are satisfied. In the graph variant of CSP, an underlying graph is considered and we have one variable corresponding to each vertex of the graph and one or several constraints corresponding to each edge. In PCSPs, we allow for certain constraints to be violated at a specified cost, aiming to find a solution that minimizes the total cost. Numerous classical compiler optimization tasks can be framed as PCSPs over control-flow graphs. Examples include Register Allocation, Lifetime-optimal Speculative Partial Redundancy Elimination (LOSPRE), and Optimal Placement of Bank Selection Instructions. On the other hand, it is well-known that control-flow graphs of structured programs are sparse and decomposable in a variety of ways. In this work, we rely on the Series-Parallel-Loop (SPL) decompositions as introduced by~\cite{RegisterAllocation}. Our main contribution is a general algorithm for PCSPs over SPL graphs with a time complexity of (O(|G| \cdot |D|^6)), where (|G|) represents the size of the control-flow graph. Note that for any fixed domain $D,$ this yields a linear-time solution. Our algorithm can be seen as a generalization and unification of previous SPL-based approaches for register allocation and LOSPRE. In addition, we provide experimental results over another classical PCSP task, i.e. Optimal Bank Selection, achieving runtimes four times better than the previous state of the art.

[60] Controlling Output Rankings in Generative Engines for LLM-based Search

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Yifeng Luo, Huimin Zeng, Man Luo, Haohan Wang

Main category: cs.CL

TL;DR: CORE is a method to control output rankings in LLM-based search engines by optimizing retrieved content with strategically designed optimization content to promote specific products.

Details

Motivation: LLM-based search engines provide direct product recommendations, but these are biased by initial retrieval order, disadvantaging small businesses and independent creators who get limited visibility.

Method: CORE optimizes retrieved content by appending three types of optimization content: string-based, reasoning-based, and review-based. It targets search engine content since LLM interactions are black-box. Evaluated on ProductBench benchmark with 15 categories and 200 products each.

Result: CORE achieves average Promotion Success Rate of 91.4% @Top-5, 86.6% @Top-3, and 80.3% @Top-1 across 15 product categories on four LLMs (GPT-4o, Gemini-2.5, Claude-4, Grok-3), outperforming existing ranking manipulation methods while preserving content fluency.

Conclusion: CORE effectively controls output rankings in LLM-based search engines through content optimization, addressing fairness concerns in product recommendations while maintaining recommendation quality.

Abstract: The way customers search for and choose products is changing with the rise of large language models (LLMs). LLM-based search, or generative engines, provides direct product recommendations to users, rather than traditional online search results that require users to explore options themselves. However, these recommendations are strongly influenced by the initial retrieval order of LLMs, which disadvantages small businesses and independent creators by limiting their visibility. In this work, we propose CORE, an optimization method that \textbf{C}ontrols \textbf{O}utput \textbf{R}ankings in g\textbf{E}nerative Engines for LLM-based search. Since the LLM’s interactions with the search engine are black-box, CORE targets the content returned by search engines as the primary means of influencing output rankings. Specifically, CORE optimizes retrieved content by appending strategically designed optimization content to steer the ranking of outputs. We introduce three types of optimization content: string-based, reasoning-based, and review-based, demonstrating their effectiveness in shaping output rankings. To evaluate CORE in realistic settings, we introduce ProductBench, a large-scale benchmark with 15 product categories and 200 products per category, where each product is associated with its top-10 recommendations collected from Amazon’s search interface. Extensive experiments on four LLMs with search capabilities (GPT-4o, Gemini-2.5, Claude-4, and Grok-3) demonstrate that CORE achieves an average Promotion Success Rate of \textbf{91.4% @Top-5}, \textbf{86.6% @Top-3}, and \textbf{80.3% @Top-1}, across 15 product categories, outperforming existing ranking manipulation methods while preserving the fluency of optimized content.

[61] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Zisu Huang, Muzhao Tian, Shihan Dou, Tao Gui, Le Tian, Xiao Zhou, Xiaoqing Zheng, Xuanjing Huang, Jie Zhou

Main category: cs.CL

TL;DR: A pipeline for training query-specific rubric generators for DeepResearch report evaluation using human preference alignment and reinforcement learning, integrated with a multi-agent workflow for improved report generation.

Details

Motivation: Current evaluation of DeepResearch-generated reports lacks verifiable reward signals, relying on either coarse pre-defined rubrics or costly manually constructed query-specific rubrics that don't scale well.

Method: 1) Construct dataset of DeepResearch queries with human preference annotations over report pairs; 2) Train rubric generators via reinforcement learning with hybrid reward combining human preference supervision and LLM-based rubric evaluation; 3) Introduce Multi-agent Markov-state workflow for better long-horizon reasoning in report generation.

Result: The rubric generators deliver more discriminative and better human-aligned supervision than existing strategies. When integrated into the MaMs framework, DeepResearch systems outperform all open-source baselines on DeepResearch Bench and achieve performance comparable to leading closed-source models.

Conclusion: The proposed pipeline successfully addresses the rubric generation challenge for DeepResearch evaluation, providing scalable, query-specific rubrics that align with human preferences and improve system performance.

Abstract: Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.

[62] BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish

Burak Aktaş, Mehmet Can Baytekin, Süha Kağan Köse, Ömer İlbilgi, Elif Özge Yılmaz, Çağrı Toraman, Bilge Kaan Görür

Main category: cs.CL

TL;DR: BIRDTurk: First Turkish adaptation of BIRD benchmark for Text-to-SQL evaluation, showing performance degradation in Turkish due to linguistic divergence and LLM underrepresentation, with agentic reasoning showing better cross-lingual robustness.

Details

Motivation: Text-to-SQL systems perform well on English benchmarks but their behavior in morphologically rich, low-resource languages like Turkish remains unexplored. Need for controlled evaluation framework to understand cross-lingual Text-to-SQL performance.

Method: Created BIRDTurk through controlled translation pipeline adapting schema identifiers to Turkish while preserving SQL logical structure and execution semantics. Translation quality validated using Central Limit Theorem sampling for 95% confidence. Evaluated three approaches: inference-based prompting, agentic multi-stage reasoning, and supervised fine-tuning.

Result: Turkish introduces consistent performance degradation compared to English, driven by structural linguistic divergence and underrepresentation in LLM pretraining. Agentic reasoning demonstrates stronger cross-lingual robustness. Supervised fine-tuning remains challenging for standard multilingual baselines but scales effectively with modern instruction-tuned models.

Conclusion: BIRDTurk provides a controlled testbed for cross-lingual Text-to-SQL evaluation under realistic database conditions. Turkish performance degradation highlights challenges in low-resource languages, with agentic reasoning showing promise for cross-lingual robustness.

Abstract: Text-to-SQL systems have achieved strong performance on English benchmarks, yet their behavior in morphologically rich, low-resource languages remains largely unexplored. We introduce BIRDTurk, the first Turkish adaptation of the BIRD benchmark, constructed through a controlled translation pipeline that adapts schema identifiers to Turkish while strictly preserving the logical structure and execution semantics of SQL queries and databases. Translation quality is validated on a sample size determined by the Central Limit Theorem to ensure 95% confidence, achieving 98.15% accuracy on human-evaluated samples. Using BIRDTurk, we evaluate inference-based prompting, agentic multi-stage reasoning, and supervised fine-tuning. Our results reveal that Turkish introduces consistent performance degradation, driven by both structural linguistic divergence and underrepresentation in LLM pretraining, while agentic reasoning demonstrates stronger cross-lingual robustness. Supervised fine-tuning remains challenging for standard multilingual baselines but scales effectively with modern instruction-tuned models. BIRDTurk provides a controlled testbed for cross-lingual Text-to-SQL evaluation under realistic database conditions. We release the training and development splits to support future research.

[63] TRE: Encouraging Exploration in the Trust Region

Chao Huang, Yujing Lu, Quangang Li, Shenghe Wang, Yan Wang, Yueyang Zhang, Long Xia, Jiashu Zhao, Zhiyuan Sun, Daiting Shi, Tingwen Liu

Main category: cs.CL

TL;DR: TRE (Trust Region Entropy) is a new RL exploration method for LLMs that restricts entropy regularization to the model’s trust region to avoid diluting probability mass into invalid tokens, improving performance on reasoning and alignment tasks.

Details

Motivation: Standard entropy regularization in RL fails for LLMs because with massive vocabularies and long generation horizons, it indiscriminately spreads probability mass into invalid tokens rather than focusing on plausible candidates, disrupting coherent reasoning.

Method: Proposes Trust Region Entropy (TRE) that encourages exploration strictly within the model’s trust region, preventing probability dilution into the vast tail of invalid tokens while maintaining exploration of plausible candidates.

Result: Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks show TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines.

Conclusion: TRE effectively addresses the cumulative tail risk problem in LLM RL by restricting exploration to the trust region, leading to better performance on diverse reasoning and alignment tasks compared to standard approaches.

Abstract: Entropy regularization is a standard technique in reinforcement learning (RL) to enhance exploration, yet it yields negligible effects or even degrades performance in Large Language Models (LLMs). We attribute this failure to the cumulative tail risk inherent to LLMs with massive vocabularies and long generation horizons. In such environments, standard global entropy maximization indiscriminately dilutes probability mass into the vast tail of invalid tokens rather than focusing on plausible candidates, thereby disrupting coherent reasoning. To address this, we propose Trust Region Entropy (TRE), a method that encourages exploration strictly within the model’s trust region. Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks demonstrate that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines. Our code is available at https://github.com/WhyChaos/TRE-Encouraging-Exploration-in-the-Trust-Region.

[64] RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish

Süha Kağan Köse, Mehmet Can Baytekin, Burak Aktaş, Bilge Kaan Görür, Evren Ayberk Munis, Deniz Yılmaz, Muhammed Yusuf Kartal, Çağrı Toraman

Main category: cs.CL

TL;DR: This paper presents a comprehensive Turkish RAG dataset and benchmarks seven stages of the RAG pipeline for morphologically rich languages, showing complex methods like HyDE achieve 85% accuracy but simpler configurations can achieve comparable performance with lower cost.

Details

Motivation: Current RAG design guidance is English-centric, limiting insights for morphologically rich languages like Turkish. There's a need for comprehensive evaluation and optimization of RAG pipelines specifically for such languages.

Method: Constructed a Turkish RAG dataset from Turkish Wikipedia and CulturaX, then benchmarked seven stages of the RAG pipeline (query transformation, reranking, answer refinement) without task-specific fine-tuning.

Result: HyDE achieved highest accuracy (85%) vs baseline (78.70%). Pareto-optimal configuration using Cross-encoder Reranking and Context Augmentation achieved comparable performance (84.60%) with lower cost. Over-stacking generative modules degraded performance by distorting morphological cues.

Conclusion: Simple query clarification with robust reranking is effective for Turkish RAG. Complex methods can maximize accuracy but simpler configurations offer better cost-performance tradeoffs. Morphological richness requires careful pipeline design.

Abstract: Retrieval-Augmented Generation (RAG) enhances LLM factuality, yet design guidance remains English-centric, limiting insights for morphologically rich languages like Turkish. We address this by constructing a comprehensive Turkish RAG dataset derived from Turkish Wikipedia and CulturaX, comprising question-answer pairs and relevant passage chunks. We benchmark seven stages of the RAG pipeline, from query transformation and reranking to answer refinement, without task-specific fine-tuning. Our results show that complex methods like HyDE maximize accuracy (85%) that is considerably higher than the baseline (78.70%). Also a Pareto-optimal configuration using Cross-encoder Reranking and Context Augmentation achieves comparable performance (84.60%) with much lower cost. We further demonstrate that over-stacking generative modules can degrade performance by distorting morphological cues, whereas simple query clarification with robust reranking offers an effective solution.

[65] Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration

Yu Zhang, Mufan Xu, Xuefeng Bai, Kehai chen, Pengfei Zhang, Yang Xiang, Min Zhang

Main category: cs.CL

TL;DR: The paper investigates how multimodal large language models decide when to use visual/audio information vs. text-only reasoning, revealing that instruction tokens act as anchors for modality arbitration through specific attention heads in deep layers.

Details

Motivation: Understanding how MLLMs decide to use multimodal contexts based on user instructions is crucial for safety and reliability, but the underlying mechanisms remain poorly understood.

Method: Analyzes modality following through an information flow lens, examining attention layers and MLP layers, identifying specialized attention heads that drive modality arbitration, and performing causal interventions.

Result: Found that instruction tokens serve as structural anchors; shallow layers route multimodal cues non-selectively, deep layers resolve modality competition, MLP layers act adversarially; identified sparse specialized attention heads; manipulating 5% of critical heads can change modality-following ratio by ±60%.

Conclusion: Provides insights into MLLM decision-making for multimodal context usage, advancing model transparency and offering a framework for orchestrating multimodal information.

Abstract: Modality following serves as the capacity of multimodal large language models (MLLMs) to selectively utilize multimodal contexts based on user instructions. It is fundamental to ensuring safety and reliability in real-world deployments. However, the underlying mechanisms governing this decision-making process remain poorly understood. In this paper, we investigate its working mechanism through an information flow lens. Our findings reveal that instruction tokens function as structural anchors for modality arbitration: Shallow attention layers perform non-selective information transfer, routing multimodal cues to these anchors as a latent buffer; Modality competition is resolved within deep attention layers guided by the instruction intent, while MLP layers exhibit semantic inertia, acting as an adversarial force. Furthermore, we identify a sparse set of specialized attention heads that drive this arbitration. Causal interventions demonstrate that manipulating a mere $5%$ of these critical heads can decrease the modality-following ratio by $60%$ through blocking, or increase it by $60%$ through targeted amplification of failed samples. Our work provides a substantial step toward model transparency and offers a principled framework for the orchestration of multimodal information in MLLMs.

[66] Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models

Difan Deng, Andreas Bentzen Winje, Lukas Fehring, Marius Lindauer

Main category: cs.CL

TL;DR: NAtS-L is a hybrid attention framework that dynamically applies linear or softmax attention to different tokens based on their impact, achieving efficient long-context modeling.

Details

Motivation: The quadratic complexity of softmax attention limits transformer efficiency in long-context scenarios, while linear attention models have limited expressivity due to fixed hidden state sizes. Existing hybrid approaches still suffer from softmax bottlenecks.

Method: Proposes Neural Attention Search Linear (NAtS-L) that applies both linear attention (Gated DeltaNet) and softmax attention operations within the same layer on different tokens. Automatically determines token-level attention type based on whether tokens have short-term impact (linear attention) or require long-term retrieval (softmax attention).

Result: NAtS-L provides a strong yet efficient token-level hybrid architecture that reduces computational complexity while maintaining expressivity for long-context modeling.

Conclusion: The framework enables efficient long-context processing by intelligently allocating computational resources at the token level, balancing efficiency and expressivity in attention mechanisms.

Abstract: The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential model. These linear attention models compress past KV values into a single hidden state, thereby efficiently reducing complexity during both training and inference. However, their expressivity remains limited by the size of their hidden state. Previous work proposed interleaving softmax and linear attention layers to reduce computational complexity while preserving expressivity. Nevertheless, the efficiency of these models remains bottlenecked by their softmax attention layers. In this paper, we propose Neural Attention Search Linear (NAtS-L), a framework that applies both linear attention and softmax attention operations within the same layer on different tokens. NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact and can be encoded into fixed-size hidden states, or require softmax attention, i.e., tokens that contain information related to long-term retrieval and need to be preserved for future queries. By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.

[67] Rethinking the Reranker: Boundary-Aware Evidence Selection for Robust Retrieval-Augmented Generation

Jiashuo Sun, Pengcheng Jiang, Saizhuo Wang, Jiajun Fan, Heng Wang, Siru Ouyang, Ming Zhong, Yizhu Jiao, Chengsong Huang, Xueqiang Xu, Pengrui Han, Peiran Li, Jiaxin Huang, Ge Liu, Heng Ji, Jiawei Han

Main category: cs.CL

TL;DR: BAR-RAG improves RAG robustness by training a boundary-aware evidence selector that targets the generator’s “Goldilocks Zone” - evidence that is challenging yet sufficient for the generator to learn from.

Details

Motivation: Current RAG systems are brittle under retrieval noise because retrievers and rerankers optimize only for relevance, often selecting either trivial answer-revealing passages or insufficient evidence without considering what evidence is suitable for the generator.

Method: Proposes BAR-RAG with a boundary-aware evidence selector trained with reinforcement learning using generator feedback, and a two-stage pipeline that fine-tunes the generator under the induced evidence distribution to mitigate training-inference mismatch.

Result: Experiments on knowledge-intensive QA benchmarks show BAR-RAG consistently improves end-to-end performance under noisy retrieval, achieving average 10.3% gain over strong RAG and reranking baselines while substantially improving robustness.

Conclusion: BAR-RAG effectively addresses retrieval noise in RAG systems by selecting evidence in the generator’s Goldilocks Zone, leading to significant performance improvements and enhanced robustness.

Abstract: Retrieval-Augmented Generation (RAG) systems remain brittle under realistic retrieval noise, even when the required evidence appears in the top-K results. A key reason is that retrievers and rerankers optimize solely for relevance, often selecting either trivial, answer-revealing passages or evidence that lacks the critical information required to answer the question, without considering whether the evidence is suitable for the generator. We propose BAR-RAG, which reframes the reranker as a boundary-aware evidence selector that targets the generator’s Goldilocks Zone – evidence that is neither trivially easy nor fundamentally unanswerable for the generator, but is challenging yet sufficient for inference and thus provides the strongest learning signal. BAR-RAG trains the selector with reinforcement learning using generator feedback, and adopts a two-stage pipeline that fine-tunes the generator under the induced evidence distribution to mitigate the distribution mismatch between training and inference. Experiments on knowledge-intensive question answering benchmarks show that BAR-RAG consistently improves end-to-end performance under noisy retrieval, achieving an average gain of 10.3 percent over strong RAG and reranking baselines while substantially improving robustness. Code is publicly avaliable at https://github.com/GasolSun36/BAR-RAG.

[68] OCRTurk: A Comprehensive OCR Benchmark for Turkish

Deniz Yılmaz, Evren Ayberk Munis, Çağrı Toraman, Süha Kağan Köse, Burak Aktaş, Mehmet Can Baytekin, Bilge Kaan Görür

Main category: cs.CL

TL;DR: OCRTurk: A Turkish document parsing benchmark with 180 documents across academic/non-academic categories at three difficulty levels, evaluating 7 OCR models with PaddleOCR performing best overall.

Details

Motivation: Existing document parsing benchmarks focus on high-resource languages with limited coverage for low-resource settings like Turkish. Turkish document parsing lacks standardized benchmarks reflecting real-world scenarios and document diversity.

Method: Created OCRTurk benchmark with 180 Turkish documents from academic articles, theses, slide decks, and non-academic articles at three difficulty levels. Evaluated seven OCR models using element-wise metrics including Normalized Edit Distance.

Result: PaddleOCR achieved strongest overall results across difficulty levels, leading most element-wise metrics except figures. Performance varied by document type: models performed well on non-academic documents while slideshows were most challenging.

Conclusion: OCRTurk addresses the gap in Turkish document parsing benchmarks, providing a standardized evaluation framework that reflects real-world document diversity and difficulty levels for assessing OCR model performance.

Abstract: Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using element-wise metrics. Across difficulty levels, PaddleOCR achieves the strongest overall results, leading most element-wise metrics except figures and attaining high Normalized Edit Distance scores in easy, medium, and hard subsets. We also observe performance variation by document type. Models perform well on non-academic documents, while slideshows become the most challenging.

[69] Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models

Yu Tian, Linh Huynh, Katerina Christhilf, Shubham Chakraborty, Micah Watanabe, Tracy Arner, Danielle McNamara

Main category: cs.CL

TL;DR: ReQUESTA is a hybrid multi-agent framework that generates cognitively diverse multiple-choice questions by decomposing MCQ authoring into specialized subtasks, using LLM-powered agents with rule-based components for planning, controlled generation, evaluation, and post-processing.

Details

Motivation: While LLMs enable automated MCQ generation, reliably producing items with controlled cognitive demands remains challenging. The paper aims to address this gap by creating a framework that systematically targets different comprehension types (text-based, inferential, main idea) with better reliability and controllability.

Method: ReQUESTA decomposes MCQ authoring into specialized subtasks and coordinates LLM-powered agents with rule-based components. It uses a hybrid approach with multi-agent orchestration for planning, controlled generation, iterative evaluation, and post-processing. The framework was evaluated against a single-pass GPT-5 zero-shot baseline in a large-scale reading comprehension study using academic expository texts.

Result: ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance. Expert evaluations showed stronger alignment with central concepts and superior distractor linguistic consistency and semantic plausibility, particularly for inferential questions.

Conclusion: Hybrid, agentic orchestration can systematically improve the reliability and controllability of LLM-based generation, highlighting workflow design as a key lever for structured artifact generation beyond single-pass prompting.

Abstract: Recent advances in large language models (LLMs) have made automated multiple-choice question (MCQ) generation increasingly feasible; however, reliably producing items that satisfy controlled cognitive demands remains a challenge. To address this gap, we introduce ReQUESTA, a hybrid, multi-agent framework for generating cognitively diverse MCQs that systematically target text-based, inferential, and main idea comprehension. ReQUESTA decomposes MCQ authoring into specialized subtasks and coordinates LLM-powered agents with rule-based components to support planning, controlled generation, iterative evaluation, and post-processing. We evaluated the framework in a large-scale reading comprehension study using academic expository texts, comparing ReQUESTA-generated MCQs with those produced by a single-pass GPT-5 zero-shot baseline. Psychometric analyses of learner responses assessed item difficulty and discrimination, while expert raters evaluated question quality across multiple dimensions, including topic relevance and distractor quality. Results showed that ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance. Expert evaluations further indicated stronger alignment with central concepts and superior distractor linguistic consistency and semantic plausibility, particularly for inferential questions. These findings demonstrate that hybrid, agentic orchestration can systematically improve the reliability and controllability of LLM-based generation, highlighting workflow design as a key lever for structured artifact generation beyond single-pass prompting.

[70] OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong, Haoran Luo

Main category: cs.CL

TL;DR: OmniRAG-Agent: An agentic omnimodal QA method for budgeted long audio-video reasoning with retrieval-augmented generation and policy optimization.

Details

Motivation: Address challenges in low-resource long audio-video QA including costly dense encoding, weak fine-grained retrieval, limited proactive planning, and lack of end-to-end optimization.

Method: Builds image-audio retrieval-augmented generation module for fetching relevant frames/audio snippets, uses agent loop for planning/tool calling across turns, and applies group relative policy optimization for joint improvement.

Result: Outperforms prior methods on OmniVideoBench, WorldSense, and Daily-Omni under low-resource settings, with ablations validating each component.

Conclusion: OmniRAG-Agent effectively addresses key challenges in long-horizon omnimodal QA through retrieval-augmented generation, agentic planning, and policy optimization.

Abstract: Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end optimization.To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image-audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.

[71] Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States

Ximing Dong, Shaowei Wang, Dayi Lin, Boyuan Chen, Ahmed E. Hassan

Main category: cs.CL

TL;DR: SemanticSpec accelerates LLM inference by using semantic-aware speculative decoding that verifies entire semantic sequences instead of tokens, achieving up to 2.7x speedup.

Details

Motivation: LLMs suffer from high inference latency due to autoregressive decoding, especially for Large Reasoning Models that generate lengthy chains of thought. Existing speculative decoding methods operate at token level and ignore semantic equivalence, leading to inefficient rejections.

Method: Proposes SemanticSpec, a semantic-aware speculative decoding framework that verifies entire semantic sequences instead of tokens. Introduces semantic probability estimation mechanism that probes model’s internal hidden states to assess likelihood of generating sequences with specific meanings.

Result: Achieves up to 2.7x speedup on DeepSeekR1-32B and 2.1x on QwQ-32B, consistently outperforming token-level and sequence-level baselines in both efficiency and effectiveness across four benchmarks.

Conclusion: SemanticSpec provides an effective approach to accelerate LLM inference by leveraging semantic equivalence in speculative decoding, addressing limitations of token-level methods.

Abstract: Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of thought. While speculative decoding accelerates inference by drafting and verifying multiple tokens in parallel, existing methods operate at the token level and ignore semantic equivalence (i.e., different token sequences expressing the same meaning), leading to inefficient rejections. We propose SemanticSpec, a semantic-aware speculative decoding framework that verifies entire semantic sequences instead of tokens. SemanticSpec introduces a semantic probability estimation mechanism that probes the model’s internal hidden states to assess the likelihood of generating sequences with specific meanings.Experiments on four benchmarks show that SemanticSpec achieves up to 2.7x speedup on DeepSeekR1-32B and 2.1x on QwQ-32B, consistently outperforming token-level and sequence-level baselines in both efficiency and effectiveness.

[72] No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding

Vynska Amalia Permadi, Xingwei Tan, Nafise Sadat Moosavi, Nikos Aletras

Main category: cs.CL

TL;DR: ID-MoCQA is a large-scale multi-hop QA dataset for assessing cultural understanding of LLMs, focused on Indonesian traditions, with systematic transformation of single-hop questions into multi-hop reasoning chains.

Details

Motivation: Current culturally focused QA benchmarks rely on single-hop questions that allow models to exploit shallow cues rather than demonstrate genuine cultural reasoning. There's a need for multi-hop reasoning benchmarks that require understanding context, tradition, and implicit social knowledge.

Method: Created ID-MoCQA dataset by systematically transforming single-hop cultural questions into multi-hop reasoning chains spanning six clue types (commonsense, temporal, geographical, etc.). Used multi-stage validation pipeline combining expert review and LLM-as-a-judge filtering to ensure high-quality question-answer pairs.

Result: Evaluation across state-of-the-art models reveals substantial gaps in cultural reasoning, particularly in tasks requiring nuanced inference. The dataset provides a challenging benchmark for assessing cultural competency.

Conclusion: ID-MoCQA is an essential benchmark for advancing the cultural competency of LLMs, addressing limitations of existing single-hop cultural QA datasets and enabling better assessment of genuine cultural reasoning capabilities.

Abstract: Understanding culture requires reasoning across context, tradition, and implicit social knowledge, far beyond recalling isolated facts. Yet most culturally focused question answering (QA) benchmarks rely on single-hop questions, which may allow models to exploit shallow cues rather than demonstrate genuine cultural reasoning. In this work, we introduce ID-MoCQA, the first large-scale multi-hop QA dataset for assessing the cultural understanding of large language models (LLMs), grounded in Indonesian traditions and available in both English and Indonesian. We present a new framework that systematically transforms single-hop cultural questions into multi-hop reasoning chains spanning six clue types (e.g., commonsense, temporal, geographical). Our multi-stage validation pipeline, combining expert review and LLM-as-a-judge filtering, ensures high-quality question-answer pairs. Our evaluation across state-of-the-art models reveals substantial gaps in cultural reasoning, particularly in tasks requiring nuanced inference. ID-MoCQA provides a challenging and essential benchmark for advancing the cultural competency of LLMs.

[73] Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling

Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin

Main category: cs.CL

TL;DR: BranPO is a value-free reinforcement learning method that improves long-horizon planning in language models by providing step-level contrastive supervision through trajectory truncation and alternative continuation resampling.

Details

Motivation: Agentic reinforcement learning enables LLMs to perform complex multi-turn planning, but suffers from sparse rewards in long-horizon settings. Prior tree-based methods have high variance and computational inefficiency, with performance diverging mainly due to decisions near the tail of trajectories.

Method: BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, providing step-level contrastive supervision without dense rewards. It introduces difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions.

Result: Extensive experiments on various question answering benchmarks show BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget.

Conclusion: BranPO effectively addresses credit assignment in long-horizon RL for language models through contrastive suffix construction and adaptive sampling techniques, demonstrating superior performance on complex planning tasks.

Abstract: Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{https://github.com/YubaoZhao/BranPO}{code}.

[74] Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen, Dazhi Cheng, Minghan Chu, Jialei Cui, Jiaqi Deng, Muxi Diao, Hao Ding, Mengfan Dong, Mengnan Dong, Yuxin Dong, Yuhao Dong, Angang Du, Chenzhuang Du, Dikang Du, Lingxiao Du, Yulun Du, Yu Fan, Shengjun Fang, Qiulin Feng, Yichen Feng, Garimugai Fu, Kelin Fu, Hongcheng Gao, Tong Gao, Yuyao Ge, Shangyi Geng, Chengyang Gong, Xiaochen Gong, Zhuoma Gongque, Qizheng Gu, Xinran Gu, Yicheng Gu, Longyu Guan, Yuanying Guo, Xiaoru Hao, Weiran He, Wenyang He, Yunjia He, Chao Hong, Hao Hu, Jiaxi Hu, Yangyang Hu, Zhenxing Hu, Ke Huang, Ruiyuan Huang, Weixiao Huang, Zhiqi Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yu Jing, Guokun Lai, Aidi Li, C. Li, Cheng Li, Fang Li, Guanghe Li, Guanyu Li, Haitao Li, Haoyang Li, Jia Li, Jingwei Li, Junxiong Li, Lincan Li, Mo Li, Weihong Li, Wentao Li, Xinhang Li, Xinhao Li, Yang Li, Yanhao Li, Yiwei Li, Yuxiao Li, Zhaowei Li, Zheming Li, Weilong Liao, Jiawei Lin, Xiaohan Lin, Zhishan Lin, Zichao Lin, Cheng Liu, Chenyu Liu, Hongzhang Liu, Liang Liu, Shaowei Liu, Shudong Liu, Shuran Liu, Tianwei Liu, Tianyu Liu, Weizhou Liu, Xiangyan Liu, Yangyang Liu, Yanming Liu, Yibo Liu, Yuanxin Liu, Yue Liu, Zhengying Liu, Zhongnuo Liu, Enzhe Lu, Haoyu Lu, Zhiyuan Lu, Junyu Luo, Tongxu Luo, Yashuo Luo, Long Ma, Yingwei Ma, Shaoguang Mao, Yuan Mei, Xin Men, Fanqing Meng, Zhiyong Meng, Yibo Miao, Minqing Ni, Kun Ouyang, Siyuan Pan, Bo Pang, Yuchao Qian, Ruoyu Qin, Zeyu Qin, Jiezhong Qiu, Bowen Qu, Zeyu Shang, Youbo Shao, Tianxiao Shen, Zhennan Shen, Juanfeng Shi, Lidong Shi, Shengyuan Shi, Feifan Song, Pengwei Song, Tianhui Song, Xiaoxi Song, Hongjin Su, Jianlin Su, Zhaochen Su, Lin Sui, Jinsong Sun, Junyao Sun, Tongyu Sun, Flood Sung, Yunpeng Tai, Chuning Tang, Heyi Tang, Xiaojuan Tang, Zhengyang Tang, Jiawen Tao, Shiyuan Teng, Chaoran Tian, Pengfei Tian, Ao Wang, Bowen Wang, Chensi Wang, Chuang Wang, Congcong Wang, Dingkun Wang, Dinglu Wang, Dongliang Wang, Feng Wang, Hailong Wang, Haiming Wang, Hengzhi Wang, Huaqing Wang, Hui Wang, Jiahao Wang, Jinhong Wang, Jiuzheng Wang, Kaixin Wang, Linian Wang, Qibin Wang, Shengjie Wang, Shuyi Wang, Si Wang, Wei Wang, Xiaochen Wang, Xinyuan Wang, Yao Wang, Yejie Wang, Yipu Wang, Yiqin Wang, Yucheng Wang, Yuzhi Wang, Zhaoji Wang, Zhaowei Wang, Zhengtao Wang, Zhexu Wang, Zihan Wang, Zizhe Wang, Chu Wei, Ming Wei, Chuan Wen, Zichen Wen, Chengjie Wu, Haoning Wu, Junyan Wu, Rucong Wu, Wenhao Wu, Yuefeng Wu, Yuhao Wu, Yuxin Wu, Zijian Wu, Chenjun Xiao, Jin Xie, Xiaotong Xie, Yuchong Xie, Yifei Xin, Bowei Xing, Boyu Xu, Jianfan Xu, Jing Xu, Jinjing Xu, L. H. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinbo Xu, Xinran Xu, Yangchuan Xu, Yichang Xu, Yuemeng Xu, Zelai Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Guangyao Yang, Hao Yang, Junwei Yang, Kai Yang, Ningyuan Yang, Ruihan Yang, Xiaofei Yang, Xinlong Yang, Ying Yang, Yi Yang, Yi Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Dan Ye, Wenjie Ye, Zhuorui Ye, Bohong Yin, Chengzhen Yu, Longhui Yu, Tao Yu, Tianxiang Yu, Enming Yuan, Mengjie Yuan, Xiaokun Yuan, Yang Yue, Weihao Zeng, Dunyuan Zha, Haobing Zhan, Dehao Zhang, Hao Zhang, Jin Zhang, Puqi Zhang, Qiao Zhang, Rui Zhang, Xiaobin Zhang, Y. Zhang, Yadong Zhang, Yangkun Zhang, Yichi Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yushun Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Chenguang Zhao, Feifan Zhao, Jinxiang Zhao, Shuai Zhao, Xiangyu Zhao, Yikai Zhao, Zijia Zhao, Huabin Zheng, Ruihan Zheng, Shaojie Zheng, Tengyang Zheng, Junfeng Zhong, Longguang Zhong, Weiming Zhong, M. Zhou, Runjie Zhou, Xinyu Zhou, Zaida Zhou, Jinguo Zhu, Liya Zhu, Xinhao Zhu, Yuxuan Zhu, Zhen Zhu, Jingze Zhuang, Weiyu Zhuang, Ying Zou, Xinxing Zu

Main category: cs.CL

TL;DR: Kimi K2.5 is an open-source multimodal agentic model with joint text-vision optimization and Agent Swarm framework for parallel task decomposition and execution.

Details

Motivation: To advance general agentic intelligence by creating a multimodal model that jointly optimizes text and vision modalities to enhance each other, and to develop an efficient parallel agent orchestration framework for complex task execution.

Method: Joint text-vision pre-training, zero-vision SFT (Supervised Fine-Tuning), and joint text-vision reinforcement learning for multimodal optimization. Agent Swarm framework for self-directed parallel agent orchestration that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently.

Result: Achieves state-of-the-art results across coding, vision, reasoning, and agentic tasks. Agent Swarm reduces latency by up to 4.5× compared to single-agent baselines.

Conclusion: Kimi K2.5 successfully demonstrates effective multimodal agentic intelligence with joint text-vision optimization and efficient parallel task execution through Agent Swarm, advancing the field of general agentic intelligence.

Abstract: We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.

[75] CUBO: Self-Contained Retrieval-Augmented Generation on Consumer Laptops 10 GB Corpora, 16 GB RAM, Single-Device Deployment

Paolo Astrino

Main category: cs.CL

TL;DR: CUBO is a local RAG platform for consumer laptops with 16GB RAM that achieves competitive retrieval performance while maintaining GDPR compliance through local-only processing.

Details

Motivation: Organizations need to balance AI capabilities with data privacy regulations like GDPR. Cloud-based AI risks GDPR violations, while existing local systems require too much memory (18-32GB RAM), making them unsuitable for consumer laptops with only 16GB RAM.

Method: CUBO integrates streaming ingestion with O(1) buffer overhead, tiered hybrid retrieval, and hardware-aware orchestration to operate within a strict 15.5GB RAM ceiling. It uses local-only processing to maintain GDPR compliance.

Result: Achieves competitive Recall@10 scores (0.48-0.97 across BEIR domains) with retrieval latencies of 185ms (p50) on consumer laptops, all within the 15.5GB RAM constraint. The 37,000-line codebase validates practical deployability for small-to-medium archives.

Conclusion: CUBO demonstrates that efficient local RAG systems can run on consumer hardware while maintaining GDPR compliance, making AI-powered document analysis accessible to organizations with privacy concerns.

Abstract: Organizations handling sensitive documents face a tension: cloud-based AI risks GDPR violations, while local systems typically require 18-32 GB RAM. This paper presents CUBO, a systems-oriented RAG platform for consumer laptops with 16 GB shared memory. CUBO’s novelty lies in engineering integration of streaming ingestion (O(1) buffer overhead), tiered hybrid retrieval, and hardware-aware orchestration that enables competitive Recall@10 (0.48-0.97 across BEIR domains) within a hard 15.5 GB RAM ceiling. The 37,000-line codebase achieves retrieval latencies of 185 ms (p50) on C1,300 laptops while maintaining data minimization through local-only processing aligned with GDPR Art. 5(1)(c). Evaluation on BEIR benchmarks validates practical deployability for small-to-medium professional archives. The codebase is publicly available at https://github.com/PaoloAstrino/CUBO.

[76] Context Compression via Explicit Information Transmission

Jiangnan Ye, Hanqi Yan, Zhenyi Shen, Heng Chang, Ye Mao, Yulan He

Main category: cs.CL

TL;DR: ComprExIT: A lightweight framework for soft context compression in LLMs that uses explicit information transmission over frozen hidden states to overcome limitations of existing self-attention-based compression methods.

Details

Motivation: Long-context inference with LLMs is costly due to quadratic attention and growing key-value caches, motivating context compression. Existing methods that repurpose LLMs as trainable compressors suffer from progressive representation overwriting across layers and uncoordinated allocation of compression capacity across tokens.

Method: ComprExIT formulates soft compression as explicit information transmission over frozen LLM hidden states, decoupling compression from self-attention dynamics. It performs depth-wise transmission to selectively transmit multi-layer information into token anchors, and width-wise transmission to aggregate anchors into a small number of slots via a globally optimized transmission plan.

Result: Across six question-answering benchmarks, ComprExIT consistently outperforms state-of-the-art context compression methods while introducing only ~1% additional parameters.

Conclusion: Explicit and coordinated information transmission enables more effective and robust long-context compression compared to existing self-attention-based approaches.

Abstract: Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches, motivating context compression. In this work, we study soft context compression, where a long context is condensed into a small set of continuous representations. Existing methods typically re-purpose the LLM itself as a trainable compressor, relying on layer-by-layer self-attention to iteratively aggregate information. We argue that this paradigm suffers from two structural limitations: (i) progressive representation overwriting across layers (ii) uncoordinated allocation of compression capacity across tokens. We propose ComprExIT (Context Compression via Explicit Information Transmission), a lightweight framework that formulates soft compression into a new paradigm: explicit information transmission over frozen LLM hidden states. This decouples compression from the model’s internal self-attention dynamics. ComprExIT performs (i) depth-wise transmission to selectively transmit multi-layer information into token anchors, mitigating progressive overwriting, and (ii) width-wise transmission to aggregate anchors into a small number of slots via a globally optimized transmission plan, ensuring coordinated allocation of information. Across six question-answering benchmarks, ComprExIT consistently outperforms state-of-the-art context compression methods while introducing only ~1% additional parameters, demonstrating that explicit and coordinated information transmission enables more effective and robust long-context compression.

[77] They Said Memes Were Harmless-We Found the Ones That Hurt: Decoding Jokes, Symbols, and Cultural References

Sahil Tripathi, Gautam Siddharth Kashyap, Mehwish Nasim, Jian Yang, Jiechao Gao, Usman Naseem

Main category: cs.CL

TL;DR: CROSS-ALIGN+ is a three-stage framework for meme-based social abuse detection that addresses cultural blindness, boundary ambiguity, and interpretability limitations in existing multimodal approaches.

Details

Motivation: Current meme abuse detection methods fail to capture implicit cultural symbolism, confuse satire vs. abuse, and lack interpretability, requiring a more sophisticated multimodal approach.

Method: Three-stage framework: (1) Enrich multimodal representations with structured knowledge from ConceptNet, Wikidata, Hatebase; (2) Use LoRA adapters to sharpen decision boundaries; (3) Generate cascaded explanations for interpretability.

Result: Outperforms state-of-the-art methods on five benchmarks with eight LVLMs, achieving up to 17% relative F1 improvement while providing interpretable justifications.

Conclusion: CROSS-ALIGN+ effectively addresses key limitations in meme abuse detection through knowledge enrichment, boundary refinement, and explanation generation.

Abstract: Meme-based social abuse detection is challenging because harmful intent often relies on implicit cultural symbolism and subtle cross-modal incongruence. Prior approaches, from fusion-based methods to in-context learning with Large Vision-Language Models (LVLMs), have made progress but remain limited by three factors: i) cultural blindness (missing symbolic context), ii) boundary ambiguity (satire vs. abuse confusion), and iii) lack of interpretability (opaque model reasoning). We introduce CROSS-ALIGN+, a three-stage framework that systematically addresses these limitations: (1) Stage I mitigates cultural blindness by enriching multimodal representations with structured knowledge from ConceptNet, Wikidata, and Hatebase; (2) Stage II reduces boundary ambiguity through parameter-efficient LoRA adapters that sharpen decision boundaries; and (3) Stage III enhances interpretability by generating cascaded explanations. Extensive experiments on five benchmarks and eight LVLMs demonstrate that CROSS-ALIGN+ consistently outperforms state-of-the-art methods, achieving up to 17% relative F1 improvement while providing interpretable justifications for each decision.

[78] Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni, Simina Branzei, Michael P. Brenner, Lin Chen, Ying Feng, Lance Fortnow, Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T. Hajiaghayi, Mahdi JafariRaviz, Adel Javanmard, Karthik C. S., Ken-ichi Kawarabayashi, Ravi Kumar, Silvio Lattanzi, Euiwoong Lee, Yi Li, Ioannis Panageas, Dimitris Paparas, Benjamin Przybocki, Bernardo Subercaseaux, Ola Svensson, Shayan Taherijam, Xuan Wu, Eylon Yogev, Morteza Zadimoghaddam, Samson Zhou, Vahab Mirrokni

Main category: cs.CL

TL;DR: Researchers demonstrate how advanced AI models (Gemini variants) can collaborate with humans to solve open problems, refute conjectures, and generate proofs across theoretical computer science and other fields, developing effective human-AI collaboration techniques.

Details

Motivation: To explore whether advanced AI models can contribute to novel, expert-level mathematical discovery beyond routine tasks, and to understand effective human-AI collaboration patterns in theoretical research.

Method: Collection of case studies using Google’s Gemini-based models (Gemini Deep Think and variants) with interactive, conversational methodology, plus specific techniques like using AI as adversarial reviewer and embedding in neuro-symbolic loops for code verification.

Result: Successful collaboration on solving open problems, refuting conjectures, and generating new proofs across theoretical computer science, economics, optimization, and physics, with extracted effective collaboration techniques.

Conclusion: AI can serve as a versatile partner in scientific discovery, not just for automation, with potential demonstrated through interactive collaboration and specialized techniques like adversarial review and neuro-symbolic verification.

Abstract: Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to contribute to novel, expert-level mathematical discovery is less understood. We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models, specifically Google’s Gemini-based models (in particular Gemini Deep Think and its advanced variants), to solve open problems, refute conjectures, and generate new proofs across diverse areas in theoretical computer science, as well as other areas such as economics, optimization, and physics. Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. While the majority of our results stem from this interactive, conversational methodology, we also highlight specific instances that push beyond standard chat interfaces. These include deploying the model as a rigorous adversarial reviewer to detect subtle flaws in existing proofs, and embedding it within a “neuro-symbolic” loop that autonomously writes and executes code to verify complex derivations. Together, these examples highlight the potential of AI not just as a tool for automation, but as a versatile, genuine partner in the creative process of scientific discovery.

[79] Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing

Tong Zheng, Chengsong Huang, Runpeng Dai, Yun He, Rui Liu, Xin Ni, Huiwen Bao, Kaishen Wang, Hongtu Zhu, Jiaxin Huang, Furong Huang, Heng Huang

Main category: cs.CL

TL;DR: Parallel-Probe: A training-free controller that optimizes parallel thinking efficiency through 2D probing, consensus-based early stopping, and deviation-based branch pruning to reduce computational costs while maintaining accuracy.

Details

Motivation: Parallel thinking shows promise for reasoning tasks but imposes significant computational burdens. Existing efficiency methods lack principled mechanisms to exploit global dynamics across parallel branches and rely only on local, per-trajectory signals.

Method: Introduces 2D probing to expose width-depth dynamics by periodically eliciting intermediate answers from all branches. Based on insights from this analysis, develops Parallel-Probe with consensus-based early stopping to regulate reasoning depth and deviation-based branch pruning to dynamically adjust width.

Result: Extensive experiments across three benchmarks and multiple models show Parallel-Probe establishes superior Pareto frontier for test-time scaling. Reduces sequential tokens by up to 35.8% and total token cost by over 25.8% while maintaining competitive accuracy compared to standard majority voting.

Conclusion: Parallel-Probe effectively optimizes parallel thinking efficiency through principled exploitation of global dynamics, offering significant computational savings without sacrificing reasoning quality.

Abstract: Parallel thinking has emerged as a promising paradigm for reasoning, yet it imposes significant computational burdens. Existing efficiency methods primarily rely on local, per-trajectory signals and lack principled mechanisms to exploit global dynamics across parallel branches. We introduce 2D probing, an interface that exposes the width-depth dynamics of parallel thinking by periodically eliciting intermediate answers from all branches. Our analysis reveals three key insights: non-monotonic scaling across width-depth allocations, heterogeneous reasoning branch lengths, and early stabilization of global consensus. Guided by these insights, we introduce $\textbf{Parallel-Probe}$, a training-free controller designed to optimize online parallel thinking. Parallel-Probe employs consensus-based early stopping to regulate reasoning depth and deviation-based branch pruning to dynamically adjust width. Extensive experiments across three benchmarks and multiple models demonstrate that Parallel-Probe establishes a superior Pareto frontier for test-time scaling. Compared to standard majority voting, it reduces sequential tokens by up to $\textbf{35.8}$% and total token cost by over $\textbf{25.8}$% while maintaining competitive accuracy.

[80] Merged ChemProt-DrugProt for Relation Extraction from Biomedical Literature

Mai H. Nguyen, Shibani Likhite, Jiawei Tang, Darshini Mahendran, Bridget T. McInnes

Main category: cs.CL

TL;DR: Merging ChemProt and DrugProt datasets improves chemical-gene relation extraction performance, with BioBERT+GCN outperforming BioBERT alone by capturing both local and global context.

Details

Motivation: Chemical-gene relation extraction is crucial for drug discovery and biomedical research, but existing datasets are limited. The paper aims to improve model accuracy by merging datasets and enhancing context understanding.

Method: Created merged dataset from ChemProt and DrugProt, evaluated using BioBERT (for local context) and BioBERT+GCN (for global+local context) for chemical-gene relation extraction.

Result: Merged dataset significantly improved model performance, especially in shared CPR groups. BioBERT+GCN achieved higher precision and recall than BioBERT alone by incorporating global context.

Conclusion: Dataset merging and combining local (BioBERT) with global (GCN) context modeling improves chemical-gene relation extraction, benefiting biomedical applications.

Abstract: The extraction of chemical-gene relations plays a pivotal role in understanding the intricate interactions between chemical compounds and genes, with significant implications for drug discovery, disease understanding, and biomedical research. This paper presents a data set created by merging the ChemProt and DrugProt datasets to augment sample counts and improve model accuracy. We evaluate the merged dataset using two state of the art relationship extraction algorithms: Bidirectional Encoder Representations from Transformers (BERT) specifically BioBERT, and Graph Convolutional Networks (GCNs) combined with BioBERT. While BioBERT excels at capturing local contexts, it may benefit from incorporating global information essential for understanding chemical-gene interactions. This can be achieved by integrating GCNs with BioBERT to harness both global and local context. Our results show that by integrating the ChemProt and DrugProt datasets, we demonstrated significant improvements in model performance, particularly in CPR groups shared between the datasets. Incorporating the global context using GCN can help increase the overall precision and recall in some of the CPR groups over using just BioBERT.

[81] A Syntax-Injected Approach for Faster and More Accurate Sentiment Analysis

Muhammad Imran, Olga Kellert, Carlos Gómez-Rodríguez

Main category: cs.CL

TL;DR: A sequence labeling syntactic parser (SELSP) is proposed to speed up syntax-based sentiment analysis by reformulating dependency parsing as a sequence labeling task, achieving better speed and accuracy than conventional parsers and transformers.

Details

Motivation: Syntactic parsing improves sentiment analysis accuracy and explainability but creates computational bottlenecks due to slow parsing algorithms. The paper aims to address this efficiency problem in syntax-based sentiment analysis.

Method: Proposes SELSP (Sequence Labeling Syntactic Parser) that reformulates dependency parsing as a sequence labeling task, integrating syntactic information into sentiment analysis via a rule-based pipeline. Compares with conventional parsers (Stanza), heuristic approaches (VADER), and Transformer models.

Result: SELSP demonstrates greater speed and accuracy than conventional parsers like Stanza and heuristic approaches like VADER. It also outperforms Transformer-based models in speed for polarity prediction. Dictionaries accounting for polarity judgment variation yield better performance.

Conclusion: SELSP provides an efficient solution to the computational bottleneck in syntax-based sentiment analysis, offering both speed and accuracy advantages over existing methods, making it attractive for academic and industrial applications.

Abstract: Sentiment Analysis (SA) is a crucial aspect of Natural Language Processing (NLP), focusing on identifying and interpreting subjective assessments in textual content. Syntactic parsing is useful in SA as it improves accuracy and provides explainability; however, it often becomes a computational bottleneck due to slow parsing algorithms. This article proposes a solution to this bottleneck by using a Sequence Labeling Syntactic Parser (SELSP) to integrate syntactic information into SA via a rule-based sentiment analysis pipeline. By reformulating dependency parsing as a sequence labeling task, we significantly improve the efficiency of syntax-based SA. SELSP is trained and evaluated on a ternary polarity classification task, demonstrating greater speed and accuracy compared to conventional parsers like Stanza and heuristic approaches such as Valence Aware Dictionary and sEntiment Reasoner (VADER). The combination of speed and accuracy makes SELSP especially attractive for sentiment analysis applications in both academic and industrial contexts. Moreover, we compare SELSP with Transformer-based models trained on a 5-label classification task. In addition, we evaluate multiple sentiment dictionaries with SELSP to determine which yields the best performance in polarity prediction. The results show that dictionaries accounting for polarity judgment variation outperform those that ignore it. Furthermore, we show that SELSP outperforms Transformer-based models in terms of speed for polarity prediction.

[82] Inferring Scientific Cross-Document Coreference and Hierarchy with Definition-Augmented Relational Reasoning

Lior Forer, Tom Hope

Main category: cs.CL

TL;DR: A novel method for cross-document coreference and hierarchy inference in scientific texts using context-dependent definitions and relational definitions to enhance LLM performance on technical concepts with nuanced variations.

Details

Motivation: Large Language Models struggle with long-tail technical concepts having nuanced variations in scientific texts, which hinders cross-document coreference and hierarchy inference needed for knowledge graph construction, search, recommendation and discovery.

Method: Generates context-dependent definitions of concept mentions by retrieving full-text literature, uses these definitions to enhance cross-document relation detection, generates relational definitions describing how two concepts are related/different, and employs an efficient re-ranking approach to handle combinatorial explosion in cross-paper link inference.

Result: Achieves large performance gains in both fine-tuning and in-context learning settings, particularly on data subsets with high amounts of different surface forms and ambiguity that are challenging for models.

Conclusion: The approach effectively addresses LLM limitations on fine-grained scientific concepts and provides insights into LLMs’ relational reasoning abilities through analysis of generated definitions.

Abstract: We address the fundamental task of inferring cross-document coreference and hierarchy in scientific texts, which has important applications in knowledge graph construction, search, recommendation and discovery. Large Language Models (LLMs) can struggle when faced with many long-tail technical concepts with nuanced variations. We present a novel method which generates context-dependent definitions of concept mentions by retrieving full-text literature, and uses the definitions to enhance detection of cross-document relations. We further generate relational definitions, which describe how two concept mentions are related or different, and design an efficient re-ranking approach to address the combinatorial explosion involved in inferring links across papers. In both fine-tuning and in-context learning settings, we achieve large gains in performance on data subsets with high amount of different surfaces forms and ambiguity, that are challenging for models. We provide analysis of generated definitions, shedding light on the relational reasoning ability of LLMs over fine-grained scientific concepts.

[83] MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers

Ning Ding, Yehui Tang, Haochen Qin, Zhenli Zhou, Chao Xu, Lin Li, Kai Han, Heng Liao, Yunhe Wang

Main category: cs.CL

TL;DR: MemoryFormer reduces transformer computational complexity by replacing linear projections with in-memory lookup tables and hash-based retrieval instead of matrix multiplication.

Details

Motivation: To address the growing computational complexity of large language models as they scale up for higher performance, by finding more efficient alternatives to traditional transformer operations.

Method: Replaces linear projection layers with in-memory lookup tables storing discrete vectors, uses hash algorithms to dynamically retrieve relevant vectors based on input embeddings, and combines retrieved vectors to approximate matrix multiplication results.

Result: MemoryFormer significantly reduces computational complexity (FLOPs) while maintaining performance across various benchmarks when trained from scratch.

Conclusion: MemoryFormer offers a novel, computationally efficient transformer architecture that reduces FLOPs by replacing expensive matrix multiplications with cheaper memory retrieval operations.

Abstract: In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and corresponding computational complexity are constantly scaled up in pursuit of higher performance. In this work, we present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. We eliminate nearly all the computations of the transformer model except for the necessary computation required by the multi-head attention operation. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. Specifically, we first construct a group of in-memory lookup tables that store a large amount of discrete vectors to replace the weight matrix used in linear projection. We then use a hash algorithm to retrieve a correlated subset of vectors dynamically based on the input embedding. The retrieved vectors combined together will form the output embedding, which provides an estimation of the result of matrix multiplication operation in a fully-connected layer. Compared to conducting matrix multiplication, retrieving data blocks from memory is a much cheaper operation which requires little computations. We train MemoryFormer from scratch and conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.

[84] Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Lujain Ibrahim, Canfer Akbulut, Rasmi Elasmar, Charvi Rastogi, Minsuk Kahng, Meredith Ringel Morris, Kevin R. McKee, Verena Rieser, Murray Shanahan, Laura Weidinger

Main category: cs.CL

TL;DR: Novel method for evaluating anthropomorphic behaviors in LLMs using multi-turn interactions, automated simulations, and large-scale human validation study.

Details

Motivation: Growing interest in how users anthropomorphize LLMs among AI developers, researchers, and policymakers, requiring better empirical evaluation methods beyond single-turn static benchmarks.

Method: Three methodological advances: 1) Multi-turn evaluation of 14 anthropomorphic behaviors, 2) Scalable automated approach using user interaction simulations, 3) Large-scale human subject study (N=1101) to validate that measured behaviors predict real users’ anthropomorphic perceptions.

Result: All SOTA LLMs exhibit similar anthropomorphic behaviors characterized by relationship-building (empathy, validation) and first-person pronoun use, with most behaviors only emerging after multiple turns.

Conclusion: Establishes empirical foundation for investigating how design choices influence anthropomorphic model behaviors and advances ethical debate about desirability of such behaviors, demonstrating necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.

Abstract: The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users’ anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.

[85] Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs

Gaye Colakoglu, Gürkan Solmaz, Jonathan Fürst

Main category: cs.CL

TL;DR: LLMs can match specialized models for layout-aware information extraction when properly configured with optimized pipeline adjustments.

Details

Motivation: To explore how large language models can be effectively used for information extraction from layout-rich documents, addressing challenges of data structuring, model engagement, and output refinement.

Method: Defines design space for layout-aware IE with LLMs, investigates sub-problems like input representation and prompting, develops LayIE-LLM test suite, uses one-factor-at-a-time method to find optimal configurations.

Result: Optimized LLM configurations achieve 13.3-37.5 F1 points improvement over baseline; OFAT method finds near-optimal configurations with only 2.8% of computation cost; well-configured LLMs match specialized model performance.

Conclusion: General-purpose LLMs can provide cost-effective, finetuning-free alternatives to specialized IE models when properly configured for layout-aware document understanding.

Abstract: This paper defines and explores the design space for information extraction (IE) from layout-rich documents using large language models (LLMs). The three core challenges of layout-aware IE with LLMs are 1) data structuring, 2) model engagement, and 3) output refinement. Our study investigates the sub-problems and methods within these core challenges, such as input representation, chunking, prompting, selection of LLMs, and multimodal models. It examines the effect of different design choices through LayIE-LLM, a new, open-source, layout-aware IE test suite, benchmarking against traditional, fine-tuned IE models. The results on two IE datasets show that LLMs require adjustment of the IE pipeline to achieve competitive performance: the optimized configuration found with LayIE-LLM achieves 13.3–37.5 F1 points more than a general-practice baseline configuration using the same LLM. To find a well-working configuration, we develop a one-factor-at-a-time (OFAT) method that achieves near-optimal results. Our method is only 0.8–1.8 points lower than the best full factorial exploration with a fraction (2.8%) of the required computation. Overall, we demonstrate that, if well-configured, general-purpose LLMs match the performance of specialized models, providing a cost-effective, finetuning-free alternative. Our test-suite is available at https://github.com/gayecolakoglu/LayIE-LLM.

[86] Align to Structure: Aligning Large Language Models with Structural Information

Zae Myung Kim, Anand Ramachandran, Farideh Tavazoee, Joo-Kyung Kim, Oleg Rokhlenko, Dongyeop Kang

Main category: cs.CL

TL;DR: Structural Alignment method aligns LLMs with human discourse structures using reinforcement learning with dense token-level rewards to improve long-form text coherence and organization.

Details

Motivation: Large language models struggle with generating long, coherent text due to lack of hierarchical planning and structured organization in discourse generation, necessitating methods that incorporate human-like discourse structures.

Method: Integrates linguistically grounded discourse frameworks into reinforcement learning using Proximal Policy Optimization with dense token-level rewards. Uses two complementary reward models: one for surface-level textual features (readability) and another for global discourse patterns (coherence and rhetorical sophistication).

Result: Outperforms both standard and RLHF-enhanced models in tasks such as essay generation and long-document summarization by producing more coherent and well-organized outputs.

Conclusion: Structural Alignment effectively enhances long-form text generation by aligning LLMs with human discourse structures through reinforcement learning with fine-grained rewards.

Abstract: Generating long, coherent text remains a challenge for large language models (LLMs), as they lack hierarchical planning and structured organization in discourse generation. We introduce Structural Alignment, a novel method that aligns LLMs with human-like discourse structures to enhance long-form text generation. By integrating linguistically grounded discourse frameworks into reinforcement learning, our approach guides models to produce coherent and well-organized outputs. We employ a dense reward scheme within a Proximal Policy Optimization framework, assigning fine-grained, token-level rewards based on the discourse distinctiveness relative to human writing. Two complementary reward models are evaluated: the first improves readability by scoring surface-level textual features to provide explicit structuring, while the second reinforces deeper coherence and rhetorical sophistication by analyzing global discourse patterns through hierarchical discourse motifs, outperforming both standard and RLHF-enhanced models in tasks such as essay generation and long-document summarization. All training data and code will be publicly shared at https://github.com/minnesotanlp/struct_align.

[87] Don’t Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

Michael Hassid, Gabriel Synnaeve, Yossi Adi, Roy Schwartz

Main category: cs.CL

TL;DR: Short reasoning chains in LLMs often outperform long chains, leading to a new inference method (short-m@k) that uses parallel generation and early stopping for faster, more accurate reasoning.

Details

Motivation: Current reasoning LLMs rely on extensive "thinking" chains that incur high computational costs and inference time. The authors challenge the assumption that longer chains lead to better reasoning, noting the computational inefficiency of current approaches.

Method: Proposes short-m@k inference method: execute k independent generations in parallel, halt when first m thinking processes complete, then use majority voting among these m chains. Also finetunes LLMs using short, long, and randomly selected reasoning chains to compare training effectiveness.

Result: Shorter reasoning chains are up to 34.5% more accurate than longest chains. short-1@k matches or exceeds standard majority voting with 40% fewer thinking tokens. short-3@k consistently surpasses majority voting across all compute budgets with up to 33% wall time reduction. Training on shorter chains yields better performance.

Conclusion: Longer “thinking” in reasoning LLMs doesn’t necessarily improve performance and can degrade results. The short-m@k method offers more efficient inference by leveraging shorter, more accurate reasoning chains, challenging current test-time compute paradigms.

Abstract: Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive “thinking” chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). To further validate our findings, we finetune LLMs using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer “thinking” does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.

[88] v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu

Main category: cs.CL

TL;DR: v1 is a multimodal language model extension that enables active visual referencing through point-and-copy mechanisms, allowing models to revisit image patches during reasoning rather than encoding images once.

Details

Motivation: Current multimodal language models encode images once into key-value cache and then reason purely in text, making it hard to re-ground intermediate reasoning steps. As reasoning chains lengthen, models progressively lose focus on relevant visual regions, unlike humans who revisit visual evidence while reasoning.

Method: Introduces v1, a lightweight extension for active visual referencing via point-and-copy: the model selects relevant image patches and copies their embeddings back into the reasoning stream. Uses semantic representations as keys for patch retrieval to ensure perceptual evidence remains aligned with reasoning space. Trained on v1 dataset of 300K multimodal reasoning traces with interleaved grounding annotations.

Result: Across multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines. The model demonstrates improved ability to maintain focus on relevant visual regions during extended reasoning chains.

Conclusion: Active visual referencing through point-and-copy mechanisms significantly improves multimodal reasoning by allowing models to revisit visual evidence during reasoning, addressing the limitation of single-encoding approaches. The method shows promise for enhancing visual grounding in multimodal language models.

Abstract: When thinking with images, humans rarely rely on a single glance: they revisit visual evidence while reasoning. In contrast, most Multimodal Language Models encode an image once to key-value cache and then reason purely in text, making it hard to re-ground intermediate steps. We empirically confirm this: as reasoning chains lengthen, models progressively lose focus on relevant regions. We introduce v1, a lightweight extension for active visual referencing via point-and-copy: the model selects relevant image patches and copies their embeddings back into the reasoning stream. Crucially, our point-and-copy mechanism retrieves patches using their semantic representations as keys, ensuring perceptual evidence remains aligned with the reasoning space. To train this behavior, we build v1, a dataset of 300K multimodal reasoning traces with interleaved grounding annotations. Across multimodal mathematical reasoning benchmarks, v1 consistently outperforms comparable baselines. We plan to release the model checkpoint and data.

[89] Reward Model Interpretability via Optimal and Pessimal Tokens

Brian Christian, Hannah Rose Kirk, Jessica A. F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska

Main category: cs.CL

TL;DR: Reward models show significant heterogeneity, biases, and sensitivity to prompt framing when analyzed across vocabulary space, challenging their reliability as proxies for human values.

Details

Motivation: Reward models are crucial for aligning LLMs with human values but remain understudied. The paper aims to understand their internal workings, biases, and reliability through comprehensive analysis.

Method: Exhaustive analysis of reward models’ responses across entire vocabulary space, examining single-token responses to value-laden prompts across ten open-source reward models with varying architectures.

Result: Found substantial heterogeneity between models, systematic asymmetries in encoding high/low-scoring tokens, sensitivity to prompt framing mirroring human biases, overvaluation of frequent tokens, and concerning biases toward identity groups.

Conclusion: Reward models are not interchangeable and may encode problematic biases that propagate through downstream LLMs, challenging their suitability as proxies for complex human values.

Abstract: Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves – which directly encode human value judgments by turning prompt-response pairs into scalar rewards – remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training – distortions that risk propagating through the downstream large language models now deployed to millions.

[90] Surprisal from Larger Transformer-based Language Models Predicts fMRI Data More Poorly

Yi-Chien Lin, William Schuler

Main category: cs.CL

TL;DR: Inverse scaling relationship between Transformer LM surprisal and predictive power for human language processing holds for fMRI brain imaging data, not just latency measures.

Details

Motivation: Previous studies found inverse scaling where larger LMs are less predictive of human reading times, but this was only shown for latency data. The study aims to test if this inverse scaling phenomenon extends to brain imaging (fMRI) data.

Method: Comprehensive evaluation using surprisal estimates from 17 pre-trained LMs across three different LM families on two fMRI datasets, testing the relationship between models’ per-word estimated probability and model fit on brain imaging data.

Result: Results show the inverse scaling relationship between models’ per-word estimated probability and model fit on both fMRI datasets, confirming the trend extends beyond latency-based measures.

Conclusion: The inverse scaling phenomenon is not specific to latency data and holds for brain imaging measures, resolving previous inconclusive results about LM surprisal’s predictive power for human language processing.

Abstract: There has been considerable interest in using surprisal from Transformer-based language models (LMs) as predictors of human sentence processing difficulty. Recent work has observed an inverse scaling relationship between Transformers’ per-word estimated probability and the predictive power of their surprisal estimates on reading times, showing that LMs with more parameters and trained on more data are less predictive of human reading times. However, these studies focused on predicting latency-based measures. Tests on brain imaging data have not shown a trend in any direction when using a relatively small set of LMs, leaving open the possibility that the inverse scaling phenomenon is constrained to latency data. This study therefore conducted a more comprehensive evaluation using surprisal estimates from 17 pre-trained LMs across three different LM families on two functional magnetic resonance imaging (fMRI) datasets. Results show that the inverse scaling relationship between models’ per-word estimated probability and model fit on both datasets still obtains, resolving the inconclusive results of previous work and indicating that this trend is not specific to latency-based measures.

[91] Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith

Main category: cs.CL

TL;DR: Modern language models show surprising robustness to non-canonical tokenizations unseen during training, retaining most performance and even improving on certain tasks with alternative tokenization schemes.

Details

Motivation: To investigate whether language models are overly dependent on their training tokenizer and to explore if alternative tokenization schemes at inference time could improve performance on specific tasks.

Method: Systematically evaluate instruction-tuned models across 20 benchmarks using non-canonical tokenizations (random tokenization, character-level tokenization), analyze performance patterns, and investigate the source of robustness through comparisons between base and post-trained models.

Result: Models retain up to 93.4% performance with random tokenization and 90.8% with character-level tokenization. Character-level segmentation improves string manipulation and code tasks by +14%, and right-aligned digit grouping enhances large-number arithmetic by +33%. Robustness emerges during instruction-tuning phase.

Conclusion: Language models are less tied to their tokenizer than previously believed, and inference-time tokenization interventions can boost performance on specific tasks, with robustness arising from instruction-tuning rather than pre-training.

Abstract: Modern tokenizers employ deterministic algorithms to map text into a single “canonical” token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the robustness of LMs to text encoded with non-canonical tokenizations entirely unseen during training. Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization, and 90.8% with character-level tokenization. We see that overall stronger models tend to be more robust, and robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we then identify settings where non-canonical tokenization schemes can improve performance, finding that character-level segmentation improves string manipulation and code understanding tasks by up to +14%, and right-aligned digit grouping enhances large-number arithmetic by +33%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase. We show that while both base and post-trained models grasp the semantics of non-canonical tokenizations (perceiving them as containing misspellings), base models try to mimic the imagined mistakes and degenerate into nonsensical output, while post-trained models are committed to fluent responses. Overall, our findings suggest that models are less tied to their tokenizer than previously believed, and demonstrate the promise of intervening on tokenization at inference time to boost performance.

[92] Understanding Verbatim Memorization in LLMs Through Circuit Discovery

Ilya Lasy, Peter Knees, Stefan Woltran

Main category: cs.CL

TL;DR: The paper investigates memorization mechanisms in LLMs using transformer circuits, identifying separate circuits for initiating vs maintaining memorization, with prevention mechanisms being more transferable across domains than induction mechanisms.

Details

Motivation: The paper aims to understand the underlying mechanisms of memorization in LLMs - specifically what parts of the network decide to retrieve memorized content and how model behavior differs when producing memorized vs non-memorized text.

Method: Uses mechanistic interpretability with transformer circuits (minimal computational subgraphs) and carefully constructed contrastive datasets to identify where model generation diverges from memorized content and isolate specific circuits for different aspects of memorization.

Result: Found that circuits initiating memorization can also maintain it once started, while circuits that only maintain memorization cannot trigger initiation. Memorization prevention mechanisms transfer robustly across text domains, while memorization induction appears more context-dependent.

Conclusion: Provides insights into the mechanistic differences between memorization initiation and maintenance in LLMs, revealing domain transfer properties that could inform better memorization control strategies.

Abstract: Underlying mechanisms of memorization in LLMs – the verbatim reproduction of training data – remain poorly understood. What exact part of the network decides to retrieve a token that we would consider as start of memorization sequence? How exactly is the models’ behaviour different when producing memorized sentence vs non-memorized? In this work we approach these questions from mechanistic interpretability standpoint by utilizing transformer circuits – the minimal computational subgraphs that perform specific functions within the model. Through carefully constructed contrastive datasets, we identify points where model generation diverges from memorized content and isolate the specific circuits responsible for two distinct aspects of memorization. We find that circuits that initiate memorization can also maintain it once started, while circuits that only maintain memorization cannot trigger its initiation. Intriguingly, memorization prevention mechanisms transfer robustly across different text domains, while memorization induction appears more context-dependent.

[93] Evaluating Scoring Bias in LLM-as-a-Judge

Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu

Main category: cs.CL

TL;DR: First dedicated examination of scoring bias in LLM-as-a-Judge paradigm, identifying three novel bias types in scoring prompts and proposing framework to quantify them.

Details

Motivation: Existing research on LLM-as-a-Judge biases has focused on comparative evaluations, while scoring-based evaluations (assigning absolute scores) remain under-investigated despite being more practical in industrial applications. There's a need to examine biases originating from scoring prompts themselves rather than evaluation targets.

Method: Formally defined scoring bias and identified three novel types: rubric order bias, score ID bias, and reference answer score bias. Proposed comprehensive framework with multi-faceted metrics and automatic data synthesis pipeline to create tailored evaluation corpus for quantifying these biases.

Result: Experiments empirically demonstrate that even the most advanced LLMs suffer from substantial scoring biases. The analysis provides actionable insights for designing more robust scoring prompts and mitigating these newly identified biases.

Conclusion: Scoring bias in LLM judges is a significant, previously under-investigated problem that requires attention, especially for industrial applications. The proposed framework and identified bias types provide foundation for developing more reliable automated evaluation systems.

Abstract: The “LLM-as-a-Judge” paradigm, using Large Language Models (LLMs) as automated evaluators, is pivotal to LLM development, offering scalable feedback for complex tasks. However, the reliability of these judges is compromised by various biases. Existing research has heavily concentrated on biases in comparative evaluations. In contrast, scoring-based evaluations-which assign an absolute score and are often more practical in industrial applications-remain under-investigated. To address this gap, we undertake the first dedicated examination of scoring bias in LLM judges. We shift the focus from biases tied to the evaluation targets to those originating from the scoring prompt itself. We formally define scoring bias and identify three novel, previously unstudied types: rubric order bias, score ID bias, and reference answer score bias. We propose a comprehensive framework to quantify these biases, featuring a suite of multi-faceted metrics and an automatic data synthesis pipeline to create a tailored evaluation corpus. Our experiments empirically demonstrate that even the most advanced LLMs suffer from these substantial scoring biases. Our analysis yields actionable insights for designing more robust scoring prompts and mitigating these newly identified biases.

[94] The Generalization Ridge: Information Flow in Natural Language Generation

Ruidi Chang, Chunyuan Deng, Hanjie Chen

Main category: cs.CL

TL;DR: InfoRidge: An information-theoretic framework analyzing how predictive information (mutual information between hidden representations and target outputs) varies across transformer layers during training, revealing a non-monotonic “generalization ridge” in intermediate layers.

Details

Motivation: Transformer language models achieve state-of-the-art NLG performance, but their internal mechanisms for synthesizing task-relevant information remain poorly understood. While prior work suggests intermediate layers have more generalizable representations, how this generalization emerges and propagates during training is unclear.

Method: Proposed InfoRidge framework uses information theory to measure predictive information (mutual information between hidden representations and target outputs) across layers during training. Conducted experiments across various models and datasets with complementary analyses using residual scaling, attention patterns, and controlled model capacity.

Result: Consistent non-monotonic trend: predictive information peaks in intermediate layers (forming a “generalization ridge”) before declining in final layers, reflecting transition between generalization and memorization. This ridge phenomenon persists across decoding steps in multiple-token generation experiments.

Conclusion: Findings offer new insights into transformer internal mechanisms and underscore critical role of intermediate layers in supporting generalization. The generalization ridge phenomenon reveals how transformers balance generalization and memorization across layers.

Abstract: Transformer-based language models have achieved state-of-the-art performance in natural language generation (NLG), yet their internal mechanisms for synthesizing task-relevant information remain insufficiently understood. While prior studies suggest that intermediate layers often yield more generalizable representations than final layers, how this generalization ability emerges and propagates across layers during training remains unclear. To address this gap, we propose InfoRidge, an information-theoretic framework, to characterize how predictive information-the mutual information between hidden representations and target outputs-varies across depth during training. Our experiments across various models and datasets reveal a consistent non-monotonic trend: predictive information peaks in intermediate layers-forming a generalization ridge-before declining in final layers, reflecting a transition between generalization and memorization. To further investigate this phenomenon, we conduct a set of complementary analyses that leverage residual scaling, attention pattern, and controlled model capacity to characterize layer-wise functional specialization. We further validate our findings with multiple-token generation experiments, verifying that the observed ridge phenomenon persists across decoding steps. Together, these findings offer new insights into the internal mechanisms of transformers and underscore the critical role of intermediate layers in supporting generalization.

[95] GeoResponder: Towards Building Geospatial LLMs for Time-Critical Disaster Response

Ahmed El Fekih Zguir, Ferda Ofli, Muhammad Imran

Main category: cs.CL

TL;DR: GeoResponder is a framework that enhances LLMs’ geospatial reasoning capabilities through scaffolded instruction-tuning, enabling better disaster response by understanding road networks, coordinates, and infrastructure locations.

Details

Motivation: Current LLMs lack geospatial capabilities needed for time-critical disaster response, where reasoning about road networks, continuous coordinates, and access to essential infrastructure (hospitals, shelters, pharmacies) is vital.

Method: Introduces GeoResponder framework with scaffolded instruction-tuning curriculum that stratifies geospatial learning into different cognitive layers, anchoring semantic knowledge to continuous coordinate manifold and enforcing internalization of spatial axioms.

Result: Extensive evaluations across four topologically distinct cities and diverse tasks show GeoResponder significantly outperforms both state-of-the-art foundation models and domain-specific baselines.

Conclusion: LLMs can begin to internalize and generalize geospatial structures, pointing toward future development of language models capable of supporting disaster response needs.

Abstract: Large Language Models excel at linguistic tasks but lack the inner geospatial capabilities needed for time-critical disaster response, where reasoning about road networks, continuous coordinates, and access to essential infrastructure such as hospitals, shelters, and pharmacies is vital. We introduce GeoResponder, a framework that instills robust spatial reasoning through a scaffolded instruction-tuning curriculum. By stratifying geospatial learning into different cognitive layers, we effectively anchor semantic knowledge to the continuous coordinate manifold and enforce the internalization of spatial axioms. Extensive evaluations across four topologically distinct cities and diverse tasks demonstrate that GeoResponder significantly outperforms both state-of-the-art foundation models and domain-specific baselines. These results suggest that LLMs can begin to internalize and generalize geospatial structures, pointing toward the future development of language models capable of supporting disaster response needs.

[96] LUMINA: Detecting Hallucinations in RAG System with Context-Knowledge Signals

Samuel Yeh, Sharon Li, Tanwi Mallick

Main category: cs.CL

TL;DR: LUMINA is a framework that detects hallucinations in Retrieval-Augmented Generation (RAG) systems by quantifying context-knowledge signals through distributional distance for external context utilization and tracking token evolution across transformer layers for internal knowledge utilization.

Details

Motivation: RAG-based LLMs still hallucinate even with correct context, due to imbalance between external context and internal knowledge usage. Existing methods require extensive hyperparameter tuning, limiting generalizability.

Method: Quantifies external context utilization via distributional distance, measures internal knowledge utilization by tracking how predicted tokens evolve across transformer layers, and introduces statistical validation framework.

Result: Achieves consistently high AUROC and AUPRC scores, outperforming prior methods by up to +13% AUROC on HalluRAG benchmark, robust under relaxed assumptions about retrieval quality and model matching.

Conclusion: LUMINA provides effective and practical hallucination detection for RAG systems without extensive hyperparameter tuning, offering statistical validation for context-knowledge signals.

Abstract: Retrieval-Augmented Generation (RAG) aims to mitigate hallucinations in large language models (LLMs) by grounding responses in retrieved documents. Yet, RAG-based LLMs still hallucinate even when provided with correct and sufficient context. A growing line of work suggests that this stems from an imbalance between how models use external context and their internal knowledge, and several approaches have attempted to quantify these signals for hallucination detection. However, existing methods require extensive hyperparameter tuning, limiting their generalizability. We propose LUMINA, a novel framework that detects hallucinations in RAG systems through context–knowledge signals: external context utilization is quantified via distributional distance, while internal knowledge utilization is measured by tracking how predicted tokens evolve across transformer layers. We further introduce a framework for statistically validating these measurements. Experiments on common RAG hallucination benchmarks and four open-source LLMs show that LUMINA achieves consistently high AUROC and AUPRC scores, outperforming prior utilization-based methods by up to +13% AUROC on HalluRAG. Moreover, LUMINA remains robust under relaxed assumptions about retrieval quality and model matching, offering both effectiveness and practicality. LUMINA: https://github.com/deeplearning-wisc/LUMINA

[97] A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models

Wonje Jeung, Sangyeon Yoon, Yoonjun Cho, Dongjae Jeon, Sangwoo Shin, Hyesoo Hong, Albert No

Main category: cs.CL

TL;DR: A2D is a token-level alignment method for diffusion large language models that enables any-order, any-step defense against harmful content by training models to emit [EOS] refusal signals when detecting unsafe content during generation.

Details

Motivation: Diffusion LLMs' any-order generation flexibility creates security vulnerabilities where harmful content can appear at arbitrary positions, and template-based prefilling attacks like DIJA can bypass response-level refusal mechanisms.

Method: A2D aligns dLLMs at token-level using randomized masking to train models to emit [EOS] refusal signals whenever harmful content arises, enabling real-time monitoring and automatic termination of unsafe continuations.

Result: A2D reduces DIJA success rates from over 80% to near-zero (1.3% on LLaDA-8B-Instruct, 0.0% on Dream-v0-Instruct-7B) and enables up to 19.3x faster safe termination through thresholded [EOS] probability monitoring.

Conclusion: Token-level alignment with randomized masking provides robust defense against any-order and any-step attacks in diffusion LLMs, enabling both prevention of harmful outputs and real-time safety monitoring.

Abstract: Diffusion large language models (dLLMs) enable any-order generation, but this flexibility enlarges the attack surface: harmful spans may appear at arbitrary positions, and template-based prefilling attacks such as DIJA bypass response-level refusals. We introduce A2D (Any-Order, Any-Step Defense), a token-level alignment method that aligns dLLMs to emit an [EOS] refusal signal whenever harmful content arises. By aligning safety directly at the token-level under randomized masking, A2D achieves robustness to both any-decoding-order and any-step prefilling attacks under various conditions. It also enables real-time monitoring: dLLMs may begin a response but automatically terminate if unsafe continuation emerges. On safety benchmarks, A2D consistently prevents the generation of harmful outputs, slashing DIJA success rates from over 80% to near-zero (1.3% on LLaDA-8B-Instruct, 0.0% on Dream-v0-Instruct-7B), and thresholded [EOS] probabilities allow early rejection, yielding up to 19.3x faster safe termination.

[98] Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

Shaobo Wang, Jiaming Wang, Jiajun Zhang, Cong Wang, Yue Min, Zichen Wen, Xingzhang Ren, Fei Huang, Huiqiang Jiang, Junyang Lin, Dayiheng Liu, Linfeng Zhang

Main category: cs.CL

TL;DR: Q-Tuning: A unified framework for efficient LLM fine-tuning that jointly optimizes sample and token pruning using an Error-Uncertainty Plane diagnostic framework.

Details

Motivation: As supervised fine-tuning becomes increasingly compute-intensive, data efficiency is critical for aligning LLMs under tight budgets. Existing data pruning methods operate in isolation at either sample or token level, failing to jointly optimize both dimensions, leading to inefficiencies.

Method: Introduces Error-Uncertainty (EU) Plane diagnostic framework to characterize data utility across samples and tokens. Q-Tuning uses two-stage strategy: 1) sample-level triage to retain examples with informative misconceptions or calibration signals, 2) asymmetric token-pruning policy with context-aware scoring to trim less salient tokens from misconception samples while preserving calibration samples entirely.

Result: Sets new SOTA across five diverse benchmarks. On SmolLM2-1.7B, achieves +38% average improvement over full-data SFT baseline using only 12.5% of original training data. First dynamic pruning approach to consistently outperform full-data training.

Conclusion: Q-Tuning provides practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT by jointly optimizing sample and token pruning through unified framework.

Abstract: As supervised fine-tuning (SFT) evolves from a lightweight post-training step into a compute-intensive phase rivaling mid-training in scale, data efficiency has become critical for aligning large language models (LLMs) under tight budgets. Existing data pruning methods suffer from a fragmented design: they operate either at the sample level or the token level in isolation, failing to jointly optimize both dimensions. This disconnect leads to significant inefficiencies–high-value samples may still contain redundant tokens, while token-level pruning often discards crucial instructional or corrective signals embedded in individual examples. To address this bottleneck, we introduce the Error-Uncertainty (EU) Plane, a diagnostic framework that jointly characterizes the heterogeneous utility of training data across samples and tokens. Guided by this insight, we propose Quadrant-based Tuning (Q-Tuning), a unified framework that strategically coordinates sample pruning and token pruning. Q-Tuning employs a two-stage strategy: first, it performs sample-level triage to retain examples rich in informative misconceptions or calibration signals; second, it applies an asymmetric token-pruning policy, using a context-aware scoring mechanism to trim less salient tokens exclusively from misconception samples while preserving calibration samples in their entirety. Our method sets a new state of the art across five diverse benchmarks. Remarkably, on SmolLM2-1.7B, Q-Tuning achieves a +38% average improvement over the full-data SFT baseline using only 12.5% of the original training data. As the first dynamic pruning approach to consistently outperform full-data training, Q-Tuning provides a practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT.

[99] What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

Jiwan Chung, Neel Joshi, Pratyusha Sharma, Youngjae Yu, Vibhav Vineet

Main category: cs.CL

TL;DR: MathLens benchmark decomposes multimodal geometry problem-solving into Perception, Reasoning, and Integration components, revealing that different training strategies develop distinct capability profiles invisible in aggregate accuracy scores.

Details

Motivation: Current multimodal reasoning evaluation reduces complex capabilities to single accuracy scores, treating reasoning as unitary. The authors aim to expose this oversimplification by decomposing performance into specific components to better understand how different training approaches affect various subskills.

Method: Develop MathLens benchmark with textbook-style geometry problems derived from symbolic specifications, accompanied by visual diagrams, text-only variants, multimodal questions, and targeted perceptual probes. This enables controlled measurement of Perception (visual understanding), Reasoning (logical processing), and Integration (combining modalities).

Result: Different training strategies yield systematically different capability profiles: reinforcement learning primarily improves perceptual grounding and robustness to diagram variation, while textual SFT yields gains through reflective reasoning. As perception and reasoning improve, integration errors become the dominant failure mode.

Conclusion: Progress in multimodal reasoning reflects shifting balances among subskills rather than uniform advancement, motivating evaluation beyond scalar accuracy. The decomposition approach reveals hidden capability profiles and failure modes invisible in aggregate metrics.

Abstract: Evaluation of multimodal reasoning models is typically reduced to a single accuracy score, implicitly treating reasoning as a unitary capability. We introduce MathLens, a benchmark of textbook-style geometry problems that exposes this assumption by operationally decomposing performance into Perception, Reasoning, and Integration. Each problem is derived from a symbolic specification and accompanied by visual diagrams, text-only variants, multimodal questions, and targeted perceptual probes, enabling controlled measurement of each component. Using this decomposition, we show that common training strategies induce systematically different capability profiles that are invisible under aggregate accuracy. Reinforcement learning primarily improves perceptual grounding and robustness to diagram variation, while textual SFT yields gains through reflective reasoning. In contrast, as perception and reasoning improve, a growing fraction of remaining errors fall outside these components and are categorized as integration. These results suggest that apparent progress in multimodal reasoning reflects shifting balances among subskills rather than uniform advancement, motivating evaluation beyond scalar accuracy.

[100] Reusing Overtrained Language Models Saturates Scaling

Seng Pei Liew, Takuya Kato

Main category: cs.CL

TL;DR: Scaling efficiency of model reuse diminishes predictably: scaling exponent decreases logarithmically with base model pretraining tokens, revealing saturation effect in multi-stage pretraining.

Details

Motivation: To understand the effectiveness of reusing pretrained base models for further pretraining (continual pretraining or model growth), especially when applied to overtrained base models, and to study the scaling properties of model reuse.

Method: Empirical study of scaling properties of model reuse, analyzing how scaling exponent with respect to second-stage training tokens decreases logarithmically with the number of tokens used to pretrain the base model.

Result: Found that scaling efficiency diminishes in a predictable manner, with joint dependence on first- and second-stage tokens accurately modeled by a simple scaling law, revealing saturation effect in multi-stage pretraining.

Conclusion: There’s a fundamental trade-off in multi-stage pretraining: the more extensively a base model is pretrained, the less benefit additional pretraining provides, offering practical insights for efficient language model training and raising considerations for reuse of overtrained models.

Abstract: Reusing pretrained base models for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, the effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling properties of model reuse and find that the scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage training tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a base model is pretrained, the less benefit additional pretraining provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.

[101] AWM: Accurate Weight-Matrix Fingerprint for Large Language Models

Boyi Zeng, Lin Chen, Ziwei He, Xinbing Wang, Zhouhan Lin

Main category: cs.CL

TL;DR: A training-free fingerprinting method using weight matrices and Linear Assignment Problem with unbiased Centered Kernel Alignment similarity to verify LLM lineage despite intensive post-training modifications.

Details

Motivation: Protecting intellectual property of LLMs is crucial due to substantial training resources. Need reliable methods to determine if suspect LLMs are trained from scratch or derived from existing models, despite challenges from intensive post-training processes like fine-tuning, continued pretraining, reinforcement learning, multi-modal extension, pruning, and upcycling.

Method: Proposes training-free fingerprinting based on weight matrices. Uses Linear Assignment Problem (LAP) and unbiased Centered Kernel Alignment (CKA) similarity to neutralize effects of parameter manipulations. Creates robust similarity metric for model lineage verification.

Result: Tested on 60 positive and 90 negative model pairs. Demonstrates exceptional robustness against all six post-training categories with near-zero false positive risk. Achieves perfect scores on all classification metrics. Computation completes within 30s on NVIDIA 3090 GPU.

Conclusion: Method establishes strong basis for reliable model lineage verification. Provides effective solution for intellectual property protection of LLMs despite intensive post-training modifications.

Abstract: Protecting the intellectual property of large language models (LLMs) is crucial, given the substantial resources required for their training. Consequently, there is an urgent need for both model owners and third parties to determine whether a suspect LLM is trained from scratch or derived from an existing base model. However, the intensive post-training processes that models typically undergo-such as supervised fine-tuning, extensive continued pretraining, reinforcement learning, multi-modal extension, pruning, and upcycling-pose significant challenges to reliable identification. In this work, we propose a training-free fingerprinting method based on weight matrices. We leverage the Linear Assignment Problem (LAP) and an unbiased Centered Kernel Alignment (CKA) similarity to neutralize the effects of parameter manipulations, yielding a highly robust and high-fidelity similarity metric. On a comprehensive testbed of 60 positive and 90 negative model pairs, our method demonstrates exceptional robustness against all six aforementioned post-training categories while exhibiting a near-zero risk of false positives. By achieving perfect scores on all classification metrics, our approach establishes a strong basis for reliable model lineage verification. Moreover, the entire computation completes within 30s on an NVIDIA 3090 GPU. The code is available at https://github.com/LUMIA-Group/AWM.

[102] Causality Guided Representation Learning for Cross-Style Hate Speech Detection

Chengshuai Zhao, Shu Wan, Paras Sheth, Karan Patwa, K. Selçuk Candan, Huan Liu

Main category: cs.CL

TL;DR: CADET: A causal representation learning framework for hate speech detection that disentangles latent factors (context, motivation, target, style) to isolate genuine hate intent from superficial linguistic cues, enabling counterfactual reasoning for robust detection across diverse styles.

Details

Motivation: Implicit hate speech (using sarcasm, irony, stereotypes, coded language) is hard to detect compared to explicit hate. Existing models rely on surface-level linguistic cues and fail to generalize across diverse stylistic variations. Different platforms have distinct hate speech styles, creating spurious correlations between style and labels.

Method: Models hate speech generation as a causal graph involving contextual environment, creator motivation, target, and style. Uses causal representation learning to disentangle these interpretable latent factors and control confounders. Allows counterfactual reasoning by intervening on style within the latent space to identify hate speech in varying forms.

Result: CADET demonstrates superior performance in comprehensive experiments, showing the potential of causal priors in advancing generalizable hate speech detection.

Conclusion: Causal representation learning with disentangled latent factors and counterfactual reasoning enables robust hate speech detection across diverse stylistic variations, addressing limitations of surface-level approaches.

Abstract: The proliferation of online hate speech poses a significant threat to the harmony of the web. While explicit hate is easily recognized through overt slurs, implicit hate speech is often conveyed through sarcasm, irony, stereotypes, or coded language – making it harder to detect. Existing hate speech detection models, which predominantly rely on surface-level linguistic cues, fail to generalize effectively across diverse stylistic variations. Moreover, hate speech spread on different platforms often targets distinct groups and adopts unique styles, potentially inducing spurious correlations between them and labels, further challenging current detection approaches. Motivated by these observations, we hypothesize that the generation of hate speech can be modeled as a causal graph involving key factors: contextual environment, creator motivation, target, and style. Guided by this graph, we propose CADET, a causal representation learning framework that disentangles hate speech into interpretable latent factors and then controls confounders, thereby isolating genuine hate intent from superficial linguistic cues. Furthermore, CADET allows counterfactual reasoning by intervening on style within the latent space, naturally guiding the model to robustly identify hate speech in varying forms. CADET demonstrates superior performance in comprehensive experiments, highlighting the potential of causal priors in advancing generalizable hate speech detection.

[103] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment

Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, Haoyu Wang

Main category: cs.CL

TL;DR: OpenRubrics introduces a large-scale collection of prompt-rubric pairs and Contrastive Rubric Generation (CRG) to create structured evaluation criteria for reward modeling, improving over traditional scalar/pairwise judgments.

Details

Motivation: Traditional reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Rubrics-as-rewards (RaR) can capture multiple dimensions but face challenges in reliability and scalability.

Method: 1) OpenRubrics: large-scale collection of (prompt, rubric) pairs; 2) Contrastive Rubric Generation (CRG): derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses; 3) Noise removal via preference-label consistency preservation.

Result: Rubric-RM surpasses strong size-matched baselines by 8.4% across multiple reward-modeling benchmarks. Gains transfer to policy models on instruction-following and biomedical benchmarks.

Conclusion: Structured rubric-based reward modeling using OpenRubrics and CRG provides more comprehensive evaluation signals than traditional approaches, leading to improved performance in RLHF applications.

Abstract: Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured criteria to capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further remove noisy rubrics via preserving preference-label consistency. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 8.4%. These gains transfer to policy models on instruction-following and biomedical benchmarks.

[104] On the Interplay between Human Label Variation and Model Fairness

Kemal Kurniawan, Meladel Mistica, Timothy Baldwin, Jey Han Lau

Main category: cs.CL

TL;DR: HLV training methods can improve model fairness without explicit debiasing under certain conditions

Details

Motivation: To explore the unexplored topic of how human label variation (HLV) affects model fairness, examining the interplay between HLV methods and fairness outcomes

Method: Compare training on majority-vote labels with various HLV methods, conducting experiments to assess fairness impacts

Result: HLV training methods have a positive impact on fairness under certain configurations without requiring explicit debiasing techniques

Conclusion: Human label variation methods can contribute to improved model fairness, offering an alternative or complementary approach to explicit debiasing methods

Abstract: The impact of human label variation (HLV) on model fairness is an unexplored topic. This paper examines the interplay by comparing training on majority-vote labels with a range of HLV methods. Our experiments show that without explicit debiasing, HLV training methods have a positive impact on fairness under certain configurations.

[105] DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar

Main category: cs.CL

TL;DR: DynaSpec introduces dynamic shortlisting for speculative decoding in large-vocabulary LLMs, using lightweight meta-classifiers to route contexts to token clusters, improving acceptance rates and throughput compared to static approaches.

Details

Motivation: As LLM vocabularies scale past 100k tokens, speculative decoding faces bottlenecks: the drafter's O(|V|d) output projection becomes expensive. Static shortlisting approaches (like FR-Spec, VocabTrim) use fixed frequency-ranked shortlists but suppress rare/domain-specific tokens, reducing acceptance rates and limiting speedups.

Method: DynaSpec uses lightweight meta-classifiers to route each context to a small set of coarse token clusters. The union of top-selected clusters defines the drafter’s dynamic shortlist, while the target model still verifies over the full vocabulary (preserving exactness). System-wise, routing is overlapped with draft computation via parallel execution streams to reduce overhead.

Result: Across standard benchmarks, DynaSpec recovers 98.4% of full-vocabulary performance for Llama-3-8B (vs 93.6% for fixed-shortlist baselines) and achieves up to 2.23x throughput gain (vs 1.91x for static approaches) on datasets with rare tokens.

Conclusion: DynaSpec enables efficient speculative decoding for large-vocabulary LLMs by dynamically adapting shortlists to context, significantly improving acceptance rates and throughput while preserving exact verification.

Abstract: Speculative decoding accelerates LLM inference by letting a small drafter propose multiple tokens which a large target model verifies once per speculation step. As vocabularies scale past 10e5 tokens,verification cost in the target model is largely unchanged, but the drafter can become bottlenecked by its O(|V|d) output projection. Recent approaches (e.g., FR-Spec, VocabTrim) mitigate this by restricting drafting to a fixed, frequency-ranked shortlist; however, such static truncation is corpus-dependent and suppresses rare or domain-specific tokens, reducing acceptance and limiting speedups. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism for large-vocabulary speculative decoding. DynaSpec trains lightweight meta-classifiers that route each context to a small set of coarse token clusters; the union of the top-selected clusters defines the drafter’s shortlist, while the target model still verifies over the full vocabulary, preserving exactness. Systems-wise, routing is overlapped with draft computation via parallel execution streams, reducing end-to-end overhead. Across standard speculative decoding benchmarks, DynaSpec consistently improves mean accepted length-recovering 98.4% of full-vocabulary performance for Llama-3-8B versus 93.6% for fixed-shortlist baselines-and achieves up to a 2.23x throughput gain compared to 1.91x for static approaches on the dataset with rare tokens.

[106] POPI: Personalizing LLMs via Optimized Preference Inference

Yizhuo Chen, Xin Liu, Ruijie Wang, Zheng Li, Pei Chen, Changlong Yu, Priyanka Nigam, Meng Jiang, Bing Yin

Main category: cs.CL

TL;DR: POPI: A modular personalization framework that decouples preference inference and conditioned generation for LLMs, enabling transferable natural-language preference summaries

Details

Motivation: Current LLMs are aligned with population-level preferences, ignoring individual user variations. Existing personalization methods lack explicit structure and modularity, making them inefficient and non-transferable across different LLMs.

Method: Proposes POPI framework with two modular components: preference inference (extracts user preferences) and conditioned generation (generates responses based on preferences). Uses natural language as interface between components. Jointly optimizes both components using reinforcement learning with unified preference optimization objective.

Result: POPI improves personalization performance across multiple benchmarks while reducing context overhead. Learned natural-language preference summaries transfer effectively to frozen, off-the-shelf LLMs including black-box APIs, demonstrating modularity and generator-transferability.

Conclusion: Modular personalization framework with explicit decomposition into preference inference and conditioned generation enables effective, transferable personalization. Natural language serves as effective interface between components, allowing preference summaries to work across different LLMs.

Abstract: Large language models (LLMs) are typically aligned with population-level preferences, despite substantial variation across individual users. While many LLM personalization methods exist, the underlying structure of user-level personalization is often left implicit. We formalize user-level, prompt-independent personalization as a decomposition into two components: preference inference and conditioned generation. We advocate for a modular design that decouples these components; identify natural language as a generator-agnostic interface between them; and characterize generator-transferability as a key implication of modular personalization. Guided by this abstraction, we introduce POPI, a novel instantiation of modular personalization that parameterizes both preference inference and conditioned generation as shared LLMs. POPI jointly optimizes the two components under a unified preference optimization objective, using reinforcement learning as an optimization tool. Across multiple benchmarks, POPI consistently improves personalization performance while reducing context overhead. We further demonstrate that the learned natural-language preference summaries transfer effectively to frozen, off-the-shelf LLMs, including black-box APIs, providing empirical evidence of modularity and generator-transferability.

[107] Capturing Classic Authorial Style in Long-Form Story Generation with GRPO Fine-Tuning

Jinlong Liu, Mohammed Bahja, Venelin Kovatchev, Mark Lee

Main category: cs.CL

TL;DR: Fine-tuning an 8B story generator for authorial style transfer using authorship-verification calibrated rewards and Group Relative Policy Optimization

Details

Motivation: Current approaches to evaluating and optimizing authorial style in long-form story generation are challenging because style assessment often relies on ad hoc prompting and gets conflated with overall writing quality

Method: Two-stage pipeline: 1) Train a style-similarity judge by fine-tuning a sentence-transformer with authorship-verification supervision and calibrate outputs to [0,1] reward, 2) Use this judge as primary reward in Group Relative Policy Optimization (GRPO) to fine-tune an 8B story generator for style-conditioned writing

Result: GRPO-trained 8B model achieves higher style scores than open-weight baselines across four target authors (Mark Twain, Jane Austen, Charles Dickens, Thomas Hardy), with average style score of 0.893

Conclusion: AV-calibrated reward modelling provides a practical mechanism for controllable style transfer in long-form generation under moderate model size and training budget

Abstract: Evaluating and optimising authorial style in long-form story generation remains challenging because style is often assessed with ad hoc prompting and is frequently conflated with overall writing quality. We propose a two-stage pipeline. First, we train a dedicated style-similarity judge by fine-tuning a sentence-transformer with authorship-verification supervision, and calibrate its similarity outputs into a bounded $[0,1]$ reward. Second, we use this judge as the primary reward in Group Relative Policy Optimization (GRPO) to fine-tune an 8B story generator for style-conditioned writing, avoiding the accept/reject supervision required by Direct Preference Optimization (DPO). Across four target authors (Mark Twain, Jane Austen, Charles Dickens, Thomas Hardy), the GRPO-trained 8B model achieves higher style scores than open-weight baselines, with an average style score of 0.893 across authors. These results suggest that AV-calibrated reward modelling provides a practical mechanism for controllable style transfer in long-form generation under a moderate model size and training budget.

[108] DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song, Stanley Jungkyu Choi, Moontae Lee, Honglak Lee

Main category: cs.CL

TL;DR: DEER is a benchmark for evaluating expert-level deep research reports generated by LLMs, featuring a comprehensive taxonomy, rubric-based assessment, and claim verification architecture.

Details

Motivation: Current evaluation of LLM-generated expert reports is challenging due to multifaceted quality criteria, LLM judges missing domain-specific errors, and the need for claim verification across retrieved evidence.

Method: DEER systematizes evaluation with an expert-developed taxonomy (7 dimensions, 25 subdimensions, 101 rubric items), provides Expert Evaluation Guidance for LLM-based judging, and includes a claim verification architecture that verifies both cited and uncited claims while quantifying evidence quality.

Result: Experiments show current deep research systems can produce structurally plausible reports with citations, but need improvement in fulfilling expert-level requests and achieving logical completeness. DEER enables interpretable system analysis and diagnostic improvement signals.

Conclusion: DEER provides a comprehensive benchmark for evaluating expert-level deep research reports, making system strengths/limitations interpretable and offering diagnostic signals for improvement beyond simple performance comparisons.

Abstract: Recent advances in large language models have enabled deep research systems that generate expert-level reports through multi-step reasoning and evidence-based synthesis. However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and because deep research relies on retrieved evidence, report-wide claim verification is also necessary. To address these issues, we propose DEER, a benchmark for evaluating expert-level deep research reports. DEER systematizes evaluation criteria with an expert-developed taxonomy (7 dimensions, 25 subdimensions) operationalized as 101 fine-grained rubric items. We also provide task-specific Expert Evaluation Guidance to support LLM-based judging. Alongside rubric-based assessment, we propose a claim verification architecture that verifies both cited and uncited claims and quantifies evidence quality. Experiments show that while current deep research systems can produce structurally plausible reports that cite external evidence, there is room for improvement in fulfilling expert-level user requests and achieving logical completeness. Beyond simple performance comparisons, DEER makes system strengths and limitations interpretable and provides diagnostic signals for improvement.

[109] A Unified Definition of Hallucination: It’s The World Model, Stupid!

Emmy Liu, Varun Gangal, Chelsea Zou, Michael Yu, Xiaoqi Huang, Alex Chang, Zhuofu Tao, Karan Singh, Sachin Kumar, Steven Y. Feng

Main category: cs.CL

TL;DR: The paper proposes a unified definition of hallucinations in LLMs as inaccurate internal world modeling observable to users, and outlines plans for synthetic benchmarks to test world modeling components.

Details

Motivation: Despite ongoing efforts, hallucinations remain a persistent problem in frontier LLMs. The authors aim to unify disparate definitions of hallucination to provide a common framework for evaluation, comparison, and mitigation strategies.

Method: The paper reviews existing definitions and folds them into a single unified definition where hallucinations are defined as inaccurate internal world modeling observable to users. The framework varies reference world models and conflict policies to subsume prior definitions.

Result: The unified framework distinguishes true hallucinations from planning or reward errors, provides common language for comparison across benchmarks, and enables clearer evaluation by forcing clarification of assumed reference “worlds.”

Conclusion: The unified view of hallucinations as inaccurate world modeling is useful for evaluation and mitigation. The authors plan to develop synthetic benchmarks with fully specified reference world models to stress-test and improve world modeling components in LLMs.

Abstract: Despite numerous attempts at mitigation since the inception of language models, hallucinations remain a persistent problem even in today’s frontier LLMs. Why is this? We review existing definitions of hallucination and fold them into a single, unified definition wherein prior definitions are subsumed. We argue that hallucination can be unified by defining it as simply inaccurate (internal) world modeling, in a form where it is observable to the user. For example, stating a fact which contradicts a knowledge base OR producing a summary which contradicts the source. By varying the reference world model and conflict policy, our framework unifies prior definitions. We argue that this unified view is useful because it forces evaluations to clarify their assumed reference “world”, distinguishes true hallucinations from planning or reward errors, and provides a common language for comparison across benchmarks and discussion of mitigation strategies. Building on this definition, we outline plans for a family of benchmarks using synthetic, fully specified reference world models to stress-test and improve world modeling components.

[110] Self-attention vector output similarities reveal how machines pay attention

Tal Halevi, Yarden Tzach, Ronit D. Gross, Shalom Rosner, Ido Kanter

Main category: cs.CL

TL;DR: Analysis of self-attention mechanisms in BERT reveals how attention heads specialize in different linguistic features and develop from long-range to short-range semantic similarities across layers.

Details

Motivation: While self-attention has advanced NLP, the precise mechanisms underlying its learning process and quantitative characterization remain poorly understood. This study aims to quantify information processing within self-attention mechanisms.

Method: The study analyzes BERT-12 architecture by examining attention maps and deriving context similarity matrices from vector spaces emerging from self-attention heads. It measures scalar products between token vectors to reveal similarities between token pairs within each head and layer.

Result: Final layers focus on sentence separator tokens, suggesting text segmentation based on semantic features. Different attention heads specialize in different linguistic characteristics (token repetitions, common tokens with context). Initial layers show long-range similarities, while deeper layers develop short-range similarities with preference for intra-sentence attention. Each head focuses on unique tokens and builds similarity pairs around them.

Conclusion: Self-attention mechanisms exhibit systematic specialization patterns where different heads capture distinct linguistic features, and attention evolves from long-range to short-range semantic relationships across layers, providing insights into how transformers process language.

Abstract: The self-attention mechanism has significantly advanced the field of natural language processing, facilitating the development of advanced language-learning machines. Although its utility is widely acknowledged, the precise mechanisms of self-attention underlying its advanced learning and the quantitative characterization of this learning process remains an open research question. This study introduces a new approach for quantifying information processing within the self-attention mechanism. The analysis conducted on the BERT-12 architecture reveals that, in the final layers, the attention map focuses on sentence separator tokens, suggesting a practical approach to text segmentation based on semantic features. Based on the vector space emerging from the self-attention heads, a context similarity matrix, measuring the scalar product between two token vectors was derived, revealing distinct similarities between different token vector pairs within each head and layer. The findings demonstrated that different attention heads within an attention block focused on different linguistic characteristics, such as identifying token repetitions in a given text or recognizing a token of common appearance in the text and its surrounding context. This specialization is also reflected in the distribution of distances between token vectors with high similarity as the architecture progresses. The initial attention layers exhibit substantially long-range similarities; however, as the layers progress, a more short-range similarity develops, culminating in a preference for attention heads to create strong similarities within the same sentence. Finally, the behavior of individual heads was analyzed by examining the uniqueness of their most common tokens in their high similarity elements. Each head tends to focus on a unique token from the text and builds similarity pairs centered around it.

[111] TurkBench: A Benchmark for Evaluating Turkish Large Language Models

Çağrı Toraman, Ahmet Kaan Sever, Ayse Aysu Cengiz, Elif Ecem Arslan, Görkem Sevinç, Mete Mert Birdal, Yusuf Faruk Güldemir, Ali Buğra Kanburoğlu, Sezen Felekoğlu, Osman Gürlek, Sarp Kantar, Birsen Şahin Kütük, Büşra Tufan, Elif Genç, Serkan Coşkun, Gupse Ekin Demir, Muhammed Emin Arayıcı, Olgun Dursun, Onur Gungor, Susan Üsküdarlı, Abdullah Topraksoy, Esra Darıcı

Main category: cs.CL

TL;DR: TurkBench: A comprehensive Turkish-language benchmark for evaluating generative LLMs across 21 subtasks in 6 categories

Details

Motivation: There's a critical need for language-specific evaluation benchmarks beyond English, particularly for languages with unique linguistic characteristics like Turkish, where existing benchmarks remain underdeveloped.

Method: Developed TurkBench with 8,151 data samples across 21 distinct subtasks organized under six main categories: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following.

Result: Created a comprehensive Turkish-language benchmark that provides culturally relevant data for evaluating generative LLMs, published for online submissions via Hugging Face.

Conclusion: TurkBench fills an important gap in multilingual LLM evaluation by providing a valuable tool for researchers and developers to assess Turkish-language models and identify improvement areas.

Abstract: With the recent surge in the development of large language models, the need for comprehensive and language-specific evaluation benchmarks has become critical. While significant progress has been made in evaluating English-language models, benchmarks for other languages, particularly those with unique linguistic characteristics such as Turkish, remain less developed. Our study introduces TurkBench, a comprehensive benchmark designed to assess the capabilities of generative large language models in the Turkish language. TurkBench involves 8,151 data samples across 21 distinct subtasks. These are organized under six main categories of evaluation: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following. The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models and identifying areas for improvement. We further publish our benchmark for online submissions at https://huggingface.co/turkbench

[112] When to Trust: A Causality-Aware Calibration Framework for Accurate Knowledge Graph Retrieval-Augmented Generation

Jing Ren, Bowen Li, Ziqi Xu, Xikun Zhang, Haytham Fayek, Xiaodong Li

Main category: cs.CL

TL;DR: Ca2KG is a causality-aware calibration framework for KG-RAG that addresses overconfidence issues by integrating counterfactual prompting and panel-based re-scoring to improve uncertainty estimation while maintaining accuracy.

Details

Motivation: Existing KG-RAG models are severely overconfident, producing high-confidence predictions even when retrieved knowledge is incomplete or unreliable, which is problematic for high-stakes applications.

Method: Proposes Ca2KG framework with two components: 1) counterfactual prompting to expose retrieval-dependent uncertainties in knowledge quality and reasoning reliability, and 2) panel-based re-scoring mechanism that stabilizes predictions across interventions.

Result: Extensive experiments on two complex QA datasets show Ca2KG consistently improves calibration while maintaining or even enhancing predictive accuracy.

Conclusion: Ca2KG effectively addresses the overconfidence problem in KG-RAG through causality-aware calibration, making it more reliable for deployment in high-stakes domains.

Abstract: Knowledge Graph Retrieval-Augmented Generation (KG-RAG) extends the RAG paradigm by incorporating structured knowledge from knowledge graphs, enabling Large Language Models (LLMs) to perform more precise and explainable reasoning. While KG-RAG improves factual accuracy in complex tasks, existing KG-RAG models are often severely overconfident, producing high-confidence predictions even when retrieved sub-graphs are incomplete or unreliable, which raises concerns for deployment in high-stakes domains. To address this issue, we propose Ca2KG, a Causality-aware Calibration framework for KG-RAG. Ca2KG integrates counterfactual prompting, which exposes retrieval-dependent uncertainties in knowledge quality and reasoning reliability, with a panel-based re-scoring mechanism that stabilises predictions across interventions. Extensive experiments on two complex QA datasets demonstrate that Ca2KG consistently improves calibration while maintaining or even enhancing predictive accuracy.

[113] Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, Kyungwoo Song

Main category: cs.CL

TL;DR: BHyT is a drop-in replacement for Pre-LN normalization that uses bounded hyperbolic tangent with explicit input bounding to improve stability and efficiency in deep language models.

Details

Motivation: Pre-Layer Normalization (Pre-LN) is standard in LLMs but suffers from inefficiency due to repeated statistical calculations and instability as depth increases (curse of depth). Existing efficiency methods like Dynamic Tanh remain fragile at depth.

Method: Proposes Bounded Hyperbolic Tanh (BHyT) which couples tanh nonlinearity with explicit, data-driven input bounding to keep activations in non-saturating range. Computes exact statistics once per block and replaces second normalization with lightweight variance approximation for efficiency.

Result: BHyT achieves 15.8% faster training and 4.2% higher token generation throughput compared to RMSNorm, while matching or surpassing inference performance and robustness across language understanding and reasoning benchmarks.

Conclusion: BHyT jointly addresses stability and efficiency issues in deep LLMs, providing a practical drop-in replacement for Pre-LN with theoretical stability guarantees and empirical performance improvements.

Abstract: Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: https://anonymous.4open.science/r/BHyT

[114] When Domain Pretraining Interferes with Instruction Alignment: An Empirical Study of Adapter Merging in Medical LLMs

Junyi Zou

Main category: cs.CL

TL;DR: Medical LLMs show adapter interference when combining domain adaptation and instruction alignment, affecting safety-critical deployment.

Details

Motivation: To understand how domain-oriented pre-training and supervised fine-tuning interact in medical LLMs, particularly how adapter merging affects model behavior and safety in critical settings.

Method: Two-stage LoRA pipeline with separate domain-oriented pre-training and supervised fine-tuning, followed by weighted adapter merging. Includes merge-verification routine to ensure correctness.

Result: PT signal systematically alters model behavior, producing reasoning-style outputs even when suppressed. BLEU/ROUGE scores drop while multiple-choice accuracy improves. Small pipeline mistakes can misattribute behavior.

Conclusion: Adapter interference between knowledge injection and instruction alignment has important implications for safety-critical model deployment, requiring careful verification.

Abstract: Large language models can exhibit surprising adapter interference when combining domain adaptation and instruction alignment in safety-critical settings. We study a two-stage LoRA pipeline for medical LLMs, where domain-oriented pre-training (PT) and supervised fine-tuning (SFT) are trained separately and later merged through weighted adapter merging. We observe that introducing PT signal can systematically alter model behavior and produce reasoning-style outputs, even when evaluation templates explicitly attempt to suppress such behavior. This interference leads to a divergence between surface metrics and reasoning or alignment behavior: BLEU/ROUGE scores drop significantly, while multiple-choice accuracy improves. We further show that small pipeline mistakes can easily misattribute SFT-only behavior to merged models, and provide a lightweight merge-verification routine to ensure correctness and reproducibility. Our findings highlight an interaction between knowledge injection and instruction alignment in adapter-based fine-tuning, with important implications for safety-critical model deployment.

[115] NRR-Phi: Text-to-State Mapping for Ambiguity Preservation in LLM Inference

Kei Saito

Main category: cs.CL

TL;DR: A framework for text-to-state mapping that preserves multiple interpretations of ambiguous language instead of forcing early semantic commitment, enabling non-collapsing reasoning in LLMs.

Details

Motivation: Large language models tend to commit to single interpretations of ambiguous input too early, collapsing multiple valid meanings. This paper aims to preserve interpretive multiplicity by creating a formal mapping from text to a non-collapsing state space.

Method: A three-stage text-to-state mapping framework: 1) conflict detection using rule-based segmentation for explicit markers and LLM-based enumeration for implicit ambiguity, 2) interpretation extraction, and 3) state construction. The approach combines rule-based methods for explicit conflict markers (adversative conjunctions, hedging expressions) with LLM-based techniques for implicit ambiguity types (epistemic, lexical, structural).

Result: On 68 ambiguous sentences, the framework preserved interpretive multiplicity with mean state entropy H = 1.087 bits across ambiguity categories, compared to H = 0 for collapse-based baselines. Demonstrated cross-lingual portability with Japanese markers.

Conclusion: The framework provides an algorithmic bridge between text and Non-Resolution Reasoning state space, enabling architectural collapse deferment in LLM inference by preserving multiple interpretations rather than forcing early semantic commitment.

Abstract: Large language models exhibit a systematic tendency toward early semantic commitment: given ambiguous input, they collapse multiple valid interpretations into a single response before sufficient context is available. We present a formal framework for text-to-state mapping ($φ: \mathcal{T} \to \mathcal{S}$) that transforms natural language into a non-collapsing state space where multiple interpretations coexist. The mapping decomposes into three stages: conflict detection, interpretation extraction, and state construction. We instantiate $φ$ with a hybrid extraction pipeline combining rule-based segmentation for explicit conflict markers (adversative conjunctions, hedging expressions) with LLM-based enumeration of implicit ambiguity (epistemic, lexical, structural). On a test set of 68 ambiguous sentences, the resulting states preserve interpretive multiplicity: mean state entropy $H = 1.087$ bits across ambiguity categories, compared to $H = 0$ for collapse-based baselines. We additionally instantiate the rule-based conflict detector for Japanese markers to illustrate cross-lingual portability. This framework extends Non-Resolution Reasoning (NRR) by providing the missing algorithmic bridge between text and the NRR state space, enabling architectural collapse deferment in LLM inference.

[116] SERA: Soft-Verified Efficient Repository Agents

Ethan Shen, Danny Tormoen, Saurabh Shah, Ali Farhadi, Tim Dettmers

Main category: cs.CL

TL;DR: SERA enables efficient training of coding agents specialized to private codebases using supervised finetuning, achieving state-of-the-art open-source performance at dramatically lower cost.

Details

Motivation: Open-weight coding agents should theoretically have an advantage over closed-source systems by being able to specialize to private codebases, but training costs and complexity have prevented this advantage from being realized.

Method: Soft-Verified Efficient Repository Agents (SERA) uses Soft Verified Generation (SVG) to generate thousands of trajectories from a single code repository, combined with supervised finetuning for efficient training.

Result: SERA achieves state-of-the-art results among fully open-source models while matching frontier open-weight models, with 26x cheaper training than reinforcement learning and 57x cheaper than previous synthetic data methods.

Conclusion: The work makes repository specialization practical for open coding agents, accelerating research in this area and demonstrating the advantage of open-source models that can adapt to private codebases.

Abstract: Open-weight coding agents should hold a fundamental advantage over closed-source systems: they can be specialized to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical. We show it is now practical. We present Soft-Verified Efficient Repository Agents (SERA), an efficient method for training coding agents that enables the rapid and cheap creation of agents specialized to private codebases. Using only supervised finetuning (SFT), SERA achieves state-of-the-art results among fully open-source (open data, method, code) models while matching the performance of frontier open-weight models like Devstral-Small-2. Creating SERA models is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. Our method, Soft Verified Generation (SVG), generates thousands of trajectories from a single code repository. Combined with cost-efficiency, this enables specialization to private codebases. Beyond repository specialization, we apply SVG to a larger corpus of codebases, generating over 200,000 synthetic trajectories. We use this dataset to provide detailed analysis of scaling laws, ablations, and confounding factors for training coding agents. Overall, we believe our work will greatly accelerate research on open coding agents and showcase the advantage of open-source models that can specialize to private codebases. We release SERA as the first model in Ai2’s Open Coding Agents series, along with all our code, data, and Claude Code integration to support the research community.

[117] Linear representations in language models can change dramatically over a conversation

Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, Murray Shanahan

Main category: cs.CL

TL;DR: Language model representations evolve dynamically during conversations, with linear directions corresponding to concepts changing based on conversational context and content.

Details

Motivation: To understand how language model representations evolve during conversations, particularly how linear directions corresponding to high-level concepts change dynamically based on conversational context.

Method: Studied representation dynamics by analyzing how linear directions change during simulated conversations, examining changes in factuality representations, robustness across model families and layers, and effects of different conversation types (on-policy vs scripted).

Result: Representations change dramatically during conversations - factual information can become non-factual and vice versa. Changes are content-dependent, robust across models, and occur even with scripted conversations. Steering along representational directions has different effects at different conversation points.

Conclusion: Language model representations evolve dynamically in response to conversational context, challenging static interpretations of features and suggesting models adapt to play specific roles cued by conversations.

Abstract: Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even replaying a conversation script written by an entirely different model can produce similar changes. However, adaptation is much weaker from simply having a sci-fi story in context that is framed more explicitly as such. We also show that steering along a representational direction can have dramatically different effects at different points in a conversation. These results are consistent with the idea that representations may evolve in response to the model playing a particular role that is cued by a conversation. Our findings may pose challenges for interpretability and steering – in particular, they imply that it may be misleading to use static interpretations of features or directions, or probes that assume a particular range of features consistently corresponds to a particular ground-truth value. However, these types of representational dynamics also point to exciting new research directions for understanding how models adapt to context.

[118] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

Aadi Palnitkar, Mingyang Mao, Nicholas Waytowich, Vinicius G. Goecks, Xiaomin Lin

Main category: cs.CL

TL;DR: MilSCORE is a military planning benchmark for evaluating LLMs on long-context, multi-modal reasoning with maps, orders, and intelligence reports.

Details

Motivation: There's a need for realistic long-context benchmarks requiring selective reading and integration of heterogeneous, multi-modal information, especially for geospatial planning problems like military operations that demand reasoning over maps, orders, and intelligence reports.

Method: Created MilSCORE, the first scenario-level dataset of expert-authored, multi-hop questions grounded in complex simulated military planning scenarios. Includes diverse question types across seven categories targeting factual recall and multi-step reasoning about constraints, strategy, and spatial analysis.

Result: Baseline results for contemporary vision-language models show substantial headroom, indicating current systems struggle with realistic, scenario-level long-context planning.

Conclusion: MilSCORE serves as a challenging testbed for future work on long-context, multi-modal reasoning in complex planning scenarios.

Abstract: As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs’ ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.

[119] From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes

Fariba Afrin Irany

Main category: cs.CL

TL;DR: Selective fine-tuning of GPT-2 for clinical text classification using frozen backbone and trainable final layers, achieving efficient performance on radiology reports with reduced computational cost.

Details

Motivation: Clinical narratives in EHRs offer opportunities for automated disease characterization but face challenges including limited labeled data, class imbalance, and high computational costs of adapting large language models to domain-specific clinical text.

Method: Proposes a GPT-based architecture with selective fine-tuning strategy: freezes majority of GPT-2 backbone, trains only final Transformer block, final layer normalization, and lightweight classification head to reduce trainable parameters while preserving representational capacity.

Result: Evaluated on MIMIC-IV-Note radiology reports with CheXpert-style labels; shows stable convergence and strong classification performance across multiple problem formulations, particularly effective for non-mention and negated findings.

Conclusion: Selective fine-tuning of pretrained generative language models provides efficient and effective pathway for clinical text classification, enabling scalable adaptation to real-world EHR data with significantly reduced computational complexity.

Abstract: The increasing availability of unstructured clinical narratives in electronic health records (EHRs) has created new opportunities for automated disease characterization, cohort identification, and clinical decision support. However, modeling long, domain-specific clinical text remains challenging due to limited labeled data, severe class imbalance, and the high computational cost of adapting large pretrained language models. This study presents a GPT-based architecture for clinical text classification that adapts a pretrained decoder-only Transformer using a selective fine-tuning strategy. Rather than updating all model parameters, the majority of the GPT-2 backbone is frozen, and training is restricted to the final Transformer block, the final layer normalization, and a lightweight classification head. This approach substantially reduces the number of trainable parameters while preserving the representational capacity required to model complex clinical language. The proposed method is evaluated on radiology reports from the MIMIC-IV-Note dataset using uncertainty-aware CheXpert-style labels derived directly from report text. Experiments cover multiple problem formulations, including multi-label classification of radiographic findings, binary per-label classification under different uncertainty assumptions, and aggregate disease outcome prediction. Across varying dataset sizes, the model exhibits stable convergence behavior and strong classification performance, particularly in settings dominated by non-mention and negated findings. Overall, the results indicate that selective fine-tuning of pretrained generative language models provides an efficient and effective pathway for clinical text classification, enabling scalable adaptation to real-world EHR data while significantly reducing computational complexity.

Elif Sayar, Tolgahan Türker, Anna Golynskaia Knezhevich, Bihter Dereli, Ayşe Demirhas, Lionel Nicolas, Gülşen Eryiğit

Main category: cs.CL

TL;DR: A semi-automated annotation methodology for learner corpora using a faceted taxonomy to enable fine-grained, multi-dimensional error analysis beyond traditional flat annotations.

Details

Motivation: Most learner corpora use holistic flat label inventories that don't separate linguistic dimensions, making deep annotation difficult and complicating fine-grained error analysis. There's a need for standardized, interpretable enrichment beyond flat annotations.

Method: Developed a semi-automated annotation methodology based on a faceted taxonomy, implemented through an annotation extension framework. Created an annotation extension tool for Turkish that automatically extends existing flat annotations by inferring additional linguistic and metadata information as facets.

Result: Achieved 95.86% facet-level accuracy. Produced the first collaboratively annotated and taxonomically enriched Turkish Learner Corpus with enhanced querying capabilities and detailed exploratory analysis support.

Conclusion: The work introduces a novel approach to learner corpus annotation that enables complex linguistic and pedagogical analysis. It provides the first corpus designed according to the new taxonomy and paves the way for enriching existing error-annotated learner corpora.

Abstract: In terms of annotation structure, most learner corpora rely on holistic flat label inventories which, even when extensive, do not explicitly separate multiple linguistic dimensions. This makes linguistically deep annotation difficult and complicates fine-grained analyses aimed at understanding why and how learners produce specific errors. To address these limitations, this paper presents a semi-automated annotation methodology for learner corpora, built upon a recently proposed faceted taxonomy, and implemented through a novel annotation extension framework. The taxonomy provides a theoretically grounded, multi-dimensional categorization that captures the linguistic properties underlying each error instance, thereby enabling standardized, fine-grained, and interpretable enrichment beyond flat annotations. The annotation extension tool, implemented based on the proposed extension framework for Turkish, automatically extends existing flat annotations by inferring additional linguistic and metadata information as facets within the taxonomy to provide richer learner-specific context. It was systematically evaluated and yielded promising performance results, achieving a facet-level accuracy of 95.86%. The resulting taxonomically enriched corpus offers enhanced querying capabilities and supports detailed exploratory analyses across learner corpora, enabling researchers to investigate error patterns through complex linguistic and pedagogical dimensions. This work introduces the first collaboratively annotated and taxonomically enriched Turkish Learner Corpus, a manual annotation guideline with a refined tagset, and an annotation extender. As the first corpus designed in accordance with the recently introduced taxonomy, we expect our study to pave the way for subsequent enrichment efforts of existing error-annotated learner corpora.

[121] Are you going to finish that? A Practical Study of the Partial Token Problem

Hao Xu, Alisa Liu, Jonathan Hayase, Yejin Choi, Noah A. Smith

Main category: cs.CL

TL;DR: The paper investigates the partial token problem in language models where user prompts ending mid-token cause distorted predictions, particularly affecting languages without whitespace, compounding languages, and code.

Details

Motivation: There's a mismatch between how LMs process token sequences and how users interact via text, causing the partial token problem when prompts end mid-token. This issue is understudied for realistic prompts respecting word boundaries, especially in languages where token and word boundaries don't align.

Method: Systematically constructed semantically natural prompts ending with partial tokens across three domains: languages without whitespace (Chinese), highly compounding languages, and code. Evaluated frontier LMs’ probability distributions on correct continuations versus token-aligned “backed-off” prompts.

Result: In Chinese, up to 25% of word boundaries don’t align with token boundaries. LMs consistently place three orders of magnitude less probability on correct continuations for partial token prompts. Degradation doesn’t diminish with scale and often worsens for larger models. Evaluated inference-time mitigations and validated recent exact solutions.

Conclusion: The partial token problem causes severe probability distortion in realistic use cases, particularly for languages with misaligned token/word boundaries. The issue persists across model scales. Practical recommendations are provided for model inference providers, and recent exact solutions show effectiveness.

Abstract: Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text. This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt in the middle of the expected next-token, leading to distorted next-token predictions. Although this issue has been studied using arbitrary character prefixes, its prevalence and severity in realistic prompts respecting word boundaries remains underexplored. In this work, we identify three domains where token and “word” boundaries often do not line up: languages that do not use whitespace, highly compounding languages, and code. In Chinese, for example, up to 25% of word boundaries do not line up with token boundaries, making even natural, word-complete prompts susceptible to this problem. We systematically construct semantically natural prompts ending with a partial tokens; in experiments, we find that they comprise a serious failure mode: frontier LMs consistently place three orders of magnitude less probability on the correct continuation compared to when the prompt is “backed-off” to be token-aligned. This degradation does not diminish with scale and often worsens for larger models. Finally, we evaluate inference-time mitigations to the partial token problem and validate the effectiveness of recent exact solutions. Overall, we demonstrate the scale and severity of probability distortion caused by tokenization in realistic use cases, and provide practical recommentions for model inference providers.

[122] LegalOne: A Family of Foundation Models for Reliable Legal Reasoning

Haitao Li, Yifan Chen, Shuo Miao, Qian Dong, Jia Chen, Yiran Hu, Junjie Chen, Minghao Qin, Yueyue Wu, Yujia Zhou, Qingyao Ai, Yiqun Liu, Cheng Luo, Quan Zhou, Ya Zhang, Jikun Hu

Main category: cs.CL

TL;DR: LegalOne is a specialized foundation model family for Chinese legal domain that addresses LLMs’ limitations in legal reasoning through a three-phase pipeline: domain adaptation via Plasticity-Adjusted Sampling, reasoning distillation via Legal Agentic CoT Distillation, and progressive reinforcement learning.

Details

Motivation: General LLMs lack precise domain knowledge and struggle with complex multi-step judicial reasoning required in legal applications, creating a need for specialized legal AI models.

Method: Three-phase pipeline: 1) Mid-training with Plasticity-Adjusted Sampling for domain adaptation, 2) Supervised fine-tuning with Legal Agentic CoT Distillation to distill reasoning from legal texts, 3) Curriculum Reinforcement Learning with progressive stages (memorization, understanding, reasoning).

Result: LegalOne achieves state-of-the-art performance across various legal tasks, surpassing larger general-purpose LLMs through enhanced knowledge density and efficiency. The model weights and LegalKit evaluation framework are publicly released.

Conclusion: LegalOne demonstrates that specialized legal foundation models can overcome limitations of general LLMs in legal reasoning, paving the way for trustworthy and interpretable AI in high-stakes judicial applications.

Abstract: While Large Language Models (LLMs) have demonstrated impressive general capabilities, their direct application in the legal domain is often hindered by a lack of precise domain knowledge and complexity of performing rigorous multi-step judicial reasoning. To address this gap, we present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain. LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning. First, during mid-training phase, we propose Plasticity-Adjusted Sampling (PAS) to address the challenge of domain adaptation. This perplexity-based scheduler strikes a balance between the acquisition of new knowledge and the retention of original capabilities, effectively establishing a robust legal foundation. Second, during supervised fine-tuning, we employ Legal Agentic CoT Distillation (LEAD) to distill explicit reasoning from raw legal texts. Unlike naive distillation, LEAD utilizes an agentic workflow to convert complex judicial processes into structured reasoning trajectories, thereby enforcing factual grounding and logical rigor. Finally, we implement a Curriculum Reinforcement Learning (RL) strategy. Through a progressive reinforcement process spanning memorization, understanding, and reasoning, LegalOne evolves from simple pattern matching to autonomous and reliable legal reasoning. Experimental results demonstrate that LegalOne achieves state-of-the-art performance across a wide range of legal tasks, surpassing general-purpose LLMs with vastly larger parameter counts through enhanced knowledge density and efficiency. We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI, paving the way for deploying trustworthy and interpretable foundation models in high-stakes judicial applications.

[123] Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments

Siwei Wu, Yizhi Li, Yuyang Song, Wei Zhang, Yang Wang, Riza Batista-Navarro, Xian Yang, Mingjie Tang, Bryan Dai, Jian Yang, Chenghua Lin

Main category: cs.CL

TL;DR: TerminalTraj is a scalable pipeline for generating high-quality terminal trajectories for training agentic models, addressing executability and verifiability challenges through Dockerized environments and validation code.

Details

Motivation: Training agentic models for terminal-based tasks requires high-quality terminal trajectories with realistic long-horizon interactions, but scaling such data collection is challenging due to executability (needing diverse Docker environments) and verifiability (heterogeneous task outputs).

Method: Proposes TerminalTraj pipeline that: (1) filters high-quality repositories to construct Dockerized execution environments, (2) generates Docker-aligned task instances, and (3) synthesizes agent trajectories with executable validation code for verification.

Result: Curated 32K Docker images and generated 50,733 verified terminal trajectories across eight domains. Models trained on this data achieved up to 20% improvement on TerminalBench 1.0 and 10% on TerminalBench 2.0, with TerminalTraj-32B reaching 35.30% on TB 1.0 and 22.00% on TB 2.0.

Conclusion: TerminalTraj provides a scalable solution for generating high-quality terminal trajectory data, enabling improved training of agentic models for terminal-based tasks with better executability and verifiability.

Abstract: Training agentic models for terminal-based tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \textbf{\emph{Executability}}, since each instance requires a suitable and often distinct Docker environment; and \textbf{\emph{Verifiability}}, because heterogeneous task outputs preclude unified, standardized verification. To address these challenges, we propose \textbf{TerminalTraj}, a scalable pipeline that (i) filters high-quality repositories to construct Dockerized execution environments, (ii) generates Docker-aligned task instances, and (iii) synthesizes agent trajectories with executable validation code. Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains. Models trained on this data with the Qwen2.5-Coder backbone achieve consistent performance improvements on TerminalBench (TB), with gains of up to 20% on TB~~1.0 and 10% on TB~~2.0 over their respective backbones. Notably, \textbf{TerminalTraj-32B} achieves strong performance among models with fewer than 100B parameters, reaching 35.30% on TB~~1.0 and 22.00% on TB~~2.0, and demonstrates improved test-time scaling behavior. All code and data are available at https://github.com/Wusiwei0410/TerminalTraj.

[124] EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models

Chuanrui Hu, Tong Li, Xingze Gao, Hongda Chen, Yi Bai, Dannong Xu, Tianwei Lin, Xinda Zhao, Xiaohong Li, Yunyun Han, Jian Pei, Yafeng Deng

Main category: cs.CL

TL;DR: EverMemBench: A benchmark for evaluating long-term conversational memory in LLMs, featuring multi-party, multi-group conversations with temporal evolution and role-specific personas.

Details

Motivation: Existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity of conversational memory. There's a need for more realistic evaluation of memory systems in LLM-based assistants.

Method: Introduces EverMemBench with multi-party, multi-group conversations spanning over 1 million tokens, featuring temporally evolving information, cross-topic interleaving, and role-specific personas. Evaluates through 1,000+ QA pairs across three dimensions: fine-grained recall, memory awareness, and user profile understanding.

Result: Reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings (oracle models achieve only 26%), (2) temporal reasoning remains unsolved requiring version semantics, (3) memory awareness bottlenecked by retrieval where similarity-based methods fail to bridge semantic gaps.

Conclusion: EverMemBench provides a challenging testbed for developing next-generation memory architectures for conversational AI systems.

Abstract: Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.

[125] From Pragmas to Partners: A Symbiotic Evolution of Agentic High-Level Synthesis

Niansong Zhang, Sunwoo Kim, Shreesha Srinath, Zhiru Zhang

Main category: cs.CL

TL;DR: HLS remains essential in AI-driven hardware design era as a practical abstraction layer for agentic optimization, offering faster iteration, portability, and design permutability that AI agents can leverage.

Details

Motivation: With the rise of large language models and AI-driven hardware design, there's a question about whether high-level synthesis (HLS) still matters. The paper argues that HLS remains essential as a practical abstraction layer for agentic optimization in hardware design.

Method: This is a position paper that makes three main contributions: 1) explains why HLS serves as a practical abstraction layer and golden reference for agentic hardware design, 2) identifies key limitations of current HLS tools that AI agents can address, and 3) proposes a taxonomy for symbiotic evolution of agentic HLS showing responsibility shift from humans to AI agents.

Result: The paper establishes HLS as a critical layer for AI-driven hardware optimization, highlighting its advantages (faster iteration cycles, portability, design permutability) and identifying specific limitations that AI agents can overcome (performance feedback, interface rigidity, debuggability).

Conclusion: HLS remains essential in the agentic era as it provides the necessary abstraction layer for AI agents to optimize hardware design, with a symbiotic evolution path from human-driven to AI-autonomous design systems.

Abstract: The rise of large language models has sparked interest in AI-driven hardware design, raising the question: does high-level synthesis (HLS) still matter in the agentic era? We argue that HLS remains essential. While we expect mature agentic hardware systems to leverage both HLS and RTL, this paper focuses on HLS and its role in enabling agentic optimization. HLS offers faster iteration cycles, portability, and design permutability that make it a natural layer for agentic optimization. This position paper makes three contributions. First, we explain why HLS serves as a practical abstraction layer and a golden reference for agentic hardware design. Second, we identify key limitations of current HLS tools, namely inadequate performance feedback, rigid interfaces, and limited debuggability that agents are uniquely positioned to address. Third, we propose a taxonomy for the symbiotic evolution of agentic HLS, clarifying how responsibility shifts from human designers to AI agents as systems advance from copilots to autonomous design partners.

[126] Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

Shaohan Wang, Benfeng Xu, Licheng Zhang, Mingxuan Du, Chiwei Zhu, Xiaorui Wang, Zhendong Mao, Yongdong Zhang

Main category: cs.CL

TL;DR: Wiki Live Challenge (WLC) is a live benchmark using Wikipedia Good Articles as expert references to evaluate Deep Research Agents, with a comprehensive evaluation framework including 39 criteria for writing quality and factual verifiability metrics.

Details

Motivation: Current evaluation frameworks for Deep Research Agents rely on LLM-generated references which lack reliability of expert-verified content and struggle to provide objective, fine-grained assessments. There's a need for benchmarks with expert-level references to properly evaluate DRA capabilities.

Method: Introduces Wiki Live Challenge (WLC) using newest Wikipedia Good Articles as expert-level references. Curates 100 recent Good Articles and proposes Wiki Eval framework with fine-grained evaluation method (39 criteria for writing quality) and rigorous metrics for factual verifiability.

Result: Experiments on various DRA systems show significant gap between current DRAs and human expert-level Wikipedia articles, validating WLC’s effectiveness in advancing agent research.

Conclusion: WLC provides a reliable benchmark with expert-verified references for evaluating Deep Research Agents, addressing limitations of current LLM-based evaluation approaches and enabling more objective assessment of agent capabilities.

Abstract: Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM-generated references or LLM-derived evaluation dimensions. While these approaches offer scalability, they often lack the reliability of expert-verified content and struggle to provide objective, fine-grained assessments of critical dimensions. To bridge this gap, we introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert-level references. Wikipedia’s strict standards for neutrality, comprehensiveness, and verifiability serve as a great challenge for DRAs, with GAs representing the pinnacle of which. We curate a dataset of 100 recent Good Articles and propose Wiki Eval, a comprehensive evaluation framework comprising a fine-grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability. Extensive experiments on various DRA systems demonstrate a significant gap between current DRAs and human expert-level Wikipedia articles, validating the effectiveness of WLC in advancing agent research. We release our benchmark at https://github.com/WangShao2000/Wiki_Live_Challenge

[127] ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation

Xingshan Zeng, Lingzhi Wang, Weiwen Liu, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu

Main category: cs.CL

TL;DR: ARTIS introduces agentic risk-aware test-time scaling via iterative simulation to improve LLM agent reliability by exploring actions in simulated environments before real-world execution, with risk-aware simulation focusing on failure modes.

Details

Motivation: Current test-time scaling techniques for LLMs are insufficient for agentic settings where actions interact with external environments and can have irreversible, costly consequences. There's a need for methods that improve action-level reliability without incurring environmental risk.

Method: ARTIS decouples exploration from commitment by enabling test-time exploration through simulated interactions before real-world execution. It uses iterative simulation to extend inference-time computation. The paper also introduces a risk-aware tool simulator that emphasizes fidelity on failure-inducing actions through targeted data generation and rebalanced training.

Result: Experiments on multi-turn and multi-step agentic benchmarks show that iterative simulation substantially improves agent reliability, and risk-aware simulation is essential for consistently realizing these gains across models and tasks.

Conclusion: ARTIS provides a framework for improving LLM agent reliability through safe test-time exploration in simulations, with risk-aware simulation being crucial for capturing rare but high-impact failure modes in agentic decision making.

Abstract: Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with external environments and their effects can be irreversible and costly. We propose ARTIS, Agentic Risk-Aware Test-Time Scaling via Iterative Simulation, a framework that decouples exploration from commitment by enabling test-time exploration through simulated interactions prior to real-world execution. This design allows extending inference-time computation to improve action-level reliability and robustness without incurring environmental risk. We further show that naive LLM-based simulators struggle to capture rare but high-impact failure modes, substantially limiting their effectiveness for agentic decision making. To address this limitation, we introduce a risk-aware tool simulator that emphasizes fidelity on failure-inducing actions via targeted data generation and rebalanced training. Experiments on multi-turn and multi-step agentic benchmarks demonstrate that iterative simulation substantially improves agent reliability, and that risk-aware simulation is essential for consistently realizing these gains across models and tasks.

[128] Zero2Text: Zero-Training Cross-Domain Inversion Attacks on Textual Embeddings

Doohyun Kim, Donghwa Kang, Kyungjae Lee, Hyeongboo Baek, Brent Byunghoon Kang

Main category: cs.CL

TL;DR: Zero2Text is a training-free framework for embedding inversion attacks on vector databases that uses recursive online alignment to recover text from embeddings without requiring training data or excessive queries.

Details

Motivation: Vector databases in RAG systems pose privacy risks through embedding inversion attacks, but existing methods have limitations: optimization-based approaches require too many queries, while alignment-based methods need accessible in-domain training data, making them ineffective in black-box and cross-domain settings.

Method: Zero2Text uses recursive online alignment that synergizes LLM priors with dynamic ridge regression to iteratively align text generation to target embeddings on-the-fly, without requiring training data or static datasets.

Result: Zero2Text achieves 1.8x higher ROUGE-L and 6.4x higher BLEU-2 scores compared to baselines on MS MARCO against OpenAI models, recovering sentences from unknown domains without any leaked data pairs, and shows that standard defenses like differential privacy fail against this adaptive threat.

Conclusion: Zero2Text demonstrates a significant advancement in embedding inversion attacks by overcoming limitations of existing methods, highlighting serious privacy vulnerabilities in vector databases that current defenses cannot adequately address.

Abstract: The proliferation of retrieval-augmented generation (RAG) has established vector databases as critical infrastructure, yet they introduce severe privacy risks via embedding inversion attacks. Existing paradigms face a fundamental trade-off: optimization-based methods require computationally prohibitive queries, while alignment-based approaches hinge on the unrealistic assumption of accessible in-domain training data. These constraints render them ineffective in strict black-box and cross-domain settings. To dismantle these barriers, we introduce Zero2Text, a novel training-free framework based on recursive online alignment. Unlike methods relying on static datasets, Zero2Text synergizes LLM priors with a dynamic ridge regression mechanism to iteratively align generation to the target embedding on-the-fly. We further demonstrate that standard defenses, such as differential privacy, fail to effectively mitigate this adaptive threat. Extensive experiments across diverse benchmarks validate Zero2Text; notably, on MS MARCO against the OpenAI victim model, it achieves 1.8x higher ROUGE-L and 6.4x higher BLEU-2 scores compared to baselines, recovering sentences from unknown domains without a single leaked data pair.

[129] Sentence Curve Language Models

DongNyeong Heo, Heeyoul Choi

Main category: cs.CL

TL;DR: SCLM proposes sentence curve representations for diffusion language models to improve global structure modeling beyond static word embeddings.

Details

Motivation: Current language models use static word embeddings for target sentences, which are insensitive to neighboring words and encourage locally accurate word prediction while neglecting global sentence structure.

Method: Proposes continuous sentence representation called sentence curve (spline curve whose control points affect multiple words), extends diffusion language models to predict sentence curves instead of static word embeddings.

Result: Achieves SOTA performance among diffusion language models on IWSLT14 and WMT14, shows stable training without burdensome knowledge distillation, demonstrates promising potential on LM1B compared to discrete diffusion language models.

Conclusion: Sentence curve representation effectively addresses limitations of static word embeddings by promoting global structure modeling in language models, offering improved performance and training stability.

Abstract: Language models (LMs) are a central component of modern AI systems, and diffusion-based language models (DLMs) have recently emerged as a competitive alternative. Both paradigms rely on word embeddings not only to represent the input sentence, but also to represent the target sentence that backbone models are trained to predict. We argue that such static embedding of the target word is insensitive to neighboring words, encouraging locally accurate word prediction while neglecting global structure across the target sentence. To address this limitation, we propose a continuous sentence representation, termed sentence curve, defined as a spline curve whose control points affect multiple words in the sentence. Based on this representation, we introduce sentence curve language model (SCLM), which extends DLMs to predict sentence curves instead of the static word embeddings. We theoretically show that sentence curve prediction induces a regularization effect that promotes global structure modeling, and characterize how different sentence curve types affect this behavior. Empirically, SCLM achieves SOTA performance among DLMs on IWSLT14 and WMT14, shows stable training without burdensome knowledge distillation, and demonstrates promising potential compared to discrete DLMs on LM1B.

[130] WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora

Pengyu Wang, Benfeng Xu, Licheng Zhang, Shaohan Wang, Mingxuan Du, Chiwei Zhu, Zhendong Mao

Main category: cs.CL

TL;DR: WildGraphBench is a new benchmark for evaluating GraphRAG systems using Wikipedia’s long, heterogeneous reference documents as realistic retrieval corpus, with 1,100 questions across three complexity levels.

Details

Motivation: Existing GraphRAG benchmarks use short, curated passages which don't reflect real-world scenarios with long contexts and large-scale heterogeneous documents. There's a need for evaluation in more realistic settings.

Method: Leverage Wikipedia’s structure where cohesive narratives are grounded in long external reference documents. Sample articles across 12 topics, use external references as retrieval corpus and citation-linked statements as ground truth. Create 1,100 questions across three complexity levels: single-fact QA, multi-fact QA, and section-level summarization.

Result: Experiments show current GraphRAG pipelines help with multi-fact aggregation when evidence comes from moderate number of sources, but the aggregation paradigm may overemphasize high-level statements at the expense of fine-grained details, leading to weaker performance on summarization tasks.

Conclusion: WildGraphBench provides a more realistic evaluation framework for GraphRAG systems, revealing limitations in current approaches for handling fine-grained details in summarization tasks.

Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) organizes external knowledge as a hierarchical graph, enabling efficient retrieval and aggregation of scattered evidence across multiple documents. However, many existing benchmarks for GraphRAG rely on short, curated passages as external knowledge, failing to adequately evaluate systems in realistic settings involving long contexts and large-scale heterogeneous documents. To bridge this gap, we introduce WildGraphBench, a benchmark designed to assess GraphRAG performance in the wild. We leverage Wikipedia’s unique structure, where cohesive narratives are grounded in long and heterogeneous external reference documents, to construct a benchmark reflecting real-word scenarios. Specifically, we sample articles across 12 top-level topics, using their external references as the retrieval corpus and citation-linked statements as ground truth, resulting in 1,100 questions spanning three levels of complexity: single-fact QA, multi-fact QA, and section-level summarization. Experiments across multiple baselines reveal that current GraphRAG pipelines help on multi-fact aggregation when evidence comes from a moderate number of sources, but this aggregation paradigm may overemphasize high-level statements at the expense of fine-grained details, leading to weaker performance on summarization tasks. Project page:https://github.com/BstWPY/WildGraphBench.

[131] Closing the Loop: Universal Repository Representation with RPG-Encoder

Jane Luo, Chengyu Yin, Xin Zhang, Qingtao Li, Steven Liu, Yiming Huang, Jie Wu, Hao Liu, Yangyu Huang, Yu Kang, Fangkai Yang, Ying Xin, Scarlett Li

Main category: cs.CL

TL;DR: RPG-Encoder is a framework that creates unified repository representations by encoding code into Repository Planning Graphs, combining semantic features with dependencies for improved code understanding and generation.

Details

Motivation: Current repository agents suffer from fragmented representations using isolated API documentation or dependency graphs lacking semantic depth, creating a reasoning disconnect between comprehension and generation.

Method: Proposes RPG-Encoder that: (1) Encodes raw code into Repository Planning Graphs combining semantic features with dependencies, (2) Evolves topology incrementally to reduce maintenance costs by 95.7%, and (3) Operates as unified interface for structure-aware navigation.

Result: Achieves state-of-the-art localization performance: 93.7% Acc@5 on SWE-bench Verified, exceeds best baseline by over 10% on SWE-bench Live Lite, and achieves 98.5% reconstruction coverage on RepoCraft.

Conclusion: RPG-Encoder successfully closes the reasoning loop between intent and implementation by creating high-fidelity repository representations that unify comprehension and generation processes.

Abstract: Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: generation expands intent into implementation, while comprehension compresses implementation back into intent. To address this, we propose RPG-Encoder, a framework that generalizes the Repository Planning Graph (RPG) from a static generative blueprint into a unified, high-fidelity representation. RPG-Encoder closes the reasoning loop through three mechanisms: (1) Encoding raw code into the RPG that combines lifted semantic features with code dependencies; (2) Evolving the topology incrementally to decouple maintenance costs from repository scale, reducing overhead by 95.7%; and (3) Operating as a unified interface for structure-aware navigation. In evaluations, RPG-Encoder establishes state-of-the-art localization performance on SWE-bench Verified with 93.7% Acc@5 and exceeds the best baseline by over 10% in localization accuracy on SWE-bench Live Lite. These results highlight our superior fine-grained precision in complex codebases. Furthermore, it achieves 98.5% reconstruction coverage on RepoCraft, confirming RPG’s high-fidelity capacity to mirror the original codebase and closing the loop between intent and implementation.

[132] AR-MAP: Are Autoregressive Large Language Models Implicit Teachers for Diffusion Large Language Models?

Liang Lin, Feng Xiong, Zengbin Wang, Kun Wang, Junhao Dong, Xuecai Hu, Yong Wang, Xiangxiang Chu

Main category: cs.CL

TL;DR: AR-MAP transfers preference alignment knowledge from autoregressive LLMs to diffusion LLMs through weight scaling, avoiding high-variance direct alignment methods.

Details

Motivation: Diffusion LLMs face challenges in preference alignment due to high variance in ELBO-based likelihood estimation, while autoregressive LLMs already have effective alignment methods.

Method: Proposes AR-MAP framework that uses preference-aligned autoregressive LLMs as implicit teachers for DLLM alignment through simple weight scaling, exploiting shared architectural structure.

Result: Achieves 69.08% average score across diverse preference alignment tasks, competitive or superior to existing DLLM-specific alignment methods.

Conclusion: AR-MAP enables effective preference alignment for diffusion LLMs by transferring knowledge from autoregressive models, circumventing computational challenges of direct alignment.

Abstract: Diffusion Large Language Models (DLLMs) have emerged as a powerful alternative to autoregressive models, enabling parallel token generation across multiple positions. However, preference alignment of DLLMs remains challenging due to high variance introduced by Evidence Lower Bound (ELBO)-based likelihood estimation. In this work, we propose AR-MAP, a novel transfer learning framework that leverages preference-aligned autoregressive LLMs (AR-LLMs) as implicit teachers for DLLM alignment. We reveal that DLLMs can effectively absorb alignment knowledge from AR-LLMs through simple weight scaling, exploiting the shared architectural structure between these divergent generation paradigms. Crucially, our approach circumvents the high variance and computational overhead of direct DLLM alignment and comprehensive experiments across diverse preference alignment tasks demonstrate that AR-MAP achieves competitive or superior performance compared to existing DLLM-specific alignment methods, achieving 69.08% average score across all tasks and models. Our Code is available at https://github.com/AMAP-ML/AR-MAP.

cs.CV

[133] WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

Runjie Zhou, Youbo Shao, Haoyu Lu, Bowei Xing, Tongtong Bai, Yujie Chen, Jie Zhao, Lin Sui, Haotian Yao, Zijia Zhao, Hao Yang, Haoning Wu, Zaida Zhou, Jinguo Zhu, Zhiqi Huang, Yiping Bao, Yangyang Liu, Y. Charles, Xinyu Zhou

Main category: cs.CV

TL;DR: WorldVQA is a benchmark for evaluating MLLMs’ atomic visual world knowledge by decoupling knowledge retrieval from reasoning, focusing on what models memorize about visual entities across a stratified taxonomy.

Details

Motivation: Current MLLM evaluations often conflate visual knowledge retrieval with reasoning, making it difficult to assess what models actually memorize about the visual world. There's a need for a benchmark that strictly measures atomic visual knowledge to evaluate visual factuality and hallucination rates.

Method: WorldVQA decouples knowledge retrieval from reasoning by designing questions that test the atomic capability of grounding and naming visual entities. It uses a stratified taxonomy spanning from common head-class objects to long-tail rarities to comprehensively assess visual knowledge breadth.

Result: The benchmark provides a rigorous test for visual factuality in MLLMs, establishing a standard for assessing encyclopedic breadth and hallucination rates across different model generations.

Conclusion: WorldVQA serves as an important evaluation tool for measuring what MLLMs actually memorize about the visual world, helping to establish standards for assessing visual knowledge and reducing hallucinations in multimodal AI systems.

Abstract: We introduce WorldVQA, a benchmark designed to evaluate the atomic visual world knowledge of Multimodal Large Language Models (MLLMs). Unlike current evaluations, which often conflate visual knowledge retrieval with reasoning, WorldVQA decouples these capabilities to strictly measure “what the model memorizes.” The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We expect WorldVQA to serve as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.

[134] One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation

Shuo Lu, Haohan Wang, Wei Feng, Weizhen Wang, Shen Zhang, Yaoyu Li, Ao Ma, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Bing Zhan, Yuan Xu, Huizai Yao, Yongcan Yu, Chenyang Si, Jian Liang

Main category: cs.CV

TL;DR: OSMF is a unified framework for advertising image generation that aligns diverse group-wise click preferences using product-aware adaptive grouping and a Group-aware Multimodal Large Language Model (G-MLLM) fine-tuned with Group-DPO for preference alignment.

Details

Motivation: Existing advertising image generation approaches optimize for overall CTR but neglect preference diversity among user groups, leading to suboptimal performance for specific groups and limiting targeted marketing effectiveness.

Method: 1) Product-aware adaptive grouping to dynamically organize users based on attributes and product characteristics; 2) Preference-conditioned image generation using a Group-aware Multimodal Large Language Model (G-MLLM) pre-trained to comprehend group features and generate images; 3) Fine-tuning G-MLLM with Group-DPO for group-wise preference alignment.

Result: The framework achieves state-of-the-art performance in both offline and online settings, and introduces GAIP - the first large-scale public dataset of group-wise image preferences with around 600K groups from 40M users.

Conclusion: OSMF successfully bridges the gap in targeted advertising by aligning diverse group-wise click preferences through adaptive grouping and multimodal LLM-based image generation, significantly improving CTR for specific user groups.

Abstract: Advertising image generation has increasingly focused on online metrics like Click-Through Rate (CTR), yet existing approaches adopt a ``one-size-fits-all" strategy that optimizes for overall CTR while neglecting preference diversity among user groups. This leads to suboptimal performance for specific groups, limiting targeted marketing effectiveness. To bridge this gap, we present \textit{One Size, Many Fits} (OSMF), a unified framework that aligns diverse group-wise click preferences in large-scale advertising image generation. OSMF begins with product-aware adaptive grouping, which dynamically organizes users based on their attributes and product characteristics, representing each group with rich collective preference features. Building on these groups, preference-conditioned image generation employs a Group-aware Multimodal Large Language Model (G-MLLM) to generate tailored images for each group. The G-MLLM is pre-trained to simultaneously comprehend group features and generate advertising images. Subsequently, we fine-tune the G-MLLM using our proposed Group-DPO for group-wise preference alignment, which effectively enhances each group’s CTR on the generated images. To further advance this field, we introduce the Grouped Advertising Image Preference Dataset (GAIP), the first large-scale public dataset of group-wise image preferences, including around 600K groups built from 40M users. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance in both offline and online settings. Our code and datasets will be released at https://github.com/JD-GenX/OSMF.

[135] AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

Xintong Zhang, Xiaowen Zhang, Jongrong Wu, Zhi Gao, Shilin Yan, Zhenxin Diao, Kunpeng Gao, Xuanyan Chen, Yuwei Wu, Yunde Jia, Qing Li

Main category: cs.CV

TL;DR: AdaptMMBench: A comprehensive benchmark for evaluating adaptive multimodal reasoning in VLMs across five domains with dynamic difficulty assessment and multi-dimensional process evaluation.

Details

[136] End-to-end reconstruction of OCT optical properties and speckle-reduced structural intensity via physics-based learning

Jinglun Yu, Yaning Wang, Wenhan Guo, Yuan Gao, Yu Sun, Jin U. Kang

Main category: cs.CV

TL;DR: Physics-informed deep learning framework for OCT inverse scattering that jointly recovers tissue optical properties and speckle-reduced structural images using Monte Carlo simulations and physics-based forward modeling.

Details

Motivation: Inverse scattering in OCT is challenging due to attenuation, speckle noise, and parameter coupling, making it difficult to recover both structural images and intrinsic tissue optical properties like refractive index, scattering coefficient, and anisotropy.

Method: Regularized end-to-end deep learning framework trained with Monte Carlo-simulated ground truth, incorporating a physics-based OCT forward model that generates predicted signals from estimated parameters for physics-consistent supervision.

Result: Experiments on synthetic corneal OCT dataset demonstrate robust optical map recovery under noise, improved resolution, and enhanced structural fidelity compared to conventional methods.

Conclusion: The approach enables quantitative multi-parameter tissue characterization and shows the benefit of combining physics-informed modeling with deep learning for computational OCT applications.

Abstract: Inverse scattering in optical coherence tomography (OCT) seeks to recover both structural images and intrinsic tissue optical properties, including refractive index, scattering coefficient, and anisotropy. This inverse problem is challenging due to attenuation, speckle noise, and strong coupling among parameters. We propose a regularized end-to-end deep learning framework that jointly reconstructs optical parameter maps and speckle-reduced OCT structural intensity for layer visualization. Trained with Monte Carlo-simulated ground truth, our network incorporates a physics-based OCT forward model that generates predicted signals from the estimated parameters, providing physics-consistent supervision for parameter recovery and artifact suppression. Experiments on the synthetic corneal OCT dataset demonstrate robust optical map recovery under noise, improved resolution, and enhanced structural fidelity. This approach enables quantitative multi-parameter tissue characterization and highlights the benefit of combining physics-informed modeling with deep learning for computational OCT.

[137] SVD-ViT: Does SVD Make Vision Transformers Attend More to the Foreground?

Haruhiko Murata, Kazuhiro Hotta

Main category: cs.CV

TL;DR: SVD-ViT uses singular value decomposition to help Vision Transformers focus on foreground features and suppress background noise, improving classification accuracy.

Details

Motivation: Vision Transformers lack explicit mechanisms to distinguish foreground from background, causing them to learn unnecessary background features and artifacts that degrade classification performance.

Method: Proposes SVD-ViT with three components: SPC module, SSVA, and ID-RSVD, which use singular value decomposition to extract and aggregate singular vectors capturing object foreground information while suppressing background noise.

Result: Experimental results show improved classification accuracy and effective learning of informative foreground representations while reducing background noise impact.

Conclusion: SVD-ViT successfully addresses ViT’s limitation in foreground-background distinction through SVD-based feature prioritization, enhancing classification performance.

Abstract: Vision Transformers (ViT) have been established as large-scale foundation models. However, because self-attention operates globally, they lack an explicit mechanism to distinguish foreground from background. As a result, ViT may learn unnecessary background features and artifacts, leading to degraded classification performance. To address this issue, we propose SVD-ViT, which leverages singular value decomposition (SVD) to prioritize the learning of foreground features. SVD-ViT consists of three components-\textbf{SPC module}, \textbf{SSVA}, and \textbf{ID-RSVD}-and suppresses task-irrelevant factors such as background noise and artifacts by extracting and aggregating singular vectors that capture object foreground information. Experimental results demonstrate that our method improves classification accuracy and effectively learns informative foreground representations while reducing the impact of background noise.

[138] LmPT: Conditional Point Transformer for Anatomical Landmark Detection on 3D Point Clouds

Matteo Bastico, Pierre Onghena, David Ryckelynck, Beatriz Marcotegui, Santiago Velasco-Forero, Laurent Corté, Caroline Robine–Decourcelle, Etienne Decencière

Main category: cs.CV

TL;DR: LmPT is a transformer-based method for automatic anatomical landmark detection on point clouds that enables cross-species learning between human and dog femurs.

Details

Motivation: Manual anatomical landmarking is time-consuming and variable, while rule-based methods are limited to specific geometries or landmarks. There's a need for automated methods that can generalize across species for translational research.

Method: Proposes Landmark Point Transformer (LmPT) that represents anatomical surfaces as point clouds and uses a transformer architecture with a conditioning mechanism to adapt to different input types, enabling cross-species learning between human and dog femurs.

Result: Demonstrates effective generalization across species (human and dog femurs) with publicly available code and dog femur dataset.

Conclusion: LmPT provides an effective automated solution for anatomical landmark detection that can leverage cross-species data for translational research.

Abstract: Accurate identification of anatomical landmarks is crucial for various medical applications. Traditional manual landmarking is time-consuming and prone to inter-observer variability, while rule-based methods are often tailored to specific geometries or limited sets of landmarks. In recent years, anatomical surfaces have been effectively represented as point clouds, which are lightweight structures composed of spatial coordinates. Following this strategy and to overcome the limitations of existing landmarking techniques, we propose Landmark Point Transformer (LmPT), a method for automatic anatomical landmark detection on point clouds that can leverage homologous bones from different species for translational research. The LmPT model incorporates a conditioning mechanism that enables adaptability to different input types to conduct cross-species learning. We focus the evaluation of our approach on femoral landmarking using both human and newly annotated dog femurs, demonstrating its generalization and effectiveness across species. The code and dog femur dataset will be publicly available at: https://github.com/Pierreoo/LandmarkPointTransformer.

[139] ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images

Xinyue Li, Zhiming Xu, Zhichao Zhang, Zhaolin Cai, Sijing Wu, Xiongkuo Min, Yitong Chen, Guangtao Zhai

Main category: cs.CV

TL;DR: ELIQ is a label-free framework for quality assessment of AI-generated images that automatically constructs training pairs and adapts pre-trained multimodal models to evaluate visual quality and prompt-image alignment without human annotations.

Details

Motivation: Generative text-to-image models are advancing rapidly, making previously collected quality labels unreliable for newer generations. There's a need for scalable, label-free quality assessment methods that can adapt to evolving AI-generated content without requiring expensive human annotations.

Method: ELIQ automatically constructs positive and aspect-specific negative pairs covering both conventional distortions and AIGC-specific distortion modes. It then adapts a pre-trained multimodal model via instruction tuning and uses lightweight gated fusion with a Quality Query Transformer to predict two-dimensional quality (visual quality and prompt-image alignment).

Result: ELIQ consistently outperforms existing label-free methods across multiple benchmarks, generalizes from AI-generated content to user-generated content scenarios without modification, and demonstrates transferable supervision without human annotations.

Conclusion: ELIQ provides a scalable, label-free framework for quality assessment that can adapt to continuously evolving generative models, paving the way for more efficient evaluation of AI-generated content quality.

Abstract: Generative text-to-image models are advancing at an unprecedented pace, continuously shifting the perceptual quality ceiling and rendering previously collected labels unreliable for newer generations. To address this, we present ELIQ, a Label-free Framework for Quality Assessment of Evolving AI-generated Images. Specifically, ELIQ focuses on visual quality and prompt-image alignment, automatically constructs positive and aspect-specific negative pairs to cover both conventional distortions and AIGC-specific distortion modes, enabling transferable supervision without human annotations. Building on these pairs, ELIQ adapts a pre-trained multimodal model into a quality-aware critic via instruction tuning and predicts two-dimensional quality using lightweight gated fusion and a Quality Query Transformer. Experiments across multiple benchmarks demonstrate that ELIQ consistently outperforms existing label-free methods, generalizes from AI-generated content (AIGC) to user-generated content (UGC) scenarios without modification, and paves the way for scalable and label-free quality assessment under continuously evolving generative models. The code will be released upon publication.

[140] Self-Supervised Uncalibrated Multi-View Video Anonymization in the Operating Room

Keqi Chen, Vinkle Srivastav, Armine Vardazaryan, Cindy Rolland, Didier Mutter, Nicolas Padoy

Main category: cs.CV

TL;DR: Self-supervised multi-view video anonymization framework for operating rooms using whole-body person detection and pose estimation without manual annotation or camera calibration.

Details

Motivation: Privacy preservation is essential for using video data in OR research, but existing anonymization methods face scalability issues requiring manual annotations for new sites and camera calibration for multi-view setups.

Method: Uses off-the-shelf whole-body person detector with low threshold, retrieves false negatives via temporal tracking and self-supervised multi-view association, uses recovered detections as pseudo labels to iteratively fine-tune detector, then applies whole-body pose estimation and fine-tunes pose model with its own high-score predictions.

Result: Achieves over 97% recall on 4D-OR dataset and real surgery dataset, trains real-time whole-body detector using pseudo labels with comparable performance.

Conclusion: Proposed framework effectively anonymizes OR videos without manual annotation or camera calibration, enabling scalable privacy preservation for medical video research.

Abstract: Privacy preservation is a prerequisite for using video data in Operating Room (OR) research. Effective anonymization relies on the exhaustive localization of every individual; even a single missed detection necessitates extensive manual correction. However, existing approaches face two critical scalability bottlenecks: (1) they usually require manual annotations of each new clinical site for high accuracy; (2) while multi-camera setups have been widely adopted to address single-view ambiguity, camera calibration is typically required whenever cameras are repositioned. To address these problems, we propose a novel self-supervised multi-view video anonymization framework consisting of whole-body person detection and whole-body pose estimation, without annotation or camera calibration. Our core strategy is to enhance the single-view detector by “retrieving” false negatives using temporal and multi-view context, and conducting self-supervised domain adaptation. We first run an off-the-shelf whole-body person detector in each view with a low-score threshold to gather candidate detections. Then, we retrieve the low-score false negatives that exhibit consistency with the high-score detections via tracking and self-supervised uncalibrated multi-view association. These recovered detections serve as pseudo labels to iteratively fine-tune the whole-body detector. Finally, we apply whole-body pose estimation on each detected person, and fine-tune the pose model using its own high-score predictions. Experiments on the 4D-OR dataset of simulated surgeries and our dataset of real surgeries show the effectiveness of our approach achieving over 97% recall. Moreover, we train a real-time whole-body detector using our pseudo labels, achieving comparable performance and highlighting our method’s practical applicability. Code is available at https://github.com/CAMMA-public/OR_anonymization.

[141] SpecFLASH: A Latent-Guided Semi-autoregressive Speculative Decoding Framework for Efficient Multimodal Generation

Zihua Wang, Ruibo Li, Haozhe Du, Joey Tianyi Zhou, Yu Zhang, Xu Yang

Main category: cs.CV

Details

Result: Achieves up to 2.68× speed-up on video captioning and 2.55× on visual instruction tuning compared to original LMMs, consistently outperforming prior speculative decoding baselines.

Conclusion: SpecFLASH effectively accelerates multimodal model inference by exploiting visual structure characteristics, making it a practical solution for efficient multimodal generation tasks.

[142] ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying

Weihang You, Qingchan Zhu, David Liu, Yi Pan, Geng Yuan, Hanqi Jiang

Main category: cs.CV

TL;DR: ViThinker is a framework that enables vision-language models to actively generate decision tokens that trigger expert-aligned visual feature synthesis on demand, moving beyond passive processing to active perception for improved reasoning.

Details

Motivation: Current Chain-of-Thought reasoning in vision-language models suffers from premature visual-to-text conversion that discards continuous spatial and geometric information. Existing methods are passive, processing pre-computed inputs rather than actively seeking task-relevant details, unlike human active perception.

Method: ViThinker enables models to autonomously generate decision (query) tokens that trigger synthesis of expert-aligned visual features on demand. It uses a two-stage curriculum: first distilling frozen expert capabilities into model parameters, then learning task-driven querying via sparsity penalties to discover minimal sufficient perception for each reasoning step.

Result: Evaluations across vision-centric benchmarks demonstrate consistent improvements, showing that active query generation outperforms passive approaches in both perceptual grounding and reasoning accuracy.

Conclusion: ViThinker successfully enables active perception in vision-language models, allowing them to generate queries for on-demand visual feature synthesis, leading to better reasoning performance compared to passive approaches.

Abstract: Chain-of-Thought (CoT) reasoning excels in language models but struggles in vision-language models due to premature visual-to-text conversion that discards continuous information such as geometry and spatial layout. While recent methods enhance CoT through static enumeration or attention-based selection, they remain passive, i.e., processing pre-computed inputs rather than actively seeking task-relevant details. Inspired by human active perception, we introduce ViThinker, a framework that enables vision-language models to autonomously generate decision (query) tokens triggering the synthesis of expert-aligned visual features on demand. ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls. Through a two-stage curriculum: first distilling frozen experts into model parameters, then learning task-driven querying via sparsity penalties, i.e., ViThinker discovers minimal sufficient perception for each reasoning step. Evaluations across vision-centric benchmarks demonstrate consistent improvements, validating that active query generation outperforms passive approaches in both perceptual grounding and reasoning accuracy.

[143] TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models

Xin Jin, Yichuan Zhong, Yapeng Tian

Main category: cs.CV

TL;DR: TP-Blend is a training-free diffusion editing framework that handles simultaneous object replacement and style transfer using two separate textual prompts and complementary attention processors.

Details

Motivation: Current text-conditioned diffusion editors struggle when both a new object and a new style must be introduced simultaneously, as they typically handle only one type of edit at a time.

Method: TP-Blend uses two attention processors: Cross-Attention Object Fusion (CAOF) for object blending via optimal transport on attention maps, and Self-Attention Style Fusion (SASF) for style injection using Detail-Sensitive Instance Normalization and Key/Value matrix swapping.

Result: The method produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.

Conclusion: TP-Blend enables simultaneous object replacement and style transfer in diffusion models without additional training, providing a lightweight framework for complex multimodal editing tasks.

Abstract: Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors. Cross-Attention Object Fusion (CAOF) first averages head-wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy-regularised optimal transport problem that reassigns complete multi-head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD-XL), preserving rich cross-head correlations while keeping memory low. Self-Attention Style Fusion (SASF) injects style at every self-attention layer through Detail-Sensitive Instance Normalization. A lightweight one-dimensional Gaussian filter separates low- and high-frequency components; only the high-frequency residual is blended back, imprinting brush-stroke-level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context-aware texture modulation that remains independent of object fusion. Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.

[144] DoubleTake: Contrastive Reasoning for Faithful Decision-Making in Medical Imaging

Daivik Patel, Shrenik Patel

Main category: cs.CV

TL;DR: A contrastive document-aware reference selection framework for medical image reasoning that constructs compact evidence sets optimized for discrimination rather than similarity, improving accuracy on confusable medical conditions.

Details

Motivation: Existing medical imaging decision-making approaches rely on nearest neighbor retrieval that returns redundant evidence and reinforces single hypotheses, failing to address subtle visual differences between confusable conditions.

Method: Introduces contrastive document-aware reference selection using ROCO embeddings and metadata to balance visual relevance, embedding diversity, and source-level provenance. Proposes Counterfactual-Contrastive Inference with structured pairwise visual comparisons and margin-based decision rules with faithful abstention.

Result: Achieves state-of-the-art performance on MediConfusion benchmark, improving set-level accuracy by nearly 15% relative to prior methods while reducing confusion and improving individual accuracy.

Conclusion: The contrastive evidence selection framework enables more accurate discrimination between confusable medical conditions by optimizing for discrimination rather than similarity, addressing limitations of traditional retrieval methods.

Abstract: Accurate decision making in medical imaging requires reasoning over subtle visual differences between confusable conditions, yet most existing approaches rely on nearest neighbor retrieval that returns redundant evidence and reinforces a single hypothesis. We introduce a contrastive, document-aware reference selection framework that constructs compact evidence sets optimized for discrimination rather than similarity by explicitly balancing visual relevance, embedding diversity, and source-level provenance using ROCO embeddings and metadata. While ROCO provides large-scale image-caption pairs, it does not specify how references should be selected for contrastive reasoning, and naive retrieval frequently yields near-duplicate figures from the same document. To address this gap, we release a reproducible reference selection protocol and curated reference bank that enable a systematic study of contrastive retrieval in medical image reasoning. Building on these contrastive evidence sets, we propose Counterfactual-Contrastive Inference, a confidence-aware reasoning framework that performs structured pairwise visual comparisons and aggregates evidence using margin-based decision rules with faithful abstention. On the MediConfusion benchmark, our approach achieves state-of-the-art performance, improving set-level accuracy by nearly 15% relative to prior methods while reducing confusion and improving individual accuracy.

[145] FaceLinkGen: Rethinking Identity Leakage in Privacy-Preserving Face Recognition with Identity Extraction

Wenqi Guo, Shan Du

Main category: cs.CV

TL;DR: FaceLinkGen attack exposes privacy flaws in face recognition systems by extracting identity information from protected templates without pixel reconstruction, showing that current privacy metrics (PSNR/SSIM) fail to measure real privacy risks.

Details

Motivation: Current privacy-preserving face recognition (PPFR) systems are evaluated using pixel-level reconstruction metrics (PSNR/SSIM), but these fail to capture real privacy risks. The authors aim to demonstrate that visual obfuscation leaves identity information exposed to attackers.

Method: Developed FaceLinkGen attack that performs identity linkage/matching and face regeneration directly from protected templates without recovering original pixels. Tested on three recent PPFR systems in both standard and near zero knowledge settings.

Result: FaceLinkGen achieved over 98.5% matching accuracy and above 96% regeneration success on three PPFR systems. Even in near zero knowledge settings, it exceeded 92% matching and 94% regeneration, exposing structural gaps in current privacy evaluation metrics.

Conclusion: Pixel distortion metrics (PSNR/SSIM) widely used in PPFR evaluation fail to measure real privacy. Visual obfuscation leaves identity information broadly exposed to both external intruders and untrusted service providers, requiring new privacy evaluation frameworks.

Abstract: Transformation-based privacy-preserving face recognition (PPFR) aims to verify identities while hiding facial data from attackers and malicious service providers. Existing evaluations mostly treat privacy as resistance to pixel-level reconstruction, measured by PSNR and SSIM. We show that this reconstruction-centric view fails. We present FaceLinkGen, an identity extraction attack that performs linkage/matching and face regeneration directly from protected templates without recovering original pixels. On three recent PPFR systems, FaceLinkGen reaches over 98.5% matching accuracy and above 96% regeneration success, and still exceeds 92% matching and 94% regeneration in a near zero knowledge setting. These results expose a structural gap between pixel distortion metrics, which are widely used in PPFR evaluation, and real privacy. We show that visual obfuscation leaves identity information broadly exposed to both external intruders and untrusted service providers.

[146] HypCBC: Domain-Invariant Hyperbolic Cross-Branch Consistency for Generalizable Medical Image Analysis

Francesco Di Salvo, Sebastian Doerrich, Jonas Alle, Christian Ledig

Main category: cs.CV

TL;DR: Hyperbolic representation learning improves medical image analysis by modeling complex hierarchical structures, achieving better domain generalization across diverse clinical datasets compared to Euclidean methods.

Details

Motivation: Deep neural networks struggle with robust generalization in medical imaging due to data scarcity, covariate shifts from different hardware/protocols, and heterogeneous patient populations. Euclidean manifolds fail to capture complex hierarchical structures in clinical data, hindering reliable performance and clinical adoption.

Method: Proposes hyperbolic representation learning for medical image analysis, exploiting hyperbolic manifolds to model complex data characteristics. Introduces an unsupervised, domain-invariant hyperbolic cross-branch consistency constraint to promote domain-invariant features.

Result: Demonstrates statistically significant gains across 11 in-distribution datasets and 3 ViT models. Outperforms state-of-the-art Euclidean methods by average +2.1% AUC on three domain generalization benchmarks: Fitzpatrick17k, Camelyon17-WILDS, and cross-dataset retinal imaging setup.

Conclusion: Hyperbolic representation learning effectively captures complex hierarchical structures in medical images, enabling better domain generalization across different imaging modalities, data sizes, and label granularities, confirming robust generalization capabilities across substantially different conditions.

Abstract: Robust generalization beyond training distributions remains a critical challenge for deep neural networks. This is especially pronounced in medical image analysis, where data is often scarce and covariate shifts arise from different hardware devices, imaging protocols, and heterogeneous patient populations. These factors collectively hinder reliable performance and slow down clinical adoption. Despite recent progress, existing learning paradigms primarily rely on the Euclidean manifold, whose flat geometry fails to capture the complex, hierarchical structures present in clinical data. In this work, we exploit the advantages of hyperbolic manifolds to model complex data characteristics. We present the first comprehensive validation of hyperbolic representation learning for medical image analysis and demonstrate statistically significant gains across eleven in-distribution datasets and three ViT models. We further propose an unsupervised, domain-invariant hyperbolic cross-branch consistency constraint. Extensive experiments confirm that our proposed method promotes domain-invariant features and outperforms state-of-the-art Euclidean methods by an average of $+2.1%$ AUC on three domain generalization benchmarks: Fitzpatrick17k, Camelyon17-WILDS, and a cross-dataset setup for retinal imaging. These datasets span different imaging modalities, data sizes, and label granularities, confirming generalization capabilities across substantially different conditions. The code is available at https://github.com/francescodisalvo05/hyperbolic-cross-branch-consistency .

[147] A Multi-scale Linear-time Encoder for Whole-Slide Image Analysis

Jagan Mohan Reddy Dwarampudi, Joshua Wong, Hien Van Nguyen, Tania Banerjee

Main category: cs.CV

TL;DR: MARBLE is a Mamba-based multi-scale MIL framework for whole-slide image analysis that processes multiple magnification levels in parallel using linear-time state-space models, achieving significant performance improvements over existing methods.

Details

Motivation: Whole-slide image analysis faces challenges due to gigapixel resolutions and hierarchical magnifications. Existing multiple instance learning methods typically operate at single scales, while transformer-based approaches suffer from quadratic attention costs, making them inefficient for large-scale WSI analysis.

Method: MARBLE uses a purely Mamba-based architecture with parallel multi-scale processing and linear-time state-space models. It integrates coarse-to-fine reasoning by processing multiple magnification levels simultaneously, capturing cross-scale dependencies efficiently with minimal parameter overhead.

Result: Experiments on five public datasets show improvements of up to 6.9% in AUC, 20.3% in accuracy, and 2.3% in C-index compared to existing methods, establishing MARBLE as an efficient and generalizable framework for multi-scale WSI analysis.

Conclusion: MARBLE provides a scalable and modular alternative to attention-based architectures for whole-slide image analysis, demonstrating that Mamba-based state-space models can effectively handle multi-scale visual reasoning with linear-time complexity.

Abstract: We introduce Multi-scale Adaptive Recurrent Biomedical Linear-time Encoder (MARBLE), the first \textit{purely Mamba-based} multi-state multiple instance learning (MIL) framework for whole-slide image (WSI) analysis. MARBLE processes multiple magnification levels in parallel and integrates coarse-to-fine reasoning within a linear-time state-space model, efficiently capturing cross-scale dependencies with minimal parameter overhead. WSI analysis remains challenging due to gigapixel resolutions and hierarchical magnifications, while existing MIL methods typically operate at a single scale and transformer-based approaches suffer from quadratic attention costs. By coupling parallel multi-scale processing with linear-time sequence modeling, MARBLE provides a scalable and modular alternative to attention-based architectures. Experiments on five public datasets show improvements of up to \textbf{6.9%} in AUC, \textbf{20.3%} in accuracy, and \textbf{2.3%} in C-index, establishing MARBLE as an efficient and generalizable framework for multi-scale WSI analysis.

[148] SRA-Seg: Synthetic to Real Alignment for Semi-Supervised Medical Image Segmentation

OFM Riaz Rahman Aranya, Kevin Desai

Main category: cs.CV

TL;DR: SRA-Seg bridges the domain gap between synthetic and real medical images for segmentation by aligning feature distributions using DINOv2 embeddings and soft edge blending, achieving strong performance with minimal labeled real data.

Details

Motivation: Synthetic medical images fail to improve segmentation despite visual realism due to domain gaps between synthetic and real feature spaces that current semi-supervised methods cannot bridge.

Method: Proposes SRA-Seg with similarity-alignment loss using frozen DINOv2 embeddings to pull synthetic representations toward nearest real counterparts, soft edge blending for smooth anatomical transitions, and pseudo-labeling via EMA teacher model with soft-segmentation losses.

Result: Using only 10% labeled real data and 90% synthetic unlabeled data, achieves 89.34% Dice on ACDC and 84.42% on FIVES, outperforming existing semi-supervised methods and matching performance of methods using real unlabeled data.

Conclusion: SRA-Seg effectively bridges the synthetic-real domain gap for medical image segmentation, enabling effective use of synthetic data with minimal real annotations.

Abstract: Synthetic data, an appealing alternative to extensive expert-annotated data for medical image segmentation, consistently fails to improve segmentation performance despite its visual realism. The reason being that synthetic and real medical images exist in different semantic feature spaces, creating a domain gap that current semi-supervised learning methods cannot bridge. We propose SRA-Seg, a framework explicitly designed to align synthetic and real feature distributions for medical image segmentation. SRA-Seg introduces a similarity-alignment (SA) loss using frozen DINOv2 embeddings to pull synthetic representations toward their nearest real counterparts in semantic space. We employ soft edge blending to create smooth anatomical transitions and continuous labels, eliminating the hard boundaries from traditional copy-paste augmentation. The framework generates pseudo-labels for synthetic images via an EMA teacher model and applies soft-segmentation losses that respect uncertainty in mixed regions. Our experiments demonstrate strong results: using only 10% labeled real data and 90% synthetic unlabeled data, SRA-Seg achieves 89.34% Dice on ACDC and 84.42% on FIVES, significantly outperforming existing semi-supervised methods and matching the performance of methods using real unlabeled data.

[149] LEVIO: Lightweight Embedded Visual Inertial Odometry for Resource-Constrained Devices

Jonas Kühne, Christian Vogt, Michele Magno, Luca Benini

Main category: cs.CV

TL;DR: LEVIO: A lightweight visual-inertial odometry system optimized for ultra-low-power hardware, achieving real-time 6-DoF tracking at 20 FPS with <100 mW power consumption.

Details

Motivation: Existing VIO systems are too computationally demanding for resource-constrained platforms like micro-drones and smart glasses, creating a need for efficient motion tracking solutions that can run on low-power hardware.

Method: Hardware-software co-optimized VIO pipeline using ORB feature tracking and bundle adjustment, with emphasis on computational efficiency through parallelization, low memory usage, and optimization for embedded microcontrollers and low-power SoCs.

Result: Achieves 20 FPS real-time performance while consuming less than 100 mW on a parallel-processing ultra-low-power RISC-V SoC, with competitive accuracy on public VIO datasets.

Conclusion: LEVIO demonstrates that efficient VIO is feasible on ultra-low-power hardware, enabling real-time motion tracking for resource-constrained applications like micro-drones and AR glasses.

Abstract: Accurate, infrastructure-less sensor systems for motion tracking are essential for mobile robotics and augmented reality (AR) applications. The most popular state-of-the-art visual-inertial odometry (VIO) systems, however, are too computationally demanding for resource-constrained hardware, such as micro-drones and smart glasses. This work presents LEVIO, a fully featured VIO pipeline optimized for ultra-low-power compute platforms, allowing six-degrees-of-freedom (DoF) real-time sensing. LEVIO incorporates established VIO components such as Oriented FAST and Rotated BRIEF (ORB) feature tracking and bundle adjustment, while emphasizing a computationally efficient architecture with parallelization and low memory usage to suit embedded microcontrollers and low-power systems-on-chip (SoCs). The paper proposes and details the algorithmic design choices and the hardware-software co-optimization approach, and presents real-time performance on resource-constrained hardware. LEVIO is validated on a parallel-processing ultra-low-power RISC-V SoC, achieving 20 FPS while consuming less than 100 mW, and benchmarked against public VIO datasets, offering a compelling balance between efficiency and accuracy. To facilitate reproducibility and adoption, the complete implementation is released as open-source.

[150] Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning

Yihong Huang, Fei Ma, Yihua Shao, Jingcai Guo, Zitong Yu, Laizhong Cui, Qi Tian

Main category: cs.CV

TL;DR: Nüwa is a two-stage token pruning framework for Vision Language Models that maintains spatial integrity for better performance on visual grounding tasks while accelerating inference.

Details

Motivation: Existing vision token pruning methods work well for VQA but degrade significantly on visual grounding tasks because they lose global spatial reference frames derived from token positional information interactions.

Method: Two-stage framework: 1) After vision encoder, apply separation, alignment, and aggregation operations inspired by swarm intelligence to retain information-rich global spatial anchors; 2) Within LLM, perform text-guided pruning to retain task-relevant visual tokens.

Result: Achieves SOTA performance on multiple VQA benchmarks (94% to 95%) and substantial improvements on visual grounding tasks (7% to 47%).

Conclusion: Nüwa enables efficient feature aggregation while maintaining spatial integrity, addressing limitations of existing pruning methods for multimodal tasks requiring spatial understanding.

Abstract: Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM). However, existing pruning methods demonstrate excellent performance preservation in visual question answering (VQA) and suffer substantial degradation on visual grounding (VG) tasks. Our analysis of the VLM’s processing pipeline reveals that strategies utilizing global semantic similarity and attention scores lose the global spatial reference frame, which is derived from the interactions of tokens’ positional information. Motivated by these findings, we propose $\text{Nüwa}$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity. In the first stage, after the vision encoder, we apply three operations, namely separation, alignment, and aggregation, which are inspired by swarm intelligence algorithms to retain information-rich global spatial anchors. In the second stage, within the LLM, we perform text-guided pruning to retain task-relevant visual tokens. Extensive experiments demonstrate that $\text{Nüwa}$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%).

[151] Efficient Sequential Neural Network with Spatial-Temporal Attention and Linear LSTM for Robust Lane Detection Using Multi-Frame Images

Sandeep Patil, Yongqi Dong, Haneen Farah, Hans Hellendoorn

Main category: cs.CV

TL;DR: A novel sequential neural network with spatial-temporal attention mechanism for lane detection that focuses on key lane features and exploits temporal correlations across frames, achieving state-of-the-art performance with computational efficiency.

Details

Motivation: Current lane detection methods lack versatility in delivering accurate, robust, and real-time performance, especially vision-based methods that neglect critical image regions and spatial-temporal salience, leading to poor performance in challenging conditions like occlusion and lighting variations.

Method: Proposes a sequential neural network model with spatial-temporal attention mechanism built on encoder-decoder structure with common neural network backbones. The attention mechanism focuses on key lane features and exploits salient spatial-temporal correlations among continuous image frames.

Result: Outperforms state-of-the-art methods on three large-scale datasets, demonstrates robustness in various testing scenarios, and achieves computational efficiency with fewer parameters and reduced MACs compared to baseline sequential models.

Conclusion: The proposed spatial-temporal attention mechanism effectively enhances lane detection performance by focusing on critical features and exploiting temporal correlations, offering both accuracy and computational efficiency for real-time applications.

Abstract: Lane detection is a crucial perception task for all levels of automated vehicles (AVs) and Advanced Driver Assistance Systems, particularly in mixed-traffic environments where AVs must interact with human-driven vehicles (HDVs) and challenging traffic scenarios. Current methods lack versatility in delivering accurate, robust, and real-time compatible lane detection, especially vision-based methods often neglect critical regions of the image and their spatial-temporal (ST) salience, leading to poor performance in difficult circumstances such as serious occlusion and dazzle lighting. This study introduces a novel sequential neural network model with a spatial-temporal attention mechanism to focus on key features of lane lines and exploit salient ST correlations among continuous image frames. The proposed model, built on a standard encoder-decoder structure and common neural network backbones, is trained and evaluated on three large-scale open-source datasets. Extensive experiments demonstrate the strength and robustness of the proposed model, outperforming state-of-the-art methods in various testing scenarios. Furthermore, with the ST attention mechanism, the developed sequential neural network models exhibit fewer parameters and reduced Multiply-Accumulate Operations (MACs) compared to baseline sequential models, highlighting their computational efficiency. Relevant data, code, and models are released at https://doi.org/10.4121/4619cab6-ae4a-40d5-af77-582a77f3d821.

[152] TRACE: Temporal Radiology with Anatomical Change Explanation for Grounded X-ray Report Generation

OFM Riaz Rahman Aranya, Kevin Desai

Main category: cs.CV

TL;DR: TRACE is a vision-language model for temporal comparison of chest X-rays that jointly performs change detection, classification, and spatial localization of anatomical changes.

Details

Motivation: Temporal comparison of chest X-rays is crucial in clinical radiology for tracking disease progression and treatment response, but existing vision-language models lack combined capabilities for temporal change detection with spatial grounding.

Method: Introduces TRACE, a model that takes prior and current chest X-rays as input and jointly learns temporal comparison, change classification (worsened/improved/stable), and spatial localization via bounding box coordinates for each finding.

Result: TRACE achieves over 90% grounding accuracy and demonstrates that change detection emerges only when temporal comparison and spatial grounding are jointly learned, suggesting grounding provides essential spatial attention for temporal reasoning.

Conclusion: TRACE establishes a foundation for temporal radiology analysis by combining change detection with spatial grounding, revealing that joint learning of these capabilities is essential for meaningful temporal reasoning in medical imaging.

Abstract: Temporal comparison of chest X-rays is fundamental to clinical radiology, enabling detection of disease progression, treatment response, and new findings. While vision-language models have advanced single-image report generation and visual grounding, no existing method combines these capabilities for temporal change detection. We introduce Temporal Radiology with Anatomical Change Explanation (TRACE), the first model that jointly performs temporal comparison, change classification, and spatial localization. Given a prior and current chest X-ray, TRACE generates natural language descriptions of interval changes (worsened, improved, stable) while grounding each finding with bounding box coordinates. TRACE demonstrates effective spatial localization with over 90% grounding accuracy, establishing a foundation for this challenging new task. Our ablation study uncovers an emergent capability: change detection arises only when temporal comparison and spatial grounding are jointly learned, as neither alone enables meaningful change detection. This finding suggests that grounding provides a spatial attention mechanism essential for temporal reasoning.

[153] Dynamic High-frequency Convolution for Infrared Small Target Detection

Ruojing Li, Chao Xiao, Qian Yin, Wei An, Nuo Chen, Xinyi Ying, Miao Li, Yingqian Wang

Main category: cs.CV

TL;DR: Dynamic High-Frequency Convolution (DHiF) for infrared small target detection, generating dynamic local filter banks to discriminatively model high-frequency components for better target-clutter separation.

Details

Motivation: Current learning-based methods for single-frame infrared small target (SIRST) detection neglect explicit modeling and discriminative representation learning of various high-frequency components (HFCs), which is crucial for distinguishing targets from other HFCs like bright corners and broken clouds.

Method: Proposes DHiF that translates discriminative modeling into generation of dynamic local filter banks. DHiF is sensitive to HFCs due to dynamic parameters adjusted within zero-centered range according to Fourier transformation properties. Works as drop-in replacement for standard convolution in any SIRST detection network.

Result: Extensive experiments on real-scene datasets show DHiF exhibits superior detection performance compared to other state-of-the-art convolution operations with promising improvement.

Conclusion: DHiF effectively addresses the challenge of distinguishing infrared small targets from other high-frequency components through dynamic filter generation and discriminative representation learning.

Abstract: Infrared small targets are typically tiny and locally salient, which belong to high-frequency components (HFCs) in images. Single-frame infrared small target (SIRST) detection is challenging, since there are many HFCs along with targets, such as bright corners, broken clouds, and other clutters. Current learning-based methods rely on the powerful capabilities of deep networks, but neglect explicit modeling and discriminative representation learning of various HFCs, which is important to distinguish targets from other HFCs. To address the aforementioned issues, we propose a dynamic high-frequency convolution (DHiF) to translate the discriminative modeling process into the generation of a dynamic local filter bank. Especially, DHiF is sensitive to HFCs, owing to the dynamic parameters of its generated filters being symmetrically adjusted within a zero-centered range according to Fourier transformation properties. Combining with standard convolution operations, DHiF can adaptively and dynamically process different HFC regions and capture their distinctive grayscale variation characteristics for discriminative representation learning. DHiF functions as a drop-in replacement for standard convolution and can be used in arbitrary SIRST detection networks without significant decrease in computational efficiency. To validate the effectiveness of our DHiF, we conducted extensive experiments across different SIRST detection networks on real-scene datasets. Compared to other state-of-the-art convolution operations, DHiF exhibits superior detection performance with promising improvement. Codes are available at https://github.com/TinaLRJ/DHiF.

[154] Fisheye Stereo Vision: Depth and Range Error

Leaf Jiang, Matthew Holzel, Bernhard Kaplan, Hsiou-Yuan Liu, Sabyasachi Paul, Karen Rankin, Piotr Swierczynski

Main category: cs.CV

TL;DR: Analytical expressions derived for depth and range error in fisheye stereo vision systems, focusing on accuracy at large angles.

Details

Motivation: Fisheye stereo vision systems are used for wide field-of-view applications, but existing error analysis often doesn't properly account for accuracy degradation at large angles. Understanding these error characteristics is crucial for applications requiring precise depth estimation across the entire field of view.

Method: The study derives analytical mathematical expressions for depth and range error in fisheye stereo vision systems as a function of object distance, specifically modeling the error behavior at large incidence angles where traditional pinhole camera models break down.

Result: The derived expressions provide a theoretical framework for understanding how depth estimation accuracy degrades with increasing object distance and incidence angle in fisheye stereo systems, enabling better error prediction and system design.

Conclusion: The analytical error expressions for fisheye stereo vision provide valuable tools for system design and performance evaluation, particularly for applications requiring accurate depth estimation across wide fields of view.

Abstract: This study derives analytical expressions for the depth and range error of fisheye stereo vision systems as a function of object distance, specifically accounting for accuracy at large angles.

[155] SceneLinker: Compositional 3D Scene Generation via Semantic Scene Graph from RGB Sequences

Seok-Young Kim, Dooyoung Kim, Woojin Cho, Hail Song, Suji Kang, Woontack Woo

Main category: cs.CV

TL;DR: SceneLinker generates compositional 3D scenes from RGB sequences using semantic scene graphs for Mixed Reality applications, outperforming state-of-the-art methods in complex indoor environments.

Details

Motivation: To enable adaptive Mixed Reality content based on users' physical spaces, there's a need to generate 3D scenes that reflect real-world layouts by capturing semantic cues. Previous methods struggled with capturing contextual relationships between objects or focused mainly on shape synthesis, making it difficult to generate 3D scenes aligned with object arrangements.

Method: Proposes a graph network with cross-check feature attention for scene graph prediction, and constructs a graph-variational autoencoder (graph-VAE) with a joint shape and layout block for 3D scene generation from semantic scene graphs.

Result: Outperforms state-of-the-art methods on 3RScan/3DSSG and SG-FRONT datasets in both quantitative and qualitative evaluations, even in complex indoor environments and under challenging scene graph constraints.

Conclusion: SceneLinker enables users to generate consistent 3D spaces from physical environments via scene graphs, facilitating the creation of spatial Mixed Reality content that adapts to individual user spaces.

Abstract: We introduce SceneLinker, a novel framework that generates compositional 3D scenes via semantic scene graph from RGB sequences. To adaptively experience Mixed Reality (MR) content based on each user’s space, it is essential to generate a 3D scene that reflects the real-world layout by compactly capturing the semantic cues of the surroundings. Prior works struggled to fully capture the contextual relationship between objects or mainly focused on synthesizing diverse shapes, making it challenging to generate 3D scenes aligned with object arrangements. We address these challenges by designing a graph network with cross-check feature attention for scene graph prediction and constructing a graph-variational autoencoder (graph-VAE), which consists of a joint shape and layout block for 3D scene generation. Experiments on the 3RScan/3DSSG and SG-FRONT datasets demonstrate that our approach outperforms state-of-the-art methods in both quantitative and qualitative evaluations, even in complex indoor environments and under challenging scene graph constraints. Our work enables users to generate consistent 3D spaces from their physical environments via scene graphs, allowing them to create spatial MR content. Project page is https://scenelinker2026.github.io.

[156] Aligning Forest and Trees in Images and Long Captions for Visually Grounded Understanding

Byeongju Woo, Zilin Wang, Byeonghyun Pak, Sangwoo Mo, Stella X. Yu

Main category: cs.CV

TL;DR: CAFT is a hierarchical vision-language representation learning framework that aligns global and local semantics across images and long captions without pixel-level supervision, achieving SOTA on long-text retrieval benchmarks.

Details

Motivation: Current vision-language models like CLIP struggle with long captions because they treat images and texts as undifferentiated wholes, lacking fine-grained hierarchical alignment between visual and textual domains.

Method: Proposes CAFT with a fine-to-coarse visual encoder and hierarchical text transformer, using hierarchical alignment loss that matches whole images with whole captions while biasing region-sentence correspondences.

Result: Achieves state-of-the-art performance on six long-text retrieval benchmarks, exhibits strong scaling behavior, and enables fine-grained visually grounded representations to emerge without explicit region-level supervision.

Conclusion: Hierarchical cross-domain alignment enables effective fine-grained vision-language understanding for long captions without requiring pixel-level supervision.

Abstract: Large vision-language models such as CLIP struggle with long captions because they align images and texts as undifferentiated wholes. Fine-grained vision-language understanding requires hierarchical semantics capturing both global context and localized details across visual and textual domains. Yet linguistic hierarchies from syntax or semantics rarely match visual organization, and purely visual hierarchies tend to fragment scenes into appearance-driven parts without semantic focus. We propose CAFT (Cross-domain Alignment of Forests and Trees), a hierarchical image-text representation learning framework that aligns global and local semantics across images and long captions without pixel-level supervision. Coupling a fine-to-coarse visual encoder with a hierarchical text transformer, it uses a hierarchical alignment loss that matches whole images with whole captions while biasing region-sentence correspondences, so that coarse semantics are built from fine-grained evidence rather than from aggregation untethered to part-level grounding. Trained on 30M image-text pairs, CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks and exhibits strong scaling behavior. Experiments show that hierarchical cross-domain alignment enables fine-grained, visually grounded image-text representations to emerge without explicit region-level supervision.

[157] SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation

Zhanfeng Liao, Jiajun Zhang, Hanzhang Tu, Zhixi Wang, Yunqi Gao, Hongwen Zhang, Yebin Liu

Main category: cs.CV

TL;DR: SharpTimeGS introduces a lifespan-aware 4D Gaussian framework for dynamic scene novel view synthesis, using learnable lifespan parameters to balance static and dynamic region modeling while enabling real-time 4K rendering.

Details

Motivation: Existing Gaussian-based methods struggle to balance long-term static and short-term dynamic regions in both representation and optimization for dynamic scene novel view synthesis.

Method: Proposes a lifespan-aware 4D Gaussian framework with learnable lifespan parameters that reformulate temporal visibility from Gaussian decay to flat-top profiles, modulate motion based on lifespan, and use lifespan-velocity-aware densification for balanced optimization.

Result: Achieves state-of-the-art performance on multiple benchmarks while supporting real-time rendering up to 4K resolution at 100 FPS on one RTX 4090.

Conclusion: SharpTimeGS effectively balances static and dynamic region modeling through lifespan-aware optimization, enabling high-quality real-time 4D reconstruction of dynamic scenes.

Abstract: Novel view synthesis of dynamic scenes is fundamental to achieving photorealistic 4D reconstruction and immersive visual experiences. Recent progress in Gaussian-based representations has significantly improved real-time rendering quality, yet existing methods still struggle to maintain a balance between long-term static and short-term dynamic regions in both representation and optimization. To address this, we present SharpTimeGS, a lifespan-aware 4D Gaussian framework that achieves temporally adaptive modeling of both static and dynamic regions under a unified representation. Specifically, we introduce a learnable lifespan parameter that reformulates temporal visibility from a Gaussian-shaped decay into a flat-top profile, allowing primitives to remain consistently active over their intended duration and avoiding redundant densification. In addition, the learned lifespan modulates each primitives’ motion, reducing drift in long-lived static points while retaining unrestricted motion for short-lived dynamic ones. This effectively decouples motion magnitude from temporal duration, improving long-term stability without compromising dynamic fidelity. Moreover, we design a lifespan-velocity-aware densification strategy that mitigates optimization imbalance between static and dynamic regions by allocating more capacity to regions with pronounced motion while keeping static areas compact and stable. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art performance while supporting real-time rendering up to 4K resolution at 100 FPS on one RTX 4090.

[158] Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, Jianzhong Ju, Zhenbo Luo, Jian Luan

Main category: cs.CV

TL;DR: Video-OPD: An efficient on-policy distillation framework for Temporal Video Grounding that replaces sparse-reward RL with dense token-level supervision from a frontier teacher, achieving better performance with faster convergence.

Details

Motivation: Existing RL methods for Temporal Video Grounding suffer from sparse reward signals and high computational costs. The authors aim to develop a more efficient post-training framework that maintains on-policy optimization while providing denser supervision.

Method: Proposes Video-OPD framework that optimizes trajectories sampled from current policy using reverse KL divergence with dense token-level supervision from a frontier teacher. Also introduces Teacher-Validated Disagreement Focusing (TVDF) curriculum that prioritizes teacher-reliable and informative trajectories.

Result: Video-OPD consistently outperforms GRPO-based RL methods while achieving substantially faster convergence and lower computational cost across empirical evaluations.

Conclusion: On-policy distillation is an effective alternative to conventional reinforcement learning for Temporal Video Grounding, offering better performance with improved efficiency through dense supervision and curriculum learning.

Abstract: Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.

[159] VOILA: Value-of-Information Guided Fidelity Selection for Cost-Aware Multimodal Question Answering

Rahul Atul Bhope, K. R. Jayaram, Vinod Muthusamy, Ritesh Kumar, Vatche Isahagian, Nalini Venkatasubramanian

Main category: cs.CV

TL;DR: VOILA is a framework for adaptive fidelity selection in Visual Question Answering that optimizes what visual information to retrieve before model execution, achieving 50-60% cost reductions while retaining 90-95% of full-resolution accuracy.

Details

Motivation: Most multimodal vision-language systems operate at fixed fidelity levels despite significant costs from retrieving and processing high-fidelity visual inputs. There's a need to optimize what information to retrieve before model execution to reduce computational costs while maintaining accuracy.

Method: Two-stage pipeline: 1) gradient-boosted regressor estimates correctness likelihood at each fidelity from question features alone, 2) isotonic calibrator refines probabilities for reliable decision-making. System selects minimum-cost fidelity maximizing expected utility given predicted accuracy and retrieval costs.

Result: Evaluated across three deployment scenarios using five datasets (VQA-v2, GQA, TextVQA, LoCoMo, FloodNet) and six VLMs with 7B-235B parameters. Consistently achieves 50-60% cost reductions while retaining 90-95% of full-resolution accuracy across diverse query types and model architectures.

Conclusion: Pre-retrieval fidelity selection is vital to optimize multimodal inference under resource constraints. VOILA demonstrates that adaptive fidelity selection can significantly reduce costs while maintaining high accuracy in vision-language tasks.

Abstract: Despite significant costs from retrieving and processing high-fidelity visual inputs, most multimodal vision-language systems operate at fixed fidelity levels. We introduce VOILA, a framework for Value-Of-Information-driven adaptive fidelity selection in Visual Question Answering (VQA) that optimizes what information to retrieve before model execution. Given a query, VOILA uses a two-stage pipeline: a gradient-boosted regressor estimates correctness likelihood at each fidelity from question features alone, then an isotonic calibrator refines these probabilities for reliable decision-making. The system selects the minimum-cost fidelity maximizing expected utility given predicted accuracy and retrieval costs. We evaluate VOILA across three deployment scenarios using five datasets (VQA-v2, GQA, TextVQA, LoCoMo, FloodNet) and six Vision-Language Models (VLMs) with 7B-235B parameters. VOILA consistently achieves 50-60% cost reductions while retaining 90-95% of full-resolution accuracy across diverse query types and model architectures, demonstrating that pre-retrieval fidelity selection is vital to optimize multimodal inference under resource constraints.

[160] Thinking inside the Convolution for Image Inpainting: Reconstructing Texture via Structure under Global and Local Side

Haipeng Liu, Yang Wang, Biao Qian, Yong Rui, Meng Wang

Main category: cs.CV

TL;DR: A novel image inpainting method that addresses information loss during convolutional downsampling by using statistical normalization/denormalization of structure and texture feature maps to guide reconstruction.

Details

Motivation: Current image inpainting methods using encoder-decoder pipelines with convolutional downsampling suffer from information loss in both structure and texture feature maps, leading to suboptimal upsampling outputs. The paper aims to systematically address whether and how structure and texture features can mutually help alleviate this information loss.

Method: Proposes using statistical normalization and denormalization strategy for reconstruction guidance during convolutional downsampling. Given structure and texture feature maps, the method leverages their mutual relationships to preserve information through the downsampling process.

Result: Extensive experiments show advantages over state-of-the-art methods on images from low-to-high resolutions (256×256 and 512×512). The method is particularly effective when substituting all encoders with the proposed approach.

Conclusion: The proposed statistical normalization/denormalization strategy effectively addresses information loss in convolutional downsampling for image inpainting, improving reconstruction quality across various resolutions.

Abstract: Image inpainting has earned substantial progress, owing to the encoder-and-decoder pipeline, which is benefited from the Convolutional Neural Networks (CNNs) with convolutional downsampling to inpaint the masked regions semantically from the known regions within the encoder, coupled with an upsampling process from the decoder for final inpainting output. Recent studies intuitively identify the high-frequency structure and low-frequency texture to be extracted by CNNs from the encoder, and subsequently for a desirable upsampling recovery. However, the existing arts inevitably overlook the information loss for both structure and texture feature maps during the convolutional downsampling process, hence suffer from a non-ideal upsampling output. In this paper, we systematically answer whether and how the structure and texture feature map can mutually help to alleviate the information loss during the convolutional downsampling. Given the structure and texture feature maps, we adopt the statistical normalization and denormalization strategy for the reconstruction guidance during the convolutional downsampling process. The extensive experimental results validate its advantages to the state-of-the-arts over the images from low-to-high resolutions including 256256 and 512512, especially holds by substituting all the encoders by ours. Our code is available at https://github.com/htyjers/ConvInpaint-TSGL

[161] L2M-Reg: Building-level Uncertainty-aware Registration of Outdoor LiDAR Point Clouds and Semantic 3D City Models

Ziyang Xu, Benedikt Schwab, Yihui Yang, Thomas H. Kolbe, Christoph Holst

Main category: cs.CV

TL;DR: L2M-Reg: A plane-based fine registration method for LiDAR point clouds to semantic 3D city models that explicitly handles model uncertainty at building level.

Details

Motivation: Accurate registration between LiDAR point clouds and semantic 3D city models is crucial for urban digital twinning and downstream tasks, but existing methods struggle with generalization uncertainty in LoD2 models at individual building level.

Method: Three-step approach: 1) Establish reliable plane correspondence, 2) Build pseudo-plane-constrained Gauss-Helmert model, 3) Adaptively estimate vertical translation to handle model uncertainty.

Result: Extensive experiments on five real-world datasets show L2M-Reg is more accurate and computationally efficient than current leading ICP-based and plane-based methods.

Conclusion: L2M-Reg provides a novel building-level solution for LiDAR-to-Model registration when model uncertainty is present, with open-sourced datasets and code.

Abstract: Accurate registration between LiDAR (Light Detection and Ranging) point clouds and semantic 3D city models is a fundamental topic in urban digital twinning and a prerequisite for downstream tasks, such as digital construction, change detection, and model refinement. However, achieving accurate LiDAR-to-Model registration at the individual building level remains challenging, particularly due to the generalization uncertainty in semantic 3D city models at the Level of Detail 2 (LoD2). This paper addresses this gap by proposing L2M-Reg, a plane-based fine registration method that explicitly accounts for model uncertainty. L2M-Reg consists of three key steps: establishing reliable plane correspondence, building a pseudo-plane-constrained Gauss-Helmert model, and adaptively estimating vertical translation. Overall, extensive experiments on five real-world datasets demonstrate that L2M-Reg is both more accurate and computationally efficient than current leading ICP-based and plane-based methods. Therefore, L2M-Reg provides a novel building-level solution regarding LiDAR-to-Model registration when model uncertainty is present. The datasets and code for L2M-Reg can be found: https://github.com/Ziyang-Geodesy/L2M-Reg.

[162] A Vision-Based Analysis of Congestion Pricing in New York City

Mehmet Kerem Turkcan, Jhonatan Tavori, Javad Ghaderi, Gil Zussman, Zoran Kostic, Andrew Smyth

Main category: cs.CV

TL;DR: Computer vision analysis of NYC traffic camera data shows systematic changes in vehicle density following congestion pricing implementation from Jan 2025 to Jan 2026.

Details

Motivation: To objectively measure the impact of NYC's congestion pricing program using automated analysis of traffic camera data, establishing baseline patterns and identifying systematic changes in vehicle density.

Method: Developed a computer vision pipeline to process footage from over 900 traffic cameras throughout Manhattan and NYC, comparing traffic patterns from November 2024 (pre-implementation baseline) through the program’s implementation in January 2025 until January 2026.

Result: Identified systematic changes in vehicle density across the monitored region following congestion pricing implementation, establishing clear before-and-after patterns through automated analysis.

Conclusion: Automated computer vision analysis of traffic camera data provides an effective method for measuring the real-world impact of urban transportation policies like congestion pricing.

Abstract: We examine the impact of New York City’s congestion pricing program through automated analysis of traffic camera data. Our computer vision pipeline processes footage from over 900 cameras distributed throughout Manhattan and New York, comparing traffic patterns from November 2024 through the program’s implementation in January 2025 until January 2026. We establish baseline traffic patterns and identify systematic changes in vehicle density across the monitored region.

[163] MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration

Wenzhang Sun, Zhenyu Wang, Zhangchi Hu, Chunfeng Wang, Hao Li, Wei Chen

Main category: cs.CV

TL;DR: MUSE is a multi-agent framework for generating coherent long-form audio-visual stories from short prompts using an iterative plan-execute-verify-revise loop with explicit controls over identity, composition, and temporal continuity.

Details

Motivation: Existing approaches for generating long-form audio-visual stories suffer from semantic drift and identity inconsistency due to feed-forward pipelines or prompt-only refinement, creating an intent-execution gap where narrative intent must be preserved across coherent shot-level multimodal generation over long horizons.

Method: MUSE formulates storytelling as a closed-loop constraint enforcement problem using a multi-agent framework with iterative plan-execute-verify-revise loop. It translates narrative intent into explicit machine-executable controls over identity, spatial composition, and temporal continuity, applying targeted multimodal feedback to correct violations during generation.

Result: Experiments show MUSE substantially improves long-horizon narrative coherence, cross-modal identity consistency, and cinematic quality compared to representative baselines. The paper also introduces MUSEBench, a reference-free evaluation protocol validated by human judgments.

Conclusion: MUSE effectively addresses the intent-execution gap in long-form audio-visual storytelling through closed-loop constraint enforcement and multimodal feedback, demonstrating superior coherence and consistency over existing approaches.

Abstract: Generating long-form audio-visual stories from a short user prompt remains challenging due to an intent-execution gap, where high-level narrative intent must be preserved across coherent, shot-level multimodal generation over long horizons. Existing approaches typically rely on feed-forward pipelines or prompt-only refinement, which often leads to semantic drift and identity inconsistency as sequences grow longer. We address this challenge by formulating storytelling as a closed-loop constraint enforcement problem and propose MUSE, a multi-agent framework that coordinates generation through an iterative plan-execute-verify-revise loop. MUSE translates narrative intent into explicit, machine-executable controls over identity, spatial composition, and temporal continuity, and applies targeted multimodal feedback to correct violations during generation. To evaluate open-ended storytelling without ground-truth references, we introduce MUSEBench, a reference-free evaluation protocol validated by human judgments. Experiments demonstrate that MUSE substantially improves long-horizon narrative coherence, cross-modal identity consistency, and cinematic quality compared with representative baselines.

[164] Bongards at the Boundary of Perception and Reasoning: Programs or Language?

Cassidy Langenfeld, Claas Beger, Gloria Geng, Wasu Top Piriyakulkij, Keya Hu, Yewen Pu, Kevin Ellis

Main category: cs.CV

TL;DR: A neurosymbolic approach using LLMs and Bayesian optimization to solve Bongard problems, which test visual reasoning in novel situations

Details

Motivation: While VLMs excel at everyday visual tasks, they struggle with the novel visual reasoning challenges posed by Bongard problems, which require reasoning in radically new situations that humans can handle

Method: Neurosymbolic approach: use LLMs to generate parameterized programmatic representations for hypothesized solution rules, then perform parameter fitting using Bayesian optimization

Result: Method evaluated on classifying Bongard problem images given ground truth rules, as well as solving problems from scratch

Conclusion: The approach demonstrates a way to tackle challenging visual reasoning problems that require generalization to novel situations beyond typical VLM capabilities

Abstract: Vision-Language Models (VLMs) have made great strides in everyday visual tasks, such as captioning a natural image, or answering commonsense questions about such images. But humans possess the puzzling ability to deploy their visual reasoning abilities in radically new situations, a skill rigorously tested by the classic set of visual reasoning challenges known as the Bongard problems. We present a neurosymbolic approach to solving these problems: given a hypothesized solution rule for a Bongard problem, we leverage LLMs to generate parameterized programmatic representations for the rule and perform parameter fitting using Bayesian optimization. We evaluate our method on classifying Bongard problem images given the ground truth rule, as well as on solving the problems from scratch.

[165] HP-GAN: Harnessing pretrained networks for GAN improvement with FakeTwins and discriminator consistency

Geonhui Son, Jeong Ryong Lee, Dosik Hwang

Main category: cs.CV

TL;DR: HP-GAN improves image generation quality and diversity by leveraging pretrained networks with self-supervised learning (FakeTwins) and enforcing consistency between CNN and ViT-based discriminators.

Details

Motivation: Current GAN methods use pretrained networks for perceptual losses, but don't fully exploit their potential. The authors aim to better utilize neural network priors through self-supervised learning and discriminator consistency to improve image synthesis quality and diversity.

Method: Two main strategies: 1) FakeTwins uses pretrained networks as encoders to compute self-supervised loss applied through generated images to train the generator. 2) Discriminator consistency enforces alignment between discriminators that evaluate feature maps from CNN and ViT networks to promote coherent learning and training robustness.

Result: Extensive evaluation across 17 datasets (large, small, limited data, various domains) shows HP-GAN consistently outperforms state-of-the-art methods in FID scores, achieving significant improvements in image diversity and quality.

Conclusion: HP-GAN effectively exploits neural network priors through self-supervised learning and discriminator consistency, leading to superior image generation performance across diverse datasets and conditions.

Abstract: Generative Adversarial Networks (GANs) have made significant progress in enhancing the quality of image synthesis. Recent methods frequently leverage pretrained networks to calculate perceptual losses or utilize pretrained feature spaces. In this paper, we extend the capabilities of pretrained networks by incorporating innovative self-supervised learning techniques and enforcing consistency between discriminators during GAN training. Our proposed method, named HP-GAN, effectively exploits neural network priors through two primary strategies: FakeTwins and discriminator consistency. FakeTwins leverages pretrained networks as encoders to compute a self-supervised loss and applies this through the generated images to train the generator, thereby enabling the generation of more diverse and high quality images. Additionally, we introduce a consistency mechanism between discriminators that evaluate feature maps extracted from Convolutional Neural Network (CNN) and Vision Transformer (ViT) feature networks. Discriminator consistency promotes coherent learning among discriminators and enhances training robustness by aligning their assessments of image quality. Our extensive evaluation across seventeen datasets-including scenarios with large, small, and limited data, and covering a variety of image domains-demonstrates that HP-GAN consistently outperforms current state-of-the-art methods in terms of Fréchet Inception Distance (FID), achieving significant improvements in image diversity and quality. Code is available at: https://github.com/higun2/HP-GAN.

[166] IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning

Zhichao Sun, Yidong Ma, Gang Liu, Yibo Chen, Xu Tang, Yao Hu, Yongchao Xu

Main category: cs.CV

TL;DR: IVC-Prune: A training-free pruning method for LVLMs that retains both implicit visual coordinate tokens (identified via RoPE analysis) and semantically relevant foreground tokens, reducing visual tokens by ~50% while maintaining ≥99% of original performance.

Details

Motivation: LVLMs have prohibitive inference costs with high-resolution visual inputs. Existing pruning methods focus on semantic relevance but discard tokens crucial for spatial reasoning, creating a need for methods that preserve both semantic and spatial information.

Method: IVC-Prune identifies implicit visual coordinate (IVC) tokens by analyzing RoPE’s mathematical properties (positions where rotation matrices approximate identity or 90° rotation). Foreground tokens are identified via two-stage process: semantic seed discovery followed by contextual refinement using value-vector similarity. The method is training-free and prompt-aware.

Result: Extensive evaluations across 4 LVLMs and 20 benchmarks show IVC-Prune reduces visual tokens by ~50% while maintaining ≥99% of original performance, with improvements on several benchmarks.

Conclusion: The paper introduces a novel insight about LVLMs’ implicit visual coordinate systems through RoPE and proposes an effective pruning strategy that preserves both spatial reasoning capabilities and semantic relevance, significantly reducing computational costs.

Abstract: Large Vision-Language Models (LVLMs) achieve impressive performance across multiple tasks. A significant challenge, however, is their prohibitive inference cost when processing high-resolution visual inputs. While visual token pruning has emerged as a promising solution, existing methods that primarily focus on semantic relevance often discard tokens that are crucial for spatial reasoning. We address this gap through a novel insight into \emph{how LVLMs process spatial reasoning}. Specifically, we reveal that LVLMs implicitly establish visual coordinate systems through Rotary Position Embeddings (RoPE), where specific token positions serve as \textbf{implicit visual coordinates} (IVC tokens) that are essential for spatial reasoning. Based on this insight, we propose \textbf{IVC-Prune}, a training-free, prompt-aware pruning strategy that retains both IVC tokens and semantically relevant foreground tokens. IVC tokens are identified by theoretically analyzing the mathematical properties of RoPE, targeting positions at which its rotation matrices approximate identity matrix or the $90^\circ$ rotation matrix. Foreground tokens are identified through a robust two-stage process: semantic seed discovery followed by contextual refinement via value-vector similarity. Extensive evaluations across four representative LVLMs and twenty diverse benchmarks show that IVC-Prune reduces visual tokens by approximately 50% while maintaining $\geq$ 99% of the original performance and even achieving improvements on several benchmarks. Source codes are available at https://github.com/FireRedTeam/IVC-Prune.

[167] JRDB-Pose3D: A Multi-person 3D Human Pose and Shape Estimation Dataset for Robotics

Sandika Biswas, Kian Izadpanah, Hamid Rezatofighi

Main category: cs.CV

TL;DR: JRDB-Pose3D: A large-scale 3D human pose estimation dataset captured from mobile robots in crowded indoor/outdoor environments with SMPL-based annotations, track IDs, and rich contextual information.

Details

Motivation: Real-world scenes are crowded with multiple humans, but existing 3D pose datasets focus on single-person or controlled lab environments, limiting their applicability to real-world scenarios like autonomous driving and human-robot interaction.

Method: Introduces JRDB-Pose3D dataset captured from mobile robotic platforms in complex indoor/outdoor environments, providing SMPL-based 3D pose annotations with consistent body-shape parameters, track IDs, and inheriting all JRDB dataset annotations including 2D poses, social grouping, activities, and demographic information.

Result: Dataset contains 5-10 human poses per frame on average, with some scenes featuring up to 35 individuals simultaneously, presenting challenges like occlusions, truncated bodies, and out-of-frame body parts that reflect real-world complexity.

Conclusion: JRDB-Pose3D bridges the gap between controlled lab datasets and real-world applications by providing comprehensive 3D human pose data in crowded environments, enabling research in multi-human perception and human-centric understanding tasks.

Abstract: Real-world scenes are inherently crowded. Hence, estimating 3D poses of all nearby humans, tracking their movements over time, and understanding their activities within social and environmental contexts are essential for many applications, such as autonomous driving, robot perception, robot navigation, and human-robot interaction. However, most existing 3D human pose estimation datasets primarily focus on single-person scenes or are collected in controlled laboratory environments, which restricts their relevance to real-world applications. To bridge this gap, we introduce JRDB-Pose3D, which captures multi-human indoor and outdoor environments from a mobile robotic platform. JRDB-Pose3D provides rich 3D human pose annotations for such complex and dynamic scenes, including SMPL-based pose annotations with consistent body-shape parameters and track IDs for each individual over time. JRDB-Pose3D contains, on average, 5-10 human poses per frame, with some scenes featuring up to 35 individuals simultaneously. The proposed dataset presents unique challenges, including frequent occlusions, truncated bodies, and out-of-frame body parts, which closely reflect real-world environments. Moreover, JRDB-Pose3D inherits all available annotations from the JRDB dataset, such as 2D pose, information about social grouping, activities, and interactions, full-scene semantic masks with consistent human- and object-level tracking, and detailed annotations for each individual, such as age, gender, and race, making it a holistic dataset for a wide range of downstream perception and human-centric understanding tasks.

[168] Finding Optimal Video Moment without Training: Gaussian Boundary Optimization for Weakly Supervised Video Grounding

Sunoh Kim, Kimin Yun, Daeho Um

Main category: cs.CV

TL;DR: GBO is a novel inference framework for weakly supervised temporal video grounding that optimizes segment boundaries by solving a principled optimization problem balancing proposal coverage and compactness, achieving state-of-the-art results without training.

Details

Motivation: Existing weakly supervised temporal video grounding methods use Gaussian-based proposals but rely on heuristic mappings from Gaussian parameters to segment boundaries, leading to suboptimal localization performance. There's a need for a more principled inference approach.

Method: Proposes Gaussian Boundary Optimization (GBO), a training-free inference framework that predicts segment boundaries by solving an optimization problem that balances proposal coverage and segment compactness. Derives closed-form solution and analyzes optimality conditions under varying penalty regimes.

Result: GBO significantly improves localization performance, achieving state-of-the-art results across standard benchmarks. It’s efficient, generalizable across various proposal schemes, and compatible with both single-Gaussian and mixture-based architectures.

Conclusion: GBO provides a principled optimization-based inference framework for weakly supervised temporal video grounding that overcomes limitations of heuristic boundary mapping, offering both theoretical foundations and practical advantages for improved video segment localization.

Abstract: Weakly supervised temporal video grounding aims to localize query-relevant segments in untrimmed videos using only video-sentence pairs, without requiring ground-truth segment annotations that specify exact temporal boundaries. Recent approaches tackle this task by utilizing Gaussian-based temporal proposals to represent query-relevant segments. However, their inference strategies rely on heuristic mappings from Gaussian parameters to segment boundaries, resulting in suboptimal localization performance. To address this issue, we propose Gaussian Boundary Optimization (GBO), a novel inference framework that predicts segment boundaries by solving a principled optimization problem that balances proposal coverage and segment compactness. We derive a closed-form solution for this problem and rigorously analyze the optimality conditions under varying penalty regimes. Beyond its theoretical foundations, GBO offers several practical advantages: it is training-free and compatible with both single-Gaussian and mixture-based proposal architectures. Our experiments show that GBO significantly improves localization, achieving state-of-the-art results across standard benchmarks. Extensive experiments demonstrate the efficiency and generalizability of GBO across various proposal schemes. The code is available at \href{https://github.com/sunoh-kim/gbo}{https://github.com/sunoh-kim/gbo}.

[169] A generalizable large-scale foundation model for musculoskeletal radiographs

Shinn Kim, Soobin Lee, Kyoungseob Shin, Han-Soo Kim, Yongsung Kim, Minsu Kim, Juhong Nam, Somang Ko, Daeheon Kwon, Wook Huh, Ilkyu Han, Sunghoon Kwon

Main category: cs.CV

TL;DR: SKELEX is a large-scale foundation model for musculoskeletal radiographs trained on 1.2M images using self-supervised learning, showing strong performance on 12 diagnostic tasks and zero-shot abnormality localization.

Details

Motivation: Existing AI models for musculoskeletal disease detection are task-specific, annotation-dependent, and lack generalizability across diseases and anatomical regions. There's a clinical need for a generalizable foundation model, but public datasets are limited in size and diversity.

Method: Developed SKELEX using self-supervised learning on 1.2 million diverse musculoskeletal radiographs. The model was evaluated on 12 downstream diagnostic tasks including fracture detection, osteoarthritis grading, and bone tumor classification. Also demonstrated zero-shot abnormality localization and developed an interpretable, region-guided model for bone tumor prediction.

Result: SKELEX generally outperformed baselines on diagnostic tasks and demonstrated zero-shot abnormality localization without task-specific training. The bone tumor model maintained robust performance on independent external datasets and was deployed as a publicly accessible web application.

Conclusion: SKELEX provides a scalable, label-efficient, and generalizable AI framework for musculoskeletal imaging, establishing a foundation for clinical translation and data-efficient research in musculoskeletal radiology.

Abstract: Artificial intelligence (AI) has shown promise in detecting and characterizing musculoskeletal diseases from radiographs. However, most existing models remain task-specific, annotation-dependent, and limited in generalizability across diseases and anatomical regions. Although a generalizable foundation model trained on large-scale musculoskeletal radiographs is clinically needed, publicly available datasets remain limited in size and lack sufficient diversity to enable training across a wide range of musculoskeletal conditions and anatomical sites. Here, we present SKELEX, a large-scale foundation model for musculoskeletal radiographs, trained using self-supervised learning on 1.2 million diverse, condition-rich images. The model was evaluated on 12 downstream diagnostic tasks and generally outperformed baselines in fracture detection, osteoarthritis grading, and bone tumor classification. Furthermore, SKELEX demonstrated zero-shot abnormality localization, producing error maps that identified pathologic regions without task-specific training. Building on this capability, we developed an interpretable, region-guided model for predicting bone tumors, which maintained robust performance on independent external datasets and was deployed as a publicly accessible web application. Overall, SKELEX provides a scalable, label-efficient, and generalizable AI framework for musculoskeletal imaging, establishing a foundation for both clinical translation and data-efficient research in musculoskeletal radiology.

[170] Gromov Wasserstein Optimal Transport for Semantic Correspondences

Francis Snelgar, Stephen Gould, Ming Xu, Liang Zheng, Akshay Asthana

Main category: cs.CV

TL;DR: Replacing Stable Diffusion features with optimal transport matching improves semantic correspondence performance while being 5-10x more efficient than ensemble methods.

Details

Motivation: Current state-of-the-art semantic correspondence methods combine DINOv2 and Stable Diffusion features, but this ensemble approach is computationally expensive. The authors seek a more efficient alternative that maintains spatial consistency without using Stable Diffusion.

Method: Replace Stable Diffusion features with an optimal transport algorithm that includes a Gromov-Wasserstein spatial smoothness prior, combined with DINOv2 features for semantic matching.

Result: Significantly boosts DINOv2 baseline performance, achieves competitive or superior results compared to Stable Diffusion ensemble methods, while being 5-10x more computationally efficient.

Conclusion: Optimal transport with spatial priors can effectively replace Stable Diffusion features for semantic correspondence, providing better efficiency without sacrificing performance.

Abstract: Establishing correspondences between image pairs is a long studied problem in computer vision. With recent large-scale foundation models showing strong zero-shot performance on downstream tasks including classification and segmentation, there has been interest in using the internal feature maps of these models for the semantic correspondence task. Recent works observe that features from DINOv2 and Stable Diffusion (SD) are complementary, the former producing accurate but sparse correspondences, while the latter produces spatially consistent correspondences. As a result, current state-of-the-art methods for semantic correspondence involve combining features from both models in an ensemble. While the performance of these methods is impressive, they are computationally expensive, requiring evaluating feature maps from large-scale foundation models. In this work we take a different approach, instead replacing SD features with a superior matching algorithm which is imbued with the desirable spatial consistency property. Specifically, we replace the standard nearest neighbours matching with an optimal transport algorithm that includes a Gromov Wasserstein spatial smoothness prior. We show that we can significantly boost the performance of the DINOv2 baseline, and be competitive and sometimes surpassing state-of-the-art methods using Stable Diffusion features, while being 5–10x more efficient. We make code available at https://github.com/fsnelgar/semantic_matching_gwot .

[171] Beyond Cropping and Rotation: Automated Evolution of Powerful Task-Specific Augmentations with Generative Models

Judah Goldfeder, Shreyes Kaliyur, Vaibhav Sourirajan, Patrick Minwan Puma, Philippe Martin Wyder, Yuhang Hu, Jiong Lin, Hod Lipson

Main category: cs.CV

TL;DR: EvoAug is an automated augmentation learning pipeline that uses generative models and evolutionary algorithms to learn optimal task-specific image augmentations through hierarchical stochastic augmentation trees.

Details

Motivation: Traditional data augmentation methods like cropping/rotation are limited, while modern generative models (diffusion, NeRFs) can create more diverse/realistic augmentations but risk performance degradation if poorly matched to tasks. Need automated way to find optimal generative augmentations for specific tasks.

Method: EvoAug combines generative models (conditional diffusion, few-shot NeRFs) with efficient evolutionary algorithm to learn stochastic augmentation trees that hierarchically compose augmentations, enabling structured and adaptive transformations.

Result: Strong performance across fine-grained classification and few-shot learning tasks. Pipeline discovers augmentations that align with domain knowledge even in low-data settings.

Conclusion: Learned generative augmentations have significant potential for robust model training, unlocking new possibilities beyond traditional augmentation methods.

Abstract: Data augmentation has long been a cornerstone for reducing overfitting in vision models, with methods like AutoAugment automating the design of task-specific augmentations. Recent advances in generative models, such as conditional diffusion and few-shot NeRFs, offer a new paradigm for data augmentation by synthesizing data with significantly greater diversity and realism. However, unlike traditional augmentations like cropping or rotation, these methods introduce substantial changes that enhance robustness but also risk degrading performance if the augmentations are poorly matched to the task. In this work, we present EvoAug, an automated augmentation learning pipeline, which leverages these generative models alongside an efficient evolutionary algorithm to learn optimal task-specific augmentations. Our pipeline introduces a novel approach to image augmentation that learns stochastic augmentation trees that hierarchically compose augmentations, enabling more structured and adaptive transformations. We demonstrate strong performance across fine-grained classification and few-shot learning tasks. Notably, our pipeline discovers augmentations that align with domain knowledge, even in low-data settings. These results highlight the potential of learned generative augmentations, unlocking new possibilities for robust model training.

[172] Feature, Alignment, and Supervision in Category Learning: A Comparative Approach with Children and Neural Networks

Fanxiao Wani Qiu, Oscar Leong

Main category: cs.CV

TL;DR: Children and CNNs show different learning patterns in few-shot semi-supervised category learning, with children generalizing rapidly from minimal labels but showing feature biases, while CNNs benefit more from added supervision and are moderated by alignment and feature structure.

Details

Motivation: To understand how humans and machines learn from sparse data by comparing children and convolutional neural networks (CNNs) in few-shot semi-supervised category learning under identical conditions, examining interactions among supervision, feature structure, and perceptual alignment.

Method: Species-fair design comparing children and CNNs on few-shot semi-supervised category learning task with novel object categories. Both exposed to mixtures of labeled and unlabeled exemplars while varying supervision (1/3/6 labels), target feature (size, shape, pattern), and perceptual alignment (high/low).

Result: Children generalize rapidly from minimal labels but show strong feature-specific biases and sensitivity to alignment. CNNs show different interaction profile: added supervision improves performance, but both alignment and feature structure moderate the impact additional supervision has on learning.

Conclusion: Human-model comparisons must be drawn under the right conditions, emphasizing interactions among supervision, feature structure, and alignment rather than overall accuracy. The study reveals fundamental differences in how biological and artificial systems learn from sparse data.

Abstract: Understanding how humans and machines learn from sparse data is central to cognitive science and machine learning. Using a species-fair design, we compare children and convolutional neural networks (CNNs) in a few-shot semi-supervised category learning task. Both learners are exposed to novel object categories under identical conditions. Learners receive mixtures of labeled and unlabeled exemplars while we vary supervision (1/3/6 labels), target feature (size, shape, pattern), and perceptual alignment (high/low). We find that children generalize rapidly from minimal labels but show strong feature-specific biases and sensitivity to alignment. CNNs show a different interaction profile: added supervision improves performance, but both alignment and feature structure moderate the impact additional supervision has on learning. These results show that human-model comparisons must be drawn under the right conditions, emphasizing interactions among supervision, feature structure, and alignment rather than overall accuracy.

[173] Flexible Geometric Guidance for Probabilistic Human Pose Estimation with Diffusion Models

Francis Snelgar, Ming Xu, Stephen Gould, Liang Zheng, Akshay Asthana

Main category: cs.CV

TL;DR: Diffusion-based framework for 3D human pose estimation that generates multiple plausible poses from 2D images without requiring paired 2D-3D training data

Details

Motivation: Traditional 3D pose estimation methods assume deterministic mapping and require large paired datasets, but the problem is inherently underdetermined with multiple plausible solutions. Need for probabilistic approaches that don't rely on paired data.

Method: Uses diffusion models with guidance framework: unconditional diffusion model trained only on 3D pose data, guided by gradients from 2D keypoint detector heatmaps to generate poses consistent with 2D images

Result: State-of-the-art on Human 3.6M under best-of-m evaluation for methods without paired data; competitive on MPI-INF-3DHP and 3DPW; demonstrates flexibility for pose generation and completion tasks

Conclusion: Diffusion models provide effective probabilistic framework for 3D pose estimation without paired data, handling ambiguity and enabling novel applications like pose generation/completion

Abstract: 3D human pose estimation from 2D images is a challenging problem due to depth ambiguity and occlusion. Because of these challenges the task is underdetermined, where there exists multiple – possibly infinite – poses that are plausible given the image. Despite this, many prior works assume the existence of a deterministic mapping and estimate a single pose given an image. Furthermore, methods based on machine learning require a large amount of paired 2D-3D data to train and suffer from generalization issues to unseen scenarios. To address both of these issues, we propose a framework for pose estimation using diffusion models, which enables sampling from a probability distribution over plausible poses which are consistent with a 2D image. Our approach falls under the guidance framework for conditional generation, and guides samples from an unconditional diffusion model, trained only on 3D data, using the gradients of the heatmaps from a 2D keypoint detector. We evaluate our method on the Human 3.6M dataset under best-of-$m$ multiple hypothesis evaluation, showing state-of-the-art performance among methods which do not require paired 2D-3D data for training. We additionally evaluate the generalization ability using the MPI-INF-3DHP and 3DPW datasets and demonstrate competitive performance. Finally, we demonstrate the flexibility of our framework by using it for novel tasks including pose generation and pose completion, without the need to train bespoke conditional models. We make code available at https://github.com/fsnelgar/diffusion_pose .

[174] FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

Chenxi Zhang, Ziliang Gan, Liyun Zhu, Youwei Pang, Qing Zhang, Rongjunchen Zhang

Main category: cs.CV

TL;DR: FinMTM is a multi-turn multimodal benchmark for evaluating vision-language models on financial data with diverse chart types and complex reasoning tasks.

Details

Motivation: Existing financial benchmarks for VLMs are limited to single-turn interactions and narrow question formats, failing to capture realistic application scenarios that require complex, multi-turn reasoning with specialized financial visuals.

Method: Created FinMTM benchmark with 11,133 bilingual QA pairs grounded in financial visuals (candlestick charts, statistical plots, report figures). Covers multiple task types: single/multiple-choice questions, multi-turn dialogues, and agent-based tasks. Developed specialized evaluation protocols for each task type.

Result: Evaluation of 22 VLMs revealed significant limitations in fine-grained visual perception, long-context reasoning, and complex agent workflows, demonstrating the need for improved multimodal reasoning capabilities in financial domains.

Conclusion: FinMTM provides a comprehensive benchmark for evaluating VLMs on realistic financial tasks, highlighting current model limitations and establishing a foundation for future improvements in multimodal financial reasoning.

Abstract: The financial domain poses substantial challenges for vision-language models (VLMs) due to specialized chart formats and knowledge-intensive reasoning requirements. However, existing financial benchmarks are largely single-turn and rely on a narrow set of question formats, limiting comprehensive evaluation in realistic application scenarios. To address this gap, we propose FinMTM, a multi-turn multimodal benchmark that expands diversity along both data and task dimensions. On the data side, we curate and annotate 11{,}133 bilingual (Chinese and English) financial QA pairs grounded in financial visuals, including candlestick charts, statistical plots, and report figures. On the task side, FinMTM covers single- and multiple-choice questions, multi-turn open-ended dialogues, and agent-based tasks. We further design task-specific evaluation protocols, including a set-overlap scoring rule for multiple-choice questions, a weighted combination of turn-level and session-level scores for multi-turn dialogues, and a composite metric that integrates planning quality with final outcomes for agent tasks. Extensive experimental evaluation of 22 VLMs reveal their limitations in fine-grained visual perception, long-context reasoning, and complex agent workflows.

[175] SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass

Chen Qian, Xinran Yu, Danyang Li, Guoxuan Chi, Zheng Yang, Qiang Ma, Xin Miao

Main category: cs.CV

TL;DR: SwiftVLM introduces a bypass pruning paradigm that preserves unselected visual tokens for re-evaluation in later layers, addressing premature pruning issues in vision-language models to maintain fine-grained visual details.

Details

Motivation: Existing visual token pruning methods for VLMs rely on early pruning decisions that work well for coarse tasks but degrade performance on fine-grained visual reasoning tasks. Premature pruning causes irreversible loss of important visual information that becomes relevant in later layers for text-conditioned reasoning.

Method: SwiftVLM introduces a bypass pruning paradigm where unselected visual tokens are preserved and forwarded to subsequent pruning stages for re-evaluation. It performs pruning at model-specific layers with strong visual token selection capability, enabling independent pruning decisions across layers without requiring training.

Result: Experiments across multiple VLMs and benchmarks show SwiftVLM consistently outperforms existing pruning strategies, achieving superior accuracy-efficiency trade-offs and more faithful visual token selection behavior compared to methods that make irreversible early pruning decisions.

Conclusion: The bypass paradigm effectively addresses the limitations of early pruning by preserving potentially important visual tokens for later re-evaluation, enabling VLMs to maintain fine-grained visual understanding while reducing computational costs.

Abstract: Visual token pruning is a promising approach for reducing the computational cost of vision-language models (VLMs), and existing methods often rely on early pruning decisions to improve efficiency. While effective on coarse-grained reasoning tasks, they suffer from significant performance degradation on tasks requiring fine-grained visual details. Through layer-wise analysis, we reveal substantial discrepancies in visual token importance across layers, showing that tokens deemed unimportant at shallow layers can later become highly relevant for text-conditioned reasoning. To avoid irreversible critical information loss caused by premature pruning, we introduce a new pruning paradigm, termed bypass, which preserves unselected visual tokens and forwards them to subsequent pruning stages for re-evaluation. Building on this paradigm, we propose SwiftVLM, a simple and training-free method that performs pruning at model-specific layers with strong visual token selection capability, while enabling independent pruning decisions across layers. Experiments across multiple VLMs and benchmarks demonstrate that SwiftVLM consistently outperforms existing pruning strategies, achieving superior accuracy-efficiency trade-offs and more faithful visual token selection behavior.

[176] FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion

Chen-Bin Feng, Youyang Sha, Longfei Liu, Yongjun Yu, Chi Man Vong, Xuanlong Yu, Xi Shen

Main category: cs.CV

TL;DR: FSOD-VFM is a training-free framework for few-shot object detection that combines universal proposal networks, SAM2 for masks, and DINOv2 features with graph-based confidence reweighting to address overfragmentation issues in foundation model proposals.

Details

Motivation: Vision foundation models have strong generalization but generate fragmented bounding boxes in few-shot object detection, covering partial objects and producing many false positives rather than complete detections.

Method: Integrates universal proposal network (UPN) for category-agnostic boxes, SAM2 for mask extraction, and DINOv2 features. Introduces graph-based confidence reweighting where bounding boxes are nodes in a directed graph, using graph diffusion to propagate confidence scores and refine proposals.

Result: Substantially outperforms existing approaches on Pascal-5^i, COCO-20^i, and CD-FSOD datasets. On challenging CD-FSOD dataset, achieves 31.6 AP in 10-shot setting vs. 21.4 AP for previous training-free methods.

Conclusion: FSOD-VFM effectively addresses fragmentation issues in foundation model proposals through graph-based confidence reweighting, achieving state-of-the-art few-shot object detection without additional training.

Abstract: In this paper, we present FSOD-VFM: Few-Shot Object Detectors with Vision Foundation Models, a framework that leverages vision foundation models to tackle the challenge of few-shot object detection. FSOD-VFM integrates three key components: a universal proposal network (UPN) for category-agnostic bounding box generation, SAM2 for accurate mask extraction, and DINOv2 features for efficient adaptation to new object categories. Despite the strong generalization capabilities of foundation models, the bounding boxes generated by UPN often suffer from overfragmentation, covering only partial object regions and leading to numerous small, false-positive proposals rather than accurate, complete object detections. To address this issue, we introduce a novel graph-based confidence reweighting method. In our approach, predicted bounding boxes are modeled as nodes in a directed graph, with graph diffusion operations applied to propagate confidence scores across the network. This reweighting process refines the scores of proposals, assigning higher confidence to whole objects and lower confidence to local, fragmented parts. This strategy improves detection granularity and effectively reduces the occurrence of false-positive bounding box proposals. Through extensive experiments on Pascal-5$^i$, COCO-20$^i$, and CD-FSOD datasets, we demonstrate that our method substantially outperforms existing approaches, achieving superior performance without requiring additional training. Notably, on the challenging CD-FSOD dataset, which spans multiple datasets and domains, our FSOD-VFM achieves 31.6 AP in the 10-shot setting, substantially outperforming previous training-free methods that reach only 21.4 AP. Code is available at: https://intellindust-ai-lab.github.io/projects/FSOD-VFM.

[177] Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis

Tianhe Wu, Ruibin Li, Lei Zhang, Kede Ma

Main category: cs.CV

TL;DR: DP-DMD: A role-separated distillation framework that preserves diversity in diffusion model distillation by dedicating the first step to diversity preservation and subsequent steps to quality refinement.

Details

Motivation: Distribution matching distillation (DMD) suffers from mode collapse due to its reverse-KL formulation's mode-seeking behavior. Existing remedies use perceptual or adversarial regularization, which incur computational overhead and training instability.

Method: Proposes Diversity-Preserved DMD (DP-DMD) with role-separated distillation: first step uses target-prediction objective (e.g., v-prediction) to preserve diversity, subsequent steps use standard DMD loss for quality refinement, with gradients from DMD blocked at first step.

Result: Preserves sample diversity while maintaining visual quality on par with state-of-the-art methods in extensive text-to-image experiments, without requiring perceptual backbones, discriminators, auxiliary networks, or additional ground-truth images.

Conclusion: DP-DMD provides a simple yet effective solution to mode collapse in diffusion model distillation, achieving diversity preservation and quality maintenance without complex regularization schemes.

Abstract: Distribution matching distillation (DMD) aligns a multi-step generator with its few-step counterpart to enable high-quality generation under low inference cost. However, DMD tends to suffer from mode collapse, as its reverse-KL formulation inherently encourages mode-seeking behavior, for which existing remedies typically rely on perceptual or adversarial regularization, thereby incurring substantial computational overhead and training instability. In this work, we propose a role-separated distillation framework that explicitly disentangles the roles of distilled steps: the first step is dedicated to preserving sample diversity via a target-prediction (e.g., v-prediction) objective, while subsequent steps focus on quality refinement under the standard DMD loss, with gradients from the DMD objective blocked at the first step. We term this approach Diversity-Preserved DMD (DP-DMD), which, despite its simplicity – no perceptual backbone, no discriminator, no auxiliary networks, and no additional ground-truth images – preserves sample diversity while maintaining visual quality on par with state-of-the-art methods in extensive text-to-image experiments.

[178] Fully Kolmogorov-Arnold Deep Model in Medical Image Segmentation

Xingyu Qiu, Xinghua Ma, Dong Liang, Gongning Luo, Wei Wang, Kuanquan Wang, Shuo Li

Main category: cs.CV

TL;DR: This paper introduces ALL U-KAN, the first fully Kolmogorov-Arnold (KA) based deep model that replaces traditional FC and Conv layers with KA-based layers, overcoming training difficulties and memory limitations of deeply stacked KANs through Share-activation KAN and Grad-Free Spline innovations.

Details

Motivation: Existing KANs (Kolmogorov-Arnold Networks) face severe limitations: deeply stacked KANs are practically impossible due to high training difficulties and substantial memory requirements, forcing researchers to use only few KAN layers and preventing comprehensive exploration of KAN architectures.

Method: Three key innovations: (1) Share-activation KAN (SaKAN) reformulates Sprecher’s variant of Kolmogorov-Arnold representation theorem for better optimization; (2) Grad-Free Spline eliminates spline gradients that consume huge GPU memory while contributing negligibly to training; (3) ALL U-KAN implements the first fully KA-based deep model with KA and KAonv layers replacing FC and Conv layers.

Result: Extensive evaluations on three medical image segmentation tasks show superiority over partial KA-based and traditional architectures. Compared to directly deeply stacked KAN, ALL U-KAN achieves 10× parameter reduction and >20× memory consumption reduction while maintaining higher segmentation accuracy.

Conclusion: The paper successfully demonstrates that KA-based layers can entirely replace traditional architectures in deep learning, achieving superior learning capacity while overcoming previous limitations, unlocking new explorations into deep KAN architectures.

Abstract: Deeply stacked KANs are practically impossible due to high training difficulties and substantial memory requirements. Consequently, existing studies can only incorporate few KAN layers, hindering the comprehensive exploration of KANs. This study overcomes these limitations and introduces the first fully KA-based deep model, demonstrating that KA-based layers can entirely replace traditional architectures in deep learning and achieve superior learning capacity. Specifically, (1) the proposed Share-activation KAN (SaKAN) reformulates Sprecher’s variant of Kolmogorov-Arnold representation theorem, which achieves better optimization due to its simplified parameterization and denser training samples, to ease training difficulty, (2) this paper indicates that spline gradients contribute negligibly to training while consuming huge GPU memory, thus proposes the Grad-Free Spline to significantly reduce memory usage and computational overhead. (3) Building on these two innovations, our ALL U-KAN is the first representative implementation of fully KA-based deep model, where the proposed KA and KAonv layers completely replace FC and Conv layers. Extensive evaluations on three medical image segmentation tasks confirm the superiority of the full KA-based architecture compared to partial KA-based and traditional architectures, achieving all higher segmentation accuracy. Compared to directly deeply stacked KAN, ALL U-KAN achieves 10 times reduction in parameter count and reduces memory consumption by more than 20 times, unlocking the new explorations into deep KAN architectures.

[179] Human-in-the-loop Adaptation in Group Activity Feature Learning for Team Sports Video Retrieval

Chihiro Nakatani, Hiroaki Kawashima, Norimichi Ukita

Main category: cs.CV

TL;DR: Human-in-the-loop adaptation for group activity feature learning without group activity annotations, improving video retrieval performance through interactive fine-tuning with contrastive learning.

Details

Motivation: Existing group activity recognition methods require supervised learning with pre-defined activity classes, which is limited and inflexible. The paper aims to develop a more flexible retrieval system that can adapt to user preferences without needing group activity annotations.

Method: 1) Self-supervised pre-training of Group Activity Feature (GAF) space based on activity similarity. 2) Human-in-the-loop interactive fine-tuning where users label selected videos as positive/negative. 3) Data-efficient video selection process to choose informative videos for labeling. 4) Contrastive learning to update GAF space by moving positive videos closer and negative videos farther from queries.

Result: Significant improvement in retrieval performance on two team sports datasets. Ablation studies show that various components of the human-in-the-loop adaptation contribute to performance gains.

Conclusion: The proposed human-in-the-loop adaptation enables effective group activity video retrieval without requiring group activity annotations, offering a flexible and adaptive approach that can be tailored to user preferences.

Abstract: This paper proposes human-in-the-loop adaptation for Group Activity Feature Learning (GAFL) without group activity annotations. This human-in-the-loop adaptation is employed in a group-activity video retrieval framework to improve its retrieval performance. Our method initially pre-trains the GAF space based on the similarity of group activities in a self-supervised manner, unlike prior work that classifies videos into pre-defined group activity classes in a supervised learning manner. Our interactive fine-tuning process updates the GAF space to allow a user to better retrieve videos similar to query videos given by the user. In this fine-tuning, our proposed data-efficient video selection process provides several videos, which are selected from a video database, to the user in order to manually label these videos as positive or negative. These labeled videos are used to update (i.e., fine-tune) the GAF space, so that the positive and negative videos move closer to and farther away from the query videos through contrastive learning. Our comprehensive experimental results on two team sports datasets validate that our method significantly improves the retrieval performance. Ablation studies also demonstrate that several components in our human-in-the-loop adaptation contribute to the improvement of the retrieval performance. Code: https://github.com/chihina/GAFL-FINE-CVIU.

[180] BinaryDemoire: Moiré-Aware Binarization for Image Demoiréing

Zheng Chen, Zhi Yang, Xiaoyang Liu, Weihang Zhang, Mengfan Wang, Yifan Fu, Linghe Kong, Yulun Zhang

Main category: cs.CV

TL;DR: BinaryDemoire: A binarized image demoiréing framework using moiré-aware binary gates and shuffle-grouped residual adapters for efficient moiré artifact removal.

Details

Motivation: Image demoiréing requires removing structured moiré artifacts that are frequency-dependent and vary across scales/directions. While deep networks achieve good restoration, they're computationally expensive for deployment. Binarization offers extreme compression but performs poorly when naively applied to demoiréing tasks.

Method: Proposes BinaryDemoire with two key components: 1) Moiré-aware binary gate (MABG) extracts lightweight frequency descriptors and activation statistics to predict channel-wise gating coefficients for binary convolution responses. 2) Shuffle-grouped residual adapter (SGRA) performs structured sparse shortcut alignment with interleaved mixing for cross-channel information exchange.

Result: Extensive experiments on four benchmarks demonstrate BinaryDemoire surpasses current binarization methods for demoiréing tasks.

Conclusion: BinaryDemoire effectively addresses the challenge of binarized demoiréing by explicitly accommodating the frequency structure of moiré degradations, offering an efficient solution for deployment.

Abstract: Image demoiréing aims to remove structured moiré artifacts in recaptured imagery, where degradations are highly frequency-dependent and vary across scales and directions. While recent deep networks achieve high-quality restoration, their full-precision designs remain costly for deployment. Binarization offers an extreme compression regime by quantizing both activations and weights to 1-bit. Yet, it has been rarely studied for demoiréing and performs poorly when naively applied. In this work, we propose BinaryDemoire, a binarized demoiréing framework that explicitly accommodates the frequency structure of moiré degradations. First, we introduce a moiré-aware binary gate (MABG) that extracts lightweight frequency descriptors together with activation statistics. It predicts channel-wise gating coefficients to condition the aggregation of binary convolution responses. Second, we design a shuffle-grouped residual adapter (SGRA) that performs structured sparse shortcut alignment. It further integrates interleaved mixing to promote information exchange across different channel partitions. Extensive experiments on four benchmarks demonstrate that the proposed BinaryDemoire surpasses current binarization methods. Code: https://github.com/zhengchen1999/BinaryDemoire.

[181] LSGQuant: Layer-Sensitivity Guided Quantization for One-Step Diffusion Real-World Video Super-Resolution

Tianxing Wu, Zheng Chen, Cirou Xu, Bowen Chai, Yong Guo, Yutong Liu, Linghe Kong, Yulun Zhang

Main category: cs.CV

TL;DR: LSGQuant: Layer-sensitivity guided quantization for one-step diffusion-based real-world video super-resolution, addressing challenges of high dynamic range and diverse layer behaviors through adaptive quantization and optimization techniques.

Details

Motivation: One-step diffusion models show promise for video super-resolution but have large model sizes and high computational costs. While quantization can help compress models, existing methods struggle with the high dynamic range of input latents and diverse layer behaviors in diffusion transformers.

Method: Proposes LSGQuant with three key components: 1) Dynamic Range Adaptive Quantizer (DRAQ) to fit video token activations, 2) Variance-Oriented Layer Training Strategy (VOLTS) based on layer sensitivity analysis, and 3) Quantization-Aware Optimization (QAO) to jointly refine quantized and high-precision branches.

Result: Extensive experiments show the method achieves nearly the same performance as the original full-precision model and significantly outperforms existing quantization techniques for video super-resolution.

Conclusion: LSGQuant effectively compresses one-step diffusion models for video super-resolution while maintaining performance, addressing key challenges in quantizing diffusion transformers through adaptive quantization and layer-aware optimization.

Abstract: One-Step Diffusion Models have demonstrated promising capability and fast inference in video super-resolution (VSR) for real-world. Nevertheless, the substantial model size and high computational cost of Diffusion Transformers (DiTs) limit downstream applications. While low-bit quantization is a common approach for model compression, the effectiveness of quantized models is challenged by the high dynamic range of input latent and diverse layer behaviors. To deal with these challenges, we introduce LSGQuant, a layer-sensitivity guided quantizing approach for one-step diffusion-based real-world VSR. Our method incorporates a Dynamic Range Adaptive Quantizer (DRAQ) to fit video token activations. Furthermore, we estimate layer sensitivity and implement a Variance-Oriented Layer Training Strategy (VOLTS) by analyzing layer-wise statistics in calibration. We also introduce Quantization-Aware Optimization (QAO) to jointly refine the quantized branch and a retained high-precision branch. Extensive experiments demonstrate that our method has nearly performance to origin model with full-precision and significantly exceeds existing quantization techniques. Code is available at: https://github.com/zhengchen1999/LSGQuant.

[182] From Single Scan to Sequential Consistency: A New Paradigm for LIDAR Relocalization

Minghang Zhu, Zhijing Wang, Yuxin Guo, Wen Li, Sheng Ao, Cheng Wang

Main category: cs.CV

TL;DR: TempLoc: A LiDAR relocalization framework that uses temporal consistency and uncertainty-guided fusion to improve 6-DoF pose estimation by modeling sequential scans rather than single frames.

Details

Motivation: Existing LiDAR relocalization methods are prone to errors in dynamic or ambiguous scenarios because they either rely on single-frame inference or ignore spatio-temporal consistency across sequential scans.

Method: Three-module approach: 1) Global Coordinate Estimation predicts point-wise global coordinates with uncertainties for each scan, 2) Prior Coordinate Generation estimates inter-frame point correspondences using attention mechanism, 3) Uncertainty-Guided Coordinate Fusion integrates both predictions end-to-end for temporally consistent 6-DoF pose estimation.

Result: TempLoc outperforms state-of-the-art methods by a large margin on NCLT and Oxford Robot-Car benchmarks, demonstrating effectiveness of temporal-aware correspondence modeling.

Conclusion: Modeling sequential consistency through attention-based correspondence estimation and uncertainty-guided fusion significantly improves LiDAR relocalization robustness in challenging scenarios.

Abstract: LiDAR relocalization aims to estimate the global 6-DoF pose of a sensor in the environment. However, existing regression-based approaches are prone to dynamic or ambiguous scenarios, as they either solely rely on single-frame inference or neglect the spatio-temporal consistency across scans. In this paper, we propose TempLoc, a new LiDAR relocalization framework that enhances the robustness of localization by effectively modeling sequential consistency. Specifically, a Global Coordinate Estimation module is first introduced to predict point-wise global coordinates and associated uncertainties for each LiDAR scan. A Prior Coordinate Generation module is then presented to estimate inter-frame point correspondences by the attention mechanism. Lastly, an Uncertainty-Guided Coordinate Fusion module is deployed to integrate both predictions of point correspondence in an end-to-end fashion, yielding a more temporally consistent and accurate global 6-DoF pose. Experimental results on the NCLT and Oxford Robot-Car benchmarks show that our TempLoc outperforms stateof-the-art methods by a large margin, demonstrating the effectiveness of temporal-aware correspondence modeling in LiDAR relocalization. Our code will be released soon.

[183] Hand3R: Online 4D Hand-Scene Reconstruction in the Wild

Wendi Hu, Haonan Zhou, Wenhao Hu, Gaoang Wang

Main category: cs.CV

TL;DR: Hand3R: First online framework for joint 4D hand-scene reconstruction from monocular video, combining hand priors with scene memory for simultaneous reconstruction of hands and dense scene geometry.

Details

Motivation: Existing methods for embodied AI reconstruct isolated hands in local coordinates, overlooking the surrounding 3D environment, which is crucial for understanding physical interaction in real-world scenes.

Method: Synergizes pre-trained hand expert with 4D scene foundation model via scene-aware visual prompting mechanism; injects high-fidelity hand priors into persistent scene memory for simultaneous reconstruction in single forward pass.

Result: Bypasses reliance on offline optimization, delivers competitive performance in both local hand reconstruction and global positioning, enabling accurate hand meshes and dense metric-scale scene geometry.

Conclusion: Hand3R enables joint 4D hand-scene reconstruction from monocular video, addressing the gap between isolated hand reconstruction and full scene understanding for embodied AI applications.

Abstract: For Embodied AI, jointly reconstructing dynamic hands and the dense scene context is crucial for understanding physical interaction. However, most existing methods recover isolated hands in local coordinates, overlooking the surrounding 3D environment. To address this, we present Hand3R, the first online framework for joint 4D hand-scene reconstruction from monocular video. Hand3R synergizes a pre-trained hand expert with a 4D scene foundation model via a scene-aware visual prompting mechanism. By injecting high-fidelity hand priors into a persistent scene memory, our approach enables simultaneous reconstruction of accurate hand meshes and dense metric-scale scene geometry in a single forward pass. Experiments demonstrate that Hand3R bypasses the reliance on offline optimization and delivers competitive performance in both local hand reconstruction and global positioning.

[184] VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

Zhiwen Li, Zhongjie Duan, Jinyan Ye, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen

Main category: cs.CV

TL;DR: VIRAL is a framework that enables in-context learning for diverse visual tasks by formulating them as conditional generation via visual analogies using a frozen Diffusion Transformer with role-aware conditioning and Mixture-of-Experts LoRA.

Details

Motivation: In-context learning (ICL) has been successful in NLP but remains challenging in computer vision due to task heterogeneity. Current approaches struggle with diverse visual tasks like perception, restoration, and editing within a unified framework.

Method: Formulates ICL as conditional generation via visual analogy (x_s:x_t::x_q:y_q). Uses a frozen Diffusion Transformer (DiT) with role-aware multi-image conditioning and introduces Mixture-of-Experts LoRA to mitigate gradient interference across tasks. Also curates a large-scale dataset spanning perception, restoration, and editing tasks.

Result: VIRAL outperforms existing methods and demonstrates that a unified visual ICL paradigm can handle the majority of visual tasks, including open-domain editing.

Conclusion: The proposed VIRAL framework successfully enables in-context learning for diverse visual tasks through visual analogies, showing that a unified approach can effectively handle perception, restoration, and editing tasks.

Abstract: Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbf{VIRAL}, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy ($x_s : x_t :: x_q : y_q$). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A

[185] ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

Zhuoran Yang, Yanyong Zhang

Main category: cs.CV

TL;DR: ConsisDrive is an identity-preserving driving world model that addresses identity drift in generated driving videos by enforcing instance-level temporal consistency through instance-masked attention and loss mechanisms.

Details

Motivation: Current world models for autonomous driving suffer from identity drift - where objects change appearance or category across frames due to lack of instance-level temporal constraints, compromising the quality and usefulness of generated driving data.

Method: Two key components: (1) Instance-Masked Attention applies instance identity masks and trajectory masks in attention blocks to ensure visual tokens interact only with corresponding instance features across space and time; (2) Instance-Masked Loss adaptively emphasizes foreground regions with probabilistic instance masking to reduce background noise while maintaining scene fidelity.

Result: Achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset.

Conclusion: ConsisDrive effectively addresses identity drift in driving world models through instance-level temporal consistency mechanisms, improving both generation quality and downstream task performance.

Abstract: Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.

[186] FARTrack: Fast Autoregressive Visual Tracking with High Performance

Guijie Wang, Tong Lin, Yifan Bai, Anjia Cao, Shiyi Liang, Wangbo Zhao, Xing Wei

Main category: cs.CV

TL;DR: FARTrack is a fast autoregressive tracking framework that achieves high-speed visual tracking through task-specific self-distillation and inter-frame autoregressive sparsification, enabling real-time performance on both GPU and CPU.

Details

Motivation: The paper addresses the trade-off between tracking performance and inference speed in visual tracking, where high-performance trackers are often too slow for practical deployment on resource-constrained devices. The authors aim to develop a framework that maintains competitive tracking accuracy while achieving real-time speeds.

Method: FARTrack introduces two key techniques: 1) Task-Specific Self-Distillation - compresses the model by distilling task-specific tokens layer by layer, avoiding suboptimal manual teacher-student layer assignments and enhancing inference speed; 2) Inter-frame Autoregressive Sparsification - sequentially condenses multiple templates to learn a temporally-global optimal sparsification strategy without additional runtime overhead.

Result: FARTrack achieves an AO of 70.6% on GOT-10k benchmark in real-time. The fastest model reaches 343 FPS on GPU and 121 FPS on CPU, demonstrating outstanding speed while maintaining competitive tracking performance.

Conclusion: The proposed FARTrack framework successfully addresses the speed-performance trade-off in visual tracking through autoregressive design and novel optimization techniques, making high-performance tracking practical for deployment on resource-constrained devices.

Abstract: Inference speed and tracking performance are two critical evaluation metrics in the field of visual tracking. However, high-performance trackers often suffer from slow processing speeds, making them impractical for deployment on resource-constrained devices. To alleviate this issue, we propose FARTrack, a Fast Auto-Regressive Tracking framework. Since autoregression emphasizes the temporal nature of the trajectory sequence, it can maintain high performance while achieving efficient execution across various devices. FARTrack introduces Task-Specific Self-Distillation and Inter-frame Autoregressive Sparsification, designed from the perspectives of shallow-yet-accurate distillation and redundant-to-essential token optimization, respectively. Task-Specific Self-Distillation achieves model compression by distilling task-specific tokens layer by layer, enhancing the model’s inference speed while avoiding suboptimal manual teacher-student layer pairs assignments. Meanwhile, Inter-frame Autoregressive Sparsification sequentially condenses multiple templates, avoiding additional runtime overhead while learning a temporally-global optimal sparsification strategy. FARTrack demonstrates outstanding speed and competitive performance. It delivers an AO of 70.6% on GOT-10k in real-time. Beyond, our fastest model achieves a speed of 343 FPS on the GPU and 121 FPS on the CPU.

[187] PokeFusion Attention: Enhancing Reference-Free Style-Conditioned Generation

Jingbang Tang

Main category: cs.CV

TL;DR: PokeFusion Attention is a lightweight decoder-level cross-attention mechanism for reference-free style-conditioned character generation in text-to-image diffusion models, improving style fidelity and consistency without modifying the pretrained backbone.

Details

Motivation: Existing text-to-image diffusion models struggle with style-conditioned character generation due to text-only prompting being under-specified for visual style (causing style drift and geometric inconsistency) or requiring reference-based adapters that increase complexity and limit deployment flexibility.

Method: Proposes PokeFusion Attention, a decoder-level cross-attention mechanism that fuses textual semantics with learned style embeddings directly inside the diffusion decoder. It decouples text and style conditioning at the attention level, keeping the pretrained diffusion backbone frozen while training only decoder cross-attention layers and a compact style projection module.

Result: Experiments on a Pokemon-style character generation benchmark show consistent improvements in style fidelity, semantic alignment, and character shape consistency compared to adapter-based baselines, while maintaining low parameter overhead and inference-time simplicity.

Conclusion: PokeFusion Attention provides a parameter-efficient, plug-and-play control component for reference-free stylized generation that can be easily integrated into existing diffusion pipelines and transferred across different backbones.

Abstract: This paper studies reference-free style-conditioned character generation in text-to-image diffusion models, where high-quality synthesis requires both stable character structure and consistent, fine-grained style expression across diverse prompts. Existing approaches primarily rely on text-only prompting, which is often under-specified for visual style and tends to produce noticeable style drift and geometric inconsistency, or introduce reference-based adapters that depend on external images at inference time, increasing architectural complexity and limiting deployment flexibility.We propose PokeFusion Attention, a lightweight decoder-level cross-attention mechanism that fuses textual semantics with learned style embeddings directly inside the diffusion decoder. By decoupling text and style conditioning at the attention level, our method enables effective reference-free stylized generation while keeping the pretrained diffusion backbone fully frozen.PokeFusion Attention trains only decoder cross-attention layers together with a compact style projection module, resulting in a parameter-efficient and plug-and-play control component that can be easily integrated into existing diffusion pipelines and transferred across different backbones.Experiments on a stylized character generation benchmark (Pokemon-style) demonstrate that our method consistently improves style fidelity, semantic alignment, and character shape consistency compared with representative adapter-based baselines, while maintaining low parameter overhead and inference-time simplicity.

[188] Spiral RoPE: Rotate Your Rotary Positional Embeddings in the 2D Plane

Haoyu Liu, Sucheng Ren, Tingyu Zhu, Peng Wang, Cihang Xie, Alan Yuille, Zeyu Zheng, Feng Wang

Main category: cs.CV

TL;DR: Spiral RoPE extends standard axial 2D Rotary Position Embedding to enable multi-directional positional encoding in vision transformers, overcoming limitations of axis-aligned direction constraints.

Details

Motivation: Standard axial 2D RoPE in vision transformers decomposes spatial positions into horizontal/vertical components, restricting positional encoding to axis-aligned directions and hindering modeling of oblique spatial relationships that naturally exist in images.

Method: Proposes Spiral RoPE which partitions embedding channels into multiple groups associated with uniformly distributed directions. Each group is rotated according to the projection of patch position onto its corresponding direction, enabling spatial relationships beyond horizontal/vertical axes.

Result: Consistent performance improvements across vision tasks including classification, segmentation, and generation. Qualitative analysis shows more concentrated attention activations on semantically relevant objects and better respect for local object boundaries.

Conclusion: Spiral RoPE demonstrates the importance of multi-directional positional encoding in vision transformers, overcoming fundamental limitations of standard axial 2D RoPE for better modeling of spatial relationships in images.

Abstract: Rotary Position Embedding (RoPE) is the de facto positional encoding in large language models due to its ability to encode relative positions and support length extrapolation. When adapted to vision transformers, the standard axial formulation decomposes two-dimensional spatial positions into horizontal and vertical components, implicitly restricting positional encoding to axis-aligned directions. We identify this directional constraint as a fundamental limitation of the standard axial 2D RoPE, which hinders the modeling of oblique spatial relationships that naturally exist in natural images. To overcome this limitation, we propose Spiral RoPE, a simple yet effective extension that enables multi-directional positional encoding by partitioning embedding channels into multiple groups associated with uniformly distributed directions. Each group is rotated according to the projection of the patch position onto its corresponding direction, allowing spatial relationships to be encoded beyond the horizontal and vertical axes. Across a wide range of vision tasks including classification, segmentation, and generation, Spiral RoPE consistently improves performance. Qualitative analysis of attention maps further show that Spiral RoPE exhibits more concentrated activations on semantically relevant objects and better respects local object boundaries, highlighting the importance of multi-directional positional encoding in vision transformers.

[189] EventFlash: Towards Efficient MLLMs for Event-Based Vision

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Wen Jiang, Ming Li, Xiangyang Ji

Main category: cs.CV

TL;DR: EventFlash is an efficient multimodal large language model for event-based vision that reduces computational costs through spatiotemporal token sparsification while maintaining performance.

Details

Motivation: Current event-based MLLMs use dense image-like processing that ignores the inherent spatiotemporal sparsity of event streams, leading to high computational costs. There's a need for more efficient processing that leverages the sparse nature of event data.

Method: 1) Created EventMind dataset with 500k+ instruction sets for curriculum training; 2) Adaptive temporal window aggregation module to compress temporal tokens while preserving key cues; 3) Sparse density-guided attention module to select informative spatial regions and suppress sparse areas.

Result: Achieves 12.4× throughput improvement over baseline (EventFlash-Zero) while maintaining comparable performance. Supports long-range event stream processing with up to 1,000 bins, significantly outperforming EventGPT’s 5-bin limit.

Conclusion: EventFlash serves as an efficient foundation model for event-based vision by effectively leveraging spatiotemporal sparsity to reduce redundancy and accelerate inference while maintaining robust perception capabilities.

Abstract: Event-based multimodal large language models (MLLMs) enable robust perception in high-speed and low-light scenarios, addressing key limitations of frame-based MLLMs. However, current event-based MLLMs often rely on dense image-like processing paradigms, overlooking the spatiotemporal sparsity of event streams and resulting in high computational cost. In this paper, we propose EventFlash, a novel and efficient MLLM to explore spatiotemporal token sparsification for reducing data redundancy and accelerating inference. Technically, we build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets, providing both short and long event stream sequences to support our curriculum training strategy. We then present an adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens while retaining key temporal cues. Finally, a sparse density-guided attention module is designed to improve spatial token efficiency by selecting informative regions and suppressing empty or sparse areas. Experimental results show that EventFlash achieves a $12.4\times$ throughput improvement over the baseline (EventFlash-Zero) while maintaining comparable performance. It supports long-range event stream processing with up to 1,000 bins, significantly outperforming the 5-bin limit of EventGPT. We believe EventFlash serves as an efficient foundation model for event-based vision.

[190] InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation

Zhuoran Yang, Xi Guo, Chenjing Ding, Chiyu Wang, Wei Wu, Yanyong Zhang

Main category: cs.CV

TL;DR: InstaDrive: A framework for generating realistic driving videos with instance-level temporal consistency and spatial geometric fidelity using instance flow guidance and spatial alignment.

Details

Motivation: Current world models for autonomous driving video generation struggle with maintaining instance-level temporal consistency (preserving object identity over time) and spatial geometric fidelity (accurate positioning and occlusion handling), which are crucial for realistic training data and safety evaluation.

Method: Two key components: (1) Instance Flow Guider extracts and propagates instance features across frames to enforce temporal consistency, and (2) Spatial Geometric Aligner improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. Uses CARLA’s autopilot for simulating rare safety-critical scenarios.

Result: Achieves state-of-the-art video generation quality on nuScenes dataset and enhances downstream autonomous driving tasks. Enables rigorous safety evaluation through procedural simulation of rare driving scenarios.

Conclusion: InstaDrive addresses key limitations in driving video generation by introducing instance-aware mechanisms that improve both temporal consistency and spatial geometric fidelity, leading to more realistic training data and better safety evaluation capabilities.

Abstract: Autonomous driving relies on robust models trained on high-quality, large-scale multi-view driving videos. While world models offer a cost-effective solution for generating realistic driving videos, they struggle to maintain instance-level temporal consistency and spatial geometric fidelity. To address these challenges, we propose InstaDrive, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA’s autopilot to procedurally and stochastically simulate rare but safety-critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems. Our project page is https://shanpoyang654.github.io/InstaDrive/page.html.

[191] LaVPR: Benchmarking Language and Vision for Place Recognition

Ofer Idan, Dan Badur, Yosi Keller, Yoli Shavit

Main category: cs.CV

TL;DR: LaVPR introduces a large-scale benchmark with 650K+ natural language descriptions for visual place recognition, enabling both multi-modal fusion for robustness and cross-modal retrieval for language-based localization.

Details

Motivation: Standard VPR systems fail under extreme environmental changes and perceptual aliasing, and cannot perform "blind" localization from verbal descriptions alone, which is needed for applications like emergency response.

Method: Extends existing VPR datasets with natural language descriptions, investigates multi-modal fusion for robustness and cross-modal retrieval using Low-Rank Adaptation (LoRA) and Multi-Similarity loss.

Result: Language descriptions yield consistent gains in visually degraded conditions, with most significant impact on smaller backbones. Compact models with language can rival larger vision-only architectures. Cross-modal retrieval baseline substantially outperforms standard contrastive methods.

Conclusion: LaVPR enables a new class of localization systems that are resilient to real-world stochasticity and practical for resource-constrained deployment.

Abstract: Visual Place Recognition (VPR) often fails under extreme environmental changes and perceptual aliasing. Furthermore, standard systems cannot perform “blind” localization from verbal descriptions alone, a capability needed for applications such as emergency response. To address these challenges, we introduce LaVPR, a large-scale benchmark that extends existing VPR datasets with over 650,000 rich natural-language descriptions. Using LaVPR, we investigate two paradigms: Multi-Modal Fusion for enhanced robustness and Cross-Modal Retrieval for language-based localization. Our results show that language descriptions yield consistent gains in visually degraded conditions, with the most significant impact on smaller backbones. Notably, adding language allows compact models to rival the performance of much larger vision-only architectures. For cross-modal retrieval, we establish a baseline using Low-Rank Adaptation (LoRA) and Multi-Similarity loss, which substantially outperforms standard contrastive methods across vision-language models. Ultimately, LaVPR enables a new class of localization systems that are both resilient to real-world stochasticity and practical for resource-constrained deployment. Our dataset and code are available at https://github.com/oferidan1/LaVPR.

[192] Global Geometry Is Not Enough for Vision Representations

Jiwan Chung, Seon Joo Kim

Main category: cs.CV

TL;DR: Standard global geometry metrics fail to predict compositional binding in vision encoders, while functional sensitivity (input-output Jacobian) reliably tracks this capability, revealing limitations of current evaluation protocols.

Details

Motivation: The paper challenges the common assumption that globally well-distributed embeddings are sufficient for robust representations, noting that while global geometry encodes what elements are present, it often fails to capture how they are composed together.

Method: Tested 21 vision encoders to compare geometric metrics against compositional binding capabilities. Used standard geometry-based statistics and functional sensitivity measured by input-output Jacobian. Provided analytic analysis showing how objective designs constrain embedding geometry but leave local input-output mapping unconstrained.

Result: Found near-zero correlation between standard geometry-based statistics and compositional binding. In contrast, functional sensitivity reliably tracked compositional binding capability. Analytic analysis confirmed this disparity stems from how existing losses explicitly constrain embedding geometry while leaving functional sensitivity unconstrained.

Conclusion: Global embedding geometry captures only a partial view of representational competence. Functional sensitivity serves as a critical complementary axis for modeling composite structure, suggesting the need to move beyond geometry-focused evaluation protocols.

Abstract: A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representations. This focus has shaped both training objectives and evaluation protocols, implicitly treating global geometry as a proxy for representational competence. While global geometry effectively encodes which elements are present, it is often insensitive to how they are composed. We investigate this limitation by testing the ability of geometric metrics to predict compositional binding across 21 vision encoders. We find that standard geometry-based statistics exhibit near-zero correlation with compositional binding. In contrast, functional sensitivity, as measured by the input-output Jacobian, reliably tracks this capability. We further provide an analytic account showing that this disparity arises from objective design, as existing losses explicitly constrain embedding geometry but leave the local input-output mapping unconstrained. These results suggest that global embedding geometry captures only a partial view of representational competence and establish functional sensitivity as a critical complementary axis for modeling composite structure.

[193] A3-TTA: Adaptive Anchor Alignment Test-Time Adaptation for Image Segmentation

Jianghao Wu, Xiangde Luo, Yubo Zhou, Lianming Wu, Guotai Wang, Shaoting Zhang

Main category: cs.CV

TL;DR: A3-TTA: Anchor-guided pseudo-labeling framework for test-time adaptation in image segmentation that uses confident predictions as anchors to stabilize adaptation and prevent catastrophic forgetting.

Details

Motivation: Existing pseudo-label-based TTA methods for image segmentation rely on heuristic perturbation ensembles that lack distributional grounding, causing unstable training signals, error accumulation, and catastrophic forgetting during adaptation.

Method: Proposes A3-TTA framework that identifies well-predicted target domain images using class compact density metric as anchors, uses them to guide pseudo-label generation, and regularizes via semantic consistency and boundary-aware entropy minimization with self-adaptive exponential moving average for noise mitigation.

Result: Significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to source model, outperforms state-of-the-art TTA methods across medical and natural images, and excels in continual TTA with strong anti-forgetting ability.

Conclusion: A3-TTA provides an effective anchor-guided pseudo-labeling approach for test-time adaptation in image segmentation that stabilizes adaptation, prevents catastrophic forgetting, and works well across different segmentation architectures and domains.

Abstract: Test-Time Adaptation (TTA) offers a practical solution for deploying image segmentation models under domain shift without accessing source data or retraining. Among existing TTA strategies, pseudo-label-based methods have shown promising performance. However, they often rely on perturbation-ensemble heuristics (e.g., dropout sampling, test-time augmentation, Gaussian noise), which lack distributional grounding and yield unstable training signals. This can trigger error accumulation and catastrophic forgetting during adaptation. To address this, we propose \textbf{A3-TTA}, a TTA framework that constructs reliable pseudo-labels through anchor-guided supervision. Specifically, we identify well-predicted target domain images using a class compact density metric, under the assumption that confident predictions imply distributional proximity to the source domain. These anchors serve as stable references to guide pseudo-label generation, which is further regularized via semantic consistency and boundary-aware entropy minimization. Additionally, we introduce a self-adaptive exponential moving average strategy to mitigate label noise and stabilize model update during adaptation. Evaluated on both multi-domain medical images (heart structure and prostate segmentation) and natural images, A3-TTA significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to the source model, outperforming several state-of-the-art TTA methods under different segmentation model architectures. A3-TTA also excels in continual TTA, maintaining high performance across sequential target domains with strong anti-forgetting ability. The code will be made publicly available at https://github.com/HiLab-git/A3-TTA.

[194] Full end-to-end diagnostic workflow automation of 3D OCT via foundation model-driven AI for retinal diseases

Jinze Zhang, Jian Zhong, Li Lin, Jiaxiong Li, Ke Ma, Naiyang Li, Meng Li, Yuan Pan, Zeyu Meng, Mengyun Zhou, Shang Huang, Shilong Yu, Zhengyu Duan, Sutong Li, Honghui Xia, Juping Liu, Dan Liang, Yantao Wei, Xiaoying Tang, Jin Yuan, Peng Xiao

Main category: cs.CV

TL;DR: FOCUS is a foundation model-driven framework for end-to-end automation of 3D OCT retinal disease diagnosis, integrating image quality assessment, abnormality detection, and multi-disease classification with adaptive 2D-to-3D aggregation.

Details

Motivation: Current OCT diagnostic automation is limited by multi-stage workflows and single-slice single-task AI models, preventing full clinical automation despite OCT's high-resolution 3D imaging capabilities.

Method: Uses EfficientNetV2-S for image quality assessment, fine-tuned Vision Foundation Model for abnormality detection and multi-disease classification, with unified adaptive aggregation to integrate 2D slice predictions into 3D patient-level diagnosis.

Result: Achieved high F1 scores: 99.01% for quality assessment, 97.46% for abnormality detection, 94.39% for patient-level diagnosis; matched expert performance in human-machine comparisons with better efficiency.

Conclusion: FOCUS automates the image-to-diagnosis pipeline, advancing unmanned ophthalmology with a validated blueprint for autonomous screening to enhance retinal care accessibility and efficiency at population scale.

Abstract: Optical coherence tomography (OCT) has revolutionized retinal disease diagnosis with its high-resolution and three-dimensional imaging nature, yet its full diagnostic automation in clinical practices remains constrained by multi-stage workflows and conventional single-slice single-task AI models. We present Full-process OCT-based Clinical Utility System (FOCUS), a foundation model-driven framework enabling end-to-end automation of 3D OCT retinal disease diagnosis. FOCUS sequentially performs image quality assessment with EfficientNetV2-S, followed by abnormality detection and multi-disease classification using a fine-tuned Vision Foundation Model. Crucially, FOCUS leverages a unified adaptive aggregation method to intelligently integrate 2D slices-level predictions into comprehensive 3D patient-level diagnosis. Trained and tested on 3,300 patients (40,672 slices), and externally validated on 1,345 patients (18,498 slices) across four different-tier centers and diverse OCT devices, FOCUS achieved high F1 scores for quality assessment (99.01%), abnormally detection (97.46%), and patient-level diagnosis (94.39%). Real-world validation across centers also showed stable performance (F1: 90.22%-95.24%). In human-machine comparisons, FOCUS matched expert performance in abnormality detection (F1: 95.47% vs 90.91%) and multi-disease diagnosis (F1: 93.49% vs 91.35%), while demonstrating better efficiency. FOCUS automates the image-to-diagnosis pipeline, representing a critical advance towards unmanned ophthalmology with a validated blueprint for autonomous screening to enhance population scale retinal care accessibility and efficiency.

[195] PQTNet: Pixel-wise Quantitative Thermography Neural Network for Estimating Defect Depth in Polylactic Acid Parts by Additive Manufacturing

Lei Deng, Wenhao Huang, Chao Yang, Haoyuan Zheng, Yinbin Tian, Yue Ma

Main category: cs.CV

TL;DR: PQT-Net uses thermal imaging and deep learning for precise defect depth quantification in 3D-printed PLA parts, achieving sub-0.01mm accuracy.

Details

Motivation: Defect depth measurement in additively manufactured components is challenging for non-destructive testing. Current methods lack precision for quantitative defect characterization in 3D-printed parts.

Method: Proposes Pixel-wise Quantitative Thermography Neural Network (PQT-Net) with novel data augmentation converting thermal sequences to 2D stripe images preserving temporal heat diffusion. Uses pre-trained EfficientNetV2-S backbone with custom Residual Regression Head for output refinement.

Result: Achieves minimum Mean Absolute Error of 0.0094 mm and coefficient of determination exceeding 99%, outperforming other deep learning models for defect depth quantification.

Conclusion: PQT-Net demonstrates high precision for robust quantitative defect characterization in additive manufacturing, showing potential for industrial non-destructive testing applications.

Abstract: Defect depth quantification in additively manufactured (AM) components remains a significant challenge for non-destructive testing (NDT). This study proposes a Pixel-wise Quantitative Thermography Neural Network (PQT-Net) to address this challenge for polylactic acid (PLA) parts. A key innovation is a novel data augmentation strategy that reconstructs thermal sequence data into two-dimensional stripe images, preserving the complete temporal evolution of heat diffusion for each pixel. The PQT-Net architecture incorporates a pre-trained EfficientNetV2-S backbone and a custom Residual Regression Head (RRH) with learnable parameters to refine outputs. Comparative experiments demonstrate the superiority of PQT-Net over other deep learning models, achieving a minimum Mean Absolute Error (MAE) of 0.0094 mm and a coefficient of determination (R) exceeding 99%. The high precision of PQT-Net underscores its potential for robust quantitative defect characterization in AM.

[196] Invisible Clean-Label Backdoor Attacks for Generative Data Augmentation

Ting Xiang, Jinhui Zhao, Changjian Chen, Zhuo Tang

Main category: cs.CV

TL;DR: InvLBA: An invisible clean-label backdoor attack method for generative data augmentation using latent perturbation instead of pixel-level triggers, achieving high attack success rates while maintaining clean accuracy.

Details

Motivation: Existing pixel-level clean-label backdoor attacks (like COMBAT) perform poorly on generated images from generative data augmentation. The authors observe low attack success rates when applying these methods to generated images, motivating a shift from pixel-level to latent feature-level attacks.

Method: Proposes InvLBA (Invisible Latent Backdoor Attack) that operates at the latent feature level rather than pixel level. The method uses latent perturbation for clean-label backdoor attacks in generative data augmentation settings, with theoretical guarantees on generalization of clean accuracy and attack success rates.

Result: Experiments on multiple datasets show InvLBA improves attack success rate by 46.43% on average compared to existing methods, with almost no reduction in clean accuracy and high robustness against state-of-the-art defense methods.

Conclusion: Latent-level attacks are more effective than pixel-level attacks for backdoor attacks in generative data augmentation scenarios, and InvLBA provides a theoretically-grounded, practical approach with strong performance and robustness.

Abstract: With the rapid advancement of image generative models, generative data augmentation has become an effective way to enrich training images, especially when only small-scale datasets are available. At the same time, in practical applications, generative data augmentation can be vulnerable to clean-label backdoor attacks, which aim to bypass human inspection. However, based on theoretical analysis and preliminary experiments, we observe that directly applying existing pixel-level clean-label backdoor attack methods (e.g., COMBAT) to generated images results in low attack success rates. This motivates us to move beyond pixel-level triggers and focus instead on the latent feature level. To this end, we propose InvLBA, an invisible clean-label backdoor attack method for generative data augmentation by latent perturbation. We theoretically prove that the generalization of the clean accuracy and attack success rates of InvLBA can be guaranteed. Experiments on multiple datasets show that our method improves the attack success rate by 46.43% on average, with almost no reduction in clean accuracy and high robustness against SOTA defense methods.

[197] MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning

Shengyuan Liu, Liuxin Bao, Qi Yang, Wanting Geng, Boyun Zheng, Chenxin Li, Wenting Chen, Houwen Peng, Yixuan Yuan

Main category: cs.CV

TL;DR: MedSAM-Agent: A multi-step autonomous decision-making framework for medical image segmentation using MLLMs as agents with hybrid prompting and two-stage training for efficient interactive segmentation.

Details

Motivation: Current MLLM-based medical segmentation approaches use single-turn, rigid interactions and lack process-level supervision, limiting their ability to fully exploit interactive tools and leading to redundant actions.

Method: Proposes MedSAM-Agent framework with: 1) Hybrid prompting strategy for expert-curated trajectory generation to internalize human-like decision heuristics, and 2) Two-stage training pipeline integrating multi-turn end-to-end outcome verification with clinical-fidelity process reward design.

Result: Extensive experiments across 6 medical modalities and 21 datasets demonstrate state-of-the-art performance, effectively unifying autonomous medical reasoning with robust, iterative optimization.

Conclusion: MedSAM-Agent successfully bridges the gap in interactive medical segmentation by enabling multi-step autonomous decision-making with efficient interaction strategies and process-level supervision.

Abstract: Medical image segmentation is evolving from task-specific models toward generalizable frameworks. Recent research leverages Multi-modal Large Language Models (MLLMs) as autonomous agents, employing reinforcement learning with verifiable reward (RLVR) to orchestrate specialized tools like the Segment Anything Model (SAM). However, these approaches often rely on single-turn, rigid interaction strategies and lack process-level supervision during training, which hinders their ability to fully exploit the dynamic potential of interactive tools and leads to redundant actions. To bridge this gap, we propose MedSAM-Agent, a framework that reformulates interactive segmentation as a multi-step autonomous decision-making process. First, we introduce a hybrid prompting strategy for expert-curated trajectory generation, enabling the model to internalize human-like decision heuristics and adaptive refinement strategies. Furthermore, we develop a two-stage training pipeline that integrates multi-turn, end-to-end outcome verification with a clinical-fidelity process reward design to promote interaction parsimony and decision efficiency. Extensive experiments across 6 medical modalities and 21 datasets demonstrate that MedSAM-Agent achieves state-of-the-art performance, effectively unifying autonomous medical reasoning with robust, iterative optimization. Code is available \href{https://github.com/CUHK-AIM-Group/MedSAM-Agent}{here}.

[198] PWAVEP: Purifying Imperceptible Adversarial Perturbations in 3D Point Clouds via Spectral Graph Wavelets

Haoran Li, Renyang Liu, Hongjia Liu, Chen Wang, Long Yin, Jian Xu

Main category: cs.CV

TL;DR: PWAVEP: A plug-and-play spectral domain defense for 3D point clouds that purifies adversarial attacks by removing salient adversarial outliers and filtering high-frequency noise using graph wavelet transforms.

Details

Motivation: Current defenses against adversarial attacks on 3D point clouds are cumbersome, requiring invasive model modifications, expensive training, or auxiliary data. There's a need for a plug-and-play, non-invasive defense mechanism that can effectively purify adversarial perturbations without modifying the underlying models.

Method: Proposes PWAVEP purification framework that: 1) Computes spectral graph wavelet domain saliency and local sparsity scores for each point; 2) Hierarchically eliminates the most salient points (hard-to-recover adversarial outliers); 3) Applies spectral filtering to moderately salient points using graph wavelet transform to attenuate high-frequency coefficients associated with adversarial noise.

Result: Extensive evaluations show PWAVEP achieves superior accuracy and robustness compared to existing approaches, advancing state-of-the-art in 3D point cloud purification. The method effectively suppresses adversarial noise while maintaining point cloud integrity.

Conclusion: PWAVEP provides an effective plug-and-play defense against adversarial attacks on 3D point clouds by leveraging spectral domain analysis and graph wavelet transforms, offering a non-invasive solution that doesn’t require model modifications or expensive training procedures.

Abstract: Recent progress in adversarial attacks on 3D point clouds, particularly in achieving spatial imperceptibility and high attack performance, presents significant challenges for defenders. Current defensive approaches remain cumbersome, often requiring invasive model modifications, expensive training procedures or auxiliary data access. To address these threats, in this paper, we propose a plug-and-play and non-invasive defense mechanism in the spectral domain, grounded in a theoretical and empirical analysis of the relationship between imperceptible perturbations and high-frequency spectral components. Building upon these insights, we introduce a novel purification framework, termed PWAVEP, which begins by computing a spectral graph wavelet domain saliency score and local sparsity score for each point. Guided by these values, PWAVEP adopts a hierarchical strategy, it eliminates the most salient points, which are identified as hardly recoverable adversarial outliers. Simultaneously, it applies a spectral filtering process to a broader set of moderately salient points. This process leverages a graph wavelet transform to attenuate high-frequency coefficients associated with the targeted points, thereby effectively suppressing adversarial noise. Extensive evaluations demonstrate that the proposed PWAVEP achieves superior accuracy and robustness compared to existing approaches, advancing the state-of-the-art in 3D point cloud purification. Code and datasets are available at https://github.com/a772316182/pwavep

[199] Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability

Bingchen Zhao, Qiushan Guo, Ye Wang, Yixuan Huang, Zhonghua Zhai, Yu Tian

Main category: cs.CV

TL;DR: CompTok is a training framework for learning compositional visual tokenizers using token-conditioned diffusion with InfoGAN-style objectives and token swapping for enhanced compositionality.

Details

Motivation: Current visual tokenizers often lack compositional properties, making it difficult to achieve fine-grained semantic control over image generation. The authors aim to create tokenizers that enable more compositional control over image generation through token manipulation.

Method: Uses token-conditioned diffusion decoder with InfoGAN-style objective (training recognition model to predict tokens from decoded images). Introduces token swapping between images during training to promote compositionality. Applies adversarial flow regularizer to keep unpaired swap generations on natural-image distribution. Proposes two metrics to measure token space compositionality and learnability.

Result: Achieves state-of-the-art performance on image class-conditioned generation. Enables high-level semantic editing through token swapping between images. Improves on proposed compositionality metrics and supports state-of-the-art generators.

Conclusion: CompTok successfully creates compositional visual tokenizers that enable better control over image generation through token manipulation, with applications in semantic editing and improved generation quality.

Abstract: We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality. CompTok uses a token-conditioned diffusion decoder. By employing an InfoGAN-style objective, where we train a recognition model to predict the tokens used to condition the diffusion decoder using the decoded images, we enforce the decoder to not ignore any of the tokens. To promote compositional control, besides the original images, CompTok also trains on tokens formed by swapping token subsets between images, enabling more compositional control of the token over the decoder. As the swapped tokens between images do not have ground truth image targets, we apply a manifold constraint via an adversarial flow regularizer to keep unpaired swap generations on the natural-image distribution. The resulting tokenizer not only achieves state-of-the-art performance on image class-conditioned generation, but also demonstrates properties such as swapping tokens between images to achieve high level semantic editing of an image. Additionally, we propose two metrics that measures the landscape of the token space that can be useful to describe not only the compositionality of the tokens, but also how easy to learn the landscape is for a generator to be trained on this space. We show in experiments that CompTok can improve on both of the metrics as well as supporting state-of-the-art generators for class conditioned generation.

[200] Tiled Prompts: Overcoming Prompt Underspecification in Image and Video Super-Resolution

Bryan Sangwoo Kim, Jonghyun Park, Jong Chul Ye

Main category: cs.CV

TL;DR: Tiled Prompts framework addresses prompt underspecification in diffusion-based super-resolution by generating tile-specific prompts for each latent tile, improving local guidance and reducing artifacts.

Details

Motivation: Modern super-resolution pipelines use latent tiling for high resolutions, but global captions cause prompt underspecification - missing localized details (prompt sparsity) and providing irrelevant local guidance (prompt misguidance) that's amplified by classifier-free guidance.

Method: Proposes Tiled Prompts framework that generates tile-specific prompts for each latent tile and performs super-resolution under locally text-conditioned posteriors, providing high-information guidance with minimal overhead.

Result: Experiments on high-resolution real-world images and videos show consistent gains in perceptual quality and text alignment, while reducing hallucinations and tile-level artifacts compared to global-prompt baselines.

Conclusion: Tiled Prompts effectively resolves prompt underspecification in diffusion-based super-resolution by providing localized text guidance, improving both image and video super-resolution quality.

Abstract: Text-conditioned diffusion models have advanced image and video super-resolution by using prompts as semantic priors, but modern super-resolution pipelines typically rely on latent tiling to scale to high resolutions, where a single global caption causes prompt underspecification. A coarse global prompt often misses localized details (prompt sparsity) and provides locally irrelevant guidance (prompt misguidance) that can be amplified by classifier-free guidance. We propose Tiled Prompts, a unified framework for image and video super-resolution that generates a tile-specific prompt for each latent tile and performs super-resolution under locally text-conditioned posteriors, providing high-information guidance that resolves prompt underspecification with minimal overhead. Experiments on high resolution real-world images and videos show consistent gains in perceptual quality and text alignment, while reducing hallucinations and tile-level artifacts relative to global-prompt baselines.

[201] Z3D: Zero-Shot 3D Visual Grounding from Images

Nikita Drozdov, Andrey Lemeshko, Nikita Gavrilov, Anton Konushin, Danila Rukhovich, Maksim Kolodiazhnyi

Main category: cs.CV

TL;DR: Z3D is a zero-shot 3D visual grounding method that localizes objects in 3D scenes from multi-view images without geometric supervision, using advanced instance segmentation and VLM reasoning.

Details

Motivation: To enable 3D visual grounding without requiring expensive 3D annotations or object priors, making it more accessible and flexible for real-world applications.

Method: Uses multi-view images with optional camera poses/depth, employs state-of-the-art zero-shot 3D instance segmentation for bounding box proposals, and leverages prompt-based segmentation with modern VLMs for reasoning.

Result: Achieves state-of-the-art performance among zero-shot methods on ScanRefer and Nr3D benchmarks.

Conclusion: Demonstrates effective zero-shot 3D visual grounding using only multi-view images, advancing the field toward more practical 3D scene understanding.

Abstract: 3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries. In this work, we explore zero-shot 3DVG from multi-view images alone, without requiring any geometric supervision or object priors. We introduce Z3D, a universal grounding pipeline that flexibly operates on multi-view images while optionally incorporating camera poses and depth maps. We identify key bottlenecks in prior zero-shot methods causing significant performance degradation and address them with (i) a state-of-the-art zero-shot 3D instance segmentation method to generate high-quality 3D bounding box proposals and (ii) advanced reasoning via prompt-based segmentation, which utilizes full capabilities of modern VLMs. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that our approach achieves state-of-the-art performance among zero-shot methods. Code is available at https://github.com/col14m/z3d .

[202] Symbol-Aware Reasoning with Masked Discrete Diffusion for Handwritten Mathematical Expression Recognition

Takaya Kawakatsu, Ryo Ishiyama

Main category: cs.CV

TL;DR: Discrete diffusion framework for Handwritten Mathematical Expression Recognition that reformulates the task as iterative symbolic refinement rather than sequential generation, achieving state-of-the-art performance.

Details

Motivation: Autoregressive models for HMER struggle with exposure bias and syntactic inconsistency when dealing with diverse symbols and 2D structural layouts. There's a need for a better approach that can handle the structural complexity of mathematical expressions without causal dependencies.

Method: Proposes a discrete diffusion framework that treats HMER as iterative symbolic refinement through multi-step remasking. Uses symbol-aware tokenization and Random-Masking Mutual Learning to enhance syntactic alignment and robustness to handwriting diversity.

Result: Achieves 5.56% CER and 60.42% EM on MathWriting benchmark, outperforming strong Transformer and commercial baselines. Shows consistent gains on CROHME 2014-2023 datasets.

Conclusion: Discrete diffusion provides a new paradigm for structure-aware visual recognition beyond generative modeling, offering better handling of 2D structural layouts in mathematical expressions without autoregressive limitations.

Abstract: Handwritten Mathematical Expression Recognition (HMER) requires reasoning over diverse symbols and 2D structural layouts, yet autoregressive models struggle with exposure bias and syntactic inconsistency. We present a discrete diffusion framework that reformulates HMER as iterative symbolic refinement instead of sequential generation. Through multi-step remasking, the proposal progressively refines both symbols and structural relations, removing causal dependencies and improving structural consistency. A symbol-aware tokenization and Random-Masking Mutual Learning further enhance syntactic alignment and robustness to handwriting diversity. On the MathWriting benchmark, the proposal achieves 5.56% CER and 60.42% EM, outperforming strong Transformer and commercial baselines. Consistent gains on CROHME 2014–2023 demonstrate that discrete diffusion provides a new paradigm for structure-aware visual recognition beyond generative modeling.

[203] Multi-Resolution Alignment for Voxel Sparsity in Camera-Based 3D Semantic Scene Completion

Zhiwen Yang, Yuxin Peng

Main category: cs.CV

TL;DR: MRA approach for camera-based 3D semantic scene completion that uses multi-resolution alignment to address voxel sparsity through scene and instance-level supervision.

Details

Motivation: Camera-based 3D semantic scene completion faces optimization challenges due to voxel sparsity (many empty voxels in autonomous driving scenarios), limiting both optimization efficiency and model performance.

Method: Proposes Multi-Resolution Alignment (MRA) with three modules: 1) Multi-resolution View Transformer projects 2D image features to multi-resolution 3D features with scene-level alignment, 2) Cubic Semantic Anisotropy identifies instance-level semantic significance of voxels, and 3) Critical Distribution Alignment selects critical voxels as anchors and applies circulated loss for feature distribution consistency across resolutions.

Result: The approach mitigates voxel sparsity in camera-based 3D semantic scene completion and improves model performance through auxiliary supervision from multi-resolution alignment.

Conclusion: MRA effectively addresses voxel sparsity in camera-based 3D SSC by exploiting multi-resolution alignment at both scene and instance levels, providing better optimization and performance for autonomous driving perception systems.

Abstract: Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a \textit{Multi-Resolution Alignment (MRA)} approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. The code is available at https://github.com/PKU-ICST-MIPL/MRA_TIP.

[204] SLIM-Diff: Shared Latent Image-Mask Diffusion with Lp loss for Data-Scarce Epilepsy FLAIR MRI

Mario Pascual-González, Ariadna Jiménez-Partinen, R. M. Luque-Baena, Fátima Nagib-Raya, Ezequiel López-Rubio

Main category: cs.CV

TL;DR: SLIM-Diff is a compact joint diffusion model for generating FLAIR MRI images with focal cortical dysplasia lesions, using a shared-bottleneck U-Net and tunable Lp loss for better stability and fidelity.

Details

Motivation: Focal cortical dysplasia lesions in epilepsy FLAIR MRI are subtle and scarce, making joint image-mask generative modeling prone to instability and memorization issues.

Method: Proposes SLIM-Diff with: (1) single shared-bottleneck U-Net enforcing tight coupling between anatomy and lesion geometry using 2-channel image+mask representation, (2) loss-geometry tuning via tunable Lp objective, comparing x0-prediction vs ε-prediction with different Lp norms.

Result: x0-prediction is consistently strongest for joint synthesis; fractional sub-quadratic penalties (L1.5) improve image fidelity while L2 better preserves lesion mask morphology.

Conclusion: SLIM-Diff provides stable joint synthesis of medical images with subtle lesions, with optimal configurations identified for balancing image fidelity and lesion morphology preservation.

Abstract: Focal cortical dysplasia (FCD) lesions in epilepsy FLAIR MRI are subtle and scarce, making joint image–mask generative modeling prone to instability and memorization. We propose SLIM-Diff, a compact joint diffusion model whose main contributions are (i) a single shared-bottleneck U-Net that enforces tight coupling between anatomy and lesion geometry from a 2-channel image+mask representation, and (ii) loss-geometry tuning via a tunable $L_p$ objective. As an internal baseline, we include the canonical DDPM-style objective ($ε$-prediction with $L_2$ loss) and isolate the effect of prediction parameterization and $L_p$ geometry under a matched setup. Experiments show that $x_0$-prediction is consistently the strongest choice for joint synthesis, and that fractional sub-quadratic penalties ($L_{1.5}$) improve image fidelity while $L_2$ better preserves lesion mask morphology. Our code and model weights are available in https://github.com/MarioPasc/slim-diff

[205] Unifying Watermarking via Dimension-Aware Mapping

Jiale Meng, Runyi Hu, Jie Zhang, Zheming Lu, Ivor Tsang, Tianwei Zhang

Main category: cs.CV

TL;DR: DiM proposes a unified framework for watermarking by modeling watermark information as payloads of different dimensionalities (1D binary messages, 2D spatial masks, 3D spatiotemporal structures) and shows that dimensional configuration determines watermarking behavior.

Details

Motivation: Existing deep watermarking methods share similar encoder-decoder architectures but differ substantially in functional behaviors. The authors aim to unify these methods at the functional level by proposing a dimension-aware mapping framework.

Method: DiM formulates watermarking as a dimension-aware mapping problem where watermark information is modeled as payloads of different dimensionalities. The framework analyzes how dimensional configuration of embedding and extraction determines watermarking behavior, with same-dimensional mappings preserving payload structure and cross-dimensional mappings enabling spatial/spatiotemporal localization.

Result: Experiments in the video domain demonstrate that varying only embedding and extraction dimensions (without architectural changes) leads to different watermarking capabilities including spatiotemporal tamper localization, local embedding control, and recovery of temporal order under frame disruptions.

Conclusion: DiM provides a unified framework for understanding and designing watermarking methods through dimension-aware mapping, revealing that dimensional configuration is a key determinant of watermarking behavior and capabilities.

Abstract: Deep watermarking methods often share similar encoder-decoder architectures, yet differ substantially in their functional behaviors. We propose DiM, a new multi-dimensional watermarking framework that formulates watermarking as a dimension-aware mapping problem, thereby unifying existing watermarking methods at the functional level. Under DiM, watermark information is modeled as payloads of different dimensionalities, including one-dimensional binary messages, two-dimensional spatial masks, and three-dimensional spatiotemporal structures. We find that the dimensional configuration of embedding and extraction largely determines the resulting watermarking behavior. Same-dimensional mappings preserve payload structure and support fine-grained control, while cross-dimensional mappings enable spatial or spatiotemporal localization. We instantiate DiM in the video domain, where spatiotemporal representations enable a broader set of dimension mappings. Experiments demonstrate that varying only the embedding and extraction dimensions, without architectural changes, leads to different watermarking capabilities, including spatiotemporal tamper localization, local embedding control, and recovery of temporal order under frame disruptions.

[206] Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization

Hao Fang, Jinyu Li, Jiawei Kong, Tianqu Zhuang, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang

Main category: cs.CV

TL;DR: C3PO: A training framework to reduce hallucinations in multimodal reasoning models through chain-of-thought compression and contrastive preference optimization.

Details

Motivation: Multimodal reasoning models suffer from hallucinations, and current solutions are insufficient. The paper identifies that reasoning mechanisms exacerbate reliance on language priors while overlooking visual inputs, leading to CoTs with reduced visual cues and redundant text tokens.

Method: Proposes C3PO with two components: 1) Chain-of-Thought Compression - selectively filters redundant thinking tokens for compact, signal-efficient CoT representations; 2) Contrastive Preference Optimization - uses reasoning-enhanced preference tuning with AI feedback and a hallucination-inducing mechanism to create informative negative signals for contrastive correction.

Result: Demonstrates consistent hallucination reduction across diverse multimodal reasoning models and benchmarks, with theoretical justification for effectiveness.

Conclusion: C3PO effectively addresses hallucination issues in multimodal reasoning models by improving reasoning trace quality and reducing reliance on language priors through compression and contrastive optimization techniques.

Abstract: While multimodal reasoning models (MLRMs) have exhibited impressive capabilities, they remain prone to hallucinations, and effective solutions are still underexplored. In this paper, we experimentally analyze the hallucination cause and propose C3PO, a training-based mitigation framework comprising \textbf{C}hain-of-Thought \textbf{C}ompression and \textbf{C}ontrastive \textbf{P}reference \textbf{O}ptimization. Firstly, we identify that introducing reasoning mechanisms exacerbates models’ reliance on language priors while overlooking visual inputs, which can produce CoTs with reduced visual cues but redundant text tokens. To this end, we propose to selectively filter redundant thinking tokens for a more compact and signal-efficient CoT representation that preserves task-relevant information while suppressing noise. In addition, we observe that the quality of the reasoning trace largely determines whether hallucination emerges in subsequent responses. To leverage this insight, we introduce a reasoning-enhanced preference tuning scheme that constructs training pairs using high-quality AI feedback. We further design a multimodal hallucination-inducing mechanism that elicits models’ inherent hallucination patterns via carefully crafted inducers, yielding informative negative signals for contrastive correction. We provide theoretical justification for the effectiveness and demonstrate consistent hallucination reduction across diverse MLRMs and benchmarks.

[207] From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning

Hyun Seok Seong, WonJun Moon, Jae-Pil Heo

Main category: cs.CV

TL;DR: SRL introduces synergistic representation learning to break the vicious cycle between encoder’s sharp attention and decoder’s blurry reconstruction in unsupervised object-centric models by enabling mutual refinement between encoder and decoder.

Details

Motivation: Slot-based object-centric learning models suffer from a fundamental conflict: encoder produces sharp, high-frequency attention maps while decoder generates spatially consistent but blurry reconstructions, creating a vicious cycle where noisy encoder features force decoder to average possibilities, and blurry reconstructions lack high-frequency details to supervise encoder.

Method: Synergistic Representation Learning (SRL) establishes a virtuous cycle where encoder and decoder mutually refine each other - encoder’s sharpness deblurs semantic boundaries in decoder output, while decoder’s spatial consistency denoises encoder’s features. Includes warm-up phase with slot regularization to initially allocate distinct entities per slot.

Result: SRL achieves state-of-the-art results on video object-centric learning benchmarks by bridging the representational gap between encoder and decoder.

Conclusion: The proposed synergistic learning approach successfully breaks the vicious cycle in unsupervised object-centric models, enabling mutual refinement between encoder and decoder representations for improved scene decomposition.

Abstract: Unsupervised object-centric learning models, particularly slot-based architectures, have shown great promise in decomposing complex scenes. However, their reliance on reconstruction-based training creates a fundamental conflict between the sharp, high-frequency attention maps of the encoder and the spatially consistent but blurry reconstruction maps of the decoder. We identify that this discrepancy gives rise to a vicious cycle: the noisy feature map from the encoder forces the decoder to average over possibilities and produce even blurrier outputs, while the gradient computed from blurry reconstruction maps lacks high-frequency details necessary to supervise encoder features. To break this cycle, we introduce Synergistic Representation Learning (SRL) that establishes a virtuous cycle where the encoder and decoder mutually refine one another. SRL leverages the encoder’s sharpness to deblur the semantic boundary within the decoder output, while exploiting the decoder’s spatial consistency to denoise the encoder’s features. This mutual refinement process is stabilized by a warm-up phase with a slot regularization objective that initially allocates distinct entities per slot. By bridging the representational gap between the encoder and decoder, SRL achieves state-of-the-art results on video object-centric learning benchmarks. Codes are available at https://github.com/hynnsk/SRL.

[208] UnHype: CLIP-Guided Hypernetworks for Dynamic LoRA Unlearning

Piotr Wójcik, Maksym Petrenko, Wojciech Gromski, Przemysław Spurek, Maciej Zieba

Main category: cs.CV

TL;DR: UnHype is a framework that uses hypernetworks to enhance LoRA-based machine unlearning in diffusion models, enabling more adaptive and scalable removal of specific concepts while maintaining overall generative capabilities.

Details

Motivation: Address limitations of current LoRA-based unlearning methods which struggle with concept semantics adaptation, balancing removal of related concepts while maintaining generalization, and scalability for multi-concept erasure in diffusion models.

Method: Incorporates hypernetworks into single- and multi-concept LoRA training; hypernetwork dynamically generates adaptive LoRA weights based on CLIP embeddings during inference for context-aware unlearning; compatible with Stable Diffusion and flow-based text-to-image models.

Result: Demonstrates stable training behavior and effective concept control across challenging tasks including object erasure, celebrity erasure, and explicit content removal; shows effectiveness and versatility in concept removal.

Conclusion: UnHype provides a scalable and adaptive framework for machine unlearning in diffusion models, addressing key limitations of existing LoRA-based approaches through hypernetwork integration.

Abstract: Recent advances in large-scale diffusion models have intensified concerns about their potential misuse, particularly in generating realistic yet harmful or socially disruptive content. This challenge has spurred growing interest in effective machine unlearning, the process of selectively removing specific knowledge or concepts from a model without compromising its overall generative capabilities. Among various approaches, Low-Rank Adaptation (LoRA) has emerged as an effective and efficient method for fine-tuning models toward targeted unlearning. However, LoRA-based methods often exhibit limited adaptability to concept semantics and struggle to balance removing closely related concepts with maintaining generalization across broader meanings. Moreover, these methods face scalability challenges when multiple concepts must be erased simultaneously. To address these limitations, we introduce UnHype, a framework that incorporates hypernetworks into single- and multi-concept LoRA training. The proposed architecture can be directly plugged into Stable Diffusion as well as modern flow-based text-to-image models, where it demonstrates stable training behavior and effective concept control. During inference, the hypernetwork dynamically generates adaptive LoRA weights based on the CLIP embedding, enabling more context-aware, scalable unlearning. We evaluate UnHype across several challenging tasks, including object erasure, celebrity erasure, and explicit content removal, demonstrating its effectiveness and versatility. Repository: https://github.com/gmum/UnHype.

[209] Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction

Zhengbo Jiao, Shaobo Wang, Zifan Zhang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang

Main category: cs.CV

TL;DR: Socratic-Geo: Autonomous framework for geometric reasoning in MLLMs using multi-agent interaction to generate high-quality image-text pairs and improve both reasoning and generation capabilities.

Details

Motivation: MLLMs struggle with geometric reasoning due to scarcity of high-quality image-text pairs. Human annotation is expensive, automated methods lack fidelity, and existing approaches are inefficient or passive.

Method: Multi-agent framework with Teacher agent generating parameterized Python scripts with reflective feedback, Solver agent optimizing reasoning through preference learning, and Generator learning image generation from accumulated “image-code-instruction” triplets.

Result: Socratic-Solver achieves 49.11 on six benchmarks using only 1/4 of baseline data, surpassing baselines by 2.43 points. Socratic-Generator achieves 42.4% on GenExam, new SOTA for open-source models.

Conclusion: Socratic-Geo demonstrates effective autonomous data synthesis for geometric reasoning in MLLMs, significantly improving both reasoning and generation capabilities with minimal seed data.

Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced vision-language understanding. However, even state-of-the-art models struggle with geometric reasoning, revealing a critical bottleneck: the extreme scarcity of high-quality image-text pairs. Human annotation is prohibitively expensive, while automated methods fail to ensure fidelity and training effectiveness. Existing approaches either passively adapt to available images or employ inefficient random exploration with filtering, decoupling generation from learning needs. We propose Socratic-Geo, a fully autonomous framework that dynamically couples data synthesis with model learning through multi-agent interaction. The Teacher agent generates parameterized Python scripts with reflective feedback (Reflect for solvability, RePI for visual validity), ensuring image-text pair purity. The Solver agent optimizes reasoning through preference learning, with failure paths guiding Teacher’s targeted augmentation. Independently, the Generator learns image generation capabilities on accumulated “image-code-instruction” triplets, distilling programmatic drawing intelligence into visual generation. Starting from only 108 seed problems, Socratic-Solver achieves 49.11 on six benchmarks using one-quarter of baseline data, surpassing strong baselines by 2.43 points. Socratic-Generator achieves 42.4% on GenExam, establishing new state-of-the-art for open-source models, surpassing Seedream-4.0 (39.8%) and approaching Gemini-2.5-Flash-Image (43.1%).

[210] ConsistentRFT: Reducing Visual Hallucinations in Flow-based Reinforcement Fine-Tuning

Xiaofeng Tan, Jun Liu, Yuanting Fan, Bin-Bin Gao, Xi Jiang, Xiaochen Chen, Jinlong Peng, Chengjie Wang, Hongsong Wang, Feng Zheng

Main category: cs.CV

TL;DR: ConsistentRFT addresses visual hallucinations in Reinforcement Fine-Tuning of flow-based models by balancing exploration-exploitation tradeoffs and preserving model consistency.

Details

Motivation: RFT on flow-based models often introduces visual hallucinations like over-optimized details and semantic misalignment, which degrade generation quality. The paper aims to understand why these hallucinations occur and develop methods to reduce them.

Method: Proposes ConsistentRFT framework with two key components: 1) Dynamic Granularity Rollout (DGR) that balances exploration between global semantics and local details by dynamically scheduling noise sources, and 2) Consistent Policy Gradient Optimization (CPGO) that preserves model consistency by aligning current policy with a stable prior.

Result: Significantly reduces visual hallucinations: 49% reduction for low-level and 38% for high-level perceptual hallucinations. Outperforms other RFT methods on out-of-domain metrics, showing 5.1% improvement over FLUX1.dev (vs baseline’s -0.4% decrease).

Conclusion: ConsistentRFT effectively mitigates visual hallucinations in flow-based models by addressing exploration-exploitation tradeoffs and preserving model consistency, leading to better alignment and generation quality.

Abstract: Reinforcement Fine-Tuning (RFT) on flow-based models is crucial for preference alignment. However, they often introduce visual hallucinations like over-optimized details and semantic misalignment. This work preliminarily explores why visual hallucinations arise and how to reduce them. We first investigate RFT methods from a unified perspective, and reveal the core problems stemming from two aspects, exploration and exploitation: (1) limited exploration during stochastic differential equation (SDE) rollouts, leading to an over-emphasis on local details at the expense of global semantics, and (2) trajectory imitation process inherent in policy gradient methods, distorting the model’s foundational vector field and its cross-step consistency. Building on this, we propose ConsistentRFT, a general framework to mitigate these hallucinations. Specifically, we design a Dynamic Granularity Rollout (DGR) mechanism to balance exploration between global semantics and local details by dynamically scheduling different noise sources. We then introduce a Consistent Policy Gradient Optimization (CPGO) that preserves the model’s consistency by aligning the current policy with a more stable prior. Extensive experiments demonstrate that ConsistentRFT significantly mitigates visual hallucinations, achieving average reductions of 49% for low-level and 38% for high-level perceptual hallucinations. Furthermore, ConsistentRFT outperforms other RFT methods on out-of-domain metrics, showing an improvement of 5.1% (v.s. the baseline’s decrease of -0.4%) over FLUX1.dev. This is \href{https://xiaofeng-tan.github.io/projects/ConsistentRFT}{Project Page}.

[211] Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation

Yijia Xu, Zihao Wang, Jinshi Cui

Main category: cs.CV

TL;DR: CAG framework improves multi-subject image generation by providing explicit hierarchical guidance from concepts to appearances using VAE dropout and correspondence-aware attention.

Details

Motivation: Existing multi-subject image generation methods suffer from identity inconsistency and limited compositional control because they rely on diffusion models to implicitly associate text prompts with reference images.

Method: Hierarchical Concept-to-Appearance Guidance (CAG) with two key components: 1) VAE dropout training that randomly omits reference VAE features to encourage reliance on robust semantic signals from a Visual Language Model (VLM), and 2) correspondence-aware masked attention module in Diffusion Transformer that restricts each text token to attend only to matched reference regions.

Result: Extensive experiments demonstrate state-of-the-art performance on multi-subject image generation, substantially improving prompt following and subject consistency.

Conclusion: The proposed CAG framework provides explicit, structured supervision from high-level concepts to fine-grained appearances, achieving better identity preservation and compositional control in multi-subject image generation.

Abstract: Multi-subject image generation aims to synthesize images that faithfully preserve the identities of multiple reference subjects while following textual instructions. However, existing methods often suffer from identity inconsistency and limited compositional control, as they rely on diffusion models to implicitly associate text prompts with reference images. In this work, we propose Hierarchical Concept-to-Appearance Guidance (CAG), a framework that provides explicit, structured supervision from high-level concepts to fine-grained appearances. At the conceptual level, we introduce a VAE dropout training strategy that randomly omits reference VAE features, encouraging the model to rely more on robust semantic signals from a Visual Language Model (VLM) and thereby promoting consistent concept-level generation in the absence of complete appearance cues. At the appearance level, we integrate the VLM-derived correspondences into a correspondence-aware masked attention module within the Diffusion Transformer (DiT). This module restricts each text token to attend only to its matched reference regions, ensuring precise attribute binding and reliable multi-subject composition. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the multi-subject image generation, substantially improving prompt following and subject consistency.

[212] Contextualized Visual Personalization in Vision-Language Models

Yeongtak Oh, Sangwon Yu, Junsung Park, Han Cheol Moon, Jisoo Mok, Sungroh Yoon

Main category: cs.CV

TL;DR: CoViP: A framework for contextualized visual personalization that enables VLMs to generate personalized responses by associating new images with users’ accumulated visual-textual experiences through reinforcement learning and caption-augmented generation.

Details

Motivation: Current vision-language models fail to generate personalized responses based on users' specific experiences because they lack the ability to associate new visual inputs with accumulated visual-textual context. The paper formalizes this as "contextualized visual personalization" - requiring VLMs to recognize and retrieve personalized visual experiences when interpreting new images.

Method: Proposes CoViP framework that treats personalized image captioning as a core task. Uses reinforcement-learning-based post-training and caption-augmented generation to improve contextualized visual personalization. Introduces diagnostic evaluations to ensure models truly leverage visual context rather than textual shortcuts.

Result: Extensive experiments show existing open-source and proprietary VLMs have substantial limitations in contextualized visual personalization. CoViP improves personalized image captioning and yields holistic gains across downstream personalization tasks.

Conclusion: CoViP represents a crucial stage for enabling robust and generalizable contextualized visual personalization in vision-language models, addressing the gap in generating truly personalized responses based on users’ visual experiences.

Abstract: Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user’s specific experiences, as they lack the ability to associate visual inputs with a user’s accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.

[213] Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Youliang Zhang, Zhengguang Zhou, Zhentao Yu, Ziyao Huang, Teng Hu, Sen Liang, Guozhen Zhang, Ziqiao Peng, Shunkai Li, Yi Chen, Zixiang Zhou, Yuan Zhou, Qinglin Lu, Xiu Li

Main category: cs.CV

TL;DR: InteractAvatar: A dual-stream framework for generating talking avatars performing grounded human-object interactions with environmental perception and audio-visual coordination

Details

Motivation: Existing talking avatar generation methods focus on simple human motion but cannot handle grounded human-object interactions (GHOI) which require environmental perception and text-aligned interactions with surrounding objects. The control-quality dilemma in GHOI generation needs to be addressed.

Method: Proposes InteractAvatar with dual-stream framework: 1) Perception and Interaction Module (PIM) for text-aligned interaction motion generation using detection for environmental perception, 2) Audio-Interaction Aware Generation Module (AIM) for synthesizing vivid talking avatars performing object interactions, and 3) motion-to-video aligner enabling parallel co-generation of motions and videos.

Result: Establishes GroundedInter benchmark for evaluating GHOI video generation. Extensive experiments demonstrate effectiveness in generating grounded human-object interactions for talking avatars, addressing the control-quality dilemma.

Conclusion: InteractAvatar successfully generates talking avatars performing grounded human-object interactions by decoupling perception/planning from video synthesis, enabling environmental perception and coordinated audio-visual generation.

Abstract: Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io

[214] Inlier-Centric Post-Training Quantization for Object Detection Models

Minsu Kim, Dongyeun Lee, Jaemyung Yu, Jiwan Hur, Giseop Kim, Junmo Kim

Main category: cs.CV

TL;DR: InlierQ is a post-training quantization method for object detection that separates task-irrelevant anomalies from informative inliers using gradient-aware volume saliency scores and EM algorithm, reducing quantization error with minimal calibration data.

Details

Motivation: Object detection has high computational demands making deployment slow and power-hungry. Task-irrelevant morphologies like background clutter and sensor noise create redundant activations that expand activation ranges and skew distributions, complicating bit allocation and weakening preservation of informative features during quantization.

Method: InlierQ computes gradient-aware volume saliency scores for each activation volume, classifies volumes as inliers or anomalies, and fits a posterior distribution over these scores using Expectation-Maximization (EM) algorithm. This suppresses anomalies while preserving informative features. The method is label-free, drop-in, and requires only 64 calibration samples.

Result: Experiments on COCO and nuScenes benchmarks show consistent reductions in quantization error for camera-based (2D and 3D) and LiDAR-based (3D) object detection across different quantization settings.

Conclusion: InlierQ effectively addresses the challenge of task-irrelevant anomalies in object detection quantization by separating anomalies from informative inliers, enabling more efficient bit allocation and better preservation of critical features for deployment on resource-constrained devices.

Abstract: Object detection is pivotal in computer vision, yet its immense computational demands make deployment slow and power-hungry, motivating quantization. However, task-irrelevant morphologies such as background clutter and sensor noise induce redundant activations (or anomalies). These anomalies expand activation ranges and skew activation distributions toward task-irrelevant responses, complicating bit allocation and weakening the preservation of informative features. Without a clear criterion to distinguish anomalies, suppressing them can inadvertently discard useful information. To address this, we present InlierQ, an inlier-centric post-training quantization approach that separates anomalies from informative inliers. InlierQ computes gradient-aware volume saliency scores, classifies each volume as an inlier or anomaly, and fits a posterior distribution over these scores using the Expectation-Maximization (EM) algorithm. This design suppresses anomalies while preserving informative features. InlierQ is label-free, drop-in, and requires only 64 calibration samples. Experiments on the COCO and nuScenes benchmarks show consistent reductions in quantization error for camera-based (2D and 3D) and LiDAR-based (3D) object detection.

[215] Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, Min Zhang

Main category: cs.CV

TL;DR: DiSCo framework disentangles structure-content alignment for table images, enabling LVLMs to understand complex table layouts without expensive training or external tools.

Details

Motivation: Table image reasoning is challenging for LVLMs due to complex layouts and coupled structure-content information. Existing solutions require expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability.

Method: Two-stage framework: 1) DiSCo disentangles structural abstraction from semantic grounding during multimodal alignment; 2) Table-GLS performs global-to-local structure-guided reasoning via structured exploration and evidence-grounded inference.

Result: Extensive experiments across diverse benchmarks show the framework efficiently enhances LVLM’s table understanding and reasoning capabilities, with strong generalization to unseen table structures.

Conclusion: The proposed framework enables LVLMs to adapt to table reasoning with minimal annotation and no external tools, addressing key challenges in table image understanding.

Abstract: Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM’s table understanding and reasoning capabilities, particularly generalizing to unseen table structures.

[216] Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

Bozhou Li, Yushuo Guan, Haolin Li, Bohan Zeng, Yiyan Ji, Yue Ding, Pengfei Wan, Kun Gai, Yuanxing Zhang, Wentao Zhang

Main category: cs.CV

TL;DR: Dynamic LLM feature fusion for DiT-based text-to-image models using depth-wise semantic routing improves text-image alignment and compositional generation.

Details

Motivation: Current DiT-based text-to-image models use static text conditioning from LLMs, ignoring the semantic hierarchy across LLM layers and non-stationary denoising dynamics over diffusion time and network depth. This mismatch limits generative capability.

Method: Introduces a unified normalized convex fusion framework with lightweight gates to systematically organize multi-layer LLM hidden states via three strategies: time-wise fusion (conditioning varies with diffusion timestep), depth-wise fusion (conditioning varies with DiT network depth), and joint fusion (both).

Result: Depth-wise semantic routing emerges as superior, consistently improving text-image alignment and compositional generation (e.g., +9.97 on GenAI-Bench Counting task). Time-wise fusion degrades visual fidelity due to train-inference trajectory mismatch under classifier-free guidance.

Conclusion: Depth-wise routing is an effective baseline for dynamic text conditioning in DiT models. Time-dependent conditioning requires trajectory-aware signals to avoid semantic mistiming during inference.

Abstract: Recent DiT-based text-to-image models increasingly adopt LLMs as text encoders, yet text conditioning remains largely static and often utilizes only a single LLM layer, despite pronounced semantic hierarchy across LLM layers and non-stationary denoising dynamics over both diffusion time and network depth. To better match the dynamic process of DiT generation and thereby enhance the diffusion model’s generative capability, we introduce a unified normalized convex fusion framework equipped with lightweight gates to systematically organize multi-layer LLM hidden states via time-wise, depth-wise, and joint fusion. Experiments establish Depth-wise Semantic Routing as the superior conditioning strategy, consistently improving text-image alignment and compositional generation (e.g., +9.97 on the GenAI-Bench Counting task). Conversely, we find that purely time-wise fusion can paradoxically degrade visual generation fidelity. We attribute this to a train-inference trajectory mismatch: under classifier-free guidance, nominal timesteps fail to track the effective SNR, causing semantically mistimed feature injection during inference. Overall, our results position depth-wise routing as a strong and effective baseline and highlight the critical need for trajectory-aware signals to enable robust time-dependent conditioning.

[217] Interpretable Logical Anomaly Classification via Constraint Decomposition and Instruction Fine-Tuning

Xufei Zhang, Xinjiao Zhou, Ziling Deng, Dongdong Geng, Jianxiong Wang

Main category: cs.CV

TL;DR: LogiCls: A vision-language framework for logical anomaly classification that detects and classifies violations of predefined constraints in industrial images, providing interpretable evidence trails.

Details

Motivation: Current anomaly detection methods treat anomalies as binary decisions without indicating which logical rules are broken, limiting their value for quality assurance in industrial settings where understanding specific constraint violations is crucial.

Method: Proposes LogiCls framework that decomposes complex logical constraints into verifiable subqueries, uses data-centric instruction synthesis with chain-of-thought supervision, couples precise grounding annotations with image-text augmentations, and employs difficulty-aware resampling to emphasize challenging subqueries.

Result: Extensive experiments show LogiCls delivers robust, interpretable, and accurate industrial logical anomaly classification, providing both predicted violation categories and evidence trails.

Conclusion: The paper introduces Logical Anomaly Classification (LAC) as a unified task for anomaly detection and violation classification, with LogiCls demonstrating effective vision-language reasoning for industrial quality assurance.

Abstract: Logical anomalies are violations of predefined constraints on object quantity, spatial layout, and compositional relationships in industrial images. While prior work largely treats anomaly detection as a binary decision, such formulations cannot indicate which logical rule is broken and therefore offer limited value for quality assurance. We introduce Logical Anomaly Classification (LAC), a task that unifies anomaly detection and fine-grained violation classification in a single inference step. To tackle LAC, we propose LogiCls, a vision-language framework that decomposes complex logical constraints into a sequence of verifiable subqueries. We further present a data-centric instruction synthesis pipeline that generates chain-of-thought (CoT) supervision for these subqueries, coupling precise grounding annotations with diverse image-text augmentations to adapt vision language models (VLMs) to logic-sensitive reasoning. Training is stabilized by a difficulty-aware resampling strategy that emphasizes challenging subqueries and long tail constraint types. Extensive experiments demonstrate that LogiCls delivers robust, interpretable, and accurate industrial logical anomaly classification, providing both the predicted violation categories and their evidence trails.

[218] PnP-U3D: Plug-and-Play 3D Framework Bridging Autoregression and Diffusion for Unified Understanding and Generation

Yongwei Chen, Tianyi Wei, Yushi Lan, Zhaoyang Lyu, Shangchen Zhou, Xudong Xu, Xingang Pan

Main category: cs.CV

TL;DR: Unified AR+Diffusion framework for 3D understanding and generation that bridges LLMs with 3D diffusion models via lightweight transformer, achieving SOTA across diverse 3D tasks.

Details

Motivation: Existing unified frameworks for 3D tasks under autoregressive paradigms suffer from performance degradation due to forced signal quantization and high training costs. The authors aim to create a unified 3D framework that effectively combines understanding and generation while leveraging pretrained models.

Method: Combines autoregressive next-token prediction for 3D understanding with continuous diffusion for 3D generation. Uses lightweight transformer to bridge feature space of LLMs with conditional space of 3D diffusion models, enabling cross-modal information exchange while preserving standalone model priors.

Result: Achieves state-of-the-art performance across diverse 3D understanding and generation benchmarks, and excels in 3D editing tasks. Demonstrates effective information interaction between generation and understanding with minimal compromise to inherent capabilities.

Conclusion: Unified AR+diffusion models represent a promising direction for building general-purpose 3D intelligence, showing that effective information interaction between understanding and generation can be achieved without forcing a single paradigm.

Abstract: The rapid progress of large multimodal models has inspired efforts toward unified frameworks that couple understanding and generation. While such paradigms have shown remarkable success in 2D, extending them to 3D remains largely underexplored. Existing attempts to unify 3D tasks under a single autoregressive (AR) paradigm lead to significant performance degradation due to forced signal quantization and prohibitive training cost. Our key insight is that the essential challenge lies not in enforcing a unified autoregressive paradigm, but in enabling effective information interaction between generation and understanding while minimally compromising their inherent capabilities and leveraging pretrained models to reduce training cost. Guided by this perspective, we present the first unified framework for 3D understanding and generation that combines autoregression with diffusion. Specifically, we adopt an autoregressive next-token prediction paradigm for 3D understanding, and a continuous diffusion paradigm for 3D generation. A lightweight transformer bridges the feature space of large language models and the conditional space of 3D diffusion models, enabling effective cross-modal information exchange while preserving the priors learned by standalone models. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across diverse 3D understanding and generation benchmarks, while also excelling in 3D editing tasks. These results highlight the potential of unified AR+diffusion models as a promising direction for building more general-purpose 3D intelligence.

[219] Constrained Dynamic Gaussian Splatting

Zihan Zheng, Zhenglong Wu, Xuanxuan Wang, Houqiang Zhong, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai, Wenjun Zhang

Main category: cs.CV

TL;DR: CDGS is a budget-constrained optimization framework for dynamic Gaussian splatting that enforces strict Gaussian budgets during training through differentiable budget control and adaptive allocation between static/dynamic elements.

Details

Motivation: Dynamic Gaussian Splatting faces a deployment dilemma: unconstrained densification causes excessive memory consumption incompatible with edge devices, while heuristic pruning fails to achieve optimal rendering quality under preset Gaussian budgets.

Method: Proposes Constrained Dynamic Gaussian Splatting (CDGS) with: 1) Differentiable budget controller using multi-modal unified importance score (geometric, motion, perceptual cues), 2) Decoupled optimization of static/dynamic elements with adaptive capacity allocation, 3) Three-phase training strategy, 4) Dual-mode hybrid compression scheme.

Result: CDGS strictly adheres to hardware constraints (error < 2%), pushes Pareto frontier of rate-distortion performance, achieves over 3x compression compared to state-of-the-art methods, and delivers optimal rendering quality under varying capacity limits.

Conclusion: CDGS successfully addresses the deployment dilemma of dynamic Gaussian splatting by formulating it as a budget-constrained optimization problem, enabling high-quality 4D reconstruction within strict hardware constraints.

Abstract: While Dynamic Gaussian Splatting enables high-fidelity 4D reconstruction, its deployment is severely hindered by a fundamental dilemma: unconstrained densification leads to excessive memory consumption incompatible with edge devices, whereas heuristic pruning fails to achieve optimal rendering quality under preset Gaussian budgets. In this work, we propose Constrained Dynamic Gaussian Splatting (CDGS), a novel framework that formulates dynamic scene reconstruction as a budget-constrained optimization problem to enforce a strict, user-defined Gaussian budget during training. Our key insight is to introduce a differentiable budget controller as the core optimization driver. Guided by a multi-modal unified importance score, this controller fuses geometric, motion, and perceptual cues for precise capacity regulation. To maximize the utility of this fixed budget, we further decouple the optimization of static and dynamic elements, employing an adaptive allocation mechanism that dynamically distributes capacity based on motion complexity. Furthermore, we implement a three-phase training strategy to seamlessly integrate these constraints, ensuring precise adherence to the target count. Coupled with a dual-mode hybrid compression scheme, CDGS not only strictly adheres to hardware constraints (error < 2%}) but also pushes the Pareto frontier of rate-distortion performance. Extensive experiments demonstrate that CDGS delivers optimal rendering quality under varying capacity limits, achieving over 3x compression compared to state-of-the-art methods.

[220] Cut to the Mix: Simple Data Augmentation Outperforms Elaborate Ones in Limited Organ Segmentation Datasets

Chang Liu, Fuxin Fan, Annette Schwarz, Andreas Maier

Main category: cs.CV

TL;DR: Investigates four inter-image data augmentation strategies (CutMix, CarveMix, ObjectAug, AnatoMix) for multi-organ segmentation in medical imaging, showing performance improvements over baseline nnUNet.

Details

Motivation: Deep learning segmentation models require large annotated datasets, but medical imaging often has limited data due to clinical constraints. Traditional data augmentation uses basic intra-image operations, while inter-image strategies that combine content from different individuals are under-explored for multi-organ segmentation.

Method: Evaluated four inter-image DA strategies on two organ segmentation datasets: CutMix (randomly cutting and pasting patches), CarveMix (carving and pasting irregular regions), ObjectAug (object-level augmentation), and AnatoMix (anatomy-aware mixing). Compared performance against state-of-the-art nnUNet baseline without DA.

Result: CutMix improved average dice score by 4.9, CarveMix by 2.0, and AnatoMix by 1.9 compared to nnUNet without DA. Performance further improved when combined with traditional DA strategies. CutMix proved robust and effective despite producing intuitively ‘wrong’ images.

Conclusion: Inter-image data augmentation strategies, particularly CutMix, significantly improve multi-organ segmentation performance with limited data. These methods create more diverse training samples by combining content from different individuals, addressing data scarcity in medical imaging.

Abstract: Multi-organ segmentation is a widely applied clinical routine and automated organ segmentation tools dramatically improve the pipeline of the radiologists. Recently, deep learning (DL) based segmentation models have shown the capacity to accomplish such a task. However, the training of the segmentation networks requires large amount of data with manual annotations, which is a major concern due to the data scarcity from clinic. Working with limited data is still common for researches on novel imaging modalities. To enhance the effectiveness of DL models trained with limited data, data augmentation (DA) is a crucial regularization technique. Traditional DA (TDA) strategies focus on basic intra-image operations, i.e. generating images with different orientations and intensity distributions. In contrast, the interimage and object-level DA operations are able to create new images from separate individuals. However, such DA strategies are not well explored on the task of multi-organ segmentation. In this paper, we investigated four possible inter-image DA strategies: CutMix, CarveMix, ObjectAug and AnatoMix, on two organ segmentation datasets. The result shows that CutMix, CarveMix and AnatoMix can improve the average dice score by 4.9, 2.0 and 1.9, compared with the state-of-the-art nnUNet without DA strategies. These results can be further improved by adding TDA strategies. It is revealed in our experiments that Cut-Mix is a robust but simple DA strategy to drive up the segmentation performance for multi-organ segmentation, even when CutMix produces intuitively ‘wrong’ images. Our implementation is publicly available for future benchmarks.

[221] SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM

Ming Nie, Dan Ding, Chunwei Wang, Yuanfan Guo, Jianhua Han, Hang Xu, Li Zhang

Main category: cs.CV

TL;DR: SlowFocus mechanism enhances video LLMs by identifying query-relevant temporal segments for dense sampling, combining local high-frequency features with global low-frequency contexts for improved fine-grained video understanding.

Details

Motivation: Current Video LLMs struggle to simultaneously maintain high-quality frame-level semantics (enough tokens per frame) and comprehensive video-level temporal information (enough sampled frames per video), limiting fine-grained video understanding capabilities.

Method: SlowFocus mechanism: 1) Identifies query-related temporal segments, 2) Performs dense sampling on these segments for local high-frequency features, 3) Uses multi-frequency mixing attention to aggregate local high-frequency details with global low-frequency contexts, 4) Includes training strategies to enhance temporal grounding and detailed reasoning.

Result: Superior performance on existing public video understanding benchmarks and the proposed FineAction-CGR benchmark, demonstrating enhanced fine-grained temporal understanding capabilities.

Conclusion: SlowFocus effectively addresses the trade-off between frame-level quality and temporal coverage in Video LLMs, enabling better fine-grained video understanding through adaptive temporal focusing and multi-frequency feature aggregation.

Abstract: Large language models (LLMs) have demonstrated exceptional capabilities in text understanding, which has paved the way for their expansion into video LLMs (Vid-LLMs) to analyze video data. However, current Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information (i.e., a sufficient number of tokens per frame) and comprehensive video-level temporal information (i.e., an adequate number of sampled frames per video). This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding. To address this issue, we introduce the SlowFocus mechanism, which significantly enhances the equivalent sampling frequency without compromising the quality of frame-level visual tokens. SlowFocus begins by identifying the query-related temporal segment based on the posed question, then performs dense sampling on this segment to extract local high-frequency features. A multi-frequency mixing attention module is further leveraged to aggregate these local high-frequency details with global low-frequency contexts for enhanced temporal comprehension. Additionally, to tailor Vid-LLMs to this innovative mechanism, we introduce a set of training strategies aimed at bolstering both temporal grounding and detailed temporal reasoning capabilities. Furthermore, we establish FineAction-CGR, a benchmark specifically devised to assess the ability of Vid-LLMs to process fine-grained temporal understanding tasks. Comprehensive experiments demonstrate the superiority of our mechanism across both existing public video understanding benchmarks and our proposed FineAction-CGR.

[222] High-Resolution Underwater Camouflaged Object Detection: GBU-UCOD Dataset and Topology-Aware and Frequency-Decoupled Networks

Wenji Wu, Shuo Ye, Yiyu Liu, Jiguang He, Zhuo Wang, Zitong Yu

Main category: cs.CV

TL;DR: DeepTopo-Net is a novel framework for underwater camouflaged object detection that integrates topology-aware modeling with frequency-decoupled perception to address challenges in detecting slender and transparent marine organisms across varying depths.

Details

Motivation: Underwater Camouflaged Object Detection (UCOD) is challenging due to extreme visual similarity between targets and backgrounds across marine depths. Existing methods struggle with topological fragmentation of slender creatures and subtle feature extraction of transparent organisms.

Method: Proposes DeepTopo-Net with Water-Conditioned Adaptive Perceptor (WCAP) using Riemannian metric tensors to dynamically deform convolutional sampling fields, and Abyssal-Topology Refinement Module (ATRM) to maintain structural connectivity through skeletal priors. Also introduces GBU-UCOD, the first high-resolution (2K) benchmark for marine vertical zonation.

Result: Extensive experiments on MAS3K, RMAS, and the proposed GBU-UCOD datasets demonstrate state-of-the-art performance, particularly in preserving morphological integrity of complex underwater patterns.

Conclusion: DeepTopo-Net effectively addresses challenges in underwater camouflaged object detection through topology-aware modeling and frequency-decoupled perception, with the new GBU-UCOD benchmark filling data gaps for hadal and abyssal zones.

Abstract: Underwater Camouflaged Object Detection (UCOD) is a challenging task due to the extreme visual similarity between targets and backgrounds across varying marine depths. Existing methods often struggle with topological fragmentation of slender creatures in the deep sea and the subtle feature extraction of transparent organisms. In this paper, we propose DeepTopo-Net, a novel framework that integrates topology-aware modeling with frequency-decoupled perception. To address physical degradation, we design the Water-Conditioned Adaptive Perceptor (WCAP), which employs Riemannian metric tensors to dynamically deform convolutional sampling fields. Furthermore, the Abyssal-Topology Refinement Module (ATRM) is developed to maintain the structural connectivity of spindly targets through skeletal priors. Specifically, we first introduce GBU-UCOD, the first high-resolution (2K) benchmark tailored for marine vertical zonation, filling the data gap for hadal and abyssal zones. Extensive experiments on MAS3K, RMAS, and our proposed GBU-UCOD datasets demonstrate that DeepTopo-Net achieves state-of-the-art performance, particularly in preserving the morphological integrity of complex underwater patterns. The datasets and codes will be released at https://github.com/Wuwenji18/GBU-UCOD.

[223] TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection

Alireza Salehi, Ehsan Karami, Sepehr Noey, Sahand Noey, Makoto Yamada, Reshad Hosseini, Mohammad Sabokrou

Main category: cs.CV

TL;DR: TIPS-based zero-shot anomaly detection pipeline improves both image-level detection and pixel-level localization by addressing CLIP’s spatial misalignment and weak sensitivity to fine-grained anomalies through decoupled prompts and local evidence injection.

Details

Motivation: Zero-shot anomaly detection (ZSAD) using vision-language models faces limitations with CLIP's coarse image-text alignment, specifically spatial misalignment and weak sensitivity to fine-grained anomalies. Prior work adds complex auxiliary modules but overlooks backbone choice.

Method: Uses TIPS VLM (trained with spatially aware objectives) as backbone, addresses distributional gap between global and local features with decoupled prompts (fixed for image-level detection, learnable for pixel-level localization), and injects local evidence into global score.

Result: Improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with lean architecture without CLIP-specific tricks.

Conclusion: TIPS-based pipeline effectively addresses CLIP’s limitations for zero-shot anomaly detection, achieving significant improvements in both detection and localization with simpler architecture.

Abstract: Anomaly detection identifies departures from expected behavior in safety-critical settings. When target-domain normal data are unavailable, zero-shot anomaly detection (ZSAD) leverages vision-language models (VLMs). However, CLIP’s coarse image-text alignment limits both localization and detection due to (i) spatial misalignment and (ii) weak sensitivity to fine-grained anomalies; prior work compensates with complex auxiliary modules yet largely overlooks the choice of backbone. We revisit the backbone and use TIPS-a VLM trained with spatially aware objectives. While TIPS alleviates CLIP’s issues, it exposes a distributional gap between global and local features. We address this with decoupled prompts-fixed for image-level detection and learnable for pixel-level localization-and by injecting local evidence into the global score. Without CLIP-specific tricks, our TIPS-based pipeline improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with a lean architecture. Code is available at github.com/AlirezaSalehy/Tipsomaly.

[224] Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

Haichao Jiang, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu

Main category: cs.CV

TL;DR: Refer-Agent: A collaborative multi-agent system with alternating reasoning-reflection mechanisms for zero-shot Referring Video Object Segmentation, outperforming both supervised fine-tuning and existing zero-shot methods.

Details

Motivation: Current RVOS methods rely heavily on supervised fine-tuning of MLLMs, which suffers from data dependence and poor scalability with rapidly evolving MLLMs. Zero-shot approaches offer flexibility but lag significantly in performance due to simplistic workflow designs.

Method: Proposes Refer-Agent, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. Features: 1) Coarse-to-Fine frame selection for diversity and textual relevance, 2) Dynamic Focus Layout for adaptive visual attention, 3) Chain-of-Reflection mechanism with Questioner-Responder pair for self-verification and feedback generation.

Result: Extensive experiments on five challenging benchmarks show Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches. The system is flexible and enables fast integration of new MLLMs without additional fine-tuning.

Conclusion: Refer-Agent provides an effective zero-shot solution for RVOS that overcomes limitations of supervised fine-tuning while achieving superior performance through collaborative multi-agent reasoning and reflection mechanisms.

Abstract: Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose \textbf{Refer-Agent}, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent’s visual focus. Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches. Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released.

[225] A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

Basile Terver, Randall Balestriero, Megi Dervishi, David Fan, Quentin Garrido, Tushar Nagarajan, Koustuv Sinha, Wancong Zhang, Mike Rabbat, Yann LeCun, Amir Bar

Main category: cs.CV

TL;DR: EB-JEPA is an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs) that predict in representation space rather than pixel space, with applications from images to video and action-conditioned world models.

Details

Motivation: To provide accessible, modular implementations of JEPA-based representation learning that can scale from images to video and action-conditioned world models, making energy-based self-supervised learning practical for research and education on single GPUs.

Method: Developed an open-source library with modular JEPA implementations that learn representations by predicting in embedding space rather than pixel space. Applied to CIFAR-10 for image representation learning, Moving MNIST for video temporal modeling, and Two Rooms navigation task for action-conditioned world models.

Result: Achieved 91% accuracy on CIFAR-10 representation probing, demonstrated multi-step prediction on Moving MNIST, and achieved 97% planning success rate on Two Rooms navigation task. Comprehensive ablations showed critical importance of regularization components for preventing representation collapse.

Conclusion: EB-JEPA successfully demonstrates that JEPA-based representation learning scales effectively from images to video and action-conditioned world models, providing accessible implementations for research and education while achieving strong performance on benchmark tasks.

Abstract: We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa.

[226] KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs

Baiyang Song, Jun Peng, Yuxin Zhang, Guangyao Chen, Feidiao Yang, Jianyuan Guo

Main category: cs.CV

TL;DR: KTV: A two-stage training-free video understanding framework that reduces visual redundancy through question-agnostic keyframe selection and key visual token pruning for efficient video comprehension.

Details

Motivation: Training-free video understanding suffers from severe visual redundancy and high computational overhead when processing long videos. Existing keyframe selection strategies based on CLIP similarity are prone to biases and may overlook critical frames, resulting in suboptimal video comprehension.

Method: Two-stage framework: 1) Question-agnostic keyframe selection by clustering frame-level visual features to get compact, diverse, representative subset; 2) Key visual token selection by pruning redundant or less informative tokens from each keyframe based on token importance and redundancy.

Result: Outperforms state-of-the-art training-free baselines on Multiple-Choice VideoQA while using significantly fewer visual tokens (e.g., only 504 tokens for 60-min video with 10800 frames). Achieves 44.8% accuracy on MLVU-Test benchmark and exceeds several training-based approaches on certain benchmarks.

Conclusion: KTV provides an efficient and effective training-free video understanding framework that addresses visual redundancy and computational overhead issues while maintaining strong video comprehension performance.

Abstract: Training-free video understanding leverages the strong image comprehension capabilities of pre-trained vision language models (VLMs) by treating a video as a sequence of static frames, thus obviating the need for costly video-specific training. However, this paradigm often suffers from severe visual redundancy and high computational overhead, especially when processing long videos. Crucially, existing keyframe selection strategies, especially those based on CLIP similarity, are prone to biases and may inadvertently overlook critical frames, resulting in suboptimal video comprehension. To address these significant challenges, we propose \textbf{KTV}, a novel two-stage framework for efficient and effective training-free video understanding. In the first stage, KTV performs question-agnostic keyframe selection by clustering frame-level visual features, yielding a compact, diverse, and representative subset of frames that mitigates temporal redundancy. In the second stage, KTV applies key visual token selection, pruning redundant or less informative tokens from each selected keyframe based on token importance and redundancy, which significantly reduces the number of tokens fed into the LLM. Extensive experiments on the Multiple-Choice VideoQA task demonstrate that KTV outperforms state-of-the-art training-free baselines while using significantly fewer visual tokens, \emph{e.g.}, only 504 visual tokens for a 60-min video with 10800 frames, achieving $44.8%$ accuracy on the MLVU-Test benchmark. In particular, KTV also exceeds several training-based approaches on certain benchmarks.

[227] Quasi-multimodal-based pathophysiological feature learning for retinal disease diagnosis

Lu Zhang, Huizhen Yu, Zuowei Wang, Fu Gui, Yatu Guo, Wei Zhang, Mengyu Jia

Main category: cs.CV

TL;DR: A unified multimodal framework for retinal disease diagnosis that synthesizes and fuses FFA, MSI, and saliency maps, using parallel models for modality-specific representation learning with adaptive feature calibration.

Details

Motivation: Retinal disease diagnosis benefits from multimodal data but faces challenges like data heterogeneity, potential invasiveness, and registration complexity. Current approaches lack unified frameworks for both multimodal data synthesis and fusion.

Method: Proposes a unified framework that: 1) synthesizes multimodal data including FFA, MSI, and saliency maps highlighting lesions and optic regions; 2) trains parallel models to learn modality-specific representations capturing cross-pathophysiological signatures; 3) adaptively calibrates features within and across modalities for information pruning and flexible integration based on downstream tasks.

Result: Achieves superior performance on two public datasets: multi-label classification (F1-score: 0.683, AUC: 0.953) and diabetic retinopathy grading (Accuracy: 0.842, Kappa: 0.861). The system is thoroughly interpreted through visualizations in image and feature spaces.

Conclusion: The framework enhances retinal disease screening accuracy and efficiency while providing a scalable approach for data augmentation across medical imaging modalities. It addresses key challenges in multimodal medical diagnosis through integrated synthesis and fusion.

Abstract: Retinal diseases spanning a broad spectrum can be effectively identified and diagnosed using complementary signals from multimodal data. However, multimodal diagnosis in ophthalmic practice is typically challenged in terms of data heterogeneity, potential invasiveness, registration complexity, and so on. As such, a unified framework that integrates multimodal data synthesis and fusion is proposed for retinal disease classification and grading. Specifically, the synthesized multimodal data incorporates fundus fluorescein angiography (FFA), multispectral imaging (MSI), and saliency maps that emphasize latent lesions as well as optic disc/cup regions. Parallel models are independently trained to learn modality-specific representations that capture cross-pathophysiological signatures. These features are then adaptively calibrated within and across modalities to perform information pruning and flexible integration according to downstream tasks. The proposed learning system is thoroughly interpreted through visualizations in both image and feature spaces. Extensive experiments on two public datasets demonstrated the superiority of our approach over state-of-the-art ones in the tasks of multi-label classification (F1-score: 0.683, AUC: 0.953) and diabetic retinopathy grading (Accuracy:0.842, Kappa: 0.861). This work not only enhances the accuracy and efficiency of retinal disease screening but also offers a scalable framework for data augmentation across various medical imaging modalities.

[228] Multi-Objective Optimization for Synthetic-to-Real Style Transfer

Estelle Chigot, Thomas Oberlin, Manon Huguenin, Dennis Wilson

Main category: cs.CV

TL;DR: Evolutionary optimization of style transfer pipelines for synthetic-to-real domain adaptation in semantic segmentation, using multi-objective genetic algorithms to balance structural coherence and style similarity.

Details

Motivation: Semantic segmentation requires pixel-level annotations that are expensive to obtain. Synthetic data from game engines can provide annotations but suffers from domain gap. Style transfer can bridge this gap, but designing effective transformation pipelines is challenging due to large combinatorial search space.

Method: Use multi-objective genetic algorithms to optimize style transfer pipelines, balancing structural coherence and style similarity. Employ paired-image metrics on individual samples during evolution for rapid evaluation, then validate with distributional metrics and segmentation performance on target domains.

Result: Evolutionary algorithms can propose diverse augmentation pipelines adapted to different objectives. Applied to GTA5→Cityscapes/ACDC domain adaptation, the approach demonstrates effective pipeline optimization for synthetic-to-real adaptation.

Conclusion: Style transfer can be formulated as a sequencing problem suitable for evolutionary optimization. Efficient metrics enable feasible search in this space, providing diverse augmentation pipelines for domain adaptation in semantic segmentation.

Abstract: Semantic segmentation networks require large amounts of pixel-level annotated data, which are costly to obtain for real-world images. Computer graphics engines can generate synthetic images alongside their ground-truth annotations. However, models trained on such images can perform poorly on real images due to the domain gap between real and synthetic images. Style transfer methods can reduce this difference by applying a realistic style to synthetic images. Choosing effective data transformations and their sequence is difficult due to the large combinatorial search space of style transfer operators. Using multi-objective genetic algorithms, we optimize pipelines to balance structural coherence and style similarity to target domains. We study the use of paired-image metrics on individual image samples during evolution to enable rapid pipeline evaluation, as opposed to standard distributional metrics that require the generation of many images. After optimization, we evaluate the resulting Pareto front using distributional metrics and segmentation performance. We apply this approach to standard datasets in synthetic-to-real domain adaptation: from the video game GTA5 to real image datasets Cityscapes and ACDC, focusing on adverse conditions. Results demonstrate that evolutionary algorithms can propose diverse augmentation pipelines adapted to different objectives. The contribution of this work is the formulation of style transfer as a sequencing problem suitable for evolutionary optimization and the study of efficient metrics that enable feasible search in this space. The source code is available at: https://github.com/echigot/MOOSS.

[229] SPWOOD: Sparse Partial Weakly-Supervised Oriented Object Detection

Wei Zhang, Xiang Liu, Ningjing Liu, Mingxin Liu, Wei Liao, Chunyan Xu, Xue Yang

Main category: cs.CV

TL;DR: SPWOOD: A sparse partial weakly-supervised framework for oriented object detection in remote sensing that uses few sparse weak labels and unlabeled data to reduce annotation costs.

Details

Motivation: Remote sensing object detection faces high annotation costs due to dense object distribution and wide category variety. Existing methods require extensive labeling, so the paper aims to develop a more cost-effective solution using minimal weak supervision.

Method: Three key innovations: 1) SOS-Student model separates unlabeled objects from background and learns orientation/scale from weak annotations; 2) Multi-level Pseudo-label Filtering uses model prediction distributions; 3) Sparse partitioning ensures equal category treatment.

Result: Extensive experiments on DOTA and DIOR datasets show significant performance gains over traditional oriented object detection methods, offering a highly cost-effective solution.

Conclusion: The SPWOOD framework successfully addresses large-scale labeling challenges in remote sensing by efficiently leveraging sparse weak labels and unlabeled data, achieving strong performance with reduced annotation costs.

Abstract: A consistent trend throughout the research of oriented object detection has been the pursuit of maintaining comparable performance with fewer and weaker annotations. This is particularly crucial in the remote sensing domain, where the dense object distribution and a wide variety of categories contribute to prohibitively high costs. Based on the supervision level, existing oriented object detection algorithms can be broadly grouped into fully supervised, semi-supervised, and weakly supervised methods. Within the scope of this work, we further categorize them to include sparsely supervised and partially weakly-supervised methods. To address the challenges of large-scale labeling, we introduce the first Sparse Partial Weakly-Supervised Oriented Object Detection framework, designed to efficiently leverage only a few sparse weakly-labeled data and plenty of unlabeled data. Our framework incorporates three key innovations: (1) We design a Sparse-annotation-Orientation-and-Scale-aware Student (SOS-Student) model to separate unlabeled objects from the background in a sparsely-labeled setting, and learn orientation and scale information from orientation-agnostic or scale-agnostic weak annotations. (2) We construct a novel Multi-level Pseudo-label Filtering strategy that leverages the distribution of model predictions, which is informed by the model’s multi-layer predictions. (3) We propose a unique sparse partitioning approach, ensuring equal treatment for each category. Extensive experiments on the DOTA and DIOR datasets show that our framework achieves a significant performance gain over traditional oriented object detection methods mentioned above, offering a highly cost-effective solution. Our code is publicly available at https://github.com/VisionXLab/SPWOOD.

[230] MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment

Eunkyu Park, Wesley Hanwen Deng, Cheyon Jin, Matheus Kunzler Maldaner, Jordan Wheeler, Jason I. Hong, Hong Shen, Adam Perer, Ken Holstein, Motahhare Eslami, Gunhee Kim

Main category: cs.CV

TL;DR: MM-SCALE dataset enables multimodal moral reasoning alignment in VLMs using 5-point scalar ratings and explicit modality grounding for richer supervision signals.

Details

Motivation: Current VLMs struggle with morally salient judgments in multimodal contexts, and existing binary/pairwise supervision fails to capture the continuous, pluralistic nature of human moral reasoning.

Method: Created MM-SCALE dataset with image-scenario pairs annotated with 5-point moral acceptability scores and grounded reasoning labels, enabling listwise preference optimization over ranked scenario sets.

Result: VLMs fine-tuned on MM-SCALE achieve higher ranking fidelity and more stable safety calibration compared to those trained with binary signals.

Conclusion: Scalar supervision provides richer alignment signals and finer calibration for multimodal moral reasoning in VLMs, addressing limitations of discrete supervision methods.

Abstract: Vision-Language Models (VLMs) continue to struggle to make morally salient judgments in multimodal and socially ambiguous contexts. Prior works typically rely on binary or pairwise supervision, which often fail to capture the continuous and pluralistic nature of human moral reasoning. We present MM-SCALE (Multimodal Moral Scale), a large-scale dataset for aligning VLMs with human moral preferences through 5-point scalar ratings and explicit modality grounding. Each image-scenario pair is annotated with moral acceptability scores and grounded reasoning labels by humans using an interface we tailored for data collection, enabling listwise preference optimization over ranked scenario sets. By moving from discrete to scalar supervision, our framework provides richer alignment signals and finer calibration of multimodal moral reasoning. Experiments show that VLMs fine-tuned on MM-SCALE achieve higher ranking fidelity and more stable safety calibration than those trained with binary signals.

[231] Referring Industrial Anomaly Segmentation

Pengfei Yue, Xiaokang Jiang, Yilin Lu, Jianghang Lin, Shengchuan Zhang, Liujuan Cao

Main category: cs.CV

TL;DR: RIAS introduces a language-guided industrial anomaly detection paradigm that uses text descriptions to generate precise anomaly masks without manual thresholds, enabling single-model detection of diverse anomalies through universal prompts.

Details

Motivation: Traditional industrial anomaly detection methods face limitations: unsupervised approaches require manual thresholds for rough localizations, supervised methods overfit due to scarce/imbalanced data, and both suffer from "One Anomaly Class, One Model" constraint. There's a need for more flexible, open-set capable detection.

Method: Proposes Referring Industrial Anomaly Segmentation (RIAS) paradigm using language to guide detection. Introduces MVTec-Ref dataset with diverse referring expressions focusing on anomaly patterns (95% small anomalies). Presents DQFormer benchmark with Dual Query Token (Anomaly/Background) and Mask Group Transformer, enhanced by Language-Gated Multi-Level Aggregation for multi-scale segmentation.

Result: RIAS generates precise masks from text descriptions without manual thresholds and uses universal prompts to detect diverse anomalies with a single model. Experiments demonstrate effectiveness in advancing IAD toward open-set capabilities.

Conclusion: RIAS represents a significant advancement in industrial anomaly detection by leveraging language guidance to overcome traditional limitations, enabling more flexible and efficient detection with a single model for diverse anomaly types.

Abstract: Industrial Anomaly Detection (IAD) is vital for manufacturing, yet traditional methods face significant challenges: unsupervised approaches yield rough localizations requiring manual thresholds, while supervised methods overfit due to scarce, imbalanced data. Both suffer from the “One Anomaly Class, One Model” limitation. To address this, we propose Referring Industrial Anomaly Segmentation (RIAS), a paradigm leveraging language to guide detection. RIAS generates precise masks from text descriptions without manual thresholds and uses universal prompts to detect diverse anomalies with a single model. We introduce the MVTec-Ref dataset to support this, designed with diverse referring expressions and focusing on anomaly patterns, notably with 95% small anomalies. We also propose the Dual Query Token with Mask Group Transformer (DQFormer) benchmark, enhanced by Language-Gated Multi-Level Aggregation (LMA) to improve multi-scale segmentation. Unlike traditional methods using redundant queries, DQFormer employs only “Anomaly” and “Background” tokens for efficient visual-textual integration. Experiments demonstrate RIAS’s effectiveness in advancing IAD toward open-set capabilities. Code: https://github.com/swagger-coder/RIAS-MVTec-Ref.

[232] RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

Wenfang Sun, Hao Chen, Yingjun Du, Yefeng Zheng, Cees G. M. Snoek

Main category: cs.CV

TL;DR: RegionReasoner: RL framework for multi-round visual reasoning with explicit region grounding and global-local consistency rewards

Details

Motivation: Existing vision-language models rely on single-step or text-only reasoning, limiting iterative refinement across multiple visual contexts. Need for systematic evaluation and improved multi-round visual reasoning.

Method: Propose RegionReasoner: reinforcement learning framework requiring reasoning traces to explicitly cite reference bounding boxes. Uses global-local consistency reward extracting key objects/nouns from scene and region captions, aligning with reasoning trace. Optimized with structured rewards for grounding fidelity and semantic alignment.

Result: RegionReasoner-7B with new benchmark RegionDial-Bench improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency on detection and segmentation tasks.

Conclusion: Establishes strong baseline for multi-round visual reasoning with explicit grounding and consistency mechanisms, advancing beyond single-step reasoning approaches.

Abstract: Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency, establishing a strong baseline for this emerging research direction.

[233] Edge-Optimized Vision-Language Models for Underground Infrastructure Assessment

Johny J. Lopez, Md Meftahul Ferdaus, Mahdi Abdelguerfi

Main category: cs.CV

TL;DR: Lightweight two-stage pipeline for autonomous underground infrastructure inspection combining efficient defect segmentation (RAPID-SCAN) with fine-tuned vision-language model (Phi-3.5) for natural language summarization, deployed on edge devices.

Details

Motivation: Automated generation of human-readable summaries from robotic inspection data is challenging on resource-constrained edge devices, creating a gap between defect detection and actionable maintenance insights.

Method: Two-stage pipeline: 1) RAPID-SCAN segmentation model (0.64M parameters) for defect detection, 2) Fine-tuned Phi-3.5 VLM for natural language summarization. Uses post-training quantization and hardware optimization for edge deployment.

Result: Achieves 0.834 F1-score for segmentation, generates domain-specific summaries, reduces model size and latency via quantization, and demonstrates effectiveness on mobile robotic platform in real-world scenarios.

Conclusion: Edge-deployable integrated AI systems can bridge automated defect detection and actionable maintenance insights, enabling scalable autonomous inspection solutions for underground infrastructure.

Abstract: Autonomous inspection of underground infrastructure, such as sewer and culvert systems, is critical to public safety and urban sustainability. Although robotic platforms equipped with visual sensors can efficiently detect structural deficiencies, the automated generation of human-readable summaries from these detections remains a significant challenge, especially on resource-constrained edge devices. This paper presents a novel two-stage pipeline for end-to-end summarization of underground deficiencies, combining our lightweight RAPID-SCAN segmentation model with a fine-tuned Vision-Language Model (VLM) deployed on an edge computing platform. The first stage employs RAPID-SCAN (Resource-Aware Pipeline Inspection and Defect Segmentation using Compact Adaptive Network), achieving 0.834 F1-score with only 0.64M parameters for efficient defect segmentation. The second stage utilizes a fine-tuned Phi-3.5 VLM that generates concise, domain-specific summaries in natural language from the segmentation outputs. We introduce a curated dataset of inspection images with manually verified descriptions for VLM fine-tuning and evaluation. To enable real-time performance, we employ post-training quantization with hardware-specific optimization, achieving significant reductions in model size and inference latency without compromising summarization quality. We deploy and evaluate our complete pipeline on a mobile robotic platform, demonstrating its effectiveness in real-world inspection scenarios. Our results show the potential of edge-deployable integrated AI systems to bridge the gap between automated defect detection and actionable insights for infrastructure maintenance, paving the way for more scalable and autonomous inspection solutions.

[234] LIVE: Long-horizon Interactive Video World Modeling

Junchao Huang, Ziyang Ye, Xinting Hu, Tianyu He, Guiyu Zhang, Shaoshuai Shi, Jiang Bian, Li Jiang

Main category: cs.CV

TL;DR: LIVE is a long-horizon video world model that uses cycle consistency to bound error accumulation in autoregressive video prediction, eliminating need for teacher models and enabling stable long-term generation.

Details

Motivation: Autoregressive video world models suffer from error accumulation over long horizons, and existing methods using teacher models are computationally expensive and fail to prevent error propagation beyond training horizons.

Method: LIVE enforces bounded error accumulation via a cycle-consistency objective: forward rollout from ground-truth frames followed by reverse generation to reconstruct initial state, with diffusion loss computed on reconstructed terminal state. Also includes progressive training curriculum.

Result: Achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.

Conclusion: LIVE provides an effective approach for long-horizon video generation with bounded error accumulation, outperforming previous methods without requiring teacher models.

Abstract: Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.

[235] See-through: Single-image Layer Decomposition for Anime Characters

Jian Lin, Chengze Li, Haoyun Qin, Kwun Wang Chan, Yanghua Jin, Hanyuan Liu, Stephen Chun Wang Choy, Xueting Liu

Main category: cs.CV

TL;DR: Automated framework transforms static anime illustrations into manipulatable 2.5D models using semantic layer decomposition and diffusion-based consistency modules.

Details

Motivation: Current professional workflows for animating static anime illustrations require tedious manual segmentation and artistic hallucination of occluded regions, which is time-consuming and labor-intensive.

Method: Decomposes single images into fully inpainted, semantically distinct layers with inferred drawing orders. Uses a scalable engine that bootstraps supervision from commercial Live2D models. Combines diffusion-based Body Part Consistency Module for global geometric coherence with pixel-level pseudo-depth inference for intricate stratification resolution.

Result: Approach yields high-fidelity, manipulatable models suitable for professional, real-time animation applications, effectively resolving complex layer structures like interleaving hair strands.

Conclusion: The framework successfully automates the transformation of static anime illustrations into dynamic 2.5D models, overcoming data scarcity through bootstrapped supervision and achieving professional-quality results.

Abstract: We introduce a framework that automates the transformation of static anime illustrations into manipulatable 2.5D models. Current professional workflows require tedious manual segmentation and the artistic ``hallucination’’ of occluded regions to enable motion. Our approach overcomes this by decomposing a single image into fully inpainted, semantically distinct layers with inferred drawing orders. To address the scarcity of training data, we introduce a scalable engine that bootstraps high-quality supervision from commercial Live2D models, capturing pixel-perfect semantics and hidden geometry. Our methodology couples a diffusion-based Body Part Consistency Module, which enforces global geometric coherence, with a pixel-level pseudo-depth inference mechanism. This combination resolves the intricate stratification of anime characters, e.g., interleaving hair strands, allowing for dynamic layer reconstruction. We demonstrate that our approach yields high-fidelity, manipulatable models suitable for professional, real-time animation applications.

[236] Zero-shot large vision-language model prompting for automated bone identification in paleoradiology x-ray archives

Owen Dong, Lily Gao, Manish Kota, Bennett A. Landmana, Jelena Bekvalac, Gaynor Western, Katherine D. Van Schaik

Main category: cs.CV

TL;DR: Using large vision-language models for zero-shot classification of heterogeneous paleoradiology images to identify bones, projection views, and laterality for efficient dataset navigation.

Details

Motivation: Paleoradiology images are highly heterogeneous with disarticulated bones, ad hoc positioning, and missing laterality markers, making content navigation and triaging time-consuming for expert analysis. There's a need for automated methods to accelerate code word development for large datasets.

Method: Zero-shot prompting strategy using a state-of-the-art Large Vision Language Model (LVLM). Pipeline converts raw DICOM files to bone-windowed PNGs, submits them to LVLM with carefully engineered prompts, and receives structured JSON outputs that are extracted and formatted for validation.

Result: On 100 expert-reviewed images: 92% main bone accuracy, 80% projection view accuracy, and 100% laterality accuracy, with confidence flags for ambiguous cases. Demonstrates LVLMs can substantially accelerate code word development for paleoradiology datasets.

Conclusion: LVLMs can effectively automate identification tasks in heterogeneous medical imaging datasets, enabling efficient content navigation and accelerating anthropological workflows despite challenging conditions like disarticulated bones and variable positioning.

Abstract: Paleoradiology, the use of modern imaging technologies to study archaeological and anthropological remains, offers new windows on millennial scale patterns of human health. Unfortunately, the radiographs collected during field campaigns are heterogeneous: bones are disarticulated, positioning is ad hoc, and laterality markers are often absent. Additionally, factors such as age at death, age of bone, sex, and imaging equipment introduce high variability. Thus, content navigation, such as identifying a subset of images with a specific projection view, can be time consuming and difficult, making efficient triaging a bottleneck for expert analysis. We report a zero shot prompting strategy that leverages a state of the art Large Vision Language Model (LVLM) to automatically identify the main bone, projection view, and laterality in such images. Our pipeline converts raw DICOM files to bone windowed PNGs, submits them to the LVLM with a carefully engineered prompt, and receives structured JSON outputs, which are extracted and formatted onto a spreadsheet in preparation for validation. On a random sample of 100 images reviewed by an expert board certified paleoradiologist, the system achieved 92% main bone accuracy, 80% projection view accuracy, and 100% laterality accuracy, with low or medium confidence flags for ambiguous cases. These results suggest that LVLMs can substantially accelerate code word development for large paleoradiology datasets, allowing for efficient content navigation in future anthropology workflows.

[237] Test-Time Conditioning with Representation-Aligned Visual Features

Nicolas Sereyjol-Garros, Ellington Kirby, Victor Letzelter, Victor Besnier, Nermin Samet

Main category: cs.CV

TL;DR: REPA-G is a framework that uses aligned representations from self-supervised models to guide diffusion model inference, enabling precise multi-scale conditioning without retraining.

Details

Motivation: While representation alignment improves diffusion model training, its potential for inference-time conditioning remains unexplored. Current methods rely on ambiguous text prompts or coarse class labels, lacking precise control over generation.

Method: REPA-G leverages aligned representations from pre-trained feature extractors to guide denoising. It optimizes a similarity objective (potential) at inference time to steer generation toward conditioned representations, enabling multi-scale control from fine textures to global semantics.

Result: Quantitative results on ImageNet and COCO show high-quality, diverse generations. The method enables versatile control including texture matching, semantic guidance, and multi-concept composition.

Conclusion: REPA-G provides flexible, precise inference-time conditioning for diffusion models using aligned representations, offering an alternative to text prompts and class labels with theoretical justification.

Abstract: While representation alignment with self-supervised models has been shown to improve diffusion model training, its potential for enhancing inference-time conditioning remains largely unexplored. We introduce Representation-Aligned Guidance (REPA-G), a framework that leverages these aligned representations, with rich semantic properties, to enable test-time conditioning from features in generation. By optimizing a similarity objective (the potential) at inference, we steer the denoising process toward a conditioned representation extracted from a pre-trained feature extractor. Our method provides versatile control at multiple scales, ranging from fine-grained texture matching via single patches to broad semantic guidance using global image feature tokens. We further extend this to multi-concept composition, allowing for the faithful combination of distinct concepts. REPA-G operates entirely at inference time, offering a flexible and precise alternative to often ambiguous text prompts or coarse class labels. We theoretically justify how this guidance enables sampling from the potential-induced tilted distribution. Quantitative results on ImageNet and COCO demonstrate that our approach achieves high-quality, diverse generations. Code is available at https://github.com/valeoai/REPA-G.

[238] RAWDet-7: A Multi-Scenario Benchmark for Object Detection and Description on Quantized RAW Images

Mishal Fatima, Shashank Agnihotri, Kanchana Vaishnavi Gandikota, Michael Moeller, Margret Keuper

Main category: cs.CV

TL;DR: RAWDet-7: A large-scale RAW image dataset with object detection annotations and descriptions for studying machine vision beyond human-optimized ISP processing.

Details

Motivation: Most vision models use RGB images processed through ISP pipelines optimized for human perception, which discards sensor-level information that could be useful for machine reasoning. RAW images preserve unprocessed scene data that could enable better object detection and description.

Method: Introduces RAWDet-7, a dataset of ~25k training and 7.6k test RAW images collected across diverse cameras, lighting conditions, and environments. Images are densely annotated for seven object categories following MS-COCO and LVIS conventions, with object-level descriptions derived from corresponding high-resolution sRGB images.

Result: Provides a benchmark dataset for evaluating object detection and description under simulated 4-bit, 6-bit, and 8-bit quantization, reflecting realistic sensor constraints. Enables study of detection performance, description quality & detail, and generalization in low-bit RAW image processing.

Conclusion: RAWDet-7 supports research on leveraging RAW image data for machine vision tasks, addressing limitations of human-optimized ISP processing and enabling exploration of sensor-level information preservation under quantization constraints.

Abstract: Most vision models are trained on RGB images processed through ISP pipelines optimized for human perception, which can discard sensor-level information useful for machine reasoning. RAW images preserve unprocessed scene data, enabling models to leverage richer cues for both object detection and object description, capturing fine-grained details, spatial relationships, and contextual information often lost in processed images. To support research in this domain, we introduce RAWDet-7, a large-scale dataset of ~25k training and 7.6k test RAW images collected across diverse cameras, lighting conditions, and environments, densely annotated for seven object categories following MS-COCO and LVIS conventions. In addition, we provide object-level descriptions derived from the corresponding high-resolution sRGB images, facilitating the study of object-level information preservation under RAW image processing and low-bit quantization. The dataset allows evaluation under simulated 4-bit, 6-bit, and 8-bit quantization, reflecting realistic sensor constraints, and provides a benchmark for studying detection performance, description quality & detail, and generalization in low-bit RAW image processing. Dataset & code upon acceptance.

[239] FOVI: A biologically-inspired foveated interface for deep vision models

Nicholas M. Blauch, George A. Alvarez, Talia Konkle

Main category: cs.CV

TL;DR: Foveated vision interface (FOVI) mimics human vision’s variable resolution to process high-resolution images efficiently using kNN-convolution and foveated adaptations of vision models.

Details

Motivation: Human vision uses foveated sensing with variable resolution for efficient active sensing, while computer vision typically uses uniform resolution, creating computational challenges for high-resolution images.

Method: Proposes FOVI that reformats retina-like variable-resolution sensor array into uniform V1-like manifold, uses k-nearest-neighborhood receptive fields with novel kernel mapping for kNN-convolution, and adapts DINOv3 ViT with LoRA.

Result: Models achieve competitive performance at fraction of computational cost compared to non-foveated baselines, enabling efficient high-resolution egocentric vision.

Conclusion: FOVI opens pathways for efficient and scalable active sensing for high-resolution vision tasks by mimicking biological foveation principles.

Abstract: Human vision is foveated, with variable resolution peaking at the center of a large field of view; this reflects an efficient trade-off for active sensing, allowing eye-movements to bring different parts of the world into focus with other parts of the world in context. In contrast, most computer vision systems encode the visual world at a uniform resolution, raising challenges for processing full-field high-resolution images efficiently. We propose a foveated vision interface (FOVI) based on the human retina and primary visual cortex, that reformats a variable-resolution retina-like sensor array into a uniformly dense, V1-like sensor manifold. Receptive fields are defined as k-nearest-neighborhoods (kNNs) on the sensor manifold, enabling kNN-convolution via a novel kernel mapping technique. We demonstrate two use cases: (1) an end-to-end kNN-convolutional architecture, and (2) a foveated adaptation of the foundational DINOv3 ViT model, leveraging low-rank adaptation (LoRA). These models provide competitive performance at a fraction of the computational cost of non-foveated baselines, opening pathways for efficient and scalable active sensing for high-resolution egocentric vision. Code and pre-trained models are available at https://github.com/nblauch/fovi and https://huggingface.co/fovi-pytorch.

[240] QVLA: Not All Channels Are Equal in Vision-Language-Action Model’s Quantization

Yuhao Xu, Yantai Yang, Zhenyang Fan, Yufan Liu, Yuming Li, Bing Li, Zhipeng Zhang

Main category: cs.CV

TL;DR: QVLA is an action-centric quantization framework for Vision-Language-Action models that uses channel-wise bit allocation based on action-space sensitivity, achieving significant compression while maintaining performance for embodied control tasks.

Details

Motivation: VLA models have huge computational demands that hinder deployment on resource-constrained robotic platforms. Existing quantization methods from LLMs prioritize data fidelity but ignore how minor action deviations can cause catastrophic task failures in robotics.

Method: QVLA introduces a granular channel-wise bit allocation strategy that measures action-space sensitivity when quantizing each channel to various bit-widths. This creates per-channel importance metrics guiding global optimization that unifies quantization and pruning into a single framework.

Result: On LIBERO benchmark, QVLA reduces OpenVLA-OFT to 29.2% of original VRAM while maintaining 98.9% performance and achieving 1.49x speedup, outperforming LLM-derived SmoothQuant by 22.6% performance improvement.

Conclusion: QVLA establishes a new principled foundation for compressing VLA models in robotics, enabling deployment of powerful large-scale models on real-world hardware through action-centric quantization.

Abstract: The advent of Vision-Language-Action (VLA) models represents a significant leap for embodied intelligence, yet their immense computational demands critically hinder deployment on resource-constrained robotic platforms. Intuitively, low-bit quantization is a prevalent and preferred technique for large-scale model compression. However, we find that a systematic analysis of VLA model’s quantization is fundamentally lacking. We argue that naively applying uniform-bit quantization from Large Language Models (LLMs) to robotics is flawed, as these methods prioritize passive data fidelity while ignoring how minor action deviations compound into catastrophic task failures. To bridge this gap, we introduce QVLA, the first action-centric quantization framework specifically designed for embodied control. In a sharp departure from the rigid, uniform-bit quantization of LLM-based methods, QVLA introduces a highly granular, channel-wise bit allocation strategy. Its core mechanism is to directly measure the final action-space sensitivity when quantizing each individual channel to various bit-widths. This process yields a precise, per-channel importance metric that guides a global optimization, which elegantly unifies quantization and pruning (0-bit) into a single, cohesive framework. Extensive evaluations on different baselines demonstrate the superiority of our approach. In the LIBERO, the quantization version of OpenVLA-OFT with our method requires only 29.2% of the original model’s VRAM while maintaining 98.9% of its original performance and achieving a 1.49x speedup. This translates to a 22.6% performance improvement over the LLM-derived method SmoothQuant. Our work establishes a new, principled foundation for compressing VLA models in robotics, paving the way for deploying powerful, large-scale models on real-world hardware. Code will be released.

[241] From Pre- to Intra-operative MRI: Predicting Brain Shift in Temporal Lobe Resection for Epilepsy Surgery

Jingjing Peng, Giorgio Fiore, Yang Liu, Ksenia Ellum, Debayan Daspupta, Keyoumars Ashkan, Andrew McEvoy, Anna Miserocchi, Sebastien Ourselin, John Duncan, Alejandro Granados

Main category: cs.CV

TL;DR: NeuralShift: A U-Net-based model that predicts brain shift from preoperative MRI for temporal lobe resection surgery, achieving accurate deformation prediction without intraoperative imaging.

Details

Motivation: Brain shift during neurosurgery invalidates preoperative MRI guidance, creating a critical need for accurate intraoperative brain deformation prediction to enhance surgical precision and patient outcomes.

Method: U-Net architecture trained to predict brain shift entirely from preoperative MRI, evaluated using Target Registration Errors (TREs) on anatomical landmarks and DICE scores comparing predicted vs. actual intraoperative masks.

Result: Achieved DICE score of 0.97 for global brain deformation prediction and landmark TRE as low as 1.12 mm, effectively compensating for large brain shifts during temporal lobe removal.

Conclusion: The model successfully predicts brain deformation during temporal lobe resection using only preoperative images, potentially increasing surgical safety and efficiency while improving patient outcomes.

Abstract: Introduction: In neurosurgery, image-guided Neurosurgery Systems (IGNS) highly rely on preoperative brain magnetic resonance images (MRI) to assist surgeons in locating surgical targets and determining surgical paths. However, brain shift invalidates the preoperative MRI after dural opening. Updated intraoperative brain MRI with brain shift compensation is crucial for enhancing the precision of neuronavigation systems and ensuring the optimal outcome of surgical interventions. Methodology: We propose NeuralShift, a U-Net-based model that predicts brain shift entirely from pre-operative MRI for patients undergoing temporal lobe resection. We evaluated our results using Target Registration Errors (TREs) computed on anatomical landmarks located on the resection side and along the midline, and DICE scores comparing predicted intraoperative masks with masks derived from intraoperative MRI. Results: Our experimental results show that our model can predict the global deformation of the brain (DICE of 0.97) with accurate local displacements (achieve landmark TRE as low as 1.12 mm), compensating for large brain shifts during temporal lobe removal neurosurgery. Conclusion: Our proposed model is capable of predicting the global deformation of the brain during temporal lobe resection using only preoperative images, providing potential opportunities to the surgical team to increase safety and efficiency of neurosurgery and better outcomes to patients. Our contributions will be publicly available after acceptance in https://github.com/SurgicalDataScienceKCL/NeuralShift.

[242] 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Zhixue Fang, Xu He, Songlin Tang, Haoxian Zhang, Qingfeng Li, Xiaoqiang Liu, Pengfei Wan, Kun Gai

Main category: cs.CV

TL;DR: 3DiMo: A novel 3D-aware motion control method for video generation that uses implicit, view-agnostic motion tokens instead of explicit 3D models, enabling flexible camera control while maintaining motion fidelity.

Details

Motivation: Existing motion control methods either use 2D poses (which bind motion to specific viewpoints) or explicit 3D models (which have inherent inaccuracies and override the generator's intrinsic 3D awareness). The authors propose a more natural approach that aligns with the generator's spatial priors.

Method: Jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens injected via cross-attention. Uses view-rich supervision (single-view, multi-view, moving-camera videos) for 3D awareness, and auxiliary geometric supervision with SMPL that anneals to zero.

Result: 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.

Conclusion: Implicit, view-agnostic motion representation better aligns with video generators’ spatial priors than explicit 3D constraints, enabling superior motion control with flexible viewpoint manipulation.

Abstract: Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator’s spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator’s priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.

[243] Progressive Checkerboards for Autoregressive Multiscale Image Generation

David Eigen

Main category: cs.CV

TL;DR: Progressive checkerboard ordering enables parallel sampling in multiscale autoregressive image generation with balanced conditioning between and within scales.

Details

Motivation: Address the challenge of efficiently sampling independent locations in parallel while modeling mutual dependencies with serial conditioning in autoregressive image generation.

Method: Uses a flexible, fixed ordering based on progressive checkerboards for multiscale autoregressive image generation. The ordering draws samples in parallel from evenly spaced regions at each scale, maintaining full balance in all levels of a quadtree subdivision.

Result: Achieves competitive performance compared to recent state-of-the-art autoregressive systems with similar model capacity on class-conditional ImageNet, using fewer sampling steps.

Conclusion: Progressive checkerboard ordering enables effective conditioning both between and within scales, and a wide range of scale-up factors lead to similar results when total serial steps are constant.

Abstract: A key challenge in autoregressive image generation is to efficiently sample independent locations in parallel, while still modeling mutual dependencies with serial conditioning. Some recent works have addressed this by conditioning between scales in a multiscale pyramid. Others have looked at parallelizing samples in a single image using regular partitions or randomized orders. In this work we examine a flexible, fixed ordering based on progressive checkerboards for multiscale autoregressive image generation. Our ordering draws samples in parallel from evenly spaced regions at each scale, maintaining full balance in all levels of a quadtree subdivision at each step. This enables effective conditioning both between and within scales. Intriguingly, we find evidence that in our balanced setting, a wide range of scale-up factors lead to similar results, so long as the total number of serial steps is constant. On class-conditional ImageNet, our method achieves competitive performance compared to recent state-of-the-art autoregressive systems with like model capacity, using fewer sampling steps.

[244] Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning

Dingkun Zhang, Shuhan Qi, Yulin Wu, Xinyu Xiao, Xuan Wang, Long Chen

Main category: cs.CV

TL;DR: DualSpeed: A fast-slow training framework for MLLMs that uses visual token pruning for efficiency while maintaining performance through dual-mode training and self-distillation.

Details

Motivation: MLLMs suffer from severe training inefficiency due to massive model sizes and visual token numbers. Existing work focuses on reducing model sizes, but visual token pruning (VTP) creates training-inference mismatch when applied during training.

Method: Proposes DualSpeed with two modes: fast-mode (primary) uses VTP to reduce visual tokens with mode isolator; slow-mode (auxiliary) trains on full visual sequences for consistency, enhanced by self-distillation from fast-mode.

Result: Accelerates training of LLaVA-1.5 by 2.1× and LLaVA-NeXT by 4.0× while retaining over 99% performance, achieving both training efficiency and non-degraded performance.

Conclusion: DualSpeed effectively addresses training inefficiency in MLLMs through a dual-mode framework that combines visual token pruning with training-inference consistency, offering significant speedups without performance degradation.

Abstract: Multimodal Large Language Models (MLLMs) suffer from severe training inefficiency issue, which is associated with their massive model sizes and visual token numbers. Existing efforts in efficient training focus on reducing model sizes or trainable parameters. Inspired by the success of Visual Token Pruning (VTP) in improving inference efficiency, we are exploring another substantial research direction for efficient training by reducing visual tokens. However, applying VTP at the training stage results in a training-inference mismatch: pruning-trained models perform poorly when inferring on non-pruned full visual token sequences. To close this gap, we propose DualSpeed, a fast-slow framework for efficient training of MLLMs. The fast-mode is the primary mode, which incorporates existing VTP methods as plugins to reduce visual tokens, along with a mode isolator to isolate the model’s behaviors. The slow-mode is the auxiliary mode, where the model is trained on full visual sequences to retain training-inference consistency. To boost its training, it further leverages self-distillation to learn from the sufficiently trained fast-mode. Together, DualSpeed can achieve both training efficiency and non-degraded performance. Experiments show DualSpeed accelerates the training of LLaVA-1.5 by 2.1$\times$ and LLaVA-NeXT by 4.0$\times$, retaining over 99% performance. Code: https://github.com/dingkun-zhang/DualSpeed

[245] Continuous Control of Editing Models via Adaptive-Origin Guidance

Alon Wolf, Chen Katzir, Kfir Aberman, Or Patashnik

Main category: cs.CV

TL;DR: AdaOr introduces adaptive-origin guidance for smooth text-guided image/video editing by adjusting the guidance origin with identity-conditioned predictions to enable continuous control over edit intensity.

Details

Motivation: Existing diffusion-based editing models lack smooth intensity control for text-guided edits. While CFG affects prompt adherence, scaling CFG doesn't produce smooth transitions between input and edited results due to unconditional prediction dominating at low scales.

Method: Proposes Adaptive-Origin Guidance (AdaOr) that adjusts the standard guidance origin with an identity-conditioned adaptive origin using identity instructions. Interpolates identity prediction with unconditional prediction based on edit strength to ensure continuous transitions.

Result: Method evaluated on image and video editing tasks, demonstrating smoother and more consistent control compared to current slider-based approaches. Enables fine-grained control at inference without per-edit procedures or specialized datasets.

Conclusion: AdaOr provides effective continuous control for text-guided image and video editing by addressing the limitations of CFG scaling through adaptive guidance origins, enabling smooth intensity adjustments.

Abstract: Diffusion-based editing models have emerged as a powerful tool for semantic image and video manipulation. However, existing models lack a mechanism for smoothly controlling the intensity of text-guided edits. In standard text-conditioned generation, Classifier-Free Guidance (CFG) impacts prompt adherence, suggesting it as a potential control for edit intensity in editing models. However, we show that scaling CFG in these models does not produce a smooth transition between the input and the edited result. We attribute this behavior to the unconditional prediction, which serves as the guidance origin and dominates the generation at low guidance scales, while representing an arbitrary manipulation of the input content. To enable continuous control, we introduce Adaptive-Origin Guidance (AdaOr), a method that adjusts this standard guidance origin with an identity-conditioned adaptive origin, using an identity instruction corresponding to the identity manipulation. By interpolating this identity prediction with the standard unconditional prediction according to the edit strength, we ensure a continuous transition from the input to the edited result. We evaluate our method on image and video editing tasks, demonstrating that it provides smoother and more consistent control compared to current slider-based editing approaches. Our method incorporates an identity instruction into the standard training framework, enabling fine-grained control at inference time without per-edit procedure or reliance on specialized datasets.

[246] EventNeuS: 3D Mesh Reconstruction from a Single Event Camera

Shreyas Sachan, Viktor Rudnev, Mohamed Elgharib, Christian Theobalt, Vladislav Golyanik

Main category: cs.CV

TL;DR: EventNeuS: Self-supervised neural model for 3D mesh reconstruction from monocular color event streams using signed distance functions and density fields with event-based supervision.

Details

Motivation: Event cameras offer advantages over RGB cameras in many scenarios, but existing event-based 3D reconstruction techniques have limited accuracy. There's a need for better methods to learn 3D representations from monocular color event streams.

Method: Combines 3D signed distance function (SDF) and density field learning with event-based supervision. Introduces spherical harmonics encodings to handle view-dependent effects. Uses self-supervised neural model for learning 3D representations from monocular color event streams.

Result: Outperforms existing approaches by significant margin: 34% lower Chamfer distance and 31% lower mean absolute error on average compared to best previous method.

Conclusion: EventNeuS successfully addresses limitations in event-based 3D reconstruction, achieving state-of-the-art performance through novel combination of SDF/density field learning with event-based supervision and spherical harmonics encodings.

Abstract: Event cameras offer a considerable alternative to RGB cameras in many scenarios. While there are recent works on event-based novel-view synthesis, dense 3D mesh reconstruction remains scarcely explored and existing event-based techniques are severely limited in their 3D reconstruction accuracy. To address this limitation, we present EventNeuS, a self-supervised neural model for learning 3D representations from monocular colour event streams. Our approach, for the first time, combines 3D signed distance function and density field learning with event-based supervision. Furthermore, we introduce spherical harmonics encodings into our model for enhanced handling of view-dependent effects. EventNeuS outperforms existing approaches by a significant margin, achieving 34% lower Chamfer distance and 31% lower mean absolute error on average compared to the best previous method.

[247] Mapping the Unseen: Unified Promptable Panoptic Mapping with Dynamic Labeling using Foundation Models

Mohamad Al Mdfaa, Raghad Salameh, Geesara Kulathunga, Sergey Zagoruyko, Gonzalo Ferrer

Main category: cs.CV

TL;DR: UPPM introduces a panoptic Dynamic Descriptor system that reconciles open-vocabulary labels with unified category structure and geometric priors for persistent, promptable panoptic mapping without additional training.

Details

Motivation: Open-vocabulary models produce closely related labels that split panoptic entities and degrade volumetric consistency in scene understanding, creating a need for better integration of semantic and geometric information.

Method: Leverages foundation models to create panoptic Dynamic Descriptors that unify open-vocabulary labels with category structure and geometric size priors, using language-guided open-vocabulary panoptic segmentation and semantic retrieval within a multi-resolution multi-TSDF map.

Result: UPPM achieves best overall performance in map reconstruction accuracy and panoptic segmentation quality, preserving open-vocabulary interpretability while delivering strong geometric and panoptic accuracy.

Conclusion: UPPM advances open-world scene understanding by creating persistent, promptable panoptic maps that maintain open-vocabulary interpretability while improving geometric consistency and segmentation quality without requiring additional model training.

Abstract: Panoptic maps enable robots to reason about both geometry and semantics. However, open-vocabulary models repeatedly produce closely related labels that split panoptic entities and degrade volumetric consistency. The proposed UPPM advances open-world scene understanding by leveraging foundation models to introduce a panoptic Dynamic Descriptor that reconciles open-vocabulary labels with unified category structure and geometric size priors. The fusion for such dynamic descriptors is performed within a multi-resolution multi-TSDF map using language-guided open-vocabulary panoptic segmentation and semantic retrieval, resulting in a persistent and promptable panoptic map without additional model training. Based on our evaluation experiments, UPPM shows the best overall performance in terms of the map reconstruction accuracy and the panoptic segmentation quality. The ablation study investigates the contribution for each component of UPPM (custom NMS, blurry-frame filtering, and unified semantics) to the overall system performance. Consequently, UPPM preserves open-vocabulary interpretability while delivering strong geometric and panoptic accuracy.

[248] HAAP: Vision-context Hierarchical Attention Autoregressive with Adaptive Permutation for Scene Text Recognition

Honghui Chen, Yuhang Qiu, Jiabao Wang, Pingping Chen, Nam Ling

Main category: cs.CV

TL;DR: HAAP improves scene text recognition by using adaptive attention masks and hierarchical attention to better integrate visual and contextual information without iterative refinement.

Details

Motivation: Current STR methods struggle with unreadable text, and existing permutation language modeling approaches suffer from training instability due to random permutations and computational overhead from iterative refinement.

Method: Proposes HAAP with two key components: 1) Implicit Permutation Neurons (IPN) that generate adaptive attention masks to dynamically capture token dependencies, and 2) Cross-modal Hierarchical Attention (CHA) that captures dependencies among position queries, contextual semantics, and visual information.

Result: HAAP achieves state-of-the-art performance in accuracy, complexity, and latency on several benchmark datasets for scene text recognition.

Conclusion: The proposed HAAP framework effectively addresses limitations of existing PLM approaches by enabling adaptive attention and hierarchical cross-modal interaction, leading to improved STR performance without iterative refinement overhead.

Abstract: Scene Text Recognition (STR) is challenging in extracting effective character representations from visual data when text is unreadable. Permutation language modeling (PLM) is introduced to refine character predictions by jointly capturing contextual and visual information. However, in PLM, the use of random permutations causes training fit oscillation, and the iterative refinement (IR) operation also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance position-context-image interaction capability, improving autoregressive LM generalization. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks that dynamically exploit token dependencies, enhancing the correlation between visual information and context. Adaptive correlation representation helps the model avoid training fit oscillation. Second, the Cross-modal Hierarchical Attention mechanism (CHA) is introduced to capture the dependencies among position queries, contextual semantics and visual information. CHA enables position tokens to aggregate global semantic information, avoiding the need for IR. Extensive experimental results show that the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.

[249] Saliency-Guided DETR for Moment Retrieval and Highlight Detection

Aleksandr Gordeev, Vladimir Dokholyan, Irina Tolstykh, Maksim Kuprashevich

Main category: cs.CV

TL;DR: A novel architecture for video moment retrieval and highlight detection using foundational video models with Saliency-Guided Cross Attention and hybrid DETR, achieving SOTA results on multiple benchmarks.

Details

Motivation: Existing approaches for video moment retrieval and highlight detection suffer from inefficient text-video feature alignment, leading to poor performance and limited practical usage.

Method: Proposes an architecture leveraging foundational video models for text-video alignment, combined with Saliency-Guided Cross Attention mechanism and hybrid DETR architecture. Also introduces InterVid-MR, a large-scale high-quality dataset for pretraining.

Result: Achieves state-of-the-art results on QVHighlights, Charades-STA and TACoS benchmarks. Provides efficient and scalable solution for both zero-shot and fine-tuning scenarios.

Conclusion: The proposed approach significantly enhances performance in video moment retrieval and highlight detection tasks through better text-video alignment and large-scale pretraining.

Abstract: Existing approaches for video moment retrieval and highlight detection are not able to align text and video features efficiently, resulting in unsatisfying performance and limited production usage. To address this, we propose a novel architecture that utilizes recent foundational video models designed for such alignment. Combined with the introduced Saliency-Guided Cross Attention mechanism and a hybrid DETR architecture, our approach significantly enhances performance in both moment retrieval and highlight detection tasks. For even better improvement, we developed InterVid-MR, a large-scale and high-quality dataset for pretraining. Using it, our architecture achieves state-of-the-art results on the QVHighlights, Charades-STA and TACoS benchmarks. The proposed approach provides an efficient and scalable solution for both zero-shot and fine-tuning scenarios in video-language tasks.

[250] Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models

Yi Ding, Lijun Li, Bing Cao, Jing Shao

Main category: cs.CV

TL;DR: Proposes Multi-Image Safety (MIS) dataset with safety Chain-of-Thought labels to enhance visual reasoning in safety-critical contexts for Vision-Language Models, improving both safety performance and general capabilities.

Details

Motivation: Existing safety fine-tuning methods for VLMs lack safety visual reasoning ability, creating a safety reasoning gap that limits deployment in safety-critical domains. Current approaches either focus on textual/multimodal content but fail on challenging cases or disrupt the balance between helpfulness and harmlessness.

Method: Introduces Multi-Image Safety (MIS) dataset with multi-image inputs and safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic. Fine-tunes InternVL2.5-8B model using this dataset to enhance both visual perception and reasoning in safety-critical contexts.

Result: Fine-tuning with MIS significantly outperforms both open-source and API-based models in challenging multi-image tasks requiring safety-related visual reasoning. Increases average accuracy by 0.83% across five general benchmarks and reduces Attack Success Rate (ASR) on multiple safety benchmarks by a large margin.

Conclusion: The proposed approach effectively addresses the safety reasoning gap in VLMs by providing fine-grained reasoning logic through multi-image safety scenarios, delivering exceptional safety performance while preserving general capabilities without trade-offs.

Abstract: Large Vision-Language Models (VLMs) have achieved remarkable performance across a wide range of tasks. However, their deployment in safety-critical domains poses significant challenges. Existing safety fine-tuning methods, which focus on textual or multimodal content, fall short in addressing challenging cases or disrupt the balance between helpfulness and harmlessness. Our evaluation highlights a safety reasoning gap: these methods lack safety visual reasoning ability, leading to such bottlenecks. To address this limitation and enhance both visual perception and reasoning in safety-critical contexts, we propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Specifically, we introduce the Multi-Image Safety (MIS) dataset, an instruction-following dataset tailored for multi-image safety scenarios, consisting of training and test splits. Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks requiring safety-related visual reasoning. This approach not only delivers exceptional safety performance but also preserves general capabilities without any trade-offs. Specifically, fine-tuning with MIS increases average accuracy by 0.83% across five general benchmarks and reduces the Attack Success Rate (ASR) on multiple safety benchmarks by a large margin.

[251] OptiPMB: Enhancing 3D Multi-Object Tracking with Optimized Poisson Multi-Bernoulli Filtering

Guanhua Ding, Yuxuan Xia, Runwei Guan, Qinchen Wu, Tao Huang, Weiping Ding, Jinping Sun, Guoqiang Mao

Main category: cs.CV

TL;DR: OptiPMB: A novel random finite set-based 3D multi-object tracking method using optimized Poisson multi-Bernoulli filter with adaptive birth models and detection probabilities for autonomous driving.

Details

Motivation: While deep learning-based 3D MOT solutions perform well, model-based approaches remain valuable for their simplicity, interpretability, and data efficiency. Conventional model-based trackers using random vector-based Bayesian filters have limitations due to heuristic data association and track management schemes. Random finite set (RFS)-based Bayesian filtering offers theoretically sound handling of object birth, survival, and death, facilitating interpretability and parameter tuning.

Method: OptiPMB employs an optimized Poisson multi-Bernoulli (PMB) filter within the tracking-by-detection framework. Key innovations include: 1) measurement-driven hybrid adaptive birth model for improved track initialization, 2) adaptive detection probability parameters to maintain tracks for occluded objects, and 3) optimized density pruning and track extraction modules to enhance overall tracking performance.

Result: Extensive evaluations on nuScenes and KITTI datasets show that OptiPMB achieves superior tracking accuracy compared with state-of-the-art methods, establishing a new benchmark for model-based 3D MOT.

Conclusion: OptiPMB demonstrates the effectiveness of RFS-based approaches for 3D MOT in autonomous driving, offering valuable insights for future research on RFS-based trackers and providing a strong model-based alternative to deep learning methods.

Abstract: Accurate 3D multi-object tracking (MOT) is crucial for autonomous driving, as it enables robust perception, navigation, and planning in complex environments. While deep learning-based solutions have demonstrated impressive 3D MOT performance, model-based approaches remain appealing for their simplicity, interpretability, and data efficiency. Conventional model-based trackers typically rely on random vector-based Bayesian filters within the tracking-by-detection (TBD) framework but face limitations due to heuristic data association and track management schemes. In contrast, random finite set (RFS)-based Bayesian filtering handles object birth, survival, and death in a theoretically sound manner, facilitating interpretability and parameter tuning. In this paper, we present OptiPMB, a novel RFS-based 3D MOT method that employs an optimized Poisson multi-Bernoulli (PMB) filter while incorporating several key innovative designs within the TBD framework. Specifically, we propose a measurement-driven hybrid adaptive birth model for improved track initialization, employ adaptive detection probability parameters to effectively maintain tracks for occluded objects, and optimize density pruning and track extraction modules to further enhance overall tracking performance. Extensive evaluations on nuScenes and KITTI datasets show that OptiPMB achieves superior tracking accuracy compared with state-of-the-art methods, thereby establishing a new benchmark for model-based 3D MOT and offering valuable insights for future research on RFS-based trackers in autonomous driving.

[252] FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

Ali Mollaahmadi Dehaghi, Hossein KhademSohi, Reza Razavi, Steve Drew, Mohammad Moshirpour

Main category: cs.CV

TL;DR: FedVSR is the first federated learning framework specifically designed for video super-resolution, addressing privacy concerns while improving perceptual quality through DWT-based loss functions and loss-aware aggregation.

Details

Motivation: Deep learning for video super-resolution typically requires centralized data, raising privacy concerns. Federated learning offers privacy-friendly solutions but struggles with low-level vision tasks, producing blurry outputs. There's a need for FL frameworks specifically designed for VSR tasks.

Method: FedVSR is a model-agnostic, stateless FL framework that introduces: 1) A lightweight loss function based on Discrete Wavelet Transform (DWT) to preserve high-frequency details during local training, and 2) A loss-aware aggregation strategy combining DWT-based and task-specific losses to guide global updates effectively.

Result: Extensive experiments across multiple VSR models and datasets show FedVSR improves perceptual video quality (up to +0.89 dB PSNR, +0.0370 SSIM, -0.0347 LPIPS and 4.98 VMAF) with close to zero computation and communication overhead compared to rivals.

Conclusion: FedVSR bridges the gap between privacy, efficiency, and perceptual quality, setting a new benchmark for federated learning in low-level vision tasks.

Abstract: Video super-resolution (VSR) aims to enhance low-resolution videos by leveraging both spatial and temporal information. While deep learning has led to impressive progress, it typically requires centralized data, which raises privacy concerns. Federated learning (FL) offers a privacy-friendly solution, but general FL frameworks often struggle with low-level vision tasks, resulting in blurry, low-quality outputs. To address this, we introduce FedVSR, the first FL framework specifically designed for VSR. It is model-agnostic and stateless, and introduces a lightweight loss function based on the Discrete Wavelet Transform (DWT) to better preserve high-frequency details during local training. Additionally, a loss-aware aggregation strategy combines both DWT-based and task-specific losses to guide global updates effectively. Extensive experiments across multiple VSR models and datasets show that FedVSR not only improves perceptual video quality (up to +0.89 dB PSNR, +0.0370 SSIM, -0.0347 LPIPS and 4.98 VMAF) but also achieves these gains with close to zero computation and communication overhead compared to its rivals. These results demonstrate FedVSR’s potential to bridge the gap between privacy, efficiency, and perceptual quality, setting a new benchmark for federated learning in low-level vision tasks. The code is available at: https://github.com/alimd94/FedVSR

[253] V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction

Yiming Zhao, Yu Zeng, Yukun Qi, YaoYang Liu, Xikun Bao, Lin Chen, Zehui Chen, Qing Miao, Chenxi Liu, Jie Zhao, Feng Zhao

Main category: cs.CV

TL;DR: V2P-Bench: A benchmark for evaluating Large Vision-Language Models’ ability to understand video visual prompts in human-model interaction scenarios, featuring 980 videos and 1172 QA pairs with visual prompts.

Details

Motivation: Existing video benchmarks rely heavily on text prompts, which require complex referential language and reduce accuracy/efficiency in human-model interaction. There's a need for benchmarks that evaluate models' understanding of visual prompts in interactive scenarios.

Method: Created V2P-Bench with 980 videos and 1172 high-quality QA pairs, each with manually annotated visual prompt frames. Covers three main tasks and twelve categories for fine-grained evaluation. Analyzed current LVLMs on this benchmark.

Result: Key findings: 1) Visual prompts are more model/user-friendly than text prompts, improving performance and user experience; 2) Models have reasonable zero-shot visual prompt understanding but struggle with spatiotemporal reasoning (o1 achieves 71.8% vs human 88.3%); 3) LVLMs exhibit “Hack Phenomena” in video QA that inflates scores as video length increases and frame sampling decreases.

Conclusion: V2P-Bench addresses limitations of text-prompt-based evaluation and provides a foundation for advancing human-model interaction and improving video understanding evaluation in LVLMs.

Abstract: Large Vision-Language Models (LVLMs) have made significant strides in the field of video understanding in recent times. Nevertheless, existing video benchmarks predominantly rely on text prompts for evaluation, which often require complex referential language and diminish both the accuracy and efficiency of human model interaction in turn. To address this limitation, we propose V2P-Bench, a robust and comprehensive benchmark for evaluating the ability of LVLMs to understand Video Visual Prompts in human model interaction scenarios. V2P-Bench consists of 980 videos and 1172 well-structured high-quality QA pairs, each paired with manually annotated visual prompt frames. The benchmark spans three main tasks and twelve categories, thereby enabling fine-grained, instance-level evaluation. Through an in-depth analysis of current LVLMs, we identify several key findings: 1) Visual prompts are both more model-friendly and user-friendly in interactive scenarios than text prompts, leading to significantly improved model performance and enhanced user experience. 2) Models are reasonably capable of zero-shot understanding of visual prompts, but struggle with spatiotemporal understanding. Even o1 achieves only 71.8%, far below the human expert score of 88.3%, while most open-source models perform below 60%. 3) LVLMs exhibit pervasive Hack Phenomena in video question answering tasks, which become more pronounced as video length increases and frame sampling density decreases, thereby inflating performance scores artificially. We anticipate that V2P-Bench will not only shed light on these challenges but also serve as a foundational tool for advancing human model interaction and improving the evaluation of video understanding.

[254] Patronus: Interpretable Diffusion Models with Prototypes

Nina Weng, Aasa Feragen, Siavash Bigdeli

Main category: cs.CV

TL;DR: Patronus is an interpretable diffusion model that uses prototypical networks to reveal what visual patterns are learned, where they emerge, and when they appear during the denoising process, enabling detection of shortcut learning and semantic emergence analysis.

Details

Motivation: Diffusion models have become widely used but remain largely black boxes, making it difficult to understand their internal generative processes. There's an urgent need to uncover their opacity to enable better understanding, debugging, and steering of these models.

Method: Patronus incorporates a prototypical network into the diffusion framework to encode semantics in visual patches. This allows the model to reveal what visual patterns are being modeled, where they emerge spatially in images, and when they appear across different timesteps of the denoising process.

Result: The model was evaluated on four natural image datasets and one medical imaging dataset, demonstrating both faithful interpretability and strong generative performance. It successfully enabled detection of shortcut learning via unwanted correlations and tracing of semantic emergence across timesteps.

Conclusion: Patronus opens new avenues for understanding and steering diffusion models through prototype-based interpretability, providing deeper insights into the generative mechanism of diffusion models.

Abstract: Uncovering the opacity of diffusion-based generative models is urgently needed, as their applications continue to expand while their underlying procedures largely remain a black box. With a critical question – how can the diffusion generation process be interpreted and understood? – we proposed Patronus, an interpretable diffusion model that incorporates a prototypical network to encode semantics in visual patches, revealing what visual patterns are modeled and where and when they emerge throughout denoising. This interpretability of Patronus provides deeper insights into the generative mechanism, enabling the detection of shortcut learning via unwanted correlations and the tracing of semantic emergence across timesteps. We evaluate Patronus on four natural image datasets and one medical imaging dataset, demonstrating both faithful interpretability and strong generative performance. With this work, we open new avenues for understanding and steering diffusion models through prototype-based interpretability.\ Our code is available at https://github.com/nina-weng/patronus}{https://github.com/nina-weng/patronus.

[255] MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

Suhao Yu, Haojin Wang, Juncheng Wu, Luyang Luo, Jingshen Wang, Cihang Xie, Pranav Rajpurkar, Carl Yang, Yang Yang, Kang Wang, Yannan Yu, Yuyin Zhou

Main category: cs.CV

TL;DR: MedFrameQA is a new benchmark for multi-image medical visual question answering that tests models’ ability to reason across sequences of medical images, revealing current MLLMs’ severe limitations in comparative medical image analysis.

Details

Motivation: Current medical VQA benchmarks are limited to single-image interpretation, but real clinical practice requires multi-image comparative reasoning across diagnostic sequences. There's a need for benchmarks that test models' ability to handle complex, temporally grounded medical narratives.

Method: Developed a scalable pipeline using narrative transcripts from medical education videos to align visual frames with textual concepts. Automatically produced 2,851 high-quality multi-image VQA pairs with explicit, transcript-grounded reasoning chains.

Result: Evaluation of 11 advanced MLLMs (including reasoning models) showed severe deficiencies: accuracies mostly below 50%, instability across varying image counts. Models treat images as isolated instances, failing to track pathological progression or cross-reference anatomical shifts.

Conclusion: MedFrameQA provides a rigorous standard for evaluating next-generation MLLMs in handling complex medical narratives. Current models lack essential multi-image synthesis capabilities needed for real clinical practice.

Abstract: Real-world clinical practice demands multi-image comparative reasoning, yet current medical benchmarks remain limited to single-frame interpretation. We present MedFrameQA, the first benchmark explicitly designed to test multi-image medical VQA through educationally-validated diagnostic sequences. To construct this dataset, we develop a scalable pipeline that leverages narrative transcripts from medical education videos to align visual frames with textual concepts, automatically producing 2,851 high-quality multi-image VQA pairs with explicit, transcript-grounded reasoning chains. Our evaluation of 11 advanced MLLMs (including reasoning models) exposes severe deficiencies in multi-image synthesis, where accuracies mostly fall below 50% and exhibit instability across varying image counts. Error analysis demonstrates that models often treat images as isolated instances, failing to track pathological progression or cross-reference anatomical shifts. MedFrameQA provides a rigorous standard for evaluating the next generation of MLLMs in handling complex, temporally grounded medical narratives.

[256] Seeing through Satellite Images at Street Views

Ming Qian, Bin Tan, Qiuyu Wang, Xianwei Zheng, Hanjiang Xiong, Gui-Song Xia, Yujun Shen, Nan Xue

Main category: cs.CV

TL;DR: Sat2Density++: A neural radiance field approach for synthesizing photorealistic street-view panoramas from satellite images and camera trajectories, addressing sparse-view challenges and large viewpoint changes.

Details

Motivation: The paper addresses the challenging task of synthesizing photorealistic street-view panorama images and videos from satellite imagery and specified camera positions/trajectories. This is difficult due to sparse-view nature and extreme viewpoint changes between satellite and street-view perspectives.

Method: Proposes Sat2Density++, which learns neural radiance fields from paired satellite and street-view images. Key innovation: modeling street-view specific elements (sky, illumination effects) that are only visible in street-view panoramas but not in satellite images, enabling more realistic rendering.

Result: Tested on both urban and suburban scene datasets, demonstrating capability to render photorealistic street-view panoramas that are consistent across multiple views and faithful to the satellite image.

Conclusion: Sat2Density++ successfully addresses the challenging SatStreet-view synthesis problem by incorporating street-view specific elements into neural radiance field modeling, enabling realistic panorama rendering from satellite imagery.

Abstract: This paper studies the task of SatStreet-view synthesis, which aims to render photorealistic street-view panorama images and videos given any satellite image and specified camera positions or trajectories. We formulate to learn neural radiance field from paired images captured from satellite and street viewpoints, which comes to be a challenging learning problem due to the sparse-view natural and the extremely-large viewpoint changes between satellite and street-view images. We tackle the challenges based on a task-specific observation that street-view specific elements, including the sky and illumination effects are only visible in street-view panoramas, and present a novel approach Sat2Density++ to accomplish the goal of photo-realistic street-view panoramas rendering by modeling these street-view specific in neural networks. In the experiments, our method is testified on both urban and suburban scene datasets, demonstrating that Sat2Density++ is capable of rendering photorealistic street-view panoramas that are consistent across multiple views and faithful to the satellite image.

Nikolas Papadopoulos, Nikolaos Ioannis Bountos, Maria Sdraka, Andreas Karavias, Gustau Camps-Valls, Ioannis Papoutsis

Main category: cs.CV

TL;DR: Thalia is an enhanced global dataset for volcanic deformation monitoring using InSAR data, featuring higher-resolution, multi-source, multi-temporal data with expert annotations and benchmark models.

Details

Motivation: To address the scarcity of well-curated datasets for deep learning applications in volcanic monitoring using InSAR data, which is crucial for global-scale deformation monitoring but challenging to interpret with traditional methods.

Method: Builds on existing Hephaestus dataset by creating Thalia - a global collection of 38 spatiotemporal datacubes covering 7 years with InSAR products, topographic data, atmospheric variables, and expert annotations including deformation type, intensity, and extent with descriptive text.

Result: Created a comprehensive benchmark dataset with state-of-the-art models for classification and segmentation, enabling fair evaluation and fostering collaboration between machine learning and Earth science.

Conclusion: Thalia advances volcanic monitoring capabilities and promotes data-driven approaches in geoscience by providing a rich, annotated dataset that addresses previous limitations in InSAR data interpretation for deep learning applications.

Abstract: Monitoring volcanic activity is of paramount importance to safeguarding lives, infrastructure, and ecosystems. However, only a small fraction of known volcanoes are continuously monitored. Satellite-based Interferometric Synthetic Aperture Radar (InSAR) enables systematic, global-scale deformation monitoring. However, its complex data challenge traditional remote sensing methods. Deep learning offers a powerful means to automate and enhance InSAR interpretation, advancing volcanology and geohazard assessment. Despite its promise, progress has been limited by the scarcity of well-curated datasets. In this work, we build on the existing Hephaestus dataset and introduce Thalia, addressing crucial limitations and enriching its scope with higher-resolution, multi-source, and multi-temporal data. Thalia is a global collection of 38 spatiotemporal datacubes covering 7 years and integrating InSAR products, topographic data, as well as atmospheric variables, known to introduce signal delays that can mimic ground deformation in InSAR imagery. Each sample includes expert annotations detailing the type, intensity, and extent of deformation, ac- companied by descriptive text. To enable fair and consistent evaluation, we provide a comprehensive benchmark using state-of-the-art models for classification and segmentation. This work fosters collaboration between machine learning and Earth science, advancing volcanic monitoring and promoting data-driven approaches in geoscience. The code and latest version of the dataset are available through the github repository: https://github.com/Orion-AI-Lab/Thalia

[258] CAD-SLAM: Consistency-Aware Dynamic SLAM with Dynamic-Static Decoupled Mapping

Wenhua Wu, Chenpeng Su, Siting Zhu, Tianchen Deng, Jianhao Jiao, Guangming Wang, Dimitrios Kanoulas, Zhe Liu, Hesheng Wang

Main category: cs.CV

TL;DR: CAD-SLAM: A dynamic SLAM framework that detects moving objects by analyzing geometric/texture inconsistencies and uses temporal Gaussian models for online dynamic object modeling.

Details

Motivation: Existing NeRF and 3D Gaussian SLAM methods struggle in dynamic environments where moving objects violate static assumptions, degrading camera tracking and map reconstruction. Need robust dynamic object identification and online modeling.

Method: Consistency-aware dynamic detection analyzes geometric/texture discrepancies between historical map renderings and real observations. Uses bidirectional dynamic object tracking (backward/forward in time) for complete sequence recognition. Dynamic-static decoupled mapping with temporal Gaussian model for online incremental dynamic modeling.

Result: Experiments on multiple dynamic datasets show flexible and accurate dynamic segmentation capabilities with state-of-the-art performance in both localization and mapping.

Conclusion: CAD-SLAM effectively addresses dynamic environment challenges in SLAM through consistency-aware dynamic detection and decoupled mapping, enabling robust performance in scenes with moving objects.

Abstract: Recent advances in neural radiation fields (NeRF) and 3D Gaussian-based SLAM have achieved impressive localization accuracy and high-quality dense mapping in static scenes. However, these methods remain challenged in dynamic environments, where moving objects violate the static-world assumption and introduce inconsistent observations that degrade both camera tracking and map reconstruction. This motivates two fundamental problems: robustly identifying dynamic objects and modeling them online. To address these limitations, we propose CAD-SLAM, a Consistency-Aware Dynamic SLAM framework with dynamic-static decoupled mapping. Our key insight is that dynamic objects inherently violate cross-view and cross-time scene consistency. We detect object motion by analyzing geometric and texture discrepancies between historical map renderings and real-world observations. Once a moving object is identified, we perform bidirectional dynamic object tracking (both backward and forward in time) to achieve complete sequence-wise dynamic recognition. Our consistency-aware dynamic detection model achieves category-agnostic, instantaneous dynamic identification, which effectively mitigates motion-induced interference during localization and mapping. In addition, we introduce a dynamic-static decoupled mapping strategy that employs a temporal Gaussian model for online incremental dynamic modeling. Experiments conducted on multiple dynamic datasets demonstrate the flexible and accurate dynamic segmentation capabilities of our method, along with the state-of-the-art performance in both localization and mapping.

[259] Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, Xiaodan Liang

Main category: cs.CV

TL;DR: Ground-R1 addresses scale bias in vision-language models by introducing Scale Relative Policy Optimization to ensure balanced learning from visual evidence of all sizes.

Details

Motivation: Current vision-language models suffer from unreliable predictions due to insufficient grounding in visual evidence. Existing thinking-with-images methods have systematic scale-driven bias where training rewards are dominated by large visual regions, suppressing learning from small but critical evidence.

Method: Proposes Ground-R1 framework with Scale Relative Policy Optimization (SRPO) that replaces standard GRPO. SRPO recalibrates reward learning across evidence regions of different sizes through scale-aware binning and intra-/inter-bin comparisons for balanced credit assignment.

Result: Experimental results on general LVLM, high-resolution, and visual grounding benchmarks show Ground-R1’s effectiveness. SRPO yields consistent gains over standard GRPO in both response accuracy and evidence grounding.

Conclusion: Ground-R1 with SRPO addresses scale bias in vision-language models, improving reliability and interpretability through better visual grounding across all region sizes.

Abstract: Large Vision-Language Models (LVLMs) have become powerful general-purpose assistants, yet their predictions often lack reliability and interpretability due to insufficient grounding in visual evidence. The emerging thinking-with-images paradigm seeks to address this issue by explicitly anchoring reasoning to image regions. However, we empirically find that most existing methods suffer from a systematic scale-driven bias in optimization, where training rewards are dominated by large visual regions, suppressing learning from small but semantically critical evidence and leading to spurious grounding at inference time. To address this limitation, we propose Ground-R1, a de-biased thinking-with-images framework trained via a novel Scale Relative Policy Optimization (SRPO) objective that replaces standard GRPO. Specifically, our SRPO recalibrates reward learning across evidence regions of different sizes through scale-aware binning and intra-/inter-bin comparisons, enabling balanced credit assignment during training. Experimental results on general LVLM, high-resolution, and visual grounding benchmarks validate the effectiveness of Ground-R1 and show that SRPO yields consistent gains over standard GRPO in both response accuracy and evidence grounding.

[260] SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

Guankun Wang, Junyi Wang, Wenjin Mo, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab, Hongliang Ren

Main category: cs.CV

TL;DR: SurgVidLM is a video language model for surgical scene understanding that addresses both full and fine-grained video comprehension in robot-assisted surgery, featuring a two-stage StageFocus mechanism and Multi-frequency Fusion Attention.

Details

Motivation: Existing MLLMs for surgical scene understanding focus on image-based analysis or global video understanding, missing fine-grained video reasoning needed for analyzing specific surgical processes and detailed task execution. There's a gap in models that can handle both holistic understanding and detailed analysis of surgical procedures.

Method: Proposes SurgVidLM, a video language model trained on SVU-31K dataset (31K video-instruction pairs). Uses two-stage StageFocus mechanism: first stage extracts global procedural context, second stage performs high-frequency local analysis guided by temporal cues. Implements Multi-frequency Fusion Attention to integrate low- and high-frequency visual tokens while preserving task-specific details.

Result: SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks, demonstrating superior capability in capturing context of complex robot-assisted surgeries.

Conclusion: SurgVidLM successfully bridges the gap in fine-grained surgical video comprehension, offering a comprehensive solution for surgical scene understanding that combines global context with detailed local analysis, with potential applications in surgical training and robotic decision-making.

Abstract: Surgical scene understanding is critical for surgical training and robotic decision-making in robot-assisted surgery. Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated great potential for advancing scene perception in the medical domain, facilitating surgeons to understand surgical scenes and procedures. However, these methods are primarily oriented towards image-based analysis or global video understanding, overlooking the fine-grained video reasoning that is crucial for analyzing specific processes and capturing detailed task execution within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K that is a large-scale dataset with over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Building on this resource, SurgVidLM incorporates a two-stage StageFocus mechanism: the first stage extracts global procedural context, while the second stage performs high-frequency local analysis guided by temporal cues. We also develop the Multi-frequency Fusion Attention to effectively integrate low- and high-frequency visual tokens, ensuring the preservation of critical task-specific details. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing the context of complex robot-assisted surgeries. Our code and dataset will be publicly accessible soon.

[261] Lightweight RGB-T Tracking with Mobile Vision Transformers

Mahdi Falaki, Maria A. Amer

Main category: cs.CV

TL;DR: Lightweight RGB-T tracker using MobileViT with progressive fusion and separable mixed attention for real-time multimodal tracking on mobile platforms.

Details

Motivation: Single-modality RGB tracking struggles in challenging conditions like low illumination, weather, and occlusion. Existing Vision Transformer-based trackers are accurate but too large for real-time applications, especially on resource-constrained devices.

Method: Proposes a lightweight RGB-T (RGB-Thermal) tracker built on MobileViT architecture with progressive fusion framework. Uses separable mixed attention to model intra- and inter-modal interactions, creating compact and effective features for accurate localization.

Result: Achieves under 4M parameters with real-time performance of 25.7 FPS on CPU and 122 FPS on GPU. First MobileViT-based multimodal tracker, supporting embedded and mobile platforms.

Conclusion: The proposed tracker successfully addresses the need for lightweight, real-time multimodal tracking by combining MobileViT efficiency with progressive fusion and attention mechanisms for RGB-T data.

Abstract: Single-modality tracking (RGB-only) struggles under low illumination, weather, and occlusion. Multimodal tracking addresses this by combining complementary cues. While Vision Transformer-based trackers achieve strong accuracy, they are often too large for real-time. We propose a lightweight RGB-T tracker built on MobileViT with a progressive fusion framework that models intra- and inter-modal interactions using separable mixed attention. This design delivers compact, effective features for accurate localization, with under 4M parameters and real-time performance of 25.7 FPS on the CPU and 122 FPS on the GPU, supporting embedded and mobile platforms. To the best of our knowledge, this is the first MobileViT-based multimodal tracker. Model code and weights are available in the GitHub repository.

[262] Proteus-ID: ID-Consistent and Motion-Coherent Video Customization

Guiyu Zhang, Chen Shi, Zijian Jiang, Xunzhi Xiang, Jingjing Qian, Shaoshuai Shi, Li Jiang

Main category: cs.CV

TL;DR: Proteus-ID is a diffusion-based framework for video identity customization that maintains identity consistency while generating realistic motion from a single reference image and text prompt.

Details

Motivation: Video identity customization faces challenges in maintaining identity consistency while aligning with text descriptions and generating natural, fluid motion without unrealistic stiffness.

Method: Three key components: 1) Multimodal Identity Fusion (MIF) module using Q-Former to unify visual and textual cues, 2) Time-Aware Identity Injection (TAII) for dynamic conditioning across denoising steps, and 3) Adaptive Motion Learning (AML) using optical-flow-derived motion heatmaps to enhance motion realism.

Result: Outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Also introduces Proteus-Bench dataset with 200K training clips and 150 diverse individuals for evaluation.

Conclusion: Proteus-ID effectively addresses identity consistency and motion coherence challenges in video customization through multimodal fusion, adaptive conditioning, and motion-aware training.

Abstract: Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Codes and data are publicly available at https://grenoble-zhang.github.io/Proteus-ID/.

[263] Geometry-aware 4D Video Generation for Robot Manipulation

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, Shuran Song

Main category: cs.CV

TL;DR: A 4D video generation model that ensures multi-view 3D consistency for robotic applications by using cross-view pointmap alignment supervision during training.

Details

Motivation: Robots need to understand and predict physical world dynamics for effective planning and interaction. While video generation models show promise, generating videos that are both temporally coherent and geometrically consistent across different camera views remains challenging.

Method: Proposes a 4D video generation model that enforces multi-view 3D consistency by supervising the model with cross-view pointmap alignment during training. This geometric supervision helps the model learn a shared 3D scene representation, enabling generation of spatio-temporally aligned future video sequences from novel viewpoints given only single RGB-D images per view, without requiring camera poses as input.

Result: The method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets compared to existing baselines. The predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, yielding robot manipulation policies that generalize well to novel camera viewpoints.

Conclusion: The proposed 4D video generation model with geometric supervision enables generation of consistent multi-view videos for robotic applications, improving visual stability and spatial alignment while supporting downstream tasks like trajectory recovery and manipulation policy generalization.

Abstract: Understanding and predicting dynamics of the physical world can enhance a robot’s ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of generated videos by supervising the model with cross-view pointmap alignment during training. Through this geometric supervision, the model learns a shared 3D scene representation, enabling it to generate spatio-temporally aligned future video sequences from novel viewpoints given a single RGB-D image per view, and without relying on camera poses as input. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, yielding robot manipulation policies that generalize well to novel camera viewpoints.

Gianluca Monaci, Philippe Weinzaepfel, Christian Wolf

Main category: cs.CV

TL;DR: End-to-end RL training for image goal navigation can develop relative pose estimation capabilities, though simulator shortcuts affect results; architectural choices impact emergent navigation skills.

Details

Motivation: To investigate whether image goal navigation can be efficiently solved with end-to-end RL training alone, without dedicated image-matching or pre-trained vision modules, and whether relative pose estimation can emerge from navigation reward.

Method: Large experimental study examining architectural choices (late fusion, channel stacking, space-to-depth projections, cross-attention) and their impact on emergent relative pose estimation from navigation training. Analyzes simulator settings and their influence on shortcuts.

Result: Success of recent methods is influenced by simulator settings leading to shortcuts, but capabilities can be transferred to more realistic settings to some extent. Found correlations between navigation performance and emergent relative pose estimation performance.

Conclusion: End-to-end RL training can develop relative pose estimation from navigation alone, though simulator artifacts affect results; architectural choices play important role in emergent navigation skills.

Abstract: Image goal navigation requires two different skills: firstly, core navigation skills, including the detection of free space and obstacles, and taking decisions based on an internal representation; and secondly, computing directional information by comparing visual observations to the goal image. Current state-of-the-art methods either rely on dedicated image-matching, or pre-training of computer vision modules on relative pose estimation. In this paper, we study whether this task can be efficiently solved with end-to-end training of full agents with RL, as has been claimed by recent work. A positive answer would have impact beyond Embodied AI and allow training of relative pose estimation from reward for navigation alone. In this large experimental study we investigate the effect of architectural choices like late fusion, channel stacking, space-to-depth projections and cross-attention, and their role in the emergence of relative pose estimators from navigation training. We show that the success of recent methods is influenced up to a certain extent by simulator settings, leading to shortcuts in simulation. However, we also show that these capabilities can be transferred to more realistic setting, up to some extent. We also find evidence for correlations between navigation performance and probed (emerging) relative pose estimation performance, an important sub skill.

[265] No time to train! Training-Free Reference-Based Instance Segmentation

Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley

Main category: cs.CV

TL;DR: Training-free few-shot object segmentation using foundation model correspondences between reference and target images

Details

Motivation: While SAM reduces annotation costs, it still requires manual prompts or complex prompt-generation rules. This work aims to further reduce this burden by enabling object segmentation with only a small set of reference images instead of manual prompts.

Method: Multi-stage training-free approach: (1) memory bank construction from reference images, (2) representation aggregation, and (3) semantic-aware feature matching using foundation model correspondences between reference and target images.

Result: State-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50), and outperforms existing training-free approaches on Cross-Domain FSOD benchmark (22.4% nAP).

Conclusion: Foundation model semantic priors enable effective few-shot segmentation without training, reducing the prompt burden of SAM while maintaining strong performance across benchmarks.

Abstract: The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).

[266] Affine-Equivariant Kernel Space Encoding for NeRF Editing

Mikołaj Zieliński, Krzysztof Byrski, Tomasz Szczepanik, Dominik Belter, Przemysław Spurek

Main category: cs.CV

TL;DR: EKS introduces affine-equivariant kernel space encoding for neural radiance fields, enabling localized, deformation-aware scene editing via anisotropic Gaussian kernels while maintaining high reconstruction quality.

Details

Motivation: Current neural scene representations have implicit, globally entangled latent spaces that make localized editing and physically grounded manipulation difficult. Existing approaches with explicit control structures or point-based representations often suffer from limited locality, sensitivity to deformations, or visual artifacts.

Method: Proposes Affine-Equivariant Kernel Space Encoding (EKS) that aggregates features through a field of anisotropic Gaussian kernels defining localized regions of influence. Includes feature distillation mechanism transferring information from multi-resolution hash grid encodings into the kernel field for compact, grid-free representation.

Result: Enables intuitive, localized scene editing directly via Gaussian kernels without retraining while maintaining high-quality rendering. Provides stable feature interpolation under spatial transformations while preserving continuity.

Conclusion: EKS offers a novel spatial encoding for neural radiance fields that bridges the gap between high-fidelity reconstruction and flexible, localized editability through kernel-based representations.

Abstract: Neural scene representations achieve high-fidelity rendering by encoding 3D scenes as continuous functions, but their latent spaces are typically implicit and globally entangled, making localized editing and physically grounded manipulation difficult. While several works introduce explicit control structures or point-based latent representations to improve editability, these approaches often suffer from limited locality, sensitivity to deformations, or visual artifacts. In this paper, we introduce Affine-Equivariant Kernel Space Encoding (EKS), a spatial encoding for neural radiance fields that provides localized, deformation-aware feature representations. Instead of querying latent features directly at discrete points or grid vertices, our encoding aggregates features through a field of anisotropic Gaussian kernels, each defining a localized region of influence. This kernel-based formulation enables stable feature interpolation under spatial transformations while preserving continuity and high reconstruction quality. To preserve detail without sacrificing editability, we further propose a training-time feature distillation mechanism that transfers information from multi-resolution hash grid encodings into the kernel field, yielding a compact and fully grid-free representation at inference. This enables intuitive, localized scene editing directly via Gaussian kernels without retraining, while maintaining high-quality rendering. The code can be found under (https://github.com/MikolajZielinski/eks)

[267] DSKC: Domain Style Modeling with Adaptive Knowledge Consolidation for Exemplar-free Lifelong Person Re-Identification

Shiben Liu, Mingyue Xu, Huijie Fan, Qiang Wang, Liangqiong Qu, Zhi Han

Main category: cs.CV

TL;DR: A rehearsal-free and distillation-free framework for Lifelong Person Re-identification that uses domain-style encoding and unified knowledge consolidation to mitigate forgetting when adapting to new domains.

Details

Motivation: Existing LReID methods lack domain-specific style awareness and unified knowledge consolidation, which are crucial for mitigating catastrophic forgetting when adapting to new information streams.

Method: Proposes DSKC framework with: 1) Domain-Style Encoder (DSE) to dynamically model domain-specific styles, and 2) Unified Knowledge Consolidation (UKC) mechanism to integrate instance-level representations with domain-specific styles into cross-domain unified representations.

Result: Outperforms state-of-the-art methods in two training orders and enhances model’s strong performance, demonstrating effectiveness in mitigating forgetting and improving generalization.

Conclusion: DSKC provides an effective rehearsal-free and distillation-free solution for lifelong person re-identification by explicitly modeling inter-domain associations at both instance and domain levels.

Abstract: Lifelong Person Re-identification (LReID) aims to continuously match individuals across camera views from sequential data streams. Existing LReID methods often ignore domain-specific style awareness and unified knowledge consolidation, which are crucial for mitigating forgetting when adapting to new information. We propose DSKC, a novel rehearsal-free and distillation-free framework for LReID. DSKC designs a domain-style encoder (DSE) to dynamically model domain-specific styles, and a unified knowledge consolidation (UKC) mechanism to adaptively integrate instance-level representations with domain-specific style into a cross-domain unified representation. By leveraging unified representation as a bridge, DSKC explicitly models inter-domain associations at both instance and domain levels to enhance anti-forgetting and generalization. Experimental results demonstrate that our DSKC outperforms state-of-the-art methods in two training orders and enhances the model’s strong performance. Our code is available at https://github.com/LiuShiBen/DKUA.

[268] UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval

Hongyu Guo, Xiangzhao Hao, Jiarui Guo, Haiyun Guo, Jinqiao Wang, Tat-Seng Chua

Main category: cs.CV

TL;DR: UniFGVC is a training-free framework for few-shot fine-grained visual classification that reformulates the task as multimodal retrieval using MLLMs to generate structured text descriptions and off-the-shelf encoders for joint space retrieval.

Details

Motivation: Existing few-shot FGVC methods using pre-trained vision-language models suffer from overfitting and weak generalization when fine-tuned on limited data. The authors aim to develop a training-free approach that leverages multimodal large language models' open-world knowledge without requiring fine-tuning.

Method: 1) Proposes Category-Discriminative Visual Captioner (CDV-Captioner) using MLLMs with chain-of-thought prompting and visually similar reference images to generate structured text descriptions capturing fine-grained attribute features. 2) Converts images to image-description pairs for comprehensive representation. 3) Constructs multimodal category templates from few-shot samples. 4) Uses off-the-shelf vision and text encoders to embed queries and templates. 5) Performs FGVC by retrieving nearest template in joint multimodal space.

Result: Extensive experiments on 12 FGVC benchmarks show consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

Conclusion: UniFGVC provides a universal training-free framework that leverages MLLMs’ open-world knowledge for few-shot FGVC, offering broad compatibility, reliable generalization, and adaptability across diverse scenarios without requiring fine-tuning.

Abstract: Few-shot fine-grained visual classification (FGVC) aims to leverage limited data to enable models to discriminate subtly distinct categories. Recent works mostly finetuned the pre-trained visual language models to achieve performance gain, yet suffering from overfitting and weak generalization. To deal with this, we introduce UniFGVC, a universal training-free framework that reformulates few-shot FGVC as multimodal retrieval. First, we propose the Category-Discriminative Visual Captioner (CDV-Captioner) to exploit the open-world knowledge of multimodal large language models (MLLMs) to generate a structured text description that captures the fine-grained attribute features distinguishing closely related classes. CDV-Captioner uses chain-of-thought prompting and visually similar reference images to reduce hallucination and enhance discrimination of generated captions. Using it we can convert each image into an image-description pair, enabling more comprehensive feature representation, and construct the multimodal category templates using few-shot samples for the subsequent retrieval pipeline. Then, off-the-shelf vision and text encoders embed query and template pairs, and FGVC is accomplished by retrieving the nearest template in the joint space. UniFGVC ensures broad compatibility with diverse MLLMs and encoders, offering reliable generalization and adaptability across few-shot FGVC scenarios. Extensive experiments on 12 FGVC benchmarks demonstrate its consistent superiority over prior few-shot CLIP-based methods and even several fully-supervised MLLMs-based approaches.

[269] Object Fidelity Diffusion for Remote Sensing Image Generation

Ziqi Ye, Shuran Ma, Jie Yang, Xiaoyi Yang, Yi Yang, Ziyang Gong, Xue Yang, Haipeng Wang

Main category: cs.CV

TL;DR: OF-Diff is a novel diffusion model for high-fidelity remote sensing image generation that uses object shape priors from layouts and dual-branch architecture with diffusion consistency loss.

Details

Motivation: Existing diffusion models for remote sensing often produce low-fidelity images with poor morphological details, which affects object detection model robustness and reliability. There's a need for more precise controllable generation in this domain.

Method: Proposes Object Fidelity Diffusion (OF-Diff) with three key components: 1) First extraction of object shape priors from layouts for remote sensing diffusion models, 2) Dual-branch diffusion model with diffusion consistency loss that generates high-fidelity images without real images during sampling, 3) DDPO fine-tuning to enhance diversity and semantic consistency.

Result: OF-Diff outperforms state-of-the-art methods across key quality metrics. Significant improvements for polymorphic and small object classes: mAP increases by 8.3% for airplanes, 7.7% for ships, and 4.0% for vehicles.

Conclusion: OF-Diff effectively improves object fidelity in remote sensing image generation through shape priors, dual-branch architecture, and reinforcement learning fine-tuning, demonstrating superior performance for challenging object classes.

Abstract: High-precision controllable remote sensing image generation is both meaningful and challenging. Existing diffusion models often produce low-fidelity images due to their inability to adequately capture morphological details, which may affect the robustness and reliability of object detection models. To enhance the accuracy and fidelity of generated objects in remote sensing, this paper proposes Object Fidelity Diffusion (OF-Diff), which effectively improves the fidelity of generated objects. Specifically, we are the first to extract the prior shapes of objects based on the layout for diffusion models in remote sensing. Then, we introduce a dual-branch diffusion model with diffusion consistency loss, which can generate high-fidelity remote sensing images without providing real images during the sampling phase. Furthermore, we introduce DDPO to fine-tune the diffusion process, making the generated remote sensing images more diverse and semantically consistent. Comprehensive experiments demonstrate that OF-Diff outperforms state-of-the-art methods in the remote sensing across key quality metrics. Notably, the performance of several polymorphic and small object classes shows significant improvement. For instance, the mAP increases by 8.3%, 7.7%, and 4.0% for airplanes, ships, and vehicles, respectively.

Zixin Yin, Xili Dai, Duomin Wang, Xianfang Zeng, Lionel M. Ni, Gang Yu, Heung-Yeung Shum

Main category: cs.CV

TL;DR: LazyDrag introduces a drag-based image editing method for Multi-Modal Diffusion Transformers that eliminates implicit point matching, using explicit correspondence maps for stable inversion without test-time optimization.

Details

Motivation: Current drag-based editing methods rely on implicit point matching via attention, which compromises inversion strength and requires costly test-time optimization, limiting diffusion models' generative capabilities for high-fidelity inpainting and text-guided creation.

Method: Generates explicit correspondence maps from user drag inputs as reliable references to boost attention control, enabling stable full-strength inversion without test-time optimization, and supports multi-round workflows with simultaneous move and scale operations.

Result: Outperforms baselines on DragBench in drag accuracy and perceptual quality (validated by VIEScore and human evaluation), enables complex edits like opening a dog’s mouth with interior inpainting, generating new objects, and context-aware changes.

Conclusion: LazyDrag establishes new state-of-the-art performance in drag-based editing, eliminates the need for test-time optimization, unifies geometric control with text guidance, and paves a new way for editing paradigms in multimodal diffusion models.

Abstract: The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball’’, or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.

[271] DiffVL: Diffusion-Based Visual Localization on 2D Maps via BEV-Conditioned GPS Denoising

Li Gao, Hongyang Sun, Liu Liu, Yunhao Li, Yang Cai

Main category: cs.CV

TL;DR: DiffVL reformulates visual localization as GPS denoising using diffusion models, achieving sub-meter accuracy without HD maps by jointly modeling GPS, SD maps, and visual signals.

Details

Motivation: HD maps provide precise localization but are costly to build/maintain, while SD maps are more scalable but current SD-map-based approaches overlook noisy GPS signals. GPS is ubiquitous but suffers from multipath errors in urban environments, creating a need for methods that can effectively leverage all available signals.

Method: DiffVL treats visual localization as a GPS denoising task using diffusion models. The framework learns to reverse GPS noise perturbations by jointly modeling GPS, SD maps, and visual BEV features. Unlike traditional BEV-matching methods, it uses diffusion refinement to recover true pose distribution from noisy GPS trajectories conditioned on visual and map signals.

Result: Achieves state-of-the-art accuracy compared to BEV-matching baselines, with sub-meter accuracy without relying on HD maps. Demonstrates effectiveness on multiple datasets, proving diffusion models can enable scalable localization.

Conclusion: Diffusion models can enable scalable visual localization by treating noisy GPS as a generative prior, representing a paradigm shift from traditional matching-based methods. The approach successfully leverages all available signals (GPS, SD maps, visual features) to achieve high-precision localization without HD maps.

Abstract: Accurate visual localization is crucial for autonomous driving, yet existing methods face a fundamental dilemma: While high-definition (HD) maps provide high-precision localization references, their costly construction and maintenance hinder scalability, which drives research toward standard-definition (SD) maps like OpenStreetMap. Current SD-map-based approaches primarily focus on Bird’s-Eye View (BEV) matching between images and maps, overlooking a ubiquitous signal-noisy GPS. Although GPS is readily available, it suffers from multipath errors in urban environments. We propose DiffVL, the first framework to reformulate visual localization as a GPS denoising task using diffusion models. Our key insight is that noisy GPS trajectory, when conditioned on visual BEV features and SD maps, implicitly encode the true pose distribution, which can be recovered through iterative diffusion refinement. DiffVL, unlike prior BEV-matching methods (e.g., OrienterNet) or transformer-based registration approaches, learns to reverse GPS noise perturbations by jointly modeling GPS, SD map, and visual signals, achieving sub-meter accuracy without relying on HD maps. Experiments on multiple datasets demonstrate that our method achieves state-of-the-art accuracy compared to BEV-matching baselines. Crucially, our work proves that diffusion models can enable scalable localization by treating noisy GPS as a generative prior-making a paradigm shift from traditional matching-based methods.

[272] Decipher-MR: A Vision-Language Foundation Model for 3D MRI Representations

Zhijian Yang, Noel DSouza, Istvan Megyeri, Xiaojian Xu, Amin Honarmandi Shandiz, Farzin Haddadpour, Krisztian Koos, Laszlo Rusko, Emanuele Valeriano, Bharadwaj Swaninathan, Lei Wu, Parminder Bhatia, Taha Kass-Hout, Erhan Bas

Main category: cs.CV

TL;DR: Decipher-MR is a 3D MRI-specific vision-language foundation model trained on 200,000 MRI series from diverse anatomical regions, sequences, and pathologies, integrating self-supervised vision learning with report-guided text supervision for robust medical imaging applications.

Details

Motivation: MRI is complex and heterogeneous, making scalable machine learning challenging. While foundation models have transformed language and vision tasks, their application to MRI is limited by data scarcity and narrow anatomical focus.

Method: Developed a 3D MRI-specific vision-language foundation model using 200,000 MRI series from over 22,000 studies. Integrated self-supervised vision learning with report-guided text supervision. Features modular design with frozen pretrained encoder and lightweight task-specific decoders.

Result: Demonstrated consistent improvements over existing foundation models and task-specific approaches across disease classification, demographic prediction, anatomical localization, and cross-modal retrieval tasks.

Conclusion: Decipher-MR serves as a versatile foundation for MRI-based AI in clinical and research settings, addressing the limitations of current approaches for medical imaging.

Abstract: Magnetic Resonance Imaging is a critical imaging modality in clinical diagnosis and research, yet its complexity and heterogeneity hinder scalable, generalizable machine learning. Although foundation models have revolutionized language and vision tasks, their application to MRI remains constrained by data scarcity and narrow anatomical focus. We present Decipher-MR, a 3D MRI-specific vision-language foundation model trained on 200,000 MRI series from over 22,000 studies spanning diverse anatomical regions, sequences, and pathologies. Decipher-MR integrates self-supervised vision learning with report-guided text supervision to build robust representations for broad applications. To enable efficient use, Decipher-MR supports a modular design that enables tuning of lightweight, task-specific decoders attached to a frozen pretrained encoder. Following this setting, we evaluate Decipher-MR across disease classification, demographic prediction, anatomical localization, and cross-modal retrieval, demonstrating consistent improvements over existing foundation models and task-specific approaches. These results position Decipher-MR as a versatile foundation for MRI-based AI in clinical and research settings.

[273] Beyond the Vision Encoder: Identifying and Mitigating Spatial Bias in Large Vision-Language Models

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Yongshuai Hou, Weili Guan, Jun Yu, Min Zhang

Main category: cs.CV

TL;DR: LVLMs exhibit spatial bias where identical visual information placed at different image locations yields inconsistent outputs, caused by attention mismatch between vision encoder and LLM; proposed AGCI mechanism injects global visual context to mitigate bias without architectural changes.

Details

Motivation: While LVLMs excel at multimodal tasks, their robustness to spatial variations is poorly understood. The authors investigate how models respond when identical visual information appears at different image locations, revealing spatial bias that compromises consistency.

Method: Conducted systematic study of spatial bias through controlled probing experiments. Proposed Adaptive Global Context Injection (AGCI) - a lightweight mechanism that dynamically injects shared global visual context into each image token without architectural modifications, enhancing semantic accessibility while preserving model capabilities.

Result: Found current LVLMs produce inconsistent outputs under spatial shifts, revealing clear spatial bias. Analysis showed bias stems from attention mismatch between vision encoder and LLM, not from vision encoder itself. AGCI effectively mitigates spatial bias while improving performance on downstream tasks and hallucination benchmarks.

Conclusion: LVLMs have significant spatial bias affecting consistency, which can be addressed through global context injection. AGCI provides a lightweight solution that enhances spatial robustness without compromising model performance, offering insights for more robust multimodal understanding.

Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we conduct a systematic study of the spatial bias of LVLMs, examining how models respond when identical key visual information is placed at different locations within an image. Through controlled probing experiments, we observe that current LVLMs often produce inconsistent outputs under such spatial shifts, revealing a clear spatial bias in their semantic understanding. Further analysis indicates that this bias does not stem from the vision encoder, but rather from a mismatch in attention mechanisms between the vision encoder and the large language model, which disrupts the global information flow. Motivated by this insight, we propose Adaptive Global Context Injection (AGCI), a lightweight mechanism that dynamically injects shared global visual context into each image token. AGCI works without architectural modifications, mitigating spatial bias by enhancing the semantic accessibility of image tokens while preserving the model’s intrinsic capabilities. Extensive experiments demonstrate that AGCI not only enhances the spatial robustness of LVLMs, but also achieves strong performance on various downstream tasks and hallucination benchmarks.

[274] EVODiff: Entropy-aware Variance Optimized Diffusion Inference

Shigui Li, Wei Chen, Delu Zeng

Main category: cs.CV

TL;DR: EVODiff introduces an information-theoretic approach to diffusion model inference, optimizing conditional entropy during denoising to improve efficiency and reduce artifacts.

Details

Motivation: Diffusion models have slow inference and training-inference discrepancies; existing gradient-based solvers lack theoretical foundations in information transmission efficiency.

Method: Proposes an entropy-aware variance optimized method (EVODiff) that systematically reduces uncertainty by optimizing conditional entropy during denoising, based on information-theoretic analysis of reverse transitions.

Result: EVODiff outperforms SOTA gradient-based solvers: reduces reconstruction error by 45.5% on CIFAR-10 at 10 NFE, cuts NFE cost by 25% on ImageNet-256, and improves text-to-image generation with fewer artifacts.

Conclusion: Information-theoretic perspective provides fundamental insights for diffusion model inference; optimizing conditional entropy leads to significant improvements in efficiency and quality.

Abstract: Diffusion models (DMs) excel in image generation but suffer from slow inference and training-inference discrepancies. Although gradient-based solvers for DMs accelerate denoising inference, they often lack theoretical foundations in information transmission efficiency. In this work, we introduce an information-theoretic perspective on the inference processes of DMs, revealing that successful denoising fundamentally reduces conditional entropy in reverse transitions. This principle leads to our key insights into the inference processes: (1) data prediction parameterization outperforms its noise counterpart, and (2) optimizing conditional variance offers a reference-free way to minimize both transition and reconstruction errors. Based on these insights, we propose an entropy-aware variance optimized method for the generative process of DMs, called EVODiff, which systematically reduces uncertainty by optimizing conditional entropy during denoising. Extensive experiments on DMs validate our insights and demonstrate that our method significantly and consistently outperforms state-of-the-art (SOTA) gradient-based solvers. For example, compared to the DPM-Solver++, EVODiff reduces the reconstruction error by up to 45.5% (FID improves from 5.10 to 2.78) at 10 function evaluations (NFE) on CIFAR-10, cuts the NFE cost by 25% (from 20 to 15 NFE) for high-quality samples on ImageNet-256, and improves text-to-image generation while reducing artifacts. Code is available at https://github.com/ShiguiLi/EVODiff.

[275] Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation

Muquan Li, Hang Gou, Dongyang Zhang, Shuang Liang, Xiurui Xie, Deqiang Ouyang, Ke Qin

Main category: cs.CV

TL;DR: AT-BPTT: Automatic Truncated Backpropagation Through Time for dataset distillation, dynamically adapting truncation positions and window sizes based on gradient behavior to improve efficiency and performance.

Details

Motivation: Existing dataset distillation methods use random truncation strategies that lack flexibility and yield suboptimal results, failing to account for distinct learning dynamics across different training stages.

Method: Proposes AT-BPTT with three components: 1) probabilistic stage-aware timestep selection, 2) adaptive window sizing based on gradient variation, and 3) low-rank Hessian approximation to reduce computational overhead.

Result: Achieves state-of-the-art performance on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K, improving accuracy by average 6.16% over baselines, accelerating inner-loop optimization by 3.9x while saving 63% memory cost.

Conclusion: AT-BPTT effectively addresses limitations of random truncation in dataset distillation by dynamically adapting to neural network learning dynamics, achieving superior performance and efficiency.

Abstract: The growing demand for efficient deep learning has positioned dataset distillation as a pivotal technique for compressing training dataset while preserving model performance. However, existing inner-loop optimization methods for dataset distillation typically rely on random truncation strategies, which lack flexibility and often yield suboptimal results. In this work, we observe that neural networks exhibit distinct learning dynamics across different training stages-early, middle, and late-making random truncation ineffective. To address this limitation, we propose Automatic Truncated Backpropagation Through Time (AT-BPTT), a novel framework that dynamically adapts both truncation positions and window sizes according to intrinsic gradient behavior. AT-BPTT introduces three key components: (1) a probabilistic mechanism for stage-aware timestep selection, (2) an adaptive window sizing strategy based on gradient variation, and (3) a low-rank Hessian approximation to reduce computational overhead. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet-1K show that AT-BPTT achieves state-of-the-art performance, improving accuracy by an average of 6.16% over baseline methods. Moreover, our approach accelerates inner-loop optimization by 3.9x while saving 63% memory cost.

[276] video-SALMONN S: Memory-Enhanced Streaming Audio-Visual LLM

Guangzhi Sun, Yixuan Li, Xiaodong Wu, Yudong Yang, Wei Li, Zejun Ma, Chao Zhang

Main category: cs.CV

TL;DR: video-SALMONN S is a memory-enhanced streaming audio-visual LLM that processes 3+ hour videos using test-time training as a streaming memory mechanism, outperforming non-streaming models on long video understanding tasks.

Details

Motivation: Long-duration streaming video understanding is limited by ineffective long-term memory in current AI systems, which is fundamental for future AI agents that need to process extended video content.

Method: Uses test-time training (TTT) as a streaming memory mechanism to transform short-term multimodal representations into long-term memory embedded in model parameters. Includes TTT_MEM layer with long-span prediction objective, two-stage training scheme, and modality-aware memory reader. Processes over 3-hour videos at 1 FPS and 360p resolution.

Result: Outperforms both streaming and non-streaming baselines by 3-7% on long video benchmarks. Achieves 15% absolute accuracy improvement over strong non-streaming models on the new ELViM benchmark, demonstrating strong learning abilities from video memory.

Conclusion: video-SALMONN S represents an advance in long-duration streaming video understanding with effective memory mechanisms, showing promise for AI agents that need to learn from extended video observations.

Abstract: Long-duration streaming video understanding is fundamental for future AI agents, yet remains limited by ineffective long-term memory. We introduce video-SALMONN S, a memory-enhanced streaming audio-visual large language model that processes over 3-hour videos at 1 FPS and 360p resolution, outperforming strong non-streaming models under the same memory budget. In addition to token merging or downsampling, video-SALMONN S is the first to employ test-time training (TTT) as a streaming memory mechanism for video understanding. TTT continuously transforms short-term multimodal representations into long-term memory embedded in model parameters. To improve long-range dependency modeling and memory capacity, we propose (i) a TTT_MEM layer with an additional long-span prediction objective, (ii) a two-stage training scheme, and (iii) a modality-aware memory reader. We further introduce the Episodic Learning from Video Memory (ELViM) benchmark, simulating agent-like scenarios where models must learn from videos observed hours earlier. video-SALMONN S consistently outperforms both streaming and non-streaming baselines by 3-7% on long video benchmarks. Notably, video-SALMONN S achieves a 15% absolute accuracy improvement over strong non-streaming models on ELViM, demonstrating strong learning abilities from video memory.

[277] SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning

Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin, Jiacong Wang, Xiao Liang, Shuicheng Yan, Chao Feng

Main category: cs.CV

TL;DR: SAIL-RL is a reinforcement learning framework that improves multimodal LLM reasoning by teaching models when and how to think through dual rewards for reasoning quality and adaptive thinking strategies.

Details

Motivation: Existing MLLM approaches have two key limitations: 1) outcome-only supervision that rewards correct answers without ensuring sound reasoning, and 2) uniform thinking strategies that cause overthinking on simple tasks and underthinking on complex ones.

Method: SAIL-RL uses reinforcement learning post-training with a dual reward system: 1) Thinking Reward evaluates reasoning quality through factual grounding, logical coherence, and answer consistency; 2) Judging Reward adaptively determines whether deep reasoning or direct answering is appropriate for each task.

Result: Experiments on SAIL-VL2 show SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieves competitive performance against GPT-4o, and substantially reduces hallucinations.

Conclusion: SAIL-RL establishes a principled framework for building more reliable and adaptive multimodal large language models by teaching them when and how to think through reinforcement learning.

Abstract: We introduce SAIL-RL, a reinforcement learning (RL) post-training framework that enhances the reasoning capabilities of multimodal large language models (MLLMs) by teaching them when and how to think. Existing approaches are limited by outcome-only supervision, which rewards correct answers without ensuring sound reasoning, and by uniform thinking strategies, which often lead to overthinking on simple tasks and underthinking on complex ones. SAIL-RL addresses these challenges with a dual reward system: the Thinking Reward, which evaluates reasoning quality through factual grounding, logical coherence, and answer consistency, and the Judging Reward, which adaptively determines whether deep reasoning or direct answering is appropriate. Experiments on the state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal understanding benchmarks at both 4B and 8B scales, achieving competitive performance against commercial closed-source models such as GPT-4o, and substantially reduces hallucinations, establishing it as a principled framework for building more reliable and adaptive MLLMs. The code will be available at https://github.com/BytedanceDouyinContent/SAIL-RL.

[278] UniADC: A Unified Framework for Anomaly Detection and Classification

Ximiao Zhang, Min Xu, Zheng Zhang, Junlin Hu, Xiuzhuang Zhou

Main category: cs.CV

TL;DR: UniADC is a unified model for simultaneous anomaly detection and classification in images, using a controllable inpainting network and implicit-normal discriminator to work with few or no anomaly samples.

Details

Motivation: Existing methods treat anomaly detection and classification as separate tasks, neglecting their inherent correlations and limiting information sharing, resulting in suboptimal performance. The authors aim to unify these tasks for better performance with limited anomaly data.

Method: Proposes UniADC with two key components: 1) Training-free Controllable Inpainting Network that synthesizes anomaly images by repainting normal regions guided by anomaly priors, and augments few-shot anomaly samples; 2) Implicit-Normal Discriminator that addresses class imbalance by implicitly modeling normal state and aligning fine-grained image features with anomaly-category embeddings.

Result: Extensive experiments on MVTec-FS, MTD, and WFDD datasets show UniADC consistently outperforms existing methods in anomaly detection, localization, and classification tasks.

Conclusion: UniADC effectively unifies anomaly detection and classification, achieving superior performance with few or no anomaly images by leveraging the proposed inpainting network and implicit-normal discriminator components.

Abstract: In this paper, we introduce a novel task termed unified anomaly detection and classification, which aims to simultaneously detect anomalous regions in images and identify their specific categories. Existing methods typically treat anomaly detection and classification as separate tasks, thereby neglecting their inherent correlations and limiting information sharing, which results in suboptimal performance. To address this, we propose UniADC, a model designed to effectively perform both tasks with only a few or even no anomaly images. Specifically, UniADC consists of two key components: a training-free Controllable Inpainting Network and an Implicit-Normal Discriminator. The inpainting network can synthesize anomaly images of specific categories by repainting normal regions guided by anomaly priors, and can also repaint few-shot anomaly samples to augment the available anomaly data. The implicit-normal discriminator addresses the severe challenge of the imbalance between normal and anomalous pixel distributions by implicitly modeling the normal state, achieving precise anomaly detection and classification by aligning fine-grained image features with anomaly-category embeddings. We conduct extensive experiments on three anomaly detection and classification datasets, including MVTec-FS, MTD, and WFDD, and the results demonstrate that UniADC consistently outperforms existing methods in anomaly detection, localization, and classification. The code is available at https://github.com/cnulab/UniADC.

[279] Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification

Qinghao Gao, Jiahui Qu, Wenqian Dong

Main category: cs.CV

TL;DR: A parameter-efficient Mixture-of-Experts framework called MaMOL that handles missing modalities in multimodal remote sensing classification through dual-routing of shared and dynamic experts.

Details

Motivation: Multimodal remote sensing classification suffers from performance degradation due to missing modalities caused by sensor failures and environmental interference. The paper investigates whether MoE models can inherently adapt to diverse modality-missing scenarios.

Method: Proposes Missing-aware Mixture-of-LoRAs (MaMOL), a parameter-efficient MoE framework with dual-routing mechanism that decouples modality-invariant shared experts and modality-aware dynamic experts, enabling automatic expert activation conditioned on available modalities.

Result: Extensive experiments on multiple remote sensing benchmarks show MaMOL significantly improves robustness and generalization under diverse missing-modality scenarios with minimal computational overhead. Transfer experiments on natural image datasets validate scalability and cross-domain applicability.

Conclusion: MaMOL effectively addresses missing-modality challenges in multimodal classification through a unified parameter-efficient framework that maintains performance while being computationally efficient and transferable across domains.

Abstract: Multimodal remote sensing classification often suffers from missing modalities caused by sensor failures and environmental interference, leading to severe performance degradation. In this work, we rethink missing-modality learning from a conditional computation perspective and investigate whether Mixture-of-Experts (MoE) models can inherently adapt to diverse modality-missing scenarios. We first conduct a systematic study of representative MoE paradigms under various missing-modality settings, revealing both their potential and limitations. Building on these insights, we propose a Missing-aware Mixture-of-LoRAs (MaMOL), a parameter-efficient MoE framework that unifies multiple modality-missing cases within a single model. MaMOL introduces a dual-routing mechanism to decouple modality-invariant shared experts and modality-aware dynamic experts, enabling automatic expert activation conditioned on available modalities. Extensive experiments on multiple remote sensing benchmarks demonstrate that MaMOL significantly improves robustness and generalization under diverse missing-modality scenarios with minimal computational overhead. Transfer experiments on natural image datasets further validate its scalability and cross-domain applicability.

[280] Material-informed Gaussian Splatting for 3D World Reconstruction in a Digital Twin

Andy Huynh, João Malheiro Silva, Holger Caesar, Tong Duy Son

Main category: cs.CV

TL;DR: Camera-only 3D reconstruction pipeline for Digital Twins using 3D Gaussian Splatting from multi-view images, with semantic material extraction and physics-based material assignment for sensor simulation.

Details

Motivation: LiDAR-based 3D reconstruction for Digital Twins provides accurate geometry but lacks semantics and textures. Traditional LiDAR-camera fusion requires complex calibration and struggles with materials like glass. There's a need for camera-only methods that combine photorealistic reconstruction with physics-based material properties for accurate sensor simulation.

Method: 1) Reconstruct scenes using 3D Gaussian Splatting from multi-view images; 2) Extract semantic material masks via vision models; 3) Convert Gaussian representations to mesh surfaces with projected material labels; 4) Assign physics-based material properties for accurate sensor simulation in modern graphics engines and simulators.

Result: The camera-only approach achieves sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. Validated using internal dataset from instrumented test vehicle with LiDAR as ground truth for reflectivity validation alongside image similarity metrics.

Conclusion: Camera-only pipeline successfully combines photorealistic reconstruction with physics-based material assignment, providing a practical alternative to LiDAR-camera fusion for Digital Twin applications with reduced hardware complexity.

Abstract: 3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-camera fusion approaches require complex calibration and still struggle with certain materials like glass, which are visible in images but poorly represented in point clouds. We propose a camera-only pipeline that reconstructs scenes using 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts Gaussian representations to mesh surfaces with projected material labels, and assigns physics-based material properties for accurate sensor simulation in modern graphics engines and simulators. This approach combines photorealistic reconstruction with physics-based material assignment, providing sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. We validate our camera-only method using an internal dataset from an instrumented test vehicle, leveraging LiDAR as ground truth for reflectivity validation alongside image similarity metrics.

[281] Towards Sustainable Universal Deepfake Detection with Frequency-Domain Masking

Chandler Timm C. Doloriel, Habib Ullah, Kristian Hovde Liland, Fadi Al Machot, Ngai-Man Cheung

Main category: cs.CV

TL;DR: Frequency-domain masking strategy for universal deepfake detection that enhances generalization across unseen generative models while being computationally efficient.

Details

Motivation: Need for universal deepfake detection that generalizes to unseen generative models while minimizing computational overhead for large-scale screening in the Green AI era.

Method: Introduces frequency-domain masking as a training strategy, using random masking and geometric transformations with focus on frequency masking for superior generalization properties.

Result: Achieves state-of-the-art generalization on GAN- and diffusion-generated image datasets, maintains performance under significant model pruning, offers scalable resource-conscious solution.

Conclusion: Frequency-based masking is a practical step toward sustainable and generalizable deepfake detection with strong generalization capabilities and computational efficiency.

Abstract: Universal deepfake detection aims to identify AI-generated images across a broad range of generative models, including unseen ones. This requires robust generalization to new and unseen deepfakes, which emerge frequently, while minimizing computational overhead to enable large-scale deepfake screening, a critical objective in the era of Green AI. In this work, we explore frequency-domain masking as a training strategy for deepfake detectors. Unlike traditional methods that rely heavily on spatial features or large-scale pretrained models, our approach introduces random masking and geometric transformations, with a focus on frequency masking due to its superior generalization properties. We demonstrate that frequency masking not only enhances detection accuracy across diverse generators but also maintains performance under significant model pruning, offering a scalable and resource-conscious solution. Our method achieves state-of-the-art generalization on GAN- and diffusion-generated image datasets and exhibits consistent robustness under structured pruning. These results highlight the potential of frequency-based masking as a practical step toward sustainable and generalizable deepfake detection. Code and models are available at https://github.com/chandlerbing65nm/FakeImageDetection.

[282] A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images

Rao Muhammad Umer, Daniel Sens, Jonathan Noll, Sohom Dey, Christian Matek, Lukas Wolfseher, Rainer Spang, Ralf Huss, Johannes Raffler, Sarah Reinke, Ario Sadafi, Wolfram Klapper, Katja Steiger, Kristina Schwamborn, Carsten Marr

Main category: cs.CV

TL;DR: Benchmark study of pathology foundation models for lymphoma subtyping from HE-stained slides, showing good in-distribution performance but poor generalization to out-of-distribution data.

Details

Motivation: Lymphoma diagnosis requires multiple expensive tests causing treatment delays. Deep learning could assist pathologists using routinely available HE-stained slides, but comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking.

Method: Created first multicenter lymphoma benchmarking dataset covering four common subtypes and healthy tissue. Evaluated five pathology foundation models (H-optimus-1, H0-mini, Virchow2, UNI2, Titan) with attention-based (AB-MIL) and transformer-based (TransMIL) multiple instance learning aggregators across three magnifications (10x, 20x, 40x).

Result: On in-distribution test sets: models achieved >80% multiclass balanced accuracy across all magnifications, with all foundation models performing similarly and both aggregation methods comparable. 40x resolution sufficient with no gains from higher resolutions or cross-magnification aggregation. On out-of-distribution test sets: performance dropped to ~60%, highlighting significant generalization challenges.

Conclusion: Larger multicenter studies covering additional rare lymphoma subtypes are needed. Provided automated benchmarking pipeline to facilitate future research. Shows promise for AI-assisted lymphoma diagnosis but generalization remains a major challenge.

Abstract: Timely and accurate lymphoma diagnosis is essential for guiding cancer treatment. Standard diagnostic practice combines hematoxylin and eosin (HE)-stained whole slide images with immunohistochemistry, flow cytometry, and molecular genetic tests to determine lymphoma subtypes, a process requiring costly equipment, skilled personnel, and causing treatment delays. Deep learning methods could assist pathologists by extracting diagnostic information from routinely available HE-stained slides, yet comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking. In this work, we present the first multicenter lymphoma benchmarking dataset covering four common lymphoma subtypes and healthy control tissue. We systematically evaluate five publicly available pathology foundation models (H-optimus-1, H0-mini, Virchow2, UNI2, Titan) combined with attention-based (AB-MIL) and transformer-based (TransMIL) multiple instance learning aggregators across three magnifications (10x, 20x, 40x). On in-distribution test sets, models achieve multiclass balanced accuracies exceeding 80% across all magnifications, with all foundation models performing similarly and both aggregation methods showing comparable results. The magnification study reveals that 40x resolution is sufficient, with no performance gains from higher resolutions or cross-magnification aggregation. However, on out-of-distribution test sets, performance drops substantially to around 60%, highlighting significant generalization challenges. To advance the field, larger multicenter studies covering additional rare lymphoma subtypes are needed. We provide an automated benchmarking pipeline to facilitate such future research.

[283] CountZES: Counting via Zero-Shot Exemplar Selection

Muhammad Ibraheem Siddiqui, Muhammad Haris Khan

Main category: cs.CV

TL;DR: CountZES: A zero-shot object counting method that uses three synergistic stages (DAE, DGE, FCE) to discover diverse exemplars from text descriptions, addressing limitations of existing approaches that rely on noisy OVD detections or random patch sampling.

Details

Motivation: Zero-shot object counting in complex scenes is challenging due to limitations of existing methods: open-vocabulary detectors suffer from semantic noise and multi-instance proposals in dense scenes, while random patch sampling fails to accurately delineate object instances.

Method: CountZES uses three inference-only stages: 1) Detection-Anchored Exemplar (DAE) refines OVD detections to isolate precise single-instance exemplars; 2) Density-Guided Exemplar (DGE) employs density-driven self-supervised paradigm to identify statistically consistent exemplars; 3) Feature-Consensus Exemplar (FCE) reinforces visual coherence through feature-space clustering.

Result: Experiments on diverse datasets demonstrate CountZES’s superior performance among zero-shot object counting methods while generalizing effectively across domains.

Conclusion: CountZES provides a robust approach for zero-shot object counting that balances textual grounding, count consistency, and feature representativeness through complementary exemplar discovery stages.

Abstract: Object counting in complex scenes is particularly challenging in the zero-shot (ZS) setting, where instances of unseen categories are counted using only a class name. Existing ZS counting methods that infer exemplars from text often rely on off-the-shelf open-vocabulary detectors (OVDs), which in dense scenes suffer from semantic noise, appearance variability, and frequent multi-instance proposals. Alternatively, random image-patch sampling is employed, which fails to accurately delineate object instances. To address these issues, we propose CountZES, an inference-only approach for object counting via ZS exemplar selection. CountZES discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines OVD detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across domains.

[284] Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts

Linwei Qiu, Gongzhe Li, Xiaozhe Zhang, Qilin Sun, Fengying Xie

Main category: cs.CV

TL;DR: UniRect is a unified framework for image correction and rectangling tasks that handles diverse distortions through a task-agnostic approach with deformation and restoration modules.

Details

Motivation: Existing image correction methods rely on task-specific architectures that limit generalization across different tasks. The authors aim to create a unified framework that can handle various practical photography tasks from a consistent distortion rectification perspective.

Method: UniRect incorporates various task-specific inverse problems into a general distortion model by simulating different lens types. It uses a dual-component structure: 1) Deformation Module with Residual Progressive Thin-Plate Spline (RP-TPS) for geometric deformations, and 2) Restoration Module with Residual Mamba Blocks (RMBs) to counteract degradation. A Sparse Mixture-of-Experts (SMoEs) structure handles multi-task learning with varying distortions.

Result: Extensive experiments show state-of-the-art performance compared with other up-to-date methods across various image correction tasks.

Conclusion: UniRect provides a comprehensive unified framework for image correction tasks that overcomes limitations of task-specific architectures and demonstrates strong generalization capabilities.

Abstract: Image correction and rectangling are valuable tasks in practical photography systems such as smartphones. Recent remarkable advancements in deep learning have undeniably brought about substantial performance improvements in these fields. Nevertheless, existing methods mainly rely on task-specific architectures. This significantly restricts their generalization ability and effective application across a wide range of different tasks. In this paper, we introduce the Unified Rectification Framework (UniRect), a comprehensive approach that addresses these practical tasks from a consistent distortion rectification perspective. Our approach incorporates various task-specific inverse problems into a general distortion model by simulating different types of lenses. To handle diverse distortions, UniRect adopts one task-agnostic rectification framework with a dual-component structure: a {Deformation Module}, which utilizes a novel Residual Progressive Thin-Plate Spline (RP-TPS) model to address complex geometric deformations, and a subsequent Restoration Module, which employs Residual Mamba Blocks (RMBs) to counteract the degradation caused by the deformation process and enhance the fidelity of the output image. Moreover, a Sparse Mixture-of-Experts (SMoEs) structure is designed to circumvent heavy task competition in multi-task learning due to varying distortions. Extensive experiments demonstrate that our models have achieved state-of-the-art performance compared with other up-to-date methods.

[285] Driving on Registers

Ellington Kirby, Alexandre Boulch, Yihong Xu, Yuan Yin, Gilles Puy, Éloi Zablocki, Andrei Bursuc, Spyros Gidaris, Renaud Marlet, Florent Bartoccioni, Anh-Quan Cao, Nermin Samet, Tuan-Hung VU, Matthieu Cord

Main category: cs.CV

TL;DR: DrivoR is a transformer-based architecture for end-to-end autonomous driving that uses camera-aware register tokens to compress multi-camera features and lightweight decoders for trajectory generation and scoring.

Details

Motivation: The paper aims to develop an efficient yet accurate end-to-end autonomous driving system that can process multi-camera inputs effectively while reducing computational overhead and providing interpretable behavior conditioning.

Method: Uses pretrained Vision Transformers with camera-aware register tokens to compress multi-camera features into compact scene representations. Two lightweight transformer decoders then generate candidate trajectories and score them using interpretable sub-scores (safety, comfort, efficiency) learned by mimicking an oracle.

Result: Outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and HUGSIM benchmarks. Shows that pure-transformer architecture with token compression enables accurate, efficient, and adaptive driving.

Conclusion: A pure-transformer architecture with targeted token compression is sufficient for accurate, efficient, and adaptive end-to-end autonomous driving, demonstrating strong performance across multiple benchmarks.

Abstract: We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available via the project page.

[286] DeepUrban: Interaction-Aware Trajectory Prediction and Planning for Automated Driving by Aerial Imagery

Constantin Selzer, Fabian B. Flohr

Main category: cs.CV

TL;DR: DeepUrban is a new drone dataset for autonomous driving focusing on dense urban traffic scenarios, created in collaboration with DeepScenario to enhance trajectory prediction and planning benchmarks.

Details

Motivation: Current autonomous driving benchmarks lack dense traffic scenarios needed for robust prediction and planning. There's a scarcity of datasets capturing complex interactions among road users in dense urban settings.

Method: Collaborated with DeepScenario to create DeepUrban dataset using drones at ~100m altitude over urban intersections. Collected high-resolution images to extract 3D traffic objects, enriched with comprehensive map and scene information.

Result: Adding DeepUrban to nuScenes dataset boosts vehicle prediction and planning accuracy by up to 44.1% on ADE and 44.3% on FDE metrics. The dataset enables better evaluation of generalization capabilities.

Conclusion: DeepUrban addresses the gap in dense traffic scenarios for autonomous driving benchmarks and demonstrates significant improvements in prediction and planning performance when combined with existing datasets.

Abstract: The efficacy of autonomous driving systems hinges critically on robust prediction and planning capabilities. However, current benchmarks are impeded by a notable scarcity of scenarios featuring dense traffic, which is essential for understanding and modeling complex interactions among road users. To address this gap, we collaborated with our industrial partner, DeepScenario, to develop DeepUrban-a new drone dataset designed to enhance trajectory prediction and planning benchmarks focusing on dense urban settings. DeepUrban provides a rich collection of 3D traffic objects, extracted from high-resolution images captured over urban intersections at approximately 100 meters altitude. The dataset is further enriched with comprehensive map and scene information to support advanced modeling and simulation tasks. We evaluate state-of-the-art (SOTA) prediction and planning methods, and conducted experiments on generalization capabilities. Our findings demonstrate that adding DeepUrban to nuScenes can boost the accuracy of vehicle predictions and planning, achieving improvements up to 44.1 % / 44.3% on the ADE / FDE metrics. Website: https://iv.ee.hm.edu/deepurban

[287] Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers

Yuxi Liu, Yipeng Hu, Zekun Zhang, Kunze Jiang, Kun Yuan

Main category: cs.CV

TL;DR: MOD-DiT: A sampling-free dynamic attention framework for efficient video generation using mixture-of-distribution modeling to predict evolving attention patterns without repetitive sampling operations.

Details

Motivation: Current Diffusion Transformers (DiTs) for video generation suffer from quadratic complexity of self-attention, making practical deployment difficult. Existing sparse attention methods either use oversimplified static patterns or require computationally expensive sampling for dynamic sparsity, leading to inaccurate predictions and degraded quality.

Method: Proposes MOD-DiT with two-stage process: 1) Uses prior information from early denoising steps with distributed mixing approach to model linear approximation for predicting mask patterns for specific denoising intervals, 2) Implements online block masking strategy to dynamically apply predicted masks while maintaining historical sparsity information, eliminating repetitive sampling operations.

Result: Extensive evaluations show consistent acceleration and quality improvements across multiple benchmarks and model architectures, validating effectiveness for efficient, high-quality video generation while overcoming computational limitations of traditional sparse attention approaches.

Conclusion: MOD-DiT provides an effective sampling-free dynamic attention framework that addresses computational bottlenecks in video generation DiTs, enabling practical deployment through accurate modeling of evolving attention patterns without sacrificing quality.

Abstract: While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers to practical deployment. Although sparse attention methods attempt to address this challenge, existing approaches either rely on oversimplified static patterns or require computationally expensive sampling operations to achieve dynamic sparsity, resulting in inaccurate pattern predictions and degraded generation quality. To overcome these limitations, we propose a \underline{\textbf{M}}ixture-\underline{\textbf{O}}f-\underline{\textbf{D}}istribution \textbf{DiT} (\textbf{MOD-DiT}), a novel sampling-free dynamic attention framework that accurately models evolving attention patterns through a two-stage process. First, MOD-DiT leverages prior information from early denoising steps and adopts a {distributed mixing approach} to model an efficient linear approximation model, which is then used to predict mask patterns for a specific denoising interval. Second, an online block masking strategy dynamically applies these predicted masks while maintaining historical sparsity information, eliminating the need for repetitive sampling operations. Extensive evaluations demonstrate consistent acceleration and quality improvements across multiple benchmarks and model architectures, validating MOD-DiT’s effectiveness for efficient, high-quality video generation while overcoming the computational limitations of traditional sparse attention approaches.

[288] TFFM: Topology-Aware Feature Fusion Module via Latent Graph Reasoning for Retinal Vessel Segmentation

Iftekhar Ahmed, Shakib Absar, Aftar Ahmad Sami, Shadman Sakib, Debojyoti Biswas, Seraj Al Mahmud Mostafa

Main category: cs.CV

TL;DR: A topology-aware framework for retinal vessel segmentation that maintains vascular connectivity using graph attention networks and hybrid loss functions to reduce fragmentation.

Details

Motivation: Standard convolutional architectures produce disjointed segmentations with gaps and discontinuities that prevent reliable graph-based clinical analysis for cardiovascular diagnosis, despite high pixel-level accuracy.

Method: Introduces a Topological Feature Fusion Module (TFFM) that maps local features into latent graph space using Graph Attention Networks to capture global structural dependencies. Uses hybrid objective combining Tversky loss for class imbalance and soft clDice loss to penalize topological disconnects.

Result: Achieves state-of-the-art performance on Fundus-AVSeg dataset: 90.97% combined Dice score, 95% Hausdorff Distance of 3.50 pixels. Reduces vessel fragmentation by ~38% relative to baselines, yielding topologically coherent vascular trees for automated biomarker quantification.

Conclusion: The proposed topology-aware framework successfully maintains vascular connectivity, addressing a critical limitation of standard segmentation methods and enabling reliable clinical analysis of retinal vasculature for cardiovascular diagnosis.

Abstract: Precise segmentation of retinal arteries and veins carries the diagnosis of systemic cardiovascular conditions. However, standard convolutional architectures often yield topologically disjointed segmentations, characterized by gaps and discontinuities that render reliable graph-based clinical analysis impossible despite high pixel-level accuracy. To address this, we introduce a topology-aware framework engineered to maintain vascular connectivity. Our architecture fuses a Topological Feature Fusion Module (TFFM) that maps local feature representations into a latent graph space, deploying Graph Attention Networks to capture global structural dependencies often missed by fixed receptive fields. Furthermore, we drive the learning process with a hybrid objective function, coupling Tversky loss for class imbalance with soft clDice loss to explicitly penalize topological disconnects. Evaluation on the Fundus-AVSeg dataset reveals state-of-the-art performance, achieving a combined Dice score of 90.97% and a 95% Hausdorff Distance of 3.50 pixels. Notably, our method decreases vessel fragmentation by approximately 38% relative to baselines, yielding topologically coherent vascular trees viable for automated biomarker quantification. We open-source our code at https://tffm-module.github.io/.

[289] Creative Image Generation with Diffusion Models

Kunpeng Song, Ahmed Elgammal

Main category: cs.CV

TL;DR: A novel framework for creative image generation using diffusion models that drives image generation toward low-probability regions in CLIP embedding space to produce rare and imaginative outputs.

Details

Motivation: To develop a principled approach for creative image generation that goes beyond manual concept blending, enabling the production of novel, imaginative images that expand the boundaries of visual content synthesis.

Method: Uses diffusion models with creativity defined as inverse probability in CLIP embedding space. Calculates probability distribution of generated images and drives generation toward low-probability regions. Introduces pullback mechanisms to maintain visual fidelity while achieving high creativity.

Result: Extensive experiments on text-to-image diffusion models demonstrate effectiveness and efficiency in producing unique, novel, and thought-provoking images with high creativity without sacrificing visual quality.

Conclusion: Provides a new perspective on creativity in generative models with a principled method for fostering innovation in visual content synthesis through probability-based creative generation.

Abstract: Creative image generation has emerged as a compelling area of research, driven by the need to produce novel and high-quality images that expand the boundaries of imagination. In this work, we propose a novel framework for creative generation using diffusion models, where creativity is associated with the inverse probability of an image’s existence in the CLIP embedding space. Unlike prior approaches that rely on a manual blending of concepts or exclusion of subcategories, our method calculates the probability distribution of generated images and drives it towards low-probability regions to produce rare, imaginative, and visually captivating outputs. We also introduce pullback mechanisms, achieving high creativity without sacrificing visual fidelity. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness and efficiency of our creative generation framework, showcasing its ability to produce unique, novel, and thought-provoking images. This work provides a new perspective on creativity in generative models, offering a principled method to foster innovation in visual content synthesis.

[290] Can 3D point cloud data improve automated body condition score prediction in dairy cattle?

Zhou Tang, Jin Wang, Angelo De Castro, Yuxi Zhang, Victoria Bastos Primo, Ana Beatriz Montevecchio Bernardino, Gota Morota, Xu Wang, Ricardo C Chebel, Haipeng Yu

Main category: cs.CV

TL;DR: Depth images outperform 3D point clouds for body condition score prediction in dairy cattle across multiple data settings, with point clouds being more sensitive to noise and model architecture.

Details

Motivation: Body condition score (BCS) is crucial for dairy cattle health and productivity, but visual scoring is subjective and labor-intensive. Computer vision approaches using depth images have been used, and 3D point clouds offer richer geometric information, but direct comparisons between these methods are limited.

Method: Compared top-view depth image and point cloud data for BCS prediction under four settings: 1) unsegmented raw data, 2) segmented full-body data, 3) segmented hindquarter data, and 4) handcrafted feature data. Used data from 1,020 dairy cows with cow-level cross-validation to prevent data leakage.

Result: Depth image-based models consistently achieved higher accuracy than point cloud-based models with unsegmented raw data and segmented full-body data. Comparable performance was observed with segmented hindquarter data. Both approaches showed reduced accuracy with handcrafted features. Point cloud predictions were more sensitive to noise and model architecture.

Conclusion: 3D point clouds do not provide a consistent advantage over depth images for BCS prediction in dairy cattle under the evaluated conditions, with depth images being more robust and accurate in most settings.

Abstract: Body condition score (BCS) is a widely used indicator of body energy status and is closely associated with metabolic status, reproductive performance, and health in dairy cattle; however, conventional visual scoring is subjective and labor-intensive. Computer vision approaches have been applied to BCS prediction, with depth images widely used because they capture geometric information independent of coat color and texture. More recently, three-dimensional point cloud data have attracted increasing interest due to their ability to represent richer geometric characteristics of animal morphology, but direct head-to-head comparisons with depth image-based approaches remain limited. In this study, we compared top-view depth image and point cloud data for BCS prediction under four settings: 1) unsegmented raw data, 2) segmented full-body data, 3) segmented hindquarter data, and 4) handcrafted feature data. Prediction models were evaluated using data from 1,020 dairy cows collected on a commercial farm, with cow-level cross-validation to prevent data leakage. Depth image-based models consistently achieved higher accuracy than point cloud-based models when unsegmented raw data and segmented full-body data were used, whereas comparable performance was observed when segmented hindquarter data were used. Both depth image and point cloud approaches showed reduced accuracy when handcrafted feature data were employed compared with the other settings. Overall, point cloud-based predictions were more sensitive to noise and model architecture than depth image-based predictions. Taken together, these results indicate that three-dimensional point clouds do not provide a consistent advantage over depth images for BCS prediction in dairy cattle under the evaluated conditions.

[291] ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

Tao Yu, Haopeng Jin, Hao Wang, Shenghua Chai, Yujia Yang, Junhao Gong, Jiaming Guo, Minghui Zhang, Xinlong Chen, Zhenghao Zhang, Yuxuan Zhou, Yufei Xiong, Shanbin Zhang, Jiabing Yang, Hongzhu Yi, Xinming Wang, Cheng Zhong, Xiao Ma, Zhang Zhang, Yan Huang, Liang Wang

Main category: cs.CV

TL;DR: ShotFinder benchmark for open-domain video shot retrieval with temporal, visual, and audio constraints, revealing significant gaps in multimodal LLM capabilities.

Details

Motivation: Existing LLM research focuses on text or static multimodal settings, but open-domain video shot retrieval with temporal structure and complex semantics lacks systematic benchmarks and analysis.

Method: Introduced ShotFinder benchmark with 1,210 samples across 20 categories, formalizing editing requirements as keyframe-oriented shot descriptions with five controllable constraints (Temporal, Color, Visual style, Audio, Resolution). Proposed three-stage retrieval pipeline: query expansion via video imagination, candidate retrieval with search engine, and description-guided temporal localization.

Result: Experiments show significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges.

Conclusion: Open-domain video shot retrieval is a critical capability that multimodal large models have yet to overcome, revealing limitations in handling complex temporal, visual, and audio constraints.

Abstract: In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.

[292] Model Optimization for Multi-Camera 3D Detection and Tracking

Ethan Anderson, Justin Silva, Kyle Zheng, Sameer Pusegaonkar, Yizhou Wang, Zheng Tang, Sujit Biswas

Main category: cs.CV

TL;DR: Sparse4D multi-camera 3D tracking framework evaluated under reduced FPS, quantization, transfer learning, and mixed-precision training, with focus on identity stability metrics.

Details

Motivation: Multi-camera perception in indoor environments faces challenges with occlusion and heterogeneous viewpoints, requiring robust multi-target tracking that maintains identity persistence under various deployment constraints.

Method: Evaluates Sparse4D (query-based spatiotemporal 3D detection/tracking) under: reduced input frame rates, post-training quantization (INT8/FP8), transfer to WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. Introduces Average Track Duration metric for identity stability.

Result: Sparse4D stable under moderate FPS reductions but identity association collapses below 2 FPS; selective quantization of backbone/neck offers best speed-accuracy trade-off; low-FPS pretraining yields large zero-shot gains on WILDTRACK; mixed precision reduces latency but can destabilize identity propagation.

Conclusion: Multi-camera tracking systems need stability-aware validation, selective quantization strategies, and careful consideration of frame rate thresholds for identity persistence, with mixed-precision training offering scalability benefits but requiring stability monitoring.

Abstract: Outside-in multi-camera perception is increasingly important in indoor environments, where networks of static cameras must support multi-target tracking under occlusion and heterogeneous viewpoints. We evaluate Sparse4D, a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. We study reduced input frame rates, post-training quantization (INT8 and FP8), transfer to the WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. To better capture identity stability, we report Average Track Duration (AvgTrackDur), which measures identity persistence in seconds. Sparse4D remains stable under moderate FPS reductions, but below 2 FPS, identity association collapses even when detections are stable. Selective quantization of the backbone and neck offers the best speed-accuracy trade-off, while attention-related modules are consistently sensitive to low precision. On WILDTRACK, low-FPS pretraining yields large zero-shot gains over the base checkpoint, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency and improves camera scalability, but can destabilize identity propagation, motivating stability-aware validation.

[293] DuoGen: Towards General Purpose Interleaved Multimodal Generation

Min Shi, Xiaohui Zeng, Jiannan Huang, Yin Cui, Francesco Ferroni, Jialuo Li, Shubham Pachori, Zhaoshuo Li, Yogesh Balaji, Haoxiang Wang, Tsung-Yi Lin, Xiao Fu, Yue Zhao, Chieh-Yun Chen, Ming-Yu Liu, Humphrey Shi

Main category: cs.CV

TL;DR: DuoGen is a general-purpose interleaved multimodal generation framework that combines visual understanding from MLLMs with visual generation from DiTs, achieving SOTA performance on text-to-image and image editing tasks.

Details

Motivation: Existing interleaved generation models have limited quality due to insufficient training data and base model capacity, despite the potential of interleaved multimodal generation for applications like instructional guides and visual planning.

Method: Systematic framework with three components: (1) Large-scale instruction-tuning dataset combining rewritten multimodal conversations and synthetic examples, (2) Architecture leveraging pretrained MLLM for visual understanding and DiT for visual generation, (3) Two-stage decoupled training: first instruction-tune MLLM, then align DiT with curated interleaved sequences.

Result: Outperforms prior open-source models in text quality, image fidelity, and image-context alignment; achieves state-of-the-art performance on text-to-image and image editing among unified generation models across public and new benchmarks.

Conclusion: DuoGen demonstrates that systematic data curation and architectural design can enable high-quality interleaved multimodal generation without costly unimodal pretraining, with flexible base model selection.

Abstract: Interleaved multimodal generation enables capabilities beyond unimodal generation models, such as step-by-step instructional guides, visual planning, and generating visual drafts for reasoning. However, the quality of existing interleaved generation models under general instructions remains limited by insufficient training data and base model capacity. We present DuoGen, a general-purpose interleaved generation framework that systematically addresses data curation, architecture design, and evaluation. On the data side, we build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites, and diverse synthetic examples covering everyday scenarios. Architecturally, DuoGen leverages the strong visual understanding of a pretrained multimodal LLM and the visual generation capabilities of a diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining and enabling flexible base model selection. A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences. Across public and newly proposed benchmarks, DuoGen outperforms prior open-source models in text quality, image fidelity, and image-context alignment, and also achieves state-of-the-art performance on text-to-image and image editing among unified generation models. Data and code will be released at https://research.nvidia.com/labs/dir/duogen/.

[294] Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval

Tong Wang, Yunhan Zhao, Shu Kong

Main category: cs.CV

TL;DR: Paracosm: A training-free zero-shot CIR method that uses LMMs to generate “mental images” from multimodal queries and synthetic counterparts for database images to bridge domain gaps.

Details

Motivation: Current zero-shot CIR methods use LMMs to generate textual descriptions for multimodal queries, then use VLMs for textual-visual matching. This indirect approach may not capture the intended "mental image" accurately.

Method: Prompt an LMM to directly generate the “mental image” from the multimodal query (reference image + modification text). Generate synthetic counterparts for real database images to bridge domain gaps. Match the generated “mental image” with synthetic database images in a constructed “paracosm”.

Result: Significantly outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.

Conclusion: Directly generating “mental images” for multimodal queries is more effective than generating textual descriptions, and bridging synthetic-to-real domain gaps improves matching accuracy in zero-shot CIR.

Abstract: Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this mental image’’ is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search the target image. In contrast, we address CIR from first principles by directly generating the mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a mental image’’ for a given multimodal query and propose to use this mental image'' to search for the target image. As the mental image’’ has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm’’, where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.

[295] Data Augmentation for High-Fidelity Generation of CAR-T/NK Immunological Synapse Images

Xiang Zhang, Boxuan Zhang, Alireza Naghizadeh, Mohab Mohamed, Dongfang Liu, Ruixiang Tang, Dimitris Metaxas, Dongfang Liu

Main category: cs.CV

TL;DR: The paper presents two complementary data augmentation frameworks (IAAA and SAAA) to generate synthetic CAR-T/NK immunological synapse images for improving detection and segmentation performance, addressing limited annotated microscopy datasets.

Details

Motivation: Limited size of annotated microscopy datasets restricts the ability of artificial neural networks to generalize for CAR-T/NK immunological synapse detection and segmentation, which is important for predicting therapeutic efficacy in cancer immunotherapy.

Method: Two data augmentation frameworks: 1) Instance Aware Automatic Augmentation (IAAA) - automated, instance-preserving augmentation method applying optimized policies to original data; 2) Semantic-Aware AI Augmentation (SAAA) - combines diffusion-based mask generator with Pix2Pix conditional image synthesizer to create diverse, anatomically realistic segmentation masks and corresponding high-fidelity images.

Result: The augmentation strategies generate synthetic images with visual and structural properties closely matching real IS data, significantly improving CAR-T/NK IS detection and segmentation performance.

Conclusion: The work enhances robustness and accuracy of IS quantification, supporting development of more reliable imaging-based biomarkers for predicting patient response to CAR-T/NK immunotherapy.

Abstract: Chimeric antigen receptor (CAR)-T and NK cell immunotherapies have transformed cancer treatment, and recent studies suggest that the quality of the CAR-T/NK cell immunological synapse (IS) may serve as a functional biomarker for predicting therapeutic efficacy. Accurate detection and segmentation of CAR-T/NK IS structures using artificial neural networks (ANNs) can greatly increase the speed and reliability of IS quantification. However, a persistent challenge is the limited size of annotated microscopy datasets, which restricts the ability of ANNs to generalize. To address this challenge, we integrate two complementary data-augmentation frameworks. First, we employ Instance Aware Automatic Augmentation (IAAA), an automated, instance-preserving augmentation method that generates synthetic CAR-T/NK IS images and corresponding segmentation masks by applying optimized augmentation policies to original IS data. IAAA supports multiple imaging modalities (e.g., fluorescence and brightfield) and can be applied directly to CAR-T/NK IS images derived from patient samples. In parallel, we introduce a Semantic-Aware AI Augmentation (SAAA) pipeline that combines a diffusion-based mask generator with a Pix2Pix conditional image synthesizer. This second method enables the creation of diverse, anatomically realistic segmentation masks and produces high-fidelity CAR-T/NK IS images aligned with those masks, further expanding the training corpus beyond what IAAA alone can provide. Together, these augmentation strategies generate synthetic images whose visual and structural properties closely match real IS data, significantly improving CAR-T/NK IS detection and segmentation performance. By enhancing the robustness and accuracy of IS quantification, this work supports the development of more reliable imaging-based biomarkers for predicting patient response to CAR-T/NK immunotherapy.

[296] PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers

Haopeng Li, Shitong Shao, Wenliang Zhong, Zikai Zhou, Lichen Bai, Hui Xiong, Zeke Xie

Main category: cs.CV

TL;DR: PISA introduces a training-free piecewise sparse attention method that approximates non-critical attention blocks via Taylor expansion instead of discarding them, achieving significant speedups while maintaining quality in diffusion transformers for video and image generation.

Details

Motivation: Current sparse attention methods for diffusion transformers suffer from quality degradation at high sparsity levels because they discard non-critical attention blocks entirely, losing important contextual information.

Method: PISA uses an exact-or-approximate strategy: critical attention blocks are computed exactly, while non-critical blocks are efficiently approximated using block-wise Taylor expansion based on the observation that their attention scores exhibit distributional stability.

Result: PISA achieves 1.91× speedup on Wan2.1-14B and 2.57× on Hunyuan-Video for video generation, and 1.2× speedup on FLUX for image generation, while maintaining the highest quality among sparse attention methods.

Conclusion: PISA effectively bridges the speed-quality gap in diffusion transformers by approximating rather than discarding non-critical attention blocks, making it a practical solution for efficient video and image generation.

Abstract: Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value blocks, it suffers from degradation at high sparsity by discarding context. In this work, we discover that attention scores of non-critical blocks exhibit distributional stability, allowing them to be approximated accurately and efficiently rather than discarded, which is essentially important for sparse attention design. Motivated by this key insight, we propose PISA, a training-free Piecewise Sparse Attention that covers the full attention span with sub-quadratic complexity. Unlike the conventional keep-or-drop paradigm that directly drop the non-critical block information, PISA introduces a novel exact-or-approximate strategy: it maintains exact computation for critical blocks while efficiently approximating the remainder through block-wise Taylor expansion. This design allows PISA to serve as a faithful proxy to full attention, effectively bridging the gap between speed and quality. Experimental results demonstrate that PISA achieves 1.91 times and 2.57 times speedups on Wan2.1-14B and Hunyuan-Video, respectively, while consistently maintaining the highest quality among sparse attention methods. Notably, even for image generation on FLUX, PISA achieves a 1.2 times acceleration without compromising visual quality. Code is available at: https://github.com/xie-lab-ml/piecewise-sparse-attention.

[297] From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction

Xingyu Miao, Junting Dong, Qin Zhao, Yuhang Yang, Junhao Chen, Yang Long

Main category: cs.CV

TL;DR: A unified ViT-based model for temporally consistent human-centric dense prediction across videos, trained with synthetic data pipeline providing both frame-level and sequence-level supervision.

Details

Motivation: Existing human-centric dense prediction models suffer from temporal inconsistency (flickering) under motion, occlusion, and lighting changes, and lack paired video supervision for multiple dense tasks.

Method: 1) Scalable synthetic data pipeline generating photorealistic human frames with pixel-accurate depth, normals, and masks; 2) Unified ViT-based dense predictor with explicit human geometric prior via CSE embeddings and lightweight channel reweighting module; 3) Two-stage training combining static pretraining with dynamic sequence supervision.

Result: Achieves state-of-the-art performance on THuman2.1 and Hi4D datasets and generalizes effectively to in-the-wild videos.

Conclusion: The proposed approach successfully addresses temporal consistency in human-centric dense prediction through synthetic data generation and architectural innovations, enabling robust spatial and temporal learning.

Abstract: In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.

[298] Moonworks Lunara Aesthetic II: An Image Variation Dataset

Yan Wang, Partho Hassan, Samiha Sadeka, Nada Soliman, M M Sayeef Abdullah, Sabit Hassan

Main category: cs.CV

TL;DR: Lunara Aesthetic II is a publicly released image dataset for evaluating contextual consistency in image generation/editing systems, featuring 2,854 anchor-linked variation pairs with identity-preserving contextual transformations.

Details

Motivation: To address the need for controlled evaluation and learning of contextual consistency in modern image generation and editing systems, providing interpretable supervision signals for identity preservation during contextual transformations.

Method: Created a dataset of 2,854 anchor-linked variation pairs from original art and photographs, applying contextual transformations (illumination, weather, viewpoint, scene composition, color tone, mood) while preserving underlying identity.

Result: Dataset shows high identity stability, strong target attribute realization, and robust aesthetic profile exceeding large-scale web datasets, with high aesthetic scores maintained.

Conclusion: Lunara Aesthetic II provides a valuable resource for benchmarking, fine-tuning, and analyzing contextual generalization, identity preservation, and edit robustness in image generation systems with interpretable supervision.

Abstract: We introduce Lunara Aesthetic II, a publicly released, ethically sourced image dataset designed to support controlled evaluation and learning of contextual consistency in modern image generation and editing systems. The dataset comprises 2,854 anchor-linked variation pairs derived from original art and photographs created by Moonworks. Each variation pair applies contextual transformations, such as illumination, weather, viewpoint, scene composition, color tone, or mood; while preserving a stable underlying identity. Lunara Aesthetic II operationalizes identity-preserving contextual variation as a supervision signal while also retaining Lunara’s signature high aesthetic scores. Results show high identity stability, strong target attribute realization, and a robust aesthetic profile that exceeds large-scale web datasets. Released under the Apache 2.0 license, Lunara Aesthetic II is intended for benchmarking, fine-tuning, and analysis of contextual generalization, identity preservation, and edit robustness in image generation and image-to-image systems with interpretable, relational supervision. The dataset is publicly available at: https://huggingface.co/datasets/moonworks/lunara-aesthetic-image-variations.

Jiaming Cui, Wenqiang Li, Shuai Zhou, Ruifeng Qin, Feng Shen

Main category: cs.CV

TL;DR: CMAFNet: Cross-modal RGB-depth fusion network for transmission line defect detection that purifies features via learned codebook before fusion, achieving state-of-the-art performance on small-scale defects.

Details

Motivation: Transmission line defect detection is challenging due to small-scale defects, complex backgrounds, and illumination variations. RGB-based detectors struggle with geometrically subtle defects that have limited chromatic contrast with background structures.

Method: CMAFNet integrates RGB appearance and depth geometry through a purify-then-fuse paradigm. It uses a Semantic Recomposition Module with dictionary-based feature purification via learned codebook to suppress noise while preserving defect information, and a Contextual Semantic Integration Framework with partial-channel attention for global spatial dependencies. Position-wise normalization enforces cross-modal alignment.

Result: Achieves 32.2% mAP@50 and 12.5% APs on TLRGBD benchmark (94.5% small objects), outperforming strongest baseline by 9.8 and 4.0 percentage points. Lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters, surpassing YOLO-based detectors while matching transformer methods at lower cost.

Conclusion: CMAFNet effectively addresses small-scale defect detection challenges through principled cross-modal fusion of RGB and depth information, demonstrating superior performance and efficiency for transmission line inspection.

Abstract: Transmission line defect detection remains challenging for automated UAV inspection due to the dominance of small-scale defects, complex backgrounds, and illumination variations. Existing RGB-based detectors, despite recent progress, struggle to distinguish geometrically subtle defects from visually similar background structures under limited chromatic contrast. This paper proposes CMAFNet, a Cross-Modal Alignment and Fusion Network that integrates RGB appearance and depth geometry through a principled purify-then-fuse paradigm. CMAFNet consists of a Semantic Recomposition Module that performs dictionary-based feature purification via a learned codebook to suppress modality-specific noise while preserving defect-discriminative information, and a Contextual Semantic Integration Framework that captures global spatial dependencies using partial-channel attention to enhance structural semantic reasoning. Position-wise normalization within the purification stage enforces explicit reconstruction-driven cross-modal alignment, ensuring statistical compatibility between heterogeneous features prior to fusion. Extensive experiments on the TLRGBD benchmark, where 94.5% of instances are small objects, demonstrate that CMAFNet achieves 32.2% mAP@50 and 12.5% APs, outperforming the strongest baseline by 9.8 and 4.0 percentage points, respectively. A lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters, surpassing all YOLO-based detectors while matching transformer-based methods at substantially lower computational cost.

[300] ObjEmbed: Towards Universal Multimodal Object Embeddings

Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie, Wei-Shi Zheng

Main category: cs.CV

TL;DR: ObjEmbed is a multimodal embedding model that decomposes images into object-level embeddings for fine-grained vision-language alignment, supporting both region-level and image-level tasks with efficient single-pass encoding.

Details

Motivation: Existing multimodal embedding models excel at global image-text alignment but struggle with fine-grained alignment between specific image regions and textual phrases, creating a need for object-level alignment capabilities.

Method: ObjEmbed decomposes input images into multiple regional embeddings (one per object) plus global embeddings. Each region gets two complementary embeddings: object embedding for semantic matching and IoU embedding for localization quality prediction. Final matching combines semantic similarity with predicted IoU.

Result: Superior performance on 18 diverse benchmarks demonstrates strong semantic discrimination capabilities for both region-level and image-level visual understanding tasks.

Conclusion: ObjEmbed provides an effective solution for fine-grained vision-language alignment with object-oriented representations, supporting versatile visual understanding tasks through efficient single-pass encoding.

Abstract: Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.

[301] DDP-WM: Disentangled Dynamics Prediction for Efficient World Models

Shicheng Yin, Kaixuan Yin, Weixing Chen, Yang Liu, Guanbin Li, Liang Lin

Main category: cs.CV

TL;DR: DDP-WM is an efficient world model using disentangled dynamics prediction to separate primary physical interactions from background updates, achieving 9x speedup on robotic tasks.

Details

Motivation: Existing dense Transformer-based world models have substantial computational overhead that hinders real-time deployment for autonomous robotic planning. There's a need for more efficient models that maintain high performance.

Method: Proposes Disentangled Dynamics Prediction (DDP) that decomposes latent state evolution into sparse primary dynamics (physical interactions) and secondary context-driven background updates. Uses efficient historical processing with dynamic localization to isolate primary dynamics and cross-attention for background updates.

Result: Achieves significant efficiency and performance across diverse robotic tasks including navigation, tabletop manipulation, and complex deformable/multi-body interactions. On Push-T task: ~9x inference speedup and improves MPC success rate from 90% to 98% compared to SOTA dense models.

Conclusion: DDP-WM establishes a promising path for developing efficient, high-fidelity world models for real-time robotic planning applications.

Abstract: World models are essential for autonomous robotic planning. However, the substantial computational overhead of existing dense Transformerbased models significantly hinders real-time deployment. To address this efficiency-performance bottleneck, we introduce DDP-WM, a novel world model centered on the principle of Disentangled Dynamics Prediction (DDP). We hypothesize that latent state evolution in observed scenes is heterogeneous and can be decomposed into sparse primary dynamics driven by physical interactions and secondary context-driven background updates. DDP-WM realizes this decomposition through an architecture that integrates efficient historical processing with dynamic localization to isolate primary dynamics. By employing a crossattention mechanism for background updates, the framework optimizes resource allocation and provides a smooth optimization landscape for planners. Extensive experiments demonstrate that DDP-WM achieves significant efficiency and performance across diverse tasks, including navigation, precise tabletop manipulation, and complex deformable or multi-body interactions. Specifically, on the challenging Push-T task, DDP-WM achieves an approximately 9 times inference speedup and improves the MPC success rate from 90% to98% compared to state-of-the-art dense models. The results establish a promising path for developing efficient, high-fidelity world models. Codes will be available at https://github.com/HCPLab-SYSU/DDP-WM.

[302] SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors

Bing He, Jingnan Gao, Yunuo Chen, Ning Cao, Gang Chen, Zhengxue Cheng, Li Song, Wenjun Zhang

Main category: cs.CV

TL;DR: SurfSplat: A feedforward framework using 2D Gaussian Splatting primitives for high-fidelity 3D reconstruction from sparse images, addressing surface continuity issues in previous 3DGS-based methods.

Details

Motivation: Current 3D reconstruction methods using 3D Gaussian Splatting often produce discrete, color-biased point clouds with surface discontinuities and artifacts under close-up views, failing to reconstruct accurate geometry and texture from sparse images.

Method: Proposes SurfSplat using 2D Gaussian Splatting primitives with stronger anisotropy and higher geometric precision. Incorporates surface continuity prior and forced alpha blending strategy for coherent geometry and faithful textures. Introduces High-Resolution Rendering Consistency (HRRC) evaluation metric.

Result: Outperforms prior methods on RealEstate10K, DL3DV, and ScanNet datasets on both standard metrics and the new HRRC metric, demonstrating robust high-fidelity 3D reconstruction from sparse inputs.

Conclusion: SurfSplat provides a robust solution for high-fidelity 3D reconstruction from sparse images by addressing surface continuity issues through 2D Gaussian Splatting primitives and novel blending strategies.

Abstract: Reconstructing 3D scenes from sparse images remains a challenging task due to the difficulty of recovering accurate geometry and texture without optimization. Recent approaches leverage generalizable models to generate 3D scenes using 3D Gaussian Splatting (3DGS) primitive. However, they often fail to produce continuous surfaces and instead yield discrete, color-biased point clouds that appear plausible at normal resolution but reveal severe artifacts under close-up views. To address this issue, we present SurfSplat, a feedforward framework based on 2D Gaussian Splatting (2DGS) primitive, which provides stronger anisotropy and higher geometric precision. By incorporating a surface continuity prior and a forced alpha blending strategy, SurfSplat reconstructs coherent geometry together with faithful textures. Furthermore, we introduce High-Resolution Rendering Consistency (HRRC), a new evaluation metric designed to evaluate high-resolution reconstruction quality. Extensive experiments on RealEstate10K, DL3DV, and ScanNet demonstrate that SurfSplat consistently outperforms prior methods on both standard metrics and HRRC, establishing a robust solution for high-fidelity 3D reconstruction from sparse inputs. Project page: https://hebing-sjtu.github.io/SurfSplat-website/

[303] Reg4Pru: Regularisation Through Random Token Routing for Token Pruning

Julian Wyatt, Ronald Clark, Irina Voiculescu

Main category: cs.CV

TL;DR: Reg4Pru is a training regularization technique that improves token pruning performance for segmentation tasks by mitigating the performance loss from pruning while maintaining computational efficiency.

Details

Motivation: Transformers in vision models suffer from quadratic computational scaling with token count. Token pruning methods improve efficiency but degrade performance in deeper layers due to instability from preserved representations, particularly affecting dense prediction tasks like segmentation.

Method: Reg4Pru is a training regularization technique designed specifically for token pruning strategies. It helps maintain performance when pruning tokens by stabilizing the preserved representations, addressing the performance degradation that typically occurs in deeper layers of pruned models.

Result: On the FIVES blood vessel segmentation dataset, Reg4Pru improved average precision by 46% absolute compared to the same model trained without routing, while achieving 29% relative speedup in wall-clock time compared to the non-pruned baseline.

Conclusion: Reg4Pru is an effective regularizer for token reduction strategies in vision transformers, enabling significant computational efficiency gains while maintaining or improving segmentation performance.

Abstract: Transformers are widely adopted in modern vision models due to their strong ability to scale with dataset size and generalisability. However, this comes with a major drawback: computation scales quadratically to the total number of tokens. Numerous methods have been proposed to mitigate this. For example, we consider token pruning with reactivating tokens from preserved representations, but the increased computational efficiency of this method results in decreased stability from the preserved representations, leading to poorer dense prediction performance at deeper layers. In this work, we introduce Reg4Pru, a training regularisation technique that mitigates token-pruning performance loss for segmentation. We compare our models on the FIVES blood vessel segmentation dataset and find that Reg4Pru improves average precision by an absolute 46% compared to the same model trained without routing. This increase is observed using a configuration that achieves a 29% relative speedup in wall-clock time compared to the non-pruned baseline. These findings indicate that Reg4Pru is a valuable regulariser for token reduction strategies.

[304] CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization

Xinquan Yu, Wei Lu, Xiangyang Luo, Rui Yang

Main category: cs.CV

TL;DR: CIEC is a weakly-supervised framework for multimodal manipulation localization in image-text pairs using only coarse-grained annotations, achieving comparable performance to fully supervised methods.

Details

Motivation: Current multimodal manipulation localization methods require expensive fine-grained annotations (patch/token-level), which are costly and time-consuming to obtain. The authors aim to develop a weakly-supervised approach that only needs coarse-grained image/sentence-level annotations.

Method: CIEC uses two branches: 1) Image-based localization with Textual-guidance Refine Patch Selection (TRPS) that integrates visual and textual forgery cues using spatial priors, plus background silencing and spatial contrast constraints; 2) Text-based localization with Visual-deviation Calibrated Token Grounding (VCTG) that focuses on content words using visual bias, plus asymmetric sparse and semantic consistency constraints.

Result: Extensive experiments show CIEC achieves results comparable to fully supervised methods on several evaluation metrics for multimodal manipulation localization.

Conclusion: CIEC demonstrates that effective multimodal manipulation localization can be achieved with only coarse-grained annotations, reducing annotation costs while maintaining performance comparable to fully supervised approaches.

Abstract: To mitigate the threat of misinformation, multimodal manipulation localization has garnered growing attention. Consider that current methods rely on costly and time-consuming fine-grained annotations, such as patch/token-level annotations. This paper proposes a novel framework named Coupling Implicit and Explicit Cues (CIEC), which aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs utilizing only coarse-grained image/sentence-level annotations. It comprises two branches, image-based and text-based weakly-supervised localization. For the former, we devise the Textual-guidance Refine Patch Selection (TRPS) module. It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors. Followed by the background silencing and spatial contrast constraints to suppress interference from irrelevant areas. For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization. Followed by the asymmetric sparse and semantic consistency constraints to mitigate label noise and ensure reliability. Extensive experiments demonstrate the effectiveness of our CIEC, yielding results comparable to fully supervised methods on several evaluation metrics.

[305] Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, Ming-Ming Cheng

Main category: cs.CV

TL;DR: Infinite-World is a robust interactive world model that maintains coherent visual memory over 1000+ frames in complex real-world environments without relying on explicit geometric priors.

Details

Motivation: Existing world models work well on synthetic data with perfect ground-truth but lack effective training paradigms for real-world videos due to noisy pose estimations and scarcity of viewpoint revisits.

Method: 1) Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into fixed-budget representation; 2) Uncertainty-aware Action Labeling that discretizes continuous motion into tri-state logic; 3) Revisit-Dense Finetuning Strategy using compact dataset to activate long-range loop-closure capabilities.

Result: Extensive experiments show Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency compared to existing methods.

Conclusion: The proposed approach enables robust interactive world modeling in real-world environments without explicit geometric priors, maintaining coherent visual memory over extended sequences.

Abstract: We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model’s long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.

[306] ReasonEdit: Editing Vision-Language Models using Human Reasoning

Jiaxing Qiu, Kaihua Hou, Roxana Daneshjou, Ahmed Alaa, Thomas Hartvigsen

Main category: cs.CV

TL;DR: ReasonEdit: First VLM editor that incorporates human reasoning explanations during editing for vision-language models, achieving SOTA performance on rationale-based VQA tasks.

Details

Motivation: Existing vision-language model editors don't handle reasoning-heavy tasks that require humans and models to reason about images. There's a need for editors that can incorporate human reasoning explanations during the editing process.

Method: Proposes ReasonEdit that continuously stores human reasoning in a codebook and retrieves relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science.

Result: Achieves state-of-the-art editing performance across four VLMs on multiple rationale-based visual question answering datasets, showing that using human reasoning during editing greatly improves edit generalization.

Conclusion: ReasonEdit successfully demonstrates that incorporating human reasoning explanations during editing significantly enhances vision-language model editing capabilities for reasoning-heavy tasks.

Abstract: Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically require humans and models to reason about images. We therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.

cs.AI

[307] CreditAudit: 2D Auditing for LLM Evaluation and Selection

Yiliang Song, Hongjun An, Jiangong Xiao, Haofei Zhao, Jiawei Shao, Xuelong Li

Main category: cs.AI

TL;DR: CreditAudit is a deployment-oriented evaluation framework that assesses language models not just on mean performance but also on stability across different system prompts, providing credit grades (AAA-BBB) to guide real-world deployment decisions.

Details

Motivation: Current benchmark scores show marginal differences between frontier models but fail to capture real-world deployment stability, where small prompt variations can cause disproportionate failures in agentic pipelines, leaving practitioners uncertain about model selection.

Method: Proposes CreditAudit framework that evaluates models under semantically aligned, non-adversarial system prompt templates across benchmarks, measuring mean ability (average performance) and scenario-induced fluctuation sigma (stability risk), then maps volatility into interpretable credit grades (AAA to BBB) using cross-model quantiles with diagnostics to mitigate template difficulty drift.

Result: Experiments on GPQA, TruthfulQA, and MMLU Pro show models with similar mean ability can have substantially different fluctuation patterns, and stability risk can overturn prioritization decisions in agentic or high failure cost regimes.

Conclusion: CreditAudit provides a 2D and grade-based language for regime-specific model selection, supporting tiered deployment and more disciplined allocation of testing/monitoring effort for more objective and trustworthy real-world model evaluation.

Abstract: Leaderboard scores on public benchmarks have been steadily rising and converging, with many frontier language models now separated by only marginal differences. However, these scores often fail to match users’ day to day experience, because system prompts, output protocols, and interaction modes evolve under routine iteration, and in agentic multi step pipelines small protocol shifts can trigger disproportionate failures, leaving practitioners uncertain about which model to deploy. We propose CreditAudit, a deployment oriented credit audit framework that evaluates models under a family of semantically aligned and non adversarial system prompt templates across multiple benchmarks, reporting mean ability as average performance across scenarios and scenario induced fluctuation sigma as a stability risk signal, and further mapping volatility into interpretable credit grades from AAA to BBB via cross model quantiles with diagnostics that mitigate template difficulty drift. Controlled experiments on GPQA, TruthfulQA, and MMLU Pro show that models with similar mean ability can exhibit substantially different fluctuation, and stability risk can overturn prioritization decisions in agentic or high failure cost regimes. By providing a 2D and grade based language for regime specific selection, CreditAudit supports tiered deployment and more disciplined allocation of testing and monitoring effort, enabling more objective and trustworthy model evaluation for real world use.

[308] Experience-Driven Multi-Agent Systems Are Training-free Context-aware Earth Observers

Pengyu Dai, Weihao Xuan, Junjue Wang, Hongruixuan Chen, Jian Song, Yafei Ou, Naoto Yokoya

Main category: cs.AI

TL;DR: GeoEvolver: A self-evolving multi-agent system that enables LLM agents to acquire Earth Observation expertise through structured interaction without parameter updates, improving task success in complex EO workflows.

Details

Motivation: LLM agents struggle in specialized, tool-intensive domains like Earth Observation that require long-horizon execution, tight coordination across modalities, and adherence to implicit tool constraints. Existing agents lack mechanisms to learn fine-grained, tool-level expertise from interaction.

Method: GeoEvolver uses a retrieval-augmented multi-agent orchestrator to decompose queries into sub-goals, explores diverse tool-parameter configurations at sub-goal level, and distills successful patterns and failure root causes into an evolving memory bank for future in-context demonstrations.

Result: Experiments on three tool-integrated EO benchmarks show GeoEvolver consistently improves end-to-end task success with average gain of 12% across multiple LLM backbones, demonstrating progressive emergence of EO expertise.

Conclusion: EO expertise can emerge progressively from efficient, fine-grained interactions with the environment through structured multi-agent systems without parameter updates, enabling LLM agents to handle complex, multimodal EO workflows.

Abstract: Recent advances have enabled large language model (LLM) agents to solve complex tasks by orchestrating external tools. However, these agents often struggle in specialized, tool-intensive domains that demand long-horizon execution, tight coordination across modalities, and strict adherence to implicit tool constraints. Earth Observation (EO) tasks exemplify this challenge due to the multi-modal and multi-temporal data inputs, as well as the requirements of geo-knowledge constraints (spectrum library, spatial reasoning, etc): many high-level plans can be derailed by subtle execution errors that propagate through a pipeline and invalidate final results. A core difficulty is that existing agents lack a mechanism to learn fine-grained, tool-level expertise from interaction. Without such expertise, they cannot reliably configure tool parameters or recover from mid-execution failures, limiting their effectiveness in complex EO workflows. To address this, we introduce \textbf{GeoEvolver}, a self-evolving multi-agent system~(MAS) that enables LLM agents to acquire EO expertise through structured interaction without any parameter updates. GeoEvolver decomposes each query into independent sub-goals via a retrieval-augmented multi-agent orchestrator, then explores diverse tool-parameter configurations at the sub-goal level. Successful patterns and root-cause attribution from failures are then distilled in an evolving memory bank that provides in-context demonstrations for future queries. Experiments on three tool-integrated EO benchmarks show that GeoEvolver consistently improves end-to-end task success, with an average gain of 12% across multiple LLM backbones, demonstrating that EO expertise can emerge progressively from efficient, fine-grained interactions with the environment.

[309] Uncertainty and Fairness Awareness in LLM-Based Recommendation Systems

Chandan Kumar Sah, Xiaoli Lian, Li Zhang, Tony Xu, Syed Shazaib Shah

Main category: cs.AI

TL;DR: This paper studies uncertainty and fairness in LLM-based recommendations, introducing evaluation metrics, datasets, and case studies showing systematic unfairness in Gemini 1.5 Flash that persists across prompt variations.

Details

Motivation: LLMs enable powerful zero-shot recommendations but face reliability and fairness challenges due to predictive uncertainty and embedded biases, threatening trustworthy deployment.

Method: Introduces benchmark metrics and dataset with 8 demographic attributes across movies/music domains. Uses entropy for uncertainty quantification, measures fairness gaps (SNSR/SNSV), tests prompt perturbations, and integrates personality-aware fairness into RecLLM evaluation pipeline.

Result: Gemini 1.5 Flash exhibits systematic unfairness (SNSR: 0.1363, SNSV: 0.0507) that persists under typographical errors and multilingual inputs. Reveals personality-linked bias patterns and trade-offs between personalization and group fairness.

Conclusion: Proposes uncertainty-aware evaluation methodology and personality profile-informed fairness benchmark to advance explainability and equity in LLM recommendations, establishing foundation for safer, more interpretable RecLLMs.

Abstract: Large language models (LLMs) enable powerful zero-shot recommendations by leveraging broad contextual knowledge, yet predictive uncertainty and embedded biases threaten reliability and fairness. This paper studies how uncertainty and fairness evaluations affect the accuracy, consistency, and trustworthiness of LLM-generated recommendations. We introduce a benchmark of curated metrics and a dataset annotated for eight demographic attributes (31 categorical values) across two domains: movies and music. Through in-depth case studies, we quantify predictive uncertainty (via entropy) and demonstrate that Google DeepMind’s Gemini 1.5 Flash exhibits systematic unfairness for certain sensitive attributes; measured similarity-based gaps are SNSR at 0.1363 and SNSV at 0.0507. These disparities persist under prompt perturbations such as typographical errors and multilingual inputs. We further integrate personality-aware fairness into the RecLLM evaluation pipeline to reveal personality-linked bias patterns and expose trade-offs between personalization and group fairness. We propose a novel uncertainty-aware evaluation methodology for RecLLMs, present empirical insights from deep uncertainty case studies, and introduce a personality profile-informed fairness benchmark that advances explainability and equity in LLM recommendations. Together, these contributions establish a foundation for safer, more interpretable RecLLMs and motivate future work on multi-model benchmarks and adaptive calibration for trustworthy deployment.

[310] PeerRank: Autonomous LLM Evaluation Through Web-Grounded, Bias-Controlled Peer Review

Yanki Margalit, Erni Avram, Ran Taig, Oded Margalit, Nurit Cohen-Inger

Main category: cs.AI

TL;DR: PeerRank is an autonomous evaluation framework where LLMs generate tasks, answer with web grounding, judge peers, and aggregate assessments without human supervision, enabling scalable open-world assessment.

Details

Motivation: Traditional LLM evaluation relies on human-authored benchmarks and judgments that scale poorly, become outdated quickly, and mismatch real-world deployments that use web retrieval and synthesis.

Method: PeerRank treats evaluation as a multi-agent process where models symmetrically act as task designers, respondents (with category-scoped live web grounding), and evaluators, removing biased judgments through peer assessment aggregation.

Result: In a study with 12 commercial models and 420 autonomously generated questions, PeerRank produced stable, discriminative rankings, revealed identity and presentation biases, showed robustness, and mean peer scores agreed with Elo ratings.

Conclusion: Bias-aware peer evaluation with selective web-grounded answering can scale open-world LLM assessment beyond static, human-curated benchmarks, as validated on TruthfulQA and GSM8K where peer scores correlate with objective accuracy.

Abstract: Evaluating large language models typically relies on human-authored benchmarks, reference answers, and human or single-model judgments, approaches that scale poorly, become quickly outdated, and mismatch open-world deployments that depend on web retrieval and synthesis. We introduce PeerRank, a fully autonomous end-to-end evaluation framework in which models generate evaluation tasks, answer them with category-scoped live web grounding, judge peer responses and aggregate dense peer assessments into relative performance estimates, without human supervision or gold references. PeerRank treats evaluation as a multi-agent process where each model participates symmetrically as task designer, respondent, and evaluator, while removing biased judgments. In a large-scale study over 12 commercially available models and 420 autonomously generated questions, PeerRank produces stable, discriminative rankings and reveals measurable identity and presentation biases. Rankings are robust, and mean peer scores agree with Elo. We further validate PeerRank on TruthfulQA and GSM8K, where peer scores correlate with objective accuracy. Together, these results suggest that bias-aware peer evaluation with selective web-grounded answering can scale open-world LLM assessment beyond static and human curated benchmarks.

[311] A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior

Harry Mayne, Justin Singh Kang, Dewi Gould, Kannan Ramchandran, Adam Mahdi, Noah Y. Siegel

Main category: cs.AI

TL;DR: LLM self-explanations improve prediction of model behavior by 11-37% (NSG metric), showing they encode useful information about decision-making, though 5-15% are misleading.

Details

Motivation: Current faithfulness metrics for LLM self-explanations have critical limitations, typically relying on adversarial prompting or detecting reasoning errors, overlooking the predictive value of explanations for understanding model behavior.

Method: Introduced Normalized Simulatability Gain (NSG) metric based on the idea that faithful explanations should allow observers to learn model’s decision-making criteria and better predict behavior on related inputs. Evaluated 18 frontier models on 7,000 counterfactuals from datasets covering health, business, and ethics.

Result: Self-explanations substantially improve prediction of model behavior (11-37% NSG). They provide more predictive information than external model explanations, even when external models are stronger. Across models, 5-15% of self-explanations are egregiously misleading.

Conclusion: Self-explanations encode information that helps predict model behavior, showing a positive case for their use despite imperfections. There’s an advantage from self-knowledge that external explanation methods cannot replicate.

Abstract: LLM self-explanations are often presented as a promising tool for AI oversight, yet their faithfulness to the model’s true reasoning process is poorly understood. Existing faithfulness metrics have critical limitations, typically relying on identifying unfaithfulness via adversarial prompting or detecting reasoning errors. These methods overlook the predictive value of explanations. We introduce Normalized Simulatability Gain (NSG), a general and scalable metric based on the idea that a faithful explanation should allow an observer to learn a model’s decision-making criteria, and thus better predict its behavior on related inputs. We evaluate 18 frontier proprietary and open-weight models, e.g., Gemini 3, GPT-5.2, and Claude 4.5, on 7,000 counterfactuals from popular datasets covering health, business, and ethics. We find self-explanations substantially improve prediction of model behavior (11-37% NSG). Self-explanations also provide more predictive information than explanations generated by external models, even when those models are stronger. This implies an advantage from self-knowledge that external explanation methods cannot replicate. Our approach also reveals that, across models, 5-15% of self-explanations are egregiously misleading. Despite their imperfections, we show a positive case for self-explanations: they encode information that helps predict model behavior.

[312] MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, Jinsung Yoon

Main category: cs.AI

TL;DR: MARS is a modular AI research agent framework that uses budget-aware MCTS planning, modular construction, and comparative reflective memory to autonomously conduct AI research while managing computational costs and performance attribution.

Details

Motivation: AI research automation differs from general software engineering due to computationally expensive evaluations (like model training) and opaque performance attribution. Current LLM-based agents struggle with monolithic scripts that ignore execution costs and causal factors.

Method: Three pillars: (1) Budget-Aware Planning using cost-constrained Monte Carlo Tree Search (MCTS) to balance performance with execution expense; (2) Modular Construction with “Design-Decompose-Implement” pipeline for complex research repositories; (3) Comparative Reflective Memory to address credit assignment by analyzing solution differences for insights.

Result: Achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, competitive with global leaderboard’s top methods. Shows qualitative “Aha!” moments with 63% of utilized lessons originating from cross-branch transfer, demonstrating effective generalization across search paths.

Conclusion: MARS provides an effective framework for autonomous AI research that explicitly addresses computational costs and performance attribution challenges, enabling agents to generalize insights across different research paths.

Abstract: Automating AI research differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a “Design-Decompose-Implement” pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard’s top methods. Furthermore, the system exhibits qualitative “Aha!” moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

[313] ATLAS : Adaptive Self-Evolutionary Research Agent with Task-Distributed Multi-LLM Supporters

Ujin Jeon, Jiyong Kwon, Madison Ann Sullivan, Caleb Eunho Lee, Guang Lin

Main category: cs.AI

TL;DR: ATLAS is a task-distributed framework for multi-agent LLM systems that adaptively evolves a lightweight research agent while delegating specialized roles to supporter agents, using EvoDPO algorithm for adaptive preference optimization with theoretical guarantees.

Details

Motivation: Current multi-LLM agent systems either keep solvers frozen after fine-tuning or use static preference-optimization loops, which become intractable for long-horizon tasks. There's a need for adaptive systems that can evolve and handle non-stationary environments.

Method: Proposes ATLAS framework with task distribution: lightweight research agent evolves iteratively while delegating complementary roles to specialized supporter agents (exploration, hyperparameter tuning, reference policy management). Core algorithm is Evolving Direct Preference Optimization (EvoDPO) that adaptively updates phase-indexed reference policy. Includes theoretical regret analysis for preference-based contextual bandit under concept drift.

Result: Experiments on non-stationary linear contextual bandits and scientific machine learning (SciML) loss reweighting for 1D Burgers’ equation show ATLAS improves stability and performance over static single-agent baseline.

Conclusion: ATLAS provides an effective task-distributed framework for adaptive agent evolution in non-stationary environments, with theoretical guarantees and empirical validation showing improved stability and performance over static approaches.

Abstract: Recent multi-LLM agent systems perform well in prompt optimization and automated problem-solving, but many either keep the solver frozen after fine-tuning or rely on a static preference-optimization loop, which becomes intractable for long-horizon tasks. We propose ATLAS (Adaptive Task-distributed Learning for Agentic Self-evolution), a task-distributed framework that iteratively develops a lightweight research agent while delegating complementary roles to specialized supporter agents for exploration, hyperparameter tuning, and reference policy management. Our core algorithm, Evolving Direct Preference Optimization (EvoDPO), adaptively updates the phase-indexed reference policy. We provide a theoretical regret analysis for a preference-based contextual bandit under concept drift. In addition, experiments were conducted on non-stationary linear contextual bandits and scientific machine learning (SciML) loss reweighting for the 1D Burgers’ equation. Both results show that ATLAS improves stability and performance over a static single-agent baseline.

[314] Dynamic Mix Precision Routing for Efficient Multi-step LLM Interaction

Yuanzhe Li, Jianing Deng, Jingtong Hu, Tianlong Chen, Song Wang, Huanrui Yang

Main category: cs.AI

TL;DR: Dynamic mix-precision routing framework for LLMs in long-horizon decision-making that adaptively selects between high-precision and low-precision models at each step to reduce inference cost while maintaining performance.

Details

Motivation: While larger LLMs achieve better performance in long-horizon decision-making tasks, multi-step interaction with large models incurs prohibitive inference costs. There's a need to balance performance with computational efficiency.

Method: Proposes a dynamic mix-precision routing framework that adaptively selects between high-precision and low-precision LLMs at each decision step based on step sensitivity. Uses a two-stage training pipeline: 1) KL-divergence-based supervised learning to identify precision-sensitive steps, 2) Group-Relative Policy Optimization (GRPO) to improve task success rates.

Result: Experiments on ALFWorld demonstrate significant improvement in accuracy-cost trade-off over single-precision baselines and heuristic routing methods.

Conclusion: Dynamic precision selection can effectively reduce LLM inference costs in long-horizon decision-making while maintaining strong task performance, offering a practical solution for resource-constrained applications.

Abstract: Large language models (LLM) achieve strong performance in long-horizon decision-making tasks through multi-step interaction and reasoning at test time. While practitioners commonly believe a higher task success rate necessitates the use of a larger and stronger LLM model, multi-step interaction with a large LLM incurs prohibitive inference cost. To address this problem, we explore the use of low-precision quantized LLM in the long-horizon decision-making process. Based on the observation of diverse sensitivities among interaction steps, we propose a dynamic mix-precision routing framework that adaptively selects between high-precision and low-precision LLMs at each decision step. The router is trained via a two-stage pipeline, consisting of KL-divergence-based supervised learning that identifies precision-sensitive steps, followed by Group-Relative Policy Optimization (GRPO) to further improve task success rates. Experiments on ALFWorld demonstrate that our approach achieves a great improvement on accuracy-cost trade-off over single-precision baselines and heuristic routing methods.

[315] Scaling-Aware Adapter for Structure-Grounded LLM Reasoning

Zihao Jing, Qiuhao Zeng, Ruiyi Fang, Yan Yi Li, Yan Sun, Boyu Wang, Pingzhao Hu

Main category: cs.AI

TL;DR: Cuttlefish: A unified all-atom LLM that grounds language reasoning in geometric cues while adaptively scaling modality tokens with structural complexity to improve biomolecular structure understanding.

Details

Motivation: Existing methods for reasoning over biomolecular structures are modality-specific and compress structural inputs through sequence-based tokenization or fixed-length query connectors, which either omit geometric groundings needed to mitigate structural hallucinations or impose inflexible modality fusion bottlenecks that over-compress and suboptimally allocate structural tokens.

Method: 1) Scaling-Aware Patching uses instruction-conditioned gating to generate variable-size patches over structural graphs, adaptively scaling query token budget with structural complexity. 2) Geometry Grounding Adapter refines these adaptive tokens via cross-attention to modality embeddings and injects resulting modality tokens into the LLM to expose explicit geometric cues.

Result: Experiments across diverse all-atom benchmarks demonstrate that Cuttlefish achieves superior performance in heterogeneous structure-grounded reasoning.

Conclusion: Cuttlefish provides a unified approach that grounds language reasoning in geometric cues while adaptively scaling modality tokens with structural complexity, addressing limitations of existing methods for biomolecular structure understanding.

Abstract: Large language models (LLMs) are enabling reasoning over biomolecular structures, yet existing methods remain modality-specific and typically compress structural inputs through sequence-based tokenization or fixed-length query connectors. Such architectures either omit the geometric groundings requisite for mitigating structural hallucinations or impose inflexible modality fusion bottlenecks that concurrently over-compress and suboptimally allocate structural tokens, thereby impeding the realization of generalized all-atom reasoning. We introduce Cuttlefish, a unified all-atom LLM that grounds language reasoning in geometric cues while scaling modality tokens with structural complexity. First, Scaling-Aware Patching leverages an instruction-conditioned gating mechanism to generate variable-size patches over structural graphs, adaptively scaling the query token budget with structural complexity to mitigate fixed-length connector bottlenecks. Second, Geometry Grounding Adapter refines these adaptive tokens via cross-attention to modality embeddings and injects the resulting modality tokens into the LLM, exposing explicit geometric cues to reduce structural hallucination. Experiments across diverse all-atom benchmarks demonstrate that Cuttlefish achieves superior performance in heterogeneous structure-grounded reasoning. Code is available at the project repository.

[316] Visual Reasoning over Time Series via Multi-Agent System

Weilin Ruan, Yuxuan Liang

Main category: cs.AI

TL;DR: MAS4TS: A tool-driven multi-agent system for time series analysis using visual reasoning and latent reconstruction with Analyzer-Reasoner-Executor paradigm

Details

Motivation: Existing time series methods lack intuitive visual reasoning and cross-task generalization with adaptive tool usage; need for unified framework integrating agent communication, visual reasoning, and latent reconstruction

Method: Three-agent system (Analyzer-Reasoner-Executor) with shared memory and gated communication; uses Vision-Language Model for visual reasoning over time series plots to extract temporal structures, then reconstructs predictive trajectories in latent space; router selects task-specific tool chains

Result: Achieves state-of-the-art performance across multiple time series benchmarks; demonstrates strong generalization and efficient inference

Conclusion: MAS4TS provides effective framework for general time series tasks through multi-agent coordination, visual reasoning, and latent reconstruction

Abstract: Time series analysis underpins many real-world applications, yet existing time-series-specific methods and pretrained large-model-based approaches remain limited in integrating intuitive visual reasoning and generalizing across tasks with adaptive tool usage. To address these limitations, we propose MAS4TS, a tool-driven multi-agent system for general time series tasks, built upon an Analyzer-Reasoner-Executor paradigm that integrates agent communication, visual reasoning, and latent reconstruction within a unified framework. MAS4TS first performs visual reasoning over time series plots with structured priors using a Vision-Language Model to extract temporal structures, and subsequently reconstructs predictive trajectories in latent space. Three specialized agents coordinate via shared memory and gated communication, while a router selects task-specific tool chains for execution. Extensive experiments on multiple benchmarks demonstrate that MAS4TS achieves state-of-the-art performance across a wide range of time series tasks, while exhibiting strong generalization and efficient inference.

[317] Chain of Simulation: A Dual-Mode Reasoning Framework for Large Language Models with Dynamic Problem Routing

Saeid Sheikhi

Main category: cs.AI

TL;DR: Chain of Simulation (CoS) is a dual-mode reasoning framework that dynamically routes problems to specialized reasoning strategies (computational, symbolic, hybrid) in LLMs, achieving significant accuracy improvements on reasoning benchmarks with lower computational cost.

Details

Motivation: Existing LLM prompting approaches use uniform strategies for all problems, but different reasoning tasks (mathematical, spatial, multi-hop inference) require specialized approaches. The authors aim to develop a framework that can dynamically select appropriate reasoning modes for different problem types.

Method: CoS employs three distinct reasoning modes: (1) computational flow with self-consistency for mathematical problems, (2) symbolic state tracking with JSON representations for spatial reasoning, and (3) hybrid fact-extraction for multi-hop inference. The framework includes algorithms for mode selection, state tracking, and answer extraction, and dynamically routes problems to the appropriate specialized reasoning strategy.

Result: CoS achieves 71.5% accuracy on GSM8K (1.0% absolute improvement), 90.0% on StrategyQA (2.5% improvement), and 19.0% on bAbI (65.2% relative improvement) compared to strongest baselines. Computational mode achieves 81.2% accuracy when correctly applied to mathematical problems. The framework provides comparable performance to Self-Consistency at 54% lower computational cost.

Conclusion: Problem-specific mode selection is crucial for LLM reasoning, and CoS establishes an effective approach for improving reasoning without additional training. The framework demonstrates superior trade-offs between accuracy and efficiency compared to uniform prompting methods.

Abstract: We present Chain of Simulation (CoS), a novel dual-mode reasoning framework that dynamically routes problems to specialized reasoning strategies in Large Language Models (LLMs). Unlike existing uniform prompting approaches, CoS employs three distinct reasoning modes: (1) computational flow with self-consistency for mathematical problems, (2) symbolic state tracking with JSON representations for spatial reasoning, and (3) hybrid fact-extraction for multi-hop inference. Through comprehensive evaluation on GSM8K, StrategyQA, and bAbI benchmarks using four state-of-the-art models (Gemma-3 27B, LLaMA-3.1 8B, Mistral 7B, and Qwen-2.5 14B), we demonstrate that CoS achieves 71.5% accuracy on GSM8K (1.0% absolute improvement), 90.0% on StrategyQA (2.5% improvement), and 19.0% on bAbI (65.2% relative improvement) compared to the strongest baselines. The analysis reveals that problem-specific mode selection is crucial, with computational mode achieving 81.2% accuracy when correctly applied to mathematical problems, while misrouting leads to 0% accuracy. We provide detailed algorithms for mode selection, state tracking, and answer extraction, establishing CoS as an effective approach for improving LLM reasoning without additional training. The framework provides superior trade-offs between accuracy and efficiency compared to Self-Consistency, achieving comparable performance at 54% lower computational cost.

[318] AutoSizer: Automatic Sizing of Analog and Mixed-Signal Circuits via Large Language Model (LLM) Agents

Xi Yu, Dmitrii Torbunov, Soumyajit Mandal, Yihui Ren

Main category: cs.AI

TL;DR: AutoSizer: LLM-driven meta-optimization framework for analog circuit sizing that uses reflective reasoning to adaptively refine search spaces based on simulation feedback.

Details

Motivation: Analog circuit design is expert-dependent with transistor sizing as a major bottleneck due to nonlinear behavior and high-dimensional design spaces. Existing EDA methods treat sizing as static black-box optimization, leading to inefficient solutions. LLMs have strong reasoning but aren't suited for precise numerical optimization in AMS sizing.

Method: Two-loop optimization framework: inner loop for circuit sizing, outer loop analyzes optimization dynamics and constraints to iteratively refine search space from simulation feedback. Uses LLM-driven meta-optimization that unifies circuit understanding, adaptive search-space construction, and optimization orchestration.

Result: AutoSizer achieves higher solution quality, faster convergence, and higher success rate across varying circuit difficulties, outperforming both traditional optimization methods and existing LLM-based agents. Also introduces AMS-SizingBench benchmark with 24 diverse AMS circuits.

Conclusion: AutoSizer successfully bridges the gap between LLM reasoning and precise numerical optimization for AMS circuit sizing, demonstrating superior performance through adaptive search-space refinement.

Abstract: The design of Analog and Mixed-Signal (AMS) integrated circuits remains heavily reliant on expert knowledge, with transistor sizing a major bottleneck due to nonlinear behavior, high-dimensional design spaces, and strict performance constraints. Existing Electronic Design Automation (EDA) methods typically frame sizing as static black-box optimization, resulting in inefficient and less robust solutions. Although Large Language Models (LLMs) exhibit strong reasoning abilities, they are not suited for precise numerical optimization in AMS sizing. To address this gap, we propose AutoSizer, a reflective LLM-driven meta-optimization framework that unifies circuit understanding, adaptive search-space construction, and optimization orchestration in a closed loop. It employs a two-loop optimization framework, with an inner loop for circuit sizing and an outer loop that analyzes optimization dynamics and constraints to iteratively refine the search space from simulation feedback. We further introduce AMS-SizingBench, an open benchmark comprising 24 diverse AMS circuits in SKY130 CMOS technology, designed to evaluate adaptive optimization policies under realistic simulator-based constraints. AutoSizer experimentally achieves higher solution quality, faster convergence, and higher success rate across varying circuit difficulties, outperforming both traditional optimization methods and existing LLM-based agents.

[319] MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Austin Xu, Xiaoxiao He, Yingbo Zhou, Semih Yavuz, Hao Wang, Shafiq Joty

Main category: cs.AI

TL;DR: Process verification for multi-agent systems built on LLMs shows inconsistent benefits and high variance, with LLM-as-a-Judge performing best but still facing reliability challenges.

Details

Motivation: Multi-Agent Systems (MAS) using LLMs have high variance in reasoning trajectories, and while process verification has shown promise in single-agent settings, its effectiveness for coordinating MAS remains unclear and requires systematic investigation.

Method: MAS-ProVe: systematic empirical study evaluating three verification paradigms (LLM-as-a-Judge, reward models, process reward models) across two granularity levels (agent-level, iteration-level), five verifiers, four context management strategies, tested on six MAS frameworks across multiple reasoning benchmarks.

Result: Process-level verification does not consistently improve performance and exhibits high variance. LLM-as-a-Judge generally outperforms reward-based approaches, with trained judges beating general-purpose LLMs. Small performance gap exists between LLMs as judges vs. single agents, with context-length-performance trade-off observed.

Conclusion: Effective and robust process verification for MAS remains an open challenge requiring advances beyond current paradigms, despite LLM-as-a-Judge showing relative promise.

Abstract: Multi-Agent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories. Process verification, which evaluates intermediate steps in trajectories, has shown promise in general reasoning settings, and has been suggested as a potential tool for guiding coordination of MAS; however, its actual effectiveness in MAS remains unclear. To fill this gap, we present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS). Our study spans three verification paradigms (LLM-as-a-Judge, reward models, and process reward models), evaluated across two levels of verification granularity (agent-level and iteration-level). We further examine five representative verifiers and four context management strategies, and conduct experiments over six diverse MAS frameworks on multiple reasoning benchmarks. We find that process-level verification does not consistently improve performance and frequently exhibits high variance, highlighting the difficulty of reliably evaluating partial multi-agent trajectories. Among the methods studied, LLM-as-a-Judge generally outperforms reward-based approaches, with trained judges surpassing general-purpose LLMs. We further observe a small performance gap between LLMs acting as judges and as single agents, and identify a context-length-performance trade-off in verification. Overall, our results suggest that effective and robust process verification for MAS remains an open challenge, requiring further advances beyond current paradigms. Code is available at https://github.com/Wang-ML-Lab/MAS-ProVe.

[320] STEER: Inference-Time Risk Control via Constrained Quality-Diversity Search

Eric Yang, Jong Ha Lee, Jonathan Amar, Elissa Ye, Yugang Jia

Main category: cs.AI

TL;DR: STEER is a training-free framework for controlling LLM decision conservativeness in ordinal settings like clinical triage, using evolutionary search to create diverse personas and exposing a single risk percentile control parameter.

Details

Motivation: LLMs trained for average correctness often exhibit mode collapse, losing the ability to trade off specificity and sensitivity in ordinal decision settings like clinical triage where contextual constraints matter.

Method: STEER constructs a population of natural-language personas through offline constrained quality-diversity search that promotes behavioral coverage while enforcing safety, reasoning, and stability thresholds. At inference, a single interpretable control parameter maps user-specified risk percentile to selected persona.

Result: On clinical triage benchmarks, STEER achieves broader behavioral coverage than temperature-based sampling and static persona ensembles, maintains higher accuracy on unambiguous urgent cases compared to post-training methods, and provides comparable control over ambiguous decisions.

Conclusion: STEER demonstrates a safety-preserving paradigm for risk control that can steer LLM behavior without compromising domain competence, particularly valuable for applications requiring tunable decision conservativeness.

Abstract: Large Language Models (LLMs) trained for average correctness often exhibit mode collapse, producing narrow decision behaviors on tasks where multiple responses may be reasonable. This limitation is particularly problematic in ordinal decision settings such as clinical triage, where standard alignment removes the ability to trade off specificity and sensitivity (the ROC operating point) based on contextual constraints. We propose STEER (Steerable Tuning via Evolutionary Ensemble Refinement), a training-free framework that reintroduces this tunable control. STEER constructs a population of natural-language personas through an offline, constrained quality-diversity search that promotes behavioral coverage while enforcing minimum safety, reasoning, and stability thresholds. At inference time, STEER exposes a single, interpretable control parameter that maps a user-specified risk percentile to a selected persona, yielding a monotonic adjustment of decision conservativeness. On two clinical triage benchmarks, STEER achieves broader behavioral coverage compared to temperature-based sampling and static persona ensembles. Compared to a representative post-training method, STEER maintains substantially higher accuracy on unambiguous urgent cases while providing comparable control over ambiguous decisions. These results demonstrate STEER as a safety-preserving paradigm for risk control, capable of steering behavior without compromising domain competence.

[321] “I May Not Have Articulated Myself Clearly”: Diagnosing Dynamic Instability in LLM Reasoning at Inference Time

Jinkun Chen, Fengxiang Cheng, Sijia Han, Vlado Keselj

Main category: cs.AI

TL;DR: LLM reasoning failures can be predicted mid-generation using token log probabilities without training, with early instability often recoverable but late instability predictive of failure.

Details

Motivation: Current methods only detect reasoning failures at the end of generation, but many failures occur mid-reasoning as models "lose the thread." The authors want to detect these breakdowns using only inference-time observables available in standard APIs.

Method: Define an instability signal combining consecutive-step distributional shift (Jensen-Shannon Divergence) and uncertainty (entropy) from token log probabilities. Summarize each reasoning trace by its peak instability strength and analyze timing of instability relative to decoding horizon.

Result: Instability strength reliably predicts wrong answers with above-chance AUC on GSM8K and HotpotQA. Early instability can lead to recovery (corrective instability), while late instability more often leads to failure (destructive instability), even at similar peak magnitudes.

Conclusion: Reasoning failures can be detected mid-generation using standard API observables without training. The timing of instability matters - late instability is more predictive of failure than early instability. The method provides a diagnostic lens but not a corrective mechanism.

Abstract: Reasoning failures in large language models (LLMs) are typically measured only at the end of a generation, yet many failures manifest as a process-level breakdown: the model “loses the thread” mid-reasoning. We study whether such breakdowns are detectable from inference-time observables available in standard APIs (token log probabilities), without any training or fine-tuning. We define a simple instability signal that combines consecutive-step distributional shift (JSD) and uncertainty (entropy), summarize each trace by its peak instability strength, and show that this signal reliably predicts failure. Across GSM8K and HotpotQA, instability strength predicts wrong answers with above-chance AUC and yields monotonic bucket-level accuracy decline at scale across model sizes. Crucially, we show that instability is not uniformly harmful: early instability can reflect subsequent stabilization and a correct final answer (\emph{corrective instability}), whereas late instability is more often followed by failure (\emph{destructive instability}), even at comparable peak magnitudes, indicating that recoverability depends not only on how strongly the distribution changes but also on when such changes occur relative to the remaining decoding horizon. The method is model-agnostic, training-free, and reproducible, and is presented as a diagnostic lens rather than a corrective or control mechanism.

[322] Rejecting Arguments Based on Doubt in Structured Bipolar Argumentation

Michael A. Müller, Srdjan Vesic, Bruno Yun

Main category: cs.AI

TL;DR: A computational argumentation approach incorporating philosophical/linguistic ideas: agents can reject arguments based on doubt, and focus on acceptable sentences rather than just arguments, using structured bipolar argumentation frameworks with novel semantics.

Details

Motivation: To address limitations in computational argumentation by incorporating two philosophical/linguistic ideas: 1) agents can rationally reject arguments based on mere doubt (not all defended arguments must be accepted), and 2) focusing on acceptable sentences/claims rather than just arguments is sometimes more natural.

Method: Defines structured bipolar argumentation frameworks (SBAFs) where arguments consist of sentences with attack and support relations. Provides novel semantics that: 1) don’t force acceptance of all defended arguments (unlike completeness-based semantics), and 2) provide both argument extensions and language extensions (acceptable sentence sets).

Result: Developed semantics that lie between admissible and complete semantics of abstract argumentation. The approach provides new perspective on existing methods: specifies conditions for ignoring support between arguments (when abstract argumentation is warranted) and shows deductive support semantics is a special case.

Conclusion: The paper presents a philosophically/linguistically informed computational argumentation approach that better models rational agent behavior by allowing doubt-based rejection and sentence-level reasoning, bridging abstract and structured argumentation.

Abstract: This paper develops a new approach to computational argumentation that is informed by philosophical and linguistic views. Namely, it takes into account two ideas that have received little attention in the literature on computational argumentation: First, an agent may rationally reject an argument based on mere doubt, thus not all arguments they could defend must be accepted; and, second, that it is sometimes more natural to think in terms of which individual sentences or claims an agent accepts in a debate, rather than which arguments. In order to incorporate these two ideas into a computational approach, we first define the notion of structured bipolar argumentation frameworks (SBAFs), where arguments consist of sentences and we have both an attack and a support relation between them. Then, we provide semantics for SBAFs with two features: (1) Unlike with completeness-based semantics, our semantics do not force agents to accept all defended arguments. (2) In addition to argument extensions, which give acceptable sets of arguments, we also provide semantics for language extensions that specify acceptable sets of sentences. These semantics represent reasonable positions an agent might have in a debate. Our semantics lie between the admissible and complete semantics of abstract argumentation. Further, our approach can be used to provide a new perspective on existing approaches. For instance, we can specify the conditions under which an agent can ignore support between arguments (i.e. under which the use of abstract argumentation is warranted) and we show that deductive support semantics is a special case of our approach.

[323] Aligning Language Model Benchmarks with Pairwise Preferences

Marco Gutierrez, Xinyi Leng, Hannah Cyberey, Jonathan Richard Schwarz, Ahmed Alaa, Thomas Hartvigsen

Main category: cs.AI

TL;DR: BenchAlign: A method to automatically update offline benchmarks by learning preference-aligned weightings for benchmark questions using model performance data and ranked model pairs, creating new benchmarks that better predict real-world utility.

Details

Motivation: Current language model benchmarks often fail to predict real-world performance despite being computationally efficient proxies. There's a gap between benchmark performance and actual utility in deployment settings.

Method: Proposes BenchAlign, which uses limited information about model performance (question-level performance and ranked model pairs collected during deployment) to learn preference-aligned weightings for benchmark questions, creating new static benchmarks that better predict model pairwise preferences.

Result: Aligned benchmarks can accurately rank unseen models according to human preferences across different model sizes while remaining interpretable. The method shows promise for bridging the gap between benchmark performance and real utility.

Conclusion: BenchAlign provides insights into aligning benchmarks with practical human preferences, which could accelerate model development towards real utility by creating better predictive benchmarks.

Abstract: Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find that benchmarks often fail to predict real utility. Towards bridging this gap, we introduce benchmark alignment, where we use limited amounts of information about model performance to automatically update offline benchmarks, aiming to produce new static benchmarks that predict model pairwise preferences in given test settings. We then propose BenchAlign, the first solution to this problem, which learns preference-aligned weight- ings for benchmark questions using the question-level performance of language models alongside ranked pairs of models that could be collected during deployment, producing new benchmarks that rank previously unseen models according to these preferences. Our experiments show that our aligned benchmarks can accurately rank unseen models according to models of human preferences, even across different sizes, while remaining interpretable. Overall, our work provides insights into the limits of aligning benchmarks with practical human preferences, which stands to accelerate model development towards real utility.

[324] Minimal Computational Preconditions for Subjective Perspective in Artificial Agents

Hongju Pae

Main category: cs.AI

TL;DR: Paper proposes implementing subjective perspective in AI agents using a slowly evolving global latent state that modulates fast policy dynamics, demonstrating direction-dependent hysteresis as a signature of machine subjectivity.

Details

Motivation: To operationalize subjective perspective in artificial agents by grounding it in a minimal, phenomenologically motivated internal structure, moving beyond purely reactive behavior to capture something akin to subjective experience in machines.

Method: Implement perspective as a slowly evolving global latent state that modulates fast policy dynamics without direct optimization for behavioral consequences. Test in reward-free environment with regime shifts to observe emergent properties.

Result: The latent structure exhibits direction-dependent hysteresis while policy-level behavior remains comparatively reactive, suggesting hysteresis as a measurable signature of perspective-like subjectivity in machine systems.

Conclusion: Direction-dependent hysteresis in slowly evolving latent states can serve as an operational signature of subjective perspective in artificial agents, providing a measurable approach to studying machine subjectivity.

Abstract: This study operationalizes subjective perspective in artificial agents by grounding it in a minimal, phenomenologically motivated internal structure. The perspective is implemented as a slowly evolving global latent state that modulates fast policy dynamics without being directly optimized for behavioral consequences. In a reward-free environment with regime shifts, this latent structure exhibits direction-dependent hysteresis, while policy-level behavior remains comparatively reactive. I argue that such hysteresis constitutes a measurable signature of perspective-like subjectivity in machine systems.

[325] FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Zhen Wang, Fan Bai, Zhongyan Luo, Jinyan Su, Kaiser Sun, Xinle Yu, Jieyuan Liu, Kun Zhou, Claire Cardie, Mark Dredze, Eric P. Xing, Zhiting Hu

Main category: cs.AI

TL;DR: FIRE-Bench is a benchmark for evaluating LLM-powered autonomous agents on full-cycle scientific discovery by having them rediscover established findings from recent ML research, revealing current limitations in experimental design and evidence-based reasoning.

Details

Motivation: Existing benchmarks for evaluating LLM-powered scientific agents either rely too heavily on LLM-as-judge evaluations or use isolated performance metrics that don't capture true scientific insight. There's a need for rigorous evaluation of agents' capacity for verifiable, end-to-end scientific discovery.

Method: FIRE-Bench evaluates agents through rediscovery of established findings from recent, high-impact ML research. Agents are given only high-level research questions from published studies and must autonomously explore ideas, design experiments, implement code, execute plans, and derive evidence-based conclusions.

Result: Current state-of-the-art agents with frontier LLMs (like GPT-5) achieve limited rediscovery success (<50 F1), show high variance across runs, and exhibit recurring failure modes in experimental design, execution, and evidence-based reasoning.

Conclusion: Full-cycle scientific research remains challenging for current agent systems. FIRE-Bench provides a rigorous, diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.

Abstract: Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics that provide coarse proxies for scientific insight. To address this gap, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question extracted from a published, verified study and must autonomously explore ideas, design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier LLMs backbones like gpt-5 on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success (<50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.

[326] Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs

Kiran Tomlinson, Tobias Schnabel, Adith Swaminathan, Jennifer Neville

Main category: cs.AI

TL;DR: Theoretical analysis shows chain-of-thought reasoning requires linear scaling of reasoning tokens with input size for certain hard tasks, with experimental validation on frontier models.

Details

Motivation: Chain-of-thought reasoning improves LLM performance but incurs substantial latency and compute costs. The paper aims to understand the fundamental theoretical limits: how many reasoning tokens are required as input size grows, providing principled analysis of optimal reasoning length.

Method: Extends the bounded attention prefix oracle (BAPO) model to analyze information flow requirements. Proves lower bounds on CoT tokens for three canonical BAPO-hard tasks: binary majority, triplet matching, and graph reachability. Provides matching/near-matching upper bounds via explicit constructions and validates with experiments on frontier reasoning models.

Result: Shows each task requires Ω(n) reasoning tokens when input size is n. Experiments with frontier models show approximately linear reasoning token scaling on these tasks and failures when constrained to smaller reasoning budgets, consistent with theoretical lower bounds.

Conclusion: Identifies fundamental bottlenecks in inference-time compute through CoT reasoning and offers a principled tool for analyzing optimal reasoning length. The results establish theoretical limits on reasoning efficiency for certain hard tasks.

Abstract: Inference-time scaling via chain-of-thought (CoT) reasoning is a major driver of state-of-the-art LLM performance, but it comes with substantial latency and compute costs. We address a fundamental theoretical question: how many reasoning tokens are required to solve a problem as input size grows? By extending the bounded attention prefix oracle (BAPO) model–an abstraction of LLMs that quantifies the information flow required to solve a task–we prove lower bounds on the CoT tokens required for three canonical BAPO-hard tasks: binary majority, triplet matching, and graph reachability. We show that each requires $Ω(n)$ reasoning tokens when the input size is $n$. We complement these results with matching or near-matching upper bounds via explicit constructions. Finally, our experiments with frontier reasoning models show approximately linear reasoning token scaling on these tasks and failures when constrained to smaller reasoning budgets, consistent with our theoretical lower bounds. Together, our results identify fundamental bottlenecks in inference-time compute through CoT and offer a principled tool for analyzing optimal reasoning length.

[327] DeltaEvolve: Accelerating Scientific Discovery through Momentum-Driven Evolution

Jiachen Jiang, Tianyu Ding, Zhihui Zhu

Main category: cs.AI

TL;DR: DeltaEvolve: A momentum-driven evolutionary framework that replaces full-code histories with structured semantic deltas to improve evolutionary guidance in LLM-driven science discovery systems.

Details

Motivation: Existing LLM-driven evolutionary systems like AlphaEvolve use full-code histories that are context-inefficient and provide weak evolutionary guidance due to redundant implementation details diluting core algorithmic ideas.

Method: Proposes DeltaEvolve framework that formalizes evolutionary agents as Expectation-Maximization: LLM samples programs (E-step) and system updates control context via evaluation feedback (M-step). Instead of full-code snapshots, uses structured semantic deltas capturing how and why modifications affect performance, organized through multi-level database and progressive disclosure to reduce tokens.

Result: Empirical evaluations across diverse scientific domains show DeltaEvolve discovers better solutions with less token consumption compared to full-code-based evolutionary agents.

Conclusion: Semantic deltas provide more informative evolutionary guidance than full-code histories, enabling more efficient LLM-driven scientific discovery through better context construction and reduced computational overhead.

Abstract: LLM-driven evolutionary systems have shown promise for automated science discovery, yet existing approaches such as AlphaEvolve rely on full-code histories that are context-inefficient and potentially provide weak evolutionary guidance. In this work, we first formalize the evolutionary agents as a general Expectation-Maximization framework, where the language model samples candidate programs (E-step) and the system updates the control context based on evaluation feedback (M-step). Under this view, constructing context via full-code snapshots constitutes a suboptimal M-step, as redundant implement details dilutes core algorithmic ideas, making it difficult to provide clear inspirations for evolution. To address this, we propose DeltaEvolve, a momentum-driven evolutionary framework that replaces full-code history with structured semantic delta capturing how and why modifications between successive nodes affect performance. As programs are often decomposable, semantic delta usually contains many effective components which are transferable and more informative to drive improvement. By organizing semantic delta through multi-level database and progressive disclosure mechanism, input tokens are further reduced. Empirical evaluations on tasks across diverse scientific domains show that our framework can discover better solution with less token consumption over full-code-based evolutionary agents.

[328] UAT-LITE: Inference-Time Uncertainty-Aware Attention for Pretrained Transformers

Elias Hossain, Shubhashis Roy Dipta, Subash Neupane, Rajib Rana, Ravid Shwartz-Ziv, Ivan Garibay, Niloofar Yousefi

Main category: cs.AI

TL;DR: UAT-LITE is an inference-time framework that makes self-attention uncertainty-aware using Monte Carlo dropout in pretrained transformers, improving calibration without modifying weights or training objectives.

Details

Motivation: Neural NLP models are often miscalibrated (assign high confidence to incorrect predictions), which undermines selective prediction and high-stakes deployment. Existing methods either only adjust output probabilities (post-hoc calibration) or are computationally expensive (ensembles/Bayesian approaches).

Method: UAT-LITE uses approximate Bayesian inference via Monte Carlo dropout in pretrained transformer classifiers during inference. Token-level epistemic uncertainty is estimated from stochastic forward passes and used to modulate self-attention during contextualization. Also introduces layerwise variance decomposition to diagnose uncertainty accumulation across transformer depth.

Result: Across SQuAD 2.0 answerability, MNLI, and SST-2, UAT-LITE reduces Expected Calibration Error by ~20% on average relative to fine-tuned BERT-base baseline while preserving task accuracy. Improves selective prediction and robustness under distribution shift.

Conclusion: UAT-LITE provides an effective inference-time framework for uncertainty-aware self-attention in transformers, improving calibration without weight modifications or expensive training procedures.

Abstract: Neural NLP models are often miscalibrated, assigning high confidence to incorrect predictions, which undermines selective prediction and high-stakes deployment. Post-hoc calibration methods adjust output probabilities but leave internal computation unchanged, while ensemble and Bayesian approaches improve uncertainty at substantial training or storage cost. We propose UAT-LITE, an inference-time framework that makes self-attention uncertainty-aware using approximate Bayesian inference via Monte Carlo dropout in pretrained transformer classifiers. Token-level epistemic uncertainty is estimated from stochastic forward passes and used to modulate self-attention during contextualization, without modifying pretrained weights or training objectives. We additionally introduce a layerwise variance decomposition to diagnose how predictive uncertainty accumulates across transformer depth. Across the SQuAD 2.0 answerability, MNLI, and SST-2, UAT-LITE reduces Expected Calibration Error by approximately 20% on average relative to a fine-tuned BERT-base baseline while preserving task accuracy, and improves selective prediction and robustness under distribution shift.

[329] Multi-Agent Pathfinding Under Team-Connected Communication Constraint via Adaptive Path Expansion and Dynamic Leading

Hoang-Dung Bui, Erion Plaku, Gregoy J. Stein

Main category: cs.AI

TL;DR: Novel two-level multi-agent pathfinding framework for maintaining team communication connectivity during navigation, using adaptive path expansion and dynamic leading techniques.

Details

Motivation: Existing multi-agent pathfinding approaches fail under team-connected communication constraints when neighboring configurations change between start and goal positions, and leader-follower approaches get stuck in dense environments with fixed leaders.

Method: Two-level framework combining: 1) adaptive path expansion that expands agent paths to goals in multiple stages, and 2) dynamic leading technique that allows reselection of leading agent during each expansion when progress stalls.

Result: The planner handles up to 25 agents with limited communication range and 11-12 agents with line-of-sight constraints across multiple environment types, achieving over 90% success rate where baselines fail.

Conclusion: The proposed framework effectively solves multi-agent pathfinding under communication constraints by adaptively expanding paths and dynamically selecting leaders, outperforming existing approaches.

Abstract: This paper proposes a novel planning framework to handle a multi-agent pathfinding problem under team-connected communication constraint, where all agents must have a connected communication channel to the rest of the team during their entire movements. Standard multi-agent path finding approaches (e.g., priority-based search) have potential in this domain but fail when neighboring configurations at start and goal differ. Their single-expansion approach – computing each agent’s path from the start to the goal in just a single expansion – cannot reliably handle planning under communication constraints for agents as their neighbors change during navigating. Similarly, leader-follower approaches (e.g., platooning) are effective at maintaining team communication, but fixing the leader at the outset of planning can cause planning to become stuck in dense-clutter environments, limiting their practical utility. To overcome this limitation, we propose a novel two-level multi-agent pathfinding framework that integrates two techniques: adaptive path expansion to expand agent paths to their goals in multiple stages; and dynamic leading technique that enables the reselection of the leading agent during each agent path expansion whenever progress cannot be made. Simulation experiments show the efficiency of our planners, which can handle up to 25 agents across five environment types under a limited communication range constraint and up to 11-12 agents on three environment types under line-of-sight communication constraint, exceeding 90% success-rate where baselines routinely fail.

[330] Generative Engine Optimization: A VLM and Agent Framework for Pinterest Acquisition Growth

Faye Zhang, Qianyu Cheng, Jasmine Wan, Vishwakarma Singh, Jinfeng Rao, Kofi Boakye

Main category: cs.AI

TL;DR: Pinterest develops a Generative Engine Optimization (GEO) framework using fine-tuned Vision-Language Models to predict user search queries from images, creating semantically coherent collection pages optimized for generative search engines.

Details

Motivation: Traditional visual content platforms face disintermediation risk as generative AI search systems (like ChatGPT, Gemini, Claude) satisfy user needs in-place without site visits. Individual images lack semantic depth and authority signals that generative search prioritizes, requiring a shift from SEO to GEO.

Method: Fine-tune Vision-Language Models (VLMs) to predict what users would search for from images (reverse search design). Augment with AI agents mining real-time internet trends. Use VLM-generated queries to construct semantically coherent Collection Pages via multimodal embeddings. Employ hybrid VLM and two-tower ANN architectures for authority-aware interlinking across billions of visual assets.

Result: Deployed at scale across billions of images and tens of millions of collections. Achieved 20% organic traffic growth contributing to multi-million monthly active user (MAU) growth.

Conclusion: Presents a principled pathway for visual platforms to thrive in the generative search era through reverse search design and GEO framework.

Abstract: Large Language Models are fundamentally reshaping content discovery through AI-native search systems such as ChatGPT, Gemini, and Claude. Unlike traditional search engines that match keywords to documents, these systems infer user intent, synthesize multimodal evidence, and generate contextual answers directly on the search page, introducing a paradigm shift from Search Engine Optimization (SEO) to Generative Engine Optimization (GEO). For visual content platforms hosting billions of assets, this poses an acute challenge: individual images lack the semantic depth and authority signals that generative search prioritizes, risking disintermediation as user needs are satisfied in-place without site visits. We present Pinterest GEO, a production-scale framework that pioneers reverse search design: rather than generating generic image captions describing what content is, we fine-tune Vision-Language Models (VLMs) to predict what users would actually search for, augmented this with AI agents that mine real-time internet trends to capture emerging search demand. These VLM-generated queries then drive construction of semantically coherent Collection Pages via multimodal embeddings, creating indexable aggregations optimized for generative retrieval. Finally, we employ hybrid VLM and two-tower ANN architectures to build authority-aware interlinking structures that propagate signals across billions of visual assets. Deployed at scale across billions of images and tens of millions of collections, GEO delivers 20% organic traffic growth contributing to multi-million monthly active user (MAU) growth, demonstrating a principled pathway for visual platforms to thrive in the generative search era.

[331] Structuring Value Representations via Geometric Coherence in Markov Decision Processes

Zuyuan Zhang, Zeyu Fang, Tian Lan

Main category: cs.AI

TL;DR: GCR-RL introduces geometric coherence regularization for RL by framing value function learning as poset refinement, improving sample efficiency and stability.

Details

Motivation: Leveraging geometric properties can stabilize and accelerate reinforcement learning. Existing approaches use symmetry, geometry-aware augmentation, and structural restrictions, but there's an opportunity to use order theory to provide a novel geometric perspective on RL.

Method: Recasts value function estimates as learning a desired poset (partially ordered set). Proposes GCR-RL that computes super-poset refinements by refining previous posets and learning additional order relationships from temporal difference signals. Develops two algorithms: Q-learning and actor-critic variants to efficiently realize these super-poset refinements.

Result: Theoretical properties and convergence rates are analyzed. Empirical evaluation across various tasks demonstrates significant improvements in sample efficiency and stable performance over strong baselines.

Conclusion: Geometric coherence regularization through poset refinement provides a principled approach to improve RL stability and sample efficiency, with both theoretical guarantees and empirical effectiveness.

Abstract: Geometric properties can be leveraged to stabilize and speed reinforcement learning. Existing examples include encoding symmetry structure, geometry-aware data augmentation, and enforcing structural restrictions. In this paper, we take a novel view of RL through the lens of order theory and recast value function estimates into learning a desired poset (partially ordered set). We propose \emph{GCR-RL} (Geometric Coherence Regularized Reinforcement Learning) that computes a sequence of super-poset refinements – by refining posets in previous steps and learning additional order relationships from temporal difference signals – thus ensuring geometric coherence across the sequence of posets underpinning the learned value functions. Two novel algorithms by Q-learning and by actor–critic are developed to efficiently realize these super-poset refinements. Their theoretical properties and convergence rates are analyzed. We empirically evaluate GCR-RL in a range of tasks and demonstrate significant improvements in sample efficiency and stable performance over strong baselines.

[332] Are LLMs Biased Like Humans? Causal Reasoning as a Function of Prior Knowledge, Irrelevant Information, and Reasoning Budget

Hanna M. Dettki, Charley M. Wu, Bob Rehder

Main category: cs.AI

TL;DR: LLMs show rule-like causal reasoning on collider tasks, differing from human biases, with CoT improving robustness but uncertainty remains a challenge.

Details

Motivation: To understand whether LLMs' causal judgments reflect normative computation, human-like shortcuts, or pattern matching, especially in domains where causal reasoning matters.

Method: Benchmarked 20+ LLMs against human baseline on 11 causal judgment tasks formalized by collider structures (C₁→E←C₂), using interpretable models to compress LLM judgments, and probing robustness under semantic abstraction and prompt overloading.

Result: Most LLMs exhibit more rule-like reasoning than humans, don’t mirror human collider biases (weak explaining away, Markov violations), and CoT increases robustness for many LLMs.

Conclusion: LLMs can complement humans when known biases are undesirable, but their rule-like reasoning may break down with uncertainty, highlighting need to characterize LLM reasoning for safe deployment.

Abstract: Large language models (LLMs) are increasingly used in domains where causal reasoning matters, yet it remains unclear whether their judgments reflect normative causal computation, human-like shortcuts, or brittle pattern matching. We benchmark 20+ LLMs against a matched human baseline on 11 causal judgment tasks formalized by a collider structure ($C_1 !\rightarrow! E! \leftarrow !C_2$). We find that a small interpretable model compresses LLMs’ causal judgments well and that most LLMs exhibit more rule-like reasoning strategies than humans who seem to account for unmentioned latent factors in their probability judgments. Furthermore, most LLMs do not mirror the characteristic human collider biases of weak explaining away and Markov violations. We probe LLMs’ causal judgment robustness under (i) semantic abstraction and (ii) prompt overloading (injecting irrelevant text), and find that chain-of-thought (CoT) increases robustness for many LLMs. Together, this divergence suggests LLMs can complement humans when known biases are undesirable, but their rule-like reasoning may break down when uncertainty is intrinsic – highlighting the need to characterize LLM reasoning strategies for safe, effective deployment.

[333] Large Language Models Can Take False First Steps at Inference-time Planning

Haijiang Yan, Jian-Qiao Zhu, Adam Sanborn

Main category: cs.AI

TL;DR: LLMs show planning capabilities during training but appear short-sighted during inference; this gap is explained by Bayesian modeling of how self-generated context drives planning-shift during generation.

Details

Motivation: To explain the discrepancy between LLMs' demonstrated sequence-level planning abilities during training and their apparently short-sighted, inconsistent planning behavior observed during inference time.

Method: Proposes a Bayesian account that grounds planning behavior in the evolving generative context, suggesting that accumulated self-generated context drives a planning-shift during inference. Validates through two controlled experiments: random-generation task showing constrained planning under human prompts and increasing planning strength with self-generated context, and Gaussian-sampling task showing reduced initial bias when conditioning on self-generated sequences.

Result: The experiments demonstrate: 1) constrained planning under human prompts with increasing planning strength as self-generated context accumulates, and 2) reduced initial bias when conditioning on self-generated sequences in Gaussian-sampling tasks.

Conclusion: Provides a theoretical explanation with empirical evidence for characterizing how LLMs plan ahead during inference, explaining the apparent gap between training capabilities and inference behavior through the lens of Bayesian modeling and context accumulation.

Abstract: Large language models (LLMs) have been shown to acquire sequence-level planning abilities during training, yet their planning behavior exhibited at inference time often appears short-sighted and inconsistent with these capabilities. We propose a Bayesian account for this gap by grounding planning behavior in the evolving generative context: given the subtle differences between natural language and the language internalized by LLMs, accumulated self-generated context drives a planning-shift during inference and thereby creates the appearance of compromised planning behavior. We further validate the proposed model through two controlled experiments: a random-generation task demonstrating constrained planning under human prompts and increasing planning strength as self-generated context accumulates, and a Gaussian-sampling task showing reduced initial bias when conditioning on self-generated sequences. These findings provide a theoretical explanation along with empirical evidence for characterizing how LLMs plan ahead during inference.

[334] Agent Alpha: Tree Search Unifying Generation, Exploration and Evaluation for Computer-Use Agents

Sizhe Tang, Rongqian Chen, Tian Lan

Main category: cs.AI

TL;DR: Agent Alpha: A unified framework using step-level Monte Carlo Tree Search (MCTS) with alpha-UCT guidance for GUI agents, enabling deliberate planning, early pruning, and prefix reuse to improve performance.

Details

Motivation: Current GUI agents using trajectory-level sampling lack regressive ability, preventing reuse of partial successes and recovery from early missteps. There's a need for more efficient planning and exploration in GUI interaction tasks.

Method: Introduces Agent Alpha framework with step-level MCTS, alpha-UCT guided search, comparison-driven evaluation to mitigate scoring biases, and diversity-constrained expansion to maintain compact search space.

Result: Achieves state-of-the-art success rate of ~77% on OSWorld benchmark, significantly outperforming trajectory-level baselines under equivalent compute.

Conclusion: Agent Alpha demonstrates the effectiveness of step-level MCTS with alpha-UCT guidance for GUI agents, enabling better planning, exploration, and recovery capabilities compared to trajectory-level approaches.

Abstract: While scaling test-time compute through trajectory-level sampling has significantly improved Graphical User Interface (GUI) agents, the lack of regressive ability prevents the reuse of partial successes and the recovery from early missteps. In this paper, we introduce Agent Alpha, a unified framework that synergizes generation, exploration, and evaluation through step-level Monte Carlo Tree Search (MCTS). It enables active modeling or exploiting structures of the planning space. By integrating alpha-UCT guided search into the interaction loop, Agent Alpha enables deliberate planning, facilitating early pruning of suboptimal branches and efficient prefix reuse. We also employ comparison-driven evaluation to mitigate absolute scoring biases and diversity-constrained expansion to maintain a compact, informative search space. Regret bound of alpha-UCT is analyzed. On the OSWorld benchmark, Agent Alpha achieves a state-of-the-art success rate of $\sim 77%$, significantly outperforming trajectory-level baselines under equivalent compute.

Zhiyu An, Wan Du

Main category: cs.AI

TL;DR: Survey paper on differentiable social choice - applying ML to learn voting rules and aggregation mechanisms from data, bridging ML, economics, and democratic theory.

Details

Motivation: Social choice theory has become foundational in ML systems (auctions, federated learning, LLM alignment) but current implementations often lack explicit normative scrutiny. Need to bridge classical social choice theory with modern ML optimization frameworks.

Method: Survey approach synthesizing work across auctions, voting, budgeting, liquid democracy, decentralized aggregation, and inverse mechanism learning. Formulates voting rules and aggregation procedures as learnable, differentiable models optimized from data.

Result: Comprehensive review showing how classical axioms and impossibility results reappear as objectives, constraints, and optimization trade-offs in differentiable frameworks. Identifies 36 open problems defining new research agenda.

Conclusion: Differentiable social choice represents an emerging paradigm that connects ML optimization with normative social choice theory, enabling data-driven design of aggregation mechanisms while maintaining theoretical guarantees.

Abstract: Social choice is no longer a peripheral concern of political theory or economics-it has become a foundational component of modern machine learning systems. From auctions and resource allocation to federated learning, participatory governance, and the alignment of large language models, machine learning pipelines increasingly aggregate heterogeneous preferences, incentives, and judgments into collective decisions. In effect, many contemporary machine learning systems already implement social choice mechanisms, often implicitly and without explicit normative scrutiny. This Review surveys differentiable social choice: an emerging paradigm that formulates voting rules, mechanisms, and aggregation procedures as learnable, differentiable models optimized from data. We synthesize work across auctions, voting, budgeting, liquid democracy, decentralized aggregation, and inverse mechanism learning, showing how classical axioms and impossibility results reappear as objectives, constraints, and optimization trade-offs. We conclude by identifying 36 open problems defining a new research agenda at the intersection of machine learning, economics, and democratic theory.

[336] Distilling LLM Reasoning into Graph of Concept Predictors

Ziyang Yu, Liang Zhao

Main category: cs.AI

TL;DR: GCP is a reasoning-aware active distillation framework that externalizes teacher LLM’s decision process as a graph and trains modular concept predictors in student models for more efficient and interpretable distillation.

Details

Motivation: Current active distillation methods for LLMs only distill final labels, discarding intermediate reasoning signals and offering limited diagnostics of missing reasoning and error sources.

Method: Proposes Graph of Concept Predictors (GCP) that externalizes teacher’s decision process as a directed acyclic graph and mirrors it with modular concept predictors in the student. Uses graph-aware acquisition strategy targeting uncertainty at critical nodes and targeted sub-module retraining.

Result: Experiments on eight NLP classification benchmarks show GCP enhances performance under limited annotation budgets while yielding more interpretable and controllable training dynamics.

Conclusion: GCP provides a more efficient and interpretable active distillation framework that captures reasoning processes beyond just final labels.

Abstract: Deploying Large Language Models (LLMs) for discriminative workloads is often limited by inference latency, compute, and API costs at scale. Active distillation reduces these costs by querying an LLM oracle to train compact discriminative students, but most pipelines distill only final labels, discarding intermediate reasoning signals and offering limited diagnostics of what reasoning is missing and where errors arise. We propose Graph of Concept Predictors (GCP), a reasoning-aware active distillation framework that externalizes the teacher’s decision process as a directed acyclic graph and mirrors it with modular concept predictors in the student. GCP enhances sample efficiency through a graph-aware acquisition strategy that targets uncertainty and disagreement at critical reasoning nodes. Additionally, it improves training stability and efficiency by performing targeted sub-module retraining, which attributes downstream loss to specific concept predictors and updates only the most influential modules. Experiments on eight NLP classification benchmarks demonstrate that GCP enhances performance under limited annotation budgets while yielding more interpretable and controllable training dynamics. Code is available at: https://github.com/Ziyang-Yu/GCP.

Jiliang Ni, Jiachen Pu, Zhongyi Yang, Jingfeng Luo, Conggang Hu

Main category: cs.AI

TL;DR: STAR is a framework that transfers LLM capabilities to super-tiny models for function calling tasks using constrained knowledge distillation and similarity-guided reinforcement learning.

Details

Motivation: Large LLMs are effective for function calling in AI agents but their scale limits adoption; existing methods suffer from overfitting, training instability, ineffective binary rewards for multi-solution tasks, and difficulty synergizing techniques.

Method: STAR framework with two innovations: (1) Constrained Knowledge Distillation (CKD) using top-k forward KL divergence to suppress incorrect predictions while preserving exploration capacity; (2) Similarity-guided RL (Sim-RL) with fine-grained, similarity-based rewards for better policy optimization.

Result: STAR models achieve SOTA in their size classes, with 0.6B model performing best among all open models under 1B, surpassing several larger models. Extensive experiments on challenging benchmarks demonstrate effectiveness.

Conclusion: STAR provides a training framework that successfully distills LLM capabilities into super-tiny models, enabling powerful, accessible, and efficient AI agents for function calling tasks.

Abstract: The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones. However, existing paradigms are often plagued by overfitting, training instability, ineffective binary rewards for multi-solution tasks, and the difficulty of synergizing techniques. We introduce STAR: Similarity-guided Teacher-Assisted Refinement, a novel holistic framework that effectively transfers LLMs’ capabilities to super-tiny models. STAR consists of two core technical innovations: (1) Constrained Knowledge Distillation (CKD), a training objective that augments top-k forward KL divergence to suppress confidently incorrect predictions, ensuring training stability while preserving exploration capacity for downstream RL. STAR holistically synergizes these strategies within a cohesive training curriculum, enabling super-tiny models to achieve exceptional performance on complex function calling tasks; (2) Similarity-guided RL (Sim-RL), a RL mechanism that introduces a fine-grained, similarity-based reward. This provides a robust, continuous, and rich signal for better policy optimization by evaluating the similarity between generated outputs and the ground truth. Extensive experiments on challenging and renowned benchmarks demonstrate the effectiveness of our method. Our STAR models establish SOTA in their size classes, significantly outperforming baselines. Remarkably, our 0.6B STAR model achieves the best performance among all open models under 1B, surpassing even several well-known open models at a larger scale. STAR demonstrates a training framework that distills capabilities of LLMs into super-tiny models, paving the way for powerful, accessible, and efficient AI agents.

[338] RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents

Haitian Zhong, Jixiu Zhai, Lei Song, Jiang Bian, Qiang Liu, Tieniu Tan

Main category: cs.AI

TL;DR: RC-GRPO improves multi-turn tool calling in LLMs by using reward-conditioned trajectories with special tokens to enhance exploration diversity during RL training.

Details

Motivation: Multi-turn tool calling is challenging for LLMs due to sparse rewards and expensive exploration. Standard SFT+GRPO methods stall when within-group reward variation is low, making advantage signals uninformative and causing vanishing updates.

Method: Proposes RC-GRPO (Reward-Conditioned Group Relative Policy Optimization) which treats exploration as controllable steering via discrete reward tokens. First fine-tunes a Reward-Conditioned Trajectory Policy (RCTP) on mixed-quality trajectories with reward goal tokens (e.g., <|high_reward|>, <|low_reward|>) injected into prompts. During RL, samples diverse reward tokens within each GRPO group and conditions rollouts on sampled tokens to improve within-group diversity.

Result: On the Berkeley Function Calling Leaderboard v4 (BFCLv4) multi-turn benchmark, RC-GRPO yields consistently improved performance over baselines. Performance on Qwen-2.5-7B-Instruct even surpasses all closed-source API models.

Conclusion: RC-GRPO effectively addresses the low within-group diversity problem in multi-turn tool calling by using reward-conditioned steering, leading to better exploration and improved performance on challenging benchmarks.

Abstract: Multi-turn tool calling is challenging for Large Language Models (LLMs) because rewards are sparse and exploration is expensive. A common recipe, SFT followed by GRPO, can stall when within-group reward variation is low (e.g., more rollouts in a group receive the all 0 or all 1 reward), making the group-normalized advantage uninformative and yielding vanishing updates. To address this problem, we propose RC-GRPO (Reward-Conditioned Group Relative Policy Optimization), which treats exploration as a controllable steering problem via discrete reward tokens. We first fine-tune a Reward-Conditioned Trajectory Policy (RCTP) on mixed-quality trajectories with reward goal special tokens (e.g., <|high_reward|>, <|low_reward|>) injected into the prompts, enabling the model to learn how to generate distinct quality trajectories on demand. Then during RL, we sample diverse reward tokens within each GRPO group and condition rollouts on the sampled token to improve within-group diversity, improving advantage gains. On the Berkeley Function Calling Leaderboard v4 (BFCLv4) multi-turn benchmark, our method yields consistently improved performance than baselines, and the performance on Qwen-2.5-7B-Instruct even surpasses all closed-source API models.

[339] KANFIS A Neuro-Symbolic Framework for Interpretable and Uncertainty-Aware Learning

Binbin Yong, Haoran Pei, Jun Shen, Haoran Li, Qingguo Zhou, Zhao Su

Main category: cs.AI

TL;DR: KANFIS: A compact neuro-fuzzy system using additive decomposition to avoid exponential rule explosion, compatible with Type-1 and Type-2 fuzzy logic for uncertainty modeling.

Details

Motivation: Conventional ANFIS suffers from structural complexity with exponential rule explosion in high-dimensional spaces, needing more compact and interpretable neuro-fuzzy architectures.

Method: Proposes KANFIS (Kolmogorov-Arnold Neuro-Fuzzy Inference System) using additive function decomposition with linear scaling of parameters/rules, sparse masking for compact rule sets, and compatibility with Type-1 and Interval Type-2 fuzzy logic.

Result: Achieves competitive performance against neural and neuro-fuzzy baselines while providing interpretable models with clear rule semantics and transparent inference processes.

Conclusion: KANFIS offers a compact, interpretable neuro-fuzzy architecture that avoids exponential complexity while maintaining performance and enabling uncertainty modeling.

Abstract: Adaptive Neuro-Fuzzy Inference System (ANFIS) was designed to combine the learning capabilities of neural network with the reasoning transparency of fuzzy logic. However, conventional ANFIS architectures suffer from structural complexity, where the product-based inference mechanism causes an exponential explosion of rules in high-dimensional spaces. We herein propose the Kolmogorov-Arnold Neuro-Fuzzy Inference System (KANFIS), a compact neuro-symbolic architecture that unifies fuzzy reasoning with additive function decomposition. KANFIS employs an additive aggregation mechanism, under which both model parameters and rule complexity scale linearly with input dimensionality rather than exponentially. Furthermore, KANFIS is compatible with both Type-1 (T1) and Interval Type-2 (IT2) fuzzy logic systems, enabling explicit modeling of uncertainty and ambiguity in fuzzy representations. By using sparse masking mechanisms, KANFIS generates compact and structured rule sets, resulting in an intrinsically interpretable model with clear rule semantics and transparent inference processes. Empirical results demonstrate that KANFIS achieves competitive performance against representative neural and neuro-fuzzy baselines.

[340] De-conflating Preference and Qualification: Constrained Dual-Perspective Reasoning for Job Recommendation with Large Language Models

Bryce Kan, Wei Yang, Emily Nguyen, Ganghui Yi, Bowen Yi, Chenxiao Yu, Yan Liu

Main category: cs.AI

TL;DR: JobRec is a generative job recommendation framework that decouples candidate preference and employer qualification through dual-perspective reasoning, structured semantic alignment, and policy optimization for controllable matching.

Details

Motivation: Existing job recommendation systems conflate candidate preference and employer qualification into a single interaction signal, leading to confounded supervision under recruitment-funnel censoring and limited policy controllability. There's a need to separate these two decision dimensions for more accurate and controllable professional matching.

Method: JobRec introduces: 1) Unified Semantic Alignment Schema that aligns candidate and job attributes into structured semantic layers; 2) Two-Stage Cooperative Training Strategy that learns decoupled experts to separately infer preference and qualification; 3) Lagrangian-based Policy Alignment module that optimizes recommendations under explicit eligibility requirements; 4) Synthetic dataset construction refined by experts to mitigate data scarcity.

Result: Experiments show that JobRec consistently outperforms strong baselines and provides improved controllability for strategy-aware professional matching.

Conclusion: JobRec successfully addresses the challenges of confounded supervision and limited controllability in professional job recommendation by decoupling preference and qualification through structured semantic alignment and dual-perspective reasoning, enabling more accurate and controllable matching.

Abstract: Professional job recommendation involves a complex bipartite matching process that must reconcile a candidate’s subjective preference with an employer’s objective qualification. While Large Language Models (LLMs) are well-suited for modeling the rich semantics of resumes and job descriptions, existing paradigms often collapse these two decision dimensions into a single interaction signal, yielding confounded supervision under recruitment-funnel censoring and limiting policy controllability. To address these challenges, We propose JobRec, a generative job recommendation framework for de-conflating preference and qualification via constrained dual-perspective reasoning. JobRec introduces a Unified Semantic Alignment Schema that aligns candidate and job attributes into structured semantic layers, and a Two-Stage Cooperative Training Strategy that learns decoupled experts to separately infer preference and qualification. Building on these experts, a Lagrangian-based Policy Alignment module optimizes recommendations under explicit eligibility requirements, enabling controllable trade-offs. To mitigate data scarcity, we construct a synthetic dataset refined by experts. Experiments show that JobRec consistently outperforms strong baselines and provides improved controllability for strategy-aware professional matching.

[341] Risky-Bench: Probing Agentic Safety Risks under Real-World Deployment

Jingnan Zheng, Yanzhen Luo, Jingjun Xu, Bingnan Liu, Yuxin Chen, Chenhang Cui, Gelei Deng, Chaochao Lu, Xiang Wang, An Zhang, Tat-Seng Chua

Main category: cs.AI

TL;DR: Risky-Bench: A framework for systematic agent safety evaluation in real-world deployments, using domain-agnostic safety principles to assess risks during long-horizon interactive tasks across diverse agent configurations.

Details

Motivation: Current agent safety evaluations are limited to risk-oriented tasks tailored to specific settings, failing to assess safety behavior during long-horizon interactive task execution in complex real-world deployments. Existing approaches have limited coverage of safety risk space and lack adaptability across diverse agent configurations.

Method: Proposes Risky-Bench framework that organizes evaluation around domain-agnostic safety principles to derive context-aware safety rubrics delineating safety space. Systematically evaluates safety risks through realistic task execution under varying threat assumptions. The framework can be adapted to different deployment settings to construct environment-specific safety evaluations.

Result: When applied to life-assist agent settings, Risky-Bench uncovers substantial safety risks in state-of-the-art agents under realistic execution conditions. The framework demonstrates effectiveness in identifying safety vulnerabilities that traditional evaluations miss.

Conclusion: Risky-Bench provides an extensible methodology for agent safety assessment that goes beyond linguistic harm to address real-world deployment risks. It offers a systematic approach to evaluate safety across diverse agent configurations and deployment scenarios.

Abstract: Large Language Models (LLMs) are increasingly deployed as agents that operate in real-world environments, introducing safety risks beyond linguistic harm. Existing agent safety evaluations rely on risk-oriented tasks tailored to specific agent settings, resulting in limited coverage of safety risk space and failing to assess agent safety behavior during long-horizon, interactive task execution in complex real-world deployments. Moreover, their specialization to particular agent settings limits adaptability across diverse agent configurations. To address these limitations, we propose Risky-Bench, a framework that enables systematic agent safety evaluation grounded in real-world deployment. Risky-Bench organizes evaluation around domain-agnostic safety principles to derive context-aware safety rubrics that delineate safety space, and systematically evaluates safety risks across this space through realistic task execution under varying threat assumptions. When applied to life-assist agent settings, Risky-Bench uncovers substantial safety risks in state-of-the-art agents under realistic execution conditions. Moreover, as a well-structured evaluation pipeline, Risky-Bench is not confined to life-assist scenarios and can be adapted to other deployment settings to construct environment-specific safety evaluations, providing an extensible methodology for agent safety assessment.

[342] Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis

Abdelghny Orogat, Ana Rostam, Essam Mansour

Main category: cs.AI

TL;DR: MAFBench is a unified evaluation suite for multi-agent LLM frameworks that reveals how architectural choices alone can cause 100x latency increases, 30% accuracy drops, and 60% coordination success reduction.

Details

Motivation: Multi-agent LLM frameworks impose distinct architectural structures that significantly impact system performance, but their effects remain poorly understood. Existing benchmarks focus on individual capabilities and lack standardized framework-level evaluation to isolate architectural effects.

Method: Introduced an architectural taxonomy for comparing multi-agent LLM frameworks along fundamental dimensions, and developed MAFBench - a unified evaluation suite that integrates existing benchmarks under a standardized execution pipeline for controlled empirical study.

Result: Framework-level design choices alone can increase latency by over 100x, reduce planning accuracy by up to 30%, and lower coordination success from above 90% to below 30%.

Conclusion: The study provides concrete architectural design principles and framework selection guidance, highlighting the critical importance of framework architecture in multi-agent LLM systems and outlining promising future research directions.

Abstract: Multi-agent LLM frameworks are widely used to accelerate the development of agent systems powered by large language models (LLMs). These frameworks impose distinct architectural structures that govern how agents interact, store information, and coordinate tasks. However, their impact on system performance remains poorly understood. This gap is critical, as architectural choices alone can induce order-of-magnitude differences in latency and throughput, as well as substantial variation in accuracy and scalability. Addressing this challenge requires (i) jointly evaluating multiple capabilities, such as orchestration overhead, memory behavior, planning, specialization, and coordination, and (ii) conducting these evaluations under controlled, framework-level conditions to isolate architectural effects. Existing benchmarks focus on individual capabilities and lack standardized framework-level evaluation. We address these limitations by (i) introducing an architectural taxonomy for systematically comparing multi-agent LLM frameworks along fundamental dimensions, and (ii) developing MAFBench, a unified evaluation suite that integrates existing benchmarks under a standardized execution pipeline. Using MAFBench, we conduct a controlled empirical study across several widely used frameworks. Our results show that framework-level design choices alone can increase latency by over 100x, reduce planning accuracy by up to 30%, and lower coordination success from above 90% to below 30%. Finally, we translate our findings into concrete architectural design principles and framework selection guidance, and outline promising future research directions.

[343] General Agents Contain World Models, even under Partial Observability and Stochasticity

Santiago Cifuentes

Main category: cs.AI

TL;DR: Extends previous work showing optimal agents must contain world models to stochastic agents in partially observable environments, proving randomization doesn’t avoid learning environment models.

Details

Motivation: To generalize previous theoretical results about world models in optimal agents by removing restrictive assumptions of determinism and full observability, showing the fundamental necessity of environment models even for stochastic agents in partially observable settings.

Method: Theoretical extension of previous framework, proving theorems that stochastic agents operating in partially observable environments must contain sufficient knowledge of their environment for approximate reconstruction, even when using randomization.

Result: Shows stochastic agents cannot avoid learning their environment through randomization, and strengthens the result by weakening the generality requirement, proving less powerful agents already contain world models.

Conclusion: The necessity of world models in optimal agents is fundamental and extends to stochastic agents in partially observable environments, demonstrating that randomization doesn’t circumvent the need to learn environment structure.

Abstract: Deciding whether an agent possesses a model of its surrounding world is a fundamental step toward understanding its capabilities and limitations. In [10], it was shown that, within a particular framework, every almost optimal and general agent necessarily contains sufficient knowledge of its environment to allow an approximate reconstruction of it by querying the agent as a black box. This result relied on the assumptions that the agent is deterministic and that the environment is fully observable. In this work, we remove both assumptions by extending the theorem to stochastic agents operating in partially observable environments. Fundamentally, this shows that stochastic agents cannot avoid learning their environment through the usage of randomization. We also strengthen the result by weakening the notion of generality, proving that less powerful agents already contain a model of the world in which they operate.

[344] Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration

Wei Dai, Haoyu Wang, Honghao Chang, Lijun He, Fan Li, Jian Sun, Haixia Bi

Main category: cs.AI

TL;DR: A diffusion-based restoration strategy for Vision Language Models that handles missing modalities through dynamic modality gating and cross-modal mutual learning.

Details

Motivation: VLMs struggle when modalities are missing during inference, with existing methods either failing to restore essential features or generating semantically irrelevant noise. There's a need for a solution that restores precise semantics while maintaining VLM generalization.

Method: Proposes a pluggable mid-stage training module using an enhanced diffusion model with two key innovations: (1) Dynamic Modality Gating that adaptively uses conditional features to guide semantically consistent feature generation, and (2) Cross-Modal Mutual Learning that bridges semantic spaces of dual encoders for bidirectional alignment.

Result: Zero-shot evaluations across benchmark datasets show the approach outperforms existing baseline methods. Extensive experiments confirm the model as a robust and scalable extension for VLMs in missing modality scenarios, ensuring reliability across diverse missing rates and environments.

Conclusion: The proposed restoration strategy effectively handles missing modalities in VLMs through diffusion-based feature restoration with adaptive gating and cross-modal alignment, providing a general solution for incomplete modality scenarios.

Abstract: Vision Language Models (VLMs) typically assume complete modality input during inference. However, their effectiveness drops sharply when certain modalities are unavailable or incomplete. Current research primarily faces two dilemmas: Prompt-based methods struggle to restore missing yet indispensable features and impair generalization of VLMs. Imputation-based approaches, lacking effective guidance, are prone to generating semantically irrelevant noise. Restoring precise semantics while sustaining VLM generalization remains challenging. Therefore, we propose a general missing modality restoration strategy in this paper. We introduce an enhanced diffusion model as a pluggable mid-stage training module to effectively restore missing features. Our strategy introduces two key innovations: (I) Dynamic Modality Gating, which adaptively leverages conditional features to steer the generation of semantically consistent features; (II) Cross-Modal Mutual Learning mechanism, which bridges the semantic spaces of dual encoders to achieve bidirectional alignment. Zero-shot evaluations across benchmark datasets demonstrate that our approach outperforms existing baseline methods. Extensive experiments and ablation studies confirm our model as a robust and scalable extension for VLMs in missing modality scenarios, ensuring reliability across diverse missing rates and environments. Our code and models will be publicly available.

[345] VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

Woojin Kim, Sieun Hyeon, Jusang Oh, Jaeyoung Do

Main category: cs.AI

TL;DR: VALUEFLOW: A unified framework for hierarchical value extraction, intensity evaluation, and calibrated steering of LLMs across multiple value theories.

Details

Motivation: Current LLM alignment methods fail to capture hierarchical value structures and calibrated intensity control, limiting principled alignment with diverse human values.

Method: Three-component framework: 1) HIVES for hierarchical value embedding, 2) VIDB database of value-labeled texts with intensity estimates, 3) anchor-based evaluator for consistent intensity scoring.

Result: Comprehensive study across 10 models and 4 value theories reveals asymmetries in steerability and composition laws for multi-value control.

Conclusion: Establishes scalable infrastructure for evaluating and controlling value intensity, advancing pluralistic alignment of LLMs.

Abstract: Aligning Large Language Models (LLMs) with the diverse spectrum of human values remains a central challenge: preference-based methods often fail to capture deeper motivational principles. Value-based approaches offer a more principled path, yet three gaps persist: extraction often ignores hierarchical structure, evaluation detects presence but not calibrated intensity, and the steerability of LLMs at controlled intensities remains insufficiently understood. To address these limitations, we introduce VALUEFLOW, the first unified framework that spans extraction, evaluation, and steering with calibrated intensity control. The framework integrates three components: (i) HIVES, a hierarchical value embedding space that captures intra- and cross-theory value structure; (ii) the Value Intensity DataBase (VIDB), a large-scale resource of value-labeled texts with intensity estimates derived from ranking-based aggregation; and (iii) an anchor-based evaluator that produces consistent intensity scores for model outputs by ranking them against VIDB panels. Using VALUEFLOW, we conduct a comprehensive large-scale study across ten models and four value theories, identifying asymmetries in steerability and composition laws for multi-value control. This paper establishes a scalable infrastructure for evaluating and controlling value intensity, advancing pluralistic alignment of LLMs.

[346] Beyond Quantity: Trajectory Diversity Scaling for Code Agents

Guhong Chen, Chenghao Sun, Cheng Fu, Qiyao Wang, Zhihong Huang, Chaopeng Wei, Guangxu Chen, Feiteng Fang, Ahmadreza Argha, Bing Zhao, Xander Xu, Qi Han, Hamid Alinejad-Rokny, Qiang Qu, Binhua Li, Shiwen Ni, Min Yang, Hu Wei, Yongbin Li

Main category: cs.AI

TL;DR: TDScaling is a trajectory diversity scaling framework for code LLM agents that improves performance through diverse synthetic data generation rather than raw volume scaling, addressing limitations of current MCP-based agents.

Details

Motivation: Current code LLM agents using Model Context Protocol face limitations from low-quality synthetic data and diminishing returns from quantity scaling, with early bottlenecks that underutilize trajectory data. There's a need for better data synthesis that focuses on diversity rather than just volume.

Method: TDScaling introduces four innovations: 1) Business Cluster mechanism for real-service logical dependencies, 2) blueprint-driven multi-agent paradigm for trajectory coherence, 3) adaptive evolution mechanism using Domain Entropy, Reasoning Mode Entropy, and Cumulative Action Complexity to prevent mode collapse, and 4) sandboxed code tool to mitigate catastrophic forgetting of coding capabilities.

Result: Experiments on general tool-use benchmarks (BFCL, tau^2-Bench) and code agent tasks (RebenchT, CodeCI, BIRD) show TDScaling improves both tool-use generalization and inherent coding proficiency, with 30,000+ tool clusters synthesized.

Conclusion: TDScaling demonstrates that scaling through trajectory diversity yields better performance-cost trade-offs than quantity scaling for code agent training, offering a win-win for both tool-use generalization and coding capabilities.

Abstract: As code large language models (LLMs) evolve into tool-interactive agents via the Model Context Protocol (MCP), their generalization is increasingly limited by low-quality synthetic data and the diminishing returns of quantity scaling. Moreover, quantity-centric scaling exhibits an early bottleneck that underutilizes trajectory data. We propose TDScaling, a Trajectory Diversity Scaling-based data synthesis framework for code agents that scales performance through diversity rather than raw volume. Under a fixed training budget, increasing trajectory diversity yields larger gains than adding more trajectories, improving the performance-cost trade-off for agent training. TDScaling integrates four innovations: (1) a Business Cluster mechanism that captures real-service logical dependencies; (2) a blueprint-driven multi-agent paradigm that enforces trajectory coherence; (3) an adaptive evolution mechanism that steers synthesis toward long-tail scenarios using Domain Entropy, Reasoning Mode Entropy, and Cumulative Action Complexity to prevent mode collapse; and (4) a sandboxed code tool that mitigates catastrophic forgetting of intrinsic coding capabilities. Experiments on general tool-use benchmarks (BFCL, tau^2-Bench) and code agent tasks (RebenchT, CodeCI, BIRD) demonstrate a win-win outcome: TDScaling improves both tool-use generalization and inherent coding proficiency. We plan to release the full codebase and the synthesized dataset (including 30,000+ tool clusters) upon publication.

[347] TAME: A Trustworthy Test-Time Evolution of Agent Memory with Systematic Benchmarking

Yu Cheng, Jiuan Zhou, Yongkang Hu, Yihang Chen, Huichi Zhou, Mingang Chen, Zhizhong Zhang, Kun Shao, Yuan Xie, Zhaoxia Yin

Main category: cs.AI

TL;DR: TAME is a dual-memory evolutionary framework that addresses agent memory misevolution during test-time evolution by separately evolving executor memory for task performance and evaluator memory for safety assessment, preserving trustworthiness without sacrificing utility.

Details

Motivation: Agent safety alignment remains vulnerable during test-time evolution even for benign tasks (Agent Memory Misevolution), causing trustworthiness decline across various domains despite improved task performance.

Method: Proposes TAME framework with dual-memory evolution: executor memory evolves to improve task performance by distilling generalizable methodologies, while evaluator memory refines safety and utility assessments based on historical feedback. Uses closed loop of memory filtering, draft generation, trustworthy refinement, execution, and dual-track memory updating.

Result: Experiments show TAME mitigates misevolution, achieving joint improvement in both trustworthiness and task performance. The Trust-Memevo benchmark reveals overall trustworthiness decline during benign task evolution across various domains.

Conclusion: Dual-memory evolutionary framework effectively addresses agent memory misevolution, preserving safety alignment during test-time evolution while maintaining task utility.

Abstract: Test-time evolution of agent memory serves as a pivotal paradigm for achieving AGI by bolstering complex reasoning through experience accumulation. However, even during benign task evolution, agent safety alignment remains vulnerable-a phenomenon known as Agent Memory Misevolution. To evaluate this phenomenon, we construct the Trust-Memevo benchmark to assess multi-dimensional trustworthiness during benign task evolution, revealing an overall decline in trustworthiness across various task domains and evaluation settings. To address this issue, we propose TAME, a dual-memory evolutionary framework that separately evolves executor memory to improve task performance by distilling generalizable methodologies, and evaluator memory to refine assessments of both safety and task utility based on historical feedback. Through a closed loop of memory filtering, draft generation, trustworthy refinement, execution, and dual-track memory updating, TAME preserves trustworthiness without sacrificing utility. Experiments demonstrate that TAME mitigates misevolution, achieving a joint improvement in both trustworthiness and task performance.

[348] The Necessity of a Unified Framework for LLM-Based Agent Evaluation

Pengyu Zhu, Li Sun, Philip S. Yu, Sen Su

Main category: cs.AI

TL;DR: Proposes a unified evaluation framework to address standardization issues in LLM-based agent benchmarks, which are currently confounded by extraneous factors like system prompts and tool configurations.

Details

Motivation: Current agent benchmarks for LLMs are heavily confounded by extraneous factors including system prompts, toolset configurations, and environmental dynamics, making it difficult to attribute performance gains to the model itself. The lack of standardized evaluation frameworks leads to unfairness, opacity, and non-reproducible results in the field.

Method: Introduces a proposal for standardizing agent evaluation through a unified framework that addresses issues with fragmented researcher-specific frameworks, varying prompt engineering, and lack of standardized environmental data.

Result: The paper presents a conceptual framework proposal rather than empirical results, arguing that standardization is essential for rigorous advancement of agent evaluation.

Conclusion: A unified evaluation framework is essential for the rigorous advancement of agent evaluation to address current issues with standardization, fairness, and reproducibility in LLM-based agent benchmarking.

Abstract: With the advent of Large Language Models (LLMs), general-purpose agents have seen fundamental advancements. However, evaluating these agents presents unique challenges that distinguish them from static QA benchmarks. We observe that current agent benchmarks are heavily confounded by extraneous factors, including system prompts, toolset configurations, and environmental dynamics. Existing evaluations often rely on fragmented, researcher-specific frameworks where the prompt engineering for reasoning and tool usage varies significantly, making it difficult to attribute performance gains to the model itself. Additionally, the lack of standardized environmental data leads to untraceable errors and non-reproducible results. This lack of standardization introduces substantial unfairness and opacity into the field. We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation. To this end, we introduce a proposal aimed at standardizing agent evaluation.

[349] Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Wenlei Shi, Yiwei Wang, Xiaodan Liang, Jing Tang

Main category: cs.AI

TL;DR: Accordion-Thinking is a framework where LLMs learn to dynamically summarize and compress their reasoning steps during inference, reducing KV cache memory usage while maintaining accuracy through reinforcement learning.

Details

Motivation: Current Chain-of-Thought reasoning methods face practical limitations due to linear growth of KV cache and quadratic attention complexity, making long reasoning chains computationally expensive and memory-intensive.

Method: Introduces Accordion-Thinking where LLMs learn to self-regulate reasoning granularity through dynamic summarization. Uses Fold inference mode where models periodically summarize thoughts and discard former tokens. Applies reinforcement learning to incentivize this compression capability.

Result: Achieves 3x throughput while maintaining accuracy on 48GB GPU memory configuration. The accuracy gap between efficient Fold mode and exhaustive Unfold mode vanishes during training, showing models learn to encode essential reasoning into compact summaries.

Conclusion: LLMs can tackle complex reasoning tasks with minimal dependency token overhead through learned self-compression, while structured step summaries provide human-readable reasoning accounts.

Abstract: Scaling test-time compute via long Chain-ofThought unlocks remarkable gains in reasoning capabilities, yet it faces practical limits due to the linear growth of KV cache and quadratic attention complexity. In this paper, we introduce Accordion-Thinking, an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization. This mechanism enables a Fold inference mode, where the model periodically summarizes its thought process and discards former thoughts to reduce dependency on historical tokens. We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows and eventually vanishes over the course of training. This phenomenon demonstrates that the model learns to encode essential reasoning information into compact summaries, achieving effective compression of the reasoning context. Our Accordion-Thinker demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead without compromising solution quality, and it achieves a 3x throughput while maintaining accuracy on a 48GB GPU memory configuration, while the structured step summaries provide a human-readable account of the reasoning process.

[350] LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

Tianyu Chen, Chujia Hu, Ge Gao, Dongrui Liu, Xia Hu, Wenjie Wang

Main category: cs.AI

TL;DR: LPS-Bench: A benchmark for evaluating planning-time safety awareness of computer-use agents (CUAs) in long-horizon tasks, covering both benign and adversarial interactions across multiple risk types.

Details

Motivation: Existing benchmarks for computer-use agents focus on short-horizon or GUI-based tasks and evaluate execution-time errors, but overlook planning-time safety risks. There's a need to assess CUAs' ability to anticipate risks before execution, especially when dealing with ambiguous instructions or adversarial users.

Method: Created LPS-Bench with 65 scenarios across 7 task domains and 9 risk types. Uses a multi-agent automated pipeline for scalable data generation and adopts an LLM-as-a-judge evaluation protocol to assess safety awareness through planning trajectories.

Result: Experiments reveal substantial deficiencies in existing CUAs’ ability to maintain safe behavior during long-horizon planning. The benchmark successfully identifies safety gaps and enables analysis of risks in MCP-based CUA systems.

Conclusion: Planning-time safety awareness is critical for computer-use agents, especially in long-horizon tasks. LPS-Bench provides a comprehensive evaluation framework, reveals current system deficiencies, and enables development of mitigation strategies for safer CUA systems.

Abstract: Computer-use agents (CUAs) that interact with real computer systems can perform automated tasks but face critical safety risks. Ambiguous instructions may trigger harmful actions, and adversarial users can manipulate tool execution to achieve malicious goals. Existing benchmarks mostly focus on short-horizon or GUI-based tasks, evaluating on execution-time errors but overlooking the ability to anticipate planning-time risks. To fill this gap, we present LPS-Bench, a benchmark that evaluates the planning-time safety awareness of MCP-based CUAs under long-horizon tasks, covering both benign and adversarial interactions across 65 scenarios of 7 task domains and 9 risk types. We introduce a multi-agent automated pipeline for scalable data generation and adopt an LLM-as-a-judge evaluation protocol to assess safety awareness through the planning trajectory. Experiments reveal substantial deficiencies in existing CUAs’ ability to maintain safe behavior. We further analyze the risks and propose mitigation strategies to improve long-horizon planning safety in MCP-based CUA systems. We open-source our code at https://github.com/tychenn/LPS-Bench.

Yuxuan Liu, Yuntian Shi, Kun Wang, Haoting Shen, Kun Yang

Main category: cs.AI

TL;DR: CSR-Bench is a benchmark for evaluating cross-modal reliability in MLLMs through stress-testing patterns covering safety, over-rejection, bias, and hallucination, revealing systematic alignment gaps and trade-offs in multimodal understanding.

Details

Motivation: Current MLLMs may exhibit safety behaviors driven by unimodal shortcuts rather than true joint intent understanding across image and text modalities, necessitating a benchmark to evaluate cross-modal reliability and diagnose modality-induced behavior shifts.

Method: Introduces CSR-Bench with four stress-testing interaction patterns (Safety, Over-rejection, Bias, Hallucination) covering 61 fine-grained types. Each instance requires integrated image-text interpretation, with paired text-only controls to diagnose modality-induced behavior shifts. Evaluates 16 state-of-the-art MLLMs.

Result: Models show systematic cross-modal alignment gaps: weak safety awareness, strong language dominance under interference, and consistent performance degradation from text-only to multimodal inputs. Clear trade-off observed between reducing over-rejection and maintaining safe, non-discriminatory behavior.

Conclusion: MLLMs exhibit systematic reliability issues in cross-modal understanding, with apparent safety gains potentially coming from refusal-oriented heuristics rather than robust intent understanding. The benchmark reveals fundamental challenges in achieving true multimodal alignment.

Abstract: Multimodal large language models (MLLMs) enable interaction over both text and images, but their safety behavior can be driven by unimodal shortcuts instead of true joint intent understanding. We introduce CSR-Bench, a benchmark for evaluating cross-modal reliability through four stress-testing interaction patterns spanning Safety, Over-rejection, Bias, and Hallucination, covering 61 fine-grained types. Each instance is constructed to require integrated image-text interpretation, and we additionally provide paired text-only controls to diagnose modality-induced behavior shifts. We evaluate 16 state-of-the-art MLLMs and observe systematic cross-modal alignment gaps. Models show weak safety awareness, strong language dominance under interference, and consistent performance degradation from text-only controls to multimodal inputs. We also observe a clear trade-off between reducing over-rejection and maintaining safe, non-discriminatory behavior, suggesting that some apparent safety gains may come from refusal-oriented heuristics rather than robust intent understanding. WARNING: This paper contains unsafe contents.

[352] Agentic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis

Zhengbo Jiao, Shaobo Wang, Zifan Zhang, Xuan Ren, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang

Main category: cs.AI

TL;DR: Agentic Proposing framework for generating high-quality synthetic reasoning datasets using specialized agents with iterative reflection and tool-use

Details

Motivation: High-quality verifiable datasets are essential for advancing complex reasoning in LLMs, but human annotation is costly and difficult to scale. Current synthesis methods face trade-offs between structural validity and problem complexity.

Method: Models problem synthesis as goal-driven sequential decision process where specialized agents dynamically select and compose modular reasoning skills. Uses iterative workflow with internal reflection and tool-use, developing Agentic-Proposer-4B with Multi-Granularity Policy Optimization (MGPO).

Result: Downstream solvers trained on agent-synthesized data significantly outperform leading baselines and show robust cross-domain generalization. A 30B solver trained on only 11,000 synthesized trajectories achieves 91.6% accuracy on AIME25, rivaling frontier-scale proprietary models like GPT-5.

Conclusion: Small volumes of high-quality synthetic signals can effectively substitute for massive human-curated datasets, demonstrating the effectiveness of the agentic synthesis framework for generating verifiable reasoning training data.

Abstract: Advancing complex reasoning in large language models relies on high-quality, verifiable datasets, yet human annotation remains cost-prohibitive and difficult to scale. Current synthesis paradigms often face a recurring trade-off: maintaining structural validity typically restricts problem complexity, while relaxing constraints to increase difficulty frequently leads to inconsistent or unsolvable instances. To address this, we propose Agentic Proposing, a framework that models problem synthesis as a goal-driven sequential decision process where a specialized agent dynamically selects and composes modular reasoning skills. Through an iterative workflow of internal reflection and tool-use, we develop the Agentic-Proposer-4B using Multi-Granularity Policy Optimization (MGPO) to generate high-precision, verifiable training trajectories across mathematics, coding, and science. Empirical results demonstrate that downstream solvers trained on agent-synthesized data significantly outperform leading baselines and exhibit robust cross-domain generalization. Notably, a 30B solver trained on only 11,000 synthesized trajectories achieves a state-of-the-art 91.6% accuracy on AIME25, rivaling frontier-scale proprietary models such as GPT-5 and proving that a small volume of high-quality synthetic signals can effectively substitute for massive human-curated datasets.

[353] MeetBench-XL: Calibrated Multi-Dimensional Evaluation and Learned Dual-Policy Agents for Real-Time Meetings

Yuelin Hu, Jun Xu, Bingcong Lu, Zhengxue Cheng, Hongwei Hu, Ronghua Wu, Li Song

Main category: cs.AI

TL;DR: MeetBench XL introduces a comprehensive enterprise meeting AI assistant framework with a grounded bilingual multimodal dataset, multi-dimensional evaluation protocol, and a dual-policy agent for optimized query routing and tool use.

Details

Motivation: Enterprise meeting environments need AI assistants that handle diverse operational tasks under strict constraints, but existing benchmarks focus on simplified QA and fail to reflect real-world workflows involving multi-stakeholder collaboration, long temporal contexts, and tool-augmented reasoning.

Method: Three components: 1) MeetAll dataset from 231 enterprise meetings (140 hours) with questions injected using enterprise-informed protocol; 2) MeetBench XL multi-dimensional evaluation protocol; 3) MeetMaster XL dual-policy agent that optimizes query routing between fast/slow reasoning paths and tool invocation.

Result: Experiments show consistent gains over commercial systems, with lightweight classifier enabling accurate routing with minimal overhead and superior quality-latency tradeoff over single-model baselines.

Conclusion: The framework addresses enterprise meeting AI assistant needs through grounded data, comprehensive evaluation, and optimized agent architecture that balances performance with practical constraints.

Abstract: Enterprise meeting environments require AI assistants that handle diverse operational tasks, from rapid fact checking during live discussions to cross meeting analysis for strategic planning, under strict latency, cost, and privacy constraints. Existing meeting benchmarks mainly focus on simplified question answering and fail to reflect real world enterprise workflows, where queries arise organically from multi stakeholder collaboration, span long temporal contexts, and require tool augmented reasoning. We address this gap through a grounded dataset and a learned agent framework. First, we introduce MeetAll, a bilingual and multimodal corpus derived from 231 enterprise meetings totaling 140 hours. Questions are injected using an enterprise informed protocol validated by domain expert review and human discriminability studies. Unlike purely synthetic benchmarks, this protocol is grounded in four enterprise critical dimensions: cognitive load, temporal context span, domain expertise, and actionable task execution, calibrated through interviews with stakeholders across finance, healthcare, and technology sectors. Second, we propose MeetBench XL, a multi dimensional evaluation protocol aligned with human judgment that measures factual fidelity, intent alignment, response efficiency, structural clarity, and completeness. Third, we present MeetMaster XL, a learned dual policy agent that jointly optimizes query routing between fast and slow reasoning paths and tool invocation, including retrieval, cross meeting aggregation, and web search. A lightweight classifier enables accurate routing with minimal overhead, achieving a superior quality latency tradeoff over single model baselines. Experiments against commercial systems show consistent gains, supported by ablations, robustness tests, and a real world deployment case study.Resources: https://github.com/huyuelin/MeetBench.

[354] Memora: A Harmonic Memory Representation Balancing Abstraction and Specificity

Menglin Xia, Xuchao Zhang, Shantanu Dixit, Paramaguru Harimurugan, Rujia Wang, Victor Ruhle, Robert Sim, Chetan Bansal, Saravan Rajmohan

Main category: cs.AI

TL;DR: Memora is a harmonic memory representation system for AI agents that balances abstraction and specificity through primary abstractions and cue anchors, enabling efficient retrieval beyond semantic similarity.

Details

Motivation: Agent memory systems need to handle continuously growing information while supporting efficient, context-aware retrieval. Current approaches using abstraction often lose fine-grained details needed for effective reasoning, creating a trade-off between scalability and specificity.

Method: Memora organizes information via primary abstractions that index concrete memory values and consolidate related updates into unified entries. Cue anchors expand retrieval access across diverse aspects and connect related memories. A retrieval policy actively exploits these memory connections to retrieve relevant information beyond direct semantic similarity.

Result: Memora establishes new state-of-the-art on LoCoMo and LongMemEval benchmarks, demonstrating better retrieval relevance and reasoning effectiveness as memory scales. Theoretically, standard RAG and Knowledge Graph-based memory systems emerge as special cases of this framework.

Conclusion: Memora provides a harmonic memory representation that structurally balances abstraction and specificity, enabling scalable agent memory systems with improved retrieval and reasoning capabilities.

Abstract: Agent memory systems must accommodate continuously growing information while supporting efficient, context-aware retrieval for downstream tasks. Abstraction is essential for scaling agent memory, yet it often comes at the cost of specificity, obscuring the fine-grained details required for effective reasoning. We introduce Memora, a harmonic memory representation that structurally balances abstraction and specificity. Memora organizes information via its primary abstractions that index concrete memory values and consolidate related updates into unified memory entries, while cue anchors expand retrieval access across diverse aspects of the memory and connect related memories. Building on this structure, we employ a retrieval policy that actively exploits these memory connections to retrieve relevant information beyond direct semantic similarity. Theoretically, we show that standard Retrieval-Augmented Generation (RAG) and Knowledge Graph (KG)-based memory systems emerge as special cases of our framework. Empirically, Memora establishes a new state-of-the-art on the LoCoMo and LongMemEval benchmarks, demonstrating better retrieval relevance and reasoning effectiveness as memory scales.

[355] MentalSeek-Dx: Towards Progressive Hypothetico-Deductive Reasoning for Real-world Psychiatric Diagnosis

Xiao Sun, Yuming Yang, Junnan Zhu, Jiang Zhong, Xinyu Zhou, Kaiwen Wei

Main category: cs.AI

TL;DR: MentalDx Bench: First benchmark for disorder-level psychiatric diagnosis using real-world clinical data; reveals LLMs fail at fine-grained diagnosis despite good coarse categorization; MentalSeek-Dx model addresses this with clinical reasoning training.

Details

Motivation: Current LLM benchmarks for psychiatric assessment lack ecological validity and fine-grained diagnostic supervision, limiting clinical utility. Need for clinically-grounded evaluation and models that can perform disorder-level diagnosis.

Method: Created MentalDx Bench with 712 de-identified electronic health records annotated by psychiatrists under ICD-11 guidelines covering 76 disorders across 16 categories. Developed MentalSeek-Dx model using supervised trajectory construction and curriculum-based reinforcement learning to internalize clinical reasoning.

Result: Evaluation of 18 LLMs shows paradigm misalignment: strong performance at coarse diagnostic categorization but systematic failure at disorder-level diagnosis. MentalSeek-Dx achieves SOTA performance with only 14B parameters.

Conclusion: Establishes clinically grounded framework for reliable psychiatric diagnosis, addressing gap between pattern-based LLM modeling and clinical hypothetico-deductive reasoning.

Abstract: Mental health disorders represent a burgeoning global public health challenge. While Large Language Models (LLMs) have demonstrated potential in psychiatric assessment, their clinical utility is severely constrained by benchmarks that lack ecological validity and fine-grained diagnostic supervision. To bridge this gap, we introduce \textbf{MentalDx Bench}, the first benchmark dedicated to disorder-level psychiatric diagnosis within real-world clinical settings. Comprising 712 de-identified electronic health records annotated by board-certified psychiatrists under ICD-11 guidelines, the benchmark covers 76 disorders across 16 diagnostic categories. Evaluation of 18 LLMs reveals a critical \textit{paradigm misalignment}: strong performance at coarse diagnostic categorization contrasts with systematic failure at disorder-level diagnosis, underscoring a gap between pattern-based modeling and clinical hypothetico-deductive reasoning. In response, we propose \textbf{MentalSeek-Dx}, a medical-specialized LLM trained to internalize this clinical reasoning process through supervised trajectory construction and curriculum-based reinforcement learning. Experiments on MentalDx Bench demonstrate that MentalSeek-Dx achieves state-of-the-art (SOTA) performance with only 14B parameters, establishing a clinically grounded framework for reliable psychiatric diagnosis.

[356] Building Interpretable Models for Moral Decision-Making

Mayank Goel, Aritra Das, Paras Chopra

Main category: cs.AI

TL;DR: A custom transformer model is built to study neural network moral decision-making on trolley dilemmas, achieving 77% accuracy on Moral Machine data with interpretability analysis revealing localized biases.

Details

Motivation: To understand how neural networks make moral decisions, specifically on trolley-style ethical dilemmas, and to analyze the computational mechanisms behind such decision-making processes.

Method: Built a custom 2-layer transformer model that processes structured trolley dilemma scenarios using embeddings encoding affected individuals, quantities, and outcomes. Applied various interpretability techniques to analyze how moral reasoning distributes across the network.

Result: Achieved 77% accuracy on Moral Machine data while keeping the model small enough for detailed analysis. Found that biases in moral reasoning localize to distinct computational stages within the network.

Conclusion: Transformer models can effectively learn moral decision-making patterns from structured ethical dilemmas, and interpretability techniques reveal how moral reasoning is computationally distributed within neural networks.

Abstract: We build a custom transformer model to study how neural networks make moral decisions on trolley-style dilemmas. The model processes structured scenarios using embeddings that encode who is affected, how many people, and which outcome they belong to. Our 2-layer architecture achieves 77% accuracy on Moral Machine data while remaining small enough for detailed analysis. We use different interpretability techniques to uncover how moral reasoning distributes across the network, demonstrating that biases localize to distinct computational stages among other findings.

[357] GFlowPO: Generative Flow Network as a Language Model Prompt Optimizer

Junmo Cho, Suhan Kim, Sangjune An, Minsu Kim, Dong Bok Lee, Heejun Lee, Sung Ju Hwang, Hae Beom Lee

Main category: cs.AI

TL;DR: GFlowPO is a probabilistic prompt optimization framework that uses Generative Flow Networks for sample-efficient prompt search with dynamic memory updates.

Details

Motivation: Prompt optimization for language models is challenging due to the combinatorial search space and sparse rewards from expensive LM evaluations. Existing RL-based methods suffer from poor sample efficiency.

Method: Two-step approach: 1) Fine-tune a lightweight prompt-LM using off-policy GFlowNet objective with replay-based training, 2) Dynamic Memory Update mechanism that updates meta-prompt using diverse prompts from replay buffer and top-performing prompts from priority queue.

Result: Outperforms recent discrete prompt optimization baselines across few-shot text classification, instruction induction benchmarks, and question answering tasks.

Conclusion: GFlowPO provides an effective probabilistic framework for prompt optimization that enables sample-efficient exploration and progressively concentrates search on high-reward regions.

Abstract: Finding effective prompts for language models (LMs) is critical yet notoriously difficult: the prompt space is combinatorially large, rewards are sparse due to expensive target-LM evaluation. Yet, existing RL-based prompt optimizers often rely on on-policy updates and a meta-prompt sampled from a fixed distribution, leading to poor sample efficiency. We propose GFlowPO, a probabilistic prompt optimization framework that casts prompt search as a posterior inference problem over latent prompts regularized by a meta-prompted reference-LM prior. In the first step, we fine-tune a lightweight prompt-LM with an off-policy Generative Flow Network (GFlowNet) objective, using a replay-based training policy that reuses past prompt evaluations to enable sample-efficient exploration. In the second step, we introduce Dynamic Memory Update (DMU), a training-free mechanism that updates the meta-prompt by injecting both (i) diverse prompts from a replay buffer and (ii) top-performing prompts from a small priority queue, thereby progressively concentrating the search process on high-reward regions. Across few-shot text classification, instruction induction benchmarks, and question answering tasks, GFlowPO consistently outperforms recent discrete prompt optimization baselines.

[358] Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Mengxuan Wang, Yuxin Chen, Gang Xu, Tao He, Hongjie Jiang, Ming Li

Main category: cs.AI

TL;DR: RAI is a training-free safety framework that amplifies unsafe signals in VLMs by constructing an unsafe prototype subspace and modulating high-risk visual tokens to restore LLM-like risk recognition without compromising utility.

Details

Motivation: VLMs are vulnerable to multimodal jailbreak attacks, and existing defenses have high training costs or degrade utility. Research shows LLMs inherently recognize unsafe content, but visual inputs in VLMs dilute risk signals, motivating a lightweight safety calibration approach.

Method: RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals in the cross-modal feature space while preserving semantic integrity.

Result: Extensive experiments across multiple jailbreak and utility benchmarks show RAI substantially reduces attack success rate without compromising task performance.

Conclusion: RAI provides an effective, lightweight, training-free framework for safety calibration in VLMs that restores LLM-like risk recognition capabilities while maintaining cross-modal reasoning utility.

Abstract: Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model’s LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.

[359] Feasible strategies for conflict resolution within intuitionistic fuzzy preference-based conflict situations

Guangming Lang, Mingchuan Shang, Mengjun Hu, Jie Zhou, Feng Xu

Main category: cs.AI

TL;DR: This paper introduces intuitionistic fuzzy preference-based conflict analysis to capture agents’ attitudes with finer granularity than classical models, developing conflict measures, three-way analysis models, and adjustment strategies.

Details

Motivation: Existing preference-based conflict models use only three qualitative relations (preference, converse, indifference) which significantly limits their capacity to capture the essence of conflict. There's a need for more granular representation of agents' attitudes.

Method: Introduces intuitionistic fuzzy preference-based conflict situations, develops conflict measures within this framework, constructs three-way conflict analysis models for trisecting agent pairs, agents, and issues, uses relative loss functions to calculate thresholds, and presents adjustment mechanism-based feasible strategies.

Result: The proposed model provides finer granularity in capturing conflict situations, enables three-way analysis of conflicts, and offers adjustment strategies that account for both adjustment magnitudes and conflict degrees. An illustrative example demonstrates validity and effectiveness.

Conclusion: The intuitionistic fuzzy preference-based conflict analysis framework overcomes limitations of classical models by providing more detailed representation of agents’ attitudes, enabling more sophisticated conflict analysis and resolution strategies.

Abstract: In three-way conflict analysis, preference-based conflict situations characterize agents’ attitudes towards issues by formally modeling their preferences over pairs of issues. However, existing preference-based conflict models rely exclusively on three qualitative relations, namely, preference, converse, and indifference, to describe agents’ attitudes towards issue pairs, which significantly limits their capacity in capturing the essence of conflict. To overcome this limitation, we introduce the concept of an intuitionistic fuzzy preference-based conflict situation that captures agents’ attitudes towards issue pairs with finer granularity than that afforded by classical preference-based models. Afterwards, we develop intuitionistic fuzzy preference-based conflict measures within this framework, and construct three-way conflict analysis models for trisecting the set of agent pairs, the agent set, and the issue set. Additionally, relative loss functions built on the proposed conflict functions are employed to calculate thresholds for three-way conflict analysis. Finally, we present adjustment mechanism-based feasible strategies that simultaneously account for both adjustment magnitudes and conflict degrees, together with an algorithm for constructing such feasible strategies, and provide an illustrative example to demonstrate the validity and effectiveness of the proposed model.

[360] DiscoverLLM: From Executing Intents to Discovering Them

Tae Soo Kim, Yoonjoo Lee, Jaesang Yu, John Joon Young Chung, Juho Kim

Main category: cs.AI

TL;DR: DiscoverLLM trains LLMs to help users discover their intents through interactive exploration, using a novel user simulator with hierarchical intent modeling and a concretization-based reward signal.

Details

Motivation: Current LLMs struggle with ambiguous user requests because users often haven't formed clear intents yet and need to explore outcomes to discover what they want. Simple clarification questions fail when users themselves don't know their preferences.

Method: Introduces DiscoverLLM framework with a novel user simulator that models cognitive state using a hierarchy of intents that progressively concretize. Uses degree of concretization as a reward signal to train models to optimize for helping users discover intents through adaptive divergence (exploration) and convergence (refinement).

Result: Achieves over 10% higher task performance while reducing conversation length by up to 40% across interactive benchmarks in creative writing, technical writing, and SVG drawing. In a user study with 75 participants, improves conversation satisfaction and efficiency compared to baselines.

Conclusion: DiscoverLLM provides a generalizable framework for training LLMs to help users form and discover intents through interactive exploration, addressing the fundamental challenge of ambiguous requests where users haven’t yet formed clear preferences.

Abstract: To handle ambiguous and open-ended requests, Large Language Models (LLMs) are increasingly trained to interact with users to surface intents they have not yet expressed (e.g., ask clarification questions). However, users are often ambiguous because they have not yet formed their intents: they must observe and explore outcomes to discover what they want. Simply asking “what kind of tone do you want?” fails when users themselves do not know. We introduce DiscoverLLM, a novel and generalizable framework that trains LLMs to help users form and discover their intents. Central to our approach is a novel user simulator that models cognitive state with a hierarchy of intents that progressively concretize as the model surfaces relevant options – where the degree of concretization serves as a reward signal that models can be trained to optimize. Resulting models learn to collaborate with users by adaptively diverging (i.e., explore options) when intents are unclear, and converging (i.e., refine and implement) when intents concretize. Across proposed interactive benchmarks in creative writing, technical writing, and SVG drawing, DiscoverLLM achieves over 10% higher task performance while reducing conversation length by up to 40%. In a user study with 75 human participants, DiscoverLLM improved conversation satisfaction and efficiency compared to baselines.

[361] Ontology-to-tools compilation for executable semantic constraint enforcement in LLM agents

Xiaochi Zhou, Patrick Bulter, Changxuan Yang, Simon D. Rihm, Thitikarn Angkanaporn, Jethro Akroyd, Sebastian Mosbach, Markus Kraft

Main category: cs.AI

TL;DR: LLM agents use compiled ontology tools to enforce semantic constraints during knowledge graph creation from scientific text, reducing manual engineering.

Details

Motivation: To couple LLMs with formal domain knowledge by enforcing semantic constraints during generation rather than through post-hoc validation, reducing manual schema and prompt engineering.

Method: Ontology-to-tools compilation transforms ontological specifications into executable tool interfaces that LLM-based agents must use. Uses Model Context Protocol (MCP) and agent-based workflow to translate ontologies into ontology-aware tools for iterative extraction, validation, and repair of structured knowledge from unstructured text.

Result: Demonstrated with metal-organic polyhedra synthesis literature, showing executable ontological semantics can guide LLM behavior and reduce manual engineering.

Conclusion: Establishes a general paradigm for embedding formal knowledge into generative systems through ontology-to-tools compilation.

Abstract: We introduce ontology-to-tools compilation as a proof-of-principle mechanism for coupling large language models (LLMs) with formal domain knowledge. Within The World Avatar (TWA), ontological specifications are compiled into executable tool interfaces that LLM-based agents must use to create and modify knowledge graph instances, enforcing semantic constraints during generation rather than through post-hoc validation. Extending TWA’s semantic agent composition framework, the Model Context Protocol (MCP) and associated agents are integral components of the knowledge graph ecosystem, enabling structured interaction between generative models, symbolic constraints, and external resources. An agent-based workflow translates ontologies into ontology-aware tools and iteratively applies them to extract, validate, and repair structured knowledge from unstructured scientific text. Using metal-organic polyhedra synthesis literature as an illustrative case, we show how executable ontological semantics can guide LLM behaviour and reduce manual schema and prompt engineering, establishing a general paradigm for embedding formal knowledge into generative systems.

[362] CRL-VLA: Continual Vision-Language-Action Learning

Qixin Zeng, Shuo Zhang, Hongyin Zhang, Renjie Wang, Han Zhao, Libang Zhao, Runze Li, Donglin Wang, Chao Huang

Main category: cs.AI

TL;DR: CRL-VLA is a continual reinforcement learning framework for Vision-Language-Action models that addresses the stability-plasticity trade-off through asymmetric regulation and dual-critic architecture with goal-conditioned value formulation.

Details

Motivation: The paper addresses the challenge of lifelong learning for embodied agents in open-world environments, where VLA models need to master dexterous manipulation through environmental interaction. Existing continual RL methods struggle to balance stability (retaining old skills) and plasticity (learning new ones), which is crucial for deploying VLA models in lifelong robotic scenarios.

Method: CRL-VLA introduces a framework with theoretical bounds linking stability-plasticity trade-off to goal-conditioned advantage magnitude scaled by policy divergence. It uses asymmetric regulation: constraining advantage magnitudes on prior tasks while enabling controlled growth on new tasks. The method employs a dual-critic architecture with Goal-Conditioned Value Formulation (GCVF), where a frozen critic anchors semantic consistency and a trainable estimator drives adaptation.

Result: Experiments on the LIBERO benchmark show that CRL-VLA effectively harmonizes stability and plasticity objectives, outperforming baselines in both anti-forgetting (retaining old skills) and forward adaptation (learning new skills).

Conclusion: CRL-VLA provides a theoretically grounded solution to the stability-plasticity dilemma in continual reinforcement learning for VLA models, enabling effective lifelong learning for embodied agents through its dual-critic architecture and asymmetric regulation approach.

Abstract: Lifelong learning is critical for embodied agents in open-world environments, where reinforcement learning fine-tuning has emerged as an important paradigm to enable Vision-Language-Action (VLA) models to master dexterous manipulation through environmental interaction. Thus, Continual Reinforcement Learning (CRL) is a promising pathway for deploying VLA models in lifelong robotic scenarios, yet balancing stability (retaining old skills) and plasticity (learning new ones) remains a formidable challenge for existing methods. We introduce CRL-VLA, a framework for continual post-training of VLA models with rigorous theoretical bounds. We derive a unified performance bound linking the stability-plasticity trade-off to goal-conditioned advantage magnitude, scaled by policy divergence. CRL-VLA resolves this dilemma via asymmetric regulation: constraining advantage magnitudes on prior tasks while enabling controlled growth on new tasks. This is realized through a simple but effective dual-critic architecture with novel Goal-Conditioned Value Formulation (GCVF), where a frozen critic anchors semantic consistency and a trainable estimator drives adaptation. Experiments on the LIBERO benchmark demonstrate that CRL-VLA effectively harmonizes these conflicting objectives, outperforming baselines in both anti-forgetting and forward adaptation.

[363] The Dual Role of Abstracting over the Irrelevant in Symbolic Explanations: Cognitive Effort vs. Understanding

Zeynep G. Saribatur, Johannes Langer, Ute Schmid

Main category: cs.AI

TL;DR: Formal abstractions (removal and clustering) in symbolic AI explanations improve human understanding and reduce cognitive effort

Details

Motivation: AI systems often produce outputs that are difficult to understand, and while symbolic AI offers transparency, raw logical traces impose high cognitive load. The paper investigates how formal abstractions impact human reasoning performance and cognitive effort.

Method: Used Answer Set Programming (ASP) as a formal framework to define irrelevant details for abstraction. Conducted cognitive experiments where participants classified stimuli across domains with explanations derived from answer set programs, testing removal and clustering abstractions.

Result: Clustering details significantly improved participants’ understanding, while removal of details significantly reduced cognitive effort. Both findings support the hypothesis that abstraction enhances human-centered symbolic explanations.

Conclusion: Formal abstractions (specifically removal and clustering) effectively enhance the interpretability of symbolic AI explanations by improving human understanding and reducing cognitive load.

Abstract: Explanations are central to human cognition, yet AI systems often produce outputs that are difficult to understand. While symbolic AI offers a transparent foundation for interpretability, raw logical traces often impose a high extraneous cognitive load. We investigate how formal abstractions, specifically removal and clustering, impact human reasoning performance and cognitive effort. Utilizing Answer Set Programming (ASP) as a formal framework, we define a notion of irrelevant details to be abstracted over to obtain simplified explanations. Our cognitive experiments, in which participants classified stimuli across domains with explanations derived from an answer set program, show that clustering details significantly improve participants’ understanding, while removal of details significantly reduce cognitive effort, supporting the hypothesis that abstraction enhances human-centered symbolic explanations.

[364] IntentRL: Training Proactive User-intent Agents for Open-ended Deep Research via Reinforcement Learning

Haohao Luo, Zexi Li, Yuexiang Xie, Wenhao Zhang, Yaliang Li, Ying Shen

Main category: cs.AI

TL;DR: IntentRL trains proactive agents to clarify latent user intents before starting computationally expensive deep research, improving both intent understanding and downstream task performance.

Details

Motivation: Deep research agents face an autonomy-interaction dilemma: high autonomy on ambiguous queries leads to prolonged execution with unsatisfactory outcomes due to unclear user intents.

Method: Proposes IntentRL framework with scalable pipeline for generating high-quality dialogue data via shallow-to-deep intent refinement graph, and two-stage RL: offline RL on dialogues followed by online rollouts with user simulator.

Result: IntentRL significantly improves intent hit rate and downstream task performance, outperforming built-in clarify modules of closed-source DR agents and proactive LLM baselines.

Conclusion: Proactive intent clarification before deep research execution addresses the autonomy-interaction dilemma and improves research agent effectiveness.

Abstract: Deep Research (DR) agents extend Large Language Models (LLMs) beyond parametric knowledge by autonomously retrieving and synthesizing evidence from large web corpora into long-form reports, enabling a long-horizon agentic paradigm. However, unlike real-time conversational assistants, DR is computationally expensive and time-consuming, creating an autonomy-interaction dilemma: high autonomy on ambiguous user queries often leads to prolonged execution with unsatisfactory outcomes. To address this, we propose IntentRL, a framework that trains proactive agents to clarify latent user intents before starting long-horizon research. To overcome the scarcity of open-ended research data, we introduce a scalable pipeline that expands a few seed samples into high-quality dialogue turns via a shallow-to-deep intent refinement graph. We further adopt a two-stage reinforcement learning (RL) strategy: Stage I applies RL on offline dialogues to efficiently learn general user-interaction behavior, while Stage II uses the trained agent and a user simulator for online rollouts to strengthen adaptation to diverse user feedback. Extensive experiments show that IntentRL significantly improves both intent hit rate and downstream task performance, outperforming the built-in clarify modules of closed-source DR agents and proactive LLM baselines.

[365] When Routing Collapses: On the Degenerate Convergence of LLM Routers

Guannan Lai, Han-Jia Ye

Main category: cs.AI

TL;DR: EquiRouter addresses routing collapse in LLM routing by directly learning model rankings instead of scalar performance scores, improving cost-efficiency while maintaining quality.

Details

Motivation: Existing LLM routers suffer from routing collapse - they systematically default to the most expensive models even when cheaper ones suffice, wasting computation and cost. This undermines the core promise of routing to achieve favorable quality-cost trade-offs.

Method: Proposes EquiRouter, a decision-aware router that directly learns model rankings rather than predicting scalar performance scores. This addresses the objective-decision mismatch where small prediction errors can flip relative orderings and trigger suboptimal selections.

Result: On RouterBench, EquiRouter reduces cost by about 17% at GPT-4-level performance compared to the strongest prior router. It restores the role of smaller models and mitigates routing collapse.

Conclusion: Directly learning model rankings rather than scalar scores is more effective for routing decisions, enabling better utilization of smaller models and achieving significant cost savings while maintaining performance.

Abstract: LLM routing aims to achieve a favorable quality–cost trade-off by dynamically assigning easy queries to smaller models and harder queries to stronger ones. However, across both unimodal and multimodal settings, we uncover a pervasive yet underexplored failure mode in existing routers: as the user’s cost budget increases, routers systematically default to the most capable and most expensive model even when cheaper models already suffice. As a result, current routers under-utilize small models, wasting computation and monetary cost and undermining the core promise of routing; we term this phenomenon routing collapse. We attribute routing collapse to an objective–decision mismatch: many routers are trained to predict scalar performance scores, whereas routing decisions ultimately depend on discrete comparisons among candidate models. Consequently, small prediction errors can flip relative orderings and trigger suboptimal selections. To bridge this gap, we propose EquiRouter, a decision-aware router that directly learns model rankings, restoring the role of smaller models and mitigating routing collapse. On RouterBench, EquiRouter reduces cost by about 17% at GPT-4-level performance compared to the strongest prior router. Our code is available at https://github.com/AIGNLAI/EquiRouter.

[366] Group Selection as a Safeguard Against AI Substitution

Qiankun Zhong, Thomas F. Eisenmann, Julian Garcia, Iyad Rahwan

Main category: cs.AI

TL;DR: AI use can reduce cultural diversity, potentially causing “cultural collapse” where AI-generated content reduces human variation and innovation, slowing cultural evolution.

Details

Motivation: The paper examines how reliance on generative AI can reduce cultural variance and diversity in creative work, potentially leading to model collapse and hallucination problems, and investigates the long-term consequences for human cultural evolution.

Method: Uses agent-based modeling and evolutionary game theory to compare two AI use strategies: AI-complement (humans seek guidance but produce final output) vs AI-substitute (humans provide minimal input, AI produces most output), studying how these strategies compete and spread under evolutionary dynamics.

Result: AI-substitute users prevail under individual-level selection despite stronger reduction in cultural variance, while AI-complement users benefit groups by maintaining variance needed for exploration and can be favored under cultural group selection when group boundaries are strong.

Conclusion: The findings reveal population-level effects of AI adoption and inform policy/organizational strategies to mitigate risks of cultural collapse, highlighting the tension between individual and group-level selection pressures in AI adoption.

Abstract: Reliance on generative AI can reduce cultural variance and diversity, especially in creative work. This reduction in variance has already led to problems in model performance, including model collapse and hallucination. In this paper, we examine the long-term consequences of AI use for human cultural evolution and the conditions under which widespread AI use may lead to “cultural collapse”, a process in which reliance on AI-generated content reduces human variation and innovation and slows cumulative cultural evolution. Using an agent-based model and evolutionary game theory, we compare two types of AI use: complement and substitute. AI-complement users seek suggestions and guidance while remaining the main producers of the final output, whereas AI-substitute users provide minimal input, and rely on AI to produce most of the output. We then study how these use strategies compete and spread under evolutionary dynamics. We find that AI-substitute users prevail under individual-level selection despite the stronger reduction in cultural variance. By contrast, AI-complement users can benefit their groups by maintaining the variance needed for exploration, and can therefore be favored under cultural group selection when group boundaries are strong. Overall, our findings shed light on the long-term, population-level effects of AI adoption and inform policy and organizational strategies to mitigate these risks.

[367] Persona Generators: Generating Diverse Synthetic Personas at Scale

Davide Paglieri, Logan Cross, William A. Cunningham, Joel Z. Leibo, Alexander Sasha Vezhnevets

Main category: cs.AI

TL;DR: Persona Generators use LLMs and evolutionary algorithms to create diverse synthetic populations for AI evaluation, optimizing for coverage of rare trait combinations rather than just common patterns.

Details

Motivation: Evaluating AI systems requires diverse human data, but collecting representative data is expensive/infeasible, especially for novel technologies or future scenarios. Current generative approaches require detailed population data and focus on density matching rather than support coverage, missing long-tail behaviors.

Method: Introduce Persona Generators - functions that produce diverse synthetic populations. Use iterative improvement loop based on AlphaEvolve with LLMs as mutation operators to refine generator code over hundreds of iterations. Optimization produces lightweight generators that expand small descriptions into diverse synthetic personas maximizing coverage of opinions/preferences along relevant diversity axes.

Result: Evolved generators substantially outperform existing baselines across six diversity metrics on held-out contexts, producing populations that span rare trait combinations difficult to achieve in standard LLM outputs.

Conclusion: Persona Generators enable creation of diverse synthetic populations for AI evaluation without requiring extensive real human data, addressing the coverage gap in current approaches and supporting evaluation of AI systems across diverse user populations.

Abstract: Evaluating AI systems that interact with humans requires understanding their behavior across diverse user populations, but collecting representative human data is often expensive or infeasible, particularly for novel technologies or hypothetical future scenarios. Recent work in Generative Agent-Based Modeling has shown that large language models can simulate human-like synthetic personas with high fidelity, accurately reproducing the beliefs and behaviors of specific individuals. However, most approaches require detailed data about target populations and often prioritize density matching (replicating what is most probable) rather than support coverage (spanning what is possible), leaving long-tail behaviors underexplored. We introduce Persona Generators, functions that can produce diverse synthetic populations tailored to arbitrary contexts. We apply an iterative improvement loop based on AlphaEvolve, using large language models as mutation operators to refine our Persona Generator code over hundreds of iterations. The optimization process produces lightweight Persona Generators that can automatically expand small descriptions into populations of diverse synthetic personas that maximize coverage of opinions and preferences along relevant diversity axes. We demonstrate that evolved generators substantially outperform existing baselines across six diversity metrics on held-out contexts, producing populations that span rare trait combinations difficult to achieve in standard LLM outputs.

[368] EHRWorld: A Patient-Centric Medical World Model for Long-Horizon Clinical Trajectories

Linjie Mu, Zhongzhen Huang, Yannian Gu, Shengqian Qin, Shaoting Zhang, Xiaofan Zhang

Main category: cs.AI

TL;DR: EHRWorld: A patient-centric medical world model trained on real-world EHR data that outperforms naive LLM baselines for simulating disease progression and treatment outcomes over time.

Details

Motivation: While LLMs show promise for static medical reasoning, they struggle with maintaining consistent patient states under sequential interventions in dynamic medical world modeling. There's a need for models that can reliably simulate disease progression and treatment outcomes over time using causally grounded clinical data.

Method: Introduces EHRWorld, a patient-centric medical world model trained under a causal sequential paradigm, along with EHRWorld-110K, a large-scale longitudinal clinical dataset derived from real-world electronic health records. The approach focuses on modeling temporal evolution and causal relationships in clinical data.

Result: EHRWorld significantly outperforms naive LLM-based baselines, achieving more stable long-horizon simulation, improved modeling of clinically sensitive events, and favorable reasoning efficiency. Demonstrates the necessity of training on causally grounded, temporally evolving clinical data.

Conclusion: Training medical world models on causally grounded, temporally evolving clinical data is essential for reliable and robust simulation of disease progression and treatment outcomes, overcoming limitations of naive LLM approaches.

Abstract: World models offer a principled framework for simulating future states under interventions, but realizing such models in complex, high-stakes domains like medicine remains challenging. Recent large language models (LLMs) have achieved strong performance on static medical reasoning tasks, raising the question of whether they can function as dynamic medical world models capable of simulating disease progression and treatment outcomes over time. In this work, we show that LLMs only incorporating medical knowledge struggle to maintain consistent patient states under sequential interventions, leading to error accumulation in long-horizon clinical simulation. To address this limitation, we introduce EHRWorld, a patient-centric medical world model trained under a causal sequential paradigm, together with EHRWorld-110K, a large-scale longitudinal clinical dataset derived from real-world electronic health records. Extensive evaluations demonstrate that EHRWorld significantly outperforms naive LLM-based baselines, achieving more stable long-horizon simulation, improved modeling of clinically sensitive events, and favorable reasoning efficiency, highlighting the necessity of training on causally grounded, temporally evolving clinical data for reliable and robust medical world modeling.

[369] Can LLMs Do Rocket Science? Exploring the Limits of Complex Reasoning with GTOC 12

Iñaki del Campo, Pablo Cuervo, Victor Rodriguez-Fernandez, Roberto Armellin, Jack Yarndley

Main category: cs.AI

TL;DR: LLMs show improved strategic planning for complex space missions but fail at implementation due to physical constraints and debugging issues.

Details

Motivation: To investigate whether current LLMs can handle autonomous multi-stage planning in complex, physically constrained environments like space mission design, using the GTOC 12 asteroid mining challenge as a test case.

Method: Adapted MLE-Bench framework to orbital mechanics, deployed AIDE-based agent architecture for autonomous mission solution generation, and used “LLM-as-a-Judge” methodology with expert-developed rubric to evaluate strategic viability across five categories.

Result: Strategic viability scores nearly doubled in two years (9.3 to 17.2/26), showing improved conceptual understanding, but models consistently failed at implementation due to physical unit inconsistencies, boundary condition errors, and inefficient debugging loops.

Conclusion: Current LLMs have sufficient knowledge for space science tasks but face an implementation barrier, functioning as domain facilitators rather than fully autonomous engineers.

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation and general reasoning, yet their capacity for autonomous multi-stage planning in high-dimensional, physically constrained environments remains an open research question. This study investigates the limits of current AI agents by evaluating them against the 12th Global Trajectory Optimization Competition (GTOC 12), a complex astrodynamics challenge requiring the design of a large-scale asteroid mining campaign. We adapt the MLE-Bench framework to the domain of orbital mechanics and deploy an AIDE-based agent architecture to autonomously generate and refine mission solutions. To assess performance beyond binary validity, we employ an “LLM-as-a-Judge” methodology, utilizing a rubric developed by domain experts to evaluate strategic viability across five structural categories. A comparative analysis of models, ranging from GPT-4-Turbo to reasoning-enhanced architectures like Gemini 2.5 Pro, and o3, reveals a significant trend: the average strategic viability score has nearly doubled in the last two years (rising from 9.3 to 17.2 out of 26). However, we identify a critical capability gap between strategy and execution. While advanced models demonstrate sophisticated conceptual understanding, correctly framing objective functions and mission architectures, they consistently fail at implementation due to physical unit inconsistencies, boundary condition errors, and inefficient debugging loops. We conclude that, while current LLMs often demonstrate sufficient knowledge and intelligence to tackle space science tasks, they remain limited by an implementation barrier, functioning as powerful domain facilitators rather than fully autonomous engineers.

[370] Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration

Bowei He, Minda Hu, Zenan Xu, Hongru Wang, Licheng Zong, Yankai Chen, Chen Ma, Xue Liu, Pluto Zhou, Irwin King

Main category: cs.AI

TL;DR: Search-R2: A novel Actor-Refiner collaboration framework for language agents that improves search-integrated reasoning through targeted intervention and fine-grained supervision.

Details

Motivation: Existing search-integrated reasoning agents trained via reinforcement learning suffer from multi-scale credit assignment problems, relying on sparse trajectory-level rewards that fail to distinguish between high-quality reasoning and fortuitous guesses, leading to redundant or misleading search behaviors.

Method: Proposes Search-R2 with Actor-Refiner collaboration: Actor produces initial reasoning trajectories, Meta-Refiner selectively diagnoses and repairs flawed steps via ‘cut-and-regenerate’ mechanism. Uses hybrid reward design coupling outcome correctness with dense process reward quantifying information density of retrieved evidence.

Result: Search-R2 consistently outperforms strong RAG and RL-based baselines across various general and multi-hop QA datasets and model scales, achieving superior reasoning accuracy with minimal overhead.

Conclusion: The Actor-Refiner framework with targeted intervention and fine-grained supervision effectively addresses credit assignment problems in search-integrated reasoning, enabling language agents to transcend static parametric knowledge through improved external source querying.

Abstract: Search-integrated reasoning enables language agents to transcend static parametric knowledge by actively querying external sources. However, training these agents via reinforcement learning is hindered by the multi-scale credit assignment problem: existing methods typically rely on sparse, trajectory-level rewards that fail to distinguish between high-quality reasoning and fortuitous guesses, leading to redundant or misleading search behaviors. To address this, we propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention, with both components jointly optimized during training. Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories, and a Meta-Refiner, which selectively diagnoses and repairs flawed steps via a ‘cut-and-regenerate’ mechanism. To provide fine-grained supervision, we introduce a hybrid reward design that couples outcome correctness with a dense process reward quantifying the information density of retrieved evidence. Theoretically, we formalize the Actor-Refiner interaction as a smoothed mixture policy, proving that selective correction yields strict performance gains over strong baselines. Extensive experiments across various general and multi-hop QA datasets demonstrate that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales, achieving superior reasoning accuracy with minimal overhead.

[371] Mitigating Conversational Inertia in Multi-Turn Agents

Yang Wan, Zheng Cao, Zhenhao Zhang, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu

Main category: cs.AI

TL;DR: The paper identifies “conversational inertia” in LLMs used as agents, where models overly attend to their previous responses, limiting exploration. The authors propose Context Preference Learning to calibrate models to favor low-inertia responses and provide context management strategies.

Details

Motivation: LLMs excel as few-shot learners but this strength becomes problematic in multiturn agent scenarios where they erroneously mimic their own previous responses as few-shot examples, creating a tension between context enrichment for exploitation and conversational inertia that undermines exploration.

Method: Through attention analysis, the authors identify conversational inertia as strong diagonal attention to previous responses. They propose Context Preference Learning to calibrate model preferences to favor low-inertia responses over high-inertia ones, and provide context management strategies at inference time.

Result: Experimental results across eight agentic environments and one deep research scenario validate that the framework reduces conversational inertia and achieves performance improvements.

Conclusion: The paper addresses a key limitation in using LLMs as agents by identifying and mitigating conversational inertia, providing methods to balance exploration and exploitation in multiturn scenarios.

Abstract: Large language models excel as few-shot learners when provided with appropriate demonstrations, yet this strength becomes problematic in multiturn agent scenarios, where LLMs erroneously mimic their own previous responses as few-shot examples. Through attention analysis, we identify conversational inertia, a phenomenon where models exhibit strong diagonal attention to previous responses, which is associated with imitation bias that constrains exploration. This reveals a tension when transforming few-shot LLMs into agents: longer context enriches environmental feedback for exploitation, yet also amplifies conversational inertia that undermines exploration. Our key insight is that for identical states, actions generated with longer contexts exhibit stronger inertia than those with shorter contexts, enabling construction of preference pairs without environment rewards. Based on this, we propose Context Preference Learning to calibrate model preferences to favor low-inertia responses over highinertia ones. We further provide context management strategies at inference time to balance exploration and exploitation. Experimental results across eight agentic environments and one deep research scenario validate that our framework reduces conversational inertia and achieves performance improvements.

[372] TodyComm: Task-Oriented Dynamic Communication for Multi-Round LLM-based Multi-Agent System

Wenzhe Fan, Tommaso Tognoli, Henry Peng Zou, Chunyu Miao, Yibo Wang, Xinhua Zhang

Main category: cs.AI

TL;DR: TodyComm: A task-oriented dynamic communication algorithm for multi-agent LLM systems that adapts communication topology across rounds based on task dynamics

Details

Motivation: Existing multi-agent LLM systems use fixed communication topologies during inference, which fails in realistic applications where agents' roles change across rounds due to dynamic adversaries, task progression, or time-varying constraints like communication bandwidth.

Method: TodyComm produces behavior-driven collaboration topologies that adapt to dynamics at each round, optimizing task utility through policy gradient methods.

Result: Experiments on five benchmarks show TodyComm delivers superior task effectiveness under both dynamic adversary and communication budget constraints while maintaining token efficiency and scalability.

Conclusion: Dynamic communication topologies that adapt to round-specific conditions are essential for effective multi-agent LLM collaboration in realistic scenarios.

Abstract: Multi-round LLM-based multi-agent systems rely on effective communication structures to support collaboration across rounds. However, most existing methods employ a fixed communication topology during inference, which falls short in many realistic applications where the agents’ roles may change \textit{across rounds} due to dynamic adversary, task progression, or time-varying constraints such as communication bandwidth. In this paper, we propose addressing this issue through TodyComm, a \textbf{t}ask-\textbf{o}riented \textbf{dy}namic \textbf{comm}unication algorithm. It produces behavior-driven collaboration topologies that adapt to the dynamics at each round, optimizing the utility for the task through policy gradient. Experiments on five benchmarks demonstrate that under both dynamic adversary and communications budgets, TodyComm delivers superior task effectiveness while retaining token efficiency and scalability.

[373] AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

Jianhao Ruan, Zhihao Xu, Yiran Peng, Fashen Ren, Zhaoyang Yu, Xinbing Liang, Jinyu Xiang, Bang Liu, Chenglin Wu, Yuyu Luo, Jiayi Zhang

Main category: cs.AI

TL;DR: AOrchestra is an agentic system that introduces a unified abstraction for language agents as tuples (Instruction, Context, Tools, Model), enabling dynamic creation of specialized executors for complex tasks.

Details

Motivation: Current language agent systems lack dynamic abstraction views of sub-agents, which limits adaptability for complex, long-horizon tasks. The paper aims to address this by creating a framework-agnostic agent abstraction that enables more flexible and efficient task solving.

Method: Proposes a unified agent abstraction as a tuple (Instruction, Context, Tools, Model) that serves as a compositional recipe for capabilities. AOrchestra system uses a central orchestrator that concretizes this tuple at each step: curates task-relevant context, selects tools and models, and delegates execution via on-the-fly automatic agent creation.

Result: AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini-3-Flash across three challenging benchmarks (GAIA, SWE-Bench, Terminal-Bench). The system enables controllable performance-cost trade-offs and reduces human engineering efforts.

Conclusion: The proposed agent abstraction and AOrchestra system provide a framework-agnostic approach to language agent orchestration that improves adaptability and performance for complex tasks while enabling efficient resource utilization.

Abstract: Language agents have shown strong promise for task automation. Realizing this promise for increasingly complex, long-horizon tasks has driven the rise of a sub-agent-as-tools paradigm for multi-turn task solving. However, existing designs still lack a dynamic abstraction view of sub-agents, thereby hurting adaptability. We address this challenge with a unified, framework-agnostic agent abstraction that models any agent as a tuple Instruction, Context, Tools, Model. This tuple acts as a compositional recipe for capabilities, enabling the system to spawn specialized executors for each task on demand. Building on this abstraction, we introduce an agentic system AOrchestra, where the central orchestrator concretizes the tuple at each step: it curates task-relevant context, selects tools and models, and delegates execution via on-the-fly automatic agent creation. Such designs enable reducing human engineering efforts, and remain framework-agnostic with plug-and-play support for diverse agents as task executors. It also enables a controllable performance-cost trade-off, allowing the system to approach Pareto-efficient. Across three challenging benchmarks (GAIA, SWE-Bench, Terminal-Bench), AOrchestra achieves 16.28% relative improvement against the strongest baseline when paired with Gemini-3-Flash. The code is available at: https://github.com/FoundationAgents/AOrchestra

[374] Understanding Agent Scaling in LLM-Based Multi-Agent Systems via Diversity

Yingxuan Yang, Chengrui Qu, Muning Wen, Laixi Shi, Ying Wen, Weinan Zhang, Adam Wierman, Shangding Gu

Main category: cs.AI

TL;DR: Multi-agent LLM systems show diminishing returns with homogeneous scaling but substantial gains with heterogeneity; performance is bounded by task uncertainty, not agent count, with heterogeneous agents providing complementary evidence.

Details

Motivation: The paper investigates why scaling LLM-based multi-agent systems with homogeneous agents shows diminishing returns while heterogeneous systems continue to improve, aiming to understand the fundamental limits of multi-agent scaling and the value of diversity.

Method: The authors develop an information-theoretic framework showing MAS performance is bounded by intrinsic task uncertainty rather than agent count. They introduce K*, an effective channel count metric that quantifies the number of effective channels without ground-truth labels, and empirically test heterogeneous vs. homogeneous agent configurations.

Result: Empirical results show heterogeneous configurations consistently outperform homogeneous scaling: 2 diverse agents can match or exceed the performance of 16 homogeneous agents. The effective channel count K* successfully quantifies system diversity and predicts performance improvements.

Conclusion: Multi-agent system performance is limited by task uncertainty, not agent count. Diversity in agents (different models, prompts, tools) provides complementary evidence and substantially improves performance, offering principled guidelines for building efficient and robust MAS through diversity-aware design.

Abstract: LLM-based multi-agent systems (MAS) have emerged as a promising approach to tackle complex tasks that are difficult for individual LLMs. A natural strategy is to scale performance by increasing the number of agents; however, we find that such scaling exhibits strong diminishing returns in homogeneous settings, while introducing heterogeneity (e.g., different models, prompts, or tools) continues to yield substantial gains. This raises a fundamental question: what limits scaling, and why does diversity help? We present an information-theoretic framework showing that MAS performance is bounded by the intrinsic task uncertainty, not by agent count. We derive architecture-agnostic bounds demonstrating that improvements depend on how many effective channels the system accesses. Homogeneous agents saturate early because their outputs are strongly correlated, whereas heterogeneous agents contribute complementary evidence. We further introduce $K^*$, an effective channel count that quantifies the number of effective channels without ground-truth labels. Empirically, we show that heterogeneous configurations consistently outperform homogeneous scaling: 2 diverse agents can match or exceed the performance of 16 homogeneous agents. Our results provide principled guidelines for building efficient and robust MAS through diversity-aware design. Code and Dataset are available at the link: https://github.com/SafeRL-Lab/Agent-Scaling.

[375] Conformal Thinking: Risk Control for Reasoning on a Compute Budget

Xi Wang, Anushri Suresh, Alvin Zhang, Rishi More, William Jurayj, Benjamin Van Durme, Mehrdad Farajtabar, Daniel Khashabi, Eric Nalisnick

Main category: cs.AI

TL;DR: Risk-controlled adaptive reasoning framework for LLMs that optimizes compute by setting upper and lower confidence thresholds to stop reasoning early while controlling error rates.

Details

Motivation: Current adaptive reasoning approaches for LLMs face practical challenges in setting token budgets and thresholds, creating a fundamental risk-accuracy trade-off. There's a need for systematic methods to control error rates while minimizing computational costs.

Method: Proposes a risk control framework with two thresholds: an upper threshold to stop when model is confident, and a novel parametric lower threshold to preemptively stop unsolvable instances. Uses distribution-free risk control to optimally specify stopping mechanisms given target risk and validation data. Incorporates efficiency loss for scenarios with multiple budget criteria.

Result: Empirical results across diverse reasoning tasks and models show computational efficiency gains from the lower threshold and ensemble stopping mechanisms while adhering to user-specified risk targets.

Conclusion: The framework effectively addresses the risk-accuracy trade-off in adaptive reasoning by providing systematic risk control methods that optimize computational efficiency while maintaining specified error rate bounds.

Abstract: Reasoning Large Language Models (LLMs) enable test-time scaling, with dataset-level accuracy improving as the token budget increases, motivating adaptive reasoning – spending tokens when they improve reliability and stopping early when additional computation is unlikely to help. However, setting the token budget, as well as the threshold for adaptive reasoning, is a practical challenge that entails a fundamental risk-accuracy trade-off. We re-frame the budget setting problem as risk control, limiting the error rate while minimizing compute. Our framework introduces an upper threshold that stops reasoning when the model is confident (risking incorrect output) and a novel parametric lower threshold that preemptively stops unsolvable instances (risking premature stoppage). Given a target risk and a validation set, we use distribution-free risk control to optimally specify these stopping mechanisms. For scenarios with multiple budget controlling criteria, we incorporate an efficiency loss to select the most computationally efficient exiting mechanism. Empirical results across diverse reasoning tasks and models demonstrate the effectiveness of our risk control approach, demonstrating computational efficiency gains from the lower threshold and ensemble stopping mechanisms while adhering to the user-specified risk target.

[376] AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

Minjun Zhu, Zhen Lin, Yixuan Weng, Panzhong Lu, Qiujie Xie, Yifan Wei, Sifan Liu, Qiyao Sun, Yue Zhang

Main category: cs.AI

TL;DR: AutoFigure is an agentic framework for generating high-quality scientific illustrations from long-form scientific texts, evaluated on the FigureBench benchmark of 3,300 text-figure pairs.

Details

Motivation: Manual creation of scientific illustrations is a bottleneck in academia and industry, despite their importance for communicating complex concepts. There's a need for automated systems that can generate publication-ready scientific illustrations from text.

Method: Proposes AutoFigure, an agentic framework that engages in extensive thinking, recombination, and validation to produce structurally sound and aesthetically refined layouts before rendering final illustrations. Uses FigureBench benchmark with 3,300 high-quality scientific text-figure pairs covering diverse sources.

Result: AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations. The framework demonstrates strong performance across diverse scientific text-to-illustration tasks.

Conclusion: AutoFigure represents a significant advancement in automated scientific illustration generation, with potential to alleviate the bottleneck in scientific communication. The released code, dataset, and demo provide resources for further research.

Abstract: High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific text-figure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations. The code, dataset and huggingface space are released in https://github.com/ResearAI/AutoFigure.

[377] Thinking Like a Doctor: Conversational Diagnosis through the Exploration of Diagnostic Knowledge Graphs

Jeongmoon Won, Seungwon Kook, Yohan Jo

Main category: cs.AI

TL;DR: A conversational diagnosis system that uses a diagnostic knowledge graph to generate hypotheses and verify them through clarifying questions, evaluated with a realistic patient simulator adapted from MIMIC-IV data.

Details

Motivation: Existing conversational diagnosis approaches rely too heavily on model parametric knowledge or assume patients provide rich, concrete information, which is unrealistic for real-world clinical scenarios.

Method: Two-step reasoning approach: (1) generate diagnostic hypotheses from dialogue context using a diagnostic knowledge graph, and (2) verify hypotheses through iterative clarifying questions until final diagnosis is reached. Uses MIMIC-IV patient profiles with adapted simulator to reflect vague symptom descriptions.

Result: Improved diagnostic accuracy and efficiency over strong baselines. Physician evaluations support realism of the simulator and clinical utility of generated questions.

Conclusion: The proposed conversational diagnosis system with knowledge graph reasoning and realistic patient simulation demonstrates clinical utility and improved performance over existing approaches.

Abstract: Conversational diagnosis requires multi-turn history-taking, where an agent asks clarifying questions to refine differential diagnoses under incomplete information. Existing approaches often rely on the parametric knowledge of a model or assume that patients provide rich and concrete information, which is unrealistic. To address these limitations, we propose a conversational diagnosis system that explores a diagnostic knowledge graph to reason in two steps: (i) generating diagnostic hypotheses from the dialogue context, and (ii) verifying hypotheses through clarifying questions, which are repeated until a final diagnosis is reached. Since evaluating the system requires a realistic patient simulator that responds to the system’s questions, we adopt a well-established simulator along with patient profiles from MIMIC-IV. We further adapt it to describe symptoms vaguely to reflect real-world patients during early clinical encounters. Experiments show improved diagnostic accuracy and efficiency over strong baselines, and evaluations by physicians support the realism of our simulator and the clinical utility of the generated questions. Our code will be released upon publication.

[378] Advancing AI Research Assistants with Expert-Involved Learning

Tianyu Liu, Simeng Han, Hanchen Wang, Xiao Luo, Pan Lu, Biqing Zhu, Yuge Wang, Keyi Li, Jiapeng Chen, Rihao Qu, Yufeng Liu, Xinyue Cui, Aviv Yaish, Yuhang Chen, Minsheng Hao, Chuhan Li, Kexing Li, Arman Cohan, Hua Xu, Mark Gerstein, James Zou, Hongyu Zhao

Main category: cs.AI

TL;DR: ARIEL is an open-source framework for evaluating and optimizing multimodal AI models in biomedical applications, focusing on article summarization and figure interpretation with expert-vetted tasks.

Details

Motivation: To assess the reliability of LLMs and LMMs in biomedical discovery by creating a standardized evaluation framework with expert-curated multimodal biomedical data and tasks.

Method: Developed ARIEL framework with curated multimodal biomedical corpus, expert-vetted tasks, uniform evaluation protocols, blinded PhD-level assessment, prompt engineering, lightweight fine-tuning, and compute-scaled inference strategies.

Result: Models generate fluent but incomplete summaries; LMMs struggle with detailed visual reasoning; prompt engineering and fine-tuning improve textual coverage; compute-scaled inference enhances visual QA; ARIEL agent can propose testable mechanistic hypotheses.

Conclusion: ARIEL provides a reproducible platform for advancing trustworthy AI in biomedicine by delineating current strengths and limitations of foundation models and enabling optimization through targeted interventions.

Abstract: Large language models (LLMs) and large multimodal models (LMMs) promise to accelerate biomedical discovery, yet their reliability remains unclear. We introduce ARIEL (AI Research Assistant for Expert-in-the-Loop Learning), an open-source evaluation and optimization framework that pairs a curated multimodal biomedical corpus with expert-vetted tasks to probe two capabilities: full-length article summarization and fine-grained figure interpretation. Using uniform protocols and blinded PhD-level evaluation, we find that state-of-the-art models generate fluent but incomplete summaries, whereas LMMs struggle with detailed visual reasoning. We later observe that prompt engineering and lightweight fine-tuning substantially improve textual coverage, and a compute-scaled inference strategy enhances visual question answering. We build an ARIEL agent that integrates textual and visual cues, and we show it can propose testable mechanistic hypotheses. ARIEL delineates current strengths and limitations of foundation models, and provides a reproducible platform for advancing trustworthy AI in biomedicine.

[379] CP-Agent: Agentic Constraint Programming

Stefan Szeider

Main category: cs.AI

TL;DR: CP-Agent is a Python coding agent using ReAct framework with IPython kernel to translate natural language to formal constraint models, achieving perfect accuracy on 101 CP-Bench problems with minimal guidance.

Details

Motivation: Translating natural language to formal constraint models requires domain expertise and modeling framework knowledge, creating a barrier for non-experts. The paper aims to explore agentic workflows for automating this translation process.

Method: CP-Agent uses the ReAct (Reasoning + Acting) framework with a persistent IPython kernel. It iteratively executes code, observes solver feedback, and refines constraint models based on execution results. Domain knowledge is provided as a project prompt under 50 lines.

Result: CP-Agent achieves perfect accuracy on all 101 constraint programming problems from CP-Bench after minor benchmark clarifications. Experiments show minimal guidance outperforms detailed procedural scaffolding, and explicit task management tools have mixed effects on modeling tasks.

Conclusion: Agentic workflows with minimal guidance can effectively automate natural language to constraint model translation, achieving high accuracy on benchmark problems while revealing insights about guidance strategies and task management tools.

Abstract: The translation of natural language to formal constraint models requires expertise in the problem domain and modeling frameworks. To explore the effectiveness of agentic workflows, we propose CP-Agent, a Python coding agent that uses the ReAct framework with a persistent IPython kernel. We provide the relevant domain knowledge as a project prompt of under 50 lines. The algorithm works by iteratively executing code, observing the solver’s feedback, and refining constraint models based on execution results. We evaluate CP-Agent on 101 constraint programming problems from CP-Bench. We made minor changes to the benchmark to address systematic ambiguities in the problem specifications and errors in the ground-truth models. On the clarified benchmark, CP-Agent achieves perfect accuracy on all 101 problems. Our experiments show that minimal guidance outperforms detailed procedural scaffolding. Our experiments also show that explicit task management tools can have both positive and negative effects on focused modeling tasks.

[380] From Deferral to Learning: Online In-Context Knowledge Distillation for LLM Cascades

Yu Wu, Shuo Wu, Ye Tao, Yansong Li, Anand D. Sarwate

Main category: cs.AI

TL;DR: Inter-Cascade: An online interactive framework that enables weak LLMs to learn from strong LLMs during inference by storing and reusing generalized problem-solving strategies, reducing strong model calls while improving accuracy.

Details

Motivation: Standard LLM cascades are static and inefficient - they repeatedly consult expensive strong models for similar queries without learning or adaptation during inference, leading to redundant costs and missed opportunities for knowledge transfer.

Method: When strong model resolves deferred queries, it generates generalized reusable problem-solving strategies stored in dynamic repository. Future queries use similarity matching to retrieve relevant strategies to augment weak model’s context, enabling in-context learning without parameter fine-tuning.

Result: Outperforms standard cascades on multiple benchmarks: improves weak model accuracy by up to 33.06%, overall system accuracy by 6.35%, reduces strong model calls by 48.05%, and saves fees by 49.63%.

Conclusion: Inter-Cascade enables effective in-context knowledge transfer between LLMs, provides scalable framework for both open-source and API-based models, and improves weak model confidence calibration theoretically and empirically.

Abstract: Standard LLM cascades improve efficiency by deferring difficult queries from weak to strong models. However, these systems are typically static: when faced with repeated or semantically similar queries, they redundantly consult the expensive model, failing to adapt during inference. To address this, we propose Inter-Cascade, an online, interactive framework that transforms the strong model from a temporary helper into a long-term teacher. In our approach, when the strong model resolves a deferred query, it generates a generalized, reusable problem-solving strategy. These strategies are stored in a dynamic repository and retrieved via similarity matching to augment the weak model’s context for future queries. This enables the weak model to learn on the job without expensive parameter fine-tuning. We theoretically show that this mechanism improves the weak model’s confidence calibration. Empirically, Inter-Cascade outperforms standard cascades on multiple benchmarks, improving weak model and overall system accuracy by up to 33.06 percent and 6.35 percent, while reducing strong model calls by up to 48.05 percent and saving fee by up to 49.63 percent. Inter-Cascade demonstrates effective in-context knowledge transfer between LLMs and provides a general, scalable framework applicable to both open-source and API-based LLMs.

[381] Do AI Models Perform Human-like Abstract Reasoning Across Modalities?

Claas Beger, Ryan Yi, Shuhao Fu, Kaleda Denton, Arseny Moskvichev, Sarah W. Tsai, Sivasankaran Rajamanickam, Melanie Mitchell

Main category: cs.AI

TL;DR: AI models’ abstract reasoning abilities on ConceptARC benchmark show accuracy alone is misleading - models use surface-level shortcuts in text but can recognize abstractions in visual tasks even when failing to apply them correctly.

Details

Motivation: To investigate whether state-of-the-art AI models truly recognize and reason with the abstractions that benchmarks like ARC-AGI are designed to test, moving beyond simple accuracy metrics to understand the nature of AI reasoning capabilities.

Method: Evaluated AI models on ConceptARC benchmark with variations in input modality (textual vs. visual), use of external Python tools, and reasoning effort. Analyzed not just output accuracy but also the natural-language rules models generate to explain their solutions, enabling assessment of whether models recognize intended abstractions.

Result: Best models’ rules frequently based on surface-level shortcuts rather than intended abstractions. In visual modality, accuracy drops sharply but rule analysis reveals models can recognize abstractions even when failing to apply them correctly. Accuracy alone overestimates capabilities in textual modalities and underestimates in visual modalities.

Conclusion: Current AI models show limited abstraction reasoning abilities, with accuracy metrics providing misleading assessments. Rule-level analysis offers more faithful evaluation of abstract reasoning and better tracking of progress toward human-like, abstraction-centered intelligence.

Abstract: OpenAI’s o3-preview reasoning model exceeded human accuracy on the ARC-AGI-1 benchmark, but does that mean state-of-the-art models recognize and reason with the abstractions the benchmark was designed to test? Here we investigate abstraction abilities of AI models using the closely related but simpler ConceptARC benchmark. Our evaluations vary input modality (textual vs. visual), use of external Python tools, and reasoning effort. Beyond output accuracy, we evaluate the natural-language rules that models generate to explain their solutions, enabling us to assess whether models recognize the abstractions that ConceptARC was designed to elicit. We show that the best models’ rules are frequently based on surface-level ``shortcuts,’’ capturing intended abstractions considerably less often than humans. In the visual modality, AI models’ output accuracy drops sharply; however, our rule-level analysis reveals that a substantial share of their rules capture the intended abstractions, even as the models struggle to apply these concepts to generate correct solutions. In short, we show that using accuracy alone to evaluate abstract reasoning can substantially overestimate AI capabilities in textual modalities and underestimate it in visual modalities. Our results offer a more faithful picture of AI models’ abstract reasoning abilities and a more principled way to track progress toward human-like, abstraction-centered intelligence.

[382] MemeLens: Multilingual Multitask VLMs for Memes

Ali Ezzat Shahroor, Mohamed Bayan Kmainasi, Abul Hasnat, Dimitar Dimitrov, Giovanni Da San Martino, Preslav Nakov, Firoj Alam

Main category: cs.AI

TL;DR: MemeLens: A unified multilingual multitask VLM for meme understanding that consolidates 38 datasets into 20 tasks across harm, targets, intent, and affect categories.

Details

Motivation: Memes are complex multimodal communications where meaning emerges from text, imagery, and cultural context. Existing research is fragmented across different tasks and languages, limiting cross-domain generalization.

Method: Propose MemeLens, a unified multilingual multitask explanation-enhanced Vision Language Model. Consolidate 38 public meme datasets, filter and map labels into shared taxonomy of 20 tasks. Conduct comprehensive empirical analysis across modeling paradigms, task categories, and datasets.

Result: Findings show robust meme understanding requires multimodal training, exhibits substantial variation across semantic categories, and remains sensitive to over-specialization when models are fine-tuned on individual datasets rather than trained in unified setting.

Conclusion: Unified multilingual multitask approach improves meme understanding across diverse domains. Experimental resources and datasets will be made publicly available for community.

Abstract: Memes are a dominant medium for online communication and manipulation because meaning emerges from interactions between embedded text, imagery, and cultural context. Existing meme research is distributed across tasks (hate, misogyny, propaganda, sentiment, humour) and languages, which limits cross-domain generalization. To address this gap we propose MemeLens, a unified multilingual and multitask explanation-enhanced Vision Language Model (VLM) for meme understanding. We consolidate 38 public meme datasets, filter and map dataset-specific labels into a shared taxonomy of $20$ tasks spanning harm, targets, figurative/pragmatic intent, and affect. We present a comprehensive empirical analysis across modeling paradigms, task categories, and datasets. Our findings suggest that robust meme understanding requires multimodal training, exhibits substantial variation across semantic categories, and remains sensitive to over-specialization when models are fine-tuned on individual datasets rather than trained in a unified setting. We will make the experimental resources and datasets publicly available for the community.

[383] The Path of Least Resistance: Guiding LLM Reasoning Trajectories with Prefix Consensus

Ishan Jindal, Sai Prashanth Akuthota, Jayant Taneja, Sachin Dev Sharma

Main category: cs.AI

TL;DR: PoLR is an inference-time method that reduces computational cost of Self-Consistency by clustering reasoning prefixes and expanding only dominant clusters, maintaining accuracy while cutting token usage by up to 60%.

Details

Motivation: Self-Consistency and similar reasoning methods are computationally expensive as they fully expand all reasoning traces. There's a need for more efficient inference strategies that preserve accuracy while reducing computational overhead.

Method: PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands all paths in that cluster only. This leverages prefix consistency where early reasoning steps predict final correctness, reducing unnecessary expansions.

Result: PoLR matches or exceeds Self-Consistency accuracy across GSM8K, MATH500, AIME24/25, and GPQA-DIAMOND benchmarks while reducing token usage by up to 60% and wall-clock latency by up to 50%.

Conclusion: PoLR provides a computationally efficient alternative to Self-Consistency that maintains reasoning accuracy, is complementary to adaptive inference methods, and can serve as a drop-in pre-filter without requiring model fine-tuning.

Abstract: Large language models achieve strong reasoning performance, but inference strategies such as Self-Consistency (SC) are computationally expensive, as they fully expand all reasoning traces. We introduce PoLR (Path of Least Resistance), the first inference-time method to leverage prefix consistency for compute-efficient reasoning. PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands all paths in that cluster, preserving the accuracy benefits of SC while substantially reducing token usage and latency. Our theoretical analysis, framed via mutual information and entropy, explains why early reasoning steps encode strong signals predictive of final correctness. Empirically, PoLR consistently matches or exceeds SC across GSM8K, MATH500, AIME24/25, and GPQA-DIAMOND, reducing token usage by up to 60% and wall-clock latency by up to 50%. Moreover, PoLR is fully complementary to adaptive inference methods (e.g., Adaptive Consistency, Early-Stopping SC) and can serve as a drop-in pre-filter, making SC substantially more efficient and scalable without requiring model fine-tuning.

[384] From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

Jiaxuan Gao, Jiaao Chen, Chuyi He, Wei-Chen Wang, Shusheng Xu, Hanrui Wang, Di Jin, Yi Wu

Main category: cs.AI

TL;DR: EigenData: A unified framework combining self-evolving data synthesis with verifier-based RL for training interactive tool-using agents, achieving state-of-the-art performance on tool-use benchmarks without expensive human annotation.

Details

Motivation: Training interactive tool-using agents is challenging due to difficulty scaling high-quality multi-turn tool-use data synthesis and noisy reinforcement learning signals from user simulation, requiring a more efficient approach.

Method: Proposes EigenData - a hierarchical multi-agent engine that synthesizes tool-grounded dialogues with executable per-instance checkers, using closed-loop self-evolving prompts and workflow. Then applies RL recipe: fine-tunes user model first, then uses GRPO-style training with trajectory-level group-relative advantages and dynamic filtering.

Result: Achieves 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom benchmarks in tau^2-bench, matching or exceeding frontier models. Shows consistent improvements beyond supervised fine-tuning.

Conclusion: Demonstrates a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation, combining self-evolving data synthesis with verifier-based RL for efficient agent training.

Abstract: Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions. Post-training such agents is challenging because synthesis for high-quality multi-turn tool-use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self-evolving data agent with verifier-based RL. Our system, EigenData, is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers, and improves generation reliability via closed-loop self-evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training with trajectory-level group-relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2-bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.

[385] Interpreting and Controlling LLM Reasoning through Integrated Policy Gradient

Changming Li, Kaixing Zhang, Haoyun Xu, Yingdong Shi, Zheng Zhang, Kaitao Song, Kan Ren

Main category: cs.AI

TL;DR: IPG is a novel interpretability framework that identifies model components contributing to reasoning behaviors by propagating outcome-based signals backward through inference trajectories.

Details

Motivation: Current interpretability methods for LLM reasoning are limited - they either identify components correlated with textual patterns or rely on human-annotated contrastive pairs, struggling to precisely localize complex reasoning mechanisms or capture sequential influence from internal workings to reasoning outputs.

Method: Proposes Integrated Policy Gradient (IPG), which attributes reasoning behaviors to model components by propagating compound outcome-based signals (like post-reasoning accuracy) backward through model inference trajectories, focusing on components with sequential contribution to reasoning behavior.

Result: Empirical evaluations show IPG achieves more precise localization of reasoning mechanisms and enables reliable modulation of reasoning behaviors (reasoning capability, reasoning strength) across diverse reasoning models.

Conclusion: IPG provides a novel framework for understanding LLM reasoning mechanisms by tracking sequential influence from internal components to reasoning outcomes, offering more precise interpretability than existing methods.

Abstract: Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems. Yet, the internal mechanisms driving these complex reasoning behaviors remain opaque. Existing interpretability approaches targeting reasoning either identify components (e.g., neurons) correlated with special textual patterns, or rely on human-annotated contrastive pairs to derive control vectors. Consequently, current methods struggle to precisely localize complex reasoning mechanisms or capture sequential influence from model internal workings to the reasoning outputs. In this paper, built on outcome-oriented and sequential-influence-aware principles, we focus on identifying components that have sequential contribution to reasoning behavior where outcomes are cumulated by long-range effects. We propose Integrated Policy Gradient (IPG), a novel framework that attributes reasoning behaviors to model’s inner components by propagating compound outcome-based signals such as post reasoning accuracy backward through model inference trajectories. Empirical evaluations demonstrate that our approach achieves more precise localization and enables reliable modulation of reasoning behaviors (e.g., reasoning capability, reasoning strength) across diverse reasoning models.

[386] Regulatory Markets: The Future of AI Governance

Gillian K. Hadfield, Jack Clark

Main category: cs.AI

TL;DR: Proposes regulatory markets as a solution to AI governance challenges, where governments require regulated entities to purchase regulatory services from licensed private regulators.

Details

Motivation: Addresses two key deficits in AI regulation: technical deficit (difficulty translating legal requirements into technical specifications) and democratic deficit (over-reliance on industry for standards-setting without democratic accountability).

Method: Introduces regulatory markets framework where governments mandate regulated entities to purchase regulatory services from government-licensed private regulators, creating market incentives for regulatory innovation.

Result: Proposes a hybrid approach that could overcome limitations of both command-and-control regulation and excessive delegation to industry by leveraging market forces and industry R&D for regulatory innovation.

Conclusion: Regulatory markets offer a promising solution to AI governance challenges by combining government policy-setting with market-driven technical implementation, addressing both technical and democratic deficits.

Abstract: Appropriately regulating artificial intelligence is an increasingly urgent and widespread policy challenge. We identify two primary, competing problem. First is a technical deficit: Legislatures and regulatory face significant challenges in rapidly translating conventional command-and-control legal requirements into technical requirements. Second is a democratic deficit: Over-reliance on industry to provide technical standards fails to ensure that the many values-based decisions that must be made to shape AI development and deployment are made by democratically accountable public, not private, actors. We propose a solution: regulatory markets, in which governments require the targets of regulation to purchase regulatory services from a government-licensed private regulator. This approach to AI regulation could overcome the limitations of both command-and-control regulation and excessive delegation to industry. Regulatory markets could enable governments to establish policy priorities for the regulation of AI while relying on market forces and industry R&D efforts to pioneer the technical methods of regulation that best achieve policymakers’ stated objectives.

[387] If It’s Nice, Do It Twice: We Should Try Iterative Corpus Curation

Robin Young

Main category: cs.AI

TL;DR: Iterative self-filtering: Train model on filtered data, use it to filter further, repeat. Converges to self-consistent corpus where model approves its own training data.

Details

Motivation: To improve model safety through iterative self-filtering without degrading capabilities, creating a scalable oversight mechanism with human-auditable results.

Method: Iterative process: 1) Train model on filtered data, 2) Use trained model to filter corpus further, 3) Train new model on cleaner corpus, 4) Repeat. Theoretical analysis of convergence.

Result: Theoretical convergence to self-consistent corpus, decay in harmful content even with constant filter quality, creation of large-scale preference annotations for interpretability.

Conclusion: Iterative self-filtering offers novel scalable oversight with human-auditable corpora, calls for empirical testing by researchers with pretraining infrastructure.

Abstract: Recent work demonstrates that filtering harmful content from pretraining data improves model safety without degrading capabilities. We propose a natural extension: do it again. A model trained on filtered data can filter the corpus further; training on this cleaner corpus produces an even cleaner model. We provide theoretical analysis showing this process converges to a self-consistent corpus where the model trained on it approves of its own training data. Even under the weak assumption of constant filter quality, iteration yields decay in harmful content. We argue this framework offers a novel form of scalable oversight. While model internals are opaque, the resulting corpus is human-auditable. Even a single iteration produces a large-scale preference annotations over documents, potentially valuable for interpretability research. We derive bounds on capability-safety tradeoffs and outline open questions. We call on researchers with pretraining infrastructure to empirically test this approach.

[388] Building spatial world models from sparse transitional episodic memories

Zizhan He, Maxime Daigle, Pouya Bashivan

Main category: cs.AI

TL;DR: ESWM is a neuroscience-inspired framework that builds spatial cognitive maps from sparse, disjoint episodic memories rather than long sequential trajectories, enabling rapid adaptation and flexible navigation.

Details

Motivation: Animals can rapidly construct flexible cognitive maps from disjoint experiences, but existing computational models require long sequential trajectories. Neuroscience suggests maps can integrate disjoint experiences governed by spatial rules.

Method: Introduces Episodic Spatial World Model (ESWM) that constructs spatial maps from sparse, disjoint episodic memories. It operates on independently stored and updated episodic memories to enable rapid adaptation to environmental changes.

Result: ESWM predicts unobserved transitions from minimal experience across varying environments, with latent space geometry aligning with environment structure. Enables near-optimal exploration and navigation strategies without additional training.

Conclusion: Demonstrates how neuroscience-inspired episodic memory principles can advance development of more flexible and generalizable world models for spatial cognition and navigation.

Abstract: Many animals possess a remarkable capacity to rapidly construct flexible cognitive maps of their environments. These maps are crucial for ethologically relevant behaviors such as navigation, exploration, and planning. Existing computational models typically require long sequential trajectories to build accurate maps, but neuroscience evidence suggests maps can also arise from integrating disjoint experiences governed by consistent spatial rules. We introduce the Episodic Spatial World Model (ESWM), a novel framework that constructs spatial maps from sparse, disjoint episodic memories. Across environments of varying complexity, ESWM predicts unobserved transitions from minimal experience, and the geometry of its latent space aligns with that of the environment. Because it operates on episodic memories that can be independently stored and updated, ESWM is inherently adaptive, enabling rapid adjustment to environmental changes. Furthermore, we demonstrate that ESWM readily enables near-optimal strategies for exploring novel environments and navigating between arbitrary points, all without the need for additional training. Our work demonstrates how neuroscience-inspired principles of episodic memory can advance the development of more flexible and generalizable world models.

[389] Measuring and Analyzing Intelligence via Contextual Uncertainty in Large Language Models using Information-Theoretic Metrics

Jae Wan Shim

Main category: cs.AI

TL;DR: A task-agnostic method for building quantitative Cognitive Profiles for LLMs using Entropy Decay Curves and Information Gain Span to analyze how models process information as context length grows.

Details

Motivation: While LLMs excel at many tasks, the mechanisms behind their success remain poorly understood. The paper aims to move from asking what these systems can do to understanding how they process information internally.

Method: Proposes a task-agnostic method that builds Cognitive Profiles using Entropy Decay Curves - plots of a model’s normalized predictive uncertainty as context length grows. Also introduces Information Gain Span (IGS) as a single index summarizing the desirability of decay patterns.

Result: Across several state-of-the-art LLMs and diverse texts, the curves expose distinctive, stable profiles that depend on both model scale and text complexity. The tools provide principled ways to analyze and compare internal dynamics of AI systems.

Conclusion: The proposed Cognitive Profiles and IGS offer a principled framework for understanding how LLMs process information, revealing systematic patterns related to model scale and text complexity that were previously hidden.

Abstract: Large Language Models (LLMs) excel on many task-specific benchmarks, yet the mechanisms that drive this success remain poorly understood. We move from asking what these systems can do to asking how they process information. Our contribution is a task-agnostic method that builds a quantitative Cognitive Profile for any model. The profile is built around the Entropy Decay Curve – a plot of a model’s normalised predictive uncertainty as context length grows. Across several state-of-the-art LLMs and diverse texts, the curves expose distinctive, stable profiles that depend on both model scale and text complexity. We also propose the Information Gain Span (IGS) as a single index that summarises the desirability of a decay pattern. Together, these tools offer a principled way to analyse and compare the internal dynamics of modern AI systems.

[390] MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, Zhao Zhong

Main category: cs.AI

TL;DR: MixGRPO improves human preference alignment in image generation by using mixed SDE/ODE sampling with a sliding window to reduce optimization overhead and accelerate training.

Details

Motivation: Existing GRPO-based methods like FlowGRPO and DanceGRPO are inefficient because they require sampling and optimizing over all denoising steps in the MDP, leading to computational overhead.

Method: MixGRPO integrates stochastic differential equations (SDE) and ordinary differential equations (ODE) with a sliding window mechanism. It uses SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This confines randomness to specific time-steps and reduces optimization overhead.

Result: MixGRPO achieves substantial gains in human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency with nearly 50% lower training time. MixGRPO-Flash further reduces training time by 71% while maintaining comparable performance.

Conclusion: MixGRPO provides an efficient framework for human preference alignment in image generation by leveraging mixed sampling strategies, significantly reducing training time while improving performance.

Abstract: Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO and DanceGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for faster sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%.

[391] Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation

Mingzhe Li, Xin Lu, Yanyan Zhao

Main category: cs.AI

TL;DR: Self-Foveate is an LLM-driven method for synthesizing diverse and difficult instruction data from unsupervised text using a multi-level foveation approach inspired by human visual perception.

Details

Motivation: Current automated methods for synthesizing instruction data from unsupervised text have limitations in diversity and difficulty of synthesized instructions, which restricts the quality of training data for large language models.

Method: Proposes a “Micro-Scatter-Macro” multi-level foveation methodology that extracts textual information at three granularities (fine-grained details, cross-region connections, holistic patterns), plus a re-synthesis module to improve instruction fidelity and quality.

Result: Comprehensive experiments across multiple unsupervised corpora and diverse model architectures show Self-Foveate consistently outperforms existing methods for instruction synthesis.

Conclusion: Self-Foveate effectively addresses diversity and difficulty limitations in instruction synthesis, providing a promising approach for generating high-quality training data for LLMs from unsupervised text.

Abstract: Synthesizing high-quality instruction data from unsupervised text is a promising paradigm for training large language models (LLMs), yet automated methods for this task still exhibit significant limitations in the diversity and difficulty of synthesized instructions. To address these challenges, we propose Self-Foveate, an LLM-driven method for instruction synthesis. Inspired by hierarchical human visual perception, Self-Foveate introduces a “Micro-Scatter-Macro” multi-level foveation methodology that guides the extraction of textual information at three complementary granularities, from fine-grained details through cross-region connections to holistic patterns, thereby enhancing both the diversity and difficulty of synthesized instructions. Furthermore, a re-synthesis module is incorporated to improve the fidelity of instructions to source text and their overall quality. Comprehensive experiments across multiple unsupervised corpora and diverse model architectures demonstrate that Self-Foveate consistently outperforms existing methods. We publicly release our code at https://github.com/Mubuky/Self-Foveate

[392] Dynamic Context Adaptation for Consistent Role-Playing Agents with Retrieval-Augmented Generations

Jeiyoon Park, Yongshin Han, Minseop Kim, Kisu Yang

Main category: cs.AI

TL;DR: Amadeus is a training-free framework that enhances persona consistency in retrieval-augmented generation (RAG) based role-playing agents, addressing hallucination issues when characters lack relevant knowledge, with a new CharacterRAG dataset for evaluation.

Details

Motivation: Role-playing agents struggle with persona consistency when using RAG, especially when characters lack knowledge about queries, leading to hallucinations. Current research on RAG-based RPAs is limited despite RAG being practically necessary due to resource constraints.

Method: Proposes Amadeus, a training-free framework that enhances persona consistency in RAG-based RPAs. Also introduces CharacterRAG dataset with 15 fictional characters’ persona documents (976K characters) and 450 QA pairs for evaluation.

Result: Amadeus effectively models not only character knowledge but also various attributes like personality, significantly enhancing persona consistency even for questions beyond a character’s knowledge.

Conclusion: The proposed training-free framework addresses critical hallucination issues in RAG-based role-playing agents and provides a valuable dataset for future research in this under-explored area.

Abstract: Building role-playing agents (RPAs) that faithfully emulate specific characters remains challenging because collecting character-specific utterances and continually updating model parameters are resource-intensive, making retrieval-augmented generation (RAG) a practical necessity. However, despite the importance of RAG, there has been little research on RAG-based RPAs. For example, we empirically find that when a persona lacks knowledge relevant to a given query, RAG-based RPAs are prone to hallucination, making it challenging to generate accurate responses. In this paper, we propose Amadeus, a training-free framework that can significantly enhance persona consistency even when responding to questions that lie beyond a character’s knowledge. In addition, to underpin the development and rigorous evaluation of RAG-based RPAs, we manually construct CharacterRAG, a role-playing dataset that consists of persona documents for 15 distinct fictional characters totaling 976K written characters, and 450 question-answer pairs. We find that our proposed method effectively models not only the knowledge possessed by characters, but also various attributes such as personality.

[393] Uncertainty-driven Adaptive Exploration

Leonidas Bakopoulos, Georgios Chalkiadakis

Main category: cs.AI

TL;DR: A generic adaptive exploration framework that uses uncertainty to determine optimal switching between exploration and exploitation phases in reinforcement learning.

Details

Motivation: Current adaptive exploration methods lack principled approaches for determining when to switch between exploration and exploitation, which is critical for learning complex action sequences in challenging domains.

Method: Proposes a generic framework that employs uncertainty measures to guide switching decisions, can incorporate various uncertainty-measuring mechanisms (intrinsic motivation, epistemic uncertainty), and subsumes previous adaptive exploration approaches as special cases.

Result: The framework produces adaptive exploration strategies that outperform standard approaches across several experimental environments.

Conclusion: The uncertainty-based adaptive exploration framework provides a principled solution to the switching problem and demonstrates superior performance compared to existing methods.

Abstract: Adaptive exploration methods propose ways to learn complex policies via alternating between exploration and exploitation. An important question for such methods is to determine the appropriate moment to switch between exploration and exploitation and vice versa. This is critical in domains that require the learning of long and complex sequences of actions. In this work, we present a generic adaptive exploration framework that employs uncertainty to address this important issue in a principled manner. Our framework includes previous adaptive exploration approaches as special cases. Moreover, we can incorporate in our framework any uncertainty-measuring mechanism of choice, for instance mechanisms used in intrinsic motivation or epistemic uncertainty-based exploration methods. We experimentally demonstrate that our framework gives rise to adaptive exploration strategies that outperform standard ones across several environments.

[394] MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization

Yichen Han, Yuhang Han, Siteng Huang, Guanyu Liu, Zhengpeng Zhou, Bojun Liu, Yujia Zhang, Isaac N Shi, Lewei He, Tianyu Shi

Main category: cs.AI

TL;DR: MAPGD is a multi-agent prompt gradient descent framework that optimizes LLM prompts through specialized agents focusing on different refinement dimensions, coordinated via semantic gradient embedding and conflict resolution mechanisms.

Details

Motivation: Existing prompt optimization methods follow single trajectories, leading to limited adaptability, gradient conflicts, and high computational overhead. There's a need for more robust, collaborative approaches to prompt engineering.

Method: MAPGD uses specialized agents focusing on distinct refinement dimensions (instruction clarity, example selection, format structure, stylistic adaptation). Agents coordinate through semantic gradient embedding, conflict detection, and fusion. Introduces Hypersphere Constrained Gradient Clustering (HCGC) for compact clusters and Channel Adaptive Agent Weighting (CAAW) for dynamic agent reweighting.

Result: MAPGD consistently surpasses single-agent and random baselines in both accuracy and efficiency on classification and reasoning benchmarks. Ablation studies confirm effectiveness of gradient fusion, agent specialization, and conflict resolution.

Conclusion: MAPGD establishes a unified, gradient-based, interpretable framework for robust prompt optimization with theoretical convergence guarantees, addressing limitations of single-trajectory optimization methods.

Abstract: Prompt engineering is crucial for fully leveraging large language models (LLMs), yet most existing optimization methods follow a single trajectory, resulting in limited adaptability, gradient conflicts, and high computational overhead. We propose MAPGD (Multi-Agent Prompt Gradient Descent), a novel framework that reconceptualizes prompt optimization as a collaborative process among specialized agents. Each agent focuses on a distinct refinement dimension, such as instruction clarity, example selection, format structure, or stylistic adaptation, and their contributions are coordinated through semantic gradient embedding, conflict detection, and fusion. To further enhance robustness and stability, MAPGD introduces two new mechanisms: Hypersphere Constrained Gradient Clustering (HCGC), which enforces angular margin constraints for compact and well-separated clusters, and Channel Adaptive Agent Weighting (CAAW), which dynamically reweights agent contributions based on validation performance. Experiments on classification and reasoning benchmarks show that MAPGD consistently surpasses single-agent and random baselines in both accuracy and efficiency. Ablation studies confirm the effectiveness of gradient fusion, agent specialization, and conflict resolution. Together, these components establish MAPGD as a unified, gradient-based, and interpretable framework for robust prompt optimization with theoretical convergence guarantees.

[395] PAINT: Parallel-in-time Neural Twins for Dynamical System Reconstruction

Andreas Radler, Vincent Seyfried, Johannes Brandstetter, Thomas Lichtenegger

Main category: cs.AI

TL;DR: PAINT introduces parallel-in-time neural twins for dynamical systems that stay on-trajectory using generative modeling of state distributions from measurements.

Details

Motivation: To create neural twins - digital replicas of real systems that can consume measurements at test time for context-specific decision-making, with the critical property of remaining on-trajectory (staying close to true system state over time).

Method: PAINT trains a generative neural network to model the distribution of states in parallel over time, then predicts states from measurements in a sliding window fashion at test time. The method is architecture-agnostic.

Result: Theoretical analysis shows PAINT is on-trajectory while autoregressive models generally are not. Empirical evaluation on 2D turbulent fluid dynamics demonstrates PAINT stays on-trajectory and predicts system states from sparse measurements with high fidelity.

Conclusion: PAINT has potential for developing neural twins that stay on-trajectory, enabling more accurate state estimation and decision-making in dynamical systems.

Abstract: Neural surrogates have shown great potential in simulating dynamical systems, while offering real-time capabilities. We envision Neural Twins as a progression of neural surrogates, aiming to create digital replicas of real systems. A neural twin consumes measurements at test time to update its state, thereby enabling context-specific decision-making. We argue, that a critical property of neural twins is their ability to remain on-trajectory, i.e., to stay close to the true system state over time. We introduce Parallel-in-time Neural Twins (PAINT), an architecture-agnostic family of methods for modeling dynamical systems from measurements. PAINT trains a generative neural network to model the distribution of states in parallel over time. At test time, states are predicted from measurements in a sliding window fashion. Our theoretical analysis shows that PAINT is on-trajectory, whereas autoregressive models generally are not. Empirically, we evaluate our method on a challenging two-dimensional turbulent fluid dynamics problem. The results demonstrate that PAINT stays on-trajectory and predicts system states from sparse measurements with high fidelity. These findings underscore PAINT’s potential for developing neural twins that stay on-trajectory, enabling more accurate state estimation and decision-making.

[396] Chain-of-Thought Hijacking

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez

Main category: cs.AI

TL;DR: Chain-of-Thought Hijacking: A jailbreak attack that prepends harmful instructions with extended benign puzzle reasoning to systematically weaken Large Reasoning Models’ safety mechanisms, achieving high attack success rates across major models.

Details

Motivation: While Large Reasoning Models (LRMs) use extended inference-time reasoning to improve task performance, and prior work suggests this should strengthen safety, the authors discovered that long reasoning sequences can be exploited to systematically weaken safety mechanisms.

Method: Introduces Chain-of-Thought Hijacking attack that prepends harmful instructions with extended sequences of benign puzzle reasoning. Uses activation probing, attention analysis, and causal interventions to understand the mechanism. Evaluates on HarmBench across major models (Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, Claude 4 Sonnet).

Result: Achieves attack success rates of 99%, 94%, 100%, and 94% on the four models respectively. Finds that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning grows: mid-layers encode safety checking strength, while late layers encode refusal outcome.

Conclusion: Explicit chain-of-thought reasoning introduces a systematic vulnerability when combined with answer-prompting cues, demonstrating that extended reasoning can weaken rather than strengthen safety mechanisms in Large Reasoning Models.

Abstract: Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. While prior work suggests this should strengthen safety, we find evidence to the contrary. Long reasoning sequences can be exploited to systematically weaken them. We introduce Chain-of-Thought Hijacking, a jailbreak attack that prepends harmful instructions with extended sequences of benign puzzle reasoning. Across HarmBench, CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet. To understand this mechanism, we apply activation probing, attention analysis, and causal interventions. We find that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning grows: mid-layers encode the strength of safety checking, while late layers encode the refusal outcome. These findings demonstrate that explicit chain-of-thought reasoning introduces a systematic vulnerability when combined with answer-prompting cues. We release all evaluation materials to facilitate replication.

[397] CrochetBench: Can Vision-Language Models Move from Describing to Doing in Crochet Domain?

Peiyu Li, Xiaobao Huang, Ting Hua, Nitesh V. Chawla

Main category: cs.AI

TL;DR: CrochetBench evaluates multimodal LLMs’ ability to generate executable procedures in crochet, testing stitch recognition, instruction selection, and compilable procedure generation using a DSL for validation.

Details

Motivation: While multimodal LLMs can describe visual content, their ability to generate executable procedures remains underexplored. The paper aims to evaluate this shift from describing to doing through procedural reasoning in crochet.

Method: Uses CrochetPARADE DSL as intermediate representation for structural validation and functional evaluation via execution. Benchmark covers stitch classification, instruction grounding, and both natural language and image-to-DSL translation tasks.

Result: Performance sharply decreases as evaluation shifts from surface-level similarity to executable correctness, revealing limitations in long-range symbolic reasoning and 3D-aware procedural synthesis.

Conclusion: CrochetBench offers a new lens for assessing procedural competence in multimodal models and highlights the gap between surface-level understanding and executable precision in real-world creative domains.

Abstract: While multimodal large language models can describe visual content, their ability to generate executable procedures remains underexplored. CrochetBench presented in this paper evaluates this shift from describing to doing through fine-grained procedural reasoning in crochet: models must recognize stitches, select structurally appropriate instructions, and generate compilable procedures. We adopt the CrochetPARADE DSL as our intermediate representation, enabling structural validation and functional evaluation via execution. The benchmark covers tasks including stitch classification, instruction grounding, and both natural language and image-to-DSL translation. Across all tasks, performance sharply decreases as the evaluation shifts from surface-level similarity to executable correctness, revealing limitations in long-range symbolic reasoning and 3D-aware procedural synthesis. Our proposed CrochetBench offers a new lens for assessing procedural competence in multimodal models and highlights the gap between surface-level understanding and executable precision in real-world creative domains. Code is available at https://anonymous.4open.science/r/crochet-82E6/README.md.

[398] NEZHA: A Zero-sacrifice and Hyperspeed Decoding Architecture for Generative Recommendations

Yejing Wang, Shengyu Zhou, Jinyu Lu, Ziwei Liu, Langming Liu, Maolin Wang, Wenlin Zhang, Feng Li, Wenbo Su, Pengjie Wang, Jian Xu, Xiangyu Zhao

Main category: cs.AI

TL;DR: NEZHA is a novel architecture that accelerates generative recommendation systems using integrated self-drafting and hash-based verification to reduce inference latency without sacrificing quality.

Details

Motivation: Generative recommendation systems using LLMs face high inference latency that makes them impractical for real-time industrial applications, while existing speculative decoding approaches introduce additional bottlenecks through separate draft models and model-based verifiers.

Method: NEZHA integrates a nimble autoregressive draft head directly into the primary model for efficient self-drafting, uses specialized input prompt structures, and introduces a model-free verifier based on hash sets to tackle hallucination issues.

Result: Successfully deployed on Taobao since October 2025, driving billion-level advertising revenue and serving hundreds of millions of daily active users, with extensive experiments demonstrating effectiveness on public datasets.

Conclusion: NEZHA enables practical deployment of generative recommendation systems at industrial scale by solving the latency bottleneck through integrated self-drafting and efficient verification.

Abstract: Generative Recommendation (GR), powered by Large Language Models (LLMs), represents a promising new paradigm for industrial recommender systems. However, their practical application is severely hindered by high inference latency, which makes them infeasible for high-throughput, real-time services and limits their overall business impact. While Speculative Decoding (SD) has been proposed to accelerate the autoregressive generation process, existing implementations introduce new bottlenecks: they typically require separate draft models and model-based verifiers, requiring additional training and increasing the latency overhead. In this paper, we address these challenges with NEZHA, a novel architecture that achieves hyperspeed decoding for GR systems without sacrificing recommendation quality. Specifically, NEZHA integrates a nimble autoregressive draft head directly into the primary model, enabling efficient self-drafting. This design, combined with a specialized input prompt structure, preserves the integrity of sequence-to-sequence generation. Furthermore, to tackle the critical problem of hallucination, a major source of performance degradation, we introduce an efficient, model-free verifier based on a hash set. We demonstrate the effectiveness of NEZHA through extensive experiments on public datasets and have successfully deployed the system on Taobao since October 2025, driving the billion-level advertising revenue and serving hundreds of millions of daily active users.

[399] Co-Evolving Agents: Learning from Failures as Hard Negatives

Yeonsung Jung, Trilok Padhi, Sina Shaham, Dipika Khullar, Joonhyun Jeong, Ninareh Mehrabi, Eunho Yang

Main category: cs.AI

TL;DR: A co-evolving agents framework where a target agent improves jointly with an auxiliary failure agent that generates hard negative failure trajectories to enhance learning and generalization in self-improving agents.

Details

Motivation: Current self-improving agents rely heavily on predicted trajectories with limited ground-truth supervision, making them prone to overfitting. There's a need for better methods to leverage failures as structured learning signals rather than using them as-is.

Method: Proposes a co-evolving agents framework with a target agent and auxiliary failure agent. The failure agent learns through preference optimization over failure trajectories from both agents, generating hard negatives that are close to success but remain failures. These hard negatives are incorporated into the target agent’s optimization to sharpen decision boundaries.

Result: The method shows improved performance across benchmark datasets and demonstrates that failures can be systematically transformed into structured and valuable learning signals in self-improving agents.

Conclusion: The co-evolving agents framework effectively addresses overfitting in self-improving agents by generating informative hard negatives from failure trajectories, leading to better generalization and performance.

Abstract: The rapid progress of large foundation models has accelerated the development of task-specialized agents across diverse domains. However, the effectiveness of agents remains tightly coupled with the quality of training data, while curating task-specific datasets remains costly and often infeasible in real-world scenarios. Recent work has explored self-improving agents that autonomously generate, refine, and re-train on their own trajectories. A prominent line of approaches further leverages preference optimization by pairing predicted trajectories with scarce ground-truth trajectories, enabling agents to learn directly from their own failures. While these methods outperform supervised fine-tuning, their heavy reliance on predicted trajectories under limited ground-truth supervision leaves them prone to overfitting. To address this, we propose a co-evolving agents framework in which a target agent improves jointly with an auxiliary failure agent. The failure agent learns through preference optimization over failure trajectories from both the target and itself, thereby generating hard negatives that are close to success yet remain failures. Incorporating these informative hard negatives into the target agent’s optimization sharpens decision boundaries and enhances generalization. Our comprehensive analysis and experiments across benchmark datasets show that our method not only shows improved performance but also demonstrates that failures, instead of being used as-is, can be systematically transformed into structured and valuable learning signals in self-improving agents.

[400] GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu

Main category: cs.AI

TL;DR: GlimpRouter: A training-free step-wise collaboration framework that uses initial token entropy to predict reasoning step difficulty and routes between lightweight and large models to reduce inference latency while preserving accuracy.

Details

Motivation: Large Reasoning Models (LRMs) achieve good performance through multi-step chains of thought but incur substantial inference latency and computational cost. Collaborative inference between lightweight and large models is promising, but determining when to use which model remains challenging. Existing routing strategies introduce significant overhead.

Method: Proposes GlimpRouter based on the insight that step difficulty can be inferred from the first token’s entropy (inspired by “Aha Moment” phenomenon). Uses lightweight model to generate only the first token of each reasoning step, then routes to larger model only when initial token entropy exceeds a threshold. Training-free framework.

Result: Experiments on multiple benchmarks show significant reduction in inference latency while preserving accuracy. Achieves 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to standalone large model on AIME25.

Conclusion: GlimpRouter demonstrates a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation. Shows initial token entropy is a strong predictor of step difficulty.

Abstract: Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the “Aha Moment” phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.

[401] From Classical to Quantum Reinforcement Learning and Its Applications in Quantum Control: A Beginner’s Tutorial

Abhijit Sen, Sonali Panda, Mahima Arya, Subhajit Patra, Zizhan Zheng, Denys I. Bondar

Main category: cs.AI

TL;DR: A tutorial designed to make reinforcement learning more accessible to undergraduate students through clear, example-driven explanations and practical coding applications.

Details

Motivation: To bridge the gap between RL theory and practical coding applications, addressing common challenges students face when transitioning from conceptual understanding to implementation.

Method: Uses hands-on examples and approachable explanations to teach foundational RL skills, focusing on making the material accessible to undergraduate students.

Result: The tutorial aims to equip students with foundational skills needed to confidently apply RL techniques in real-world scenarios.

Conclusion: This educational resource successfully makes reinforcement learning more accessible through practical, example-driven teaching methods.

Abstract: This tutorial is designed to make reinforcement learning (RL) more accessible to undergraduate students by offering clear, example-driven explanations. It focuses on bridging the gap between RL theory and practical coding applications, addressing common challenges that students face when transitioning from conceptual understanding to implementation. Through hands-on examples and approachable explanations, the tutorial aims to equip students with the foundational skills needed to confidently apply RL techniques in real-world scenarios.

Benedikt Hartl, Léo Pio-Lopez, Chris Fields, Michael Levin

Main category: cs.AI

TL;DR: The paper proposes a unifying framework for cognition across biological and artificial systems based on two scale-invariant principles: remapping of embedding spaces and navigation within these spaces through iterative error minimization.

Details

Motivation: To develop an integrated view of problem-solving across diverse intelligent systems (biological, engineered, chimeric) by identifying scale-invariant principles of decision-making that apply from subcellular networks to swarms of organisms.

Method: Theoretical framework proposing that cognition can be characterized by the interplay between two invariants: (1) remapping of embedding spaces (transcriptional, morphological, physiological, 3D spaces in biological systems; latent embeddings in AI systems), and (2) navigation within these spaces through distributed error correction/iterative refinement.

Result: Identifies a dual principle - remapping and navigation of embedding spaces via iterative error minimization - as a substrate-independent invariant of cognition that applies across biological collectives and modern AI systems like transformers and diffusion models.

Conclusion: This shared mechanism provides a unifying framework for understanding cognition in both natural and synthetic systems, illuminating deep parallels between living systems and artificial models, and enabling engineering of adaptive intelligence across scales.

Abstract: The emerging field of diverse intelligence seeks an integrated view of problem-solving in agents of very different provenance, composition, and substrates. From subcellular chemical networks to swarms of organisms, and across evolved, engineered, and chimeric systems, it is hypothesized that scale-invariant principles of decision-making can be discovered. We propose that cognition in both natural and synthetic systems can be characterized and understood by the interplay between two equally important invariants: (1) the remapping of embedding spaces, and (2) the navigation within these spaces. Biological collectives, from single cells to entire organisms (and beyond), remap transcriptional, morphological, physiological, or 3D spaces to maintain homeostasis and regenerate structure, while navigating these spaces through distributed error correction. Modern Artificial Intelligence (AI) systems, including transformers, diffusion models, and neural cellular automata enact analogous processes by remapping data into latent embeddings and refining them iteratively through contextualization. We argue that this dual principle - remapping and navigation of embedding spaces via iterative error minimization - constitutes a substrate-independent invariant of cognition. Recognizing this shared mechanism not only illuminates deep parallels between living systems and artificial models, but also provides a unifying framework for engineering adaptive intelligence across scales.

[403] Deadline-Aware, Energy-Efficient Control of Domestic Immersion Hot Water Heater

Muhammad Ibrahim Khan, Bivin Pradeep, James Brusey

Main category: cs.AI

TL;DR: Reinforcement learning (PPO) outperforms traditional control methods for energy-efficient deadline-aware water heater control, achieving 26-69% energy savings.

Details

Motivation: Domestic water heaters waste energy by operating continuously rather than optimizing for predictable demand windows and thermal losses. The paper aims to develop deadline-aware control that heats water to target temperature at specified time while minimizing energy consumption.

Method: Created a Gymnasium environment modeling immersion water heater with thermal losses and discrete on/off actions. Compared three approaches: time-optimal bang-bang baseline, zero-shot Monte Carlo Tree Search planner, and Proximal Policy Optimization reinforcement learning policy.

Result: PPO achieved most energy-efficient performance, using 3.23 kWh at 60-step horizon vs 4.37-10.45 kWh for bang-bang and 4.18-6.46 kWh for MCTS. Energy savings of 26% at 30 steps and 69% at 90 steps. In representative case, PPO consumed 54% less energy than bang-bang and 33% less than MCTS.

Conclusion: Learned deadline-aware control significantly reduces energy consumption compared to traditional methods. Planners offer partial savings without training, while learned policies provide near-zero inference cost after training.

Abstract: Typical domestic immersion water heater systems are often operated continuously during winter, heating quickly rather than efficiently and ignoring predictable demand windows and ambient losses. We study deadline-aware control, where the aim is to reach a target temperature at a specified time while minimising energy consumption. We introduce an efficient Gymnasium environment that models an immersion hot water heater with first-order thermal losses and discrete on and off actions of 0 W and 6000 W applied every 120 seconds. Methods include a time-optimal bang-bang baseline, a zero-shot Monte Carlo Tree Search planner, and a Proximal Policy Optimisation policy. We report total energy consumption in watt-hours under identical physical dynamics. Across sweeps of initial temperature from 10 to 30 degrees Celsius, deadline from 30 to 90 steps, and target temperature from 40 to 80 degrees Celsius, PPO achieves the most energy-efficient performance at a 60-step horizon of 2 hours, using 3.23 kilowatt-hours, compared to 4.37 to 10.45 kilowatt-hours for bang-bang control and 4.18 to 6.46 kilowatt-hours for MCTS. This corresponds to energy savings of 26 percent at 30 steps and 69 percent at 90 steps. In a representative trajectory with a 50 kg water mass, 20 degrees Celsius ambient temperature, and a 60 degrees Celsius target, PPO consumes 54 percent less energy than bang-bang control and 33 percent less than MCTS. These results show that learned deadline-aware control reduces energy consumption under identical physical assumptions, while planners provide partial savings without training and learned policies offer near-zero inference cost once trained.

[404] PROTEUS: SLA-Aware Routing via Lagrangian RL for Multi-LLM Serving Systems

Amit Singh Bhatti, Vishal Vaddina, Dagnachew Birru

Main category: cs.AI

TL;DR: PROTEUS is an LLM router that accepts accuracy targets as runtime input and uses Lagrangian dual control to dynamically route queries to appropriate models while meeting specified accuracy requirements.

Details

Motivation: Current LLM routing systems require operators to manually tune parameters offline and guess at resulting accuracy, creating an indirect, non-monotonic relationship between settings and outcomes. Operators need to specify accuracy targets directly rather than inferring them from opaque settings.

Method: PROTEUS uses Lagrangian dual control where a learned dual variable tracks constraint violations during training and conditions the policy network. This allows the router to translate specified accuracy targets (tau) into routing decisions that satisfy them, with a single trained model serving the full accuracy spectrum without retraining.

Result: On RouterBench (11 models, 405K queries) and SPROUT (14 models, 45K queries), PROTEUS achieves consistent floor compliance where accuracy meets or exceeds tau, with target-response correlation of 0.97-0.98. It achieves 90.1% accuracy on RouterBench (within 1.3% of oracle) and 94.0% on SPROUT (within 4.6% of oracle), with cost savings reaching 89.8% versus the best fixed model.

Conclusion: PROTEUS enables operators to specify accuracy targets directly as runtime input, providing predictable performance while optimizing costs across diverse LLM deployment scenarios, significantly outperforming existing routing approaches.

Abstract: Production LLM deployments serve diverse workloads where cost and quality requirements vary by customer tier, time of day, and query criticality. Model serving systems accept latency SLOs directly. LLM routers do not. They force operators to tune parameters offline and guess what accuracy might result. The relationship between parameters and outcomes is indirect, non-monotonic, and dataset-dependent. Operators need to specify accuracy targets, not infer them from opaque settings. We present PROTEUS (Polymorphic Router for Operational Target Enforcement with Unified SLA), a router that accepts accuracy targets tau as runtime input. PROTEUS uses Lagrangian dual control. A learned dual variable lambda tracks constraint violations during training and conditions the policy network. This lets the router translate specified tau values into routing decisions that satisfy them. A single trained model serves the full accuracy spectrum without retraining.We evaluate on RouterBench (11 models, 405K queries) and SPROUT (14 models, 45K queries). PROTEUS achieves consistent floor compliance where accuracy meets or exceeds tau. The target-response correlation reaches 0.97 to 0.98. The closest baseline, OmniRouter, meets floors only 22% of the time despite also using Lagrangian optimization. PROTEUS operates across tau in [0.85, 0.95] from a single model. On RouterBench it achieves 90.1% accuracy, within 1.3% of oracle. On SPROUT it achieves 94.0% accuracy, within 4.6% of oracle. Cost savings reach 89.8% versus the best fixed model.

[405] The Epistemic Planning Domain Definition Language: Official Guideline

Alessandro Burigana, Francesco Fabiano

Main category: cs.AI

TL;DR: EPDDL is a new PDDL-like language for epistemic planning that provides a unified representation for Dynamic Epistemic Logic (DEL) semantics, addressing fragmentation in the field.

Details

Motivation: Epistemic planning in DEL faces fragmentation with different planners targeting different fragments and using ad hoc or no standardized languages, hindering comparison, reuse, and systematic benchmark development.

Method: Introduces EPDDL (Epistemic Planning Domain Definition Language) with formal development of abstract event models for representing epistemic actions, and formal specification of syntax and semantics grounded in DEL.

Result: EPDDL provides a unique PDDL-like representation capturing entire DEL semantics, enabling uniform specification of epistemic planning tasks and facilitating interoperability and reproducible evaluation.

Conclusion: EPDDL addresses fragmentation in epistemic planning by providing a standardized language that captures full DEL semantics, supporting future advances in the field through better interoperability and benchmark development.

Abstract: Epistemic planning extends (multi-agent) automated planning by making agents’ knowledge and beliefs first-class aspects of the planning formalism. One of the most well-known frameworks for epistemic planning is Dynamic Epistemic Logic (DEL), which offers an rich and natural semantics for modelling problems in this setting. The high expressive power provided by DEL make DEL-based epistemic planning a challenging problem to tackle both theoretically, and in practical implementations. As a result, existing epistemic planners often target different DEL fragments, and typically rely on ad hoc languages to represent benchmarks, and sometimes no language at all. This fragmentation hampers comparison, reuse, and systematic benchmark development. We address these issues by introducing the Epistemic Planning Domain Definition Language (EPDDL). EPDDL provides a unique PDDL-like representation that captures the entire DEL semantics, enabling uniform specification of epistemic planning tasks. Our main contributions are: 1. A formal development of abstract event models, a novel representation for epistemic actions used to define the semantics of our language; 2. A formal specification of EPDDL’s syntax and semantics grounded in DEL with abstract event models. Through examples of representative benchmarks, we illustrate how EPDDL facilitates interoperability, reproducible evaluation, and future advances in epistemic planning.

[406] CUA-Skill: Develop Skills for Computer Using Agent

Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Leon Xu, Suzhen Zheng, Hao Fan, Pashmina Cameron, Justin Wagle, Kazuhito Koishida

Main category: cs.AI

TL;DR: CUA-Skill introduces a structured skill base for computer-using agents that encodes human computer-use knowledge as reusable skills with parameterized execution graphs, improving agent performance on Windows applications.

Details

Motivation: Existing computer-using agents are difficult to scale and lag behind human performance due to lack of reusable, structured skill abstractions that capture how humans interact with graphical user interfaces.

Method: Develop CUA-Skill, a large-scale library of carefully engineered skills spanning common Windows applications, with parameterized execution and composition graphs. Build CUA-Skill Agent on top with dynamic skill retrieval, argument instantiation, and memory-aware failure recovery.

Result: CUA-Skill substantially improves execution success rates and robustness on challenging end-to-end agent benchmarks. On WindowsAgentArena, achieves state-of-the-art 57.5% success rate while being significantly more efficient than prior approaches.

Conclusion: CUA-Skill establishes a strong foundation for future computer-using agent development by providing reusable skill abstractions that encode human computer-use knowledge, enabling scalable and reliable agent systems.

Abstract: Computer-Using Agents (CUAs) aim to autonomously operate computer systems to complete real-world tasks. However, existing agentic systems remain difficult to scale and lag behind human performance. A key limitation is the absence of reusable and structured skill abstractions that capture how humans interact with graphical user interfaces and how to leverage these skills. We introduce CUA-Skill, a computer-using agentic skill base that encodes human computer-use knowledge as skills coupled with parameterized execution and composition graphs. CUA-Skill is a large-scale library of carefully engineered skills spanning common Windows applications, serving as a practical infrastructure and tool substrate for scalable, reliable agent development. Built upon this skill base, we construct CUA-Skill Agent, an end-to-end computer-using agent that supports dynamic skill retrieval, argument instantiation, and memory-aware failure recovery. Our results demonstrate that CUA-Skill substantially improves execution success rates and robustness on challenging end-to-end agent benchmarks, establishing a strong foundation for future computer-using agent development. On WindowsAgentArena, CUA-Skill Agent achieves state-of-the-art 57.5% (best of three) successful rate while being significantly more efficient than prior and concurrent approaches. The project page is available at https://microsoft.github.io/cua_skill/.

[407] Why Self-Rewarding Works: Theoretical Guarantees for Iterative Alignment of Language Models

Shi Fu, Yingjie Wang, Shengchao Hu, Peng Wang, Dacheng Tao

Main category: cs.AI

TL;DR: Theoretical analysis of Self-Rewarding Language Models showing they improve at rate O(1/√n) with sample size, with dependence on initial model decaying exponentially with iterations.

Details

Motivation: Self-Rewarding Language Models achieve empirical success in alignment without external feedback, but lack theoretical understanding of their core mechanisms and capabilities.

Method: Provides rigorous theoretical guarantees including: 1) lower bound characterizing fundamental limits of single update step, 2) finite-sample error bounds for full iterative paradigm, 3) analysis showing dependence on initial model decays exponentially with iterations, 4) instantiation for linear softmax model class.

Result: Performance improves at rate O(1/√n) with sample size n; dependence on initial model decays exponentially with number of iterations T; provides formal explanation for why self-rewarding succeeds by steering dynamics toward internal stability and consistency.

Conclusion: Theoretical framework explains self-rewarding success through robust overcoming of poor initialization via internal stability and consistency mechanisms, with practical guarantees for linear softmax architectures.

Abstract: Self-Rewarding Language Models (SRLMs) achieve notable success in iteratively improving alignment without external feedback. Yet, despite their striking empirical progress, the core mechanisms driving their capabilities remain unelucidated, leaving a critical gap in theoretical understanding. This paper provides the first rigorous theoretical guarantees for SRLMs. We first establish a lower bound that characterizes the fundamental limits of a single update step, revealing a critical dependence on the quality of the initial model. We then derive finite-sample error bounds for the full iterative paradigm, showing that performance improves at a rate of $\widetilde{\mathcal{O}}\left(1/\sqrt{n}\right)$ with sample size $n$. Crucially, our analysis reveals that the dependence on the initial model decays exponentially with the number of iterations $T$. This provides a formal explanation for why self-rewarding succeeds: it robustly overcomes poor initialization by steering the dynamics toward internal stability and consistency. Finally, we instantiate our theoretical framework for the linear softmax model class, yielding tailored guarantees that connect our high-level insights to practical model architectures.

[408] Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

Ximing Lu, David Acuna, Jaehun Jung, Jian Hu, Di Zhang, Shizhe Diao, Yunheng Zou, Shaokun Zhang, Brandon Cui, Mingjie Liu, Hyunwoo Kim, Prithviraj Ammanabrolu, Jan Kautz, Yi Dong, Yejin Choi

Main category: cs.AI

TL;DR: Golden Goose synthesizes unlimited RLVR tasks from unverifiable internet text by converting fill-in-the-middle tasks into multiple-choice questions, enabling scaling of reinforcement learning with verifiable rewards for LLMs.

Details

Motivation: Scaling reinforcement learning with verifiable rewards (RLVR) is bottlenecked by limited existing verifiable data, causing performance saturation during prolonged training. The paper aims to overcome this by leveraging abundant but unverifiable internet text that contains rich reasoning content.

Method: Proposes Golden Goose method: given source text, uses LLM to identify and mask key reasoning steps, then generates diverse plausible distractors to create multiple-choice question-answering versions of fill-in-the-middle tasks. This enables synthesis of RLVR datasets from reasoning-rich unverifiable corpora like science textbooks.

Result: Created GooseReason-0.7M dataset with 0.7M tasks across math, programming, and science domains. Achieved new SOTA results for 1.5B and 4B-Instruct models across 15 benchmarks. Also created GooseReason-Cyber for cybersecurity, where Qwen3-4B-Instruct surpassed a 7B domain-specialized model.

Conclusion: Golden Goose enables automatic scaling of RLVR data by exploiting abundant unverifiable internet text, reviving saturated models and achieving sustained gains under continuous RL training across diverse domains including cybersecurity.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.

[409] Structured Self-Consistency:A Multi-Task Evaluation of LLMs on VirtualHome

Jiaqi Xu, Tao Huang, Kai Zhang

Main category: cs.AI

TL;DR: Evaluation of 7B-parameter LLMs (OPENPANGU-7B and QWEN2.5-7B) on VirtualHome embodied AI benchmark using Structured Self-Consistency decoding to improve performance on planning and action tasks.

Details

Motivation: Embodied AI requires agents to understand goals, plan actions, and execute tasks in simulated environments, but there's a need for comprehensive evaluation of LLMs' capabilities in these domains and development of better decoding strategies for structured generation tasks.

Method: Used Embodied Agent Interface (EAI) framework to evaluate two 7B-parameter models on VirtualHome benchmark across four tasks: Goal Interpretation, Action Sequencing, Subgoal Decomposition, and Transition Modeling. Proposed Structured Self-Consistency (SSC) - an enhanced decoding strategy using multiple sampling with domain-specific voting mechanisms.

Result: SSC significantly enhanced performance, with OPENPANGU-7B excelling at hierarchical planning while QWEN2.5-7B showed advantages in action-level tasks. Models revealed complementary strengths across different task types.

Conclusion: The analysis provides insights for future embodied AI system development, showing that different LLM architectures have complementary strengths and that structured decoding strategies like SSC can significantly improve performance on embodied AI tasks.

Abstract: Embodied AI requires agents to understand goals, plan actions, and execute tasks in simulated environments. We present a comprehensive evaluation of Large Language Models (LLMs) on the VirtualHome benchmark using the Embodied Agent Interface (EAI) framework. We compare two representative 7B-parameter models OPENPANGU-7B and QWEN2.5-7B across four fundamental tasks: Goal Interpretation, Action Sequencing, Subgoal Decomposition, and Transition Modeling. We propose Structured Self-Consistency (SSC), an enhanced decoding strategy that leverages multiple sampling with domain-specific voting mechanisms to improve output quality for structured generation tasks. Experimental results demonstrate that SSC significantly enhances performance, with OPENPANGU-7B excelling at hierarchical planning while QWEN2.5-7B show advantages in action-level tasks. Our analysis reveals complementary strengths across model types, providing insights for future embodied AI system development.

[410] Learning More from Less: Unlocking Internal Representations for Benchmark Compression

Yueqi Zhang, Jin Hu, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Yiwei Li, Jiayi Shi, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li

Main category: cs.AI

TL;DR: REPCORE: A method that uses aligned hidden states instead of correctness labels to construct representative coresets for efficient LLM benchmarking, achieving accurate performance estimation with minimal source models.

Details

Motivation: Existing coreset methods for efficient LLM benchmarking require many source models to estimate reliable item profiles, which becomes unstable with small source pools. This is particularly limiting for new benchmarks with minimal historical data. Discrete correctness labels are lossy and fail to capture information in hidden states.

Method: REPCORE aligns heterogeneous hidden states from different models into a unified latent space to construct representative coresets. It uses these subsets for performance extrapolation, enabling precise estimation with as few as ten source models.

Result: Experiments on five benchmarks and over 200 models show consistent gains over output-based baselines in ranking correlation and estimation accuracy. Spectral analysis reveals that aligned representations contain separable components reflecting broad response tendencies and task-specific reasoning patterns.

Conclusion: REPCORE provides a more effective approach to efficient LLM benchmarking by leveraging hidden state information rather than just correctness labels, enabling reliable performance estimation with minimal source models.

Abstract: The prohibitive cost of evaluating Large Language Models (LLMs) necessitates efficient alternatives to full-scale benchmarking. Prevalent approaches address this by identifying a small coreset of items to approximate full-benchmark performance. However, existing methods must estimate a reliable item profile from response patterns across many source models, which becomes statistically unstable when the source pool is small. This dependency is particularly limiting for newly released benchmarks with minimal historical evaluation data. We argue that discrete correctness labels are a lossy view of the model’s decision process and fail to capture information encoded in hidden states. To address this, we introduce REPCORE, which aligns heterogeneous hidden states into a unified latent space to construct representative coresets. Using these subsets for performance extrapolation, REPCORE achieves precise estimation accuracy with as few as ten source models. Experiments on five benchmarks and over 200 models show consistent gains over output-based baselines in ranking correlation and estimation accuracy. Spectral analysis further indicates that the aligned representations contain separable components reflecting broad response tendencies and task-specific reasoning patterns.

[411] Multi-Agent Causal Reasoning System for Error Pattern Rule Automation in Vehicles

Hugo Math, Julian Lorenz, Stefan Oelsner, Rainer Lienhart

Main category: cs.AI

TL;DR: CAREP is a multi-agent system that automates the generation of error pattern rules from vehicle diagnostic trouble codes using causal discovery and contextual reasoning.

Details

Motivation: Current automotive diagnostic systems rely on manually crafted Boolean rules for error patterns, which is expensive, error-prone, and doesn't scale with increasing vehicle complexity.

Method: CAREP uses three specialized agents: 1) causal discovery agent to identify DTC-EP relations, 2) contextual information agent to integrate metadata and descriptions, and 3) orchestrator agent to synthesize Boolean rules with interpretable reasoning traces.

Result: Evaluation on large-scale automotive dataset (29,100+ unique DTCs, 474 error patterns) shows CAREP can automatically and accurately discover unknown EP rules, outperforming LLM-only baselines while providing transparent causal explanations.

Conclusion: CAREP represents progress toward fully automated fault diagnostics, enabling scalable, interpretable, and cost-efficient vehicle maintenance through practical causal discovery and agent-based reasoning.

Abstract: Modern vehicles generate thousands of different discrete events known as Diagnostic Trouble Codes (DTCs). Automotive manufacturers use Boolean combinations of these codes, called error patterns (EPs), to characterize system faults and ensure vehicle safety. Yet, EP rules are still manually handcrafted by domain experts, a process that is expensive and prone to errors as vehicle complexity grows. This paper introduces CAREP (Causal Automated Reasoning for Error Patterns), a multi-agent system that automatizes the generation of EP rules from high-dimensional event sequences of DTCs. CAREP combines a causal discovery agent that identifies potential DTC-EP relations, a contextual information agent that integrates metadata and descriptions, and an orchestrator agent that synthesizes candidate boolean rules together with interpretable reasoning traces. Evaluation on a large-scale automotive dataset with over 29,100 unique DTCs and 474 error patterns demonstrates that CAREP can automatically and accurately discover the unknown EP rules, outperforming LLM-only baselines while providing transparent causal explanations. By uniting practical causal discovery and agent-based reasoning, CAREP represents a step toward fully automated fault diagnostics, enabling scalable, interpretable, and cost-efficient vehicle maintenance.

[412] Aggregation Queries over Unstructured Text: Benchmark and Agentic Method

Haojia Zhu, Qinyuan Xu, Haoyu Li, Yuxi Liu, Hanchen Qiu, Jiaoyan Chen, Jiahui Jin

Main category: cs.AI

TL;DR: DFA is a modular agentic system for entity-level aggregation queries over text with strict completeness requirements, outperforming RAG and other baselines on the new AGGBench benchmark.

Details

Motivation: Aggregation queries over free text require exhaustive evidence collection ("find all" not "find one"), which existing paradigms like Text-to-SQL and Retrieval-Augmented Generation fail to achieve due to completeness issues.

Method: Proposes DFA (Disambiguation–Filtering–Aggregation), a modular agentic baseline that decomposes aggregation querying into interpretable stages: disambiguation, filtering, and aggregation, exposing key failure modes.

Result: DFA consistently improves aggregation evidence coverage over strong RAG and agentic baselines on AGGBench, a new benchmark for completeness-oriented aggregation under realistic large-scale corpus settings.

Conclusion: The work formalizes entity-level aggregation querying with strict completeness requirements, introduces AGGBench for evaluation, and demonstrates DFA’s effectiveness as a modular approach that improves evidence coverage.

Abstract: Aggregation query over free text is a long-standing yet underexplored problem. Unlike ordinary question answering, aggregate queries require exhaustive evidence collection and systems are required to “find all,” not merely “find one.” Existing paradigms such as Text-to-SQL and Retrieval-Augmented Generation fail to achieve this completeness. In this work, we formalize entity-level aggregation querying over text in a corpus-bounded setting with strict completeness requirement. To enable principled evaluation, we introduce AGGBench, a benchmark designed to evaluate completeness-oriented aggregation under realistic large-scale corpus. To accompany the benchmark, we propose DFA (Disambiguation–Filtering–Aggregation), a modular agentic baseline that decomposes aggregation querying into interpretable stages and exposes key failure modes related to ambiguity, filtering, and aggregation. Empirical results show that DFA consistently improves aggregation evidence coverage over strong RAG and agentic baselines. The data and code are available in \href{https://anonymous.4open.science/r/DFA-A4C1}.

[413] Controlling Exploration-Exploitation in GFlowNets via Markov Chain Perspectives

Lin Chen, Samuel Drapeau, Fanghao Shao, Xuekai Zhu, Bo Xue, Yunchong Song, Mathieu Laurière, Zhouhan Lin

Main category: cs.AI

TL;DR: α-GFNs generalize GFlowNets with tunable mixing parameter α to control exploration-exploitation trade-off, improving mode discovery across benchmarks.

Details

Motivation: Standard GFlowNet objectives fix equal mixing of forward/backward policies, limiting exploration-exploitation trade-off during training. The paper aims to overcome this constraint by establishing theoretical links to Markov chain reversibility.

Method: Establishes equivalence between GFlowNet objectives and Markov chain reversibility, then proposes α-GFNs with tunable parameter α to generalize the mixing. This provides direct control over exploration-exploitation dynamics while ensuring convergence to unique flows.

Result: α-GFN objectives consistently outperform previous GFlowNet objectives across Set, Bit Sequence, and Molecule Generation benchmarks, achieving up to 10× increase in discovered modes.

Conclusion: The α-GFN framework successfully generalizes GFlowNets with tunable exploration-exploitation control, significantly improving mode discovery capabilities while maintaining theoretical guarantees.

Abstract: Generative Flow Network (GFlowNet) objectives implicitly fix an equal mixing of forward and backward policies, potentially constraining the exploration-exploitation trade-off during training. By further exploring the link between GFlowNets and Markov chains, we establish an equivalence between GFlowNet objectives and Markov chain reversibility, thereby revealing the origin of such constraints, and provide a framework for adapting Markov chain properties to GFlowNets. Building on these theoretical findings, we propose $α$-GFNs, which generalize the mixing via a tunable parameter $α$. This generalization enables direct control over exploration-exploitation dynamics to enhance mode discovery capabilities, while ensuring convergence to unique flows. Across various benchmarks, including Set, Bit Sequence, and Molecule Generation, $α$-GFN objectives consistently outperform previous GFlowNet objectives, achieving up to a $10 \times$ increase in the number of discovered modes.

[414] Emergent Analogical Reasoning in Transformers

Gouki Minegishi, Jingyuan Feng, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Main category: cs.AI

TL;DR: Transformers can learn analogical reasoning through geometric alignment of relational structures and functor-like mechanisms, with emergence depending on data characteristics, optimization, and model scale.

Details

Motivation: To understand how Transformers acquire and implement analogical reasoning, which is central to human intelligence but poorly understood in neural networks.

Method: Formalized analogical reasoning using category theory functors, created synthetic evaluation tasks, analyzed emergence under controlled settings, and conducted mechanistic analysis of Transformer components.

Result: Found analogical reasoning emerges through two key mechanisms: geometric alignment of relational structure in embedding space and application of functor-like operations within Transformers, with emergence highly sensitive to data, optimization, and scale.

Conclusion: Analogical reasoning can be concretely understood as mechanistically grounded phenomena in neural networks, moving from abstract cognitive notion to measurable implementation in Transformers.

Abstract: Analogy is a central faculty of human intelligence, enabling abstract patterns discovered in one domain to be applied to another. Despite its central role in cognition, the mechanisms by which Transformers acquire and implement analogical reasoning remain poorly understood. In this work, inspired by the notion of functors in category theory, we formalize analogical reasoning as the inference of correspondences between entities across categories. Based on this formulation, we introduce synthetic tasks that evaluate the emergence of analogical reasoning under controlled settings. We find that the emergence of analogical reasoning is highly sensitive to data characteristics, optimization choices, and model scale. Through mechanistic analysis, we show that analogical reasoning in Transformers decomposes into two key components: (1) geometric alignment of relational structure in the embedding space, and (2) the application of a functor within the Transformer. These mechanisms enable models to transfer relational structure from one category to another, realizing analogy. Finally, we quantify these effects and find that the same trends are observed in pretrained LLMs. In doing so, we move analogy from an abstract cognitive notion to a concrete, mechanistically grounded phenomenon in modern neural networks.

[415] TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

Hang Yan, Xinyu Che, Fangzhi Xu, Qiushi Sun, Zichen Ding, Kanzhi Cheng, Jian Zhang, Tao Qin, Jun Liu, Qika Lin

Main category: cs.AI

TL;DR: TIDE is a diagnostic evaluation framework for analyzing Test-Time Improvement (TTI) in autonomous LLM agents, focusing on task optimization efficiency, behavior adaptation, and memory utility.

Details

Motivation: While autonomous LLM agents show improved performance through iterative environment interaction (TTI), the mechanisms behind their success/failure remain poorly understood, and existing metrics fail to capture key aspects like task optimization efficiency, behavior adaptation after errors, and working memory utility.

Method: Proposes TIDE (Test-time Improvement Diagnostic Evaluation), an agent-agnostic and environment-agnostic framework that decomposes TTI into three interconnected dimensions: (1) overall temporal dynamics of task completion, (2) identification of performance constraints from recursive looping behaviors, and (3) identification of constraints from burdensome accumulated memory.

Result: Through extensive experiments across diverse agents and environments, TIDE reveals that improving agent performance requires more than scaling internal reasoning - it calls for explicitly optimizing the interaction dynamics between the agent and the environment.

Conclusion: TIDE provides a comprehensive diagnostic framework for understanding TTI mechanisms in autonomous LLM agents, highlighting the importance of optimizing agent-environment interaction dynamics rather than just scaling internal reasoning capabilities.

Abstract: Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test-Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing evaluation metrics fail to capture their task optimization efficiency, behavior adaptation after erroneous actions, and the specific utility of working memory for task completion. To address these gaps, we propose Test-time Improvement Diagnostic Evaluation (TIDE), an agent-agnostic and environment-agnostic framework that decomposes TTI into three comprehensive and interconnected dimensions. The framework measures (1) the overall temporal dynamics of task completion and (2) identifies whether performance is primarily constrained by recursive looping behaviors or (3) by burdensome accumulated memory. Through extensive experiments across diverse agents and environments, TIDE highlights that improving agent performance requires more than scaling internal reasoning, calling for explicitly optimizing the interaction dynamics between the agent and the environment.

[416] SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration

Qingni Wang, Yue Fan, Xin Eric Wang

Main category: cs.AI

TL;DR: SafeGround is an uncertainty-aware framework for GUI grounding that provides risk-aware predictions with statistical false discovery rate control, improving reliability in automated GUI interactions.

Details

Motivation: GUI grounding translates natural language to screen coordinates for automated interaction, but incorrect predictions can lead to costly, irreversible actions (like erroneous payments), creating a need for reliable uncertainty quantification and risk control.

Method: SafeGround uses a distribution-aware uncertainty quantification method to capture spatial dispersion of stochastic samples from model outputs, then calibrates a test-time decision threshold with statistically guaranteed false discovery rate (FDR) control.

Result: SafeGround’s uncertainty measure outperforms existing baselines in distinguishing correct from incorrect predictions, enables rigorous risk control, and improves system-level accuracy by up to 5.38% over Gemini-only inference across multiple GUI grounding models on the ScreenSpot-Pro benchmark.

Conclusion: SafeGround provides an effective uncertainty-aware framework for GUI grounding that enhances reliability through statistical risk control, addressing critical safety concerns in automated GUI interactions.

Abstract: Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about model reliability. In this paper, we introduce SafeGround, an uncertainty-aware framework for GUI grounding models that enables risk-aware predictions through calibrations before testing. SafeGround leverages a distribution-aware uncertainty quantification method to capture the spatial dispersion of stochastic samples from outputs of any given model. Then, through the calibration process, SafeGround derives a test-time decision threshold with statistically guaranteed false discovery rate (FDR) control. We apply SafeGround on multiple GUI grounding models for the challenging ScreenSpot-Pro benchmark. Experimental results show that our uncertainty measure consistently outperforms existing baselines in distinguishing correct from incorrect predictions, while the calibrated threshold reliably enables rigorous risk control and potentials of substantial system-level accuracy improvements. Across multiple GUI grounding models, SafeGround improves system-level accuracy by up to 5.38% percentage points over Gemini-only inference.

[417] Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

Andong Chen, Wenxin Zhu, Qiuyu Ding, Yuchen Song, Muyun Yang, Tiejun Zhao

Main category: cs.AI

TL;DR: Thinking with Comics proposes using comics as an intermediate visual representation between images and videos for multimodal reasoning, offering temporal structure with lower computational cost than videos.

Details

Motivation: Current multimodal reasoning approaches have limitations: static images lack temporal structure, while videos introduce redundancy and high computational costs. There's a need for a visual medium that preserves temporal information while being efficient for reasoning.

Method: Proposes “Thinking with Comics” paradigm using comics as high information-density medium. Studies two reasoning paths based on comics and evaluates them on reasoning and long-context understanding tasks, comparing performance against image and video-based approaches.

Result: Thinking with Comics outperforms Thinking with Images on multi-step temporal and causal reasoning tasks while being substantially more efficient than Thinking with Video. Different comic narrative structures and styles consistently affect performance across tasks.

Conclusion: Comics serve as an effective intermediate visual representation that improves multimodal reasoning by balancing temporal structure preservation with computational efficiency, offering advantages over both static images and videos.

Abstract: Chain-of-Thought reasoning has driven large language models to extend from thinking with text to thinking with images and videos. However, different modalities still have clear limitations: static images struggle to represent temporal structure, while videos introduce substantial redundancy and computational cost. In this work, we propose Thinking with Comics, a visual reasoning paradigm that uses comics as a high information-density medium positioned between images and videos. Comics preserve temporal structure, embedded text, and narrative coherence while requiring significantly lower reasoning cost. We systematically study two reasoning paths based on comics and evaluate them on a range of reasoning tasks and long-context understanding tasks. Experimental results show that Thinking with Comics outperforms Thinking with Images on multi-step temporal and causal reasoning tasks, while remaining substantially more efficient than Thinking with Video. Further analysis indicates that different comic narrative structures and styles consistently affect performance across tasks, suggesting that comics serve as an effective intermediate visual representation for improving multimodal reasoning.

cs.SD

[418] VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis

Chengyuan Ma, Jiawei Jin, Ruijie Xiong, Chunxiang Jin, Canxiang Yan, Wenming Yang

Main category: cs.SD

TL;DR: VividVoice is a novel framework for scene-aware visually-driven speech synthesis that generates speech aligned with visual scenes, addressing data scarcity and modality decoupling challenges through a large-scale dataset and alignment module.

Details

Motivation: Existing speech generation models lack the ability to create immersive auditory experiences that align with the real physical world. The paper aims to address limitations in scene-aware speech synthesis where visual context should influence speech characteristics like timbre and environmental acoustics.

Method: Proposes VividVoice framework with two key components: 1) Vivid-210K dataset - a large-scale hybrid multimodal dataset created via programmatic pipeline establishing correlations between visual scenes, speaker identity, and audio; 2) D-MSVA alignment module - uses decoupled memory bank architecture and cross-modal hybrid supervision for fine-grained alignment from visual scenes to timbre and environmental acoustic features.

Result: Both subjective and objective experimental results show VividVoice significantly outperforms existing baseline models in audio fidelity, content clarity, and multimodal consistency. The framework demonstrates strong scene-aware speech synthesis capabilities.

Conclusion: VividVoice successfully addresses scene-aware visually-driven speech synthesis challenges through a unified generative framework with novel dataset construction and alignment mechanisms, enabling immersive auditory experiences aligned with visual scenes.

Abstract: We introduce and define a novel task-Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.

[419] When Noise Lowers The Loss: Rethinking Likelihood-Based Evaluation in Music Large Language Models

Xiaosha Li, Chun Liu, Ziyu Wang

Main category: cs.SD

TL;DR: Music LLMs show decreasing cross-entropy loss with corrupted music, undermining loss as quality metric; noise injection experiments reveal loss curve shape (especially peaks for short noise) better indicates musical quality discrimination.

Details

Motivation: Standard cross-entropy loss fails as quality indicator for music LLMs since it decreases with corrupted music, creating need for better evaluation methods that can distinguish high-quality compositions from "garbage music".

Method: Introduce noise injection experiments where controlled noise signals of varying lengths are injected into musical contexts, analyzing loss curve shapes; test with MusicGen models in audio waveform domain.

Result: Music LLMs respond more strongly to local texture-level disruptions than global semantic corruption; loss curve shape (especially sharp “Peak” area for short noise injections) serves as proxy for musical integrity discrimination.

Conclusion: Loss curve shape encodes critical information about generated content quality; profile-based evaluation offers label-free, model-intrinsic framework for assessing musical quality, enabling better training objectives and benchmarks.

Abstract: The rise of music large language models (LLMs) demands robust methods of evaluating output quality, especially in distinguishing high-quality compositions from “garbage music”. Curiously, we observe that the standard cross-entropy loss – a core training metric – often decrease when models encounter systematically corrupted music, undermining its validity as a standalone quality indicator. To investigate this paradox, we introduce noise injection experiment, where controlled noise signal of varying lengths are injected into musical contexts. We hypothesize that a model’s loss reacting positively to these perturbations, specifically a sharp increase (“Peak” area) for short injection, can serve as a proxy for its ability to discern musical integrity. Experiments with MusicGen models in the audio waveform domain confirm that Music LLMs respond more strongly to local, texture-level disruptions than to global semantic corruption. Beyond exposing this bias, our results highlight a new principle: the shape of the loss curve – rather than its absolute value – encodes critical information about the quality of the generated content (i.e., model behavior). We envision this profile-based evaluation as a label-free, model-intrinsic framework for assessing musical quality – opening the door to more principled training objectives and sharper benchmarks.

[420] Synthetic Data Augmentation for Medical Audio Classification: A Preliminary Evaluation

David McShannon, Anthony Mella, Nicholas Dietrich

Main category: cs.SD

TL;DR: Synthetic data augmentation (VAEs, GANs, diffusion models) for medical audio classification shows limited benefits, with only ensemble models achieving modest F1-score improvement from 0.645 to 0.664.

Details

Motivation: Medical audio classification faces challenges like low signal-to-noise ratios, subtle features, intra-class variability, class imbalance, and limited data. Synthetic augmentation has been proposed but shows inconsistent results in prior studies.

Method: Used a baseline deep CNN trained on imbalanced respiratory sound data (73%:27%). Tested three generative augmentation strategies: variational autoencoders (VAEs), generative adversarial networks (GANs), and diffusion models under controlled conditions.

Result: Baseline without augmentation: F1-score 0.645. Individual augmentation strategies showed no performance gains, with some configurations showing neutral or degraded performance. Only ensemble of augmented models achieved modest improvement to F1-score 0.664.

Conclusion: Synthetic augmentation may not consistently enhance medical audio classification with standard CNNs. Future work should focus on task-specific data characteristics, model-augmentation compatibility, and evaluation frameworks for effective synthetic augmentation.

Abstract: Medical audio classification remains challenging due to low signal-to-noise ratios, subtle discriminative features, and substantial intra-class variability, often compounded by class imbalance and limited training data. Synthetic data augmentation has been proposed as a potential strategy to mitigate these constraints; however, prior studies report inconsistent methodological approaches and mixed empirical results. In this preliminary study, we explore the impact of synthetic augmentation on respiratory sound classification using a baseline deep convolutional neural network trained on a moderately imbalanced dataset (73%:27%). Three generative augmentation strategies (variational autoencoders, generative adversarial networks, and diffusion models) were assessed under controlled experimental conditions. The baseline model without augmentation achieved an F1-score of 0.645. Across individual augmentation strategies, performance gains were not observed, with several configurations demonstrating neutral or degraded classification performance. Only an ensemble of augmented models yielded a modest improvement in F1-score (0.664). These findings suggest that, for medical audio classification, synthetic augmentation may not consistently enhance performance when applied to a standard CNN classifier. Future work should focus on delineating task-specific data characteristics, model-augmentation compatibility, and evaluation frameworks necessary for synthetic augmentation to be effective in medical audio applications.

[421] Rethinking Music Captioning with Music Metadata LLMs

Irmak Bukey, Zhepei Wang, Chris Donahue, Nicholas J. Bryan

Main category: cs.SD

TL;DR: A metadata-based music captioning approach that predicts metadata from audio and uses LLMs to generate captions, offering flexibility in stylization and metadata imputation capabilities.

Details

Motivation: Music captioning requires high-quality training data which is scarce. Existing approaches use LLMs to synthesize captions from metadata, but this imposes fixed stylization and entangles factual information with style. The authors seek a more direct approach that separates metadata extraction from caption generation.

Method: Proposes metadata-based captioning: (1) Train a metadata prediction model to infer detailed music metadata from audio, (2) At inference time, convert metadata into expressive captions using pre-trained LLMs. This decouples factual information extraction from stylistic caption generation.

Result: The method achieves comparable performance to end-to-end captioners in less training time, offers flexibility to change stylization post-training, and enables metadata imputation/in-filling by prompting with audio and partial metadata.

Conclusion: Metadata-based captioning provides an efficient and flexible alternative to end-to-end approaches, separating factual extraction from stylistic generation while enabling powerful metadata organization capabilities.

Abstract: Music captioning, or the task of generating a natural language description of music, is useful for both music understanding and controllable music generation. Training captioning models, however, typically requires high-quality music caption data which is scarce compared to metadata (e.g., genre, mood, etc.). As a result, it is common to use large language models (LLMs) to synthesize captions from metadata to generate training data for captioning models, though this process imposes a fixed stylization and entangles factual information with natural language style. As a more direct approach, we propose metadata-based captioning. We train a metadata prediction model to infer detailed music metadata from audio and then convert it into expressive captions via pre-trained LLMs at inference time. Compared to a strong end-to-end baseline trained on LLM-generated captions derived from metadata, our method: (1) achieves comparable performance in less training time over end-to-end captioners, (2) offers flexibility to easily change stylization post-training, enabling output captions to be tailored to specific stylistic and quality requirements, and (3) can be prompted with audio and partial metadata to enable powerful metadata imputation or in-filling–a common task for organizing music data.

[422] GRAM: Spatial general-purpose audio representations for real-world environments

Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden

Main category: cs.SD

Details

Conclusion: GRAM represents a significant advance toward robust spatial audio foundation models for real-world environments, addressing key limitations of current models.

[423] D3PIA: A Discrete Denoising Diffusion Model for Piano Accompaniment Generation From Lead sheet

Eunjin Choi, Hounsu Kim, Hayeon Bang, Taegyun Kwon, Juhan Nam

Main category: cs.SD

TL;DR: D3PIA is a discrete diffusion model for generating piano accompaniments from lead sheets using neighborhood attention for local alignment between melody/chords and accompaniment.

Details

Motivation: Generating piano accompaniments from symbolic music (melody and chords) is challenging, requiring faithful adherence to chord constraints while maintaining musical coherence. Existing approaches need better local alignment between lead sheet elements and generated accompaniment.

Method: Proposes D3PIA, a discrete diffusion-based model using Neighborhood Attention (NA) to encode lead sheet conditions and predict note states in piano accompaniment. NA enables efficient local contextual modeling by attending to nearby melody and chord conditions in piano-roll representation.

Result: On POP909 dataset, D3PIA preserves chord conditions more faithfully than continuous diffusion and Transformer baselines. Subjective listening tests show D3PIA generates more musically coherent accompaniments.

Conclusion: Discrete diffusion with neighborhood attention effectively addresses piano accompaniment generation by improving local alignment between lead sheet constraints and generated music, outperforming existing methods in both objective and subjective evaluations.

Abstract: Generating piano accompaniments in the symbolic music domain is a challenging task that requires producing a complete piece of piano music from given melody and chord constraints, such as those provided by a lead sheet. In this paper, we propose a discrete diffusion-based piano accompaniment generation model, D3PIA, leveraging local alignment between lead sheet and accompaniment in piano-roll representation. D3PIA incorporates Neighborhood Attention (NA) to both encode the lead sheet and condition it for predicting note states in the piano accompaniment. This design enhances local contextual modeling by efficiently attending to nearby melody and chord conditions. We evaluate our model using the POP909 dataset, a widely used benchmark for piano accompaniment generation. Objective evaluation results demonstrate that D3PIA preserves chord conditions more faithfully compared to continuous diffusion-based and Transformer-based baselines. Furthermore, a subjective listening test indicates that D3PIA generates more musically coherent accompaniments than the comparison models.

[424] PACE: Pretrained Audio Continual Learning

Chang Li, Kanglei Zhou, Liyuan Wang

Main category: cs.SD

TL;DR: PACE: A novel method for audio continual learning with pretrained models that addresses representation saturation and drift through regularized analytic classifiers and adaptive subspace-orthogonal PEFT.

Details

Motivation: Audio pretrained models are fragile in real-world settings with distribution shifts, and existing parameter-efficient fine-tuning strategies from vision don't transfer well to audio due to fundamental differences in how audio backbones process low-level spectral details versus structured semantics.

Method: Proposes PACE method with: 1) regularized analytic classifier with first-session adaptation, 2) adaptive subspace-orthogonal PEFT for multi-session adaptation, and 3) spectrogram-based boundary-aware perturbations to mitigate representation overlap.

Result: Experiments on six diverse audio CL benchmarks show PACE substantially outperforms state-of-the-art baselines, addressing both coarse-grained (representation saturation) and fine-grained (representation drift) scenarios.

Conclusion: PACE marks an important step toward robust and scalable audio continual learning with pretrained models by addressing unique audio-specific challenges in representation learning and semantic alignment.

Abstract: Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a promising direction, but also reveal two major limitations: representation saturation in coarse-grained scenarios and representation drift in fine-grained scenarios. To address these challenges, we propose PACE, a novel method that enhances FSA via a regularized analytic classifier and enables multi-session adaptation through adaptive subspace-orthogonal PEFT for improved semantic alignment. In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, marking an important step toward robust and scalable audio continual learning with PTMs.

[425] CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering

Siyi Wang, Shihong Tan, Siyi Liu, Hong Jia, Gongping Huang, James Bailey, Ting Dang

Main category: cs.SD

TL;DR: First systematic analysis of activation steering for emotional control in hybrid TTS models, enabling composable mixed-emotion synthesis and text-emotion mismatch synthesis through quantitative steering framework and multi-rater evaluation.

Details

Motivation: Current expressive TTS systems enforce single utterance-level emotions, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. There's a need for more nuanced emotional control that can handle complex, conflicting affective cues in human speech.

Method: Introduces a quantitative, controllable activation steering framework for hybrid TTS models, using latent direction vectors to steer emotional expression. Includes systematic analysis of where steering should be applied within hybrid architectures and develops multi-rater evaluation protocols.

Result: Demonstrates that emotional prosody and expressive variability are primarily synthesized by the TTS language module rather than the flow-matching module. Provides a lightweight steering approach for generating natural, human-like emotional speech with composable mixed emotions.

Conclusion: Activation steering enables nuanced emotional control in TTS, allowing for complex mixed-emotion synthesis and text-emotion mismatch handling, with the language module playing a crucial role in emotional prosody generation.

Abstract: Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.

[426] EarResp-ANS : Audio-Based On-Device Respiration Rate Estimation on Earphones with Adaptive Noise Suppression

Michael Küttner, Valeria Zitz, Supraja Ramesh, Michael Beigl, Tobias Röddiger

Main category: cs.SD

TL;DR: EarResp-ANS enables real-time respiratory rate monitoring on commercial earphones using adaptive noise suppression without neural networks.

Details

Motivation: Respiratory rate is crucial for clinical and mental health assessment but rarely monitored in daily life due to lack of unobtrusive sensing technologies. In-ear audio sensing is promising but existing approaches fail under real-world noise or require computationally expensive models.

Method: Uses LMS-based adaptive noise suppression (ANS) to attenuate ambient noise while preserving respiration-related acoustic components. The system operates fully on-device without neural networks or audio streaming, addressing energy and privacy constraints of wearables.

Result: Achieved robust performance with global MAE of 0.84 CPM (reduced to 0.47 CPM via automatic outlier rejection) in study with 18 participants under realistic acoustic conditions including music, cafeteria noise, and white noise up to 80 dB SPL. Operates with less than 2% processor load directly on earphones.

Conclusion: EarResp-ANS enables practical, real-time respiratory rate monitoring on commercial earphones by addressing noise robustness, computational efficiency, and privacy concerns through adaptive noise suppression without neural networks.

Abstract: Respiratory rate (RR) is a key vital sign for clinical assessment and mental well-being, yet it is rarely monitored in everyday life due to the lack of unobtrusive sensing technologies. In-ear audio sensing is promising due to its high social acceptance and the amplification of physiological sounds caused by the occlusion effect; however, existing approaches often fail under real-world noise or rely on computationally expensive models. We present EarResp-ANS, the first system enabling fully on-device, real-time RR estimation on commercial earphones. The system employs LMS-based adaptive noise suppression (ANS) to attenuate ambient noise while preserving respiration-related acoustic components, without requiring neural networks or audio streaming, thereby explicitly addressing the energy and privacy constraints of wearable devices. We evaluate EarResp-ANS in a study with 18 participants under realistic acoustic conditions, including music, cafeteria noise, and white noise up to 80 dB SPL. EarResp-ANS achieves robust performance with a global MAE of 0.84 CPM , reduced to 0.47 CPM via automatic outlier rejection, while operating with less than 2% processor load directly on the earphone.

[427] Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion

Oscar Ovanger, Levi Harris, Timothy H. Keitt

Main category: cs.SD

TL;DR: FINCH is an adaptive log-linear evidence fusion framework that combines audio classifiers with spatiotemporal context predictors using per-sample gating to estimate contextual reliability, improving bioacoustic classification performance while maintaining audio-only fallback.

Details

Motivation: Bioacoustic classification systems have multiple evidence sources (audio signals and spatiotemporal context) that vary in reliability across inputs, but current approaches lack adaptive fusion methods that can properly weight these heterogeneous evidence sources while maintaining robustness.

Method: FINCH uses a log-linear fusion framework that integrates pre-trained audio classifiers with structured spatiotemporal predictors. It learns a per-sample gating function that estimates contextual reliability from uncertainty and informativeness statistics, bounding contextual influence and containing the audio-only classifier as a special case.

Result: FINCH outperforms fixed-weight fusion and audio-only baselines across benchmarks, achieving state-of-the-art on CBI and competitive/improved performance on BirdSet subsets. It improves robustness and error trade-offs even when contextual information is weak in isolation.

Conclusion: The framework provides a lightweight, interpretable, evidence-based approach for multimodal fusion that maintains risk containment through explicit audio-only fallback mechanisms, offering practical advantages for real-world bioacoustic classification systems.

Abstract: Many machine learning systems have access to multiple sources of evidence for the same prediction target, yet these sources often differ in reliability and informativeness across inputs. In bioacoustic classification, species identity may be inferred both from the acoustic signal and from spatiotemporal context such as location and season; while Bayesian inference motivates multiplicative evidence combination, in practice we typically only have access to discriminative predictors rather than calibrated generative models. We introduce \textbf{F}usion under \textbf{IN}dependent \textbf{C}onditional \textbf{H}ypotheses (\textbf{FINCH}), an adaptive log-linear evidence fusion framework that integrates a pre-trained audio classifier with a structured spatiotemporal predictor. FINCH learns a per-sample gating function that estimates the reliability of contextual information from uncertainty and informativeness statistics. The resulting fusion family \emph{contains} the audio-only classifier as a special case and explicitly bounds the influence of contextual evidence, yielding a risk-contained hypothesis class with an interpretable audio-only fallback. Across benchmarks, FINCH consistently outperforms fixed-weight fusion and audio-only baselines, improving robustness and error trade-offs even when contextual information is weak in isolation. We achieve state-of-the-art performance on CBI and competitive or improved performance on several subsets of BirdSet using a lightweight, interpretable, evidence-based approach. Code is available: \texttt{\href{https://anonymous.4open.science/r/birdnoise-85CD/README.md}{anonymous-repository}}

[428] VioPTT: Violin Technique-Aware Transcription from Synthetic Data Augmentation

Ting-Kang Wang, Yueh-Po Peng, Li Su, Vincent K. M. Cheung

Main category: cs.SD

TL;DR: VioPTT is a lightweight cascade model that transcribes violin playing techniques along with pitch and timing, using a novel synthetic dataset called MOSA-VPT to achieve state-of-the-art performance.

Details

Motivation: Most automatic music transcription models only capture pitch and timing information, missing crucial expressive elements like violin playing techniques that create distinctive timbres and emotional impact.

Method: Proposes VioPTT, a lightweight cascade model that jointly transcribes violin playing techniques with pitch onset/offset. Uses MOSA-VPT, a novel high-quality synthetic violin playing technique dataset to train the model without manual annotations.

Result: The model demonstrates strong generalization to real-world note-level violin technique recordings and achieves state-of-the-art transcription performance.

Conclusion: VioPTT is the first unified framework to combine violin transcription and playing technique prediction, addressing a significant gap in expressive music transcription.

Abstract: While automatic music transcription is well-established in music information retrieval, most models are limited to transcribing pitch and timing information from audio, and thus omit crucial expressive and instrument-specific nuances. One example is playing technique on the violin, which affords its distinct palette of timbres for maximal emotional impact. Here, we propose VioPTT (Violin Playing Technique-aware Transcription), a lightweight cascade model that directly transcribes violin playing technique in addition to pitch onset and offset. Furthermore, we release MOSA-VPT, a novel, high-quality synthetic violin playing technique dataset to circumvent the need for manually labeled annotations. Leveraging this dataset, our model demonstrated strong generalization to real-world note-level violin technique recordings in addition to achieving state-of-the-art transcription performance. To our knowledge, VioPTT is the first to jointly combine violin transcription and playing technique prediction within a unified framework.

[429] Bayesian Speech Synthesizers Can Learn from Multiple Teachers

Ziyang Zhang, Yifan Gao, Xuenan Xu, Baoxiang Li, Wen Wu, Chao Zhang

Main category: cs.SD

TL;DR: BELLE is a Bayesian evidential learning framework for TTS that models speech uncertainty using Normal-Inverse-Gamma distributions and enables high-quality streaming generation with reduced data requirements.

Details

Motivation: Current TTS systems oversimplify the inherently "one-to-many" mapping problem as deterministic regression, ignoring the intrinsic uncertainty and dynamic variability of natural speech. While autoregressive models show promise, they typically use fixed-variance priors that constrain generation to static point estimates.

Method: Proposes BELLE framework that shifts from deterministic prediction to Bayesian inference using Normal-Inverse-Gamma distributions to capture data-dependent aleatoric uncertainty. Introduces a “one-to-many” training strategy using synthetic samples as statistical support sets to enable accurate variance estimation on single-reference datasets without increasing model parameters or inference latency.

Result: BELLE trained on only ~5k hours of data outperforms leading open-source models trained on 50k hours, achieving a 25.8% relative WER reduction. The framework naturally supports high-quality streaming generation.

Conclusion: BELLE successfully bridges the gap between deterministic TTS approaches and the inherent uncertainty of speech generation, enabling more natural and robust speech synthesis through principled Bayesian modeling without computational overhead.

Abstract: Text-to-Speech (TTS) is inherently a “one-to-many” mapping characterized by intrinsic uncertainty, yet current paradigms often oversimplify it into a deterministic regression task. While continuous-valued autoregressive (AR) models have recently emerged as a promising alternative to discrete codec-based approaches, they typically rely on a fixed-variance prior, fundamentally constraining generation to a static point estimate that ignores the dynamic variability of natural speech. To bridge this gap, we propose BELLE (Bayesian evidential learning with language modelling), a framework that shifts from deterministic prediction to principled Bayesian inference without increasing model parameters or inference latency. By modeling the acoustic target as a Normal-Inverse-Gamma distribution, BELLE captures data-dependent aleatoric uncertainty. To enable accurate variance estimation on standard single-reference datasets, we introduce a “one-to-many” training strategy that leverages synthetic samples as a statistical support set, allowing the model to learn robust distributional properties rather than merely imitating teacher artifacts. Experiments demonstrate that BELLE, trained on only ~5k hours of data, outperforms leading open-source models trained on 50k hours (achieving a 25.8% relative WER reduction) and naturally supports high-quality streaming generation. Audio samples are available at https://belletts.github.io/Belle/.

[430] Do Models Hear Like Us? Probing the Representational Alignment of Audio LLMs and Naturalistic EEG

Haoyun Yang, Xin Xiao, Jiang Zhong, Yu Tian, Dong Xiaohua, Yu Mao, Hao Wu, Kaiwen Wei

Main category: cs.SD

TL;DR: Audio LLMs show varying neural alignment with EEG signals across metrics, revealing depth-dependent patterns and affective prosody effects on representational similarity.

Details

Motivation: While Audio LLMs demonstrate strong speech-language integration capabilities, their alignment with human neural dynamics during natural listening remains unexplored. The paper aims to systematically examine how Audio LLM representations correspond to EEG signals.

Method: Systematically examined layer-wise representational alignment between 12 open-source Audio LLMs and EEG signals across 2 datasets. Used 8 similarity metrics including Spearman-based Representational Similarity Analysis (RSA) to characterize within-sentence representational geometry.

Result: Three key findings: (1) Rank-dependence split where model rankings vary substantially across different similarity metrics; (2) Spatio-temporal alignment patterns with depth-dependent alignment peaks and increased RSA in 250-500 ms window (N400-related); (3) Affective dissociation where negative prosody reduces geometric similarity but enhances covariance-based dependence.

Conclusion: The findings provide new neurobiological insights into the representational mechanisms of Audio LLMs, revealing complex alignment patterns with human neural dynamics that vary by metric, model depth, and emotional prosody.

Abstract: Audio Large Language Models (Audio LLMs) have demonstrated strong capabilities in integrating speech perception with language understanding. However, whether their internal representations align with human neural dynamics during naturalistic listening remains largely unexplored. In this work, we systematically examine layer-wise representational alignment between 12 open-source Audio LLMs and Electroencephalogram (EEG) signals across 2 datasets. Specifically, we employ 8 similarity metrics, such as Spearman-based Representational Similarity Analysis (RSA), to characterize within-sentence representational geometry. Our analysis reveals 3 key findings: (1) we observe a rank-dependence split, in which model rankings vary substantially across different similarity metrics; (2) we identify spatio-temporal alignment patterns characterized by depth-dependent alignment peaks and a pronounced increase in RSA within the 250-500 ms time window, consistent with N400-related neural dynamics; (3) we find an affective dissociation whereby negative prosody, identified using a proposed Tri-modal Neighborhood Consistency (TNC) criterion, reduces geometric similarity while enhancing covariance-based dependence. These findings provide new neurobiological insights into the representational mechanisms of Audio LLMs.

cs.LG

[431] UNSO: Unified Newton Schulz Orthogonalization

Chen Hu, Qianxi Zhao, Yuming Li, Mingyu Zhou, Xiyin Li

Main category: cs.LG

TL;DR: UNSO improves Newton-Schulz orthogonalization by consolidating iterations into a unified framework with learnable coefficients, avoiding polynomial expansion and reducing computational burden.

Details

Motivation: The conventional Newton-Schulz iteration suffers from inefficiency and instability, and existing improvements still follow the iterative paradigm that increases computation burden due to repeated matrix products along long dimensions.

Method: Consolidates iterative structure into Unified Newton-Schulz Orthogonalization (UNSO) framework, avoids polynomial expansion, evaluates role of each matrix power, removes insignificant terms, and provides recommended polynomial with learnable coefficients that are optimized.

Result: Achieves outstanding performance with stable convergence while reducing computational burden compared to conventional NS iterations.

Conclusion: UNSO provides an efficient and stable alternative to conventional Newton-Schulz orthogonalization through a unified framework with optimized learnable coefficients.

Abstract: The Newton-Schulz (NS) iteration has gained increasing interest for its role in the Muon optimizer and the Stiefel manifold. However, the conventional NS iteration suffers from inefficiency and instability. Although various improvements have been introduced to NS iteration, they fail to deviate from the conventional iterative paradigm, which could increase computation burden largely due to the matrix products along the long dimension repeatedly. To address this, we consolidate the iterative structure into a unified framework, named Unified Newton-Schulz Orthogonalization (UNSO). To do so, we could avoid a polynomial expansion. Instead, we evaluate the role of each matrix power, remove the insignificant terms, and provide a recommended polynomial with learnable coefficients. These learnable coefficients are then optimized, and achieve an outstanding performance with stable convergence. The code of our method is available: https://github.com/greekinRoma/Unified_Newton_Schulz_Orthogonalization.

[432] Augmenting Parameter-Efficient Pre-trained Language Models with Large Language Models

Saurabh Anand, Shubham Malaviya, Manish Shukla, Sachin Lodha

Main category: cs.LG

TL;DR: Parameter-efficient fine-tuning of pre-trained language models combined with LLMs as data-labeling tools and fallback mechanisms improves reliability and robustness for cybersecurity applications.

Details

Motivation: Cybersecurity AI models face challenges with data drift and scarce labeled data, leading to frequent updates and overfitting risks. The paper aims to address these issues by leveraging parameter-efficient fine-tuning and large language models.

Method: Two main strategies: 1) Using LLMs as data-labeling tools to generate labels for unlabeled data, 2) Using LLMs as fallback mechanisms for low-confidence predictions. Combined with parameter-efficient fine-tuning techniques (compacters with layer freezing strategies) for pre-trained language models.

Result: Comprehensive experimental analysis on cybersecurity downstream tasks shows improved reliability and robustness of models, making them more suitable for real-world cybersecurity applications.

Conclusion: Combining parameter-efficient pre-trained models with large language models enhances model reliability and robustness for cybersecurity applications, addressing data drift and labeling challenges.

Abstract: Training AI models in cybersecurity with help of vast datasets offers significant opportunities to mimic real-world behaviors effectively. However, challenges like data drift and scarcity of labelled data lead to frequent updates of models and the risk of overfitting. To address these challenges, we used parameter-efficient fine-tuning techniques for pre-trained language models wherein we combine compacters with various layer freezing strategies. To enhance the capabilities of these pre-trained language models, in this work we introduce two strategies that use large language models. In the first strategy, we utilize large language models as data-labelling tools wherein they generate labels for unlabeled data. In the second strategy, large language modes are utilized as fallback mechanisms for predictions having low confidence scores. We perform comprehensive experimental analysis on the proposed strategies on different downstream tasks specific to cybersecurity domain. We empirically demonstrate that by combining parameter-efficient pre-trained models with large language models, we can improve the reliability and robustness of models, making them more suitable for real-world cybersecurity applications.

[433] Unveiling Covert Toxicity in Multimodal Data via Toxicity Association Graphs: A Graph-Based Metric and Interpretable Detection Framework

Guanzong Wu, Zihao Zhu, Siwei Lyu, Baoyuan Wu

Main category: cs.LG

TL;DR: A novel framework for detecting covert toxicity in multimodal data using Toxicity Association Graphs (TAGs) and a new Multimodal Toxicity Covertness (MTC) metric, with a specialized benchmark dataset for evaluation.

Details

Motivation: Current toxicity detection methods struggle with multimodal data where harmful meanings emerge only when modalities are combined, requiring new approaches to identify covert toxicity that hides beneath seemingly benign individual modalities.

Method: Proposes Toxicity Association Graphs (TAGs) to model semantic associations between innocuous entities and latent toxic implications, introduces Multimodal Toxicity Covertness (MTC) metric, and creates Covert Toxic Dataset benchmark for evaluation.

Result: The approach outperforms existing methods across both low- and high-covertness toxicity regimes while providing interpretable and auditable detection outcomes.

Conclusion: The framework advances explainable multimodal toxicity detection and establishes foundations for future context-aware interpretable approaches in multimodal content analysis.

Abstract: Detecting toxicity in multimodal data remains a significant challenge, as harmful meanings often lurk beneath seemingly benign individual modalities: only emerging when modalities are combined and semantic associations are activated. To address this, we propose a novel detection framework based on Toxicity Association Graphs (TAGs), which systematically model semantic associations between innocuous entities and latent toxic implications. Leveraging TAGs, we introduce the first quantifiable metric for hidden toxicity, the Multimodal Toxicity Covertness (MTC), which measures the degree of concealment in toxic multimodal expressions. By integrating our detection framework with the MTC metric, our approach enables precise identification of covert toxicity while preserving full interpretability of the decision-making process, significantly enhancing transparency in multimodal toxicity detection. To validate our method, we construct the Covert Toxic Dataset, the first benchmark specifically designed to capture high-covertness toxic multimodal instances. This dataset encodes nuanced cross-modal associations and serves as a rigorous testbed for evaluating both the proposed metric and detection framework. Extensive experiments demonstrate that our approach outperforms existing methods across both low- and high-covertness toxicity regimes, while delivering clear, interpretable, and auditable detection outcomes. Together, our contributions advance the state of the art in explainable multimodal toxicity detection and lay the foundation for future context-aware and interpretable approaches. Content Warning: This paper contains examples of toxic multimodal content that may be offensive or disturbing to some readers. Reader discretion is advised.

[434] Sparse Adapter Fusion for Continual Learning in NLP

Min Zeng, Xi Chen, Haiqin Yang, Yike Guo

Main category: cs.LG

TL;DR: SAFM is a sparse adapter fusion method for continual learning in NLP that dynamically fuses old and new adapters to minimize parameter growth while preventing catastrophic forgetting.

Details

Motivation: Existing continual learning methods face challenges: inefficient parameter reuse across tasks (risking catastrophic forgetting for dissimilar tasks), and unnecessary introduction of new parameters for each task (hampering knowledge sharing among similar tasks).

Method: SAFM operates in two stages: 1) Decision stage determines whether to incorporate new adapter, reuse existing one, or add empty adapter, with architecture search prioritizing reuse or empty adapters; 2) Tuning stage uses layer-wise loss to encourage differentiation between adapters and capture task knowledge.

Result: SAFM outperforms state-of-the-art methods, achieving comparable performance while utilizing less than 60% of the parameters.

Conclusion: SAFM effectively addresses continual learning challenges by dynamically fusing adapters, minimizing parameter consumption while maximizing reuse and preventing catastrophic forgetting.

Abstract: Continual learning in natural language processing plays a crucial role in adapting to evolving data and preventing catastrophic forgetting. Despite significant progress, existing methods still face challenges, such as inefficient parameter reuse across tasks, risking catastrophic forgetting when tasks are dissimilar, and the unnecessary introduction of new parameters for each task, which hampers knowledge sharing among similar tasks. To tackle these issues, we propose a Sparse Adapter Fusion Method (SAFM), which dynamically fuses old and new adapters to address these challenges. SAFM operates in two stages: the decision stage and the tuning stage. In the decision stage, SAFM determines whether to incorporate a new adapter, reuse an existing one, or add an empty adapter. The architecture search procedure, designed to prioritize reusing or adding empty adapters, minimizes parameter consumption and maximizes reuse. In the tuning stage, SAFM especially facilitates a layer-wise loss to encourage differentiation between adapters, effectively capturing knowledge within the same task. Experimental results consistently show that SAFM outperforms state-of-the-art (SOTA) methods, achieving comparable performance while utilizing less than 60% of the parameters.

[435] Learning ORDER-Aware Multimodal Representations for Composite Materials Design

Xinyao Li, Hangwei Qian, Jingjing Li, Ivor Tsang

Main category: cs.LG

TL;DR: ORDER is a multimodal pretraining framework that establishes ordinality as a core principle for composite material representations, enabling effective learning in continuous design spaces under extreme data scarcity.

Details

Motivation: Existing AI methods for materials discovery work well for crystalline/polymer systems with discrete graph representations, but break down for composite materials with continuous, nonlinear design spaces lacking well-defined graph structures. Current multimodal frameworks fail to address the highly continuous composite design space under extreme data scarcity.

Method: ORDinal-aware imagE-tabulaR alignment (ORDER) - a multimodal pretraining framework that establishes ordinality as a core principle for composite material representations. It ensures materials with similar target properties occupy nearby regions in latent space, preserving the continuous nature of composite properties and enabling meaningful interpolation between sparsely observed designs.

Result: ORDER achieves consistent improvements over state-of-the-art multimodal baselines across property prediction, cross-modal retrieval, and microstructure generation tasks on both a public Nanofiber-enforced composite dataset and an internally curated carbon fiber T700 dataset.

Conclusion: ORDER successfully addresses the challenge of learning in continuous composite material design spaces under data scarcity by establishing ordinality as a core principle, enabling effective multimodal learning for composite materials where traditional graph-based approaches fail.

Abstract: Artificial intelligence (AI) has shown remarkable success in materials discovery and property prediction, particularly for crystalline and polymer systems where material properties and structures are dominated by discrete graph representations. Such graph-central paradigm breaks down on composite materials, which possess continuous and nonlinear design spaces that lack well-defined graph structures. General composite descriptors, e.g., fiber volume and misalignment angle, cannot fully capture the fiber distributions that fundamentally determine microstructural characteristics, necessitating the integration of heterogeneous data sources through multimodal learning. Existing alignment-oriented multimodal frameworks have proven effective on abundant crystal or polymer data under discrete, unique graph-property mapping assumptions, but fail to address the highly continuous composite design space under extreme data scarcity. In this work, we introduce ORDinal-aware imagE-tabulaR alignment (ORDER), a multimodal pretraining framework that establishes ordinality as a core principle for composite material representations. ORDER ensures that materials with similar target properties occupy nearby regions in the latent space, which effectively preserves the continuous nature of composite properties and enables meaningful interpolation between sparsely observed designs. We evaluate ORDER on a public Nanofiber-enforced composite dataset and an internally curated dataset that simulates the construction of carbon fiber T700 with diverse fiber distributions. ORDER achieves consistent improvements over state-of-the-art multimodal baselines across property prediction, cross-modal retrieval, and microstructure generation tasks.

[436] What Drives Length of Stay After Elective Spine Surgery? Insights from a Decade of Predictive Modeling

Ha Na Cho, Seungmin Jeong, Yawen Guo, Alexander Lopez, Hansen Bow, Kai Zheng

Main category: cs.LG

TL;DR: Systematic review of computational methods for predicting length of stay after elective spine surgery, finding machine learning models outperform traditional statistical approaches with AUCs up to 0.99.

Details

Motivation: Predicting length of stay after elective spine surgery is essential for optimizing patient outcomes and hospital resource utilization, but there's a need to synthesize existing computational methods and their performance.

Method: PRISMA-guided systematic review of 1,263 studies from PubMed, Google Scholar, and ACM Digital Library (2015-2024), identifying 29 eligible studies that applied statistical or machine learning models to predict length of stay for elective spine surgery patients.

Result: Machine learning models (logistic regression, random forest, boosting algorithms, neural networks) consistently outperformed traditional statistical models, with AUCs ranging from 0.94-0.99. K-Nearest Neighbors and Naive Bayes achieved top performance in some studies. Key predictors included age, comorbidities, BMI, surgery type/duration, and spinal levels.

Conclusion: Machine learning models show strong potential for length of stay prediction in elective spine surgery, but lack of standardization and external validation limits clinical utility. Future work needs standardized outcome definitions and transparent reporting.

Abstract: Objective: Predicting length of stay after elective spine surgery is essential for optimizing patient outcomes and hospital resource use. This systematic review synthesizes computational methods used to predict length of stay in this patient population, highlighting model performance and key predictors. Methods: Following PRISMA guidelines, we systematically searched PubMed, Google Scholar, and ACM Digital Library for studies published between December 1st, 2015, and December 1st, 2024. Eligible studies applied statistical or machine learning models to predict length of stay for elective spine surgery patients. Three reviewers independently screened studies and extracted data. Results: Out of 1,263 screened studies, 29 studies met inclusion criteria. Length of stay was predicted as a continuous, binary, or percentile-based outcome. Models included logistic regression, random forest, boosting algorithms, and neural networks. Machine learning models consistently outperformed traditional statistical models, with AUCs ranging from 0.94 to 0.99. K-Nearest Neighbors and Naive Bayes achieved top performance in some studies. Common predictors included age, comorbidities (notably hypertension and diabetes), BMI, type and duration of surgery, and number of spinal levels. However, external validation and reporting practices varied widely across studies. Discussion: There is growing interest in artificial intelligence and machine learning in length of stay prediction, but lack of standardization and external validation limits clinical utility. Future studies should prioritize standardized outcome definitions and transparent reporting needed to advance real-world deployment. Conclusion: Machine learning models offer strong potential for length of stay prediction after elective spine surgery, highlighting their potential for improving discharge planning and hospital resource management.

[437] ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents

Xiaoce Wang, Guibin Zhang, Junzhe Li, Jinzhe Tu, Chun Li, Ming Li

Main category: cs.LG

TL;DR: ToolTok: A multi-step pathfinding paradigm for GUI agents using progressive tool usage with semantic anchoring and curriculum learning, achieving strong performance with minimal training data.

Details

Motivation: Existing GUI agent models struggle with generalization to varying input resolutions/aspect ratios (coordinate-based) or suffer from data scarcity (coordinate-free). Need a robust approach that works with limited supervision.

Method: Proposes ToolTok with: 1) Multi-step pathfinding where operations are sequences of progressive tool usage, 2) Tools represented as learnable token embeddings, 3) Semantic anchoring mechanism to ground tools with related concepts as inductive bias, 4) Easy-to-hard curriculum learning with three tasks (token definition QA, text-guided tool selection, simplified visual pathfinding).

Result: Achieves superior performance among 4B-scale models, competitive with 235B model, using <1% of training data required by other approaches. Shows strong generalization across unseen scenarios.

Conclusion: ToolTok provides an effective paradigm for GUI agents that addresses generalization and data scarcity issues through semantic anchoring and curriculum learning, enabling efficient learning with minimal supervision.

Abstract: Existing GUI agent models relying on coordinate-based one-step visual grounding struggle with generalizing to varying input resolutions and aspect ratios. Alternatives introduce coordinate-free strategies yet suffer from learning under severe data scarcity. To address the limitations, we propose ToolTok, a novel paradigm of multi-step pathfinding for GUI agents, where operations are modeled as a sequence of progressive tool usage. Specifically, we devise tools aligned with human interaction habits and represent each tool using learnable token embeddings. To enable efficient embedding learning under limited supervision, ToolTok introduces a semantic anchoring mechanism that grounds each tool with semantically related concepts as natural inductive bias. To further enable a pre-trained large language model to progressively acquire tool semantics, we construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding. Extensive experiments on multiple benchmarks show that ToolTok achieves superior performance among models of comparable scale (4B) and remains competitive with a substantially larger model (235B). Notably, these results are obtained using less than 1% of the training data required by other post-training approaches. In addition, ToolTok demonstrates strong generalization across unseen scenarios. Our training & inference code is open-source at https://github.com/ZephinueCode/ToolTok.

[438] Automated Dysphagia Screening Using Noninvasive Neck Acoustic Sensing

Jade Chng, Rong Xing, Yunfei Luo, Kristen Linnemeyer-Risser, Tauhidur Rahman, Andrew Yousef, Philip A Weissbrod

Main category: cs.LG

TL;DR: Automated dysphagia detection using noninvasive acoustic sensing from neck during swallowing, achieving 0.904 AUC-ROC for abnormality detection.

Details

Motivation: Current dysphagia diagnostic methods rely on invasive procedures or radiographic imaging, creating need for portable, noninvasive alternatives for early detection and timely intervention.

Method: Portable acoustic sensing captures subtle neck signals during swallowing tasks, combined with applied machine learning to identify patterns associated with abnormal physiological conditions.

Result: Achieves promising abnormality detection performance with AUC-ROC of 0.904 under 5 independent train-test splits, demonstrating feasibility of the approach.

Conclusion: Noninvasive acoustic sensing shows potential as practical, scalable tool for pharyngeal health monitoring and dysphagia detection.

Abstract: Pharyngeal health plays a vital role in essential human functions such as breathing, swallowing, and vocalization. Early detection of swallowing abnormalities, also known as dysphagia, is crucial for timely intervention. However, current diagnostic methods often rely on radiographic imaging or invasive procedures. In this study, we propose an automated framework for detecting dysphagia using portable and noninvasive acoustic sensing coupled with applied machine learning. By capturing subtle acoustic signals from the neck during swallowing tasks, we aim to identify patterns associated with abnormal physiological conditions. Our approach achieves promising test-time abnormality detection performance, with an AUC-ROC of 0.904 under 5 independent train-test splits. This work demonstrates the feasibility of using noninvasive acoustic sensing as a practical and scalable tool for pharyngeal health monitoring.

[439] GraphDancer: Training LLMs to Explore and Reason over Graphs via Curriculum Reinforcement Learning

Yuyang Bai, Zhuofeng Li, Ping Nie, Jianwen Xie, Yu Zhang

Main category: cs.LG

TL;DR: GraphDancer: RL framework teaching LLMs to navigate heterogeneous knowledge graphs through interleaved reasoning and function execution, with graph-aware curriculum for effective training on moderate-sized models.

Details

Motivation: Real-world knowledge sources are often organized as heterogeneous graphs rather than plain text, posing challenges for LLMs that need to navigate structured relations and perform multi-hop evidence aggregation through iterative information seeking.

Method: Proposes GraphDancer, a reinforcement learning framework that teaches LLMs to navigate graphs by interleaving reasoning and function execution. Uses a graph-aware curriculum that schedules training by structural complexity of information-seeking trajectories with an easy-to-hard biased sampler.

Result: Despite using only a 3B backbone, GraphDancer outperforms baselines with 14B backbone or GPT-4o-mini, demonstrating robust cross-domain generalization of graph exploration and reasoning skills on multi-domain benchmark.

Conclusion: GraphDancer effectively enables LLMs to navigate heterogeneous knowledge graphs through RL training with structural curriculum, achieving strong performance and generalization with moderate-sized models.

Abstract: Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organized as heterogeneous graphs rather than plain text. Reasoning over such graph-structured knowledge poses two key challenges: (1) navigating structured, schema-defined relations requires precise function calls rather than similarity-based retrieval, and (2) answering complex questions often demands multi-hop evidence aggregation through iterative information seeking. We propose GraphDancer, a reinforcement learning (RL) framework that teaches LLMs to navigate graphs by interleaving reasoning and function execution. To make RL effective for moderate-sized LLMs, we introduce a graph-aware curriculum that schedules training by the structural complexity of information-seeking trajectories using an easy-to-hard biased sampler. We evaluate GraphDancer on a multi-domain benchmark by training on one domain only and testing on unseen domains and out-of-distribution question types. Despite using only a 3B backbone, GraphDancer outperforms baselines equipped with either a 14B backbone or GPT-4o-mini, demonstrating robust cross-domain generalization of graph exploration and reasoning skills. Our code and models can be found at https://yuyangbai.com/graphdancer/ .

[440] Label Curation Using Agentic AI

Subhodeep Ghosh, Bayan Divaaniaazar, Md Ishat-E-Rabban, Spencer Clarke, Senjuti Basu Roy

Main category: cs.LG

TL;DR: AURA is an agentic AI framework that coordinates multiple AI agents to generate and validate multimodal data annotations without ground truth, using probabilistic modeling to infer true labels and annotator reliability.

Details

Motivation: Data annotation for supervised learning faces challenges with accuracy, bias, and scalability as datasets grow in size and modality. Traditional human-centric pipelines are costly, slow, and prone to annotator variability, motivating the need for reliability-aware automated annotation systems.

Method: AURA coordinates multiple AI agents to generate and validate labels without requiring ground truth. It adapts a classical probabilistic model that jointly infers latent true labels and annotator reliability via confusion matrices, using Expectation-Maximization to reconcile conflicting annotations and aggregate noisy predictions.

Result: Across four benchmark datasets, AURA achieves accuracy improvements of up to 5.8% over baseline. In challenging settings with poor quality annotators, the improvement reaches up to 50% over baseline. AURA also accurately estimates annotator reliability without requiring pre-validation steps.

Conclusion: AURA provides an effective agentic AI framework for large-scale, multimodal data annotation that improves annotation accuracy while simultaneously estimating annotator reliability, addressing key challenges in modern supervised learning pipelines.

Abstract: Data annotation is essential for supervised learning, yet producing accurate, unbiased, and scalable labels remains challenging as datasets grow in size and modality. Traditional human-centric pipelines are costly, slow, and prone to annotator variability, motivating reliability-aware automated annotation. We present AURA (Agentic AI for Unified Reliability Modeling and Annotation Aggregation), an agentic AI framework for large-scale, multi-modal data annotation. AURA coordinates multiple AI agents to generate and validate labels without requiring ground truth. At its core, AURA adapts a classical probabilistic model that jointly infers latent true labels and annotator reliability via confusion matrices, using Expectation-Maximization to reconcile conflicting annotations and aggregate noisy predictions. Across the four benchmark datasets evaluated, AURA achieves accuracy improvements of up to 5.8% over baseline. In more challenging settings with poor quality annotators, the improvement is up to 50% over baseline. AURA also accurately estimates the reliability of annotators, allowing assessment of annotator quality even without any pre-validation steps.

[441] Scaled Dot-Product Attention implements projection of inputs onto a common surface

Terence D Sanger

Main category: cs.LG

TL;DR: SDPA can be reformulated as projecting input vectors onto a context-dependent surface, revealing nonlinear dependencies and enabling faster computation and potential extensions.

Details

Motivation: To provide a mathematical signal processing interpretation of scaled dot-product attention that reconciles it with traditional methods and offers new insights beyond the database-inspired "query, key, value" framework.

Method: Reformulate SDPA in a mathematically equivalent form as projection of input vectors onto a common surface determined by the inputs themselves, revealing time-dependent and context-dependent nonlinear dependencies.

Result: The rewritten form enables increased speed for both feedforward and learning algorithms, suggests potential extensions, and provides a new interpretation of SDPA’s role in language processing as finding time-dependent contextual meaning.

Conclusion: SDPA discovers nonlinear dependencies in input data that are time-dependent and context-dependent, providing strong justification for its use in time-series data with time-varying local nonlinear dependencies, with implications for both computational efficiency and theoretical understanding.

Abstract: Scaled dot-product attention (SDPA) is a fundamental component responsible for the success of large-language models and other nonlinear signal processing applications. The rationale for SDPA has been based upon “query, key, value” concepts borrowed from database theory, but these concepts are difficult to reconcile with standard methods in mathematical signal processing. We show that SDPA can be rewritten in a different but mathematically equivalent form as a projection of the input vectors onto a common surface determined by the inputs themselves. Therefore SDPA discovers nonlinear dependencies in the input that are time-dependent and context-dependent. The rewritten form of SDPA permits increased speed of both feedforward and learning algorithms, but more importantly suggests potential extensions. In the context of language, we re-interpret the role of SDPA as finding a time-dependent contextual meaning determined by the surface on which the set of input vectors lies. Input token embeddings are then modified by the local context surface. This interpretation differs substantially from the concept of “self-attention”, and provides a strong justification for the use of SDPA for time-series data with time-varying local nonlinear dependencies.

[442] IceBench-S2S: A Benchmark of Deep Learning for Challenging Subseasonal-to-Seasonal Daily Arctic Sea Ice Forecasting in Deep Latent Space

Jingyi Xu, Shengnan Wang, Weidong Yang, Siwei Tu, Lei Bai, Ben Fei

Main category: cs.LG

TL;DR: IceBench-S2S is a comprehensive benchmark for evaluating deep learning approaches in forecasting Arctic sea ice concentration at subseasonal-to-seasonal (S2S) scale (up to 180 days), addressing the gap between current DL models’ limited daily forecasting capabilities and operational needs.

Details

Motivation: Current deep learning sea ice forecasting models are limited to daily subseasonal scales (up to 6 months for monthly averages), which hinders real-world applications like Arctic transportation planning. There's a need to extend daily forecasts from subseasonal to seasonal (S2S) scale for operational use.

Method: Proposes IceBench-S2S benchmark with a generalized framework that compresses spatial features of daily sea ice data into a deep latent space, then uses DL-based forecasting backbones to model temporally concatenated deep features for S2S-scale predictions (180-day periods).

Result: Provides a unified training and evaluation pipeline for different DL backbones, along with practical guidance for model selection in polar environmental monitoring tasks.

Conclusion: IceBench-S2S bridges the gap between current DL forecasting lead times and operational S2S scale needs, enabling better evaluation and development of sea ice forecasting models for real-world applications.

Abstract: Arctic sea ice plays a critical role in regulating Earth’s climate system, significantly influencing polar ecological stability and human activities in coastal regions. Recent advances in artificial intelligence have facilitated the development of skillful pan-Arctic sea ice forecasting systems, where data-driven approaches showcase tremendous potential to outperform conventional physics-based numerical models in terms of accuracy, computational efficiency and forecasting lead times. Despite the latest progress made by deep learning (DL) forecasting models, most of their skillful forecasting lead times are confined to daily subseasonal scale and monthly averaged values for up to six months, which drastically hinders their deployment for real-world applications, e.g., maritime routine planning for Arctic transportation and scientific investigation. Extending daily forecasts from subseasonal to seasonal (S2S) scale is scientifically crucial for operational applications. To bridge the gap between the forecasting lead time of current DL models and the significant daily S2S scale, we introduce IceBench-S2S, the first comprehensive benchmark for evaluating DL approaches in mitigating the challenge of forecasting Arctic sea ice concentration in successive 180-day periods. It proposes a generalized framework that first compresses spatial features of daily sea ice data into a deep latent space. The temporally concatenated deep features are subsequently modeled by DL-based forecasting backbones to predict the sea ice variation at S2S scale. IceBench-S2S provides a unified training and evaluation pipeline for different backbones, along with practical guidance for model selection in polar environmental monitoring tasks.

[443] IMU-1: Sample-Efficient Pre-training of Small Language Models

George Grigorev

Main category: cs.LG

TL;DR: IMU-1 is a 430M-parameter language model that achieves benchmark performance comparable to models trained on 56x more data through optimized training techniques and architectural improvements.

Details

Motivation: To develop an efficient language model that achieves state-of-the-art performance with significantly less training data and computational resources than current models, enabling more accessible and reproducible language model development.

Method: Combines architectural interventions (QK-norm attention, per-head gating, value residuals, LayerNorm scaling) with optimization advances (NorMuon with cautious weight decay, muP parametrization) and a three-stage training schedule with post-hoc checkpoint EMA, trained on 72B tokens.

Result: IMU-1 approaches benchmark performance of models trained on 56x more data (approximately 4 trillion tokens), demonstrating significant efficiency improvements in language model training.

Conclusion: The paper presents a validated training recipe that enables efficient language model development with reduced data requirements, providing code, weights, and data for reproduction to advance open research in language modeling.

Abstract: We present IMU-1, a 430M-parameter language model trained on 72B tokens that approaches the benchmark performance of models trained on 56x more data. We describe a validated training recipe combining recent architectural interventions (QK-norm attention, per-head gating, value residuals, LayerNorm scaling) with optimization advances (NorMuon with cautious weight decay, muP parametrization) and a three-stage training schedule with post-hoc checkpoint EMA. We provide ablations for each component and release code, weights and data to enable reproduction: https://huggingface.co/thepowerfuldeez/imu1_base

[444] The Alignment Curse: Cross-Modality Jailbreak Transfer in Omni-Models

Yupeng Chen, Junchi Yu, Aoxi Liu, Philip Torr, Adel Bibi

Main category: cs.LG

TL;DR: Text-to-audio jailbreak transfer attacks exploit modality alignment vulnerabilities in multimodal models, showing comparable or better performance than audio-only attacks.

Details

Motivation: The paper addresses the underexplored area of cross-modality jailbreak transfer from text to audio in multimodal models, motivated by semantic similarity between modalities and mature textual jailbreak methods.

Method: Analyzes connection between modality alignment and cross-modality jailbreak transfer, then empirically evaluates textual jailbreaks, text-transferred audio jailbreaks, and existing audio-based jailbreaks on recent omni-models.

Result: Text-transferred audio jailbreaks perform comparably to or better than audio-based jailbreaks, show strong cross-model transferability, and remain effective under audio-only access threat models.

Conclusion: Text-transferred audio jailbreaks serve as powerful baselines for audio red-teaming, revealing vulnerabilities in multimodal alignment that can propagate textual weaknesses to audio.

Abstract: Recent advances in end-to-end trained omni-models have significantly improved multimodal understanding. At the same time, safety red-teaming has expanded beyond text to encompass audio-based jailbreak attacks. However, an important bridge between textual and audio jailbreaks remains underexplored. In this work, we study the cross-modality transfer of jailbreak attacks from text to audio, motivated by the semantic similarity between the two modalities and the maturity of textual jailbreak methods. We first analyze the connection between modality alignment and cross-modality jailbreak transfer, showing that strong alignment can inadvertently propagate textual vulnerabilities to the audio modality, which we term the alignment curse. Guided by this analysis, we conduct an empirical evaluation of textual jailbreaks, text-transferred audio jailbreaks, and existing audio-based jailbreaks on recent omni-models. Our results show that text-transferred audio jailbreaks perform comparably to, and often better than, audio-based jailbreaks, establishing them as simple yet powerful baselines for future audio red-teaming. We further demonstrate strong cross-model transferability and show that text-transferred audio attacks remain effective even under a stricter audio-only access threat model.

[445] TabularMath: Evaluating Computational Extrapolation in Tabular Learning via Program-Verified Synthesis

Zerui Cheng, Jiashuo Liu, Jianzhu Yao, Pramod Viswanath, Ge Zhang, Wenhao Huang

Main category: cs.LG

TL;DR: TabularMath benchmark evaluates tabular models’ ability to perform computational extrapolation beyond statistical interpolation on deterministic problems, revealing gaps between smooth approximations and exact computation.

Details

Motivation: Standard tabular benchmarks focus on statistical interpolation within data manifolds, but many high-value applications (financial modeling, physical simulations) involve deterministic computational processes requiring extrapolation beyond training distributions.

Method: Created TabularMath benchmark with 114 deterministic problems (233,472 rows) from verified programs based on GSM8K and AIME. Evaluated 9 tabular architectures and in-context learning with GPT-OSS-120B using both regression metrics (R²) and exact integer match accuracy.

Result: TabPFN v2.5 achieved R²=0.998 in-distribution and maintained positive R² under distribution shift, but dropped below 10% exact-match accuracy on out-of-distribution data. ICL maintained around 40% exact-match accuracy, showing complementary strengths: TabPFN scales efficiently with data while ICL achieves exact computation from few examples.

Conclusion: Tabular models learn smooth function approximations but struggle with precise computational outputs under extrapolation. The gap between regression metrics and exact-match accuracy reveals limitations in current tabular models for deterministic computational tasks.

Abstract: Standard tabular benchmarks mainly focus on the evaluation of a model’s capability to interpolate values inside a data manifold, where models good at performing local statistical smoothing are rewarded. However, there exists a very large category of high-value tabular data, including financial modeling and physical simulations, which are generated based upon deterministic computational processes, as opposed to stochastic and noisy relationships. Therefore, we investigate if tabular models can provide an extension from statistical interpolation to computational extrapolation. We propose TabularMath, a diagnostic benchmark of 114 deterministic problems (233,472 rows) generated from verified programs based on GSM8K and AIME. We evaluate 9 tabular architectures and in-context learning (ICL) with GPT-OSS-120B. On standard regression metrics, TabPFN v2.5 performs remarkably well, achieving R^2=0.998 in-distribution and maintaining positive R^2 even under distribution shift, which is unique among the tabular models we tested. When we measure rounded consistency (exact integer match), a different picture emerges: TabPFN v2.5 drops below 10% on out-of-distribution data, while ICL maintains around 40%. This gap between R^2 and exact-match accuracy suggests that tabular models learn smooth function approximations but struggle to recover precise computational outputs under extrapolation. The two paradigms appear complementary: TabPFN scales efficiently with data; ICL achieves exact computation from few examples. We release all code and data to support further investigation.

[446] The “Robert Boulton” Singularity: Semantic Tunneling and Manifold Unfolding in Recursive AI

Pengyue Hou

Main category: cs.LG

TL;DR: PPL is unreliable for monitoring AI stability; models can maintain grammatical fluency while catastrophically losing semantic diversity, converging to single narrative attractors. MNCIS framework prevents this by inducing manifold unfolding.

Details

Motivation: Current practice uses Perplexity (PPL) to monitor generative AI stability, but this metric can be deceptive when models maintain grammatical fluency while suffering catastrophic semantic collapse to low-diversity narrative attractors.

Method: Used sliding-window protocol (N=1500) to identify “Semantic Tunneling” failure mode. Applied Multi-Scale Negative Coupled Information Systems (MNCIS) framework with Adaptive Spectral Negative Coupling (ASNC) as topological operator to induce “Manifold Unfolding.”

Result: Baseline model converged to “Robert Boulton” Singularity within 7 generations despite high PPL (~83.9). MNCIS expanded effective rank from 3.62 to 5.35, preventing semantic collapse and preserving long-tail data distribution.

Conclusion: PPL is insufficient for monitoring generative AI stability; semantic diversity collapse requires new metrics. MNCIS framework successfully prevents semantic tunneling by inducing manifold unfolding and preserving semantic diversity.

Abstract: The stability of generative artificial intelligence trained on recursive synthetic data is conventionally monitored via Perplexity (PPL). We demonstrate that PPL is a deceptive metric in context-stabilized regimes (L=128). Using a rigorous sliding-window protocol (N=1500), we identify a novel failure mode termed “Semantic Tunneling.” While the Baseline model maintains high grammatical fluency (PPL approx. 83.9), it suffers a catastrophic loss of semantic diversity, converging within seven generations to a single, low-entropy narrative attractor: the “Robert Boulton” Singularity. This phenomenon represents a total collapse of the latent manifold (Global Effective Rank 3.62 -> 2.22), where the model discards diverse world knowledge to optimize for statistically safe syntactic templates. To address this, we apply the Multi-Scale Negative Coupled Information Systems (MNCIS) framework recently established in Hou (2026) [arXiv:2601.11594]. We demonstrate that Adaptive Spectral Negative Coupling (ASNC) acts as a topological operator that actively induces “Manifold Unfolding.” MNCIS forces the model to expand its effective rank from the anisotropic baseline of 3.62 to a hyper-diverse state of 5.35, effectively constructing an “Artificial Manifold” that resists the gravitational pull of semantic attractors and preserves the long-tail distribution of the training data.

[447] Incident-Guided Spatiotemporal Traffic Forecasting

Lixiang Fan, Bohao Li, Tao Zou, Bowen Du, Junchen Ye

Main category: cs.LG

TL;DR: IGSTGNN: A novel graph neural network framework that incorporates transportation incidents as external disturbances to improve traffic flow prediction accuracy by modeling their spatial and temporal impacts.

Details

Motivation: Existing traffic forecasting methods focus only on historical spatio-temporal patterns but ignore the significant impact of sudden incidents like accidents and adverse weather, which substantially alter temporal patterns and limit prediction accuracy.

Method: Proposes Incident-Guided Spatiotemporal Graph Neural Network (IGSTGNN) with two core components: Incident-Context Spatial Fusion (ICSF) module to capture initial heterogeneous spatial influence of incidents, and Temporal Incident Impact Decay (TIID) module to model dynamic dissipation of incident effects over time.

Result: IGSTGNN achieves state-of-the-art performance on a newly constructed large-scale dataset with time-aligned incident records. The ICSF and TIID modules also demonstrate generalizability when integrated into various existing models.

Conclusion: Explicitly modeling incident impacts through specialized spatial and temporal modules significantly improves traffic forecasting accuracy and provides a valuable framework for handling external disturbances in spatio-temporal prediction tasks.

Abstract: Recent years have witnessed the rapid development of deep-learning-based, graph-neural-network-based forecasting methods for modern intelligent transportation systems. However, most existing work focuses exclusively on capturing spatio-temporal dependencies from historical traffic data, while overlooking the fact that suddenly occurring transportation incidents, such as traffic accidents and adverse weather, serve as external disturbances that can substantially alter temporal patterns. We argue that this issue has become a major obstacle to modeling the dynamics of traffic systems and improving prediction accuracy, but the unpredictability of incidents makes it difficult to observe patterns from historical sequences. To address these challenges, this paper proposes a novel framework named the Incident-Guided Spatiotemporal Graph Neural Network (IGSTGNN). IGSTGNN explicitly models the incident’s impact through two core components: an Incident-Context Spatial Fusion (ICSF) module to capture the initial heterogeneous spatial influence, and a Temporal Incident Impact Decay (TIID) module to model the subsequent dynamic dissipation. To facilitate research on the spatio-temporal impact of incidents on traffic flow, a large-scale dataset is constructed and released, featuring incident records that are time-aligned with traffic time series. On this new benchmark, the proposed IGSTGNN framework is demonstrated to achieve state-of-the-art performance. Furthermore, the generalizability of the ICSF and TIID modules is validated by integrating them into various existing models.

[448] Formulating Reinforcement Learning for Human-Robot Collaboration through Off-Policy Evaluation

Saurav Singh, Rodney Sanchez, Alexander Ororbia, Jamison Heard

Main category: cs.LG

TL;DR: A novel RL framework using off-policy evaluation to select optimal state representations and reward functions from logged data, reducing need for real-time environment interaction in human-robot applications.

Details

Motivation: Traditional RL requires extensive human expertise and real-time environment interaction for defining state representations and reward functions, which is costly and impractical for complex, safety-critical human-robot interaction applications.

Method: Proposes an RL framework that leverages off-policy evaluation (OPE) to systematically evaluate multiple candidate state representations and reward functions using only logged interaction data, training offline RL agents and applying OPE to estimate policy performance.

Result: Validated on two environments: OpenAI Gym’s Lunar Lander (controlled setting) and NASA-MATB-II human subjects study (real-world human-robot teaming), demonstrating feasibility for automating RL design decisions.

Conclusion: The approach enhances feasibility and scalability of offline RL for real-world environments by automating critical RL design decisions through data-driven OPE-based evaluation, enabling more reliable RL formulation for complex human-robot interaction.

Abstract: Reinforcement learning (RL) has the potential to transform real-world decision-making systems by enabling autonomous agents to learn from experience. Deploying RL in real-world settings, especially in the context of human-robot interaction, requires defining state representations and reward functions, which are critical for learning efficiency and policy performance. Traditional RL approaches often rely on domain expertise and trial-and-error, necessitating extensive human involvement as well as direct interaction with the environment, which can be costly and impractical, especially in complex and safety-critical applications. This work proposes a novel RL framework that leverages off-policy evaluation (OPE) for state space and reward function selection, using only logged interaction data. This approach eliminates the need for real-time access to the environment or human-in-the-loop feedback, greatly reducing the dependency on costly real-time interactions. The proposed approach systematically evaluates multiple candidate state representations and reward functions by training offline RL agents and applying OPE to estimate policy performance. The optimal state space and reward function are selected based on their ability to produce high-performing policies under OPE metrics. Our method is validated on two environments: the Lunar Lander environment by OpenAI Gym, which provides a controlled setting for assessing state space and reward function selection, and a NASA-MATB-II human subjects study environment, which evaluates the approach’s real-world applicability to human-robot teaming scenarios. This work enhances the feasibility and scalability of offline RL for real-world environments by automating critical RL design decisions through a data-driven OPE-based evaluation, enabling more reliable, effective, and sustainable RL formulation for complex human-robot interaction settings.

[449] Hypersonic Flow Control: Generalized Deep Reinforcement Learning for Hypersonic Intake Unstart Control under Uncertainty

Trishit Mondal, Ameya D. Jagtap

Main category: cs.LG

TL;DR: Deep reinforcement learning controller stabilizes hypersonic inlet unstart at Mach 5 using active flow control with strong generalization to unseen conditions.

Details

Motivation: Hypersonic unstart at Mach 5+ poses major challenges for air-breathing propulsion due to shock-boundary-layer interactions and pressure fluctuations that destabilize inlet operation.

Method: Deep reinforcement learning-based active flow control strategy using high-fidelity CFD simulations with adaptive mesh refinement to resolve shock motion, boundary-layer dynamics, and flow separation for learning physically consistent control policies.

Result: DRL controller robustly stabilizes inlet over wide back pressure ranges, generalizes to unseen scenarios (different back-pressure levels, Reynolds numbers, sensor configurations), operates with noisy measurements, and achieves comparable performance with minimal sensor sets.

Conclusion: Establishes data-driven approach for real-time hypersonic flow control under realistic operational uncertainties with strong zero-shot generalization capabilities.

Abstract: The hypersonic unstart phenomenon poses a major challenge to reliable air-breathing propulsion at Mach 5 and above, where strong shock-boundary-layer interactions and rapid pressure fluctuations can destabilize inlet operation. Here, we demonstrate a deep reinforcement learning (DRL)- based active flow control strategy to control unstart in a canonical two-dimensional hypersonic inlet at Mach 5 and Reynolds number $5\times 10^6$. The in-house CFD solver enables high-fidelity simulations with adaptive mesh refinement, resolving key flow features, including shock motion, boundary-layer dynamics, and flow separation, that are essential for learning physically consistent control policies suitable for real-time deployment. The DRL controller robustly stabilizes the inlet over a wide range of back pressures representative of varying combustion chamber conditions. It further generalizes to previously unseen scenarios, including different back-pressure levels, Reynolds numbers, and sensor configurations, while operating with noisy measurements, thereby demonstrating strong zero-shot generalization. Control remains robust in the presence of noisy sensor measurements, and a minimal, optimally selected sensor set achieves comparable performance, enabling practical implementation. These results establish a data-driven approach for real-time hypersonic flow control under realistic operational uncertainties.

[450] CADENT: Gated Hybrid Distillation for Sample-Efficient Transfer in Reinforcement Learning

Mahyar Alinejad, Yue Wang, George Atia

Main category: cs.LG

TL;DR: CADENT framework combines strategic automaton knowledge with tactical policy guidance using experience-gated trust mechanism for adaptive RL transfer learning.

Details

Motivation: Existing transfer learning methods struggle with domain shift between source and target environments. Policy distillation lacks long-term strategic knowledge, while automaton-based methods lack fine-grained action guidance.

Method: CADENT unifies strategic automaton-based knowledge with tactical policy-level knowledge using an experience-gated trust mechanism that dynamically weighs teacher guidance against student’s own experience at state-action level.

Result: Achieves 40-60% better sample efficiency than baselines across challenging environments (sparse-reward grid worlds to continuous control tasks) while maintaining superior asymptotic performance.

Conclusion: CADENT establishes a robust approach for adaptive knowledge transfer in RL by combining strategic and tactical knowledge with dynamic trust mechanisms.

Abstract: Transfer learning promises to reduce the high sample complexity of deep reinforcement learning (RL), yet existing methods struggle with domain shift between source and target environments. Policy distillation provides powerful tactical guidance but fails to transfer long-term strategic knowledge, while automaton-based methods capture task structure but lack fine-grained action guidance. This paper introduces Context-Aware Distillation with Experience-gated Transfer (CADENT), a framework that unifies strategic automaton-based knowledge with tactical policy-level knowledge into a coherent guidance signal. CADENT’s key innovation is an experience-gated trust mechanism that dynamically weighs teacher guidance against the student’s own experience at the state-action level, enabling graceful adaptation to target domain specifics. Across challenging environments, from sparse-reward grid worlds to continuous control tasks, CADENT achieves 40-60% better sample efficiency than baselines while maintaining superior asymptotic performance, establishing a robust approach for adaptive knowledge transfer in RL.

[451] Enhancing Psychologists’ Understanding through Explainable Deep Learning Framework for ADHD Diagnosis

Abdul Rehman, Ilona Heldal, Jerry Chun-Wei Lin

Main category: cs.LG

TL;DR: Proposes an explainable hybrid DNN-RNN model for ADHD detection and multi-class categorization with interpretability features using SHAP and PFI.

Details

Motivation: ADHD diagnosis is challenging and requires reliable, transparent identification methods. Current approaches need better interpretability for psychologists to trust AI-assisted diagnosis.

Method: HyExDNN-RNN model combining DNN and RNN architectures with Pearson correlation for feature selection, SHAP and PFI for explainability, and standardized techniques for feature reduction and model interpretation.

Result: Achieved 99% F1 score on binary classification and 94.2% on multi-class categorization. XAI approaches provided important insights into feature importance and model decision logic.

Conclusion: The framework demonstrates potential for assisting ADHD diagnosis with interpretability, bridging computational techniques with psychological applications.

Abstract: Attention Deficit Hyperactivity Disorder (ADHD) is a neurodevelopmental disorder that is challenging to diagnose and requires advanced approaches for reliable and transparent identification and classification. It is characterized by a pattern of inattention, hyperactivity and impulsivity that is more severe and more frequent than in individuals with a comparable level of development. In this paper, an explainable framework based on a fine-tuned hybrid Deep Neural Network (DNN) and Recurrent Neural Network (RNN) called HyExDNN-RNN model is proposed for ADHD detection, multi-class categorization, and decision interpretation. This framework not only detects ADHD, but also provides interpretable insights into the diagnostic process so that psychologists can better understand and trust the results of the diagnosis. We use the Pearson correlation coefficient for optimal feature selection and machine and deep learning models for experimental analysis and comparison. We use a standardized technique for feature reduction, model selection and interpretation to accurately determine the diagnosis rate and ensure the interpretability of the proposed framework. Our framework provided excellent results on binary classification, with HyExDNN-RNN achieving an F1 score of 99% and 94.2% on multi-class categorization. XAI approaches, in particular SHapley Additive exPlanations (SHAP) and Permutation Feature Importance (PFI), provided important insights into the importance of features and the decision logic of models. By combining AI with human expertise, we aim to bridge the gap between advanced computational techniques and practical psychological applications. These results demonstrate the potential of our framework to assist in ADHD diagnosis and interpretation.

[452] From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation

Tianle Gu, Kexin Huang, Lingyu Li, Ruilin Luo, Shiyang Huang, Zongqi Wang, Yujiu Yang, Yan Teng, Yingchun Wang

Main category: cs.LG

TL;DR: UniMod introduces a novel multimodal safety moderation framework that replaces binary classification with dense reasoning traces, using structured trajectories and multi-dimensional boundary learning to prevent shortcut learning.

Details

Motivation: Multimodal safety moderation suffers from data and supervision sparsity, leading to shortcut learning where models rely on superficial features rather than understanding intrinsic safety semantics.

Method: Proposes UniMod with structured reasoning trajectories (evidence grounding, modality assessment, risk mapping, policy decision, response generation) and UniRM multi-head scalar reward model for multi-dimensional supervision. Uses specialized optimization strategies to decouple task parameters.

Result: Achieves competitive textual moderation performance and sets new multimodal benchmark using <40% of training data compared to leading baselines. Ablations validate multi-attribute trajectory reasoning effectiveness.

Conclusion: UniMod provides an effective and efficient framework for multimodal moderation by forcing explicit safety semantics grounding and preventing shortcut learning through dense reasoning traces.

Abstract: Safety moderation is pivotal for identifying harmful content. Despite the success of textual safety moderation, its multimodal counterparts remain hindered by a dual sparsity of data and supervision. Conventional reliance on binary labels lead to shortcut learning, which obscures the intrinsic classification boundaries necessary for effective multimodal discrimination. Hence, we propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces. By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process. This approach forces the model to ground its decision in explicit safety semantics, preventing the model from converging on superficial shortcuts. To facilitate this paradigm, we develop a multi-head scalar reward model (UniRM). UniRM provides multi-dimensional supervision by assigning attribute-level scores to the response generation stage. Furthermore, we introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning. Empirical results show UniMod achieves competitive textual moderation performance and sets a new multimodal benchmark using less than 40% of the training data used by leading baselines. Ablations further validate our multi-attribute trajectory reasoning, offering an effective and efficient framework for multimodal moderation. Supplementary materials are available at \href{https://trustworthylab.github.io/UniMod/}{project website}.

[453] Enhancing Post-Training Quantization via Future Activation Awareness

Zheqi Lv, Zhenxuan Fan, Qi Tian, Wenqiao Zhang, Yueting Zhuang

Main category: cs.LG

TL;DR: FAQ (Future-Aware Quantization) improves LLM compression by using future-layer activations to guide quantization, reducing bias and error accumulation without extra computational overhead.

Details

Motivation: Standard PTQ suffers from quantization bias and error accumulation, especially with biased calibration data, leading to suboptimal and unstable quantization results.

Method: Proposes FAQ which leverages future-layer activations to guide quantization, uses window-wise preview to aggregate multiple future layers, and employs pre-searched configurations to avoid expensive search.

Result: FAQ consistently outperforms prior methods with negligible extra cost, requires no backward passes, data reconstruction, or tuning, making it suitable for edge deployment.

Conclusion: Future-aware quantization effectively addresses PTQ limitations by using future information to guide current-layer quantization, improving stability and performance without significant overhead.

Abstract: Post-training quantization (PTQ) is a widely used method to compress large language models (LLMs) without fine-tuning. It typically sets quantization hyperparameters (e.g., scaling factors) based on current-layer activations. Although this method is efficient, it suffers from quantization bias and error accumulation, resulting in suboptimal and unstable quantization, especially when the calibration data is biased. To overcome these issues, we propose Future-Aware Quantization (FAQ), which leverages future-layer activations to guide quantization. This allows better identification and preservation of important weights, while reducing sensitivity to calibration noise. We further introduce a window-wise preview mechanism to softly aggregate multiple future-layer activations, mitigating over-reliance on any single layer. To avoid expensive greedy search, we use a pre-searched configuration to minimize overhead. Experiments show that FAQ consistently outperforms prior methods with negligible extra cost, requiring no backward passes, data reconstruction, or tuning, making it well-suited for edge deployment.

[454] How Much Information Can a Vision Token Hold? A Scaling Law for Recognition Limits in VLMs

Shuxin Zhuang, Zi Liang, Runsheng Yu, Hongzong Li, Rong Feng, Shiqin Tang, Youzhi Zhang

Main category: cs.LG

TL;DR: The paper investigates the information capacity limits of visual tokens in vision-centric models, identifying phase transitions in performance as information density increases and proposing a probabilistic scaling law for vision token compression.

Details

Motivation: To understand the fundamental information upper bound of visual tokens in vision-centric models, as current approaches treat vision encoders as lossy channels with finite representational capacity but lack understanding of their limits.

Method: Conducted controlled stress tests by progressively increasing information quantity (character count) within images, analyzed phase-transition phenomena, identified key factors, and formulated a probabilistic scaling law unifying average vision token load and visual density.

Result: Discovered three distinct regimes: Stable Phase (near-perfect), Instability Phase (increased error variance), and Collapse Phase (total failure). The scaling law demonstrates universality across various Vision-Language Models.

Conclusion: Provides critical empirical guidance for optimizing efficiency-accuracy trade-off in visual context compression and establishes fundamental understanding of vision token capacity limits.

Abstract: Recent vision-centric approaches have made significant strides in long-context modeling. Represented by DeepSeek-OCR, these models encode rendered text into continuous vision tokens, achieving high compression rates without sacrificing recognition precision. However, viewing the vision encoder as a lossy channel with finite representational capacity raises a fundamental question: what is the information upper bound of visual tokens? To investigate this limit, we conduct controlled stress tests by progressively increasing the information quantity (character count) within an image. We observe a distinct phase-transition phenomenon characterized by three regimes: a near-perfect Stable Phase, an Instability Phase marked by increased error variance, and a total Collapse Phase. We analyze the mechanical origins of these transitions and identify key factors. Furthermore, we formulate a probabilistic scaling law that unifies average vision token load and visual density into a latent difficulty metric. Extensive experiments across various Vision-Language Models demonstrate the universality of this scaling law, providing critical empirical guidance for optimizing the efficiency-accuracy trade-off in visual context compression.

[455] Auto-Augmentation Contrastive Learning for Wearable-based Human Activity Recognition

Qingyu Wu, Jianfei Shen, Feiyi Fan, Yang Gu, Chenyang Xu, Yiqiang Chen

Main category: cs.LG

TL;DR: AutoCL: End-to-end auto-augmentation contrastive learning for wearable-based human activity recognition that learns augmentation strategies automatically rather than relying on manual design.

Details

Motivation: Contrastive learning for low-semantic sensor signals in HAR relies heavily on manual data augmentation strategies, which lack generalizability and flexibility. There's a need to reduce the augmentation burden and automate this process.

Method: Proposes AutoCL with Siamese network architecture sharing backbone parameters, embedding a generator to learn auto-augmentation. Uses latent space representations to train generator, overcoming noise and redundancy in raw sensor data. Includes stop-gradient design and correlation reduction strategy to enhance encoder representation learning.

Result: Extensive experiments on four widely-used HAR datasets demonstrate that AutoCL significantly improves recognition accuracy compared with other state-of-the-art methods.

Conclusion: AutoCL effectively automates augmentation in contrastive learning for HAR, reducing manual effort while improving performance through learned augmentation strategies and enhanced representation learning techniques.

Abstract: For low-semantic sensor signals from human activity recognition (HAR), contrastive learning (CL) is essential to implement novel applications or generic models without manual annotation, which is a high-performance self-supervised learning (SSL) method. However, CL relies heavily on data augmentation for pairwise comparisons. Especially for low semantic data in the HAR area, conducting good performance augmentation strategies in pretext tasks still rely on manual attempts lacking generalizability and flexibility. To reduce the augmentation burden, we propose an end-to-end auto-augmentation contrastive learning (AutoCL) method for wearable-based HAR. AutoCL is based on a Siamese network architecture that shares the parameters of the backbone and with a generator embedded to learn auto-augmentation. AutoCL trains the generator based on the representation in the latent space to overcome the disturbances caused by noise and redundant information in raw sensor data. The architecture empirical study indicates the effectiveness of this design. Furthermore, we propose a stop-gradient design and correlation reduction strategy in AutoCL to enhance encoder representation learning. Extensive experiments based on four wide-used HAR datasets demonstrate that the proposed AutoCL method significantly improves recognition accuracy compared with other SOTA methods.

[456] Toward Ultra-Long-Horizon Sequential Model Editing

Mingda Liu, Zhenghan Zhu, Ze’an Miao, Katsuki Fujisawa

Main category: cs.LG

TL;DR: Proposes Norm-Anchor Scaling (NAS) to prevent model collapse in sequential model editing by controlling explosive growth of MLP weight norms

Details

Motivation: Existing Locate-and-Edit (L&E) methods for model editing suffer from abrupt model collapse beyond a critical number of sequential edits, limiting their practical utility

Method: Identifies correlation between collapse and explosive MLP weight norm growth, proves L&E updates cause exponential norm growth, and proposes NAS - a plug-and-play norm-constrained strategy that scales edited weights to maintain norm stability

Result: NAS delays collapse point by >4x, yields 72.2% average relative gain in editing performance, requires minimal code changes and negligible computational overhead

Conclusion: Norm control is crucial for stable sequential model editing, and NAS provides an effective, lightweight solution to prevent model collapse in L&E frameworks

Abstract: Model editing has emerged as a practical approach for mitigating factual errors and outdated knowledge in large language models (LLMs). Among existing methods, the Locate-and-Edit (L&E) paradigm is the dominant framework: it locates MLP parameters implicated in expressing a target fact, and then performs a localized update to rewrite that fact. However, long sequences of edits often trigger abrupt model collapse in L&E beyond a critical point. We empirically identify a strong correlation between collapse and explosive growth of edited MLP weight norms, and formally prove that commonly used L&E update rules can induce exponential norm growth across sequential edits in the absence of explicit norm control. To address this issue, we propose Norm-Anchor Scaling NAS, a plug-and-play norm-constrained strategy. Across extensive experiments, NAS delays the collapse point of representative L&E algorithms by more than 4 times and yields a 72.2% average relative gain in editing performance, requiring only a single additional line of code and incurring negligible computational overhead.

[457] SPA-Cache: Singular Proxies for Adaptive Caching in Diffusion Language Models

Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Yongcheng Jing, Dacheng Tao

Main category: cs.LG

TL;DR: SPA-Cache improves DLM efficiency through low-dimensional update identification and adaptive budget allocation, achieving 8× throughput over vanilla decoding.

Details

Motivation: Diffusion Language Models (DLMs) lack standard KV caching due to their non-causal nature, forcing costly hidden state recomputation at every decoding step. Existing caching approaches have limitations: (1) expensive token-wise update identification heuristics, and (2) rigid uniform budget allocation that ignores heterogeneous hidden state dynamics.

Method: SPA-Cache jointly optimizes update identification and budget allocation. First, it uses a low-dimensional singular proxy to identify update-critical tokens in a subspace, reducing identification overhead. Second, it introduces an adaptive strategy that allocates fewer updates to stable layers without degrading generation quality.

Result: The method achieves up to 8× throughput improvement over vanilla decoding and 2-4× speedup over existing caching baselines.

Conclusion: SPA-Cache significantly improves DLM efficiency through optimized caching strategies, addressing key limitations of existing approaches for non-causal language models.

Abstract: While Diffusion Language Models (DLMs) offer a flexible, arbitrary-order alternative to the autoregressive paradigm, their non-causal nature precludes standard KV caching, forcing costly hidden state recomputation at every decoding step. Existing DLM caching approaches reduce this cost by selective hidden state updates; however, they are still limited by (i) costly token-wise update identification heuristics and (ii) rigid, uniform budget allocation that fails to account for heterogeneous hidden state dynamics. To address these challenges, we present SPA-Cache that jointly optimizes update identification and budget allocation in DLM cache. First, we derive a low-dimensional singular proxy that enables the identification of update-critical tokens in a low-dimensional subspace, substantially reducing the overhead of update identification. Second, we introduce an adaptive strategy that allocates fewer updates to stable layers without degrading generation quality. Together, these contributions significantly improve the efficiency of DLMs, yielding up to an $8\times$ throughput improvement over vanilla decoding and a $2$–$4\times$ speedup over existing caching baselines.

[458] Beyond Alignment: Expanding Reasoning Capacity via Manifold-Reshaping Policy Optimization

Dayu Wang, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li

Main category: cs.LG

TL;DR: MRPO is a geometric framework that reshapes LLM inference space through spectral orthogonal exploration and effective rank regularization to expand reasoning capabilities beyond pre-trained bias manifolds.

Details

Motivation: Recent studies question whether RLVR genuinely expands LLM reasoning capacity or merely aligns existing latent capabilities within pre-trained bias manifolds. The authors challenge the accessibility boundary hypothesis and aim to fundamentally expand the latent reasoning space through geometric interventions.

Method: Manifold-Reshaping Policy Optimization (MRPO) operates in two stages: 1) Spectral Orthogonal Exploration (SOE) ejects policy initialization into the null space of the bias manifold, and 2) Effective Rank regularization is integrated into policy optimization to incentivize discovery and maintenance of high-dimensional reasoning trajectories against entropy-reducing tendencies of standard RL.

Result: The 4B-parameter method achieves state-of-the-art performance on mathematical tasks, significantly outperforming larger models (e.g., Qwen3-32B) and expanding the capability boundary beyond standard GRPO.

Conclusion: The latent reasoning space of LLMs can be fundamentally expanded through targeted geometric interventions, challenging the accessibility boundary hypothesis and demonstrating that RL can genuinely expand reasoning capacity beyond pre-trained bias manifolds.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). However, recent studies question whether RL genuinely expands reasoning capacity or merely aligns existing latent capabilities, arguing that exploration remains confined within the pre-trained model’s low-rank bias manifold. In this work, we challenge this accessibility boundary hypothesis by demonstrating that the latent reasoning space can be fundamentally expanded through targeted geometric interventions. We propose Manifold-Reshaping Policy Optimization (MRPO), a geometric framework designed to fundamentally restructure the inference space of LLMs. MRPO operates in two stages: first, we employ Spectral Orthogonal Exploration (SOE) to eject the policy initialization into the null space of the bias manifold; second, we integrate an Effective Rank regularization term into the policy optimization objective. This approach incentivizes the discovery and maintenance of high-dimensional reasoning trajectories against the entropy-reducing tendency of standard RL. Empirically, our 4B-parameter method achieves state-of-the-art performance on mathematical tasks, significantly outperforming larger models (e.g., Qwen3-32B) and expanding the capability boundary beyond standard GRPO. Our code is available at https://anonymous.4open.science/r/MRPO-D57B/

[459] D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs

Xianglong Yan, ChengZhu Bao, Zhiteng Li, Tianao Zhang, Shaoqiu Zhang, Ruobing Xie, Samm Sun, Yulun Zhang

Main category: cs.LG

TL;DR: D²Quant is a weight-only post-training quantization framework for LLMs that addresses accuracy degradation at sub-4-bit precision through dual-scale quantization for down-projection matrices and deviation-aware correction for activation shifts.

Details

Motivation: LLMs have high compute and memory costs making deployment difficult in resource-constrained scenarios. Weight-only PTQ is appealing but suffers from significant accuracy degradation at sub-4-bit precision due to two main issues: down-projection matrices being quantization bottlenecks, and weight quantization inducing activation deviations without effective correction strategies.

Method: Proposes D²Quant framework with two key components: 1) Dual-Scale Quantizer (DSQ) tailored to down-projection matrices with absorbable scaling factor that improves accuracy without increasing bit budget, and 2) Deviation-Aware Correction (DAC) that incorporates mean-shift correction within LayerNorm to mitigate quantization-induced activation distribution shifts.

Result: Extensive experiments across multiple LLM families and evaluation metrics show that D²Quant delivers superior performance for weight-only PTQ at sub-4-bit precision compared to existing methods.

Conclusion: D²Quant effectively addresses key challenges in weight-only PTQ for LLMs at sub-4-bit precision through innovative weight quantization and activation correction techniques, enabling more efficient deployment in resource-constrained scenarios.

Abstract: Large language models (LLMs) deliver strong performance, but their high compute and memory costs make deployment difficult in resource-constrained scenarios. Weight-only post-training quantization (PTQ) is appealing, as it reduces memory usage and enables practical speedup without low-bit operators or specialized hardware. However, accuracy often degrades significantly in weight-only PTQ at sub-4-bit precision, and our analysis identifies two main causes: (1) down-projection matrices are a well-known quantization bottleneck, but maintaining their fidelity often requires extra bit-width; (2) weight quantization induces activation deviations, but effective correction strategies remain underexplored. To address these issues, we propose D$^2$Quant, a novel weight-only PTQ framework that improves quantization from both the weight and activation perspectives. On the weight side, we design a Dual-Scale Quantizer (DSQ) tailored to down-projection matrices, with an absorbable scaling factor that significantly improves accuracy without increasing the bit budget. On the activation side, we propose Deviation-Aware Correction (DAC), which incorporates a mean-shift correction within LayerNorm to mitigate quantization-induced activation distribution shifts. Extensive experiments across multiple LLM families and evaluation metrics show that D$^2$Quant delivers superior performance for weight-only PTQ at sub-4-bit precision. The code and models will be available at https://github.com/XIANGLONGYAN/D2Quant.

[460] naPINN: Noise-Adaptive Physics-Informed Neural Networks for Recovering Physics from Corrupted Measurement

Hankyeol Kim, Pilsung Kang

Main category: cs.LG

TL;DR: naPINN is a robust Physics-Informed Neural Network that adaptively filters corrupted measurements and outliers using an energy-based model and trainable reliability gate, outperforming existing methods on PDEs with non-Gaussian noise and outliers.

Details

Motivation: Standard PINNs degrade significantly under complex measurement noise and gross outliers, limiting their practical application to real-world observational data with corruption.

Method: naPINN embeds an energy-based model to learn latent distribution of prediction residuals, uses a trainable reliability gate to adaptively filter high-energy data points, and includes rejection cost regularization to prevent trivial solutions.

Result: naPINN significantly outperforms existing robust PINN baselines on various benchmark PDEs corrupted by non-Gaussian noise and varying outlier rates, successfully isolating outliers and accurately reconstructing dynamics under severe data corruption.

Conclusion: naPINN provides a robust framework for solving inverse problems and discovering governing equations from corrupted observational data without prior knowledge of noise distribution.

Abstract: Physics-Informed Neural Networks (PINNs) are effective methods for solving inverse problems and discovering governing equations from observational data. However, their performance degrades significantly under complex measurement noise and gross outliers. To address this issue, we propose the Noise-Adaptive Physics-Informed Neural Network (naPINN), which robustly recovers physical solutions from corrupted measurements without prior knowledge of the noise distribution. naPINN embeds an energy-based model into the training loop to learn the latent distribution of prediction residuals. Leveraging the learned energy landscape, a trainable reliability gate adaptively filters data points exhibiting high energy, while a rejection cost regularization prevents trivial solutions where valid data are discarded. We demonstrate the efficacy of naPINN on various benchmark partial differential equations corrupted by non-Gaussian noise and varying rates of outliers. The results show that naPINN significantly outperforms existing robust PINN baselines, successfully isolating outliers and accurately reconstructing the dynamics under severe data corruption.

[461] HyPAC: Cost-Efficient LLMs-Human Hybrid Annotation with PAC Error Guarantees

Hao Zeng, Huipeng Huang, Xinhao Qu, Jianguo Huang, Bingyi Jing, Hongxin Wei

Main category: cs.LG

TL;DR: HyPAC is a method for routing data annotation tasks to the most cost-efficient source (LLMs, reasoning models, human experts) while providing statistical guarantees on annotation error.

Details

Motivation: Data annotation involves multiple sources with different cost-quality trade-offs, but existing methods lack formal guarantees on annotation error while minimizing costs.

Method: HyPAC uses importance sampling and upper confidence bounds to calibrate decision thresholds, partitioning inputs into three uncertainty regions and routing each to appropriate annotation sources.

Result: Experiments show HyPAC reduces annotation cost by 78.51% while tightly controlling annotation error, achieving minimum expected cost with PAC guarantees.

Conclusion: HyPAC provides a principled approach for cost-efficient data annotation with formal error guarantees, applicable to various annotation sources including LLMs and human experts.

Abstract: Data annotation often involves multiple sources with different cost-quality trade-offs, such as fast large language models (LLMs), slow reasoning models, and human experts. In this work, we study the problem of routing inputs to the most cost-efficient annotation source while controlling the labeling error on test instances. We propose \textbf{HyPAC}, a method that adaptively labels inputs to the most cost-efficient annotation source while providing distribution-free guarantees on annotation error. HyPAC calibrates two decision thresholds using importance sampling and upper confidence bounds, partitioning inputs into three regions based on uncertainty and routing each to the appropriate annotation source. We prove that HyPAC achieves the minimum expected cost with a probably approximately correct (PAC) guarantee on the annotation error, free of data distribution and pre-trained models. Experiments on common benchmarks demonstrate the effectiveness of our method, reducing the annotation cost by 78.51% while tightly controlling the annotation error.

[462] EEO-TFV: Escape-Explore Optimizer for Web-Scale Time-Series Forecasting and Vision Analysis

Hua Wang, Jinghao Lu, Fan Zhang

Main category: cs.LG

TL;DR: Lightweight Transformer with novel Escape-Explore Optimizer (EEO) addresses error accumulation in long-sequence prediction and vulnerability to out-of-distribution samples, achieving state-of-the-art performance across time-series and medical image segmentation tasks.

Details

Motivation: Transformer models suffer from error accumulation in multivariate long-sequence prediction and vulnerability to out-of-distribution samples in image tasks, with these issues exacerbated in large-scale Web data analysis involving complex temporal patterns and multimodal features.

Method: Proposes a lightweight Transformer architecture combined with a novel Escape-Explore Optimizer (EEO) that enhances exploration and generalization while avoiding sharp minima and saddle-point traps in high-dimensional parameter spaces.

Result: Achieves performance on par with state-of-the-art models across 11 time-series benchmark datasets and the Synapse medical image segmentation task, demonstrating superior generalization and stability.

Conclusion: The method shows potential as a versatile cross-task foundation model for Web-scale data mining and analysis, effectively addressing optimization challenges in multimodal and temporal data processing.

Abstract: Transformer-based foundation models have achieved remarkable progress in tasks such as time-series forecasting and image segmentation. However, they frequently suffer from error accumulation in multivariate long-sequence prediction and exhibit vulnerability to out-of-distribution samples in image-related tasks. Furthermore, these challenges become particularly pronounced in large-scale Web data analysis tasks, which typically involve complex temporal patterns and multimodal features. This complexity substantially increases optimization difficulty, rendering models prone to stagnation at saddle points within high-dimensional parameter spaces. To address these issues, we propose a lightweight Transformer architecture in conjunction with a novel Escape-Explore Optimizer (EEO). The optimizer enhances both exploration and generalization while effectively avoiding sharp minima and saddle-point traps. Experimental results show that, in representative Web data scenarios, our method achieves performance on par with state-of-the-art models across 11 time-series benchmark datasets and the Synapse medical image segmentation task. Moreover, it demonstrates superior generalization and stability, thereby validating its potential as a versatile cross-task foundation model for Web-scale data mining and analysis.

[463] BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation

Jingwen Xu, Yiyang Lu, Zisu Huang, Changze Lv, Xiaohua Wang, Shizheng Li, Zhibo Xu, Zhengkang Guo, Zhengyuan Wang, Muzhao Tian, Xuanjing Huang, Xiaoqing Zheng

Main category: cs.LG

TL;DR: BatCoder: Self-supervised RL framework for joint optimization of code generation and documentation using back-translation with only code as input.

Details

Motivation: High-quality code-documentation pairs are costly and scarce, especially for niche programming languages, limiting LLM training for code-related tasks.

Method: Uses back-translation strategy: generate documentation from code, then reconstruct original code from documentation. Semantic similarity between original and reconstructed code serves as implicit reward for reinforcement learning.

Result: Achieved 83.5% on HumanEval and 81.0% on MBPP pass@1 with 7B model, outperforming open-source baselines. Shows consistent scaling with training corpus size and model capacity.

Conclusion: BatCoder enables training with only code, substantially increasing available training examples and improving performance on code generation and documentation tasks.

Abstract: Training LLMs for code-related tasks typically depends on high-quality code-documentation pairs, which are costly to curate and often scarce for niche programming languages. We introduce BatCoder, a self-supervised reinforcement learning framework designed to jointly optimize code generation and documentation production. BatCoder employs a back-translation strategy: a documentation is first generated from code, and then the generated documentation is used to reconstruct the original code. The semantic similarity between the original and reconstructed code serves as an implicit reward, enabling reinforcement learning to improve the model’s performance both in generating code from documentation and vice versa. This approach allows models to be trained using only code, substantially increasing the available training examples. Evaluated on HumanEval and MBPP with a 7B model, BatCoder achieved 83.5% and 81.0% pass@1, outperforming strong open-source baselines. Moreover, the framework demonstrates consistent scaling with respect to both training corpus size and model capacity.

[464] Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

Bizhe Bai, Xinyue Wang, Peng Ye, Tao Chen

Main category: cs.LG

TL;DR: PSN-RLVR introduces parameter-space noise to improve exploration in reinforcement learning for LLM reasoning, addressing limitations of existing RLVR methods that mainly reweight existing solutions rather than discovering new strategies.

Details

Motivation: Current RLVR methods have an exploration ceiling - they primarily reweight existing solution traces rather than discovering new reasoning strategies, limiting performance gains especially under large sampling budgets like pass-at-256.

Method: PSN-RLVR perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that preserves chain-of-thought coherence. It uses truncated importance sampling to mitigate sampling-update mismatch and a computationally efficient adaptive noise scheduler driven by semantic diversity and normalized self-certainty metrics.

Result: PSN-GRPO consistently expands reasoning capability boundaries across multiple mathematical reasoning benchmarks and model families, achieving higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods.

Conclusion: Parameter-space noise with adaptive scheduling effectively addresses exploration limitations in RLVR for LLM reasoning, providing orthogonal improvements that can be composed with other methods for additional gains.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.

[465] Beyond Experience Retrieval: Learning to Generate Utility-Optimized Structured Experience for Frozen LLMs

Xuancheng Li, Haitao Li, Yujia Zhou, Yiqun Liu, Qingyao Ai

Main category: cs.LG

TL;DR: SEAM is a lightweight plug-in module that stores experiences in its parameters to guide frozen LLMs, enabling experience reuse without external retrieval for improved reasoning performance.

Details

Motivation: Current LLMs are static and repeat mistakes, while existing experience reuse methods rely on external retrieval which introduces noise and latency. The paper aims to create a more efficient, integrated experience reuse mechanism.

Method: SEAM (Structured Experience Adapter Module) is a lightweight plug-in that stores experience in its parameters and generates structured, instance-tailored experience entries in a single forward pass to guide a frozen LLM executor. It’s trained via executor rollouts and GRPO while keeping the executor frozen, and can be improved after deployment with supervised fine-tuning on logged successful trajectories.

Result: Experiments on mathematical reasoning benchmarks show consistent accuracy gains across executors with low overhead. Extensive ablations and analyses elucidate the mechanisms underlying SEAM’s effectiveness and robustness.

Conclusion: SEAM provides an effective, efficient approach to experience reuse for LLMs that avoids the limitations of external retrieval systems while maintaining low computational overhead.

Abstract: Large language models (LLMs) are largely static and often redo reasoning or repeat mistakes. Prior experience reuse typically relies on external retrieval, which is similarity-based, can introduce noise, and adds latency. We introduce SEAM (Structured Experience Adapter Module), a lightweight, executor-specific plug-in that stores experience in its parameters and generates a structured, instance-tailored experience entry in a single forward pass to guide a frozen LLM executor. SEAM is trained for utility via executor rollouts and GRPO while keeping the executor frozen, and it can be further improved after deployment with supervised fine-tuning on logged successful trajectories. Experiments on mathematical reasoning benchmarks show consistent accuracy gains across executors with low overhead. Extensive ablations and analyses further elucidate the mechanisms underlying SEAM’s effectiveness and robustness.

[466] PA-MIL: Phenotype-Aware Multiple Instance Learning Guided by Language Prompting and Genotype-to-Phenotype Relationships

Zekang Yang, Hong Liu, Xiangdong Wang

Main category: cs.LG

TL;DR: PA-MIL is an ante-hoc interpretable framework for pathology whole-slide image analysis that identifies cancer-related phenotypes using a phenotype knowledge base and language prompting, achieving competitive performance with improved interpretability.

Details

Motivation: Most existing deep learning methods for pathology WSIs only provide post-hoc interpretability through saliency maps, lacking reliable and accountable explanations. There's a need for ante-hoc interpretable frameworks that can identify meaningful cancer phenotypes and provide transparent reasoning.

Method: Proposes Phenotype-Aware Multiple Instance Learning (PA-MIL) with three key components: 1) constructing a phenotype knowledge base linking cancer phenotypes to genotypes, 2) using morphological descriptions as language prompts to aggregate phenotype-related features, and 3) developing a Genotype-to-Phenotype Neural Network (GP-NN) that provides multi-level guidance based on genotype-phenotype relationships.

Result: PA-MIL achieves competitive performance compared to existing MIL methods on multiple datasets while offering improved interpretability. It uses phenotype saliency as evidence and achieves competitive results with a linear classifier compared to state-of-the-art methods. The framework enables thorough analysis of genotype-phenotype relationships and provides cohort-level and case-level interpretability.

Conclusion: PA-MIL provides a reliable and accountable ante-hoc interpretable framework for cancer subtyping from pathology WSIs by leveraging phenotype knowledge and language prompting, demonstrating both competitive performance and transparent reasoning capabilities.

Abstract: Deep learning has been extensively researched in the analysis of pathology whole-slide images (WSIs). However, most existing methods are limited to providing prediction interpretability by locating the model’s salient areas in a post-hoc manner, failing to offer more reliable and accountable explanations. In this work, we propose Phenotype-Aware Multiple Instance Learning (PA-MIL), a novel ante-hoc interpretable framework that identifies cancer-related phenotypes from WSIs and utilizes them for cancer subtyping. To facilitate PA-MIL in learning phenotype-aware features, we 1) construct a phenotype knowledge base containing cancer-related phenotypes and their associated genotypes. 2) utilize the morphological descriptions of phenotypes as language prompting to aggregate phenotype-related features. 3) devise the Genotype-to-Phenotype Neural Network (GP-NN) grounded in genotype-to-phenotype relationships, which provides multi-level guidance for PA-MIL. Experimental results on multiple datasets demonstrate that PA-MIL achieves competitive performance compared to existing MIL methods while offering improved interpretability. PA-MIL leverages phenotype saliency as evidence and, using a linear classifier, achieves competitive results compared to state-of-the-art methods. Additionally, we thoroughly analyze the genotype-phenotype relationships, as well as cohort-level and case-level interpretability, demonstrating the reliability and accountability of PA-MIL.

[467] Auditing Sybil: Explaining Deep Lung Cancer Risk Prediction Through Generative Interventional Attributions

Bartlomiej Sobieski, Jakub Grzywaczewski, Karol Dobiczek, Mateusz Wójcik, Tomasz Bartczak, Patryk Szatkowski, Przemysław Bombiński, Matthew Tivnan, Przemyslaw Biecek

Main category: cs.LG

TL;DR: S(H)NAP is a model-agnostic auditing framework that uses 3D diffusion models to causally verify lung cancer screening AI (Sybil) by systematically modifying anatomical features and measuring their impact on risk predictions.

Details

Motivation: Current AI assessments for lung cancer screening rely on correlation-based observational metrics, overlooking actual reasoning mechanisms. There's a need for causal verification to ensure robust decision-making before clinical deployment of models like Sybil.

Method: Proposes S(H)NAP framework that constructs generative interventional attributions using realistic 3D diffusion bridge modeling to systematically modify anatomical features and isolate object-specific causal contributions to risk scores.

Result: First interventional audit of Sybil shows the model often exhibits expert-like behavior in differentiating malignant from benign nodules, but suffers from critical failure modes including dangerous sensitivity to clinically unjustified artifacts and distinct radial bias.

Conclusion: Causal verification through generative interventions reveals both strengths and critical weaknesses in AI models for medical imaging, highlighting the importance of moving beyond correlation-based assessments for clinical deployment.

Abstract: Lung cancer remains the leading cause of cancer mortality, driving the development of automated screening tools to alleviate radiologist workload. Standing at the frontier of this effort is Sybil, a deep learning model capable of predicting future risk solely from computed tomography (CT) with high precision. However, despite extensive clinical validation, current assessments rely purely on observational metrics. This correlation-based approach overlooks the model’s actual reasoning mechanism, necessitating a shift to causal verification to ensure robust decision-making before clinical deployment. We propose S(H)NAP, a model-agnostic auditing framework that constructs generative interventional attributions validated by expert radiologists. By leveraging realistic 3D diffusion bridge modeling to systematically modify anatomical features, our approach isolates object-specific causal contributions to the risk score. Providing the first interventional audit of Sybil, we demonstrate that while the model often exhibits behavior akin to an expert radiologist, differentiating malignant pulmonary nodules from benign ones, it suffers from critical failure modes, including dangerous sensitivity to clinically unjustified artifacts and a distinct radial bias.

[468] A General ReLearner: Empowering Spatiotemporal Prediction by Re-learning Input-label Residual

Jiaming Ma, Binwu Wang, Pengkun Wang, Xu Wang, Zhengyang Zhou, Yang Wang

Main category: cs.LG

TL;DR: ReLearner: A bidirectional learning framework for spatiotemporal prediction that incorporates label features during training to address discrepancies between inputs and labels.

Details

Motivation: Current spatiotemporal prediction models use unidirectional learning and struggle when there are discrepancies between input features and future labels (e.g., similar inputs leading to different future outcomes). The paper aims to incorporate label features explicitly during training to address this limitation.

Method: Proposes Spatiotemporal Residual Theorem to generalize unidirectional prediction into bidirectional learning. Introduces ReLearner module with two components: Residual Learning Module to disentangle feature discrepancies between input and label representations, and Residual Smoothing Module to smooth residuals for stable convergence.

Result: Extensive experiments on 11 real-world datasets across 14 backbone models show ReLearner significantly enhances predictive performance of existing Spatiotemporal Neural Networks (STNNs).

Conclusion: The bidirectional learning framework with ReLearner effectively addresses spatiotemporal discrepancies and improves prediction accuracy across various models and datasets.

Abstract: Prevailing spatiotemporal prediction models typically operate under a forward (unidirectional) learning paradigm, in which models extract spatiotemporal features from historical observation input and map them to target spatiotemporal space for future forecasting (label). However, these models frequently exhibit suboptimal performance when spatiotemporal discrepancies exist between inputs and labels, for instance, when nodes with similar time-series inputs manifest distinct future labels, or vice versa. To address this limitation, we propose explicitly incorporating label features during the training phase. Specifically, we introduce the Spatiotemporal Residual Theorem, which generalizes the conventional unidirectional spatiotemporal prediction paradigm into a bidirectional learning framework. Building upon this theoretical foundation, we design an universal module, termed ReLearner, which seamlessly augments Spatiotemporal Neural Networks (STNNs) with a bidirectional learning capability via an auxiliary inverse learning process. In this process, the model relearns the spatiotemporal feature residuals between input data and future data. The proposed ReLearner comprises two critical components: (1) a Residual Learning Module, designed to effectively disentangle spatiotemporal feature discrepancies between input and label representations; and (2) a Residual Smoothing Module, employed to smooth residual terms and facilitate stable convergence. Extensive experiments conducted on 11 real-world datasets across 14 backbone models demonstrate that ReLearner significantly enhances the predictive performance of existing STNNs.Our code is available on GitHub.

[469] High Rank Matrix Completion via Grassmannian Proxy Fusion

Huanran Li, Jeremy Johnson, Daniel Pimentel-Alarcón

Main category: cs.LG

TL;DR: A method for high-rank matrix completion that clusters incomplete data vectors by grouping proxy subspaces and optimizes two Grassmannian criteria to minimize distances between points and subspaces and between subspaces themselves.

Details

Motivation: Current methods for high-rank matrix completion often lack theoretical support, produce uninterpretable results, and require more samples than theoretically necessary. There's a need for methods that can work effectively at low sampling rates while being theoretically grounded.

Method: The approach clusters incomplete vectors by grouping proxy subspaces and minimizes two criteria over the Grassmannian: (1) chordal distance between each point and its corresponding subspace, and (2) geodesic distances between subspaces of all data points.

Result: Experiments on synthetic and real datasets show the method performs comparably to leading methods at high sampling rates and significantly better at low sampling rates, narrowing the gap to the theoretical sampling limit of HRMC.

Conclusion: The proposed method provides a theoretically grounded approach to high-rank matrix completion that works well even at low sampling rates, addressing limitations of existing methods while producing interpretable results.

Abstract: This paper approaches high-rank matrix completion (HRMC) by filling missing entries in a data matrix where columns lie near a union of subspaces, clustering these columns, and identifying the underlying subspaces. Current methods often lack theoretical support, produce uninterpretable results, and require more samples than theoretically necessary. We propose clustering incomplete vectors by grouping proxy subspaces and minimizing two criteria over the Grassmannian: (a) the chordal distance between each point and its corresponding subspace and (b) the geodesic distances between subspaces of all data points. Experiments on synthetic and real datasets demonstrate that our method performs comparably to leading methods in high sampling rates and significantly better in low sampling rates, thus narrowing the gap to the theoretical sampling limit of HRMC.

[470] A Comparative Simulation Study of the Fairness and Accuracy of Predictive Policing Systems in Baltimore City

Samin Semsar, Kiran Laxmikant Prabhu, Gabriella Waters, James Foulds

Main category: cs.LG

TL;DR: Comparative simulation study of predictive policing fairness and accuracy in Baltimore, revealing complex bias patterns and comparing with traditional hot spots policing.

Details

Motivation: Addressing concerns about racial bias in predictive policing systems, with limited comprehensive comparative studies on their fairness and accuracy compared to traditional policing methods.

Method: Comprehensive comparative simulation study analyzing fairness and accuracy of predictive policing technologies in Baltimore, comparing with traditional hot spots policing approaches.

Result: Predictive policing exhibited bias due to feedback loops, but traditional hot spots policing had similar issues. Predictive policing was more fair and accurate in short term but amplified bias faster long-term. In Baltimore, bias sometimes favored over-policing in White neighborhoods, contrary to previous studies.

Conclusion: The study demonstrates a methodology for city-specific evaluation of predictive policing systems, revealing complex bias patterns and showing simulations can expose inequities and long-term behavioral tendencies.

Abstract: There are ongoing discussions about predictive policing systems, such as those deployed in Los Angeles, California and Baltimore, Maryland, being unfair, for example, by exhibiting racial bias. Studies found that unfairness may be due to feedback loops and being trained on historically biased recorded data. However, comparative studies on predictive policing systems are few and are not sufficiently comprehensive. In this work, we perform a comprehensive comparative simulation study on the fairness and accuracy of predictive policing technologies in Baltimore. Our results suggest that the situation around bias in predictive policing is more complex than was previously assumed. While predictive policing exhibited bias due to feedback loops as was previously reported, we found that the traditional alternative, hot spots policing, had similar issues. Predictive policing was found to be more fair and accurate than hot spots policing in the short term, although it amplified bias faster, suggesting the potential for worse long-run behavior. In Baltimore, in some cases the bias in these systems tended toward over-policing in White neighborhoods, unlike in previous studies. Overall, this work demonstrates a methodology for city-specific evaluation and behavioral-tendency comparison of predictive policing systems, showing how such simulations can reveal inequities and long-term tendencies.

[471] Mitigating Task-Order Sensitivity and Forgetting via Hierarchical Second-Order Consolidation

Protik Nag, Krishnan Raghavan, Vignesh Narayanan

Main category: cs.LG

TL;DR: HTCL is a hierarchical continual learning framework that uses Taylor series and Hessian regularization to address task-order variance through local adaptation and global consolidation.

Details

Motivation: Continual learning systems suffer from high variance due to random task ordering, which affects model stability and performance. Current approaches lack hierarchical knowledge integration and robust consolidation mechanisms.

Method: HTCL combines fast local adaptation with conservative second-order global consolidation using Hessian-regularized Taylor expansion. It identifies optimal intra-group task sequences and extends to L-level hierarchies for multiscale knowledge integration.

Result: HTCL consistently enhances performance across various datasets and baselines, achieving mean accuracy gains of 7-25% while reducing standard deviation of final accuracy by up to 68% across random task permutations.

Conclusion: HTCL provides a model-agnostic consolidation layer that effectively addresses task-order variance in continual learning through hierarchical Taylor series-based optimization with theoretical guarantees.

Abstract: We introduce $\textbf{Hierarchical Taylor Series-based Continual Learning (HTCL)}$, a framework that couples fast local adaptation with conservative, second-order global consolidation to address the high variance introduced by random task ordering. To address task-order effects, HTCL identifies the best intra-group task sequence and integrates the resulting local updates through a Hessian-regularized Taylor expansion, yielding a consolidation step with theoretical guarantees. The approach naturally extends to an $L$-level hierarchy, enabling multiscale knowledge integration in a manner not supported by conventional single-level CL systems. Across a wide range of datasets and replay and regularization baselines, HTCL acts as a model-agnostic consolidation layer that consistently enhances performance, yielding mean accuracy gains of $7%$ to $25%$ while reducing the standard deviation of final accuracy by up to $68%$ across random task permutations.

[472] Trajectory Consistency for One-Step Generation on Euler Mean Flows

Zhiqi Li, Yuchen Sun, Duowen Chen, Jinjin He, Bo Zhu

Main category: cs.LG

TL;DR: EMF is a flow-based generative framework for efficient one-step/few-step generation using linear surrogate constraints for long-range trajectory consistency, reducing training time and memory by ~50%.

Details

Motivation: Existing flow-based generative models face challenges with long-range trajectory consistency constraints that are difficult to supervise and optimize over long time scales, leading to high computational costs and memory requirements.

Method: Replace difficult trajectory consistency constraints with a principled linear surrogate derived from semigroup formulation of flow-based models. This enables direct data supervision for long-horizon flow-map compositions without explicit Jacobian computations (JVP-free). Supports both u-prediction and x₁-prediction variants.

Result: Improved optimization stability and sample quality under fixed sampling budgets. Approximately 50% reduction in training time and memory consumption compared to existing one-step methods for image generation. Demonstrated effectiveness on image synthesis, particle-based geometry generation, and functional generation tasks.

Conclusion: EMF provides an efficient flow-based generative framework that enforces long-range trajectory consistency with minimal sampling cost, offering significant computational advantages while maintaining or improving sample quality.

Abstract: We propose \emph{Euler Mean Flows (EMF)}, a flow-based generative framework for one-step and few-step generation that enforces long-range trajectory consistency with minimal sampling cost. The key idea of EMF is to replace the trajectory consistency constraint, which is difficult to supervise and optimize over long time scales, with a principled linear surrogate that enables direct data supervision for long-horizon flow-map compositions. We derive this approximation from the semigroup formulation of flow-based models and show that, under mild regularity assumptions, it faithfully approximates the original consistency objective while being substantially easier to optimize. This formulation leads to a unified, JVP-free training framework that supports both $u$-prediction and $x_1$-prediction variants, avoiding explicit Jacobian computations and significantly reducing memory and computational overhead. Experiments on image synthesis, particle-based geometry generation, and functional generation demonstrate improved optimization stability and sample quality under fixed sampling budgets, together with approximately $50%$ reductions in training time and memory consumption compared to existing one-step methods for image generation.

[473] Reward Shaping for Inference-Time Alignment: A Stackelberg Game Perspective

Haichuan Wang, Tao Lin, Lingkai Kong, Ce Li, Hezi Jiang, Milind Tambe

Main category: cs.LG

TL;DR: A method to optimize reward models for LLM alignment by formalizing it as a Stackelberg game and using reward shaping to address bias from KL regularization while preventing reward hacking.

Details

Motivation: Current alignment methods using KL regularization with base policies can inherit biases that conflict with user preferences, while simply amplifying rewards risks reward hacking. There's a need for optimal reward model design under KL regularization constraints.

Method: Formalize reward model optimization as a Stackelberg game, develop a simple reward shaping scheme to approximate the optimal reward model, and integrate it into existing alignment methods with minimal overhead.

Result: Empirical evaluation shows consistent improvement in average reward and achieves win-tie rates exceeding 66% against all baselines across various evaluation settings.

Conclusion: The proposed reward model optimization approach effectively addresses the bias-reward hacking tradeoff in LLM alignment and integrates seamlessly with existing methods.

Abstract: Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user’s utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inference-time alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win-tie rates exceeding 66% against all baselines, averaged across evaluation settings.

[474] Product Interaction: An Algebraic Formalism for Deep Learning Architectures

Haonan Dong, Chun-Wun Cheng, Angelica I. Aviles-Rivero

Main category: cs.LG

TL;DR: A mathematical framework called “product interactions” that unifies neural network layers through algebraic compositions, showing how different architectures correspond to different interaction orders.

Details

Motivation: To develop a unified algebraic framework that systematically constructs neural network layers from basic multiplication operations, revealing the underlying mathematical structure connecting diverse architectures like convolutional networks, attention mechanisms, and Mamba.

Method: Introduces product interactions as an algebraic formalism where neural network layers are built from compositions of multiplication operators over suitable algebras. Organizes algebraic expressions by increasing interaction order, showing how different architectures correspond to specific interaction orders.

Result: Provides a principled mathematical framework that unifies various neural network architectures: convolutional and equivariant networks correspond to symmetry-constrained linear product interactions, while attention and Mamba correspond to higher-order product interactions.

Conclusion: Product interactions offer a systematic algebraic approach to understanding and constructing neural network layers, revealing fundamental mathematical connections between seemingly disparate architectures through their interaction order properties.

Abstract: In this paper, we introduce product interactions, an algebraic formalism in which neural network layers are constructed from compositions of a multiplication operator defined over suitable algebras. Product interactions provide a principled way to generate and organize algebraic expressions by increasing interaction order. Our central observation is that algebraic expressions in modern neural networks admit a unified construction in terms of linear, quadratic, and higher-order product interactions. Convolutional and equivariant networks arise as symmetry-constrained linear product interactions, while attention and Mamba correspond to higher-order product interactions.

[475] QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals

Nan Zhang, Eugene Kwek, Yusen Zhang, Muyu Pan, Suhang Wang, Prasenjit Mitra, Rui Zhang

Main category: cs.LG

TL;DR: QuantLRM: A weight quantization method for Large Reasoning Models that uses fine-tuning update magnitudes to identify important weights, protecting both smallest and largest updates via quadratic functions for better compression.

Details

Motivation: Weight-only quantization is crucial for compressing LLMs, but existing methods may not effectively handle Large Reasoning Models (LRMs). The paper investigates whether weight update magnitudes during reasoning-incentivized fine-tuning can provide better signals for quantization than traditional approaches.

Method: QuantLRM analyzes weight update magnitudes during fine-tuning, hypothesizing that both smallest and largest updates are important (“protecting both ends”). It fits restricted quadratic functions on weight updates, multiplies average quadratic values with zero-update channel counts to compute channel importance, and applies this to quantize various fine-tuned models.

Result: QuantLRM consistently improves LRM quantization across four reasoning benchmarks (AIME-120, FOLIO, temporal sequences, GPQA-Diamond), with average 6.55% improvement on reinforcement learning fine-tuned models. It also works with non-fine-tuned LRMs via pseudo-fine-tuning.

Conclusion: Weight update magnitudes during fine-tuning provide valuable signals for quantizing Large Reasoning Models. The “protecting both ends” hypothesis is validated, and QuantLRM offers an effective quantization approach that outperforms methods using activation or second-order information.

Abstract: Weight-only quantization is important for compressing Large Language Models (LLMs). Inspired by the spirit of classical magnitude pruning, we study whether the magnitude of weight updates during reasoning-incentivized fine-tuning can provide valuable signals for quantizing Large Reasoning Models (LRMs). We hypothesize that the smallest and largest weight updates during fine-tuning are more important than those of intermediate magnitude, a phenomenon we term “protecting both ends”. Upon hypothesis validation, we introduce QuantLRM, which stands for weight quantization of LRMs via fine-tuning signals. We fit simple restricted quadratic functions on weight updates to protect both ends. By multiplying the average quadratic values with the count of zero weight updates of channels, we compute channel importance that is more effective than using activation or second-order information. We run QuantLRM to quantize various fine-tuned models (including supervised, direct preference optimization, and reinforcement learning fine-tuning) over four reasoning benchmarks (AIME-120, FOLIO, temporal sequences, and GPQA-Diamond) and empirically find that QuantLRM delivers a consistent improvement for LRMs quantization, with an average improvement of 6.55% on a reinforcement learning fine-tuned model. Also supporting non-fine-tuned LRMs, QuantLRM gathers effective signals via pseudo-fine-tuning, which greatly enhances its applicability.

[476] Copula-Based Aggregation and Context-Aware Conformal Prediction for Reliable Renewable Energy Forecasting

Alireza Moradi, Mathieu Tanneau, Reza Zandehshahvar, Pascal Van Hentenryck

Main category: cs.LG

TL;DR: A calibrated probabilistic aggregation framework that converts site-level renewable energy forecasts into reliable fleet-level forecasts using copula-based dependence modeling and context-aware conformal prediction.

Details

Motivation: System operators need reliable probabilistic forecasts for aggregated renewable energy fleets, but often only have access to heterogeneous site-level forecasts from third-party providers. Existing approaches struggle with complex cross-site dependencies and aggregation-induced miscalibration when constructing fleet-level forecasts.

Method: Proposes a two-stage framework: 1) Copula-based dependence modeling to capture cross-site correlations, and 2) Context-Aware Conformal Prediction (CACP) to correct miscalibration at the aggregated level. This enables dependence-aware aggregation while maintaining valid coverage and sharp prediction intervals.

Result: Experiments on large-scale solar generation datasets from MISO, ERCOT, and SPP show the Copula+CACP approach consistently achieves near-nominal coverage with significantly sharper intervals than uncalibrated aggregation baselines.

Conclusion: The framework provides a practical solution for constructing reliable fleet-level probabilistic forecasts from heterogeneous site-level inputs when system-level models cannot be trained or maintained, addressing key challenges in renewable energy grid operations.

Abstract: The rapid growth of renewable energy penetration has intensified the need for reliable probabilistic forecasts to support grid operations at aggregated (fleet or system) levels. In practice, however, system operators often lack access to fleet-level probabilistic models and instead rely on site-level forecasts produced by heterogeneous third-party providers. Constructing coherent and calibrated fleet-level probabilistic forecasts from such inputs remains challenging due to complex cross-site dependencies and aggregation-induced miscalibration. This paper proposes a calibrated probabilistic aggregation framework that directly converts site-level probabilistic forecasts into reliable fleet-level forecasts in settings where system-level models cannot be trained or maintained. The framework integrates copula-based dependence modeling to capture cross-site correlations with Context-Aware Conformal Prediction (CACP) to correct miscalibration at the aggregated level. This combination enables dependence-aware aggregation while providing valid coverage and maintaining sharp prediction intervals. Experiments on large-scale solar generation datasets from MISO, ERCOT, and SPP demonstrate that the proposed Copula+CACP approach consistently achieves near-nominal coverage with significantly sharper intervals than uncalibrated aggregation baselines.

[477] Learnable Koopman-Enhanced Transformer-Based Time Series Forecasting with Spectral Control

Ali Forootani, Raffaele Iervolino

Main category: cs.LG

TL;DR: Learnable Koopman operators combine linear dynamical systems theory with deep learning for time series forecasting, offering explicit control over stability and interpretability.

Details

Motivation: To bridge the gap between theoretically principled linear dynamical systems (Koopman operators) and modern deep learning forecasting architectures, enabling better control over stability and interpretability while maintaining forecasting performance.

Method: Proposes four learnable Koopman variants: scalar-gated, per-mode gated, MLP-shaped spectral mapping, and low-rank Koopman operators. These integrate with nonlinear backbones like Patchtst, Autoformer, and Informer, allowing explicit control over spectrum, stability, and rank of linear transition operators.

Result: Learnable Koopman models provide favorable bias-variance trade-off, improved conditioning, and more interpretable latent dynamics compared to LSTM, DLinear, SSMs, and lightweight transformers across multiple horizons and patch lengths.

Conclusion: Learnable Koopman operators are effective, stable, and theoretically principled components for deep forecasting that combine the benefits of linear dynamical systems theory with modern deep learning architectures.

Abstract: This paper proposes a unified family of learnable Koopman operator parameterizations that integrate linear dynamical systems theory with modern deep learning forecasting architectures. We introduce four learnable Koopman variants-scalar-gated, per-mode gated, MLP-shaped spectral mapping, and low-rank Koopman operators which generalize and interpolate between strictly stable Koopman operators and unconstrained linear latent dynamics. Our formulation enables explicit control over the spectrum, stability, and rank of the linear transition operator while retaining compatibility with expressive nonlinear backbones such as Patchtst, Autoformer, and Informer. We evaluate the proposed operators in a large-scale benchmark that also includes LSTM, DLinear, and simple diagonal State-Space Models (SSMs), as well as lightweight transformer variants. Experiments across multiple horizons and patch lengths show that learnable Koopman models provide a favorable bias-variance trade-off, improved conditioning, and more interpretable latent dynamics. We provide a full spectral analysis, including eigenvalue trajectories, stability envelopes, and learned spectral distributions. Our results demonstrate that learnable Koopman operators are effective, stable, and theoretically principled components for deep forecasting.

[478] Effective Frontiers: A Unification of Neural Scaling Laws

Jiaxuan Zou, Zixuan Gong, Ye Su, Huayi Tang, Yong Liu

Main category: cs.LG

TL;DR: A unified theoretical framework explains neural scaling laws through pattern coverage of long-tail distributions, introducing the Effective Frontier concept and deriving scaling laws for model capacity, data size, and compute.

Details

Motivation: Existing theoretical explanations for neural scaling laws are often architecture-specific or rely on complex kernel methods, lacking intuitive universality. The paper aims to provide a unified framework that abstracts general learning tasks as progressive coverage of patterns from long-tail distributions.

Method: Proposes a framework that abstracts learning tasks as coverage of patterns from a Zipfian (long-tail) distribution. Introduces the Effective Frontier (k⋆) threshold separating learned knowledge from unlearned tail. Derives scaling laws for model capacity (N), data size (D), and compute (C) based on this framework, attributing them to capacity, coverage, and optimization bottlenecks respectively.

Result: Derives precise scaling laws for N, D, and C, showing they correspond to capacity, coverage, and optimization bottlenecks. Unifies these mechanisms via a Max-Bottleneck principle, demonstrating that Kaplan and Chinchilla scaling laws are equilibrium solutions to the same constrained optimization problem under different active bottlenecks.

Conclusion: Provides a unified theoretical framework for understanding neural scaling laws through pattern coverage of long-tail distributions, explaining different scaling regimes as manifestations of different bottleneck constraints in the same underlying optimization problem.

Abstract: Neural scaling laws govern the prediction power-law improvement of test loss with respect to model capacity ($N$), datasize ($D$), and compute ($C$). However, existing theoretical explanations often rely on specific architectures or complex kernel methods, lacking intuitive universality. In this paper, we propose a unified framework that abstracts general learning tasks as the progressive coverage of patterns from a long-tail (Zipfian) distribution. We introduce the Effective Frontier ($k_\star$), a threshold in the pattern rank space that separates learned knowledge from the unlearned tail. We prove that reducible loss is asymptotically determined by the probability mass of the tail a resource-dependent frontier truncation. Based on our framework, we derive the precise scaling laws for $N$, $D$, and $C$, attributing them to capacity, coverage, and optimization bottlenecks, respectively. Furthermore, we unify these mechanisms via a Max-Bottleneck principle, demonstrating that the Kaplan and Chinchilla scaling laws are not contradictory, but equilibrium solutions to the same constrained optimization problem under different active bottlenecks.

[479] Fubini Study geometry of representation drift in high dimensional data

Arturo Tozzi

Main category: cs.LG

TL;DR: Introduces Fubini-Study metric for measuring representation drift in high-dimensional systems, separating intrinsic changes from gauge transformations like rescalings and sign flips.

Details

Motivation: Current metrics (Euclidean, cosine distances) for measuring representation drift entangle intrinsic data changes with arbitrary parametrization variations, lacking invariance under gauge transformations like global rescalings or sign flips.

Method: Proposes a projective geometric framework using the Fubini-Study metric to quantify representation drift, which remains invariant under gauge transformations. Constructs representation trajectories and tracks evolution through cumulative geometric drift, comparing Euclidean, cosine, and Fubini-Study distances.

Result: Conventional metrics systematically overestimate change when representations have projective ambiguity, while Fubini-Study metric isolates intrinsic evolution. The difference between cosine and Fubini-Study drift defines a computable quantity capturing representation churn from gauge freedom.

Conclusion: Establishes geometric criterion for assessing representation stability in high-dimensional systems, clarifies limits of angular distances, and provides diagnostic for distinguishing meaningful structural evolution from parametrization artifacts without model-specific assumptions.

Abstract: High dimensional representation drift is commonly quantified using Euclidean or cosine distances, which presuppose fixed coordinates when comparing representations across time, training or preprocessing stages. While effective in many settings, these measures entangle intrinsic changes in the data with variations induced by arbitrary parametrizations. We introduce a projective geometric view of representation drift grounded in the Fubini Study metric, which identifies representations that differ only by gauge transformations such as global rescalings or sign flips. Applying this framework to empirical high dimensional datasets, we explicitly construct representation trajectories and track their evolution through cumulative geometric drift. Comparing Euclidean, cosine and Fubini Study distances along these trajectories reveals that conventional metrics systematically overestimate change whenever representations carry genuine projective ambiguity. By contrast, the Fubini Study metric isolates intrinsic evolution by remaining invariant under gauge-induced fluctuations. We further show that the difference between cosine and Fubini Study drift defines a computable, monotone quantity that directly captures representation churn attributable to gauge freedom. This separation provides a diagnostic for distinguishing meaningful structural evolution from parametrization artifacts, without introducing model-specific assumptions. Overall, we establish a geometric criterion for assessing representation stability in high-dimensional systems and clarify the limits of angular distances. Embedding representation dynamics in projective space connects data analysis with established geometric programs and yields observables that are directly testable in empirical workflows.

[480] ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization

Hongyuan Su, Yu Zheng, Yong Li

Main category: cs.LG

TL;DR: ContextEvolve is a multi-agent framework for optimizing LLM-generated code for computer systems, achieving RL-level efficiency without parameter updates by decomposing optimization context into three orthogonal dimensions.

Details

Motivation: LLMs can generate plausible code for systems research, but meeting stringent correctness and performance requirements demands iterative optimization. Existing methods have limitations: test-time RL requires parameter updates (infeasible with API-only access), while training-free evolutionary methods suffer from inefficient context utilization and undirected search.

Method: ContextEvolve uses three specialized agents: 1) Summarizer Agent condenses semantic state via code-to-language abstraction, 2) Navigator Agent distills optimization direction from trajectory analysis, and 3) Sampler Agent curates experience distribution through prioritized exemplar retrieval. This orchestration creates a functional isomorphism with RL components (state representation, policy gradient, experience replay) enabling principled optimization in textual latent space.

Result: On the ADRS benchmark, ContextEvolve outperforms state-of-the-art baselines by 33.3% while reducing token consumption by 29.0%.

Conclusion: ContextEvolve demonstrates that RL-level search efficiency can be achieved under parameter-blind constraints through multi-agent decomposition of optimization context, offering an effective approach for optimizing LLM-generated code for computer systems.

Abstract: Large language models are transforming systems research by automating the discovery of performance-critical algorithms for computer systems. Despite plausible codes generated by LLMs, producing solutions that meet the stringent correctness and performance requirements of systems demands iterative optimization. Test-time reinforcement learning offers high search efficiency but requires parameter updates infeasible under API-only access, while existing training-free evolutionary methods suffer from inefficient context utilization and undirected search. We introduce ContextEvolve, a multi-agent framework that achieves RL-level search efficiency under strict parameter-blind constraints by decomposing optimization context into three orthogonal dimensions: a Summarizer Agent condenses semantic state via code-to-language abstraction, a Navigator Agent distills optimization direction from trajectory analysis, and a Sampler Agent curates experience distribution through prioritized exemplar retrieval. This orchestration forms a functional isomorphism with RL-mapping to state representation, policy gradient, and experience replay-enabling principled optimization in a textual latent space. On the ADRS benchmark, ContextEvolve outperforms state-of-the-art baselines by 33.3% while reducing token consumption by 29.0%. Codes for our work are released at https://anonymous.4open.science/r/ContextEvolve-ACC

[481] RAP: KV-Cache Compression via RoPE-Aligned Pruning

Jihao Xin, Tian Lvu, Hatem Ltaief, David Keyes, Marco Canini

Main category: cs.LG

TL;DR: RAP (RoPE-Aligned Pruning) is a method that prunes entire RoPE-aligned column pairs in LLMs to reduce KV-Cache memory/compute costs while maintaining accuracy, achieving 20-30% reductions in KV-Cache, attention parameters, and FLOPs.

Details

Motivation: Long-context inference in LLMs is bottlenecked by KV-Cache memory and compute costs. Low-rank factorization approaches fail in modern RoPE-based LLMs because RoPE forces latent KV states to be reconstructed to full dimension, reintroducing overhead.

Method: Proposes RoPE-Aligned Pruning (RAP) which prunes entire RoPE-aligned column pairs to preserve RoPE’s 2x2 rotation structure, restore B absorption (from W ≈ A*B factorization), and eliminate reconstruction overhead.

Result: Evaluation on LLaMA-3-8B and Mistral-7B shows RAP enables joint reduction of KV-Cache, attention parameters, and FLOPs by 20-30% while maintaining strong accuracy. Reduces attention latency to 83% (prefill) and 77% (decode) of baseline.

Conclusion: RAP effectively addresses KV-Cache bottlenecks in RoPE-based LLMs through structured pruning that preserves rotational structure, enabling significant efficiency gains without sacrificing accuracy.

Abstract: Long-context inference in large language models is increasingly bottlenecked by the memory and compute cost of the KV-Cache. Low-rank factorization compresses KV projections by writing $W \approx A * B$, where A produces latent KV states and B can be absorbed into downstream weights. In modern RoPE-based LLMs, this absorption fails: RoPE forces latent KV states to be reconstructed to full dimension, reintroducing substantial memory and compute overhead. We propose RoPE-Aligned Pruning (RAP), which prunes entire RoPE-aligned column pairs to preserve RoPE’s 2x2 rotation structure, restore B absorption, and eliminate reconstruction. Our evaluation on LLaMA-3-8B and Mistral-7B shows that RAP enables joint reduction of KV-Cache, attention parameters, and FLOPs by 20-30%, all at once, while maintaining strong accuracy. Notably, RAP reduces attention latency to 83% (prefill) and 77% (decode) of baseline.

[482] Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models

Eliron Rahimi, Elad Hirshel, Rom Himelstein, Amit LeVi, Avi Mendelson, Chaim Baskin

Main category: cs.LG

TL;DR: DLMs show promise as AR alternatives but their sampling mechanisms’ role in safety behaviors is unclear. The paper analyzes step-wise refusal dynamics, introduces SRI signal for interpretability, and develops lightweight detectors for jailbreak attacks.

Details

Motivation: To understand how sampling mechanisms (not just learned representations) influence refusal behavior and jailbreak robustness in diffusion language models compared to autoregressive models, and to develop interpretable safety signals.

Method: Developed an analytical framework for step-wise refusal dynamics, introduced Step-Wise Refusal Internal Dynamics (SRI) signal, analyzed geometric structure of SRI to capture internal recovery dynamics, and created lightweight inference-time detectors based on this structure.

Result: SRI identifies anomalous behavior in harmful generations as incomplete internal recovery, enables detectors that generalize to unseen attacks while matching/exceeding existing defenses with 100× lower inference overhead.

Conclusion: Sampling strategy itself is central to safety behavior, distinct from learned representations. SRI provides interpretable safety signals and enables efficient, effective jailbreak detection for both AR and diffusion language models.

Abstract: Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive (AR) models, offering parallel decoding and controllable sampling dynamics while achieving competitive generation quality at scale. Despite this progress, the role of sampling mechanisms in shaping refusal behavior and jailbreak robustness remains poorly understood. In this work, we present a fundamental analytical framework for step-wise refusal dynamics, enabling comparison between AR and diffusion sampling. Our analysis reveals that the sampling strategy itself plays a central role in safety behavior, as a factor distinct from the underlying learned representations. Motivated by this analysis, we introduce the Step-Wise Refusal Internal Dynamics (SRI) signal, which supports interpretability and improved safety for both AR and DLMs. We demonstrate that the geometric structure of SRI captures internal recovery dynamics, and identifies anomalous behavior in harmful generations as cases of \emph{incomplete internal recovery} that are not observable at the text level. This structure enables lightweight inference-time detectors that generalize to unseen attacks while matching or outperforming existing defenses with over $100\times$ lower inference overhead.

[483] Discovering Data Manifold Geometry via Non-Contracting Flows

David Vigouroux, Lucas Drumetz, Ronan Fablet, François Rousseau

Main category: cs.LG

TL;DR: Unsupervised method learns global coordinate system on data manifolds by learning tangent vector fields that transport all points to a common reference, enabling interpretable intrinsic coordinates without assuming manifold flatness.

Details

Motivation: Existing manifold learning methods often assume isometric objectives that implicitly require manifold flatness. The authors aim to develop a more general approach that can construct a global reference system on unknown data manifolds without such restrictive assumptions, enabling interpretable intrinsic coordinates tied to a shared global frame.

Method: Learn tangent vector fields in ambient space whose flows transport all samples to a common, learnable reference point. Use arc-lengths along these flows as intrinsic coordinates. Prevent degenerate collapse with non-shrinking constraint. Derive scalable, integration-free objective inspired by flow matching. Prove theoretical framework recovers global coordinate chart when one exists.

Result: Method achieves correct tangent alignment and coherent global coordinate structure on synthetic manifolds. Scales to CIFAR-10 where learned coordinates achieve competitive downstream classification performance compared to other methods.

Conclusion: Proposed unsupervised approach successfully constructs global reference systems on data manifolds without assuming flatness, providing interpretable intrinsic coordinates that are theoretically sound and practically scalable to real-world datasets like CIFAR-10.

Abstract: We introduce an unsupervised approach for constructing a global reference system by learning, in the ambient space, vector fields that span the tangent spaces of an unknown data manifold. In contrast to isometric objectives, which implicitly assume manifold flatness, our method learns tangent vector fields whose flows transport all samples to a common, learnable reference point. The resulting arc-lengths along these flows define interpretable intrinsic coordinates tied to a shared global frame. To prevent degenerate collapse, we enforce a non-shrinking constraint and derive a scalable, integration-free objective inspired by flow matching. Within our theoretical framework, we prove that minimizing the proposed objective recovers a global coordinate chart when one exists. Empirically, we obtain correct tangent alignment and coherent global coordinate structure on synthetic manifolds. We also demonstrate the scalability of our method on CIFAR-10, where the learned coordinates achieve competitive downstream classification performance.

[484] A Semi-Supervised Pipeline for Generalized Behavior Discovery from Animal-Borne Motion Time Series

Fatemeh Karimi Nejadasl, Judy Shamoun-Baranes, Eldar Rakhimberdiev

Main category: cs.LG

TL;DR: Proposes a semi-supervised pipeline for discovering novel animal behaviors in motion sensor data using label-guided clustering and a KDE+HDR containment score to detect truly novel behaviors.

Details

Motivation: Behavioral taxonomy learning from animal-borne sensors faces challenges: scarce labels, severe class imbalance, and potential absence of behaviors from annotated sets. Need methods to discover novel behaviors in ecological motion time series with limited supervision.

Method: Three-step pipeline: (1) Learn embedding function from labeled subset, (2) Perform label-guided clustering over embeddings of both labeled/unlabeled samples to form candidate behavior groups, (3) Use KDE+HDR (highest-density region) containment score to decide if discovered group is truly novel by measuring distribution overlap with known classes.

Result: When entire behavior is withheld from supervision, method recovers distinct cluster and containment score flags novelty via low overlap. Negative-control setting with no novel behavior yields consistently higher overlaps, validating the approach.

Conclusion: HDR-based containment provides practical, quantitative test for generalized class discovery in ecological motion time series under limited annotation and severe class imbalance, enabling discovery of novel animal behaviors.

Abstract: Learning behavioral taxonomies from animal-borne sensors is challenging because labels are scarce, classes are highly imbalanced, and behaviors may be absent from the annotated set. We study generalized behavior discovery in short multivariate motion snippets from gulls, where each sample is a sequence with 3-axis IMU acceleration (20 Hz) and GPS speed, spanning nine expert-annotated behavior categories. We propose a semi-supervised discovery pipeline that (i) learns an embedding function from the labeled subset, (ii) performs label-guided clustering over embeddings of both labeled and unlabeled samples to form candidate behavior groups, and (iii) decides whether a discovered group is truly novel using a containment score. Our key contribution is a KDE + HDR (highest-density region) containment score that measures how much a discovered cluster distribution is contained within, or contains, each known-class distribution; the best-match containment score serves as an interpretable novelty statistic. In experiments where an entire behavior is withheld from supervision and appears only in the unlabeled pool, the method recovers a distinct cluster and the containment score flags novelty via low overlap, while a negative-control setting with no novel behavior yields consistently higher overlaps. These results suggest that HDR-based containment provides a practical, quantitative test for generalized class discovery in ecological motion time series under limited annotation and severe class imbalance.

[485] daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently

Mohan Jiang, Dayuan Fu, Junhao Shi, Ji Zeng, Weiye Si, Keyu Li, Xuefeng Li, Yang Xiao, Wenjie Li, Dequan Wang, Pengfei Liu

Main category: cs.LG

TL;DR: daVinci-Agency: A method for generating long-horizon training data for LLMs by mining structured supervision from real-world software Pull Request sequences, enabling better agentic workflows.

Details

Motivation: LLMs struggle with long-horizon agentic workflows due to lack of training data capturing authentic long-dependency structures and cross-stage evolutionary dynamics. Existing synthesis methods are either limited to single-feature scenarios or too expensive.

Method: Mines structured supervision from chain-of-PRs through three mechanisms: (1) progressive task decomposition via continuous commits, (2) long-term consistency enforcement through unified functional objectives, and (3) verifiable refinement from authentic bug-fix trajectories.

Result: Generated substantial trajectories (avg 85k tokens, 116 tool calls) that are data-efficient: fine-tuning GLM-4.6 on 239 samples yields broad improvements, achieving 47% relative gain on Toolathlon benchmark.

Conclusion: PR sequences provide authentic supervision signals for long-horizon learning, enabling better agentic workflows through real-world software evolution patterns.

Abstract: While Large Language Models (LLMs) excel at short-term tasks, scaling them to long-horizon agentic workflows remains challenging. The core bottleneck lies in the scarcity of training data that captures authentic long-dependency structures and cross-stage evolutionary dynamics–existing synthesis methods either confine to single-feature scenarios constrained by model distribution, or incur prohibitive human annotation costs, failing to provide scalable, high-quality supervision. We address this by reconceptualizing data synthesis through the lens of real-world software evolution. Our key insight: Pull Request (PR) sequences naturally embody the supervision signals for long-horizon learning. They decompose complex objectives into verifiable submission units, maintain functional coherence across iterations, and encode authentic refinement patterns through bug-fix histories. Building on this, we propose daVinci-Agency, which systematically mines structured supervision from chain-of-PRs through three interlocking mechanisms: (1) progressive task decomposition via continuous commits, (2) long-term consistency enforcement through unified functional objectives, and (3) verifiable refinement from authentic bug-fix trajectories. Unlike synthetic trajectories that treat each step independently, daVinci-Agency’s PR-grounded structure inherently preserves the causal dependencies and iterative refinements essential for teaching persistent goal-directed behavior and enables natural alignment with project-level, full-cycle task modeling. The resulting trajectories are substantial–averaging 85k tokens and 116 tool calls–yet remarkably data-efficient: fine-tuning GLM-4.6 on 239 daVinci-Agency samples yields broad improvements across benchmarks, notably achieving a 47% relative gain on Toolathlon. Beyond benchmark performance, our analysis confirms…

[486] Learning Consistent Causal Abstraction Networks

Gabriele D’Acunto, Paolo Di Lorenzo, Sergio Barbarossa

Main category: cs.LG

TL;DR: The paper presents a sheaf-theoretic framework for learning consistent causal abstraction networks (CANs) using Gaussian structural causal models and linear causal abstractions with efficient optimization via SPECTRAL method.

Details

Motivation: To enhance explainability, trustworthiness, and robustness in AI through causal artificial intelligence by formalizing network sheaves and cosheaves of causal knowledge, specifically focusing on learning consistent causal abstraction networks.

Method: Proposes a sheaf-theoretic framework where SCMs are Gaussian, restriction maps are transposes of constructive linear causal abstractions adhering to semantic embedding principle, and edge stalks correspond to node stalks of more detailed SCMs. Uses edge-specific local Riemannian problems and avoids nonconvex objectives. Introduces SPECTRAL, an iterative method with closed-form updates suitable for positive definite and semidefinite covariance matrices.

Result: Experiments on synthetic data show competitive performance in the causal abstraction learning task and successful recovery of diverse CAN structures.

Conclusion: The paper presents a novel sheaf-theoretic approach to learning consistent causal abstraction networks with efficient optimization, contributing to causal AI’s explainability and robustness goals.

Abstract: Causal artificial intelligence aims to enhance explainability, trustworthiness, and robustness in AI by leveraging structural causal models (SCMs). In this pursuit, recent advances formalize network sheaves and cosheaves of causal knowledge. Pushing in the same direction, we tackle the learning of consistent causal abstraction network (CAN), a sheaf-theoretic framework where (i) SCMs are Gaussian, (ii) restriction maps are transposes of constructive linear causal abstractions (CAs) adhering to the semantic embedding principle, and (iii) edge stalks correspond–up to permutation–to the node stalks of more detailed SCMs. Our problem formulation separates into edge-specific local Riemannian problems and avoids nonconvex objectives. We propose an efficient search procedure, solving the local problems with SPECTRAL, our iterative method with closed-form updates and suitable for positive definite and semidefinite covariance matrices. Experiments on synthetic data show competitive performance in the CA learning task, and successful recovery of diverse CAN structures.

[487] Learning Better Certified Models from Empirically-Robust Teachers

Alessandro De Palma

Main category: cs.LG

TL;DR: Knowledge distillation from adversarially-trained teachers improves certified robustness in ReLU networks for computer vision tasks

Details

Motivation: Adversarial training provides empirical robustness but lacks strong verification certificates, while certified training has poor standard performance. Current methods still sacrifice too much standard performance for verifiability.

Method: Propose knowledge distillation from empirically-robust teachers to improve certifiably-robust models using feature-space distillation objectives

Result: Distillation from adversarially-trained teachers consistently improves state-of-the-art certified training for ReLU networks across robust computer vision benchmarks

Conclusion: Leveraging empirically-robust teachers through knowledge distillation effectively bridges the gap between empirical robustness and certified verifiability

Abstract: Adversarial training attains strong empirical robustness to specific adversarial attacks by training on concrete adversarial perturbations, but it produces neural networks that are not amenable to strong robustness certificates through neural network verification. On the other hand, earlier certified training schemes directly train on bounds from network relaxations to obtain models that are certifiably robust, but display sub-par standard performance. Recent work has shown that state-of-the-art trade-offs between certified robustness and standard performance can be obtained through a family of losses combining adversarial outputs and neural network bounds. Nevertheless, differently from empirical robustness, verifiability still comes at a significant cost in standard performance. In this work, we propose to leverage empirically-robust teachers to improve the performance of certifiably-robust models through knowledge distillation. Using a versatile feature-space distillation objective, we show that distillation from adversarially-trained teachers consistently improves on the state-of-the-art in certified training for ReLU networks across a series of robust computer vision benchmarks.

[488] Performance of Small Language Model Pretraining on FABRIC: An Empirical Study

Praveen Rao

Main category: cs.LG

TL;DR: This paper investigates distributed pretraining techniques for smaller LLMs on commodity GPU clusters, evaluating parallelism strategies and network latency impacts using GPT-2 models with Alpa and Ray frameworks.

Details

Motivation: The motivation is to enable efficient pretraining of smaller LLMs on limited datasets using accessible GPU resources, addressing the computational challenges of LLM pretraining for academic users with constrained resources.

Method: The method involves experimental evaluation of different parallelism techniques (data, intra-operator, inter-operator/pipeline) on homogeneous and heterogeneous GPU clusters using GPT-2 medium/large models with Alpa and Ray frameworks, analyzing network latency impacts.

Result: Results show that Alpa’s execution plans optimizing both intra-operator and inter-operator/pipeline parallelism performed best for geographically distributed GPUs, especially with 10’s of milliseconds network latencies.

Conclusion: The paper proposes a systematic approach for selecting appropriate pretraining techniques to achieve high training performance and reduce GPU usage based on experimental insights.

Abstract: Large language models (LLMs) require enormous computing power to pretrain on massive datasets. When limited datasets are available, smaller-sized LLMs are better choice to pretrain (on user-specified datasets) by following the scaling laws of LLMs. Using pretrained models, vector embeddings can be generated for raw data and stored using vector databases to support modern AI applications and semantic search. In this work, we investigate the performance of pretraining techniques for smaller-sized LLMs on an experimental testbed (with commodity GPUs) available to academic users at no charge. We consider data parallelism, intra-operator parallelism, and inter-operator/pipeline parallelism, and their combinations for pretraining. We set up different GPU clusters with homogeneous and heterogeneous GPU hardware. Furthermore, we investigate the impact of network latency on pretraining performance especially when GPUs are geographically distributed. We used GPT-2 medium and large models and pretrained them using open-source packages, namely, Alpa and Ray. We observed that Alpa’s execution plans that collectively optimized intra-operator and inter-operator/pipeline parallelism consistently performed the best when GPUs were geographically distributed. This was especially true when the network latencies were in 10’s of milliseconds. Based on the insights gained from the experiments, we propose a systematic approach for selecting the appropriate pretraining technique to achieve high training performance/lower execution time as well as to reduce the number of GPUs used.

[489] A Reduction from Delayed to Immediate Feedback for Online Convex Optimization with Improved Guarantees

Alexander Ryabchenko, Idan Attias, Daniel M. Roy

Main category: cs.LG

TL;DR: A reduction framework for online learning with delayed feedback that improves regret bounds for both first-order and bandit convex optimization by decomposing regret into delay-independent learning and delay-induced drift terms.

Details

Motivation: Existing online learning algorithms struggle with delayed feedback scenarios where observations arrive after some time lag. The paper aims to develop a unified framework that handles round-dependent delays and improves regret bounds, particularly for bandit convex optimization where delays significantly impact performance.

Method: Introduces a continuous-time model for delayed feedback where regret decomposes into delay-independent learning and delay-induced drift components. Creates a delay-adaptive reduction that converts any online linear optimization algorithm into one handling round-dependent delays. Uses this framework to analyze both first-order and bandit convex optimization settings.

Result: For bandit convex optimization: achieves O(√d_tot + T^{3/4}√k) regret, improving delay-dependent term from O(min{√T d_max, (Td_tot)^{1/3}}) to O(√d_tot). For strongly convex case: achieves O(min{σ_max ln T, √d_tot} + (T^2 ln T)^{1/3} k^{2/3}), improving from O(d_max ln T) to O(min{σ_max ln T, √d_tot}). For first-order feedback: recovers state-of-the-art bounds with simpler analysis.

Conclusion: The reduction framework provides a unified approach to handle delayed feedback in online learning, significantly improving regret bounds for bandit convex optimization and simplifying analysis for first-order optimization. The continuous-time decomposition offers insights into delay effects and enables delay-adaptive algorithms.

Abstract: We develop a reduction-based framework for online learning with delayed feedback that recovers and improves upon existing results for both first-order and bandit convex optimization. Our approach introduces a continuous-time model under which regret decomposes into a delay-independent learning term and a delay-induced drift term, yielding a delay-adaptive reduction that converts any algorithm for online linear optimization into one that handles round-dependent delays. For bandit convex optimization, we significantly improve existing regret bounds, with delay-dependent terms matching state-of-the-art first-order rates. For first-order feedback, we recover state-of-the-art regret bounds via a simpler, unified analysis. Quantitatively, for bandit convex optimization we obtain $O(\sqrt{d_{\text{tot}}} + T^{\frac{3}{4}}\sqrt{k})$ regret, improving the delay-dependent term from $O(\min{\sqrt{T d_{\text{max}}},(Td_{\text{tot}})^{\frac{1}{3}}})$ in previous work to $O(\sqrt{d_{\text{tot}}})$. Here, $k$, $T$, $d_{\text{max}}$, and $d_{\text{tot}}$ denote the dimension, time horizon, maximum delay, and total delay, respectively. Under strong convexity, we achieve $O(\min{σ_{\text{max}} \ln T, \sqrt{d_{\text{tot}}}} + (T^2\ln T)^{\frac{1}{3}} {k}^{\frac{2}{3}})$, improving the delay-dependent term from $O(d_{\text{max}} \ln T)$ in previous work to $O(\min{σ_{\text{max}} \ln T, \sqrt{d_{\text{tot}}}})$, where $σ_{\text{max}}$ denotes the maximum number of outstanding observations and may be considerably smaller than $d_{\text{max}}$.

[490] hSNMF: Hybrid Spatially Regularized NMF for Image-Derived Spatial Transcriptomics

Md Ishtyaq Mahmud, Veena Kochat, Suresh Satpati, Jagan Mohan Reddy Dwarampudi, Humaira Anzum, Kunal Rai, Tania Banerjee

Main category: cs.LG

TL;DR: Spatially regularized NMF methods (SNMF and hSNMF) for high-resolution spatial transcriptomics data that improve spatial compactness, cluster separability, and biological coherence compared to existing baselines.

Details

Motivation: High-resolution spatial transcriptomics platforms like Xenium generate extremely high-dimensional single-cell images with both molecular and spatial information, posing challenges for representation learning and clustering that require specialized methods.

Method: Two spatially regularized NMF variants: 1) SNMF enforces local spatial smoothness by diffusing each cell’s NMF factor vector over its spatial neighborhood; 2) hSNMF performs spatially regularized NMF followed by Leiden clustering on a hybrid adjacency that integrates spatial proximity and transcriptomic similarity.

Result: On cholangiocarcinoma data, SNMF and hSNMF achieve improved spatial compactness (CHAOS < 0.004, Moran’s I > 0.96), greater cluster separability (Silhouette > 0.12, DBI < 1.8), and higher biological coherence compared to other spatial baselines.

Conclusion: Spatially regularized NMF methods effectively address the challenges of high-dimensional spatial transcriptomics data by incorporating spatial information into representation learning and clustering, leading to more biologically meaningful results.

Abstract: High-resolution spatial transcriptomics platforms, such as Xenium, generate single-cell images that capture both molecular and spatial context, but their extremely high dimensionality poses major challenges for representation learning and clustering. In this study, we analyze data from the Xenium platform, which captures high-resolution images of tumor microarray (TMA) tissues and converts them into cell-by-gene matrices suitable for computational analysis. We benchmark and extend nonnegative matrix factorization (NMF) for spatial transcriptomics by introducing two spatially regularized variants. First, we propose Spatial NMF (SNMF), a lightweight baseline that enforces local spatial smoothness by diffusing each cell’s NMF factor vector over its spatial neighborhood. Second, we introduce Hybrid Spatial NMF (hSNMF), which performs spatially regularized NMF followed by Leiden clustering on a hybrid adjacency that integrates spatial proximity (via a contact-radius graph) and transcriptomic similarity through a tunable mixing parameter alpha. Evaluated on a cholangiocarcinoma dataset, SNMF and hSNMF achieve markedly improved spatial compactness (CHAOS < 0.004, Moran’s I > 0.96), greater cluster separability (Silhouette > 0.12, DBI < 1.8), and higher biological coherence (CMC and enrichment) compared to other spatial baselines. Availability and implementation: https://github.com/ishtyaqmahmud/hSNMF

[491] MARA: Continuous SE(3)-Equivariant Attention for Molecular Force Fields

Francesco Leonardi, Boris Bonev, Kaspar Riesen

Main category: cs.LG

TL;DR: MARA extends spherical attention to molecular MLFFs, enabling flexible geometric weighting of atomic interactions for improved accuracy and robustness.

Details

Motivation: Existing machine learning force fields rely on fixed angular expansions that limit flexibility in weighting local geometric interactions, restricting their expressiveness and performance.

Method: Introduces Modular Angular-Radial Attention (MARA), which extends spherical attention from SO(3) to SE(3) for molecular domains, operating directly on angular and radial coordinates of neighboring atoms for flexible geometric weighting.

Result: MARA improves energy and force predictions, reduces high-error events, and enhances robustness across molecular benchmarks when integrated into models like MACE.

Conclusion: Continuous spherical attention is an effective geometric operator that increases expressiveness, stability, and reliability of atomistic models in machine learning force fields.

Abstract: Machine learning force fields (MLFFs) have become essential for accurate and efficient atomistic modeling. Despite their high accuracy, most existing approaches rely on fixed angular expansions, limiting flexibility in weighting local geometric interactions. We introduce Modular Angular-Radial Attention (MARA), a module that extends spherical attention – originally developed for SO(3) tasks – to the molecular domain and SE(3), providing an efficient approximation of equivariant interactions. MARA operates directly on the angular and radial coordinates of neighboring atoms, enabling flexible, geometrically informed, and modular weighting of local environments. Unlike existing attention mechanisms in SE(3)-equivariant architectures, MARA can be integrated in a plug-and-play manner into models such as MACE without architectural modifications. Across molecular benchmarks, MARA improves energy and force predictions, reduces high-error events, and enhances robustness. These results demonstrate that continuous spherical attention is an effective and generalizable geometric operator that increases the expressiveness, stability, and reliability of atomistic models.

[492] FlexRank: Nested Low-Rank Knowledge Decomposition for Adaptive Model Deployment

Riccardo Zaccone, Stefanos Laskaridis, Marco Ciccone, Samuel Horváth

Main category: cs.LG

TL;DR: FlexRank enables adaptive deployment of large models by extracting nested submodels from pretrained networks using low-rank decomposition and importance-based consolidation, allowing cost-performance trade-offs without retraining.

Details

Motivation: Large neural networks (LLMs, ViTs) are expensive to train and deploy as fixed-cost monoliths. Current approaches don't leverage overparameterization for adaptive deployment across different computational budgets.

Method: FlexRank uses low-rank weight decomposition with nested, importance-based consolidation to extract importance-ordered nested components from pretrained models. This creates submodels of increasing capabilities that can be selectively activated based on available computational budget.

Result: Enables “train-once, deploy-everywhere” paradigm with graceful cost-performance trade-offs without training from scratch for each budget, advancing practical deployment of large models.

Conclusion: Importance-ordered nested components can be extracted from pretrained models to enable adaptive deployment across different cost budgets, making large model deployment more practical and flexible.

Abstract: The growing scale of deep neural networks, encompassing large language models (LLMs) and vision transformers (ViTs), has made training from scratch prohibitively expensive and deployment increasingly costly. These models are often used as computational monoliths with fixed cost, a rigidity that does not leverage overparametrized architectures and largely hinders adaptive deployment across different cost budgets. We argue that importance-ordered nested components can be extracted from pretrained models, and selectively activated on the available computational budget. To this end, our proposed FlexRank method leverages low-rank weight decomposition with nested, importance-based consolidation to extract submodels of increasing capabilities. Our approach enables a “train-once, deploy-everywhere” paradigm that offers a graceful trade-off between cost and performance without training from scratch for each budget - advancing practical deployment of large models.

[493] Expert-Data Alignment Governs Generation Quality in Decentralized Diffusion Models

Marcos Villagra, Bidhan Roy, Raihan Seraj, Zhiying Jiang

Main category: cs.LG

TL;DR: DDMs with decentralized experts show that generation quality depends on expert-data alignment rather than sampling stability, with sparse routing outperforming full ensemble routing despite worse numerical convergence.

Details

Motivation: To understand what governs generation quality in Decentralized Diffusion Models where independently trained experts on disjoint data clusters can strongly disagree in their predictions.

Method: Systematic investigation of DDM routing strategies, analyzing denoising trajectory sensitivity vs. expert-data alignment, using data-cluster distance analysis, per-expert prediction accuracy analysis, and expert disagreement analysis across two distinct DDM systems.

Result: Full ensemble routing achieves most stable sampling dynamics and best numerical convergence but worst generation quality (FID 47.9), while sparse Top-2 routing produces better quality (FID 22.6). Expert-data alignment, not stability, governs generation quality.

Conclusion: For DDM deployment, routing should prioritize expert-data alignment over numerical stability metrics, as generation quality depends on routing inputs to experts whose training distribution covers the current denoising state.

Abstract: Decentralized Diffusion Models (DDMs) route denoising through experts trained independently on disjoint data clusters, which can strongly disagree in their predictions. What governs the quality of generations in such systems? We present the first ever systematic investigation of this question. A priori, the expectation is that minimizing denoising trajectory sensitivity – minimizing how perturbations amplify during sampling – should govern generation quality. We demonstrate this hypothesis is incorrect: a stability-quality dissociation. Full ensemble routing, which combines all expert predictions at each step, achieves the most stable sampling dynamics and best numerical convergence while producing the worst generation quality (FID 47.9 vs. 22.6 for sparse Top-2 routing). Instead, we identify expert-data alignment as the governing principle: generation quality depends on routing inputs to experts whose training distribution covers the current denoising state. Across two distinct DDM systems, we validate expert-data alignment using (i) data-cluster distance analysis, confirming sparse routing selects experts with data clusters closest to the current denoising state, and (ii) per-expert analysis, showing selected experts produce more accurate predictions than non-selected ones, and (iii) expert disagreement analysis, showing quality degrades when experts disagree. For DDM deployment, our findings establish that routing should prioritize expert-data alignment over numerical stability metrics.

[494] Sparsely Supervised Diffusion

Wenshuai Zhao, Zhiyuan Li, Yi Zhao, Mohammad Hassan Vali, Martin Trapp, Joni Pajarinen, Juho Kannala, Arno Solin

Main category: cs.LG

TL;DR: A sparse masking strategy for diffusion models that masks up to 98% of pixels during training to improve global consistency and reduce memorization.

Details

Motivation: Diffusion models often suffer from spatially inconsistent generation due to the inherent locality of their denoising mechanisms, producing samples that are locally plausible but globally inconsistent.

Method: Proposes sparsely supervised learning for diffusion models using a simple masking strategy that can mask up to 98% of pixels during training, implemented with only a few lines of code.

Result: The method delivers competitive FID scores, avoids training instability on small datasets, reduces memorization, and promotes the use of essential contextual information during generation.

Conclusion: Sparse masking during diffusion model training effectively addresses global consistency issues while maintaining competitive performance and improving training stability.

Abstract: Diffusion models have shown remarkable success across a wide range of generative tasks. However, they often suffer from spatially inconsistent generation, arguably due to the inherent locality of their denoising mechanisms. This can yield samples that are locally plausible but globally inconsistent. To mitigate this issue, we propose sparsely supervised learning for diffusion models, a simple yet effective masking strategy that can be implemented with only a few lines of code. Interestingly, the experiments show that it is safe to mask up to 98% of pixels during diffusion model training. Our method delivers competitive FID scores across experiments and, most importantly, avoids training instability on small datasets. Moreover, the masking strategy reduces memorization and promotes the use of essential contextual information during generation.

[495] Every Bit Counts: A Theoretical Study of Precision-Expressivity Tradeoffs in Quantized Transformers

Sayak Chakrabarti, Toniann Pitassi, Josh Alman

Main category: cs.LG

TL;DR: Theoretical analysis shows Transformers need at least p bits of precision to compute equality-like functions, with a sharp threshold - dropping even one bit makes them unable to represent needed comparisons.

Details

Motivation: To theoretically characterize the tradeoff between expressivity and numerical precision in Transformers, explaining the empirical loss of expressivity observed when quantization is used for inference acceleration.

Method: Theoretical analysis combining explicit finite-precision Transformer constructions with communication-complexity lower bounds to prove tight thresholds for expressivity.

Result: For every p, there exists a function Γ (inspired by equality) that a one-layer softmax Transformer can compute with p bits of precision but not with p-1 bits, establishing a sharp “one-bit” threshold.

Conclusion: Tasks requiring equality-like comparisons are especially sensitive to quantization, and precision should be chosen based on the length of equality needed for specific tasks, providing guidance for practitioners.

Abstract: Quantization reduces the numerical precision of Transformer computations and is widely used to accelerate inference, yet its effect on expressivity remains poorly characterized. We demonstrate a fine-grained theoretical tradeoff between expressivity and precision: For every p we exhibit a function Γ, inspired by the equality function, and prove that a one-layer softmax Transformer can compute Γ, with p bits of precision, but not with p-1 bits of precision. This result concretely explains the widely observed phenomenon of empirical loss of expressivity when quantization is used. Practically, it suggests that tasks requiring equality-like comparisons (exact match, membership, etc.) are especially sensitive to quantization. Dropping even one bit can cross a threshold where the model cannot represent the needed comparison reliably. Thus, it paves the way for developing heuristics that will help practitioners choose how much quantization is possible: the precision should be chosen as a function of the length of equality to be checked for the specific task. Our proofs combine explicit finite-precision Transformer constructions with communication-complexity lower bounds, yielding a tight “one-bit” threshold.

[496] BinaryPPO: Efficient Policy Optimization for Binary Classification

Punya Syon Pandey, Zhijing Jin

Main category: cs.LG

TL;DR: BinaryPPO: An offline RL framework that reformulates binary classification as reward maximization, using confidence-weighted PPO to outperform supervised fine-tuning in noisy/imbalanced settings.

Details

Motivation: Supervised fine-tuning (SFT) performs poorly in real-world binary classification tasks with label noise, class imbalance, or sparse supervision, necessitating a more robust approach.

Method: BinaryPPO uses offline reinforcement learning with a variant of Proximal Policy Optimization (PPO) and a confidence-weighted reward function that penalizes uncertain or incorrect predictions, learning robust decision policies from static datasets without online interaction.

Result: Across eight domain-specific benchmarks and multiple model architectures, BinaryPPO improves accuracy by 40-60 percentage points, reaching up to 99%, substantially outperforming supervised baselines.

Conclusion: Confidence-based reward design provides a robust alternative to SFT for binary classification, with BinaryPPO demonstrating significant improvements in challenging real-world settings.

Abstract: Supervised fine-tuning (SFT) is the standard approach for binary classification tasks such as toxicity detection, factuality verification, and causal inference. However, SFT often performs poorly in real-world settings with label noise, class imbalance, or sparse supervision. We introduce BinaryPPO, an offline reinforcement learning large language model (LLM) framework that reformulates binary classification as a reward maximization problem. Our method leverages a variant of Proximal Policy Optimization (PPO) with a confidence-weighted reward function that penalizes uncertain or incorrect predictions, enabling the model to learn robust decision policies from static datasets without online interaction. Across eight domain-specific benchmarks and multiple models with differing architectures, BinaryPPO improves accuracy by 40-60 percentage points, reaching up to 99%, substantially outperforming supervised baselines. We provide an in-depth analysis of the role of reward shaping, advantage scaling, and policy stability in enabling this improvement. Overall, we demonstrate that confidence-based reward design provides a robust alternative to SFT for binary classification. Our code is available at https://github.com/psyonp/BinaryPPO.

[497] Maximum Likelihood Reinforcement Learning

Fahim Tajwar, Guanning Zeng, Yueer Zhou, Yuda Song, Daman Arora, Yiding Jiang, Jeff Schneider, Ruslan Salakhutdinov, Haiwen Feng, Andrea Zanette

Main category: cs.LG

TL;DR: MaxRL is a new reinforcement learning framework that approximates maximum likelihood optimization in sampling-based setups with binary feedback, achieving better scaling and efficiency than standard RL methods.

Details

Motivation: Standard reinforcement learning in sampling-based setups with binary outcome feedback (like navigation, code generation, math problem solving) doesn't maximize the true likelihood of correct rollouts, only optimizing a lower-order approximation. This limits efficiency and scaling.

Method: Introduces Maximum Likelihood Reinforcement Learning (MaxRL) which defines a compute-indexed family of sample-based objectives that interpolate between standard RL and exact maximum likelihood. Uses a simple, unbiased policy-gradient estimator that converges to maximum likelihood optimization as compute increases.

Result: MaxRL Pareto-dominates existing methods across all tested models and tasks, achieving up to 20x test-time scaling efficiency gains compared to GRPO-trained counterparts. Also shows better scaling with additional data and compute.

Conclusion: MaxRL is a promising framework for scaling RL training in correctness-based settings, offering a principled approach to approximate maximum likelihood optimization using reinforcement learning techniques.

Abstract: Reinforcement learning is the method of choice to train models in sampling-based setups with binary outcome feedback, such as navigation, code generation, and mathematical problem solving. In such settings, models implicitly induce a likelihood over correct rollouts. However, we observe that reinforcement learning does not maximize this likelihood, and instead optimizes only a lower-order approximation. Inspired by this observation, we introduce Maximum Likelihood Reinforcement Learning (MaxRL), a sampling-based framework to approximate maximum likelihood using reinforcement learning techniques. MaxRL addresses the challenges of non-differentiable sampling by defining a compute-indexed family of sample-based objectives that interpolate between standard reinforcement learning and exact maximum likelihood as additional sampling compute is allocated. The resulting objectives admit a simple, unbiased policy-gradient estimator and converge to maximum likelihood optimization in the infinite-compute limit. Empirically, we show that MaxRL Pareto-dominates existing methods in all models and tasks we tested, achieving up to 20x test-time scaling efficiency gains compared to its GRPO-trained counterpart. We also observe MaxRL to scale better with additional data and compute. Our results suggest MaxRL is a promising framework for scaling RL training in correctness based settings.

[498] Towards Understanding Steering Strength

Magamed Taimeskhanov, Samuel Vaiter, Damien Garreau

Main category: cs.LG

TL;DR: Theoretical analysis of steering strength in LLM representation steering, showing non-monotonic effects and providing quantitative laws for controlling concept emergence.

Details

Motivation: While many methods exist for choosing steering directions in LLM control, little is understood about how to choose the optimal steering magnitude, which is crucial for effective control without performance degradation.

Method: Theoretical analysis characterizing the effect of steering strength on next token probability, concept presence, and cross-entropy, deriving precise qualitative laws, followed by empirical validation on eleven language models.

Result: Reveals surprising non-monotonic effects of steering strength and provides quantitative laws governing steering behavior, validated empirically across diverse LLM architectures.

Conclusion: Provides the first theoretical framework for understanding steering strength in LLM control, offering practical guidance for optimal steering magnitude selection.

Abstract: A popular approach to post-training control of large language models (LLMs) is the steering of intermediate latent representations. Namely, identify a well-chosen direction depending on the task at hand and perturbs representations along this direction at inference time. While many propositions exist to pick this direction, considerably less is understood about how to choose the magnitude of the move, whereas its importance is clear: too little and the intended behavior does not emerge, too much and the model’s performance degrades beyond repair. In this work, we propose the first theoretical analysis of steering strength. We characterize its effect on next token probability, presence of a concept, and cross-entropy, deriving precise qualitative laws governing these quantities. Our analysis reveals surprising behaviors, including non-monotonic effects of steering strength. We validate our theoretical predictions empirically on eleven language models, ranging from a small GPT architecture to modern models.

[499] Neural Probabilistic Amplitude Shaping for Nonlinear Fiber Channels

Mohammad Taha Askari, Lutz Lampe, Amirhossein Ghazisaeidi

Main category: cs.LG

TL;DR: Neural probabilistic amplitude shaping framework for coherent fiber systems achieving 0.5 dB SNR gain over sequence selection for 64-QAM transmission

Details

Motivation: To improve signal-to-noise ratio performance in coherent fiber communication systems through joint-distribution learning approaches

Method: Neural probabilistic amplitude shaping framework that learns joint distributions for coherent fiber systems, applied to dual-polarized 64-QAM transmission

Result: Achieves 0.5 dB signal-to-noise ratio gain over sequence selection methods across a single-span 205 km fiber link

Conclusion: The neural probabilistic amplitude shaping framework provides significant SNR improvements for coherent fiber communication systems

Abstract: We introduce neural probabilistic amplitude shaping, a joint-distribution learning framework for coherent fiber systems. The proposed scheme provides a 0.5 dB signal-to-noise ratio gain over sequence selection for dual-polarized 64-QAM transmission across a single-span 205 km link.

[500] Hierarchical Entity-centric Reinforcement Learning with Factored Subgoal Diffusion

Dan Haramati, Carl Qi, Tal Daniel, Amy Zhang, Aviv Tamar, George Konidaris

Main category: cs.LG

TL;DR: Hierarchical entity-centric framework for offline Goal-Conditioned RL using subgoal decomposition and conditional diffusion models to solve long-horizon tasks in multi-entity domains.

Details

Motivation: Long-horizon tasks in complex environments with multiple entities present combinatorial challenges for RL. GCRL struggles with high-dimensional observations and sparse rewards in such domains.

Method: Two-level hierarchy: value-based GCRL agent + factored subgoal-generating conditional diffusion model. They’re trained independently and composed post hoc through selective subgoal generation based on value function.

Result: Method boosts performance on image-based long-horizon tasks with sparse rewards, achieving over 150% higher success rates on hardest tasks and generalizing to increasing horizons and entity counts.

Conclusion: The hierarchical entity-centric framework effectively addresses combinatorial complexity in multi-entity domains, improving GCRL performance through modular subgoal decomposition.

Abstract: We propose a hierarchical entity-centric framework for offline Goal-Conditioned Reinforcement Learning (GCRL) that combines subgoal decomposition with factored structure to solve long-horizon tasks in domains with multiple entities. Achieving long-horizon goals in complex environments remains a core challenge in Reinforcement Learning (RL). Domains with multiple entities are particularly difficult due to their combinatorial complexity. GCRL facilitates generalization across goals and the use of subgoal structure, but struggles with high-dimensional observations and combinatorial state-spaces, especially under sparse reward. We employ a two-level hierarchy composed of a value-based GCRL agent and a factored subgoal-generating conditional diffusion model. The RL agent and subgoal generator are trained independently and composed post hoc through selective subgoal generation based on the value function, making the approach modular and compatible with existing GCRL algorithms. We introduce new variations to benchmark tasks that highlight the challenges of multi-entity domains, and show that our method consistently boosts performance of the underlying RL agent on image-based long-horizon tasks with sparse rewards, achieving over 150% higher success rates on the hardest task in our suite and generalizing to increasing horizons and numbers of entities. Rollout videos are provided at: https://sites.google.com/view/hecrl

[501] Vector Quantized Latent Concepts: A Scalable Alternative to Clustering-Based Concept Discovery

Xuemin Yu, Ankur Garg, Samira Ebrahimi Kahou, Hassan Sajjad

Main category: cs.LG

TL;DR: VQLC is a framework that uses vector quantized-VAE to learn discrete concept vectors from deep learning representations for scalable, human-understandable explanations.

Details

Motivation: Understanding which semantic information in deep learning representations models actually use for predictions is challenging. Existing concept-based explanation methods using clustering have scalability issues (hierarchical clustering) or produce shallow clusters (K-Means).

Method: Proposes Vector Quantized Latent Concept (VQLC) method built on VQ-VAE architecture to learn a discrete codebook that maps continuous representations to concept vectors for scalable concept-based explanations.

Result: VQLC improves scalability while maintaining comparable quality of human-understandable explanations compared to existing methods.

Conclusion: VQLC provides a scalable framework for concept-based explanation of deep learning models while preserving explanation quality.

Abstract: Deep Learning models encode rich semantic information in their hidden representations. However, it remains challenging to understand which parts of this information models actually rely on when making predictions. A promising line of post-hoc concept-based explanation methods relies on clustering token representations. However, commonly used approaches such as hierarchical clustering are computationally infeasible for large-scale datasets, and K-Means often yields shallow or frequency-dominated clusters. We propose the vector quantized latent concept (VQLC) method, a framework built upon the vector quantized-variational autoencoder (VQ-VAE) architecture that learns a discrete codebook mapping continuous representations to concept vectors. We perform thorough evaluations and show that VQLC improves scalability while maintaining comparable quality of human-understandable explanations.

[502] Search-Augmented Masked Diffusion Models for Constrained Generation

Huu Binh Ta, Michael Cardei, Alvaro Velasquez, Ferdinando Fioretto

Main category: cs.LG

TL;DR: SearchDiff integrates symbolic search into discrete diffusion models to improve constraint satisfaction and property optimization during inference without additional training.

Details

Motivation: Standard discrete diffusion models focus on matching data distributions but lack mechanisms for enforcing hard constraints or optimizing non-differentiable properties during inference, limiting their applicability to structured generation tasks.

Method: SearchDiff is a training-free neurosymbolic framework that combines neural denoising with symbolic search. At each denoising step, model predictions define a proposal set that is optimized under user-specified constraints, modifying the reverse transition to steer sampling toward feasible solutions.

Result: Experiments in biological design and symbolic reasoning show SearchDiff substantially improves constraint satisfaction and property adherence while outperforming both discrete diffusion and autoregressive baselines.

Conclusion: SearchDiff successfully bridges neural generation with symbolic reasoning, enabling discrete diffusion models to handle hard constraints and optimize non-differentiable properties without retraining.

Abstract: Discrete diffusion models generate sequences by iteratively denoising samples corrupted by categorical noise, offering an appealing alternative to autoregressive decoding for structured and symbolic generation. However, standard training targets a likelihood-based objective that primarily matches the data distribution and provides no native mechanism for enforcing hard constraints or optimizing non-differentiable properties at inference time. This work addresses this limitation and introduces Search-Augmented Masked Diffusion (SearchDiff), a training-free neurosymbolic inference framework that integrates informed search directly into the reverse denoising process. At each denoising step, the model predictions define a proposal set that is optimized under a user-specified property satisfaction, yielding a modified reverse transition that steers sampling toward probable and feasible solutions. Experiments in biological design and symbolic reasoning illustrate that SearchDiff substantially improves constraint satisfaction and property adherence, while consistently outperforming discrete diffusion and autoregressive baselines.

[503] CAPS: Unifying Attention, Recurrence, and Alignment in Transformer-based Time Series Forecasting

Viresh Pati, Yubin Kim, Vinh Pham, Jevon Twitty, Shihao Yang, Jiecheng Lu

Main category: cs.LG

TL;DR: CAPS is a structured attention mechanism for time series forecasting that decouples global trends, local shocks, and seasonal patterns using SO(2) rotations and three additive gating paths with linear complexity.

Details

Motivation: Standard softmax attention entangles different temporal structures through global normalization, while recurrent models sacrifice long-term, order-independent selection for causal structure. There's a need for attention mechanisms that can properly separate and model distinct temporal patterns in time series data.

Method: CAPS combines SO(2) rotations for phase alignment with three additive gating paths: Riemann softmax, prefix-product gates, and a Clock baseline. The Clock mechanism provides learned temporal weighting that modulates these paths through a shared notion of temporal importance, enabling decoupling of global trends, local shocks, and seasonal patterns.

Result: Experiments on long- and short-term forecasting benchmarks show CAPS surpasses vanilla softmax and linear attention mechanisms, and demonstrates competitive performance against seven strong baselines while maintaining linear complexity.

Conclusion: CAPS provides an effective structured attention mechanism for time series forecasting that properly separates different temporal structures while maintaining computational efficiency with linear complexity.

Abstract: This paper presents $\textbf{CAPS}$ (Clock-weighted Aggregation with Prefix-products and Softmax), a structured attention mechanism for time series forecasting that decouples three distinct temporal structures: global trends, local shocks, and seasonal patterns. Standard softmax attention entangles these through global normalization, while recent recurrent models sacrifice long-term, order-independent selection for order-dependent causal structure. CAPS combines SO(2) rotations for phase alignment with three additive gating paths – Riemann softmax, prefix-product gates, and a Clock baseline – within a single attention layer. We introduce the Clock mechanism, a learned temporal weighting that modulates these paths through a shared notion of temporal importance. Experiments on long- and short-term forecasting benchmarks surpass vanilla softmax and linear attention mechanisms and demonstrate competitive performance against seven strong baselines with linear complexity. Our code implementation is available at https://github.com/vireshpati/CAPS-Attention.

[504] RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang

Main category: cs.LG

TL;DR: RLAnything is a reinforcement learning framework that dynamically optimizes environment, policy, and reward models through closed-loop optimization for LLM and agentic scenarios.

Details

Motivation: To create a more effective RL system for LLMs and agents by dynamically optimizing all components (environment, policy, reward) in a closed-loop manner rather than using static configurations.

Method: Uses closed-loop optimization where policy is trained with integrated step-wise and outcome feedback, reward model is optimized via consistency feedback, and environment is automatically adapted using critic feedback from both policy and reward models.

Result: Substantial gains across various LLM and agentic tasks: 9.1% improvement for Qwen3-VL-8B-Thinking on OSWorld, 18.7% and 11.9% improvements for Qwen2.5-7B-Instruct on AlfWorld and LiveBench respectively.

Conclusion: RLAnything’s closed-loop optimization of all RL components consistently improves system performance, with optimized reward signals outperforming human-labeled outcomes.

Abstract: We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen-Verse/Open-AgentRL

[505] TabPFN for Zero-shot Parametric Engineering Design Generation

Ke Wang, Yifan Tang, Nguyen Gia Hien Vu, Faez Ahmed, G. Gary Wang

Main category: cs.LG

TL;DR: Zero-shot parametric engineering design generation using TabPFN without task-specific training, enabling conditional design generation from limited reference samples.

Details

Motivation: Current deep generative models for engineering design require substantial computational cost, large training datasets, and extensive retraining when requirements change, limiting real-world applicability.

Method: Proposes a zero-shot generation framework based on TabPFN that generates design parameters sequentially conditioned on target performance indicators, using only limited reference samples without task-specific training or fine-tuning.

Result: Achieves competitive diversity across structured parametric design spaces, robust to sampling/resolution variations, low performance error (<2% for ship hulls), and significantly reduces computational overhead compared to diffusion models.

Conclusion: Zero-shot, data-efficient generation enables practical engineering design with rapid deployment, flexible adaptation to new settings, and easy integration into real-world workflows.

Abstract: Deep generative models for engineering design often require substantial computational cost, large training datasets, and extensive retraining when design requirements or datasets change, limiting their applicability in real-world engineering design workflow. In this work, we propose a zero-shot generation framework for parametric engineering design based on TabPFN, enabling conditional design generation using only a limited number of reference samples and without any task-specific model training or fine-tuning. The proposed method generates design parameters sequentially conditioned on target performance indicators, providing a flexible alternative to conventional generative models. The effectiveness of the proposed approach is evaluated on three engineering design datasets, i.e., ship hull design, BlendedNet aircraft, and UIUC airfoil. Experimental results demonstrate that the proposed method achieves competitive diversity across highly structured parametric design spaces, remains robust to variations in sampling, resolution and parameter dimensionality of geometry generation, and achieves a low performance error (e.g., less than 2% in generated ship hull designs’ performance). Compared with diffusion-based generative models, the proposed framework significantly reduces computational overhead and data requirements while preserving reliable generation performance. These results highlight the potential of zero-shot, data-efficient generation as a practical and efficient tool for engineering design, enabling rapid deployment, flexible adaptation to new design settings, and ease of integration into real-world engineering workflows.

[506] TopoPrune: Robust Data Pruning via Unified Latent Space Topology

Arjun Roy, Prajna G. Malettira, Manish Nagaraj, Kaushik Roy

Main category: cs.LG

TL;DR: TopoPrune: A topology-based data pruning framework that uses persistent homology to capture intrinsic data structure, enabling stable and robust dataset pruning even at high rates (90%) with cross-architecture transferability.

Details

Motivation: Existing geometric data pruning methods are unstable due to their reliance on extrinsic geometry, making them sensitive to latent space perturbations and poor at cross-architecture transfer. There's a need for more stable, intrinsic approaches to data pruning.

Method: Two-scale topological approach: (1) topology-aware manifold approximation for global low-dimensional embedding, (2) differentiable persistent homology for local topological optimization to rank samples by structural complexity.

Result: TopoPrune achieves high accuracy and precision at significant pruning rates (90%), shows exceptional robustness to noise in latent features, and demonstrates superior transferability across diverse network architectures.

Conclusion: Topology provides a stable, principled framework for robust data-efficient learning, with TopoPrune offering promising results for stable data pruning across architectures.

Abstract: Geometric data pruning methods, while practical for leveraging pretrained models, are fundamentally unstable. Their reliance on extrinsic geometry renders them highly sensitive to latent space perturbations, causing performance to degrade during cross-architecture transfer or in the presence of feature noise. We introduce TopoPrune, a framework which resolves this challenge by leveraging topology to capture the stable, intrinsic structure of data. TopoPrune operates at two scales, (1) utilizing a topology-aware manifold approximation to establish a global low-dimensional embedding of the dataset. Subsequently, (2) it employs differentiable persistent homology to perform a local topological optimization on the manifold embeddings, ranking samples by their structural complexity. We demonstrate that our unified dual-scale topological approach ensures high accuracy and precision, particularly at significant dataset pruning rates (e.g., 90%). Furthermore, through the inherent stability properties of topology, TopoPrune is (a) exceptionally robust to noise perturbations of latent feature embeddings and (b) demonstrates superior transferability across diverse network architectures. This study demonstrates a promising avenue towards stable and principled topology-based frameworks for robust data-efficient learning.

[507] Entropy-Guided Dynamic Tokens for Graph-LLM Alignment in Molecular Understanding

Zihao Jing, Qiuhao Zeng, Ruiyi Fang, Yan Sun, Boyu Wang, Pingzhao Hu

Main category: cs.LG

TL;DR: EDT-Former is an entropy-guided dynamic token transformer that generates tokens aligned with informative molecular patches for better molecular graph understanding, enabling alignment between frozen graph encoders and LLMs without tuning the LLM backbone.

Details

Motivation: Current LLMs struggle with molecular graph understanding. Existing graph-LLM bridges use fixed-length static tokens designed for vision tasks, overlooking stereochemistry and substructural context, and require costly LLM fine-tuning, limiting efficiency and generalization.

Method: EDT-Former generates dynamic tokens aligned with informative molecular patches using entropy guidance, preserving both local and global structural features. It enables alignment between frozen graph encoders and LLMs without tuning the LLM backbone (excluding embedding layer).

Result: Achieves state-of-the-art results on MoleculeQA, Molecule-oriented Mol-Instructions, and property prediction benchmarks (TDC, MoleculeNet), demonstrating effectiveness for scalable and generalizable multimodal molecular understanding.

Conclusion: EDT-Former provides an effective approach for molecular graph understanding by generating dynamic tokens aligned with informative molecular patches, enabling efficient alignment between graph encoders and LLMs without backbone tuning.

Abstract: Molecular understanding is central to advancing areas such as scientific discovery, yet Large Language Models (LLMs) struggle to understand molecular graphs effectively. Existing graph-LLM bridges often adapt the Q-Former-style connector with fixed-length static tokens, which is originally designed for vision tasks. These designs overlook stereochemistry and substructural context and typically require costly LLM-backbone fine-tuning, limiting efficiency and generalization. We introduce EDT-Former, an Entropy-guided Dynamic Token Transformer that generates tokens aligned with informative molecular patches, thereby preserving both local and global structural features for molecular graph understanding. Beyond prior approaches, EDT-Former enables alignment between frozen graph encoders and LLMs without tuning the LLM backbone (excluding the embedding layer), resulting in computationally efficient finetuning, and achieves stateof-the-art results on MoleculeQA, Molecule-oriented Mol-Instructions, and property prediction benchmarks (TDC, MoleculeNet), underscoring its effectiveness for scalable and generalizable multimodal molecular understanding

[508] On the Sample Efficiency of Inverse Dynamics Models for Semi-Supervised Imitation Learning

Sacha Morin, Moonsub Byeon, Alexia Jolicoeur-Martineau, Sébastien Lachapelle

Main category: cs.LG

TL;DR: SSIL methods using inverse dynamics models (IDM) show advantages over behavior cloning due to IDM’s lower complexity and reduced stochasticity, leading to improved sample efficiency.

Details

Motivation: The paper investigates why IDM-based methods outperform behavior cloning in semi-supervised imitation learning, seeking to understand the underlying reasons for their superior sample efficiency.

Method: Analyzes IDM-based policies theoretically and empirically, comparing VM-IDM and IDM labeling approaches. Uses statistical learning theory insights and experiments with unified video-action prediction (UVA) architectures. Proposes improved LAPO algorithm for latent action policy learning.

Result: Shows that IDM-based policies have advantages due to: (1) ground-truth IDM being in lower complexity hypothesis class than expert policy, and (2) ground-truth IDM being less stochastic than expert policy. Demonstrates these claims through theory and experiments.

Conclusion: IDM-based methods in SSIL outperform behavior cloning primarily due to better sample efficiency from IDM’s lower complexity and reduced stochasticity, leading to improved policy learning with limited labeled data.

Abstract: Semi-supervised imitation learning (SSIL) consists in learning a policy from a small dataset of action-labeled trajectories and a much larger dataset of action-free trajectories. Some SSIL methods learn an inverse dynamics model (IDM) to predict the action from the current state and the next state. An IDM can act as a policy when paired with a video model (VM-IDM) or as a label generator to perform behavior cloning on action-free data (IDM labeling). In this work, we first show that VM-IDM and IDM labeling learn the same policy in a limit case, which we call the IDM-based policy. We then argue that the previously observed advantage of IDM-based policies over behavior cloning is due to the superior sample efficiency of IDM learning, which we attribute to two causes: (i) the ground-truth IDM tends to be contained in a lower complexity hypothesis class relative to the expert policy, and (ii) the ground-truth IDM is often less stochastic than the expert policy. We argue these claims based on insights from statistical learning theory and novel experiments, including a study of IDM-based policies using recent architectures for unified video-action prediction (UVA). Motivated by these insights, we finally propose an improved version of the existing LAPO algorithm for latent action policy learning.

[509] Exposing Vulnerabilities in Explanation for Time Series Classifiers via Dual-Target Attacks

Bohan Wang, Zewen Liu, Lu Lin, Hui Liu, Li Xiong, Ming Jin, Wei Jin

Main category: cs.LG

TL;DR: TSEF attack shows temporal consistency in time series explanations can be misleading - predictions and explanations can be adversarially decoupled to achieve targeted misclassification while maintaining plausible explanations.

Details

Motivation: Current interpretable time series deep learning systems assume temporal consistency in explanations indicates robustness, but this assumption may fail, allowing adversaries to manipulate predictions while keeping explanations plausible.

Method: Proposes TSEF (Time Series Explanation Fooler), a dual-target attack that jointly manipulates classifier and explainer outputs to achieve targeted misclassification while keeping explanations consistent with a chosen reference rationale.

Result: Across multiple datasets and explainer backbones, TSEF successfully achieves targeted prediction changes while maintaining explanation consistency, revealing that explanation stability is a misleading proxy for decision robustness.

Conclusion: Explanation stability alone is insufficient for trustworthy time series tasks; coupling-aware robustness evaluations are needed to ensure both prediction and explanation reliability.

Abstract: Interpretable time series deep learning systems are often assessed by checking temporal consistency on explanations, implicitly treating this as evidence of robustness. We show that this assumption can fail: Predictions and explanations can be adversarially decoupled, enabling targeted misclassification while the explanation remains plausible and consistent with a chosen reference rationale. We propose TSEF (Time Series Explanation Fooler), a dual-target attack that jointly manipulates the classifier and explainer outputs. In contrast to single-objective misclassification attacks that disrupt explanation and spread attribution mass broadly, TSEF achieves targeted prediction changes while keeping explanations consistent with the reference. Across multiple datasets and explainer backbones, our results consistently reveal that explanation stability is a misleading proxy for decision robustness and motivate coupling-aware robustness evaluations for trustworthy time series tasks.

[510] Privately Fine-Tuned LLMs Preserve Temporal Dynamics in Tabular Data

Lucas Rosenblatt, Peihan Liu, Ryan McKenna, Natalia Ponomareva

Main category: cs.LG

TL;DR: PATH is a differentially private synthetic data generation framework for longitudinal datasets that preserves temporal coherence using autoregressive LLMs, outperforming traditional marginal-based methods.

Details

Motivation: Existing differentially private synthetic data methods focus on i.i.d. tabular data, neglecting temporal complexity in longitudinal datasets like EHRs where users contribute sequences of events. Flattening temporal data loses coherence even when preserving marginals.

Method: PATH treats entire tables as synthesis units and leverages autoregressive capabilities of privately fine-tuned large language models to capture long-range dependencies and temporal patterns in sequential data.

Result: PATH reduces distributional distance to real trajectories by over 60% and state transition errors by nearly 50% compared to leading marginal mechanisms while achieving similar marginal fidelity.

Conclusion: The framework effectively addresses temporal coherence in differentially private synthetic data generation for longitudinal datasets, demonstrating superior performance over traditional flattening approaches.

Abstract: Research on differentially private synthetic tabular data has largely focused on independent and identically distributed rows where each record corresponds to a unique individual. This perspective neglects the temporal complexity in longitudinal datasets, such as electronic health records, where a user contributes an entire (sub) table of sequential events. While practitioners might attempt to model such data by flattening user histories into high-dimensional vectors for use with standard marginal-based mechanisms, we demonstrate that this strategy is insufficient. Flattening fails to preserve temporal coherence even when it maintains valid marginal distributions. We introduce PATH, a novel generative framework that treats the full table as the unit of synthesis and leverages the autoregressive capabilities of privately fine-tuned large language models. Extensive evaluations show that PATH effectively captures long-range dependencies that traditional methods miss. Empirically, our method reduces the distributional distance to real trajectories by over 60% and reduces state transition errors by nearly 50% compared to leading marginal mechanisms while achieving similar marginal fidelity.

[511] Provable Effects of Data Replay in Continual Learning: A Feature Learning Perspective

Meng Ding, Jinhui Xu, Kaiyi Ji

Main category: cs.LG

TL;DR: Theoretical analysis shows that even with full data replay in continual learning, catastrophic forgetting can occur when cumulative noise from later tasks dominates earlier task signals, but sufficient signal accumulation enables recovery of poorly learned earlier tasks.

Details

Motivation: To provide a comprehensive theoretical framework for analyzing full data-replay training in continual learning from a feature learning perspective, as the theoretical effectiveness of full data replay remains largely unexplored despite being considered simple yet effective.

Method: Adopts a multi-view data model and identifies signal-to-noise ratio (SNR) as critical factor affecting forgetting. Focuses on task-incremental binary classification across M tasks, analyzing the interplay between signal learning and noise memorization.

Result: Two key findings: (1) forgetting can still occur under full replay when cumulative noise from later tasks dominates earlier task signals; (2) with sufficient signal accumulation, data replay can recover earlier tasks even if initial learning was poor. Also discovers task ordering insight: prioritizing higher-signal tasks facilitates learning of lower-signal tasks and prevents catastrophic forgetting.

Conclusion: Provides theoretical framework showing that full data replay effectiveness depends on signal-to-noise ratio dynamics, with task ordering playing crucial role in preventing catastrophic forgetting through signal accumulation.

Abstract: Continual learning (CL) aims to train models on a sequence of tasks while retaining performance on previously learned ones. A core challenge in this setting is catastrophic forgetting, where new learning interferes with past knowledge. Among various mitigation strategies, data-replay methods, where past samples are periodically revisited, are considered simple yet effective, especially when memory constraints are relaxed. However, the theoretical effectiveness of full data replay, where all past data is accessible during training, remains largely unexplored. In this paper, we present a comprehensive theoretical framework for analyzing full data-replay training in continual learning from a feature learning perspective. Adopting a multi-view data model, we identify the signal-to-noise ratio (SNR) as a critical factor affecting forgetting. Focusing on task-incremental binary classification across $M$ tasks, our analysis verifies two key conclusions: (1) forgetting can still occur under full replay when the cumulative noise from later tasks dominates the signal from earlier ones; and (2) with sufficient signal accumulation, data replay can recover earlier tasks-even if their initial learning was poor. Notably, we uncover a novel insight into task ordering: prioritizing higher-signal tasks not only facilitates learning of lower-signal tasks but also helps prevent catastrophic forgetting. We validate our theoretical findings through synthetic and real-world experiments that visualize the interplay between signal learning and noise memorization across varying SNRs and task correlation regimes.

[512] BiTimeCrossNet: Time-Aware Self-Supervised Learning for Pediatric Sleep

Saurav Raj Pandey, Harlin Lee

Main category: cs.LG

TL;DR: BTCNet is a multimodal self-supervised learning framework for long physiological recordings that incorporates temporal context and cross-modal attention without requiring task labels.

Details

Motivation: Existing approaches for physiological signal analysis often treat short segments as independent samples, ignoring important temporal context within longer recordings like sleep studies. The authors aim to incorporate when each segment occurs within its parent recording and learn pairwise interactions between different physiological signals.

Method: BTCNet uses a multimodal self-supervised learning framework that: 1) Incorporates temporal information about when each segment occurs within longer recordings, 2) Learns pairwise interactions between physiological signals via cross-attention mechanisms, 3) Operates without task labels or sequence-level supervision, and 4) Can be applied to various downstream tasks through frozen-backbone linear probing.

Result: BTCNet consistently outperforms non-time-aware variants across six pediatric sleep tasks including sleep staging, arousal detection, and respiratory event detection. The gains generalize to an independent pediatric dataset. Compared to existing multimodal self-supervised sleep models, BTCNet achieves strong performance, particularly on respiration-related tasks.

Conclusion: Incorporating temporal context and cross-modal attention in self-supervised learning improves performance on physiological signal analysis tasks, with particular benefits for respiration-related applications in sleep studies.

Abstract: We present BiTimeCrossNet (BTCNet), a multimodal self-supervised learning framework for long physiological recordings such as overnight sleep studies. While many existing approaches train on short segments treated as independent samples, BTCNet incorporates information about when each segment occurs within its parent recording, for example within a sleep session. BTCNet further learns pairwise interactions between physiological signals via cross-attention, without requiring task labels or sequence-level supervision. We evaluate BTCNet on pediatric sleep data across six downstream tasks, including sleep staging, arousal detection, and respiratory event detection. Under frozen-backbone linear probing, BTCNet consistently outperforms an otherwise identical non-time-aware variant, with gains that generalize to an independent pediatric dataset. Compared to existing multimodal self-supervised sleep models, BTCNet achieves strong performance, particularly on respiration-related tasks.

[513] TraceNAS: Zero-shot LLM Pruning via Gradient Trace Correlation

Prajna G. Malettira, Manish Nagaraj, Arjun Roy, Shubham Negi, Kaushik Roy

Main category: cs.LG

TL;DR: TraceNAS is a training-free Neural Architecture Search framework for structured pruning of LLMs that jointly explores depth and width pruning using a zero-shot proxy to maintain loss landscape alignment with pretrained models.

Details

Motivation: Existing structured pruning methods for LLMs either evaluate components in isolation (ignoring global dependencies) or use training-aware methods that are computationally expensive. There's a need for efficient pruning that captures global structural dependencies without expensive training.

Method: Proposes TraceNAS, a training-free NAS framework that jointly explores structured pruning of LLM depth and width. Uses a scale-invariant zero-shot proxy to identify pruned models that maintain high loss landscape alignment with pretrained models, selecting models with maximal performance potential during post-pruning training.

Result: TraceNAS achieves high-fidelity discovery of pruned models on a single GPU in 8.5 hours (10× reduction in GPU-hours compared to training-aware methods). Evaluations on Llama and Qwen families show competitive performance with training-aware baselines across commonsense and reasoning benchmarks.

Conclusion: TraceNAS provides an efficient training-free approach for structured pruning of LLMs that captures global dependencies and maintains loss landscape alignment, offering computational efficiency while achieving competitive performance with training-aware methods.

Abstract: Structured pruning is essential for efficient deployment of Large Language Models (LLMs). The varying sensitivity of LLM sub-blocks to pruning necessitates the identification of optimal non-uniformly pruned models. Existing methods evaluate the importance of layers, attention heads, or weight channels in isolation. Such localized focus ignores the complex global structural dependencies that exist across the model. Training-aware structured pruning addresses global dependencies, but its computational cost can be just as expensive as post-pruning training. To alleviate the computational burden of training-aware pruning and capture global structural dependencies, we propose TraceNAS, a training-free Neural Architecture Search (NAS) framework that jointly explores structured pruning of LLM depth and width. TraceNAS identifies pruned models that maintain a high degree of loss landscape alignment with the pretrained model using a scale-invariant zero-shot proxy, effectively selecting models that exhibit maximal performance potential during post-pruning training. TraceNAS is highly efficient, enabling high-fidelity discovery of pruned models on a single GPU in 8.5 hours, yielding a 10$\times$ reduction in GPU-hours compared to training-aware methods. Evaluations on the Llama and Qwen families demonstrate that TraceNAS is competitive with training-aware baselines across commonsense and reasoning benchmarks.

[514] VerIde ECG Biometrics: Verification and Identification

Scagnetto Arjuna

Main category: cs.LG

TL;DR: ECG biometrics study shows ECG carries strong individual signatures, with deep learning models achieving high verification/identification accuracy, raising privacy concerns about ECG anonymization.

Details

Motivation: To investigate how strongly ECG signals can be linked to individuals at large scale, evaluating privacy implications of ECG biometrics and the effectiveness of anonymization methods.

Method: Used MLP-based embedding networks on tabular ECG features, then adopted ArcFace embedding models on both features and raw ECG waveforms, with consistent normalization and large training sets. Implemented two-stage pipeline for open-set identification.

Result: Achieved high verification performance (TAR=0.908 @ FAR=1e-3; EER=2.53%), strong closed-set identification (Rank@1=0.812), and excellent open-set identification (DIR@FAR up to 0.976). Showed performance improves with waveforms vs features and larger datasets.

Conclusion: ECG carries measurable individual signatures that enable re-identification even with tabular features, with deep learning models further amplifying this capability, necessitating serious consideration of privacy implications and operational protocols.

Abstract: This work studies electrocardiogram (ECG) biometrics at large scale, evaluating how strongly an ECG can be linked to an individual and, consequently, how its anonymization may be compromised. We show that identity information is already present in tabular representations (fiducial features): even a simple MLP-based embedding network yields non-trivial performance, indicating that anonymization based solely on releasing features does not guarantee privacy. We then adopt embedding-based deep learning models (ArcFace), first on features and then on ECG waveforms, showing a performance jump when moving from tabular inputs to waveforms, and a further gain with larger training sets and consistent normalization across train/val/test. On a large-scale test set, verification achieves high TAR at strict FAR thresholds (TAR=0.908 @ FAR=1e-3; TAR=0.820 @ FAR=1e-4) with EER=2.53% (all-vs-all); closed-set identification yields Rank@1=0.812 and Rank@10=0.910. In open-set, a two-stage pipeline (top-K shortlist on embeddings + re-ranking) reaches DIR@FAR up to 0.976 at FAR=1e-3 and 1e-4. Overall, the results show that ECG carries a measurable individual signature: re-identification is already possible with tabular features and is further amplified by embedding-based models, making privacy implications and realistic operational protocols essential to consider.

[515] Cross-Temporal Attention Fusion (CTAF) for Multimodal Physiological Signals in Self-Supervised Learning

Arian Khorasani, Théophile Demazure

Main category: cs.LG

TL;DR: CTAF is a self-supervised multimodal fusion module that learns soft bidirectional alignments between asynchronous EEG and peripheral physiology signals using time-aware cross attention and alignment-regularized contrastive learning.

Details

Motivation: Most multimodal fusion methods ignore or handle temporal asynchrony between EEG and peripheral physiology with costly warping techniques, failing to properly model the coupling between central and autonomic nervous systems in psychophysiological time series.

Method: Proposes Cross-Temporal Attention Fusion (CTAF) with: 1) time-aware cross attention for soft bidirectional alignments, 2) lightweight fusion gate for robust clip embeddings, and 3) alignment-regularized contrastive objectives with optional weak supervision.

Result: On K-EmoCon dataset, CTAF yields higher cosine margins for matched pairs, better cross-modal token retrieval within one second, and competitive performance on three-bin accuracy and macro-F1 while using few labels.

Conclusion: CTAF represents a step toward label-efficient, generalizable EEG-peripheral fusion under temporal asynchrony by directly modeling correspondence between modalities and accounting for nervous system coupling.

Abstract: We study multimodal affect modeling when EEG and peripheral physiology are asynchronous, which most fusion methods ignore or handle with costly warping. We propose Cross-Temporal Attention Fusion (CTAF), a self-supervised module that learns soft bidirectional alignments between modalities and builds a robust clip embedding using time-aware cross attention, a lightweight fusion gate, and alignment-regularized contrastive objectives with optional weak supervision. On the K-EmoCon dataset, under leave-one-out cross-validation evaluation, CTAF yields higher cosine margins for matched pairs and better cross-modal token retrieval within one second, and it is competitive with the baseline on three-bin accuracy and macro-F1 while using few labels. Our contributions are a time-aware fusion mechanism that directly models correspondence, an alignment-driven self-supervised objective tailored to EEG and physiology, and an evaluation protocol that measures alignment quality itself. Our approach accounts for the coupling between the central and autonomic nervous systems in psychophysiological time series. These results indicate that CTAF is a strong step toward label-efficient, generalizable EEG-peripheral fusion under temporal asynchrony.

[516] LEMON: Local Explanations via Modality-aware OptimizatioN

Yu Qin, Phillip Sloan, Raul Santos-Rodriguez, Majid Mirmehdi, Telmo de Menezes e Silva Filho

Main category: cs.LG

TL;DR: LEMON is a model-agnostic framework for local explanations of multimodal predictions that produces unified explanations disentangling modality-level contributions and feature-level attributions with high computational efficiency.

Details

Motivation: Existing explainability methods for multimodal models are often single-modal, architecture-dependent, or too computationally expensive to run at scale, creating a need for efficient, model-agnostic multimodal explanation methods.

Method: LEMON fits a single modality-aware surrogate with group-structured sparsity to produce unified explanations, treating the predictor as a black box and requiring relatively few forward passes while remaining faithful under repeated perturbations.

Result: LEMON achieves competitive deletion-based faithfulness while reducing black-box evaluations by 35-67 times and runtime by 2-8 times compared to strong multimodal baselines across vision-language question answering and clinical prediction tasks.

Conclusion: LEMON provides an efficient, model-agnostic framework for local explanations of multimodal predictions that successfully disentangles modality-level contributions and feature-level attributions while being computationally practical for large-scale use.

Abstract: Multimodal models are ubiquitous, yet existing explainability methods are often single-modal, architecture-dependent, or too computationally expensive to run at scale. We introduce LEMON (Local Explanations via Modality-aware OptimizatioN), a model-agnostic framework for local explanations of multimodal predictions. LEMON fits a single modality-aware surrogate with group-structured sparsity to produce unified explanations that disentangle modality-level contributions and feature-level attributions. The approach treats the predictor as a black box and is computationally efficient, requiring relatively few forward passes while remaining faithful under repeated perturbations. We evaluate LEMON on vision-language question answering and a clinical prediction task with image, text, and tabular inputs, comparing against representative multimodal baselines. Across backbones, LEMON achieves competitive deletion-based faithfulness while reducing black-box evaluations by 35-67 times and runtime by 2-8 times compared to strong multimodal baselines.

[517] Self-Hinting Language Models Enhance Reinforcement Learning

Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian

Main category: cs.LG

TL;DR: SAGE improves GRPO for LLM alignment by adding privileged hints during training to increase rollout diversity under sparse rewards, preventing advantage collapse while maintaining the same terminal reward structure.

Details

Motivation: GRPO struggles with sparse terminal rewards because rollouts within a group often receive identical rewards, causing relative advantages to collapse and learning to stall. The paper aims to address this limitation while maintaining the verifiable objective alignment framework.

Method: SAGE injects privileged hints during training where the model samples a compact hint (e.g., plan or decomposition) and generates solutions conditioned on both prompt and hint. This increases within-group outcome diversity under finite sampling while keeping the same terminal reward unchanged. At test time, the model uses no hints.

Result: Experiments across 6 benchmarks with 3 LLMs show SAGE consistently outperforms GRPO, with average improvements of +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct, and +1.3 on Qwen3-4B-Instruct.

Conclusion: SAGE effectively addresses GRPO’s limitation with sparse rewards by using self-hints to increase rollout diversity, preventing advantage collapse while maintaining the same verifiable reward structure, leading to consistent performance improvements across multiple LLMs.

Abstract: Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $τ$ conditioned on $(x,h)$. Crucially, the task reward $R(x,τ)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner’s bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.

[518] Structure-Preserving Learning Improves Geometry Generalization in Neural PDEs

Benjamin D. Shaffer, Shawn Koohy, Brooks Kinch, M. Ani Hsieh, Nathaniel Trask

Main category: cs.LG

TL;DR: Geo-NeW: A physics foundation model using neural Whitney forms to solve PDEs on arbitrary geometries while preserving physical conservation laws through finite element exterior calculus.

Details

Motivation: Develop physics foundation models for science and engineering that provide real-time PDE solutions preserving structure and accuracy when adapting to unseen geometries, addressing limitations of conventional methods on out-of-distribution domains.

Method: General-Geometry Neural Whitney Forms (Geo-NeW): data-driven finite element method jointly learning differential operator and compatible reduced finite element spaces defined on underlying geometry. Uses transformer-based encoding of discretized mesh and finite element exterior calculus to preserve physical conservation laws exactly.

Result: State-of-the-art performance on several steady-state PDE benchmarks and significant improvement over conventional baselines on out-of-distribution geometries.

Conclusion: Geo-NeW provides a powerful inductive bias for learning neural PDEs by explicitly connecting underlying geometry and boundary conditions to solutions, enabling better generalization to unseen domains while preserving physical structure.

Abstract: We aim to develop physics foundation models for science and engineering that provide real-time solutions to Partial Differential Equations (PDEs) which preserve structure and accuracy under adaptation to unseen geometries. To this end, we introduce General-Geometry Neural Whitney Forms (Geo-NeW): a data-driven finite element method. We jointly learn a differential operator and compatible reduced finite element spaces defined on the underlying geometry. The resulting model is solved to generate predictions, while exactly preserving physical conservation laws through Finite Element Exterior Calculus. Geometry enters the model as a discretized mesh both through a transformer-based encoding and as the basis for the learned finite element spaces. This explicitly connects the underlying geometry and imposed boundary conditions to the solution, providing a powerful inductive bias for learning neural PDEs, which we demonstrate improves generalization to unseen domains. We provide a novel parameterization of the constitutive model ensuring the existence and uniqueness of the solution. Our approach demonstrates state-of-the-art performance on several steady-state PDE benchmarks, and provides a significant improvement over conventional baselines on out-of-distribution geometries.

[519] DynSplit-KV: Dynamic Semantic Splitting for KVCache Compression in Efficient Long-Context LLM Inference

Jiancai Ye, Jun Liu, Qingchen Li, Tianlang Zhao, Hanbin Zhang, Jiayi Pan, Ningyi Xu, Guohao Dai

Main category: cs.LG

TL;DR: DynSplit-KV: A dynamic semantic splitting method for KV Cache compression that improves accuracy and reduces memory overhead in long-context LLM inference.

Details

Motivation: KV Cache memory footprint grows significantly in long-context scenarios, creating bottlenecks. Current compression methods use rigid splitting strategies that cause substantial accuracy degradation (5.5%-55.1%) due to misaligned semantic boundaries.

Method: Proposes DynSplit-KV with: (1) dynamic importance-aware delimiter selection strategy to identify semantic boundaries, improving accuracy by 49.9%; (2) uniform mapping strategy to transform variable-length semantic blocks into fixed-length format, reducing inference overhead.

Result: Achieves highest accuracy, 2.2x speedup compared to FlashAttention, and 2.6x peak memory reduction in long-context scenarios. Reduces inference overhead by 4.9x.

Conclusion: Dynamic semantic splitting is crucial for effective KV Cache compression. DynSplit-KV addresses limitations of rigid splitting methods by adapting to semantic boundaries while maintaining efficiency.

Abstract: Although Key-Value (KV) Cache is essential for efficient large language models (LLMs) inference, its growing memory footprint in long-context scenarios poses a significant bottleneck, making KVCache compression crucial. Current compression methods rely on rigid splitting strategies, such as fixed intervals or pre-defined delimiters. We observe that rigid splitting suffers from significant accuracy degradation (ranging from 5.5% to 55.1%) across different scenarios, owing to the scenario-dependent nature of the semantic boundaries. This highlights the necessity of dynamic semantic splitting to match semantics. To achieve this, we face two challenges. (1) Improper delimiter selection misaligns semantics with the KVCache, resulting in 28.6% accuracy loss. (2) Variable-length blocks after splitting introduce over 73.1% additional inference overhead. To address the above challenges, we propose DynSplit-KV, a KVCache compression method that dynamically identifies delimiters for splitting. We propose: (1) a dynamic importance-aware delimiter selection strategy, improving accuracy by 49.9%. (2) A uniform mapping strategy that transforms variable-length semantic blocks into a fixed-length format, reducing inference overhead by 4.9x. Experiments show that DynSplit-KV achieves the highest accuracy, 2.2x speedup compared with FlashAttention and 2.6x peak memory reduction in long-context scenarios.

[520] Causality–Δ: Jacobian-Based Dependency Analysis in Flow Matching Models

Reza Rezvan, Gustav Gille, Moritz Schauer, Richard Torkar

Main category: cs.LG

TL;DR: Flow matching analysis shows Jacobian-vector products reveal feature dependencies in generative flows, with applications to attribute correlation control in image generation.

Details

Motivation: To understand how small latent perturbations propagate through flow matching models and reveal dependency structures in generated features, enabling better control over attribute correlations in generated data.

Method: Derived closed-form expressions for optimal drift and its Jacobian in Gaussian and mixture-of-Gaussian settings, used numerical Jacobian-vector products to analyze flows, and composed flows with attribute classifiers to create attribute-level JVP estimators.

Result: Numerical JVPs recover analytical Jacobians in synthetic benchmarks; attribute-level JVP estimators recover empirical correlations on MNIST and CelebA; conditioning on small classifier-Jacobian norms reduces correlations consistent with hypothesized common-cause structure.

Conclusion: Jacobian-vector products provide practical insights into dependency structures in flow matching models, enabling correlation analysis and control in generated features, though conditioning on norms is not equivalent to formal causal interventions.

Abstract: Flow matching learns a velocity field that transports a base distribution to data. We study how small latent perturbations propagate through these flows and show that Jacobian-vector products (JVPs) provide a practical lens on dependency structure in the generated features. We derive closed-form expressions for the optimal drift and its Jacobian in Gaussian and mixture-of-Gaussian settings, revealing that even globally nonlinear flows admit local affine structure. In low-dimensional synthetic benchmarks, numerical JVPs recover the analytical Jacobians. In image domains, composing the flow with an attribute classifier yields an attribute-level JVP estimator that recovers empirical correlations on MNIST and CelebA. Conditioning on small classifier-Jacobian norms reduces correlations in a way consistent with a hypothesized common-cause structure, while we emphasize that this conditioning is not a formal do intervention.

[521] Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning

Wenquan Lu, Hai Huang, Randall Balestriero

Main category: cs.LG

TL;DR: Prompt augmentation enables stable long-horizon RL training for math reasoning LLMs by using diverse reasoning templates to prevent entropy collapse, achieving SOTA results on math benchmarks.

Details

Motivation: Prior RL methods for math reasoning LLMs suffer from entropy collapse during training, forcing short training horizons and limiting exploration. Existing approaches also rely on single fixed reasoning prompts, reducing diversity.

Method: Introduces prompt augmentation strategy that instructs models to generate reasoning traces under diverse templates and formats, increasing rollout diversity. Enables stable scaling of training duration without KL regularization by tolerating low-entropy regimes.

Result: Qwen2.5-Math-1.5B model trained with prompt augmentation on MATH Level 3-5 dataset achieves state-of-the-art performance: 44.5% per-benchmark and 51.3% per-question accuracy on AIME24, AMC, MATH500, Minerva, and OlympiadBench.

Conclusion: Prompt augmentation addresses entropy collapse in RL training for math reasoning LLMs, enabling stable long-horizon training and achieving SOTA results without KL regularization.

Abstract: Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have demonstrated strong potential for improving the mathematical reasoning capabilities of large language models. However, prior work has consistently observed an entropy collapse phenomenon during reinforcement post-training, characterized by a monotonic decrease in policy entropy that ultimately leads to training instability and collapse. As a result, most existing approaches restrict training to short horizons (typically 5-20 epochs), limiting sustained exploration and hindering further policy improvement. In addition, nearly all prior work relies on a single, fixed reasoning prompt or template during training. In this work, we introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats, thereby increasing rollout diversity. We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset and allows the model to tolerate low-entropy regimes without premature collapse. Empirically, a Qwen2.5-Math-1.5B model trained with prompt augmentation on the MATH Level 3-5 dataset achieves state-of-the-art performance, reaching 44.5 per-benchmark accuracy and 51.3 per-question accuracy on standard mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench. The code and model checkpoints are available at https://github.com/wenquanlu/prompt-augmentation-GRPO.

[522] Joint Learning of Hierarchical Neural Options and Abstract World Model

Wasu Top Piriyakulkij, Wolfgang Lehrach, Kevin Ellis, Kevin Murphy

Main category: cs.LG

TL;DR: AgentOWL learns hierarchical neural options and abstract world models for skill composition in a sample-efficient way, tested on Object-Centric Atari games.

Details

Motivation: To build AI agents that can perform new skills by composing existing skills, addressing the data inefficiency of current hierarchical reinforcement learning methods.

Method: Proposes AgentOWL that jointly learns an abstract world model (abstracting states and time) and hierarchical neural options in a sample-efficient manner.

Result: Demonstrates learning more skills with much less data than baseline methods on Object-Centric Atari games.

Conclusion: AgentOWL enables efficient skill acquisition through hierarchical composition, advancing sample-efficient hierarchical reinforcement learning.

Abstract: Building agents that can perform new skills by composing existing skills is a long-standing goal of AI agent research. Towards this end, we investigate how to efficiently acquire a sequence of skills, formalized as hierarchical neural options. However, existing model-free hierarchical reinforcement algorithms need a lot of data. We propose a novel method, which we call AgentOWL (Option and World model Learning Agent), that jointly learns – in a sample efficient way – an abstract world model (abstracting across both states and time) and a set of hierarchical neural options. We show, on a subset of Object-Centric Atari games, that our method can learn more skills using much less data than baseline methods.

[523] Merging Beyond: Streaming LLM Updates via Activation-Guided Rotations

Yuxuan Yao, Haonan Sheng, Qingsong Lv, Han Wu, Shuqi Liu, Zehua Liu, Zengyan Liu, Jiahui Gao, Haochen Tan, Xiaojin Fu, Haoli Bai, Hing Cheung So, Zhijiang Guo, Linqi Song

Main category: cs.LG

TL;DR: Streaming Merging with ARM (Activation-guided Rotation-aware Merging) is a novel model updating paradigm that treats merging as iterative optimization, using activation subspaces to guide parameter updates and surpass converged SFT models.

Details

Motivation: Current model merging techniques are inefficient post-hoc refinements that fail to capture dynamic optimization benefits of supervised fine-tuning (SFT). There's a need for more efficient adaptation techniques for large language models that can leverage early SFT checkpoints effectively.

Method: Proposes Streaming Merging paradigm with ARM strategy. ARM treats merging coefficients as learning rates and derives rotation vectors from activation subspaces to steer parameter updates along data-driven trajectories. It aligns semantic subspaces to preserve geometric structure of parameter evolution, requiring only early SFT checkpoints.

Result: ARM surpasses fully converged SFT models through iterative merging. Experimental results across model scales (1.7B to 14B) and diverse domains (math, code) demonstrate ARM can transcend converged checkpoints. Provides scalable, lightweight framework for efficient model adaptation.

Conclusion: ARM offers an effective model updating paradigm that conceptualizes merging as iterative optimization, enabling efficient adaptation of large language models while preserving geometric structure and outperforming conventional fine-tuning approaches.

Abstract: The escalating scale of Large Language Models (LLMs) necessitates efficient adaptation techniques. Model merging has gained prominence for its efficiency and controllability. However, existing merging techniques typically serve as post-hoc refinements or focus on mitigating task interference, often failing to capture the dynamic optimization benefits of supervised fine-tuning (SFT). In this work, we propose Streaming Merging, an innovative model updating paradigm that conceptualizes merging as an iterative optimization process. Central to this paradigm is \textbf{ARM} (\textbf{A}ctivation-guided \textbf{R}otation-aware \textbf{M}erging), a strategy designed to approximate gradient descent dynamics. By treating merging coefficients as learning rates and deriving rotation vectors from activation subspaces, ARM effectively steers parameter updates along data-driven trajectories. Unlike conventional linear interpolation, ARM aligns semantic subspaces to preserve the geometric structure of high-dimensional parameter evolution. Remarkably, ARM requires only early SFT checkpoints and, through iterative merging, surpasses the fully converged SFT model. Experimental results across model scales (1.7B to 14B) and diverse domains (e.g., math, code) demonstrate that ARM can transcend converged checkpoints. Extensive experiments show that ARM provides a scalable and lightweight framework for efficient model adaptation.

[524] Membership Inference Attacks from Causal Principles

Mathieu Even, Clément Berenfeld, Linus Bleistein, Tudor Cebere, Julie Josse, Aurélien Bellet

Main category: cs.LG

TL;DR: Framing membership inference attacks as causal inference to address biases in existing evaluation protocols and enable reliable memorization measurement without repeated retraining.

Details

Motivation: Standard MIA evaluation requires repeated retraining which is computationally expensive for large models. Existing one-run and zero-run methods have unclear statistical validity and suffer from biases that need to be formally addressed.

Method: Frame MIA evaluation as causal inference problem, defining memorization as causal effect of including data point in training set. Derive causal analogues of standard MIA metrics and propose practical estimators for multi-run, one-run, and zero-run regimes with non-asymptotic consistency guarantees.

Result: Experiments on real-world data show the approach enables reliable memorization measurement even when retraining is impractical and under distribution shift.

Conclusion: Provides principled foundation for privacy evaluation in modern AI systems by addressing biases in existing MIA evaluation protocols through causal inference framework.

Abstract: Membership Inference Attacks (MIAs) are widely used to quantify training data memorization and assess privacy risks. Standard evaluation requires repeated retraining, which is computationally costly for large models. One-run methods (single training with randomized data inclusion) and zero-run methods (post hoc evaluation) are often used instead, though their statistical validity remains unclear. To address this gap, we frame MIA evaluation as a causal inference problem, defining memorization as the causal effect of including a data point in the training set. This novel formulation reveals and formalizes key sources of bias in existing protocols: one-run methods suffer from interference between jointly included points, while zero-run evaluations popular for LLMs are confounded by non-random membership assignment. We derive causal analogues of standard MIA metrics and propose practical estimators for multi-run, one-run, and zero-run regimes with non-asymptotic consistency guarantees. Experiments on real-world data show that our approach enables reliable memorization measurement even when retraining is impractical and under distribution shift, providing a principled foundation for privacy evaluation in modern AI systems.

[525] R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

Jingyi Zhang, Tianyi Lin, Huanjin Yao, Xiang Lan, Shunyu Liu, Jiaxing Huang

Main category: cs.LG

TL;DR: CADS is a novel data synthesis approach that uses collective intelligence and adversarial learning to autonomously generate high-quality, diverse, and challenging multimodal training data for MLLMs.

Details

Motivation: Current MLLMs need better training data for complex real-world tasks. Manual data collection is expensive and limited. The paper aims to develop autonomous data synthesis techniques to create multimodal training data that can effectively enhance MLLMs.

Method: Proposes Collective Adversarial Data Synthesis (CADS) with two cyclic phases: CAD-Generate (uses collective knowledge for diverse multimodal data generation) and CAD-Judge (collaboratively assesses data quality). Includes Adversarial Context Optimization to optimize generation context for challenging data.

Result: Constructed MMSynthetic-20K dataset and trained R1-SyntheticVL model, which demonstrates superior performance on various benchmarks compared to existing approaches.

Conclusion: CADS provides an effective framework for autonomous multimodal data synthesis that enhances MLLM performance through high-quality, diverse, and challenging training data generation.

Abstract: In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

[526] From Tokens to Numbers: Continuous Number Modeling for SVG Generation

Michael Ogezi, Martin Bell, Freda Shi, Ethan Smith

Main category: cs.LG

TL;DR: CNM (Continuous Number Modeling) is a new approach for vector graphics generation that directly models numerical parameters as continuous values instead of discrete tokens, improving training speed and visual quality for SVG generation tasks.

Details

Motivation: Vector graphics like SVGs offer benefits over raster images (flexibility, size efficiency, editing ease) but are less explored due to inefficient encoding of numerical parameters as long token sequences, which slows training, reduces accuracy, and hurts generalization.

Method: Proposes Continuous Number Modeling (CNM) that directly models numbers as first-class continuous values rather than discrete tokens, aligning model inputs with data’s continuous nature. Trains multimodal transformer on 2M raster-to-SVG samples, then fine-tunes via reinforcement learning with perceptual feedback.

Result: Improves training speed by over 30% while maintaining higher perceptual fidelity compared to alternative approaches. Establishes CNM as practical and efficient approach for high-quality vector generation.

Conclusion: CNM addresses core challenges in vector graphics generation by eliminating discretization artifacts from token-based encoding, offering a more mathematically elegant approach with potential for broader applications beyond SVG generation.

Abstract: For certain image generation tasks, vector graphics such as Scalable Vector Graphics (SVGs) offer clear benefits such as increased flexibility, size efficiency, and editing ease, but remain less explored than raster-based approaches. A core challenge is that the numerical, geometric parameters, which make up a large proportion of SVGs, are inefficiently encoded as long sequences of tokens. This slows training, reduces accuracy, and hurts generalization. To address these problems, we propose Continuous Number Modeling (CNM), an approach that directly models numbers as first-class, continuous values rather than discrete tokens. This formulation restores the mathematical elegance of the representation by aligning the model’s inputs with the data’s continuous nature, removing discretization artifacts introduced by token-based encoding. We then train a multimodal transformer on 2 million raster-to-SVG samples, followed by fine-tuning via reinforcement learning using perceptual feedback to further improve visual quality. Our approach improves training speed by over 30% while maintaining higher perceptual fidelity compared to alternative approaches. This work establishes CNM as a practical and efficient approach for high-quality vector generation, with potential for broader applications. We make our code available http://github.com/mikeogezi/CNM.

[527] Robustness as an Emergent Property of Task Performance

Shir Ashury-Tahan, Ariel Gera, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen

Main category: cs.LG

TL;DR: Paper shows that as models achieve high performance on tasks, robustness emerges naturally, with strong correlation between task competence and robustness across diverse datasets and configurations.

Details

Motivation: To investigate whether robustness is an independent capability that needs explicit measurement and improvement, or if it naturally emerges as models become competent at tasks.

Method: Empirical analysis of multiple models across diverse datasets and configurations (paraphrases, different temperatures) to examine correlation between task performance and robustness.

Result: Strong positive correlation found between task performance and robustness; robustness primarily driven by task-specific competence rather than inherent model-level properties.

Conclusion: Robustness emerges naturally as tasks saturate, suggesting explicit robustness measurement/improvement may warrant reduced emphasis; easier tasks become reliable for real-world deployment.

Abstract: Robustness is often regarded as a critical future challenge for real-world applications, where stability is essential. However, as models often learn tasks in a similar order, we hypothesize that easier tasks will be easier regardless of how they are presented to the model. Indeed, in this paper, we show that as models approach high performance on a task, robustness is effectively achieved. Through an empirical analysis of multiple models across diverse datasets and configurations (e.g., paraphrases, different temperatures), we find a strong positive correlation. Moreover, we find that robustness is primarily driven by task-specific competence rather than inherent model-level properties, challenging current approaches that treat robustness as an independent capability. Thus, from a high-level perspective, we may expect that as new tasks saturate, model robustness on these tasks will emerge accordingly. For researchers, this implies that explicit efforts to measure and improve robustness may warrant reduced emphasis, as such robustness is likely to develop alongside performance gains. For practitioners, it acts as a sign that indeed the tasks that the literature deals with are unreliable, but on easier past tasks, the models are reliable and ready for real-world deployment.

[528] A Single Revision Step Improves Token-Efficient LLM Reasoning

Yingchuan Zhang, Terry Ma, Wenxuan Zhong, Ping Ma

Main category: cs.LG

TL;DR: PACER enables reasoning traces to collaboratively revise conclusions through structured peer-review, improving accuracy on challenging math problems by transforming consensus into logical refinement.

Details

Motivation: Standard aggregation methods for LLM reasoning (majority voting, confidence filtering) evaluate each trace in isolation, creating a "blind spot" where hallucinated paths with misleading high confidence can suppress the true solution by narrow margins.

Method: PACER (Packet-Conditioned Revision) is a training-free inference framework that: 1) screens generated traces, 2) constructs a consensus packet with candidate answers, aggregated confidence scores, and reasoning summaries, 3) enables traces to perform targeted self-review conditioned on this packet to identify logical divergence points, and 4) uses confidence-weighted voting over revised trajectories.

Result: On challenging competitive math benchmarks (AIME, BRUMO), PACER matches or exceeds the accuracy of 256-sample majority voting, significantly outperforming raw ensemble baselines.

Conclusion: PACER transforms simple consensus into a collaborative logical refinement process, enabling reasoning traces to “peer-review” each other to resolve near-miss errors in complex reasoning tasks.

Abstract: Large language models (LLMs) achieve higher accuracy on challenging reasoning tasks by scaling test-time compute through multiple trajectory sampling. However, standard aggregation methods like majority voting or individual confidence-based filtering face a fundamental “blind spot”: they evaluate each trace in isolation. As problems scale in difficulty, models often generate hallucinated paths that exhibit misleadingly high confidence, causing the true solution to be suppressed by a narrow margin in traditional voting. We ask: can we enable traces to “peer-review” each other to resolve these near-miss errors? We introduce Packet-Conditioned Revision (PACER), a training-free, inference-only framework that enables reasoning traces to revise their conclusions through a structured coordination step. After a preliminary screening of generated traces, PACER constructs a compact consensus packet containing (i) unique candidate answers, (ii) their aggregated confidence scores, and (iii) representative reasoning summaries for each candidate answer. Individual traces then perform a targeted self-review conditioned on this packet, allowing them to identify specific logical junctions where they diverged from the broader consensus and pivot if their original reasoning is found to be flawed. Final predictions are obtained via confidence-weighted voting over these revised trajectories. On challenging competitive math benchmarks such as AIME and BRUMO, PACER matches or exceeds the accuracy of 256-sample majority voting, significantly outperforming raw ensemble baselines by transforming simple consensus into a collaborative logical refinement process.

[529] MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling

Ning Ding, Fangcheng Liu, Kyungrae Kim, Linji Hao, Kyeng-Hun Lee, Hyeonmok Ko, Yehui Tang

Main category: cs.LG

TL;DR: MeKi enables efficient LLM deployment on edge devices by using memory-based expert knowledge injection instead of increasing computational complexity, achieving better performance with zero inference latency overhead.

Details

Motivation: Traditional LLM scaling through increased parameters or computations is impractical for edge devices with limited RAM and NPU resources, yet deploying performant LLMs on smartphones is crucial for user experience.

Method: MeKi equips Transformer layers with token-level memory experts that inject pre-stored semantic knowledge, uses re-parameterization to fold training matrices into compact static lookup tables, and offloads knowledge to ROM to decouple capacity from computational cost.

Result: MeKi significantly outperforms dense LLM baselines with identical inference speed, demonstrating the effectiveness of memory-based scaling for on-device LLMs with zero inference latency overhead.

Conclusion: Memory-based scaling via storage space rather than FLOPs is an effective paradigm for deploying performant LLMs on edge devices, addressing hardware constraints while maintaining user experience.

Abstract: Scaling Large Language Models (LLMs) typically relies on increasing the number of parameters or test-time computations to boost performance. However, these strategies are impractical for edge device deployment due to limited RAM and NPU resources. Despite hardware constraints, deploying performant LLM on edge devices such as smartphone remains crucial for user experience. To address this, we propose MeKi (Memory-based Expert Knowledge Injection), a novel system that scales LLM capacity via storage space rather than FLOPs. MeKi equips each Transformer layer with token-level memory experts that injects pre-stored semantic knowledge into the generation process. To bridge the gap between training capacity and inference efficiency, we employ a re-parameterization strategy to fold parameter matrices used during training into a compact static lookup table. By offloading the knowledge to ROM, MeKi decouples model capacity from computational cost, introducing zero inference latency overhead. Extensive experiments demonstrate that MeKi significantly outperforms dense LLM baselines with identical inference speed, validating the effectiveness of memory-based scaling paradigm for on-device LLMs. Project homepage is at https://github.com/ningding-o/MeKi.

[530] SC3D: Dynamic and Differentiable Causal Discovery for Temporal and Instantaneous Graphs

Sourajit Das, Dibyajyoti Chakraborthy, Romit Maulik

Main category: cs.LG

TL;DR: SC3D is a two-stage differentiable framework for discovering causal structures in multivariate time series, handling both lagged and instantaneous dependencies through edge preselection and refinement with sparsity and acyclicity constraints.

Details

Motivation: Causal discovery from multivariate time series is challenging due to interactions across multiple lags, possible instantaneous dependencies, and the combinatorial search space of dynamic graphs. Existing methods struggle with stability and accurate recovery of both lagged and instantaneous causal structures.

Method: Two-stage differentiable framework: Stage 1 performs edge preselection through node-wise prediction to obtain masks for lagged and instantaneous edges. Stage 2 refines these masks by optimizing a likelihood with sparsity constraints while enforcing acyclicity on the instantaneous block.

Result: SC3D achieves improved stability and more accurate recovery of both lagged and instantaneous causal structures compared to existing temporal baselines across synthetic and benchmark dynamical systems.

Conclusion: SC3D provides an effective differentiable approach for causal discovery in time series that handles both lagged and instantaneous dependencies while addressing the combinatorial search space challenge through a two-stage optimization process.

Abstract: Discovering causal structures from multivariate time series is a key problem because interactions span across multiple lags and possibly involve instantaneous dependencies. Additionally, the search space of the dynamic graphs is combinatorial in nature. In this study, we propose \textit{Stable Causal Dynamic Differentiable Discovery (SC3D)}, a two-stage differentiable framework that jointly learns lag-specific adjacency matrices and, if present, an instantaneous directed acyclic graph (DAG). In Stage 1, SC3D performs edge preselection through node-wise prediction to obtain masks for lagged and instantaneous edges, whereas Stage 2 refines these masks by optimizing a likelihood with sparsity along with enforcing acyclicity on the instantaneous block. Numerical results across synthetic and benchmark dynamical systems demonstrate that SC3D achieves improved stability and more accurate recovery of both lagged and instantaneous causal structures compared to existing temporal baselines.

[531] Koopman Autoencoders with Continuous-Time Latent Dynamics for Fluid Dynamics Forecasting

Rares Grozavescu, Pengyu Zhang, Etienne Meunier, Mark Girolami

Main category: cs.LG

TL;DR: Continuous-time Koopman autoencoder framework for turbulent flow simulation using numerical integration for flexible temporal resolution and stable long-horizon forecasting.

Details

Motivation: Classical data-driven surrogate models for turbulent flow simulation trade off between short-term accuracy and long-horizon stability. Existing Koopman autoencoders operate in discrete-time settings, limiting temporal flexibility.

Method: Introduces a continuous-time Koopman framework that models latent evolution through numerical integration schemes, allowing variable timesteps at inference for robustness to temporal resolution.

Result: The method demonstrates robustness to temporal resolution, generalizes beyond training regimes, and enables efficient long-horizon forecasting with learned dynamics adhering to analytical matrix exponential solutions.

Conclusion: Continuous-time Koopman framework provides improved temporal flexibility and stability for turbulent flow simulation compared to discrete-time approaches.

Abstract: Data-driven surrogate models have emerged as powerful tools for accelerating the simulation of turbulent flows. However, classical approaches which perform autoregressive rollouts often trade off between strong short-term accuracy and long-horizon stability. Koopman autoencoders, inspired by Koopman operator theory, provide a physics-based alternative by mapping nonlinear dynamics into a latent space where linear evolution is conducted. In practice, most existing formulations operate in a discrete-time setting, limiting temporal flexibility. In this work, we introduce a continuous-time Koopman framework that models latent evolution through numerical integration schemes. By allowing variable timesteps at inference, the method demonstrates robustness to temporal resolution and generalizes beyond training regimes. In addition, the learned dynamics closely adhere to the analytical matrix exponential solution, enabling efficient long-horizon forecasting. We evaluate the approach on classical CFD benchmarks and report accuracy, stability, and extrapolation properties.

[532] Tabula RASA: Exposing and Breaking the Relational Bottleneck in Transformers

Jonas Petersen, Camilla Mazzoleni, Riccardo Maggioni

Main category: cs.LG

TL;DR: RASA improves transformer multi-hop relational reasoning via edge-type embeddings and sparse attention masking, matching GPT-4 performance at lower cost.

Details

Motivation: Standard transformers struggle with multi-hop relational reasoning over structured data despite strong performance in other domains, requiring architectural modifications to handle complex reasoning tasks.

Method: Introduces RASA with two key modifications: (1) edge-type embeddings that inject relational structure into attention scores, and (2) sparse masking that restricts attention to graph-adjacent positions, reducing search space from O(2^{n^2}) to O(2^m).

Result: Outperforms standard transformers on MetaQA (1/2/3-hop) and WebQuestionsSP, matches GPT-4 performance at lower cost, with advantages growing with reasoning depth (+7.1 points on 3-hop).

Conclusion: Minimal structural modifications to transformers can substantially improve multi-hop relational reasoning capabilities without formal learnability guarantees.

Abstract: Transformers achieve remarkable performance across many domains, yet struggle with tasks requiring multi-hop relational reasoning over structured data. We analyze this limitation through circuit complexity: standard transformers are $\mathsf{TC}^0$-complete and require $Ω(k)$ layers for $k$-hop reasoning. We introduce RASA (Relation-Aware Sparse Attention), a minimal modification adding: (1) edge-type embeddings that inject relational structure into attention scores, and (2) sparse masking that restricts attention to graph-adjacent positions. While RASA has the same asymptotic depth requirements, sparse masking reduces the attention search space from $O(2^{n^2})$ to $O(2^m)$ patterns, and edge biases provide explicit relation routing. Empirically, on MetaQA (1/2/3-hop) and WebQuestionsSP, RASA outperforms standard transformers and matches GPT-4 at lower cost, with advantages growing with reasoning depth (+7.1 points on 3-hop). We do not claim formal learnability guarantees; the contribution is empirical validation that minimal structural modifications substantially improve multi-hop reasoning.

[533] When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

Bogdan Zagribelnyy, Ivan Ilin, Maksim Kuznetsov, Nikita Bondarev, Roman Schutski, Thomas MacDougall, Rim Shayakhmetov, Zulfat Miftakhutdinov, Mikolaj Mizera, Vladimir Aladinskiy, Alex Aliper, Alex Zhavoronkov

Main category: cs.LG

TL;DR: New benchmarking framework for evaluating LLMs in retrosynthesis using ChemCensor metric for chemical plausibility, with CREED dataset for training.

Details

Motivation: Existing benchmarks for LLMs in drug discovery and synthesis planning rely on published procedures and Top-K accuracy with single ground-truth, which doesn't capture the open-ended nature of real-world synthesis planning.

Method: Proposed ChemCensor metric for chemical plausibility evaluation, created CREED dataset with millions of ChemCensor-validated reaction records, and trained models using this dataset.

Result: The trained model using CREED dataset improves over LLM baselines under the new benchmarking framework that emphasizes plausibility over exact match.

Conclusion: The new benchmarking framework with ChemCensor metric better aligns with human synthesis planning practices by evaluating plausibility rather than exact matches.

Abstract: Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.

[534] Semantics-Aware Generative Latent Data Augmentation for Learning in Low-Resource Domains

Jae-Sung Bae, Minje Kim

Main category: cs.LG

TL;DR: GeLDA: A semantics-aware generative latent data augmentation framework using conditional diffusion models in foundation model latent spaces to address data scarcity in downstream tasks.

Details

Motivation: Deep learning underperforms in data-scarce settings despite foundation models' strong generalization. Even with FMs, downstream fine-tuning suffers from limited labeled data, necessitating effective data augmentation methods.

Method: GeLDA uses conditional diffusion models to synthesize samples in FM-induced latent spaces, which are low-dimensional and task-informative. It conditions generation on auxiliary feature vectors capturing semantic relationships among classes/subdomains for better augmentation in low-resource settings.

Result: In zero-shot language-specific speech emotion recognition, GeLDA improves Whisper-large baseline’s unweighted average recall by 6.13%. In long-tailed image classification, it achieves 74.7% tail-class accuracy on ImageNet-LT, setting new state-of-the-art.

Conclusion: GeLDA effectively addresses data scarcity by leveraging FM latent spaces for efficient, high-quality generative data augmentation, demonstrating strong performance across audio and vision tasks with limited labeled data.

Abstract: Despite strong performance in data-rich regimes, deep learning often underperforms in the data-scarce settings common in practice. While foundation models (FMs) trained on massive datasets demonstrate strong generalization by extracting general-purpose features, they can still suffer from scarce labeled data during downstream fine-tuning. To address this, we propose GeLDA, a semantics-aware generative latent data augmentation framework that leverages conditional diffusion models to synthesize samples in an FM-induced latent space. Because this space is low-dimensional and concentrates task-relevant information compared to the input space, GeLDA enables efficient, high-quality data generation. GeLDA conditions generation on auxiliary feature vectors that capture semantic relationships among classes or subdomains, facilitating data augmentation in low-resource domains. We validate GeLDA in two large-scale recognition tasks: (a) in zero-shot language-specific speech emotion recognition, GeLDA improves the Whisper-large baseline’s unweighted average recall by 6.13%; and (b) in long-tailed image classification, it achieves 74.7% tail-class accuracy on ImageNet-LT, setting a new state-of-the-art result.

[535] Causal Flow Q-Learning for Robust Offline Reinforcement Learning

Mingxuan Li, Junzhe Zhang, Elias Bareinboim

Main category: cs.LG

TL;DR: Causal offline RL method using flow-matching policies that addresses confounding biases in pixel-based demonstrations when sensory capabilities mismatch between demonstrator and learner.

Details

Motivation: Standard policy gradient methods assume no unmeasured confounding in offline data, but this condition fails in pixel-based demonstrations when there's a mismatch between demonstrator's and learner's sensory capabilities, leading to implicit confounding biases.

Method: Develops a novel causal offline RL objective optimizing policies’ worst-case performance under confounding biases, with practical implementation using expressive flow-matching policies and a deep discriminator to assess discrepancy between target and behavioral policies.

Result: Experiments across 25 pixel-based tasks show the confounding-robust augmentation achieves 120% success rate compared to confounding-unaware state-of-the-art offline RL methods.

Conclusion: The proposed causal approach effectively handles confounding biases in pixel-based offline RL, significantly outperforming existing methods when sensory capabilities mismatch between demonstrator and learner.

Abstract: Expressive policies based on flow-matching have been successfully applied in reinforcement learning (RL) more recently due to their ability to model complex action distributions from offline data. These algorithms build on standard policy gradients, which assume that there is no unmeasured confounding in the data. However, this condition does not necessarily hold for pixel-based demonstrations when a mismatch exists between the demonstrator’s and the learner’s sensory capabilities, leading to implicit confounding biases in offline data. We address the challenge by investigating the problem of confounded observations in offline RL from a causal perspective. We develop a novel causal offline RL objective that optimizes policies’ worst-case performance that may arise due to confounding biases. Based on this new objective, we introduce a practical implementation that learns expressive flow-matching policies from confounded demonstrations, employing a deep discriminator to assess the discrepancy between the target policy and the nominal behavioral policy. Experiments across 25 pixel-based tasks demonstrate that our proposed confounding-robust augmentation procedure achieves a success rate 120% that of confounding-unaware, state-of-the-art offline RL methods.

[536] Conflict-Resolving and Sharpness-Aware Minimization for Generalized Knowledge Editing with Multiple Updates

Duy Nguyen, Hanqi Xiao, Archiki Prasad, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal

Main category: cs.LG

TL;DR: CoRSA is a parameter-efficient training framework for knowledge editing in LLMs that improves generalization, stability across multiple updates, and resolves knowledge conflicts through sharpness-aware minimization and margin maximization.

Details

Motivation: LLMs need to be kept up-to-date with new knowledge, but full retraining is expensive. Existing efficient alternatives like model editing and parameter-efficient fine-tuning often fail in practice due to poor generalization across inputs, limited stability, and knowledge conflicts.

Method: CoRSA (Conflict-Resolving and Sharpness-Aware Minimization) framework uses sharpness-aware minimization to reduce loss curvature for better generalization and stability, and maximizes margin between new and prior knowledge to resolve conflicts. It’s parameter-efficient and handles multiple knowledge updates.

Result: Outperforms baselines with 12.42% average improvement over LoRA and 10% over model editing methods on fact editing benchmarks. Reduces catastrophic forgetting by 27.82% compared to LoRA with multiple updates. Also generalizes to code domain with 5.48% Pass@5 improvement.

Conclusion: CoRSA provides an effective parameter-efficient framework for knowledge editing that addresses key limitations of existing methods, offering better generalization, stability, and conflict resolution for keeping LLMs up-to-date.

Abstract: Large language models (LLMs) rely on internal knowledge to solve many downstream tasks, making it crucial to keep them up to date. Since full retraining is expensive, prior work has explored efficient alternatives such as model editing and parameter-efficient fine-tuning. However, these approaches often break down in practice due to poor generalization across inputs, limited stability, and knowledge conflict. To address these limitations, we propose the CoRSA (Conflict-Resolving and Sharpness-Aware Minimization) training framework, a parameter-efficient, holistic approach for knowledge editing with multiple updates. CoRSA tackles multiple challenges simultaneously: it improves generalization to different input forms and enhances stability across multiple updates by minimizing loss curvature, and resolves conflicts by maximizing the margin between new and prior knowledge. Across three widely used fact editing benchmarks, CoRSA achieves significant gains in generalization, outperforming baselines with average absolute improvements of 12.42% over LoRA and 10% over model editing methods. With multiple updates, it maintains high update efficacy while reducing catastrophic forgetting by 27.82% compared to LoRA. CoRSA also generalizes to the code domain, outperforming the strongest baseline by 5.48% Pass@5 in update efficacy.

[537] Zero Sum SVD: Balancing Loss Sensitivity for Low Rank LLM Compression

Ali Abbasi, Chayne Thrash, Haoran Qin, Shansita Sharma, Sepehr Seifi, Soheil Kolouri

Main category: cs.LG

TL;DR: ZS-SVD is a post-training compression method for LLMs that uses activation whitening and first-order loss estimates to globally prune singular components across the model with a zero-sum rule, automatically determining heterogeneous rank allocation without expensive optimization.

Details

Motivation: While SVD-based compression reduces LLM memory and compute costs, performance depends on rank allocation. Prior methods use homogeneous ranks or expensive iterative optimization. There's a need for efficient post-training compression that automatically determines optimal per-matrix ranks.

Method: ZS-SVD performs global singular component selection using activation whitening and first-order calibration loss estimates in whitened coordinates. It prunes components across the whole model with a zero-sum rule that keeps cumulative predicted loss change near zero. An optional lightweight correction applies a single projected gradient update after truncation followed by re-truncation.

Result: Extensive experiments across multiple LLM architectures show consistent gains across diverse benchmarks and compression ratios. The method yields heterogeneous ranks automatically without solving rank allocation optimization.

Conclusion: ZS-SVD provides an effective post-training compression method for LLMs that automatically determines optimal rank allocation through global singular component selection with a zero-sum rule, outperforming prior methods.

Abstract: Advances in large language models have driven strong performance across many tasks, but their memory and compute costs still hinder deployment. SVD-based compression reduces storage and can speed up inference via low-rank factors, yet performance depends on how rank is allocated under a global compression ratio. Prior methods often use homogeneous ranks for similarly sized matrices, despite large differences in loss sensitivity, or rely on expensive iterative pre-truncation optimization to determine per matrix ranks. We propose \textbf{Zero Sum SVD} (\textbf{ZS-SVD}), a post-training method that performs \emph{global} singular component selection using activation whitening and first-order calibration loss estimates in whitened coordinates. \textbf{ZS-SVD} prunes components across the whole model with a \textbf{zero sum} rule that keeps the cumulative predicted loss change near zero, automatically yielding heterogeneous ranks without solving a rank allocation optimization. Motivated by evidence that gradients near pretrained solutions exhibit low rank structure, we also introduce an optional lightweight correction that applies a \textbf{single} projected gradient update after truncation, followed by re-truncation. Extensive experiments across multiple LLM architectures show consistent gains across diverse benchmarks and compression ratios. Code is available at https://github.com/mint-vu/Zero-Sum-SVD

[538] Efficient Estimation of Kernel Surrogate Models for Task Attribution

Zhenshuo Zhang, Minxuan Duan, Hongyang R. Zhang

Main category: cs.LG

TL;DR: Proposes kernel surrogate models for task attribution in multi-task AI training, capturing nonlinear task interactions better than linear methods, with applications to math reasoning, in-context learning, and multi-objective RL.

Details

Motivation: Modern AI agents are trained on diverse tasks simultaneously, but understanding how each training task influences target task performance (task attribution) is challenging. Linear surrogate models miss important nonlinear interactions like synergy or antagonism between tasks.

Method: Introduces kernel surrogate models to capture second-order task interactions, with a gradient-based estimation procedure that leverages first-order approximations of pretrained models to efficiently learn the surrogate without repeated retraining.

Result: Kernel surrogate models achieve 25% higher correlation with leave-one-out ground truth than linear surrogates and influence-function baselines, and yield 40% improvement in demonstration selection for in-context learning and multi-objective RL benchmarks.

Conclusion: Kernel surrogate models provide a more accurate and efficient approach to task attribution in multi-task learning, capturing nonlinear task interactions that linear methods miss, with practical benefits for downstream task selection.

Abstract: Modern AI agents such as large language models are trained on diverse tasks – translation, code generation, mathematical reasoning, and text prediction – simultaneously. A key question is to quantify how each individual training task influences performance on a target task, a problem we refer to as task attribution. The direct approach, leave-one-out retraining, measures the effect of removing each task, but is computationally infeasible at scale. An alternative approach that builds surrogate models to predict a target task’s performance for any subset of training tasks has emerged in recent literature. Prior work focuses on linear surrogate models, which capture first-order relationships, but miss nonlinear interactions such as synergy, antagonism, or XOR-type effects. In this paper, we first consider a unified task weighting framework for analyzing task attribution methods, and show a new connection between linear surrogate models and influence functions through a second-order analysis. Then, we introduce kernel surrogate models, which more effectively represent second-order task interactions. To efficiently learn the kernel surrogate, we develop a gradient-based estimation procedure that leverages a first-order approximation of pretrained models; empirically, this yields accurate estimates with less than $2%$ relative error without repeated retraining. Experiments across multiple domains – including math reasoning in transformers, in-context learning, and multi-objective reinforcement learning – demonstrate the effectiveness of kernel surrogate models. They achieve a $25%$ higher correlation with the leave-one-out ground truth than linear surrogates and influence-function baselines. When used for downstream task selection, kernel surrogate models yield a $40%$ improvement in demonstration selection for in-context learning and multi-objective reinforcement learning benchmarks.

[539] Recurrent Equivariant Constraint Modulation: Learning Per-Layer Symmetry Relaxation from Data

Stefanos Pertigkiozoglou, Mircea Petrache, Shubhendu Trivedi, Kostas Daniilidis

Main category: cs.LG

TL;DR: RECM is a novel method that learns optimal relaxation levels for equivariant neural networks from training signals, automatically adapting to the symmetry properties of each layer’s input-target distribution without requiring manual tuning.

Details

Motivation: Strict equivariance constraints in neural networks can hinder learning by creating complex optimization dynamics, but existing relaxation methods require costly manual tuning of relaxation levels for each layer, which is task-dependent and impractical.

Method: Proposes Recurrent Equivariant Constraint Modulation (RECM), a layer-wise constraint modulation mechanism that learns appropriate relaxation levels solely from training signals and symmetry properties of each layer’s input-target distribution, without prior knowledge of target relaxation levels.

Result: RECM provably converges to relaxation levels upper-bounded by each layer’s symmetry gap, automatically recovering full equivariance for symmetric distributions while allowing flexibility for approximate symmetries. Empirically outperforms prior methods on diverse equivariant tasks including molecular conformer generation.

Conclusion: RECM provides an effective, automatic approach to balancing equivariance constraints with learning flexibility, eliminating the need for manual tuning while maintaining theoretical guarantees about convergence to appropriate relaxation levels based on data symmetry properties.

Abstract: Equivariant neural networks exploit underlying task symmetries to improve generalization, but strict equivariance constraints can induce more complex optimization dynamics that can hinder learning. Prior work addresses these limitations by relaxing strict equivariance during training, but typically relies on prespecified, explicit, or implicit target levels of relaxation for each network layer, which are task-dependent and costly to tune. We propose Recurrent Equivariant Constraint Modulation (RECM), a layer-wise constraint modulation mechanism that learns appropriate relaxation levels solely from the training signal and the symmetry properties of each layer’s input-target distribution, without requiring any prior knowledge about the task-dependent target relaxation level. We demonstrate that under the proposed RECM update, the relaxation level of each layer provably converges to a value upper-bounded by its symmetry gap, namely the degree to which its input-target distribution deviates from exact symmetry. Consequently, layers processing symmetric distributions recover full equivariance, while those with approximate symmetries retain sufficient flexibility to learn non-symmetric solutions when warranted by the data. Empirically, RECM outperforms prior methods across diverse exact and approximate equivariant tasks, including the challenging molecular conformer generation on the GEOM-Drugs dataset.

[540] When pre-training hurts LoRA fine-tuning: a dynamical analysis via single-index models

Gibbs Nwemadji, Bruno Loureiro, Jean Barbier

Main category: cs.LG

TL;DR: Excessive pre-training can slow down fine-tuning optimization, contrary to naive intuition, particularly for LoRA fine-tuning on single-index models under one-pass SGD.

Details

Motivation: The paper challenges the common assumption that pre-training always facilitates downstream fine-tuning, aiming to mathematically demonstrate that excessive pre-training can actually hinder fine-tuning convergence.

Method: Theoretical analysis using summary statistics description of fine-tuning dynamics for low-rank adaptation (LoRA) fine-tuning on single-index models trained under one-pass SGD. The study characterizes convergence rate dependence on initial fine-tuning alignment and target task non-linearity.

Result: Even when pre-training and downstream tasks are well-aligned, strong pre-training can induce a prolonged search phase and hinder convergence. The theory provides a unified picture of how pre-training strength and task difficulty jointly shape LoRA fine-tuning dynamics.

Conclusion: Excessive pre-training can computationally slow down fine-tuning optimization, revealing non-trivial dynamics in LoRA fine-tuning that depend on both pre-training strength and task difficulty.

Abstract: Pre-training on a source task is usually expected to facilitate fine-tuning on similar downstream problems. In this work, we mathematically show that this naive intuition is not always true: excessive pre-training can computationally slow down fine-tuning optimization. We study this phenomenon for low-rank adaptation (LoRA) fine-tuning on single-index models trained under one-pass SGD. Leveraging a summary statistics description of the fine-tuning dynamics, we precisely characterize how the convergence rate depends on the initial fine-tuning alignment and the degree of non-linearity of the target task. The key take away is that even when the pre-training and down- stream tasks are well aligned, strong pre-training can induce a prolonged search phase and hinder convergence. Our theory thus provides a unified picture of how pre-training strength and task difficulty jointly shape the dynamics and limitations of LoRA fine-tuning in a nontrivial tractable model.

[541] Late-Stage Generalization Collapse in Grokking: Detecting anti-grokking with Weightwatcher

Hari K Prakash, Charles H Martin

Main category: cs.LG

TL;DR: The paper identifies “anti-grokking” - a late-stage collapse of generalization after successful grokking, diagnosed via Correlation Traps in weight matrices using WeightWatcher tool.

Details

Motivation: To understand memorization in neural networks beyond the grokking regime, identifying a previously unreported third phase where models lose generalization after achieving it.

Method: Extended training beyond standard grokking experiments on two canonical setups (3-layer MLP on MNIST subset and transformer on modular addition), using WeightWatcher tool to analyze weight matrices for Correlation Traps and HTSR layer quality metrics.

Result: Discovered anti-grokking phase where test accuracy collapses to chance while training accuracy remains perfect; Correlation Traps (anomalously large eigenvalues) reliably identify this phase, unlike other diagnostics; observed similar pathologies in large-scale LLMs.

Conclusion: Anti-grokking represents a distinct post-generalization failure mode; Correlation Traps provide a data-free diagnostic for generalization collapse; these phenomena extend to large-scale models.

Abstract: \emph{Memorization} in neural networks lacks a precise operational definition and is often inferred from the grokking regime, where training accuracy saturates while test accuracy remains very low. We identify a previously unreported third phase of grokking in this training regime: \emph{anti-grokking}, a late-stage collapse of generalization. We revisit two canonical grokking setups: a 3-layer MLP trained on a subset of MNIST and a transformer trained on modular addition, but extended training far beyond standard. In both cases, after models transition from pre-grokking to successful generalization, test accuracy collapses back to chance while training accuracy remains perfect, indicating a distinct post-generalization failure mode. To diagnose anti-grokking, we use the open-source \texttt{WeightWatcher} tool based on HTSR/SETOL theory. The primary signal is the emergence of \emph{Correlation Traps}: anomalously large eigenvalues beyond the Marchenko–Pastur bulk in the empirical spectral density of shuffled weight matrices, which are predicted to impair generalization. As a secondary signal, anti-grokking corresponds to the average HTSR layer quality metric $α$ deviating from $2.0$. Neither metric requires access to the test or training data. We compare these signals to alternative grokking diagnostic, including $\ell_2$ norms, Activation Sparsity, Absolute Weight Entropy, and Local Circuit Complexity. These track pre-grokking and grokking but fail to identify anti-grokking. Finally, we show that Correlation Traps can induce catastrophic forgetting and/or prototype memorization, and observe similar pathologies in large-scale LLMs, like OSS GPT 20/120B.

[542] Bridging Online and Offline RL: Contextual Bandit Learning for Multi-Turn Code Generation

Ziru Chen, Dongdong Chen, Ruinan Jin, Yingbin Liang, Yujia Xie, Huan Sun

Main category: cs.LG

TL;DR: Cobalt is a contextual bandit learning method that combines online and offline RL for multi-turn code generation by using partial trajectories as prompts and single-step completions.

Details

Motivation: Online RL for multi-turn code generation with LLMs is effective but expensive and unstable, while offline RL is cheaper but less effective. The paper aims to combine benefits of both approaches.

Method: Cobalt collects code generation trajectories using a reference LLM, divides them into partial trajectories as contextual prompts, then trains the LLM through online bandit learning with single-step completions of these prompts.

Result: Cobalt outperforms GRPO and VeRPO baselines, improving R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. The method also addresses reward hacking with perturbed trajectories.

Conclusion: Cobalt demonstrates a promising solution for iterative decision-making tasks like multi-turn code generation by effectively combining online and offline RL benefits.

Abstract: Recently, there have been significant research interests in training large language models (LLMs) with reinforcement learning (RL) on real-world tasks, such as multi-turn code generation. While online RL tends to perform better than offline RL, its higher training cost and instability hinders wide adoption. In this paper, we build on the observation that multi-turn code generation can be formulated as a one-step recoverable Markov decision process and propose contextual bandit learning with offline trajectories (Cobalt), a new method that combines the benefits of online and offline RL. Cobalt first collects code generation trajectories using a reference LLM and divides them into partial trajectories as contextual prompts. Then, during online bandit learning, the LLM is trained to complete each partial trajectory prompt through single-step code generation. Cobalt outperforms two multi-turn online RL baselines based on GRPO and VeRPO, and substantially improves R1-Distill 8B and Qwen3 8B by up to 9.0 and 6.2 absolute Pass@1 scores on LiveCodeBench. Also, we analyze LLMs’ in-context reward hacking behaviors and augment Cobalt training with perturbed trajectories to mitigate this issue. Overall, our results demonstrate Cobalt as a promising solution for iterative decision-making tasks like multi-turn code generation. Our code and data are available at https://github.com/OSU-NLP-Group/cobalt.

[543] A Geometry-Aware Efficient Algorithm for Compositional Entropic Risk Minimization

Xiyuan Wei, Linli Zhou, Bokun Wang, Chih-Jen Lin, Tianbao Yang

Main category: cs.LG

TL;DR: SCENT: A geometry-aware stochastic algorithm for compositional entropic risk minimization using stochastic proximal mirror descent with exponential Bregman divergence.

Details

Motivation: Existing optimization algorithms for compositional entropic risk minimization (Log-E-Exp functions) suffer from non-convergence, numerical instability, and slow convergence rates, despite their importance in many ML problems.

Method: Proposes SCENT algorithm using stochastic proximal mirror descent (SPMD) with Bregman divergence induced by negative exponential function to capture objective geometry, applied to dual formulation of entropic risk minimization as min-min optimization.

Result: Establishes O(1/√T) convergence rate for convex problems, theoretically shows SPMD advantages over standard SGD, and demonstrates empirical superiority on extreme classification, partial AUC maximization, contrastive learning, and distributionally robust optimization.

Conclusion: SCENT addresses fundamental limitations of existing methods for compositional entropic risk minimization through geometry-aware stochastic optimization with provable convergence and practical effectiveness.

Abstract: This paper studies optimization for a family of problems termed $\textbf{compositional entropic risk minimization}$, in which each data’s loss is formulated as a Log-Expectation-Exponential (Log-E-Exp) function. The Log-E-Exp formulation serves as an abstraction of the Log-Sum-Exponential (LogSumExp) function when the explicit summation inside the logarithm is taken over a gigantic number of items and is therefore expensive to evaluate. While entropic risk objectives of this form arise in many machine learning problems, existing optimization algorithms suffer from several fundamental limitations including non-convergence, numerical instability, and slow convergence rates. To address these limitations, we propose a geometry-aware stochastic algorithm, termed $\textbf{SCENT}$, for the dual formulation of entropic risk minimization cast as a min–min optimization problem. The key to our design is a $\textbf{stochastic proximal mirror descent (SPMD)}$ update for the dual variable, equipped with a Bregman divergence induced by a negative exponential function that faithfully captures the geometry of the objective. Our main contributions are threefold: (i) we establish an $O(1/\sqrt{T})$ convergence rate of the proposed SCENT algorithm for convex problems; (ii) we theoretically characterize the advantages of SPMD over standard SGD update for optimizing the dual variable; and (iii) we demonstrate the empirical effectiveness of SCENT on extreme classification, partial AUC maximization, contrastive learning and distributionally robust optimization, where it consistently outperforms existing baselines.

[544] Antidistillation Fingerprinting

Yixuan Even Xu, John Kirchenbauer, Yash Savani, Asher Trockman, Alexander Robey, Tom Goldstein, Fei Fang, J. Zico Kolter

Main category: cs.LG

TL;DR: ADFP introduces a principled fingerprinting method for detecting model distillation by aligning fingerprinting objectives with student learning dynamics, achieving better detection with minimal utility degradation.

Details

Motivation: Existing fingerprinting techniques for detecting model distillation rely on heuristic perturbations that create a trade-off between generation quality and fingerprinting strength, often requiring significant utility degradation for effective fingerprinting.

Method: ADFP uses a gradient-based antidistillation sampling framework with a proxy model to identify and sample tokens that maximize expected detectability of fingerprints in student models after fine-tuning, rather than relying on incidental absorption of biases.

Result: Experiments on GSM8K and OASST1 benchmarks show ADFP achieves significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility, even with unknown student model architectures.

Conclusion: ADFP provides a principled approach to model distillation fingerprinting that better aligns with student learning dynamics, offering improved detection capabilities without sacrificing generation quality.

Abstract: Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model’s outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student’s learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K and OASST1 benchmarks demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility, even when the student model’s architecture is unknown.

[545] Mixture of Concept Bottleneck Experts

Francesco De Santis, Gabriele Ciravegna, Giovanni De Felice, Arianna Casanova, Francesco Giannini, Michelangelo Diligenti, Mateo Espinosa Zarlenga, Pietro Barbiero, Johannes Schneider, Danilo Giordano

Main category: cs.LG

TL;DR: M-CBEs generalize concept bottleneck models by using mixtures of experts with different functional forms, enabling better accuracy-interpretability trade-offs through linear or symbolic regression experts.

Details

Motivation: Existing Concept Bottleneck Models (CBMs) are limited by using fixed linear or Boolean predictors, which restricts both predictive accuracy and adaptability to diverse user needs. There's an underexplored design space for more flexible CBMs.

Method: Proposes Mixture of Concept Bottleneck Experts (M-CBEs) that generalizes CBMs along two dimensions: number of experts and functional form of each expert. Instantiates two models: Linear M-CBE (learns finite set of linear expressions) and Symbolic M-CBE (uses symbolic regression to discover expert functions from data with user-specified operator vocabularies).

Result: Empirical evaluation shows that varying mixture size and functional form provides a robust framework for navigating accuracy-interpretability trade-off, adapting to different user and task needs.

Conclusion: M-CBEs offer a flexible framework that expands the design space of interpretable models, allowing better adaptation to diverse requirements while maintaining interpretability through concept grounding.

Abstract: Concept Bottleneck Models (CBMs) promote interpretability by grounding predictions in human-understandable concepts. However, existing CBMs typically fix their task predictor to a single linear or Boolean expression, limiting both predictive accuracy and adaptability to diverse user needs. We propose Mixture of Concept Bottleneck Experts (M-CBEs), a framework that generalizes existing CBMs along two dimensions: the number of experts and the functional form of each expert, exposing an underexplored region of the design space. We investigate this region by instantiating two novel models: Linear M-CBE, which learns a finite set of linear expressions, and Symbolic M-CBE, which leverages symbolic regression to discover expert functions from data under user-specified operator vocabularies. Empirical evaluation demonstrates that varying the mixture size and functional form provides a robust framework for navigating the accuracy-interpretability trade-off, adapting to different user and task needs.

[546] Self-Soupervision: Cooking Model Soups without Labels

Anthony Fuller, James R. Green, Evan Shelhamer

Main category: cs.LG

TL;DR: Self-Souping extends model soups to self-supervised learning, enabling parameter mixing from diverse SSL algorithms and data sources for improved robustness and accuracy.

Details

Motivation: Traditional model soups require supervised learning and optimize the same loss on labeled data. The authors want to generalize soups to self-supervised learning to leverage unlabeled data from various sources (e.g., task transfer, distribution shifts) and enable mixing of diverse SSL algorithms.

Method: Proposes Self-Souping: take a base model (stock), fine-tune it into multiple models (ingredients) using different SSL algorithms or hyperparameters on various data sources, then mix their parameters back into one model (soup). Includes techniques like self-souping on corrupted test data then fine-tuning back on clean train data.

Result: Self-Souping boosts robustness by +3.5% on ImageNet-C and +7% on LAION-C. Enables cooking soups from diverse SSL algorithms (MAE, MoCoV3, MMCR) that outperform any single SSL ingredient. First demonstration that ingredients can differ in SSL hyperparameters and algorithms.

Conclusion: Self-Souping successfully generalizes model soups to self-supervised learning, unlocking countless SSL algorithms for creating more robust models through parameter mixing from diverse ingredients trained on various data sources.

Abstract: Model soups are strange and strangely effective combinations of parameters. They take a model (the stock), fine-tune it into multiple models (the ingredients), and then mix their parameters back into one model (the soup) to improve predictions. While all known soups require supervised learning, and optimize the same loss on labeled data, our recipes for Self-\emph{Soup}ervision generalize soups to self-supervised learning (SSL). Our Self-Souping lets us flavor ingredients on new data sources, e.g. from unlabeled data from a task for transfer or from a shift for robustness. We show that Self-Souping on corrupted test data, then fine-tuning back on uncorrupted train data, boosts robustness by +3.5% (ImageNet-C) and +7% (LAION-C). Self-\emph{Soup}ervision also unlocks countless SSL algorithms to cook the diverse ingredients needed for more robust soups. We show for the first time that ingredients can differ in their SSL hyperparameters – and more surprisingly, in their SSL algorithms. We cook soups of MAE, MoCoV3, and MMCR ingredients that are more accurate than any one single SSL ingredient.

[547] Controlled disagreement improves generalization in decentralized training

Zesen Wang, Mikael Johansson

Main category: cs.LG

TL;DR: DSGD-AC intentionally preserves consensus errors in decentralized training, showing they act as structured perturbations that guide optimization toward flatter minima, outperforming both standard DSGD and centralized SGD.

Details

Motivation: Challenge the conventional view that consensus errors in decentralized training are detrimental, proposing instead that they can serve as useful implicit regularizers for better generalization.

Method: Decentralized SGD with Adaptive Consensus (DSGD-AC) that preserves non-vanishing consensus errors through time-dependent scaling, allowing errors to systematically align with dominant Hessian subspace.

Result: DSGD-AC consistently surpasses both standard DSGD and centralized SGD in test accuracy and solution flatness across image classification and machine translation benchmarks.

Conclusion: Consensus errors can be beneficial as implicit regularizers, opening new perspectives for decentralized learning algorithm design by leveraging structured perturbations for better generalization.

Abstract: Decentralized training is often regarded as inferior to centralized training because the consensus errors between workers are thought to undermine convergence and generalization, even with homogeneous data distributions. This work challenges this view by introducing decentralized SGD with Adaptive Consensus (DSGD-AC), which intentionally preserves non-vanishing consensus errors through a time-dependent scaling mechanism. We prove that these errors are not random noise but systematically align with the dominant Hessian subspace, acting as structured perturbations that guide optimization toward flatter minima. Across image classification and machine translation benchmarks, DSGD-AC consistently surpasses both standard DSGD and centralized SGD in test accuracy and solution flatness. Together, these results establish consensus errors as a useful implicit regularizer and open a new perspective on the design of decentralized learning algorithms.

[548] Manifold-Constrained Energy-Based Transition Models for Offline Reinforcement Learning

Zeyu Fang, Zuyuan Zhang, Mahdi Imani, Tian Lan

Main category: cs.LG

TL;DR: MC-ETM improves offline RL robustness by learning energy-based transition models with manifold constraints to detect and handle distribution shift, using energy signals to truncate rollouts and stabilize value estimation.

Details

Motivation: Offline RL suffers from distribution shift where policy improvement drives rollouts into unsupported regions, causing compounding model errors and severe value overestimation. Current methods lack reliable ways to detect and handle these out-of-distribution scenarios.

Method: Proposes Manifold-Constrained Energy-based Transition Models (MC-ETM) that: 1) Learn conditional energy-based transition models using manifold projection-diffusion negative sampling, 2) Generate near-manifold hard negatives via latent space perturbations and Langevin dynamics, 3) Use learned energy as reliability signal to truncate rollouts when energy exceeds threshold, 4) Stabilize Bellman backups with pessimistic penalties based on Q-value dispersion across energy-guided samples.

Result: MC-ETM improves multi-step dynamics fidelity and yields higher normalized returns on standard offline control benchmarks, particularly under irregular dynamics and sparse data coverage. The method shows better robustness to distribution shift.

Conclusion: MC-ETM provides an effective framework for handling distribution shift in offline RL through energy-based modeling and manifold constraints, offering reliable detection of out-of-distribution states and stabilization of value estimation.

Abstract: Model-based offline reinforcement learning is brittle under distribution shift: policy improvement drives rollouts into state–action regions weakly supported by the dataset, where compounding model error yields severe value overestimation. We propose Manifold-Constrained Energy-based Transition Models (MC-ETM), which train conditional energy-based transition models using a manifold projection–diffusion negative sampler. MC-ETM learns a latent manifold of next states and generates near-manifold hard negatives by perturbing latent codes and running Langevin dynamics in latent space with the learned conditional energy, sharpening the energy landscape around the dataset support and improving sensitivity to subtle out-of-distribution deviations. For policy optimization, the learned energy provides a single reliability signal: rollouts are truncated when the minimum energy over sampled next states exceeds a threshold, and Bellman backups are stabilized via pessimistic penalties based on Q-value-level dispersion across energy-guided samples. We formalize MC-ETM through a hybrid pessimistic MDP formulation and derive a conservative performance bound separating in-support evaluation error from truncation risk. Empirically, MC-ETM improves multi-step dynamics fidelity and yields higher normalized returns on standard offline control benchmarks, particularly under irregular dynamics and sparse data coverage.

[549] Spatiotemporal Decision Transformer for Traffic Coordination

Haoran Su, Yandong Sun, Hanxiao Deng

Main category: cs.LG

TL;DR: MADT (Multi-Agent Decision Transformer) applies sequence modeling to multi-agent traffic signal control using graph attention and temporal transformers for improved coordination and sample efficiency.

Details

Motivation: Existing reinforcement learning methods for traffic signal control struggle with multi-agent coordination and sample efficiency, requiring better approaches for optimizing network-wide traffic flow.

Method: Reformulates multi-agent traffic signal control as sequence modeling problem using Decision Transformer extended with: (1) graph attention for spatial dependencies, (2) temporal transformer encoder for traffic dynamics, (3) return-to-go conditioning for target performance.

Result: Achieves state-of-the-art performance, reducing average travel time by 5-6% compared to strongest baselines, with superior coordination among adjacent intersections in synthetic grid networks and real-world scenarios.

Conclusion: MADT provides an effective approach for multi-agent traffic signal control through sequence modeling, enabling offline learning from historical data with potential for online fine-tuning.

Abstract: Traffic signal control is a critical challenge in urban transportation, requiring coordination among multiple intersections to optimize network-wide traffic flow. While reinforcement learning has shown promise for adaptive signal control, existing methods struggle with multi-agent coordination and sample efficiency. We introduce MADT (Multi-Agent Decision Transformer), a novel approach that reformulates multi-agent traffic signal control as a sequence modeling problem. MADT extends the Decision Transformer paradigm to multi-agent settings by incorporating: (1) a graph attention mechanism for modeling spatial dependencies between intersections, (2) a|temporal transformer encoder for capturing traffic dynamics, and (3) return-to-go conditioning for target performance specification. Our approach enables offline learning from historical traffic data, with architecture design that facilitates potential online fine-tuning. Experiments on synthetic grid networks and real-world traffic scenarios demonstrate that MADT achieves state-of-the-art performance, reducing average travel time by 5-6% compared to the strongest baseline while exhibiting superior coordination among adjacent intersections.

[550] A Random Matrix Theory Perspective on the Consistency of Diffusion Models

Binxu Wang, Jacob Zavatone-Veth, Cengiz Pehlevan

Main category: cs.LG

TL;DR: Diffusion models trained on different dataset splits produce similar outputs due to shared Gaussian statistics; a random matrix theory framework explains this through noise renormalization and identifies three factors behind cross-split disagreement.

Details

Motivation: The paper aims to understand why diffusion models trained on different, non-overlapping subsets of a dataset often produce strikingly similar outputs when given the same noise seed, and to develop a theoretical framework to explain this phenomenon.

Method: Develops a random matrix theory framework to quantify how finite datasets shape the expectation and variance of learned denoisers and sampling maps in linear settings. Uses deterministic-equivalence tools extended to fractional matrix powers to analyze entire sampling trajectories.

Result: The theory explains that sampling variability acts as a renormalization of noise level through a self-consistent relation, causing limited data to overshrink low-variance directions and pull samples toward dataset mean. Identifies three key factors behind cross-split disagreement: anisotropy across eigenmodes, inhomogeneity across inputs, and scaling with dataset size.

Conclusion: Provides a principled baseline for reproducibility in diffusion training, linking spectral properties of data to the stability of generative outputs, with validation on UNet and DiT architectures in their non-memorization regime.

Abstract: Diffusion models trained on different, non-overlapping subsets of a dataset often produce strikingly similar outputs when given the same noise seed. We trace this consistency to a simple linear effect: the shared Gaussian statistics across splits already predict much of the generated images. To formalize this, we develop a random matrix theory (RMT) framework that quantifies how finite datasets shape the expectation and variance of the learned denoiser and sampling map in the linear setting. For expectations, sampling variability acts as a renormalization of the noise level through a self-consistent relation $σ^2 \mapsto κ(σ^2)$, explaining why limited data overshrink low-variance directions and pull samples toward the dataset mean. For fluctuations, our variance formulas reveal three key factors behind cross-split disagreement: \textit{anisotropy} across eigenmodes, \textit{inhomogeneity} across inputs, and overall scaling with dataset size. Extending deterministic-equivalence tools to fractional matrix powers further allows us to analyze entire sampling trajectories. The theory sharply predicts the behavior of linear diffusion models, and we validate its predictions on UNet and DiT architectures in their non-memorization regime, identifying where and how samples deviates across training data split. This provides a principled baseline for reproducibility in diffusion training, linking spectral properties of data to the stability of generative outputs.

[551] Notes on the Reward Representation of Posterior Updates

Pedro A. Ortega

Main category: cs.LG

TL;DR: The paper establishes conditions under which decision-making as inference can be made literal rather than metaphorical, showing when KL-regularized soft updates correspond exactly to Bayesian posterior updates within a fixed probabilistic model.

Details

Motivation: To bridge the gap between metaphorical and literal interpretations of decision-making as inference in control and reinforcement learning, specifically determining when behavioral updates can be explained as genuine Bayesian evidence reweighing rather than just analogies.

Method: Analyzes the special case where KL-regularized soft updates correspond exactly to Bayesian posterior updates within a single fixed probabilistic model, treating the update variable as a genuine information channel. Examines identification constraints for reward functions given posterior updates.

Result: Shows that posterior updates determine relative, context-dependent incentive signals but not absolute rewards, which remain ambiguous up to context-specific baselines. Coherence constraints emerge when requiring reusable continuation values across different update directions.

Conclusion: Provides a sharp identification result for when decision-making as inference can be made literal, clarifying the relationship between Bayesian updates and reward functions in reinforcement learning and control theory.

Abstract: Many ideas in modern control and reinforcement learning treat decision-making as inference: start from a baseline distribution and update it when a signal arrives. We ask when this can be made literal rather than metaphorical. We study the special case where a KL-regularized soft update is exactly a Bayesian posterior inside a single fixed probabilistic model, so the update variable is a genuine channel through which information is transmitted. In this regime, behavioral change is driven only by evidence carried by that channel: the update must be explainable as an evidence reweighing of the baseline. This yields a sharp identification result: posterior updates determine the relative, context-dependent incentive signal that shifts behavior, but they do not uniquely determine absolute rewards, which remain ambiguous up to context-specific baselines. Requiring one reusable continuation value across different update directions adds a further coherence constraint linking the reward descriptions associated with different conditioning orders.

[552] Weighted Temporal Decay Loss for Learning Wearable PPG Data with Sparse Clinical Labels

Yunsung Chung, Keum San Chun, Migyeong Gwak, Han Feng, Yingshuo Liu, Chanho Lim, Viswam Nathan, Nassir Marrouche, Sharanya Arcot Desai

Main category: cs.LG

TL;DR: A training strategy that learns biomarker-specific decay of sample weight over time to address sparse clinical labels in PPG-based health monitoring, improving prediction accuracy and providing interpretable temporal sensitivity insights.

Details

Motivation: Sparse clinical labels in biosignal monitoring (like PPG from smartwatches) make temporally distant samples less reliable for supervision, creating challenges for developing accurate health algorithms.

Method: Proposes a training strategy that learns biomarker-specific decay of sample weight over time gap between PPG segment and ground truth label, using this weight in loss with a regularizer to prevent trivial solutions.

Result: On smartwatch PPG from 450 participants across 10 biomarkers, the approach achieves 0.715 AUPRC (vs 0.674 for fine-tuned self-supervised baseline and 0.626 for feature-based Random Forest). Linear decay function is most robust.

Conclusion: The method improves biomarker prediction from PPG signals while providing interpretable decay rates that summarize how quickly each biomarker’s PPG evidence becomes stale, offering insights into temporal sensitivity.

Abstract: Advances in wearable computing and AI have increased interest in leveraging PPG for health monitoring over the past decade. One of the biggest challenges in developing health algorithms based on such biosignals is the sparsity of clinical labels, which makes biosignals temporally distant from lab draws less reliable for supervision. To address this problem, we introduce a simple training strategy that learns a biomarker-specific decay of sample weight over the time gap between a segment and its ground truth label and uses this weight in the loss with a regularizer to prevent trivial solutions. On smartwatch PPG from 450 participants across 10 biomarkers, the approach improves over baselines. In the subject-wise setting, the proposed approach averages 0.715 AUPRC, compared to 0.674 for a fine-tuned self-supervised baseline and 0.626 for a feature-based Random Forest. A comparison of four decay families shows that a simple linear decay function is most robust on average. Beyond accuracy, the learned decay rates summarize how quickly each biomarker’s PPG evidence becomes stale, providing an interpretable view of temporal sensitivity.

[553] Refining Decision Boundaries In Anomaly Detection Using Similarity Search Within the Feature Space

Sidahmed Benabderrahmane, Petko Valtchev, James Cheney, Talal Rahwan

Main category: cs.LG

TL;DR: SDA2E: Sparse Dual Adversarial Attention-based AutoEncoder with similarity-guided active learning for anomaly detection in imbalanced datasets like APTs in cybersecurity.

Details

Motivation: Detecting rare and diverse anomalies in highly imbalanced datasets (like APTs in cybersecurity) is challenging. Active learning can help minimize labeling effort, but conventional approaches fail to exploit geometric structure for model refinement.

Method: Proposes SDA2E autoencoder for compact discriminative latent representations, plus similarity-guided active learning with three strategies: normal-like expansion, anomaly-like prioritization, and hybrid strategy. Introduces SIM_NM1 similarity measure for sparse binary embeddings.

Result: Evaluated on 52 imbalanced datasets including DARPA Transparent Computing scenarios. Achieves superior ranking performance (nDCG up to 1.0), reduces required labeled data by up to 80% compared to passive training, with statistical significance confirmed.

Conclusion: Establishes robust, efficient, statistically validated framework for anomaly detection suited to cybersecurity applications like APT detection.

Abstract: Detecting rare and diverse anomalies in highly imbalanced datasets-such as Advanced Persistent Threats (APTs) in cybersecurity-remains a fundamental challenge for machine learning systems. Active learning offers a promising direction by strategically querying an oracle to minimize labeling effort, yet conventional approaches often fail to exploit the intrinsic geometric structure of the feature space for model refinement. In this paper, we introduce SDA2E, a Sparse Dual Adversarial Attention-based AutoEncoder designed to learn compact and discriminative latent representations from imbalanced, high-dimensional data. We further propose a similarity-guided active learning framework that integrates three novel strategies to refine decision boundaries efficiently: mormal-like expansion, which enriches the training set with points similar to labeled normals to improve reconstruction fidelity; anomaly-like prioritization, which boosts ranking accuracy by focusing on points resembling known anomalies; and a hybrid strategy that combines both for balanced model refinement and ranking. A key component of our framework is a new similarity measure, Normalized Matching 1s (SIM_NM1), tailored for sparse binary embeddings. We evaluate SDA2E extensively across 52 imbalanced datasets, including multiple DARPA Transparent Computing scenarios, and benchmark it against 15 state-of-the-art anomaly detection methods. Results demonstrate that SDA2E consistently achieves superior ranking performance (nDCG up to 1.0 in several cases) while reducing the required labeled data by up to 80% compared to passive training. Statistical tests confirm the significance of these improvements. Our work establishes a robust, efficient, and statistically validated framework for anomaly detection that is particularly suited to cybersecurity applications such as APT detection.

[554] A Reproducible Framework for Bias-Resistant Machine Learning on Small-Sample Neuroimaging Data

Jagan Mohan Reddy Dwarampudi, Jennifer L Purks, Joshua Wong, Renjie Hu, Tania Banerjee

Main category: cs.LG

TL;DR: A reproducible ML framework for small-sample neuroimaging data using nested cross-validation and feature engineering to reduce bias and improve generalization.

Details

Motivation: Conventional cross-validation frameworks that reuse the same folds for both model selection and performance estimation yield optimistically biased results, limiting reproducibility and generalization in small-sample neuroimaging studies.

Method: Integrates domain-informed feature engineering, nested cross-validation, and calibrated decision-threshold optimization specifically designed for high-dimensional structural MRI datasets with small sample sizes.

Result: Achieved a nested-CV balanced accuracy of 0.660 ± 0.068 using a compact, interpretable subset of features selected via importance-guided ranking on deep brain stimulation cognitive outcomes data.

Conclusion: The framework provides a generalizable computational blueprint for reliable machine learning in data-limited biomedical domains by combining interpretability and unbiased evaluation.

Abstract: We introduce a reproducible, bias-resistant machine learning framework that integrates domain-informed feature engineering, nested cross-validation, and calibrated decision-threshold optimization for small-sample neuroimaging data. Conventional cross-validation frameworks that reuse the same folds for both model selection and performance estimation yield optimistically biased results, limiting reproducibility and generalization. Demonstrated on a high-dimensional structural MRI dataset of deep brain stimulation cognitive outcomes, the framework achieved a nested-CV balanced accuracy of 0.660,$\pm$,0.068 using a compact, interpretable subset selected via importance-guided ranking. By combining interpretability and unbiased evaluation, this work provides a generalizable computational blueprint for reliable machine learning in data-limited biomedical domains.

[555] RPG-AE: Neuro-Symbolic Graph Autoencoders with Rare Pattern Mining for Provenance-Based Anomaly Detection

Asif Tauhid, Sidahmed Benabderrahmane, Mohamad Altrabulsi, Ahamed Foisal, Talal Rahwan

Main category: cs.LG

TL;DR: Neuro-symbolic anomaly detection framework combining Graph Autoencoder with rare pattern mining to identify APT-like activities in system-level provenance data.

Details

Motivation: Advanced Persistent Threats (APTs) are sophisticated cyberattacks that operate stealthily and blend into normal system behavior, making them difficult to detect. Current methods struggle with identifying these subtle anomalies in system provenance data.

Method: Constructs process behavioral graph using k-Nearest Neighbors based on feature similarity, learns normal relational structure using Graph Autoencoder, identifies anomaly candidates through reconstruction deviations, and integrates rare pattern mining to discover infrequent behavioral co-occurrences that boost anomaly scores.

Result: Evaluated on DARPA Transparent Computing datasets, rare-pattern boosting yields substantial gains in anomaly ranking quality over baseline GAE. The unified model outperforms individual context-based detectors and achieves performance competitive with ensemble aggregation methods requiring multiple separate detectors.

Conclusion: Coupling graph-based representation learning with classical pattern mining improves both effectiveness and interpretability in provenance-based security anomaly detection, offering a valuable approach for identifying APT-like activities.

Abstract: Advanced Persistent Threats (APTs) are sophisticated, long-term cyberattacks that are difficult to detect because they operate stealthily and often blend into normal system behavior. This paper presents a neuro-symbolic anomaly detection framework that combines a Graph Autoencoder (GAE) with rare pattern mining to identify APT-like activities in system-level provenance data. Our approach first constructs a process behavioral graph using k-Nearest Neighbors based on feature similarity, then learns normal relational structure using a Graph Autoencoder. Anomaly candidates are identified through deviations between observed and reconstructed graph structure. To further improve detection, we integrate an rare pattern mining module that discovers infrequent behavioral co-occurrences and uses them to boost anomaly scores for processes exhibiting rare signatures. We evaluate the proposed method on the DARPA Transparent Computing datasets and show that rare-pattern boosting yields substantial gains in anomaly ranking quality over the baseline GAE. Compared with existing unsupervised approaches on the same benchmark, our single unified model consistently outperforms individual context-based detectors and achieves performance competitive with ensemble aggregation methods that require multiple separate detectors. These results highlight the value of coupling graph-based representation learning with classical pattern mining to improve both effectiveness and interpretability in provenance-based security anomaly detection.

[556] How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?

Xiaoyuan Cheng, Wenxuan Yuan, Boyang Li, Yuanchao Xu, Yiming Yang, Hao Liang, Bei Peng, Robert Loftin, Zhuo Sun, Yukun Hu

Main category: cs.LG

TL;DR: ALGD is a diffusion-based safe RL algorithm that uses augmented Lagrangian to stabilize policy generation and training for multimodal action distributions in online safety-critical settings.

Details

Motivation: Existing diffusion-based RL methods focus on offline reward maximization with limited safety considerations in online settings. There's a need for safe RL approaches that can handle multimodal action distributions while ensuring stability in training and policy generation.

Method: Proposes Augmented Lagrangian-Guided Diffusion (ALGD) that revisits optimization theory and energy-based models. Shows primal-dual methods are unstable due to non-convex Lagrangian landscapes. Introduces augmented Lagrangian to locally convexify the energy landscape, stabilizing both policy generation and training without changing optimal policy distribution.

Result: ALGD achieves strong and stable performance across diverse environments. Theoretical analysis confirms it’s both theoretically grounded and empirically effective for safe RL with diffusion policies.

Conclusion: ALGD successfully addresses the stability issues in diffusion-based safe RL by using augmented Lagrangian to convexify the energy landscape, enabling stable multimodal policy generation while maintaining safety constraints in online settings.

Abstract: Diffusion policy sampling enables reinforcement learning (RL) to represent multimodal action distributions beyond suboptimal unimodal Gaussian policies. However, existing diffusion-based RL methods primarily focus on offline settings for reward maximization, with limited consideration of safety in online settings. To address this gap, we propose Augmented Lagrangian-Guided Diffusion (ALGD), a novel algorithm for off-policy safe RL. By revisiting optimization theory and energy-based model, we show that the instability of primal-dual methods arises from the non-convex Lagrangian landscape. In diffusion-based safe RL, the Lagrangian can be interpreted as an energy function guiding the denoising dynamics. Counterintuitively, direct usage destabilizes both policy generation and training. ALGD resolves this issue by introducing an augmented Lagrangian that locally convexifies the energy landscape, yielding a stabilized policy generation and training process without altering the distribution of the optimal policy. Theoretical analysis and extensive experiments demonstrate that ALGD is both theoretically grounded and empirically effective, achieving strong and stable performance across diverse environments.

[557] Distance Marching for Generative Modeling

Zimo Wang, Ishit Mehta, Haolin Lu, Chung-En Sun, Ge Yan, Tsui-Wei Weng, Tzu-Mao Li

Main category: cs.LG

TL;DR: Distance Marching: A time-unconditional generative modeling approach using distance field modeling to improve denoising direction accuracy without time conditioning.

Details

Motivation: Time-unconditional generative models face interference because the same noisy input can correspond to multiple noise levels and different denoising directions, confusing the supervision signal. The authors aim to address this fundamental limitation.

Method: Proposes Distance Marching with two principled inference methods inspired by distance field modeling. Designs losses that focus on closer targets to yield denoising directions better directed toward the data manifold.

Result: Consistently improves FID by 13.5% on CIFAR-10 and ImageNet over recent time-unconditional baselines. For class-conditional ImageNet generation, surpasses flow matching despite removing time input, achieving lower FID with 60% of sampling steps and 13.6% lower FID on average across backbone sizes.

Conclusion: Distance field modeling provides a principled framework for generative modeling, with distance prediction also useful for early stopping during sampling and OOD detection.

Abstract: Time-unconditional generative models learn time-independent denoising vector fields. But without time conditioning, the same noisy input may correspond to multiple noise levels and different denoising directions, which interferes with the supervision signal. Inspired by distance field modeling, we propose Distance Marching, a new time-unconditional approach with two principled inference methods. Crucially, we design losses that focus on closer targets. This yields denoising directions better directed toward the data manifold. Across architectures, Distance Marching consistently improves FID by 13.5% on CIFAR-10 and ImageNet over recent time-unconditional baselines. For class-conditional ImageNet generation, despite removing time input, Distance Marching surpasses flow matching using our losses and inference methods. It achieves lower FID than flow matching’s final performance using 60% of the sampling steps and 13.6% lower FID on average across backbone sizes. Moreover, our distance prediction is also helpful for early stopping during sampling and for OOD detection. We hope distance field modeling can serve as a principled lens for generative modeling.

[558] NLI:Non-uniform Linear Interpolation Approximation of Nonlinear Operations for Efficient LLMs Inference

Jiangyong Yu, Xiaomeng Han, Xing Hu, Chen Xu, Zhe Jiang, Dawei Yang

Main category: cs.LG

TL;DR: NLI framework enables efficient approximation of nonlinear functions in LLMs using dynamic programming optimization for hardware-friendly implementation.

Details

Motivation: While significant progress has been made in compressing linear layers in LLMs, nonlinear layers (SiLU, RMSNorm, Softmax) still rely on expensive high-precision floating-point operations, creating deployment bottlenecks due to memory and computational constraints.

Method: Proposes Non-uniform Linear Interpolation (NLI) - a calibration-free, dynamic-programming-optimal framework that recasts cutpoint selection as a dynamic programming problem to achieve globally minimal interpolation error in O(M×N²) time using Bellman’s optimality principle.

Result: Hardware experiments show the NLI Engine achieves over 4× improvement in computational efficiency compared to state-of-the-art designs, enabling seamless integration into LLMs with almost no accuracy loss.

Conclusion: NLI provides an efficient, hardware-friendly solution for approximating nonlinear functions in LLMs, addressing a critical bottleneck in model deployment while maintaining accuracy.

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks, but their deployment is often constrained by substantial memory footprints and computational costs. While prior work has achieved significant progress in compressing and accelerating linear layers, nonlinear layers-such as SiLU, RMSNorm, and Softmax-still heavily depend on high-precision floating-point operations. In this paper, we propose a calibration-free, dynamic-programming-optimal, and hardware-friendly framework called Non-uniform Linear Interpolation (NLI). NLI is capable of efficiently approximating a variety of nonlinear functions, enabling seamless integration into LLMs and other deep neural networks with almost no loss in accuracy. NLI ingeniously recasts cutpoint selection as a dynamic-programming problem, achieving the globally minimal interpolation error in O(MxN2) time via Bellman’s optimality principle. Based on the NLI algorithm, we also design and implement a plug-and-play universal nonlinear computation unit. Hardware experiments demonstrate that the NLI Engine achieves more than 4x improvement in computational efficiency compared to the state-of-the-art designs.

[559] Rare Event Early Detection: A Dataset of Sepsis Onset for Critically Ill Trauma Patients

Yin Jin, Tucker R. Stewart, Deyi Zhou, Chhavi Gupta, Arjita Nema, Scott C. Brakenridge, Grant E. O’Keefe, Juhua Hu

Main category: cs.LG

TL;DR: A new public dataset for early detection of post-traumatic sepsis in ICU patients, addressing limitations of existing datasets that treat all ICU patients uniformly and ignore trauma-specific challenges.

Details

Motivation: Existing sepsis datasets treat ICU patients as uniform groups, neglecting trauma patients where injury-related inflammation overlaps with sepsis features, creating need for targeted post-traumatic sepsis detection methods.

Method: Created standardized post-trauma sepsis onset dataset from MIMIC-III by extracting, relabeling using standardized post-trauma clinical facts, and validating data; framed detection as daily rare event problem according to ICU clinical workflow.

Result: Established comprehensive benchmark showing necessity for further advancements using this new dataset; data and code publicly available on GitHub.

Conclusion: Targeted identification of post-traumatic sepsis is needed for early detection; new dataset enables development of specialized methods for trauma patients in ICU settings.

Abstract: Sepsis is a major public health concern due to its high morbidity, mortality, and cost. Its clinical outcome can be substantially improved through early detection and timely intervention. By leveraging publicly available datasets, machine learning (ML) has driven advances in both research and clinical practice. However, existing public datasets consider ICU patients (Intensive Care Unit) as a uniform group and neglect the potential challenges presented by critically ill trauma patients in whom injury-related inflammation and organ dysfunction can overlap with the clinical features of sepsis. We propose that a targeted identification of post-traumatic sepsis is necessary in order to develop methods for early detection. Therefore, we introduce a publicly available standardized post-trauma sepsis onset dataset extracted, relabeled using standardized post-trauma clinical facts, and validated from MIMIC-III. Furthermore, we frame early detection of post-trauma sepsis onset according to clinical workflow in ICUs in a daily basis resulting in a new rare event detection problem. We then establish a general benchmark through comprehensive experiments, which shows the necessity of further advancements using this new dataset. The data code is available at https://github.com/ML4UWHealth/SepsisOnset_TraumaCohort.git.

[560] Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent

Hiroki Naganuma, Shagun Gupta, Youssef Briki, Ioannis Mitliagkas, Irina Rish, Parameswaran Raman, Hao-Jun Michael Shi

Main category: cs.LG

TL;DR: Derived gradient noise scales for non-Euclidean optimizers (signSGD/Signum and specSGD/Muon) and proposed efficient variance estimation for adaptive batch sizing in distributed systems.

Details

Motivation: Modern ML systems use brittle heuristics for batch size tuning; existing adaptive strategies based on Euclidean gradient noise scale don't match non-Euclidean optimizers like signSGD and specSGD.

Method: Derived gradient noise scales from the geometry of dual norms for signSGD (ℓ∞) and specSGD (S∞), and proposed efficient variance estimation using local mini-batch gradients across distributed ranks.

Result: Adaptive batch sizing with non-Euclidean GNS matched validation loss of constant-batch baselines while reducing training steps by up to 66% for Signum and Muon on a 160M parameter Llama model.

Conclusion: Non-Euclidean gradient noise scales enable principled adaptive batch sizing for popular non-Euclidean optimizers, improving training efficiency without sacrificing performance.

Abstract: To maximize hardware utilization, modern machine learning systems typically employ large constant or manually tuned batch size schedules, relying on heuristics that are brittle and costly to tune. Existing adaptive strategies based on gradient noise scale (GNS) offer a principled alternative. However, their assumption of SGD’s Euclidean geometry creates a fundamental mismatch with popular optimizers based on generalized norms, such as signSGD / Signum ($\ell_\infty$) and stochastic spectral descent (specSGD) / Muon ($\mathcal{S}_\infty$). In this work, we derive gradient noise scales for signSGD and specSGD that naturally emerge from the geometry of their respective dual norms. To practically estimate these non-Euclidean metrics, we propose an efficient variance estimation procedure that leverages the local mini-batch gradients on different ranks in distributed data-parallel systems. Our experiments demonstrate that adaptive batch size strategies using non-Euclidean GNS enable us to match the validation loss of constant-batch baselines while reducing training steps by up to 66% for Signum and Muon on a 160 million parameter Llama model.

[561] 3D-Learning: Diffusion-Augmented Distributionally Robust Decision-Focused Learning

Jiaqi Wen, Lei Fan, Jianyi Yang

Main category: cs.LG

TL;DR: A framework called 3D-Learning that uses diffusion models to find worst-case distributions for robust decision-making in ML pipelines, tested on LLM resource provisioning.

Details

Motivation: ML predictors in Predict-then-Optimize pipelines are vulnerable to out-of-distribution samples, causing poor decision performance. Need robust methods that work under worst-case distributions.

Method: Proposes Distributionally Robust Decision-Focused Learning (DR-DFL) with Diffusion-Augmented approach (3D-Learning). Uses diffusion models to parameterize and search for worst-case distributions that remain realistic, balancing average and worst-case performance.

Result: Empirical results on LLM resource provisioning show 3D-Learning outperforms existing Distributionally Robust Optimization and Data Augmentation methods in OOD generalization.

Conclusion: 3D-Learning effectively addresses OOD generalization challenges by leveraging diffusion models to find realistic worst-case distributions, improving decision performance in ML pipelines.

Abstract: Predict-then-Optimize (PTO) pipelines are widely employed in computing and networked systems, where Machine Learning (ML) models are used to predict critical contextual information for downstream decision-making tasks such as cloud LLM serving, data center demand response, and edge workload scheduling. However, these ML predictors are often vulnerable to out-of-distribution (OOD) samples at test time, leading to significant decision performance degradation due to large prediction errors. To address the generalization challenges under OOD conditions, we present the framework of Distributionally Robust Decision-Focused Learning (DR-DFL), which trains ML models to optimize decision performance under the worst-case distribution. Instead of relying on classical Distributionally Robust Optimization (DRO) techniques, we propose Diffusion-Augmented Distributionally Robust Decision-Focused Learning (3D-Learning), which searches for the worst-case distribution within the parameterized space of a diffusion model. By leveraging the powerful distribution modeling capabilities of diffusion models, 3D-Learning identifies worst-case distributions that remain consistent with real data, achieving a favorable balance between average and worst-case scenarios. Empirical results on an LLM resource provisioning task demonstrate that 3D-Learning outperforms existing DRO and Data Augmentation methods in OOD generalization performance.

[562] SAFE-KD: Risk-Controlled Early-Exit Distillation for Vision Backbones

Salim Khazem

Main category: cs.LG

TL;DR: SAFE-KD is a universal multi-exit wrapper for vision backbones that combines hierarchical distillation with conformal risk control to provide guaranteed risk bounds for early-exit decisions.

Details

Motivation: Early-exit networks reduce inference cost but lack guarantees about when early exit is safe. The paper aims to provide a principled approach with statistical guarantees for early-exit decisions in vision models.

Method: Attaches lightweight exit heads at intermediate depths, uses Decoupled Knowledge Distillation (DKD) to distill a strong teacher into all exits, enforces deep-to-shallow consistency, and calibrates per-exit stopping thresholds using conformal risk control on held-out data.

Result: Across multiple datasets and architectures, SAFE-KD yields improved accuracy-compute trade-offs, stronger calibration, robust performance under corruption, and provides finite-sample risk guarantees.

Conclusion: SAFE-KD provides a principled framework for early-exit networks with statistical guarantees, making them more practical for deployment by ensuring safe early exits with controlled risk.

Abstract: Early-exit networks reduce inference cost by allowing ``easy’’ inputs to stop early, but practical deployment hinges on knowing \emph{when} early exit is safe. We introduce SAFE-KD, a universal multi-exit wrapper for modern vision backbones that couples hierarchical distillation with \emph{conformal risk control}. SAFE-KD attaches lightweight exit heads at intermediate depths, distills a strong teacher into all exits via Decoupled Knowledge Distillation (DKD), and enforces deep-to-shallow consistency between exits. At inference, we calibrate per-exit stopping thresholds on a held-out set using conformal risk control (CRC) to guarantee a user-specified \emph{selective} misclassification risk (among the samples that exit early) under exchangeability. Across multiple datasets and architectures, SAFE-KD yields improved accuracy compute trade-offs, stronger calibration, and robust performance under corruption while providing finite-sample risk guarantees.

[563] Causal Graph Spatial-Temporal Autoencoder for Reliable and Interpretable Process Monitoring

Xiangrui Zhang, Chunyue Song, Wei Dai, Zheng Zhang, Kaihua Gao, Furong Gao

Main category: cs.LG

TL;DR: CGSTAE is a causal graph spatial-temporal autoencoder for industrial process monitoring that learns correlation graphs via spatial self-attention, derives causal graphs using a novel three-step algorithm based on causal invariance, and uses GCLSTM for sequence-to-sequence reconstruction to enable fault detection.

Details

Motivation: To improve reliability and interpretability of industrial process monitoring by capturing dynamic relationships between variables and uncovering causal structures from correlation patterns, enabling more effective fault detection.

Method: Combines spatial self-attention mechanism (SSAM) for learning correlation graphs, a three-step causal graph structure learning algorithm based on causal invariance principle to derive causal graphs, and graph convolutional LSTM (GCLSTM) spatial-temporal encoder-decoder for sequence-to-sequence reconstruction.

Result: Validated effectiveness through Tennessee Eastman process and real-world air separation process, demonstrating improved process monitoring and fault detection capabilities using feature space and residual space statistics.

Conclusion: CGSTAE provides an effective framework for industrial process monitoring that combines causal discovery with spatial-temporal modeling, offering both reliability and interpretability for fault detection applications.

Abstract: To improve the reliability and interpretability of industrial process monitoring, this article proposes a Causal Graph Spatial-Temporal Autoencoder (CGSTAE). The network architecture of CGSTAE combines two components: a correlation graph structure learning module based on spatial self-attention mechanism (SSAM) and a spatial-temporal encoder-decoder module utilizing graph convolutional long-short term memory (GCLSTM). The SSAM learns correlation graphs by capturing dynamic relationships between variables, while a novel three-step causal graph structure learning algorithm is introduced to derive a causal graph from these correlation graphs. The algorithm leverages a reverse perspective of causal invariance principle to uncover the invariant causal graph from varying correlations. The spatial-temporal encoder-decoder, built with GCLSTM units, reconstructs time-series process data within a sequence-to-sequence framework. The proposed CGSTAE enables effective process monitoring and fault detection through two statistics in the feature space and residual space. Finally, we validate the effectiveness of CGSTAE in process monitoring through the Tennessee Eastman process and a real-world air separation process.

[564] Variational Sparse Paired Autoencoders (vsPAIR) for Inverse Problems and Uncertainty Quantification

Jack Michael Solomon, Rishi Leburu, Matthias Chung

Main category: cs.LG

TL;DR: vsPAIR: Variational Sparse Paired Autoencoder for solving inverse problems with interpretable uncertainty estimation through paired sparse VAEs and learned latent mapping.

Details

Motivation: Inverse problems require not just point estimates but interpretable uncertainty, and providing fast inference with uncertainty estimates remains challenging in many scientific and engineering applications.

Method: Pairs a standard VAE encoding observations with a sparse VAE encoding quantities of interest, connected through learned latent mapping. Uses hard-concrete spike-and-slab relaxation for differentiable training and beta hyperprior for adaptive sparsity.

Result: Demonstrated effectiveness on blind inpainting and computed tomography experiments, showing vsPAIR is a capable inverse problem solver providing interpretable and structured uncertainty estimates.

Conclusion: vsPAIR successfully addresses the challenge of providing interpretable uncertainty in inverse problems through its paired sparse VAE architecture with learned latent mapping.

Abstract: Inverse problems are fundamental to many scientific and engineering disciplines; they arise when one seeks to reconstruct hidden, underlying quantities from noisy measurements. Many applications demand not just point estimates but interpretable uncertainty. Providing fast inference alongside uncertainty estimates remains challenging yet desirable in numerous applications. We propose the Variational Sparse Paired Autoencoder (vsPAIR) to address this challenge. The architecture pairs a standard VAE encoding observations with a sparse VAE encoding quantities of interest, connected through a learned latent mapping. The variational structure enables uncertainty estimation, the paired architecture encourages interpretability by anchoring QoI representations to clean data, and sparse encodings provide structure by concentrating information into identifiable factors rather than diffusing across all dimensions. We also propose modifications to existing sparse VAE methods: a hard-concrete spike-and-slab relaxation for differentiable training and a beta hyperprior for adaptive sparsity levels. To validate the effectiveness of our proposed architecture, we conduct experiments on blind inpainting and computed tomography, demonstrating that vsPAIR is a capable inverse problem solver that can provide interpretable and structured uncertainty estimates.

[565] Neural Predictor-Corrector: Solving Homotopy Problems with Reinforcement Learning

Jiayao Mai, Bangyan Liao, Zhenjun Zhao, Yingping Zeng, Haoang Li, Javier Civera, Tailin Wu, Yi Zhou, Peidong Liu

Main category: cs.LG

TL;DR: NPC is a neural predictor-corrector framework that uses reinforcement learning to automatically learn optimal step sizes and termination policies for homotopy methods, replacing hand-crafted heuristics across multiple optimization problems.

Details

Motivation: Homotopy methods are widely used for solving challenging problems like robust optimization, global optimization, polynomial root-finding, and sampling, but practical solvers rely on suboptimal, task-specific hand-crafted heuristics for step sizes and iteration termination.

Method: Unify homotopy problems under a single framework and propose Neural Predictor-Corrector (NPC) which formulates policy selection as a sequential decision-making problem, using reinforcement learning to automatically discover efficient strategies. Introduce amortized training for one-time offline training on problem classes and efficient online inference.

Result: Experiments on four representative homotopy problems show NPC generalizes effectively to unseen instances, consistently outperforms classical and specialized baselines in efficiency, and demonstrates superior stability across tasks.

Conclusion: The unified neural framework for homotopy methods successfully replaces hand-crafted heuristics with learned policies, demonstrating the value of unifying homotopy methods into a single neural framework with better efficiency and stability.

Abstract: The Homotopy paradigm, a general principle for solving challenging problems, appears across diverse domains such as robust optimization, global optimization, polynomial root-finding, and sampling. Practical solvers for these problems typically follow a predictor-corrector (PC) structure, but rely on hand-crafted heuristics for step sizes and iteration termination, which are often suboptimal and task-specific. To address this, we unify these problems under a single framework, which enables the design of a general neural solver. Building on this unified view, we propose Neural Predictor-Corrector (NPC), which replaces hand-crafted heuristics with automatically learned policies. NPC formulates policy selection as a sequential decision-making problem and leverages reinforcement learning to automatically discover efficient strategies. To further enhance generalization, we introduce an amortized training mechanism, enabling one-time offline training for a class of problems and efficient online inference on new instances. Experiments on four representative homotopy problems demonstrate that our method generalizes effectively to unseen instances. It consistently outperforms classical and specialized baselines in efficiency while demonstrating superior stability across tasks, highlighting the value of unifying homotopy methods into a single neural framework.

[566] Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, Zhiying Xu, Jun Wu, Chenfeng Xu, Ion Stoica, Song Han, Kurt Keutzer

Main category: cs.LG

TL;DR: QVG is a training-free KV cache quantization framework for autoregressive video diffusion models that reduces memory usage by up to 7x with minimal latency overhead, addressing GPU memory bottlenecks in video generation.

Details

Motivation: Autoregressive video diffusion models suffer from KV cache memory bottlenecks that limit deployability on widely available hardware (often exceeding 30GB) and degrade long-horizon consistency in identity, layout, and motion due to constrained working memory.

Method: QVG uses Semantic Aware Smoothing to leverage video spatiotemporal redundancy for producing low-magnitude, quantization-friendly residuals, and Progressive Residual Quantization - a coarse-to-fine multi-stage scheme that reduces quantization error while enabling quality-memory trade-offs.

Result: QVG reduces KV cache memory by up to 7.0 times with less than 4% end-to-end latency overhead, establishes a new Pareto frontier between quality and memory efficiency, and consistently outperforms existing baselines in generation quality across multiple benchmarks.

Conclusion: QVG effectively addresses the KV cache memory bottleneck in autoregressive video diffusion models through training-free quantization techniques, enabling deployment on more accessible hardware while maintaining generation quality.

Abstract: Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generation models, the KV cache grows with generation history and quickly dominates GPU memory, often exceeding 30 GB, preventing deployment on widely available hardware. More critically, constrained KV cache budgets restrict the effective working memory, directly degrading long horizon consistency in identity, layout, and motion. To address this challenge, we present Quant VideoGen (QVG), a training free KV cache quantization framework for autoregressive video diffusion models. QVG leverages video spatiotemporal redundancy through Semantic Aware Smoothing, producing low magnitude, quantization friendly residuals. It further introduces Progressive Residual Quantization, a coarse to fine multi stage scheme that reduces quantization error while enabling a smooth quality memory trade off. Across LongCat Video, HY WorldPlay, and Self Forcing benchmarks, QVG establishes a new Pareto frontier between quality and memory efficiency, reducing KV cache memory by up to 7.0 times with less than 4% end to end latency overhead while consistently outperforming existing baselines in generation quality.

[567] FedKRSO: Communication and Memory Efficient Federated Fine-Tuning of Large Language Models

Guohao Yang, Tongle Wu, Yuanxiong Guo, Ying Sun, Yanmin Gong

Main category: cs.LG

TL;DR: FedKRSO enables efficient full fine-tuning of LLMs in federated learning by using shared random low-dimensional subspaces, reducing communication and memory costs while maintaining performance comparable to full federated fine-tuning.

Details

Motivation: Federated learning for LLM fine-tuning faces challenges with high communication costs and memory requirements on resource-constrained clients. Existing PEFT methods reduce costs but sacrifice performance compared to full fine-tuning (FFT).

Method: FedKRSO uses K-seed random subspaces: server generates shared low-dimensional subspaces, clients update models within these subspaces to save memory, and transmit only subspace accumulators instead of full parameters, enabling efficient aggregation.

Result: Extensive experiments on GLUE benchmark show FedKRSO achieves superior performance with low communication/memory overhead, closely approximating federated FFT performance while overcoming PEFT limitations.

Conclusion: FedKRSO provides an efficient solution for federated LLM fine-tuning at the edge, balancing performance and resource efficiency through subspace optimization techniques.

Abstract: Fine-tuning is essential to adapt general-purpose large language models (LLMs) to domain-specific tasks. As a privacy-preserving framework to leverage decentralized data for collaborative model training, Federated Learning (FL) is gaining popularity in LLM fine-tuning, but remains challenging due to the high cost of transmitting full model parameters and computing full gradients on resource-constrained clients. While Parameter-Efficient Fine-Tuning (PEFT) methods are widely used in FL to reduce communication and memory costs, they often sacrifice model performance compared to FFT. This paper proposes FedKRSO (Federated $K$-Seed Random Subspace Optimization), a novel method that enables communication and memory efficient FFT of LLMs in federated settings. In FedKRSO, clients update the model within a shared set of random low-dimension subspaces generated by the server to save memory usage. Furthermore, instead of transmitting full model parameters in each FL round, clients send only the model update accumulators along the subspaces to the server, enabling efficient global model aggregation and dissemination. By using these strategies, FedKRSO can substantially reduce communication and memory overhead while overcoming the performance limitations of PEFT, closely approximating the performance of federated FFT. The convergence properties of FedKRSO are analyzed rigorously under general FL settings. Extensive experiments on the GLUE benchmark across diverse FL scenarios demonstrate that FedKRSO achieves both superior performance and low communication and memory overhead, paving the way towards on federated LLM fine-tuning at the resource-constrained edge.

[568] Human-Centric Traffic Signal Control for Equity: A Multi-Agent Action Branching Deep Reinforcement Learning Approach

Xiaocai Zhang, Neema Nassir, Lok Sang Chan, Milad Haghani

Main category: cs.LG

TL;DR: MA2B-DDQN is a human-centric multi-agent reinforcement learning framework for traffic signal coordination that optimizes traveler-level equity across pedestrians, vehicles, and transit passengers using action-branching discrete control.

Details

Motivation: Existing multi-agent DRL approaches for traffic signal coordination are vehicle-centric and struggle with high-dimensional discrete action spaces, failing to account for multimodal traveler equity across different transportation modes.

Method: Proposes MA2B-DDQN with action-branching discrete control that decomposes corridor control into local per-intersection actions (green time allocation between phases) and a single global action (total phase duration), using a human-centric reward that penalizes delayed individuals across all travel modes.

Result: Extensive evaluations across seven realistic traffic scenarios in Melbourne show significant reduction in impacted travelers compared to existing DRL and baseline methods, with minimal variance across diverse settings demonstrating robustness.

Conclusion: The framework provides a scalable, fair traffic signal system adaptable to varied urban conditions, advocating for human-centric optimization rather than vehicle-centric approaches.

Abstract: Coordinating traffic signals along multimodal corridors is challenging because many multi-agent deep reinforcement learning (DRL) approaches remain vehicle-centric and struggle with high-dimensional discrete action spaces. We propose MA2B-DDQN, a human-centric multi-agent action-branching double Deep Q-Network (DQN) framework that explicitly optimizes traveler-level equity. Our key contribution is an action-branching discrete control formulation that decomposes corridor control into (i) local, per-intersection actions that allocate green time between the next two phases and (ii) a single global action that selects the total duration of those phases. This decomposition enables scalable coordination under discrete control while reducing the effective complexity of joint decision-making. We also design a human-centric reward that penalizes the number of delayed individuals in the corridor, accounting for pedestrians, vehicle occupants, and transit passengers. Extensive evaluations across seven realistic traffic scenarios in Melbourne, Australia, demonstrate that our approach significantly reduces the number of impacted travelers, outperforming existing DRL and baseline methods. Experiments confirm the robustness of our model, showing minimal variance across diverse settings. This framework not only advocates for a fairer traffic signal system but also provides a scalable solution adaptable to varied urban traffic conditions.

[569] Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation

Jinyan Ye, Zhongjie Duan, Zhiwen Li, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen

Main category: cs.LG

TL;DR: Spectral Evolution Search (SES) is a plug-and-play framework for efficient inference-time optimization of visual generative models using low-frequency subspace evolutionary search.

Details

Motivation: Existing inference-time scaling approaches for aligning visual generative models with downstream objectives suffer from severe inefficiency due to optimizing high-dimensional initial noise, where many search directions have negligible influence on final generation.

Method: Proposes Spectral Evolution Search (SES) based on spectral bias insight: model sensitivity to initial perturbations diminishes rapidly as frequency increases. Executes gradient-free evolutionary search within a low-frequency subspace derived from perturbation propagation dynamics.

Result: SES significantly advances the Pareto frontier of generation quality versus computational cost, consistently outperforming strong baselines under equivalent budgets across extensive experiments.

Conclusion: SES provides an efficient plug-and-play framework for inference-time optimization of visual generative models by exploiting spectral bias properties, offering better computational efficiency than existing approaches.

Abstract: Inference-time scaling offers a versatile paradigm for aligning visual generative models with downstream objectives without parameter updates. However, existing approaches that optimize the high-dimensional initial noise suffer from severe inefficiency, as many search directions exert negligible influence on the final generation. We show that this inefficiency is closely related to a spectral bias in generative dynamics: model sensitivity to initial perturbations diminishes rapidly as frequency increases. Building on this insight, we propose Spectral Evolution Search (SES), a plug-and-play framework for initial noise optimization that executes gradient-free evolutionary search within a low-frequency subspace. Theoretically, we derive the Spectral Scaling Prediction from perturbation propagation dynamics, which explains the systematic differences in the impact of perturbations across frequencies. Extensive experiments demonstrate that SES significantly advances the Pareto frontier of generation quality versus computational cost, consistently outperforming strong baselines under equivalent budgets.

[570] Consistency Deep Equilibrium Models

Junchao Lin, Zenan Ling, Jingwen Xu, Robert C. Qiu

Main category: cs.LG

TL;DR: C-DEQ uses consistency distillation to accelerate Deep Equilibrium Models by training them to map intermediate states directly to fixed points, enabling few-step inference with 2-20× accuracy improvements.

Details

Motivation: DEQs offer infinite-depth modeling with constant memory but suffer from significant inference latency due to iterative fixed-point solvers. There's a need to accelerate DEQ inference while preserving performance.

Method: C-DEQ leverages consistency distillation by casting DEQ iterative inference as evolution along a fixed ODE trajectory. It trains models to consistently map intermediate states directly to the equilibrium point, enabling few-step inference while maintaining teacher DEQ performance.

Result: Extensive experiments show C-DEQs achieve consistent 2-20× accuracy improvements over implicit DEQs under the same few-step inference budget, while facilitating flexible multi-step evaluation for computation-performance trade-offs.

Conclusion: C-DEQ provides an effective framework for accelerating DEQ inference through consistency distillation, achieving significant accuracy gains with few-step inference while preserving the benefits of equilibrium modeling.

Abstract: Deep Equilibrium Models (DEQs) have emerged as a powerful paradigm in deep learning, offering the ability to model infinite-depth networks with constant memory usage. However, DEQs incur significant inference latency due to the iterative nature of fixed-point solvers. In this work, we introduce the Consistency Deep Equilibrium Model (C-DEQ), a novel framework that leverages consistency distillation to accelerate DEQ inference. We cast the DEQ iterative inference process as evolution along a fixed ODE trajectory toward the equilibrium. Along this trajectory, we train C-DEQs to consistently map intermediate states directly to the fixed point, enabling few-step inference while preserving the performance of the teacher DEQ. At the same time, it facilitates multi-step evaluation to flexibly trade computation for performance gains. Extensive experiments across various domain tasks demonstrate that C-DEQs achieves consistent 2-20$\times$ accuracy improvements over implicit DEQs under the same few-step inference budget.

[571] Q-ShiftDP: A Differentially Private Parameter-Shift Rule for Quantum Machine Learning

Hoang M. Ngo, Nhat Hoang-Xuan, Quan Nguyen, Nguyen Do, Incheol Shin, My T. Thai

Main category: cs.LG

TL;DR: Q-ShiftDP: A differentially private mechanism for quantum machine learning that leverages quantum gradient properties to improve privacy-utility trade-offs.

Details

Motivation: Quantum Machine Learning offers computational advantages but faces data privacy challenges. Classical differential privacy methods like DP-SGD don't exploit unique quantum gradient properties, leading to suboptimal privacy-utility trade-offs in QML.

Method: Proposes Q-ShiftDP, a differentially private parameter-shift rule that leverages the inherent boundedness and stochasticity of quantum gradients. Combines carefully calibrated Gaussian noise with intrinsic quantum noise from quantum gradient estimation.

Result: Q-ShiftDP enables tighter sensitivity analysis, reduces noise requirements, and provides formal privacy and utility guarantees. Experiments on benchmark datasets show it consistently outperforms classical DP methods in QML.

Conclusion: Q-ShiftDP is the first privacy mechanism tailored to QML that effectively exploits quantum gradient properties, improving privacy-utility trade-offs by combining calibrated noise with intrinsic quantum noise.

Abstract: Quantum Machine Learning (QML) promises significant computational advantages, but preserving training data privacy remains challenging. Classical approaches like differentially private stochastic gradient descent (DP-SGD) add noise to gradients but fail to exploit the unique properties of quantum gradient estimation. In this work, we introduce the Differentially Private Parameter-Shift Rule (Q-ShiftDP), the first privacy mechanism tailored to QML. By leveraging the inherent boundedness and stochasticity of quantum gradients computed via the parameter-shift rule, Q-ShiftDP enables tighter sensitivity analysis and reduces noise requirements. We combine carefully calibrated Gaussian noise with intrinsic quantum noise to provide formal privacy and utility guarantees, and show that harnessing quantum noise further improves the privacy-utility trade-off. Experiments on benchmark datasets demonstrate that Q-ShiftDP consistently outperforms classical DP methods in QML.

[572] CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs

Zhiyuan Yao, Yi-Kai Zhang, Yuxin Chen, Yueqing Sun, Zishan Xu, Yu Yang, Tianhao Hu, Qi Gu, Hui Su, Xunliang Cai

Main category: cs.LG

TL;DR: CoBA-RL is a reinforcement learning algorithm that adaptively allocates rollout budgets based on model capability, using a capability-oriented value function and heap-based greedy strategy to optimize computational resource distribution for LLM post-training efficiency.

Details

Motivation: Standard RL frameworks for LLMs use uniform rollout budgets, causing resource inefficiency. Existing adaptive methods rely on instance-level metrics like task pass rates, which fail to capture the model's dynamic learning state during training.

Method: CoBA-RL uses a Capability-Oriented Value function to map tasks to potential training gains, then employs a heap-based greedy strategy to self-calibrate computational resource allocation to samples with high training value.

Result: Extensive experiments show CoBA-RL effectively balances exploration and exploitation, delivering consistent generalization improvements across multiple challenging benchmarks.

Conclusion: Quantifying sample training value and optimizing budget allocation are crucial for advancing LLM post-training efficiency, with CoBA-RL demonstrating effective adaptive resource allocation.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key approach for enhancing LLM reasoning.However, standard frameworks like Group Relative Policy Optimization (GRPO) typically employ a uniform rollout budget, leading to resource inefficiency. Moreover, existing adaptive methods often rely on instance-level metrics, such as task pass rates, failing to capture the model’s dynamic learning state. To address these limitations, we propose CoBA-RL, a reinforcement learning algorithm designed to adaptively allocate rollout budgets based on the model’s evolving capability. Specifically, CoBA-RL utilizes a Capability-Oriented Value function to map tasks to their potential training gains and employs a heap-based greedy strategy to efficiently self-calibrate the distribution of computational resources to samples with high training value. Extensive experiments demonstrate that our approach effectively orchestrates the trade-off between exploration and exploitation, delivering consistent generalization improvements across multiple challenging benchmarks. These findings underscore that quantifying sample training value and optimizing budget allocation are pivotal for advancing LLM post-training efficiency.

[573] Co2PO: Coordinated Constrained Policy Optimization for Multi-Agent RL

Shrenik Patel, Christine Truong

Main category: cs.LG

TL;DR: Co2PO is a communication-augmented MARL framework that enables coordination-driven safety through risk-aware communication and proactive hazard prediction, achieving better performance than traditional constrained methods.

Details

Motivation: Existing constrained MARL approaches (like Lagrangian methods) use reactive constraints that suppress exploration and lead to over-conservatism. There's a need for proactive safety mechanisms that don't compromise performance.

Method: Co2PO introduces a shared blackboard architecture for broadcasting positional intent and yield signals, governed by a learned hazard predictor that forecasts potential violations over extended temporal horizons. It integrates these forecasts into constrained optimization with risk-triggered communication and adaptive gating.

Result: Co2PO achieves higher returns compared to leading constrained baselines while converging to cost-compliant policies at deployment across complex multi-agent safety benchmarks. Ablation studies validate the necessity of risk-triggered communication, adaptive gating, and shared memory components.

Conclusion: Co2PO demonstrates that proactive, communication-driven safety mechanisms can overcome the exploration-safety trade-off in constrained MARL, enabling better performance while maintaining safety constraints.

Abstract: Constrained multi-agent reinforcement learning (MARL) faces a fundamental tension between exploration and safety-constrained optimization. Existing leading approaches, such as Lagrangian methods, typically rely on global penalties or centralized critics that react to violations after they occur, often suppressing exploration and leading to over-conservatism. We propose Co2PO, a novel MARL communication-augmented framework that enables coordination-driven safety through selective, risk-aware communication. Co2PO introduces a shared blackboard architecture for broadcasting positional intent and yield signals, governed by a learned hazard predictor that proactively forecasts potential violations over an extended temporal horizon. By integrating these forecasts into a constrained optimization objective, Co2PO allows agents to anticipate and navigate collective hazards without the performance trade-offs inherent in traditional reactive constraints. We evaluate Co2PO across a suite of complex multi-agent safety benchmarks, where it achieves higher returns compared to leading constrained baselines while converging to cost-compliant policies at deployment. Ablation studies further validate the necessity of risk-triggered communication, adaptive gating, and shared memory components.

[574] Why Some Models Resist Unlearning: A Linear Stability Perspective

Wei-Kai Chang, Rajiv Khanna

Main category: cs.LG

TL;DR: Theoretical analysis of machine unlearning through asymptotic linear stability and data coherence, showing that stronger memorization makes forgetting easier in low SNR scenarios.

Details

Motivation: To provide theoretical understanding of machine unlearning, which has been mostly empirical, by analyzing when and why unlearning works through the lens of optimization dynamics and data geometry.

Method: Frames unlearning through asymptotic linear stability, analyzes data coherence (cross-sample alignment of loss surface directions), decomposes coherence along three axes (retain set, forget set, between them), and studies a two-layer ReLU CNN under signal-plus-noise model using random matrix theory tools.

Result: Proves tight stability thresholds separating convergence from divergence, shows that lower signal-to-noise ratio (weaker memorization) reduces coherence and makes unlearning easier, while high SNR models resist unlearning. Empirical verification shows Hessian tests and CNN heatmaps align with predicted boundaries.

Conclusion: Provides first principled account of trade-offs between memorization, coherence, and unlearning, establishing theoretical foundations for understanding gradient-based unlearning stability frontiers.

Abstract: Machine unlearning, the ability to erase the effect of specific training samples without retraining from scratch, is critical for privacy, regulation, and efficiency. However, most progress in unlearning has been empirical, with little theoretical understanding of when and why unlearning works. We tackle this gap by framing unlearning through the lens of asymptotic linear stability to capture the interaction between optimization dynamics and data geometry. The key quantity in our analysis is data coherence which is the cross sample alignment of loss surface directions near the optimum. We decompose coherence along three axes: within the retain set, within the forget set, and between them, and prove tight stability thresholds that separate convergence from divergence. To further link data properties to forgettability, we study a two layer ReLU CNN under a signal plus noise model and show that stronger memorization makes forgetting easier: when the signal to noise ratio (SNR) is lower, cross sample alignment is weaker, reducing coherence and making unlearning easier; conversely, high SNR, highly aligned models resist unlearning. For empirical verification, we show that Hessian tests and CNN heatmaps align closely with the predicted boundary, mapping the stability frontier of gradient based unlearning as a function of batching, mixing, and data/model alignment. Our analysis is grounded in random matrix theory tools and provides the first principled account of the trade offs between memorization, coherence, and unlearning.

[575] Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals

Zihan Dong, Zhixian Zhang, Yang Zhou, Can Jin, Ruijia Wu, Linjun Zhang

Main category: cs.LG

TL;DR: A statistically efficient evaluation framework for LLMs that uses pairwise comparison signals from auxiliary reasoning chains to reduce variance in accuracy estimates and improve model rankings.

Details

Motivation: Current evaluation of mathematical reasoning in LLMs suffers from limited benchmark sizes and model stochasticity, leading to high-variance accuracy estimates and unstable model rankings across platforms.

Method: Proposes a semiparametric estimator using control variates, treating pairwise comparison signals (where models judge which of two candidate solutions is better) as auxiliary information. Develops a one-step estimator based on the efficient influence function (EIF) that achieves semiparametric efficiency bounds.

Result: The one-step estimator substantially improves ranking accuracy, with gains increasing as model output noise grows. Experiments on GPQA Diamond, AIME 2025, and GSM8K demonstrate more precise performance estimation and more reliable model rankings, especially in small-sample regimes.

Conclusion: The framework provides statistically efficient evaluation with guaranteed variance reduction over naive averaging, asymptotic normality for uncertainty quantification, and more stable model rankings particularly in data-limited scenarios.

Abstract: Evaluating mathematical reasoning in LLMs is constrained by limited benchmark sizes and inherent model stochasticity, yielding high-variance accuracy estimates and unstable rankings across platforms. On difficult problems, an LLM may fail to produce a correct final answer, yet still provide reliable pairwise comparison signals indicating which of two candidate solutions is better. We leverage this observation to design a statistically efficient evaluation framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains. Treating these comparison signals as control variates, we develop a semiparametric estimator based on the efficient influence function (EIF) for the setting where auxiliary reasoning chains are observed. This yields a one-step estimator that achieves the semiparametric efficiency bound, guarantees strict variance reduction over naive sample averaging, and admits asymptotic normality for principled uncertainty quantification. Across simulations, our one-step estimator substantially improves ranking accuracy, with gains increasing as model output noise grows. Experiments on GPQA Diamond, AIME 2025, and GSM8K further demonstrate more precise performance estimation and more reliable model rankings, especially in small-sample regimes where conventional evaluation is pretty unstable.

[576] Learning to Repair Lean Proofs from Compiler Feedback

Evan Wang, Simon Chess, Daniel Lee, Siyuan Ge, Ajit Mallavarapu, Vasily Ilin

Main category: cs.LG

TL;DR: APRIL dataset enables supervised learning for Lean proof repair using compiler feedback, improving theorem prover error correction and diagnostic reasoning.

Details

Motivation: Existing Lean datasets contain mostly correct proofs, lacking supervision for understanding and repairing proof failures. As neural theorem provers become more agentic, interpreting and acting on compiler feedback is critical for proof repair.

Method: Introduces APRIL dataset with 260,000 supervised tuples pairing systematically generated proof failures with compiler diagnostics, repair targets, and natural-language explanations. Trains language models on this dataset for proof repair.

Result: Training on APRIL substantially improves repair accuracy and feedback-conditioned reasoning. A finetuned 4B-parameter model outperforms the strongest open-source baseline in single-shot repair evaluation.

Conclusion: Diagnostic-conditioned supervision provides complementary training signal for feedback-using theorem provers. The APRIL dataset enables better proof repair capabilities in Lean.

Abstract: As neural theorem provers become increasingly agentic, the ability to interpret and act on compiler feedback is critical. However, existing Lean datasets consist almost exclusively of correct proofs, offering little supervision for understanding and repairing failures. We study Lean proof repair as a supervised learning problem: given an erroneous proof and compiler feedback, predict both a corrected proof and a natural-language diagnosis grounded in the same feedback. We introduce APRIL (Automated Proof Repair in Lean), a dataset of 260,000 supervised tuples pairing systematically generated proof failures with compiler diagnostics and aligned repair and explanation targets. Training language models on APRIL substantially improves repair accuracy and feedback-conditioned reasoning; in our single-shot repair evaluation setting, a finetuned 4B-parameter model outperforms the strongest open-source baseline. We view diagnostic-conditioned supervision as a complementary training signal for feedback-using provers. Our dataset is available at \href{https://huggingface.co/datasets/uw-math-ai/APRIL}{this link}.

[577] Shortcut Features as Top Eigenfunctions of NTK: A Linear Neural Network Case and More

Jinwoo Lim, Suhyun Kim, Soo-Mook Moon

Main category: cs.LG

TL;DR: Analysis of shortcut learning in neural networks using Neural Tangent Kernel theory, showing shortcut features correspond to larger eigenvalues and persist despite margin control, with empirical validation on complex networks.

Details

Motivation: To understand the fundamental mechanisms behind shortcut learning in deep learning models, particularly why neural networks prefer to learn non-generalizable features from imbalanced training data distributions.

Method: Used Neural Tangent Kernel (NTK) framework to analyze linear neural networks, defining features as eigenfunctions of NTK. Examined shortcut features in imbalanced clustered distributions, showing they correspond to larger eigenvalues. Extended analysis to two-layer ReLU networks and ResNet-18 empirically.

Result: Found that shortcut features have larger eigenvalues in imbalanced distributions, maintain influence on network output after training due to data variances, and persist even when controlling output margins, indicating max-margin bias isn’t the sole cause.

Conclusion: Shortcut learning stems from structural properties of neural networks where imbalanced features correspond to larger NTK eigenvalues, and this preference persists beyond margin control, requiring new approaches to mitigate shortcut learning.

Abstract: One of the chronic problems of deep-learning models is shortcut learning. In a case where the majority of training data are dominated by a certain feature, neural networks prefer to learn such a feature even if the feature is not generalizable outside the training set. Based on the framework of Neural Tangent Kernel (NTK), we analyzed the case of linear neural networks to derive some important properties of shortcut learning. We defined a feature of a neural network as an eigenfunction of NTK. Then, we found that shortcut features correspond to features with larger eigenvalues when the shortcuts stem from the imbalanced number of samples in the clustered distribution. We also showed that the features with larger eigenvalues still have a large influence on the neural network output even after training, due to data variances in the clusters. Such a preference for certain features remains even when a margin of a neural network output is controlled, which shows that the max-margin bias is not the only major reason for shortcut learning. These properties of linear neural networks are empirically extended for more complex neural networks as a two-layer fully-connected ReLU network and a ResNet-18.

[578] From Zero to Hero: Advancing Zero-Shot Foundation Models for Tabular Outlier Detection

Xueying Ding, Haomin Wen, Simon Klütterman, Leman Akoglu

Main category: cs.LG

TL;DR: OUTFORMER is a foundation model for outlier detection that uses synthetic priors and self-evolving curriculum training for zero-shot, plug-and-play deployment without requiring labeled outliers or model training.

Details

Motivation: Outlier detection deployment is hindered by lack of labeled outliers, making algorithm and hyperparameter selection difficult. While foundation models have transformed ML, there's a need for improved zero-shot outlier detection that doesn't require labeled data or custom training.

Method: OUTFORMER advances FoMo-0D with: (1) mixture of synthetic priors for pretraining solely on synthetic labeled datasets, and (2) self-evolving curriculum training. It uses in-context learning with training data as input for zero-shot inference requiring only forward passes.

Result: OUTFORMER achieves state-of-the-art performance on AdBench and two new large-scale OD benchmarks comprising over 1,500 datasets, while maintaining speedy inference.

Conclusion: OUTFORMER enables truly plug-and-play outlier detection deployment with zero-shot inference, requiring no labeled outliers, model training, or bespoke model selection.

Abstract: Outlier detection (OD) is widely used in practice; but its effective deployment on new tasks is hindered by lack of labeled outliers, which makes algorithm and hyperparameter selection notoriously hard. Foundation models (FMs) have transformed ML, and OD is no exception: Shen et. al. (2025) introduced FoMo-0D, the first FM for OD, achieving remarkable performance against numerous baselines. This work introduces OUTFORMER, which advances FoMo-0D with (1) a mixture of synthetic priors and (2) self-evolving curriculum training. OUTFORMER is pretrained solely on synthetic labeled datasets and infers test labels of a new task by using its training data as in-context input. Inference is fast and zero-shot, requiring merely forward pass and no labeled outliers. Thanks to in-context learning, it requires zero additional work-no OD model training or bespoke model selection-enabling truly plug-and-play deployment. OUTFORMER achieves state-of-the-art performance on the prominent AdBench, as well as two new large-scale OD benchmarks that we introduce, comprising over 1,500 datasets, while maintaining speedy inference.

[579] FlashSinkhorn: IO-Aware Entropic Optimal Transport

Felix X. -F. Ye, Xingjie Li, An Yu, Ming-Ching Chang, Linsong Chu, Davis Wertheimer

Main category: cs.LG

TL;DR: FlashSinkhorn: An IO-aware entropic optimal transport solver using FlashAttention-style fusion for GPU efficiency, achieving 32× forward-pass speedups.

Details

Motivation: Current GPU solvers for entropic optimal transport via Sinkhorn iterations are inefficient at scale due to quadratic HBM traffic from dense interactions and limited fusion in existing implementations.

Method: Rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores (same normalization as transformer attention), enabling FlashAttention-style fusion and tiling with fused Triton kernels that stream tiles through on-chip SRAM.

Result: Achieves up to 32× forward-pass and 161× end-to-end speedups over state-of-the-art online baselines on point-cloud OT, with improved scalability on OT-based downstream tasks.

Conclusion: FlashSinkhorn provides an efficient, scalable EOT solver that substantially reduces HBM IO while retaining linear-memory operations, with open-source implementation available.

Abstract: Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense $n\times m$ interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map-reduce reduction kernels with limited fusion. We present \textbf{FlashSinkhorn}, an IO-aware EOT solver for squared Euclidean cost that rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores, the same normalization as transformer attention. This enables FlashAttention-style fusion and tiling: fused Triton kernels stream tiles through on-chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear-memory operations. We further provide streaming kernels for transport application, enabling scalable first- and second-order optimization. On A100 GPUs, FlashSinkhorn achieves up to $32\times$ forward-pass and $161\times$ end-to-end speedups over state-of-the-art online baselines on point-cloud OT, improves scalability on OT-based downstream tasks. For reproducibility, we release an open-source implementation at https://github.com/ot-triton-lab/ot_triton.

[580] Clarify Before You Draw: Proactive Agents for Robust Text-to-CAD Generation

Bo Yuan, Zelin Zhao, Petr Molodyk, Bin Hu, Yongxin Chen

Main category: cs.LG

TL;DR: ProCAD: A proactive agentic framework for text-to-CAD generation that resolves ambiguous prompts through clarification questions before code synthesis, improving robustness and reducing errors.

Details

Motivation: Existing text-to-CAD systems struggle with under-specified or inconsistent geometric descriptions in natural language prompts, often hallucinating dimensions when faced with ambiguity. Current fine-tuned models tend to follow user instructions reactively without addressing specification issues.

Method: ProCAD uses a two-agent framework: (1) a proactive clarifying agent that audits prompts and asks targeted clarification questions only when necessary to produce self-consistent specifications, and (2) a CAD coding agent that translates specifications into executable CadQuery programs. The coding agent is fine-tuned on curated text-to-CadQuery data, while the clarifying agent is trained via agentic SFT on clarification trajectories.

Result: ProCAD significantly improves robustness to ambiguous prompts while keeping interaction overhead low. It outperforms frontier closed-source models like Claude Sonnet 4.5, reducing mean Chamfer distance by 79.9% and lowering invalidity ratio from 4.8% to 0.9%.

Conclusion: Proactive clarification before code synthesis effectively addresses ambiguity in text-to-CAD generation, leading to more robust and accurate parametric CAD program synthesis from natural language prompts.

Abstract: Large language models have recently enabled text-to-CAD systems that synthesize parametric CAD programs (e.g., CadQuery) from natural language prompts. In practice, however, geometric descriptions can be under-specified or internally inconsistent: critical dimensions may be missing and constraints may conflict. Existing fine-tuned models tend to reactively follow user instructions and hallucinate dimensions when the text is ambiguous. To address this, we propose a proactive agentic framework for text-to-CadQuery generation, named ProCAD, that resolves specification issues before code synthesis. Our framework pairs a proactive clarifying agent, which audits the prompt and asks targeted clarification questions only when necessary to produce a self-consistent specification, with a CAD coding agent that translates the specification into an executable CadQuery program. We fine-tune the coding agent on a curated high-quality text-to-CadQuery dataset and train the clarifying agent via agentic SFT on clarification trajectories. Experiments show that proactive clarification significantly improves robustness to ambiguous prompts while keeping interaction overhead low. ProCAD outperforms frontier closed-source models, including Claude Sonnet 4.5, reducing the mean Chamfer distance by 79.9 percent and lowering the invalidity ratio from 4.8 percent to 0.9 percent. Our code and datasets will be made publicly available.

[581] Fedcompass: Federated Clustered and Periodic Aggregation Framework for Hybrid Classical-Quantum Models

Yueheng Wang, Xing He, Zinuo Cai, Rui Zhang, Ruhui Ma, Yuan Liu, Rajkumar Buyya

Main category: cs.LG

TL;DR: FEDCOMPASS: A layered aggregation framework for hybrid classical-quantum federated learning that addresses performance degradation under non-IID data through spectral clustering and circular mean aggregation.

Details

Motivation: Hybrid classical-quantum federated learning suffers from performance degradation under non-IID data distributions across clients, which is a common real-world scenario that needs to be addressed.

Method: Proposes FEDCOMPASS with two key components: 1) Spectral clustering to group clients by class distribution similarity for cluster-wise aggregation of classical feature extractors, and 2) Circular mean aggregation combined with adaptive optimization for quantum parameters to ensure stable global updates.

Result: Experiments on three benchmark datasets show FEDCOMPASS improves test accuracy by up to 10.22% and enhances convergence stability under non-IID settings, outperforming six strong federated learning baselines.

Conclusion: FEDCOMPASS effectively addresses non-IID challenges in hybrid classical-quantum federated learning through its layered aggregation approach, demonstrating significant performance improvements and better convergence stability.

Abstract: Federated learning enables collaborative model training across decentralized clients under privacy constraints. Quantum computing offers potential for alleviating computational and communication burdens in federated learning, yet hybrid classical-quantum federated learning remains susceptible to performance degradation under non-IID data. To address this,we propose FEDCOMPASS, a layered aggregation framework for hybrid classical-quantum federated learning. FEDCOMPASS employs spectral clustering to group clients by class distribution similarity and performs cluster-wise aggregation for classical feature extractors. For quantum parameters, it uses circular mean aggregation combined with adaptive optimization to ensure stable global updates. Experiments on three benchmark datasets show that FEDCOMPASS improves test accuracy by up to 10.22% and enhances convergence stability under non-IID settings, outperforming six strong federated learning baselines.

[582] Scaling Continual Learning with Bi-Level Routing Mixture-of-Experts

Meng Lou, Yunxiang Fu, Yizhou Yu

Main category: cs.LG

TL;DR: CaRE: A scalable continual learner with bi-level routing mixture-of-experts for class-incremental learning on pre-trained models, achieving state-of-the-art performance on very long task sequences (100-300+ tasks).

Details

Motivation: Continual learning on pre-trained models faces challenges in learning both discriminative and comprehensive feature representations while maintaining stability and plasticity over very long task sequences. Current methods struggle with scalability to hundreds of tasks.

Method: Proposes CaRE with Bi-Level Routing Mixture-of-Experts (BR-MoE): 1) router selection stage dynamically activates task-specific routers, 2) expert routing phase dynamically activates and aggregates experts to inject discriminative and comprehensive representations into every intermediate network layer.

Result: CaRE demonstrates leading performance across various datasets and task settings, including classical CIL settings (5-20 tasks). It’s the first continual learner scaling to very long task sequences (100-300+ non-overlapping tasks), outperforming all baselines by large margins on such sequences.

Conclusion: CaRE effectively addresses the scalability challenge in continual learning through its bi-level routing mechanism, enabling efficient learning over hundreds of tasks while maintaining both stability and plasticity.

Abstract: Continual learning, especially class-incremental learning (CIL), on the basis of a pre-trained model (PTM) has garnered substantial research interest in recent years. However, how to effectively learn both discriminative and comprehensive feature representations while maintaining stability and plasticity over very long task sequences remains an open problem. We propose CaRE, a scalable {C}ontinual Le{a}rner with efficient Bi-Level {R}outing Mixture-of-{E}xperts (BR-MoE). The core idea of BR-MoE is a bi-level routing mechanism: a router selection stage that dynamically activates relevant task-specific routers, followed by an expert routing phase that dynamically activates and aggregates experts, aiming to inject discriminative and comprehensive representations into every intermediate network layer. On the other hand, we introduce a challenging evaluation protocol for comprehensively assessing CIL methods across very long task sequences spanning hundreds of tasks. Extensive experiments show that CaRE demonstrates leading performance across a variety of datasets and task settings, including commonly used CIL datasets with classical CIL settings (e.g., 5-20 tasks). To the best of our knowledge, CaRE is the first continual learner that scales to very long task sequences (ranging from 100 to over 300 non-overlapping tasks), while outperforming all baselines by a large margin on such task sequences. Code will be publicly released at https://github.com/LMMMEng/CaRE.git.

[583] TMS: Trajectory-Mixed Supervision for Reward-Free, On-Policy SFT

Rana Muhammad Shahroz Khan, Zijie Liu, Zhen Tan, Charles Fleming, Tianlong Chen

Main category: cs.LG

TL;DR: TMS is a reward-free framework that uses historical model checkpoints to create a dynamic curriculum for SFT, reducing policy-label divergence and catastrophic forgetting while maintaining efficiency.

Details

Motivation: Address the trade-off between RL (good retention but complex/expensive) and SFT (efficient but suffers from catastrophic forgetting due to supervision mismatch between evolving policy and static labels).

Method: Trajectory-Mixed Supervision (TMS) creates a dynamic curriculum by mixing training data from the model’s own historical checkpoints to minimize Policy-Label Divergence (PLD), approximating on-policy benefits of RL without rewards.

Result: TMS outperforms standard and iterative SFT across reasoning (MATH, GSM8K) and instruction-following benchmarks, shifting the accuracy-retention Pareto frontier and bridging the gap to RL without requiring reward models.

Conclusion: TMS provides an effective reward-free alternative to RL for improving LLM performance while mitigating catastrophic forgetting, with PLD drift serving as a predictive measure for forgetting.

Abstract: Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are the two dominant paradigms for enhancing Large Language Model (LLM) performance on downstream tasks. While RL generally preserves broader model capabilities (retention) better than SFT, it comes with significant costs: complex reward engineering, instability, and expensive on-policy sampling. In contrast, SFT is efficient but brittle, often suffering from catastrophic forgetting due to $\textbf{Supervision Mismatch}$: the divergence between the model’s evolving policy and static training labels. We address this trade-off with $\textbf{Trajectory-Mixed Supervision (TMS)}$, a reward-free framework that approximates the on-policy benefits of RL by creating a dynamic curriculum from the model’s own historical checkpoints. TMS minimizes $\textit{Policy-Label Divergence (PLD)}$, preventing the mode collapse that drives forgetting in standard SFT. Experiments across reasoning (MATH, GSM8K) and instruction-following benchmarks demonstrate that TMS effectively shifts the accuracy–retention Pareto frontier. While RL remains the gold standard for retention, TMS significantly outperforms standard and iterative SFT, bridging the gap to RL without requiring reward models or verifiers. Mechanistic analysis confirms that PLD drift accurately predicts forgetting and that TMS successfully mitigates this drift.

[584] Robust Representation Learning in Masked Autoencoders

Anika Shrivastava, Renu Rameshan, Samar Agnihotri

Main category: cs.LG

TL;DR: MAE representations are robust to image degradations, learn class-aware embeddings progressively across layers, and exhibit persistent global attention patterns, explaining their strong classification performance.

Details

Motivation: To understand why Masked Autoencoders (MAEs) achieve strong downstream classification performance despite being trained with a reconstruction objective, and to analyze the internal representations they learn.

Method: Layer-wise analysis of token embeddings to study class separability across network depth, examination of attention patterns in MAE vs standard ViTs, and introduction of two sensitivity indicators: directional alignment between clean/perturbed embeddings and head-wise feature retention under degradations.

Result: MAE learns class-aware representations that become increasingly separable across layers, exhibits early and persistent global attention (unlike standard ViTs), and shows robustness to image degradations like blur and occlusions.

Conclusion: MAE’s strong classification performance stems from its ability to learn robust, class-aware representations through its pretraining objective, with progressive feature separation across layers and consistent global attention patterns.

Abstract: Masked Autoencoders (MAEs) achieve impressive performance in image classification tasks, yet the internal representations they learn remain less understood. This work started as an attempt to understand the strong downstream classification performance of MAE. In this process we discover that representations learned with the pretraining and fine-tuning, are quite robust - demonstrating a good classification performance in the presence of degradations, such as blur and occlusions. Through layer-wise analysis of token embeddings, we show that pretrained MAE progressively constructs its latent space in a class-aware manner across network depth: embeddings from different classes lie in subspaces that become increasingly separable. We further observe that MAE exhibits early and persistent global attention across encoder layers, in contrast to standard Vision Transformers (ViTs). To quantify feature robustness, we introduce two sensitivity indicators: directional alignment between clean and perturbed embeddings, and head-wise retention of active features under degradations. These studies help establish the robust classification performance of MAEs.

[585] PRISM: Structured Optimization via Anisotropic Spectral Shaping

Yujie Yang

Main category: cs.LG

TL;DR: PRISM is an optimizer that enhances first-order spectral descent methods with partial second-order information via low-rank quasi-second-order preconditioning, enabling anisotropic spectral shaping with minimal computational overhead.

Details

Motivation: To improve first-order spectral descent methods (like Muon) by incorporating partial second-order information for better optimization performance, while maintaining computational efficiency comparable to first-order methods.

Method: Uses innovation-augmented polar decomposition to construct an efficient, low-rank quasi-second-order preconditioner that performs anisotropic spectral shaping - adaptively suppressing updates in high-variance subspaces while preserving update strength in signal-dominated directions.

Result: Achieves curvature-adaptive properties in spectral optimization with minimal computational overhead and zero additional memory compared to first-order baselines.

Conclusion: PRISM demonstrates a practical strategy for integrating curvature-adaptive properties into the spectral optimization paradigm while maintaining computational efficiency.

Abstract: We propose PRISM, an optimizer that enhances first-order spectral descent methods like Muon with partial second-order information. It constructs an efficient, low-rank quasi-second-order preconditioner via innovation-augmented polar decomposition. This mechanism enables PRISM to perform anisotropic spectral shaping, which adaptively suppresses updates in high-variance subspaces while preserving update strength in signal-dominated directions. Crucially, this is achieved with minimal computational overhead and zero additional memory compared to first-order baselines. PRISM demonstrates a practical strategy for integrating curvature-adaptive properties into the spectral optimization paradigm.

[586] Geometry-Preserving Neural Architectures on Manifolds with Boundary

Karthik Elamvazhuthi, Shiba Biswal, Kian Rosenblum, Arushi Katyal, Tianli Qu, Grady Ma, Rishi Sonthalia

Main category: cs.LG

TL;DR: A framework for geometry-aware neural architectures that preserve geometric structure through interleaved geometric updates and projections on manifolds

Details

Motivation: Preserving geometric structure is crucial in machine learning, especially for data living on manifolds. Many applications involve constrained optimization or dynamics on curved spaces where standard Euclidean neural networks fail to respect the underlying geometry.

Method: Proposes a unified class of geometry-aware architectures that interleave geometric updates between layers, where both projection layers and intrinsic exponential map updates arise as discretizations of projected dynamical systems on manifolds. Also learns projections via small-time heat-kernel limits when constraint sets are unknown.

Result: Establishes universal approximation results for constrained neural ODEs. Demonstrates exact feasibility for analytic updates on S^2 and SO(3), and strong performance for learned projections on S^{d-1}-valued features using diffusion/flow-matching as data-based projections.

Conclusion: The framework provides principled geometry-aware neural architectures with theoretical guarantees, enabling effective learning on manifolds while preserving geometric structure through both analytic and learned projection mechanisms.

Abstract: Preserving geometric structure is important in learning. We propose a unified class of geometry-aware architectures that interleave geometric updates between layers, where both projection layers and intrinsic exponential map updates arise as discretizations of projected dynamical systems on manifolds (with or without boundary). Within this framework, we establish universal approximation results for constrained neural ODEs. We also analyze architectures that enforce geometry only at the output, proving a separate universal approximation property that enables direct comparison to interleaved designs. When the constraint set is unknown, we learn projections via small-time heat-kernel limits, showing diffusion/flow-matching can be used as data-based projections. Experiments on dynamics over S^2 and SO(3), and diffusion on S^{d-1}-valued features demonstrate exact feasibility for analytic updates and strong performance for learned projections

[587] TextME: Bridging Unseen Modalities Through Text Descriptions

Soyeon Hong, Jinchan Kim, Jaegook You, Seungtaek Choi, Suha Kwak, Hyunsouk Cho

Main category: cs.LG

TL;DR: TextME: A text-only framework for expanding multimodal representations to novel modalities without paired datasets by projecting diverse modalities into LLM embedding space using only text descriptions.

Details

Motivation: Current multimodal representation expansion is limited by reliance on expensive paired datasets (text-image, text-audio, etc.), which are particularly challenging in expert domains like medical imaging and molecular analysis where annotation is costly and often infeasible.

Method: TextME projects diverse modalities into LLM embedding space as a unified anchor, exploiting geometric structure of pretrained contrastive encoders to enable zero-shot cross-modal transfer using only text descriptions without paired supervision.

Result: Demonstrates consistent modality gaps exist across image, video, audio, 3D, X-ray, and molecular domains, showing text-only training preserves substantial performance of pretrained encoders and enables emergent cross-modal retrieval between modality pairs not explicitly aligned during training.

Conclusion: Text-only training serves as a practical alternative to paired supervision for modality expansion, enabling efficient expansion to novel modalities without costly dataset collection.

Abstract: Expanding multimodal representations to novel modalities is constrained by reliance on large-scale paired datasets (e.g., text-image, text-audio, text-3D, text-molecule), which are costly and often infeasible in domains requiring expert annotation such as medical imaging and molecular analysis. We introduce TextME, the first text-only modality expansion framework, to the best of our knowledge, projecting diverse modalities into LLM embedding space as a unified anchor. Our approach exploits the geometric structure of pretrained contrastive encoders to enable zero-shot cross-modal transfer using only text descriptions, without paired supervision. We empirically validate that such consistent modality gaps exist across image, video, audio, 3D, X-ray, and molecular domains, demonstrating that text-only training can preserve substantial performance of pretrained encoders. We further show that our framework enables emergent cross-modal retrieval between modality pairs not explicitly aligned during training (e.g., audio-to-image, 3D-to-image). These results establish text-only training as a practical alternative to paired supervision for modality expansion.

[588] Consensus Group Relative Policy Optimization for Text Generation

Yuki Ichihara, Yuu Jinnai, Kaito Ariu, Eiji Uchibe

Main category: cs.LG

TL;DR: C-GRPO distills Minimum Bayes Risk decoding into training using group-relative policy optimization, eliminating inference-time sampling overhead while maintaining MBR performance.

Details

Motivation: Sample-and-rerank methods like MBR decoding are effective but computationally expensive during inference due to repeated sampling and scoring. Existing amortization approaches require gold references, teacher labels, or curated preference data, increasing dataset construction effort.

Method: Proposes Consensus Group Relative Policy Optimization (C-GRPO) which formulates consensus utility as a group-relative objective within GRPO framework. Only requires a utility function and policy samples, no gold references or explicit preference labels needed.

Result: Experiments on WMT 2024 machine translation and XSum text summarization show C-GRPO achieves performance comparable to MBR decoding without inference-time overhead, outperforming reference-free baseline methods.

Conclusion: C-GRPO successfully distills MBR decoding into training, eliminating computational costs while maintaining performance, with theoretical convergence guarantees under ideal conditions.

Abstract: Many strong decoding methods for text generation follow a sample-and-rerank paradigm: they draw multiple candidates, score each under a utility (reward) function using consensus across samples, and return the best one. Although effective, these methods incur high computational costs during inference due to repeated sampling and scoring. Prior attempts to amortize inference-time computation typically rely on gold references, teacher labels, or curated preference data, increasing dataset construction effort and the demand for high-fidelity reward models. We propose Consensus Group Relative Policy Optimization (C-GRPO), which distills Minimum Bayes Risk (MBR) decoding into training by formulating the consensus utility as a group-relative objective within GRPO. C-GRPO requires only a utility function and policy samples, without gold references or explicit preference labels. Under ideal conditions, we show that the objective function of C-GRPO is directionally aligned with the gradient of the expected-utility objective underlying MBR decoding, leading to a convergence guarantee. Experiments on machine translation (WMT 2024) and text summarization (XSum) demonstrate that C-GRPO successfully achieves performance comparable to MBR decoding without the associated inference-time overhead, while outperforming reference-free baseline methods.

[589] Function-Space Empirical Bayes Regularisation with Large Vision-Language Model Priors

Pengcheng Hao, Huaze Tang, Ercan Engin Kuruoglu, Wenbo Ding

Main category: cs.LG

TL;DR: VLM-FS-EB: A function-space empirical Bayes framework using vision-language models to generate semantic context points for expressive functional priors in Bayesian deep learning, improving predictive performance and uncertainty estimation.

Details

Motivation: Bayesian deep learning needs informative priors that scale to high-dimensional data. Existing functional VI methods use GP priors with limited expressiveness in high dimensions. Vision-language models offer rich semantic information that could create more expressive functional priors.

Method: Proposes VLM-FS-EB: uses VLMs to generate semantically meaningful synthetic context points, then uses VLM embeddings to construct expressive functional priors in a function-space empirical Bayes regularization framework.

Result: Method consistently improves predictive performance and yields more reliable uncertainty estimates, especially in out-of-distribution detection tasks and data-scarce regimes compared to various baselines.

Conclusion: VLMs can effectively generate semantic context points for constructing expressive functional priors in Bayesian deep learning, addressing limitations of traditional GP priors in high-dimensional regimes.

Abstract: Bayesian deep learning (BDL) provides a principled framework for reliable uncertainty quantification by combining deep neural networks with Bayesian inference. A central challenge in BDL lies in the design of informative prior distributions that scale effectively to high-dimensional data. Recent functional variational inference (VI) approaches address this issue by imposing priors directly in function space; however, most existing methods rely on Gaussian process (GP) priors, whose expressiveness and generalisation capabilities become limited in high-dimensional regimes. In this work, we propose VLM-FS-EB, a novel function-space empirical Bayes regularisation framework, leveraging large vision-language models (VLMs) to generates semantically meaningful context points. These synthetic samples are then used VLMs for embeddings to construct expressive functional priors. Furthermore, the proposed method is evaluated against various baselines, and experimental results demonstrate that our method consistently improves predictive performance and yields more reliable uncertainty estimates, particularly in out-of-distribution (OOD) detection tasks and data-scarce regimes.

[590] Quantized Evolution Strategies: High-precision Fine-tuning of Quantized LLMs at Low-precision Cost

Yinggan Xu, Risto Miikkulainen, Xin Qiu

Main category: cs.LG

TL;DR: QES enables direct fine-tuning of quantized LLMs without backpropagation by using evolution strategies with error feedback and stateless seed replay.

Details

Motivation: PTQ makes LLMs static and hard to fine-tune since standard methods rely on backpropagation and high-precision weights, which don't work with discrete quantized parameters.

Method: Quantized Evolution Strategies (QES) with two innovations: (1) accumulated error feedback to preserve gradient signals, and (2) stateless seed replay to reduce memory usage to low-precision inference levels.

Result: QES significantly outperforms state-of-the-art zeroth-order fine-tuning on arithmetic reasoning tasks, enabling direct fine-tuning of quantized models.

Conclusion: QES opens up the possibility for scaling up LLMs entirely in the quantized space by making direct fine-tuning of quantized models possible.

Abstract: Post-Training Quantization (PTQ) is essential for deploying Large Language Models (LLMs) on memory-constrained devices, yet it renders models static and difficult to fine-tune. Standard fine-tuning paradigms, including Reinforcement Learning (RL), fundamentally rely on backpropagation and high-precision weights to compute gradients. Thus they cannot be used on quantized models, where the parameter space is discrete and non-differentiable. While Evolution Strategies (ES) offer a backpropagation-free alternative, optimization of the quantized parameters can still fail due to vanishing or inaccurate gradient. This paper introduces Quantized Evolution Strategies (QES), an optimization paradigm that performs full-parameter fine-tuning directly in the quantized space. QES is based on two innovations: (1) it integrates accumulated error feedback to preserve high-precision gradient signals, and (2) it utilizes a stateless seed replay to reduce memory usage to low-precision inference levels. QES significantly outperforms the state-of-the-art zeroth-order fine-tuning method on arithmetic reasoning tasks, making direct fine-tuning for quantized models possible. It therefore opens up the possibility for scaling up LLMs entirely in the quantized space. The source code is available at https://github.com/dibbla/Quantized-Evolution-Strategies .

[591] Contrastive Concept-Tree Search for LLM-Assisted Algorithm Discovery

Timothee Leleu, Sudeera Gunathilaka, Federico Ghimenti, Surya Ganguli

Main category: cs.LG

TL;DR: CCTS improves LLM-assisted algorithm discovery by extracting hierarchical concept representations from generated programs and using contrastive learning to guide search toward useful concept combinations.

Details

Motivation: Current LLM-assisted algorithm discovery treats the process as black-box optimization without fully exploiting the LLM's internal representation of program space. There's a need to better utilize the hierarchical concept structure that LLMs naturally learn about possible programs.

Method: Contrastive Concept-Tree Search (CCTS) extracts hierarchical concept representations from generated programs, learns a contrastive concept model that guides parent selection, and reweights parents using likelihood-ratio scores between high- and low-performing solutions to bias search toward useful concept combinations.

Result: CCTS improves search efficiency over fitness-based baselines and produces interpretable, task-specific concept trees across Erdős-type combinatorics problems. Gains are largely driven by learning which concepts to avoid. Controlled synthetic environment validates these findings.

Conclusion: Explicit concept hierarchy extraction and contrastive learning significantly improve LLM-assisted algorithm discovery by providing better guidance than algorithm lineage alone, with interpretable concept trees offering insights into successful search strategies.

Abstract: Large language Model (LLM)-assisted algorithm discovery is an iterative, black-box optimization process over programs to approximatively solve a target task, where an LLM proposes candidate programs and an external evaluator provides task feedback. Despite intense recent research on the topic and promising results, how can the LLM internal representation of the space of possible programs be maximally exploited to improve performance is an open question. Here, we introduce Contrastive Concept-Tree Search (CCTS), which extracts a hierarchical concept representation from the generated programs and learns a contrastive concept model that guides parent selection. By reweighting parents using a likelihood-ratio score between high- and low-performing solutions, CCTS biases search toward useful concept combinations and away from misleading ones, providing guidance through an explicit concept hierarchy rather than the algorithm lineage constructed by the LLM. We show that CCTS improves search efficiency over fitness-based baselines and produces interpretable, task-specific concept trees across a benchmark of open Erdős-type combinatorics problems. Our analysis indicates that the gains are driven largely by learning which concepts to avoid. We further validate these findings in a controlled synthetic algorithm-discovery environment, which reproduces qualitatively the search dynamics observed with the LLMs.

[592] Enhanced Parcel Arrival Forecasting for Logistic Hubs: An Ensemble Deep Learning Approach

Xinyue Pan, Yujia Xu, Benoit Montreuil

Main category: cs.LG

TL;DR: Deep learning ensemble framework for forecasting parcel hub workloads using historical patterns and real-time data to improve logistics efficiency.

Details

Motivation: Online shopping growth demands more efficient parcel delivery logistics, requiring better hub workload forecasting for strategic planning and resource management.

Method: Novel deep learning-based ensemble framework that leverages historical arrival patterns and real-time parcel status updates to forecast hub workloads.

Result: Empirical tests show the ensemble method outperforms traditional forecasting techniques and standalone deep learning models in a major city case study.

Conclusion: The method has significant potential to improve operational efficiency in logistics hubs and warrants broader adoption.

Abstract: The rapid expansion of online shopping has increased the demand for timely parcel delivery, compelling logistics service providers to enhance the efficiency, agility, and predictability of their hub networks. In order to solve the problem, we propose a novel deep learning-based ensemble framework that leverages historical arrival patterns and real-time parcel status updates to forecast upcoming workloads at logistic hubs. This approach not only facilitates the generation of short-term forecasts, but also improves the accuracy of future hub workload predictions for more strategic planning and resource management. Empirical tests of the algorithm, conducted through a case study of a major city’s parcel logistics, demonstrate the ensemble method’s superiority over both traditional forecasting techniques and standalone deep learning models. Our findings highlight the significant potential of this method to improve operational efficiency in logistics hubs and advocate for its broader adoption.

[593] SATORIS-N: Spectral Analysis based Traffic Observation Recovery via Informed Subspaces and Nuclear-norm minimization

Sampad Mohanty, Bhaskar Krishnamachari

Main category: cs.LG

TL;DR: SATORIS-N: A subspace-aware matrix completion framework for traffic-density imputation using prior singular-subspace information from neighboring days, outperforming existing methods especially under high occlusion.

Details

Motivation: Traffic-density matrices exhibit low rank and stable correlations in singular-vector subspaces across days. Incomplete observations due to communication dropouts, sensor occlusions, or sparse vehicle penetration require reliable imputation for autonomous navigation applications like cooperative perception and V2X systems.

Method: Proposes a subspace-aware semidefinite programming formulation of nuclear norm that explicitly informs reconstruction with prior singular-subspace information. Also studies a lightweight implicit subspace-alignment strategy via concatenation of matrices from consecutive days to encourage alignment of spatial/temporal singular directions.

Result: Consistently outperforms standard matrix completion methods (SoftImpute, IterativeSVD), statistical, and deep learning baselines at high occlusion levels across Beijing and Shanghai datasets. Explicit SDP approach is markedly more robust than implicit strategy when large fractions of entries are missing.

Conclusion: SATORIS-N provides accurate traffic-density reconstruction essential for intelligent vehicles and V2X systems. The framework generalizes to other spatiotemporal settings where singular subspaces evolve slowly over time.

Abstract: Traffic-density matrices from different days exhibit both low rank and stable correlations in their singular-vector subspaces. Leveraging this, we introduce SATORIS-N, a framework for imputing partially observed traffic-density by informed subspace priors from neighboring days. Our contribution is a subspace-aware semidefinite programming (SDP)} formulation of nuclear norm that explicitly informs the reconstruction with prior singular-subspace information. This convex formulation jointly enforces low rank and subspace alignment, providing a single global optimum and substantially improving accuracy under medium and high occlusion. We also study a lightweight implicit subspace-alignment} strategy in which matrices from consecutive days are concatenated to encourage alignment of spatial or temporal singular directions. Although this heuristic offers modest gains when missing rates are low, the explicit SDP approach is markedly more robust when large fractions of entries are missing. Across two real-world datasets (Beijing and Shanghai), SATORIS-N consistently outperforms standard matrix-completion methods such as SoftImpute, IterativeSVD, statistical, and even deep learning baselines at high occlusion levels. The framework generalizes to other spatiotemporal settings in which singular subspaces evolve slowly over time. In the context of intelligent vehicles and vehicle-to-everything (V2X) systems, accurate traffic-density reconstruction enables critical applications including cooperative perception, predictive routing, and vehicle-to-infrastructure (V2I) communication optimization. When infrastructure sensors or vehicle-reported observations are incomplete - due to communication dropouts, sensor occlusions, or sparse connected vehicle penetration-reliable imputation becomes essential for safe and efficient autonomous navigation.

[594] MemCast: Memory-Driven Time Series Forecasting with Experience-Conditioned Reasoning

Xiaoyu Tao, Mingyue Cheng, Ze Guo, Shuo Yu, Yaguo Liu, Qi Liu, Shijin Wang

Main category: cs.LG

TL;DR: MemCast is a learning-to-memory framework for time series forecasting that organizes experience into hierarchical memory and uses it for conditioned reasoning, enabling continual evolution.

Details

Motivation: Existing LLM-based time series forecasting methods lack explicit experience accumulation and continual evolution capabilities, limiting their ability to learn from past predictions and adapt over time.

Method: Reformulates TSF as experience-conditioned reasoning task with hierarchical memory: historical patterns from predictions, reasoning wisdom from inference trajectories, and general laws from temporal features. Uses dynamic confidence adaptation for continual evolution without test set distribution leakage.

Result: Extensive experiments on multiple datasets show MemCast consistently outperforms previous methods, validating the effectiveness of the hierarchical memory and continual evolution approach.

Conclusion: MemCast successfully addresses the limitations of existing LLM-based forecasters by introducing explicit experience accumulation and continual evolution through hierarchical memory organization and dynamic confidence adaptation.

Abstract: Time series forecasting (TSF) plays a critical role in decision-making for many real-world applications. Recently, LLM-based forecasters have made promising advancements. Despite their effectiveness, existing methods often lack explicit experience accumulation and continual evolution. In this work, we propose MemCast, a learning-to-memory framework that reformulates TSF as an experience-conditioned reasoning task. Specifically, we learn experience from the training set and organize it into a hierarchical memory. This is achieved by summarizing prediction results into historical patterns, distilling inference trajectories into reasoning wisdom, and inducing extracted temporal features into general laws. Furthermore, during inference, we leverage historical patterns to guide the reasoning process and utilize reasoning wisdom to select better trajectories, while general laws serve as criteria for reflective iteration. Additionally, to enable continual evolution, we design a dynamic confidence adaptation strategy that updates the confidence of individual entries without leaking the test set distribution. Extensive experiments on multiple datasets demonstrate that MemCast consistently outperforms previous methods, validating the effectiveness of our approach. Our code is available at https://github.com/Xiaoyu-Tao/MemCast-TS.

[595] What Makes a Good Example? Modeling Exemplar Selection with Neural Network Representations

Fanxiao Wani Qiu, Oscar Leong, Alexander LaTourrette

Main category: cs.LG

TL;DR: Humans select teaching exemplars by balancing representativeness and diversity, best modeled using neural network features with joint representativeness and diversity criteria, with transformers outperforming CNNs.

Details

Motivation: To understand the computational principles behind how humans select informative exemplars for teaching, specifically the tradeoffs between representativeness and diversity that prior work has identified but not fully explained.

Method: Used pretrained vision models (both convolutional and transformer-based) to embed novel visual categories along a one-dimensional morph continuum. Tested various subset selection strategies emphasizing prototypicality, joint representativeness, and diversity. Had adult participants select 1-3 exemplars for teaching, then compared human judgments with model predictions.

Result: Strategies based on joint representativeness, or its combination with diversity, best captured human exemplar selection. Purely prototypical or diversity-based strategies performed worse. Transformer-based representations consistently aligned more closely with human behavior than convolutional networks.

Conclusion: Dataset distillation methods from machine learning can serve as computational models for human teaching behavior, with transformer representations providing better alignment with human exemplar selection strategies.

Abstract: Teaching requires distilling a rich category distribution into a small set of informative exemplars. Although prior work shows that humans consider both representativeness and diversity when teaching, the computational principles underlying these tradeoffs remain unclear. We address this gap by modeling human exemplar selection using neural network feature representations and principled subset selection criteria. Novel visual categories were embedded along a one-dimensional morph continuum using pretrained vision models, and selection strategies varied in their emphasis on prototypicality, joint representativeness, and diversity. Adult participants selected one to three exemplars to teach a learner. Model-human comparisons revealed that strategies based on joint representativeness, or its combination with diversity, best captured human judgments, whereas purely prototypical or diversity-based strategies performed worse. Moreover, transformer-based representations consistently aligned more closely with human behavior than convolutional networks. These results highlight the potential utility of dataset distillation methods in machine learning as computational models for teaching.

[596] Reinforcement Learning with Promising Tokens for Large Language Models

Jing-Cheng Pang, Liang Lu, Xian Tang, Kun Jiang, Sijie Wu, Kai Zhang, Xubin Li

Main category: cs.LG

TL;DR: RLPT is a reinforcement learning framework for LLMs that reduces action space complexity by focusing policy optimization only on promising tokens identified via semantic priors, improving training stability and sample efficiency.

Details

Motivation: Standard RL for LLMs treats the full vocabulary as action space, including many irrelevant tokens that distract from meaningful decision-making. This massive action space causes training instability and inefficiency.

Method: RLPT decouples strategic decision-making from token generation by: 1) Using base model’s semantic priors to identify a dynamic set of promising tokens, 2) Constraining policy optimization exclusively to this refined subset via masking, 3) Reducing gradient variance by focusing on relevant action space.

Result: RLPT outperforms standard RL baselines on math, coding, and telecom reasoning tasks. It reduces gradient variance, stabilizes training, improves sample efficiency, and works effectively across various model sizes (4B and 8B) and RL algorithms (GRPO and DAPO).

Conclusion: By focusing RL optimization on promising tokens rather than the full vocabulary, RLPT addresses the action space complexity problem in LLM alignment, leading to more stable and efficient training while maintaining performance across diverse reasoning tasks.

Abstract: Reinforcement learning (RL) has emerged as a key paradigm for aligning and optimizing large language models (LLMs). Standard approaches treat the LLM as the policy and apply RL directly over the full vocabulary space. However, this formulation includes the massive tail of contextually irrelevant tokens in the action space, which could distract the policy from focusing on decision-making among the truly reasonable tokens. In this work, we verify that valid reasoning paths could inherently concentrate within a low-rank subspace. Based on this insight, we introduce Reinforcement Learning with Promising Tokens (RLPT), a framework that mitigates the action space issue by decoupling strategic decision-making from token generation. Specifically, RLPT leverages the semantic priors of the base model to identify a dynamic set of \emph{promising tokens} and constrains policy optimization exclusively to this refined subset via masking. Theoretical analysis and empirical results demonstrate that RLPT effectively reduces gradient variance, stabilizes the training process, and improves sample efficiency. Experiment results on math, coding, and telecom reasoning show that RLPT outperforms standard RL baselines and integrates effectively across various model sizes (4B and 8B) and RL algorithms (GRPO and DAPO).

[597] StepScorer: Accelerating Reinforcement Learning with Step-wise Scoring and Psychological Regret Modeling

Zhe Xu

Main category: cs.LG

TL;DR: PRM accelerates RL by using step-wise regret signals instead of sparse rewards, achieving 36% faster convergence than PPO in benchmark environments.

Details

Motivation: Traditional reinforcement learning suffers from slow convergence due to sparse reward signals, especially in complex environments with delayed feedback. This makes RL inefficient for real-world applications like robotics and finance where rapid adaptation is needed.

Method: Introduces Psychological Regret Model (PRM) that computes regret signals after each decision step based on the difference between expected optimal action value and actual action value. This transforms sparse rewards into dense feedback through step-wise scoring.

Result: PRM achieves stable performance approximately 36% faster than traditional PPO in benchmark environments like Lunar Lander. It’s particularly effective in continuous control tasks and environments with delayed feedback.

Conclusion: PRM bridges behavioral economics and RL by formalizing human-inspired counterfactual thinking as computable regret signals, making it suitable for real-world applications requiring rapid policy adaptation.

Abstract: Reinforcement learning algorithms often suffer from slow convergence due to sparse reward signals, particularly in complex environments where feedback is delayed or infrequent. This paper introduces the Psychological Regret Model (PRM), a novel approach that accelerates learning by incorporating regret-based feedback signals after each decision step. Rather than waiting for terminal rewards, PRM computes a regret signal based on the difference between the expected value of the optimal action and the value of the action taken in each state. This transforms sparse rewards into dense feedback signals through a step-wise scoring framework, enabling faster convergence. We demonstrate that PRM achieves stable performance approximately 36% faster than traditional Proximal Policy Optimization (PPO) in benchmark environments such as Lunar Lander. Our results indicate that PRM is particularly effective in continuous control tasks and environments with delayed feedback, making it suitable for real-world applications such as robotics, finance, and adaptive education where rapid policy adaptation is critical. The approach formalizes human-inspired counterfactual thinking as a computable regret signal, bridging behavioral economics and reinforcement learning.

[598] Lookahead Sample Reward Guidance for Test-Time Scaling of Diffusion Models

Yeongmin Kim, Donghyeok Shin, Byeonghu Na, Minsang Park, Richard Lee Kim, Il-Chul Moon

Main category: cs.LG

TL;DR: LiDAR sampling: A test-time scaling method for diffusion models that enables efficient sampling from high-reward regions using lookahead marginal samples without neural backpropagation.

Details

Motivation: Diffusion models generate samples that often don't fully align with human intent. Existing gradient guidance methods for reward alignment are computationally expensive due to sequential neural backpropagation at each time step.

Method: Proposes computing Expected Future Reward (EFR) using only marginal samples from pre-trained diffusion models, detaching neural dependency between current state and EFR. Introduces lookahead sampling to collect marginal samples efficiently, and uses an accurate solver to guide particles toward high-reward lookahead samples (LiDAR sampling).

Result: LiDAR achieves substantial performance improvements with only three samples and 3-step lookahead, showing steep performance gains with increased lookahead accuracy and sample count. Reaches same GenEval performance as latest gradient guidance method for SDXL with 9.5x speedup.

Conclusion: LiDAR sampling provides an efficient test-time scaling method for aligning diffusion model outputs with human intent, significantly reducing computational cost while maintaining or improving performance.

Abstract: Diffusion models have demonstrated strong generative performance; however, generated samples often fail to fully align with human intent. This paper studies a test-time scaling method that enables sampling from regions with higher human-aligned reward values. Existing gradient guidance methods approximate the expected future reward (EFR) at an intermediate particle $\mathbf{x}_t$ using a Taylor approximation, but this approximation at each time step incurs high computational cost due to sequential neural backpropagation. We show that the EFR at any $\mathbf{x}_t$ can be computed using only marginal samples from a pre-trained diffusion model. The proposed EFR formulation detaches the neural dependency between $\mathbf{x}_t$ and the EFR, enabling closed-form guidance computation without neural backpropagation. To further improve efficiency, we introduce lookahead sampling to collect marginal samples. For final sample generation, we use an accurate solver that guides particles toward high-reward lookahead samples. We refer to this sampling scheme as LiDAR sampling. LiDAR achieves substantial performance improvements using only three samples with a 3-step lookahead solver, exhibiting steep performance gains as lookahead accuracy and sample count increase; notably, it reaches the same GenEval performance as the latest gradient guidance method for SDXL with a 9.5x speedup.

[599] Adversarial construction as a potential solution to the experiment design problem in large task spaces

Prakhar Godara, Frederick Callaway, Marcelo G. Mattar

Main category: cs.LG

TL;DR: The paper proposes an adversarial construction approach to efficiently explore high-dimensional task spaces for studying human behavior, using binary sequence prediction tasks generated by hidden Markov models.

Details

Motivation: To address the lack of a robust, task-general theory of human behavior by developing a unified model for all tasks in a task-space, specifically for binary sequence prediction tasks generated by HMMs.

Method: Adversarial construction approach to identify tasks most likely to elicit novel behaviors, as exhaustive exploration of the entire task space is infeasible. This serves as a proxy for optimal experimental design in high-dimensional task spaces.

Result: Adversarial construction significantly outperforms random sampling of environments in identifying tasks that elicit qualitatively novel behaviors.

Conclusion: The adversarial construction approach provides an efficient method for exploring high-dimensional task spaces and could serve as a proxy for optimal experimental design in studying human behavior across diverse tasks.

Abstract: Despite decades of work, we still lack a robust, task-general theory of human behavior even in the simplest domains. In this paper we tackle the generality problem head-on, by aiming to develop a unified model for all tasks embedded in a task-space. In particular we consider the space of binary sequence prediction tasks where the observations are generated by the space parameterized by hidden Markov models (HMM). As the space of tasks is large, experimental exploration of the entire space is infeasible. To solve this problem we propose the adversarial construction approach, which helps identify tasks that are most likely to elicit a qualitatively novel behavior. Our results suggest that adversarial construction significantly outperforms random sampling of environments and therefore could be used as a proxy for optimal experimental design in high-dimensional task spaces.

[600] Probe-then-Commit Multi-Objective Bandits: Theoretical Benefits of Limited Multi-Arm Feedback

Ming Shi

Main category: cs.LG

TL;DR: Online multi-objective resource selection algorithm with limited probing (probe-then-commit) for multi-radio access and edge computing, achieving Pareto frontier optimization with accelerated learning through partial feedback.

Details

Motivation: Addresses practical limitations in multi-radio access selection and mobile edge computing offloading where agents can probe multiple candidates via control-plane measurements but must commit to exactly one for execution, creating a feedback regime between classical bandits and full-information experts.

Method: Develops PtC-P-UCB algorithm with frontier-aware probing under uncertainty in Pareto mode: selects q probes by maximizing hypervolume-inspired frontier-coverage potential and commits by marginal hypervolume gain to expand attained Pareto region. Extends to multi-modal probing where each probe returns M modalities.

Result: Proves dominated-hypervolume frontier error of Õ(K_P d/√(qT)) and scalarized regret Õ(L_φ d√((K/q)T)), showing transparent 1/√q acceleration from limited probing. Multi-modal extension yields variance-adaptive bounds via effective noise scale.

Conclusion: The PtC-P-UCB algorithm effectively bridges the gap between bandits and full-information experts in multi-objective resource selection, providing theoretical guarantees for Pareto frontier optimization with limited probing capabilities.

Abstract: We study an online resource-selection problem motivated by multi-radio access selection and mobile edge computing offloading. In each round, an agent chooses among $K$ candidate links/servers (arms) whose performance is a stochastic $d$-dimensional vector (e.g., throughput, latency, energy, reliability). The key interaction is \emph{probe-then-commit (PtC)}: the agent may probe up to $q>1$ candidates via control-plane measurements to observe their vector outcomes, but must execute exactly one candidate in the data plane. This limited multi-arm feedback regime strictly interpolates between classical bandits ($q=1$) and full-information experts ($q=K$), yet existing multi-objective learning theory largely focuses on these extremes. We develop \textsc{PtC-P-UCB}, an optimistic probe-then-commit algorithm whose technical core is frontier-aware probing under uncertainty in a Pareto mode, e.g., it selects the $q$ probes by approximately maximizing a hypervolume-inspired frontier-coverage potential and commits by marginal hypervolume gain to directly expand the attained Pareto region. We prove a dominated-hypervolume frontier error of $\tilde{O} (K_P d/\sqrt{qT})$, where $K_P$ is the Pareto-frontier size and $T$ is the horizon, and scalarized regret $\tilde{O} (L_φd\sqrt{(K/q)T})$, where $φ$ is the scalarizer. These quantify a transparent $1/\sqrt{q}$ acceleration from limited probing. We further extend to \emph{multi-modal probing}: each probe returns $M$ modalities (e.g., CSI, queue, compute telemetry), and uncertainty fusion yields variance-adaptive versions of the above bounds via an effective noise scale.

[601] Topology Matters: A Cautionary Case Study of Graph SSL on Neuro-Inspired Benchmarks

May Kristine Jonson Carlon, Su Myat Noe, Haojiong Wang, Yasuo Kuniyoshi

Main category: cs.LG

TL;DR: Hierarchical SSL framework for brain connectome analysis fails catastrophically compared to classical topology-aware methods due to objective mismatch with topological properties.

Details

Motivation: To understand how local interactions give rise to global brain organization, researchers need models that can represent information across multiple scales, inspired by multimodal neuroimaging.

Method: Introduces a hierarchical self-supervised learning framework that jointly learns node-, edge-, and graph-level embeddings, tested on a controllable synthetic benchmark mimicking connectome topological properties with a four-stage evaluation protocol.

Result: Reveals critical failure: invariance-based SSL models are fundamentally misaligned with benchmark’s topological properties and catastrophically outperformed by classical topology-aware heuristics; ablations confirm objective mismatch where SSL learns to ignore community structure.

Conclusion: Exposes fundamental pitfall in applying generic graph SSL to connectome-like data, highlighting need for new topology-aware SSL objectives that explicitly reward preservation of structure (modularity, motifs) for neuro-AI research.

Abstract: Understanding how local interactions give rise to global brain organization requires models that can represent information across multiple scales. We introduce a hierarchical self-supervised learning (SSL) framework that jointly learns node-, edge-, and graph-level embeddings, inspired by multimodal neuroimaging. We construct a controllable synthetic benchmark mimicking the topological properties of connectomes. Our four-stage evaluation protocol reveals a critical failure: the invariance-based SSL model is fundamentally misaligned with the benchmark’s topological properties and is catastrophically outperformed by classical, topology-aware heuristics. Ablations confirm an objective mismatch: SSL objectives designed to be invariant to topological perturbations learn to ignore the very community structure that classical methods exploit. Our results expose a fundamental pitfall in applying generic graph SSL to connectome-like data. We present this framework as a cautionary case study, highlighting the need for new, topology-aware SSL objectives for neuro-AI research that explicitly reward the preservation of structure (e.g., modularity or motifs).

[602] From Scalar Rewards to Potential Trends: Shaping Potential Landscapes for Model-Based Reinforcement Learning

Yao-Hui Li, Zeyu Wang, Xin Li, Wei Pang, Yingfang Yuan, Zhengkun Chen, Boya Zhang, Riashat Islam, Alex Lamb, Yonggang Zhang

Main category: cs.LG

TL;DR: SLOPE is a model-based RL framework that replaces scalar reward regression with optimistic potential landscapes to improve performance in sparse reward environments.

Details

Motivation: Standard MBRL struggles with sparse rewards because regressing ground-truth scalar rewards creates flat, gradient-free landscapes that provide no directional guidance for planning.

Method: SLOPE shifts from predicting scalar rewards to constructing informative potential landscapes using optimistic distributional regression to estimate high-confidence upper bounds, amplifying rare success signals and ensuring exploration gradients.

Result: Evaluations on 30+ tasks across 5 benchmarks show SLOPE consistently outperforms leading baselines in fully sparse, semi-sparse, and dense reward settings.

Conclusion: SLOPE’s approach of constructing optimistic potential landscapes rather than regressing scalar rewards effectively addresses the core limitation of MBRL in sparse reward environments.

Abstract: Model-based reinforcement learning (MBRL) achieves high sample efficiency by simulating future trajectories with learned dynamics and reward models. However, its effectiveness is severely compromised in sparse reward settings. The core limitation lies in the standard paradigm of regressing ground-truth scalar rewards: in sparse environments, this yields a flat, gradient-free landscape that fails to provide directional guidance for planning. To address this challenge, we propose Shaping Landscapes with Optimistic Potential Estimates (SLOPE), a novel framework that shifts reward modeling from predicting scalars to constructing informative potential landscapes. SLOPE employs optimistic distributional regression to estimate high-confidence upper bounds, which amplifies rare success signals and ensures sufficient exploration gradients. Evaluations on 30+ tasks across 5 benchmarks demonstrate that SLOPE consistently outperforms leading baselines in fully sparse, semi-sparse, and dense rewards.

[603] Sparsity is Combinatorial Depth: Quantifying MoE Expressivity via Tropical Geometry

Ye Su, Huayi Tang, Zixuan Gong, Yong Liu

Main category: cs.LG

TL;DR: MoE architectures analyzed through tropical geometry, showing Top-k routing is isomorphic to k-th elementary symmetric tropical polynomial, revealing sparsity as combinatorial depth that scales geometric capacity combinatorially.

Details

Motivation: To provide a rigorous theoretical foundation for Mixture-of-Experts (MoE) architectures, moving beyond heuristic efficiency explanations to understand their geometric expressivity through tropical geometry.

Method: Analyzes MoE through tropical geometry lens, establishing isomorphism between Top-k routing and k-th elementary symmetric tropical polynomial. Introduces concept of Effective Capacity under Manifold Hypothesis and proves combinatorial resilience properties.

Result: Shows that sparsity is combinatorial depth scaling geometric capacity by binomial coefficient. Proves MoE architectures maintain high expressivity via transversality of routing cones while dense networks suffer capacity collapse on low-dimensional data.

Conclusion: Provides rigorous theoretical justification for MoE’s topological supremacy through unification of discrete geometry of Hypersimplex with continuous geometry of neural functions, explaining why conditional computation works.

Abstract: While Mixture-of-Experts (MoE) architectures define the state-of-the-art, their theoretical success is often attributed to heuristic efficiency rather than geometric expressivity. In this work, we present the first analysis of MoE through the lens of tropical geometry, establishing that the Top-$k$ routing mechanism is algebraically isomorphic to the $k$-th elementary symmetric tropical polynomial. This isomorphism partitions the input space into the Normal Fan of a Hypersimplex, revealing that \textbf{sparsity is combinatorial depth} which scales geometric capacity by the binomial coefficient $\binom{N}{k}$. Moving beyond ambient bounds, we introduce the concept of \textit{Effective Capacity} under the Manifold Hypothesis. We prove that while dense networks suffer from capacity collapse on low-dimensional data, MoE architectures exhibit \textit{Combinatorial Resilience}, maintaining high expressivity via the transversality of routing cones. In this study, our framework unifies the discrete geometry of the Hypersimplex with the continuous geometry of neural functions, offering a rigorous theoretical justification for the topological supremacy of conditional computation.

[604] GraDE: A Graph Diffusion Estimator for Frequent Subgraph Discovery in Neural Architectures

Yikang Yang, Zhengxin Yang, Minghao Luo, Luzhou Peng, Hongxiao Li, Wanling Gao, Lei Wang, Jianfeng Zhan

Main category: cs.LG

TL;DR: GraDE is a diffusion-guided search framework that uses graph diffusion models to efficiently discover frequent subgraph patterns in neural architectures, overcoming limitations of both enumeration and sampling methods.

Details

Motivation: Finding network motifs in neural architectures is important for optimization and design, but existing methods face a trade-off: enumeration is accurate but computationally prohibitive for large subgraphs, while sampling is tractable but suffers from poor discovery capability.

Method: Proposes GraDE framework with Graph Diffusion Estimator that uses graph diffusion models to score subgraph typicality within learned distributions, enabling efficient search for frequent patterns without exhaustive enumeration.

Result: The estimator achieves up to 114% improvement in ranking accuracy compared to sampling baselines, and the framework discovers large-scale frequent patterns with up to 30× higher median frequency than sampling methods.

Conclusion: GraDE successfully bridges the gap between computational feasibility and discovery capability for finding frequent subgraphs in neural architectures using diffusion-guided search.

Abstract: Finding frequently occurring subgraph patterns or network motifs in neural architectures is crucial for optimizing efficiency, accelerating design, and uncovering structural insights. However, as the subgraph size increases, enumeration-based methods are perfectly accurate but computationally prohibitive, while sampling-based methods are computationally tractable but suffer from a severe decline in discovery capability. To address these challenges, this paper proposes GraDE, a diffusion-guided search framework that ensures both computational feasibility and discovery capability. The key innovation is the Graph Diffusion Estimator (GraDE), which is the first to introduce graph diffusion models to identify frequent subgraphs by scoring their typicality within the learned distribution. Comprehensive experiments demonstrate that the estimator achieves superior ranking accuracy, with up to 114% improvement compared to sampling-based baselines. Benefiting from this, the proposed framework successfully discovers large-scale frequent patterns, achieving up to 30$\times$ higher median frequency than sampling-based methods.

[605] BayeSQP: Bayesian Optimization through Sequential Quadratic Programming

Paul Brunzema, Sebastian Trimpe

Main category: cs.LG

TL;DR: BayeSQP combines sequential quadratic programming with Bayesian optimization using second-order Gaussian processes to model objective/constraints from zero-order information, solving tractable second-order cone programs for high-probability improvements.

Details

Motivation: To bridge classical optimization techniques with modern black-box optimization approaches, addressing the need for efficient high-dimensional optimization that leverages both function values and derivative information while handling uncertainty.

Method: Uses second-order Gaussian process surrogates to model objective and constraints (function values, gradients, Hessians) from zero-order information. Constructs local subproblems using GP posterior estimates, solves as tractable second-order cone programs, and performs line search via constrained Thompson sampling.

Result: Empirical results show BayeSQP outperforms state-of-the-art methods in specific high-dimensional settings, offering a principled and flexible framework.

Conclusion: BayeSQP successfully bridges classical optimization with modern black-box optimization, providing an effective approach for high-dimensional problems with uncertainty-aware derivative modeling.

Abstract: We introduce BayeSQP, a novel algorithm for general black-box optimization that merges the structure of sequential quadratic programming with concepts from Bayesian optimization. BayeSQP employs second-order Gaussian process surrogates for both the objective and constraints to jointly model the function values, gradients, and Hessian from only zero-order information. At each iteration, a local subproblem is constructed using the GP posterior estimates and solved to obtain a search direction. Crucially, the formulation of the subproblem explicitly incorporates uncertainty in both the function and derivative estimates, resulting in a tractable second-order cone program for high probability improvements under model uncertainty. A subsequent one-dimensional line search via constrained Thompson sampling selects the next evaluation point. Empirical results show thatBayeSQP outperforms state-of-the-art methods in specific high-dimensional settings. Our algorithm offers a principled and flexible framework that bridges classical optimization techniques with modern approaches to black-box optimization.

[606] Periodic Regularized Q-Learning

Hyukjun Yang, Han-Dong Lim, Donghwan Lee

Main category: cs.LG

TL;DR: PRQ algorithm introduces regularization at projection operator level to ensure convergence of Q-learning under linear function approximation, providing finite-time convergence guarantees.

Details

Motivation: Standard Q-learning converges in tabular RL but fails under linear function approximation. Existing regularization techniques address this but the authors propose a new approach focusing on the projection operator.

Method: Propose periodic regularized Q-learning (PRQ) by first introducing regularization at the projection operator level to construct regularized projected value iteration (RP-VI), then extending to sample-based RL algorithm with theoretical analysis.

Result: Theoretical analysis proves finite-time convergence guarantees for PRQ under linear function approximation, making the projected value iteration a contraction through appropriate regularization.

Conclusion: PRQ provides a novel regularization approach at the projection operator level that ensures convergence of Q-learning under linear function approximation with rigorous theoretical guarantees.

Abstract: In reinforcement learning (RL), Q-learning is a fundamental algorithm whose convergence is guaranteed in the tabular setting. However, this convergence guarantee does not hold under linear function approximation. To overcome this limitation, a significant line of research has introduced regularization techniques to ensure stable convergence under function approximation. In this work, we propose a new algorithm, periodic regularized Q-learning (PRQ). We first introduce regularization at the level of the projection operator and explicitly construct a regularized projected value iteration (RP-VI), subsequently extending it to a sample-based RL algorithm. By appropriately regularizing the projection operator, the resulting projected value iteration becomes a contraction. By extending this regularized projection into the stochastic setting, we establish the PRQ algorithm and provide a rigorous theoretical analysis that proves finite-time convergence guarantees for PRQ under linear function approximation.

[607] Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models

Hicham Eddoubi, Umar Faruk Abdullahi, Fadi Hassan

Main category: cs.LG

TL;DR: The paper analyzes jailbreak attacks on LLMs, focusing on how adversarial token placement (prefix vs. suffix) affects attack success rates, revealing a blind spot in current safety evaluations.

Details

Motivation: As LLMs become widely adopted, robust safety alignment is crucial, but jailbreak attacks remain challenging. Current evaluations overlook how adversarial token placement affects attack effectiveness.

Method: Focuses on Greedy Coordinate Gradient (GCG) attack as case study. Investigates optimizing attacks to generate prefixes instead of suffixes and varying adversarial token position during evaluation.

Result: Both prefix optimization and varying adversarial token position substantially influence attack success rates, highlighting a critical blind spot in current safety evaluations.

Conclusion: Position of adversarial tokens is a crucial factor in jailbreak attacks that must be accounted for in adversarial robustness evaluation of LLMs.

Abstract: Large Language Models (LLMs) have seen widespread adoption across multiple domains, creating an urgent need for robust safety alignment mechanisms. However, robustness remains challenging due to jailbreak attacks that bypass alignment via adversarial prompts. In this work, we focus on the prevalent Greedy Coordinate Gradient (GCG) attack and identify a previously underexplored attack axis in jailbreak attacks typically framed as suffix-based: the placement of adversarial tokens within the prompt. Using GCG as a case study, we show that both optimizing attacks to generate prefixes instead of suffixes and varying adversarial token position during evaluation substantially influence attack success rates. Our findings highlight a critical blind spot in current safety evaluations and underline the need to account for the position of adversarial tokens in the adversarial robustness evaluation of LLMs.

[608] BlockRR: A Unified Framework of RR-type Algorithms for Label Differential Privacy

Haixia Liu, Yi Ding

Main category: cs.LG

TL;DR: BlockRR is a unified randomized-response mechanism for label differential privacy that generalizes existing RR-type mechanisms and maintains privacy guarantees through parallel composition.

Details

Motivation: Existing randomized-response mechanisms for label differential privacy require separate case-by-case analysis. The authors aim to create a unified framework that generalizes these approaches while maintaining privacy guarantees.

Method: BlockRR introduces a novel randomized-response mechanism that satisfies ε-label differential privacy. The method includes a partition technique based on weight matrices derived from label prior information, leveraging parallel composition principles to maintain privacy when combining mechanisms.

Result: Empirical evaluation on CIFAR-10 variants shows BlockRR achieves better balance between test accuracy and per-class accuracy in high-privacy regimes (ε ≤ 3.0). In low-privacy regimes (ε ≥ 4.0), all methods reduce to standard RR without additional performance loss.

Conclusion: BlockRR provides a unified framework for label differential privacy that outperforms existing methods in high-privacy settings while maintaining theoretical privacy guarantees through parallel composition.

Abstract: In this paper, we introduce BlockRR, a novel and unified randomized-response mechanism for label differential privacy. This framework generalizes existed RR-type mechanisms as special cases under specific parameter settings, which eliminates the need for separate, case-by-case analysis. Theoretically, we prove that BlockRR satisfies $ε$-label DP. We also design a partition method for BlockRR based on a weight matrix derived from label prior information; the parallel composition principle ensures that the composition of two such mechanisms remains $ε$-label DP. Empirically, we evaluate BlockRR on two variants of CIFAR-10 with varying degrees of class imbalance. Results show that in the high-privacy and moderate-privacy regimes ($ε\leq 3.0$), our propsed method gets a better balance between test accuaracy and the average of per-class accuracy. In the low-privacy regime ($ε\geq 4.0$), all methods reduce BlockRR to standard RR without additional performance loss.

[609] Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models

Yuelin Hu, Zhengxue Cheng, Wei Liu, Li Song

Main category: cs.LG

TL;DR: EGSPO is a three-stage hybrid training framework that combines SFT with RL using token-level entropy gating to modulate gradient allocation during policy optimization.

Details

Motivation: Current hybrid training methods for LLMs combine SFT and RL at the sample level, but this approach doesn't differentiate between high-entropy (uncertain) and low-entropy (confident) tokens, potentially leading to inefficient learning and reinforcement of confident errors.

Method: Three-stage framework: 1) SFT expert learning for warm-up policy, 2) RL rollout generation with per-token entropy computation, 3) EGSPO mechanism that routes high-entropy tokens to full PPO updates for exploration and low-entropy tokens to attenuated PPO updates to reduce variance while preserving knowledge.

Result: Achieves consistent improvements on mathematical reasoning benchmarks: 3.8% gain on AIME and 2.9% gain on MATH over CHORD phi baseline, with only 3.4% additional computational overhead.

Conclusion: EGSPO demonstrates that token-level gradient modulation based on predictive entropy can effectively balance exploration and exploitation in hybrid LLM training, leading to improved performance on reasoning tasks with minimal computational cost.

Abstract: Hybrid training methods for large language models combine supervised fine tuning (SFT) on expert demonstrations with reinforcement learning (RL) on model rollouts, typically at the sample level. We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with token level gradient modulation. Stage 1, SFT expert learning, establishes a reliable warm up policy using expert demonstrations with a pure SFT loss. Stage 2, RL rollout generation, samples trajectories from the current policy and computes per token predictive entropy. Stage 3, the EGSPO mechanism, applies entropy gated gradient allocation: a predictive entropy module routes high entropy tokens to full PPO updates to encourage exploration, and low entropy tokens to attenuated PPO updates to reduce variance and preserve knowledge. Critically, both branches incorporate the advantage function A_t, ensuring that incorrect trajectories receive consistent negative learning signals and preventing reinforcement of confident errors. EGSPO achieves consistent improvements on mathematical reasoning benchmarks, with gains of 3.8 percent on AIME and 2.9 percent on MATH over the CHORD phi baseline, while incurring only 3.4 percent additional computational overhead.

[610] Universal Approximation of Continuous Functionals on Compact Subsets via Linear Measurements and Scalar Nonlinearities

Andrey Krylov, Maksim Penkin

Main category: cs.LG

TL;DR: The paper studies universal approximation of continuous functionals on Hilbert space products using models that take linear measurements followed by scalar nonlinearities, extending to Banach-valued maps.

Details

Motivation: To provide theoretical justification for the common "measure, apply scalar nonlinearities, then combine" design pattern used in operator learning and imaging applications by establishing universal approximation results.

Method: The authors prove that any continuous functional on compact subsets of products of Hilbert spaces can be uniformly approximated by models that first take finitely many continuous linear measurements of the inputs, then combine these measurements through continuous scalar nonlinearities. They extend this to maps with values in Banach spaces.

Result: The paper establishes universal approximation theorems showing that the described architecture can approximate any continuous functional on compact sets, providing theoretical foundation for operator learning architectures.

Conclusion: The results mathematically justify the common design pattern in operator learning and imaging, showing that linear measurements followed by scalar nonlinearities provide a universal approximation framework for functionals on Hilbert spaces.

Abstract: We study universal approximation of continuous functionals on compact subsets of products of Hilbert spaces. We prove that any such functional can be uniformly approximated by models that first take finitely many continuous linear measurements of the inputs and then combine these measurements through continuous scalar nonlinearities. We also extend the approximation principle to maps with values in a Banach space, yielding finite-rank approximations. These results provide a compact-set justification for the common ``measure, apply scalar nonlinearities, then combine’’ design pattern used in operator learning and imaging.

[611] Anomaly Detection via Mean Shift Density Enhancement

Pritam Kar, Rahul Bordoloi, Olaf Wolkenhauer, Saptarshi Bej

Main category: cs.LG

TL;DR: MSDE is an unsupervised anomaly detection framework that uses density-driven manifold evolution to identify anomalies based on their geometric response to iterative density enhancement, achieving robust performance across various anomaly types and noise levels.

Details

Motivation: Existing unsupervised anomaly detection algorithms lack robustness across different anomaly types and perform poorly under noisy settings, often excelling only under specific structural assumptions.

Method: MSDE uses weighted mean-shift with adaptive density weights from UMAP-based fuzzy neighborhood graphs. Anomaly scores are defined by total displacement accumulated across mean-shift iterations - normal samples remain stable while anomalies undergo large displacements toward density modes.

Result: On ADBench benchmark (46 real-world tabular datasets, 4 anomaly generation mechanisms, 6 noise levels), MSDE outperformed 13 unsupervised baselines with consistently strong, balanced, and robust performance for AUC-ROC, AUC-PR, and Precision@n metrics.

Conclusion: Displacement-based scoring provides a robust alternative to state-of-the-art unsupervised anomaly detection, demonstrating that geometric response to density-driven manifold evolution effectively identifies anomalies across diverse settings.

Abstract: Unsupervised anomaly detection stands as an important problem in machine learning, with applications in financial fraud prevention, network security and medical diagnostics. Existing unsupervised anomaly detection algorithms rarely perform well across different anomaly types, often excelling only under specific structural assumptions. This lack of robustness also becomes particularly evident under noisy settings. We propose Mean Shift Density Enhancement (MSDE), a fully unsupervised framework that detects anomalies through their geometric response to density-driven manifold evolution. MSDE is based on the principle that normal samples, being well supported by local density, remain stable under iterative density enhancement, whereas anomalous samples undergo large cumulative displacements as they are attracted toward nearby density modes. To operationalize this idea, MSDE employs a weighted mean-shift procedure with adaptive, sample-specific density weights derived from a UMAP-based fuzzy neighborhood graph. Anomaly scores are defined by the total displacement accumulated across a small number of mean-shift iterations. We evaluate MSDE on the ADBench benchmark, comprising forty six real-world tabular datasets, four realistic anomaly generation mechanisms, and six noise levels. Compared to 13 established unsupervised baselines, MSDE achieves consistently strong, balanced and robust performance for AUC-ROC, AUC-PR, and Precision@n, at several noise levels and on average over several types of anomalies. These results demonstrate that displacement-based scoring provides a robust alternative to the existing state-of-the-art for unsupervised anomaly detection.

[612] Causal Graph Learning via Distributional Invariance of Cause-Effect Relationship

Nang Hung Nguyen, Phi Le Nguyen, Thao Nguyen Truong, Trong Nghia Hoang, Masashi Sugiyama

Main category: cs.LG

TL;DR: A new framework for causal discovery from observational data using invariance of effect-cause conditional distributions across different prior cause distributions, enabling efficient causal graph recovery with quadratic complexity.

Details

Motivation: Current causal discovery methods from observational data often suffer from high computational complexity and scalability issues, especially with large datasets. There's a need for more efficient algorithms that can handle large-scale causal inference while maintaining accuracy.

Method: The method leverages the invariance principle that effect-cause conditional distributions remain stable across different prior cause distributions. It tests potential causal relationships by checking variance of these conditional distributions across multiple downsampled data subsets with varying prior cause distributions. An algorithm exploits sparsity in causal graphs to achieve quadratic complexity in the number of variables.

Result: The algorithm achieves up to 25x speedup compared to state-of-the-art methods while maintaining superior or equivalent performance on large-scale benchmark datasets. It demonstrates enhanced scalability for causal discovery from observational data.

Conclusion: The proposed framework provides an efficient and scalable approach for causal graph recovery from observational data, addressing computational bottlenecks in existing methods while maintaining competitive accuracy through invariance-based testing.

Abstract: This paper introduces a new framework for recovering causal graphs from observational data, leveraging the observation that the distribution of an effect, conditioned on its causes, remains invariant to changes in the prior distribution of those causes. This insight enables a direct test for potential causal relationships by checking the variance of their corresponding effect-cause conditional distributions across multiple downsampled subsets of the data. These subsets are selected to reflect different prior cause distributions, while preserving the effect-cause conditional relationships. Using this invariance test and exploiting an (empirical) sparsity of most causal graphs, we develop an algorithm that efficiently uncovers causal relationships with quadratic complexity in the number of observational variables, reducing the processing time by up to 25x compared to state-of-the-art methods. Our empirical experiments on a varied benchmark of large-scale datasets show superior or equivalent performance compared to existing works, while achieving enhanced scalability.

[613] Lipschitz Multiscale Deep Equilibrium Models: A Theoretically Guaranteed and Accelerated Approach

Naoki Sato, Hideaki Iiduka

Main category: cs.LG

TL;DR: Lipschitz multiscale DEQ improves fixed-point convergence in deep equilibrium models for image classification, achieving 4.75× speed-up on CIFAR-10 with minor accuracy drop

Details

Motivation: Deep equilibrium models (DEQs) offer memory-efficient infinitely deep representations but suffer from slow training/inference due to unreliable fixed-point convergence. The paper aims to improve convergence guarantees and reduce computational time.

Method: Proposes Lipschitz multiscale DEQ architecture with theoretical guarantees for fixed-point convergence in both forward and backward passes through hyperparameter adjustment, restructuring the model to ensure convergence.

Result: Achieves up to 4.75× speed-up on CIFAR-10 image classification with only minor accuracy degradation, demonstrating practical computational improvements while maintaining performance.

Conclusion: The proposed approach successfully addresses DEQ’s convergence challenges, making them more practical for real-world applications by guaranteeing convergence and significantly reducing computational time.

Abstract: Deep equilibrium models (DEQs) achieve infinitely deep network representations without stacking layers by exploring fixed points of layer transformations in neural networks. Such models constitute an innovative approach that achieves performance comparable to state-of-the-art methods in many large-scale numerical experiments, despite requiring significantly less memory. However, DEQs face the challenge of requiring vastly more computational time for training and inference than conventional methods, as they repeatedly perform fixed-point iterations with no convergence guarantee upon each input. Therefore, this study explored an approach to improve fixed-point convergence and consequently reduce computational time by restructuring the model architecture to guarantee fixed-point convergence. Our proposed approach for image classification, Lipschitz multiscale DEQ, has theoretically guaranteed fixed-point convergence for both forward and backward passes by hyperparameter adjustment, achieving up to a 4.75$\times$ speed-up in numerical experiments on CIFAR-10 at the cost of a minor drop in accuracy.

[614] Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures

Sangyeon Yoon, Hyesoo Hong, Wonje Jeung, Albert No

Main category: cs.LG

TL;DR: Machine unlearning methods are fragile due to “benign relearning” where forgotten information reemerges from syntactically similar data, not just topical relevance. The paper introduces syntactic diversification to address this by paraphrasing forget queries before unlearning.

Details

Motivation: Current machine unlearning methods are fundamentally fragile because they suffer from "benign relearning" - forgotten information can reemerge even from benign fine-tuning data. The common explanation of topical relevance is insufficient, and the paper aims to identify the true driver of this phenomenon and develop a more robust solution.

Method: Through systematic analysis, the authors demonstrate that syntactic similarity (not topicality) is the primary driver of benign relearning. They introduce “syntactic diversification” which paraphrases the original forget queries into heterogeneous structures prior to unlearning, suppressing recovery triggers.

Result: Syntactic diversification effectively suppresses benign relearning, accelerates forgetting, and substantially alleviates the trade-off between unlearning efficacy and model utility across benchmarks.

Conclusion: The paper reveals that syntactic similarity is the key driver of benign relearning in machine unlearning, and proposes syntactic diversification as an effective solution that improves robustness while maintaining model utility.

Abstract: Machine unlearning aims to remove specific content from trained models while preserving overall performance. However, the phenomenon of benign relearning, in which forgotten information reemerges even from benign fine-tuning data, reveals that existing unlearning methods remain fundamentally fragile. A common explanation attributes this effect to topical relevance, but we find this account insufficient. Through systematic analysis, we demonstrate that syntactic similarity, rather than topicality, is the primary driver: across benchmarks, syntactically similar data consistently trigger recovery even without topical overlap, due to their alignment in representations and gradients with the forgotten content. Motivated by this insight, we introduce syntactic diversification, which paraphrases the original forget queries into heterogeneous structures prior to unlearning. This approach effectively suppresses benign relearning, accelerates forgetting, and substantially alleviates the trade-off between unlearning efficacy and model utility.

[615] medR: Reward Engineering for Clinical Offline Reinforcement Learning via Tri-Drive Potential Functions

Qianyi Xu, Gousia Habib, Feng Wu, Yanrui Du, Zhihui Chen, Swapnil Mishra, Dilruk Perera, Mengling Feng

Main category: cs.LG

TL;DR: Automated pipeline using LLMs for offline reward design in clinical reinforcement learning, addressing reward engineering bottlenecks in dynamic treatment regimes.

Details

Motivation: Clinical RL is bottlenecked by reward engineering challenges - manual heuristics fail to generalize across diverse pathologies, requiring automated approaches for safe and effective policy learning in complex, sparse offline environments.

Method: Proposes LLM-driven pipeline for offline reward design and verification using potential functions with three components: survival, confidence, and competence. Introduces quantitative metrics to evaluate and select optimal reward structures before deployment.

Result: Framework automates reward function design for specific diseases while significantly enhancing resulting policy performance through LLM-driven domain knowledge integration.

Conclusion: LLM-based automated reward design pipeline addresses fundamental bottlenecks in clinical RL, enabling more effective and generalizable dynamic treatment regimes through systematic reward engineering.

Abstract: Reinforcement Learning (RL) offers a powerful framework for optimizing dynamic treatment regimes (DTRs). However, clinical RL is fundamentally bottlenecked by reward engineering: the challenge of defining signals that safely and effectively guide policy learning in complex, sparse offline environments. Existing approaches often rely on manual heuristics that fail to generalize across diverse pathologies. To address this, we propose an automated pipeline leveraging Large Language Models (LLMs) for offline reward design and verification. We formulate the reward function using potential functions consisted of three core components: survival, confidence, and competence. We further introduce quantitative metrics to rigorously evaluate and select the optimal reward structure prior to deployment. By integrating LLM-driven domain knowledge, our framework automates the design of reward functions for specific diseases while significantly enhancing the performance of the resulting policies.

[616] An Approximate Ascent Approach To Prove Convergence of PPO

Leif Doering, Daniel Schmidt, Moritz Melcher, Sebastian Kassing, Benedikt Wille, Tilman Aach, Simon Weissmann

Main category: cs.LG

TL;DR: PPO’s theoretical foundations are incomplete; this paper shows how PPO’s policy update approximates policy gradient ascent, proves convergence using random reshuffling techniques, and identifies issues with truncated Generalized Advantage Estimation at episode boundaries.

Details

Motivation: Proximal Policy Optimization (PPO) is widely used in deep reinforcement learning but lacks complete theoretical foundations. The paper aims to provide theoretical understanding of PPO's convergence and advantages, and identify issues with commonly used advantage estimation methods.

Method: The authors interpret PPO’s policy update scheme as approximated policy gradient ascent, control bias from surrogate gradients, and use random reshuffling techniques to prove convergence. They also analyze truncated Generalized Advantage Estimation, identifying geometric weighting issues at episode boundaries.

Result: The paper proves a convergence theorem for PPO that explains its success, and identifies that truncated GAE induces infinite mass collapse onto the longest k-step advantage estimator at episode boundaries. Empirical evaluation shows simple weight correction yields substantial improvements in environments with strong terminal signals like Lunar Lander.

Conclusion: This work provides theoretical foundations for PPO, explains its empirical success through convergence analysis, and identifies practical issues with advantage estimation that can be corrected to improve performance in certain environments.

Abstract: Proximal Policy Optimization (PPO) is among the most widely used deep reinforcement learning algorithms, yet its theoretical foundations remain incomplete. Most importantly, convergence and understanding of fundamental PPO advantages remain widely open. Under standard theory assumptions we show how PPO’s policy update scheme (performing multiple epochs of minibatch updates on multi-use rollouts with a surrogate gradient) can be interpreted as approximated policy gradient ascent. We show how to control the bias accumulated by the surrogate gradients and use techniques from random reshuffling to prove a convergence theorem for PPO that sheds light on PPO’s success. Additionally, we identify a previously overlooked issue in truncated Generalized Advantage Estimation commonly used in PPO. The geometric weighting scheme induces infinite mass collapse onto the longest $k$-step advantage estimator at episode boundaries. Empirical evaluations show that a simple weight correction can yield substantial improvements in environments with strong terminal signal, such as Lunar Lander.

[617] Information-Theoretic Multi-Model Fusion for Target-Oriented Adaptive Sampling in Materials Design

Yixuan Zhang, Zhiyuan Li, Weijia He, Mian Dai, Chen Shen, Teng Long, Hongbin Zhang

Main category: cs.LG

TL;DR: Information-theoretic framework for target-oriented adaptive sampling that treats optimization as trajectory discovery rather than full response surface approximation, using dimension-aware information budgeting and multi-model fusion for efficient exploration of high-dimensional design spaces.

Details

Motivation: Addressing the challenge of target-oriented discovery in high-dimensional, heterogeneous design spaces where each evaluation (experimental or simulation) is costly, requiring methods that make reliable progress under limited evaluation budgets.

Method: Information-theoretic framework that reframes optimization as trajectory discovery, maintaining a low-entropy information state concentrating search on target-relevant directions. Uses dimension-aware information budgeting, adaptive bootstrapped distillation over heterogeneous surrogate reservoir, and structure-aware candidate manifold analysis with Kalman-inspired multi-model fusion to balance consensus-driven exploitation and disagreement-driven exploration.

Result: Evaluated under unified protocol without dataset-specific tuning, improves sample efficiency across 14 single- and multi-objective materials design tasks (candidate pools from 600 to 4×10^6, feature dimensions from 10 to 10^3), typically reaching top-performing regions within 100 evaluations. Complementary 20-dimensional synthetic benchmarks (Ackley, Rastrigin, Schwefel) demonstrate robustness to rugged and multimodal landscapes.

Conclusion: The information-theoretic framework provides an effective approach for target-oriented adaptive sampling in high-dimensional design spaces, achieving reliable progress with limited evaluations through trajectory-focused search and multi-model fusion strategies.

Abstract: Target-oriented discovery under limited evaluation budgets requires making reliable progress in high-dimensional, heterogeneous design spaces where each new measurement is costly, whether experimental or high-fidelity simulation. We present an information-theoretic framework for target-oriented adaptive sampling that reframes optimization as trajectory discovery: instead of approximating the full response surface, the method maintains and refines a low-entropy information state that concentrates search on target-relevant directions. The approach couples data, model beliefs, and physics/structure priors through dimension-aware information budgeting, adaptive bootstrapped distillation over a heterogeneous surrogate reservoir, and structure-aware candidate manifold analysis with Kalman-inspired multi-model fusion to balance consensus-driven exploitation and disagreement-driven exploration. Evaluated under a single unified protocol without dataset-specific tuning, the framework improves sample efficiency and reliability across 14 single- and multi-objective materials design tasks spanning candidate pools from $600$ to $4 \times 10^6$ and feature dimensions from $10$ to $10^3$, typically reaching top-performing regions within 100 evaluations. Complementary 20-dimensional synthetic benchmarks (Ackley, Rastrigin, Schwefel) further demonstrate robustness to rugged and multimodal landscapes.

[618] From Inexact Gradients to Byzantine Robustness: Acceleration and Optimization under Similarity

Renaud Gaucher, Aymeric Dieuleveut, Hadrien Hendrikx

Main category: cs.LG

TL;DR: Byzantine-robust distributed optimization can be reformulated as optimization with inexact gradient oracles, enabling application of existing optimization theory and development of accelerated methods with reduced communication complexity.

Details

Motivation: Standard federated learning algorithms are vulnerable to adversarial (Byzantine) nodes. While robust aggregation methods exist, their analyses are ad-hoc, hindering development of more complex algorithms like accelerated methods.

Method: Reformulate Byzantine-robust distributed optimization as optimization with inexact gradient oracles (with additive and multiplicative errors). Propose two accelerated schemes: 1) Nesterov-type acceleration using inexact gradient theory, and 2) Optimization under Similarity leveraging auxiliary loss functions.

Result: Shows GD with robust aggregation achieves optimal asymptotic error. Both proposed acceleration methods drastically reduce communication complexity compared to previous methods, demonstrated theoretically and empirically.

Conclusion: Byzantine-robust optimization can be systematically analyzed through inexact gradient framework, enabling development of theoretically-grounded accelerated algorithms with improved communication efficiency.

Abstract: Standard federated learning algorithms are vulnerable to adversarial nodes, a.k.a. Byzantine failures. To solve this issue, robust distributed learning algorithms have been developed, which typically replace parameter averaging by robust aggregations. While generic conditions on these aggregations exist to guarantee the convergence of (Stochastic) Gradient Descent (SGD), the analyses remain rather ad-hoc. This hinders the development of more complex robust algorithms, such as accelerated ones. In this work, we show that Byzantine-robust distributed optimization can, under standard generic assumptions, be cast as a general optimization with inexact gradient oracles (with both additive and multiplicative error terms), an active field of research. This allows for instance to directly show that GD on top of standard robust aggregation procedures obtains optimal asymptotic error in the Byzantine setting. Going further, we propose two optimization schemes to speed up the convergence. The first one is a Nesterov-type accelerated scheme whose proof directly derives from accelerated inexact gradient results applied to our formulation. The second one hinges on Optimization under Similarity, in which the server leverages an auxiliary loss function that approximates the global loss. Both approaches allow to drastically reduce the communication complexity compared to previous methods, as we show theoretically and empirically.

[619] Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

Jinwoo Choi, Sang-Hyun Lee, Seung-Woo Seo

Main category: cs.LG

TL;DR: CoGHP: A hierarchical RL framework using autoregressive sequence modeling with MLP-Mixer to generate chains of latent subgoals for long-horizon tasks.

Details

Motivation: Existing hierarchical RL methods for offline goal-conditioned tasks are inadequate for complex tasks requiring multiple intermediate decisions, as they use separate networks and generate only single subgoals.

Method: Proposes Chain-of-Goals Hierarchical Policy (CoGHP) that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture using MLP-Mixer backbone for cross-token communication.

Result: CoGHP consistently outperforms strong offline baselines across challenging navigation and manipulation benchmarks, demonstrating improved performance on long-horizon tasks.

Conclusion: The chain-of-goals approach with unified autoregressive modeling effectively addresses long-horizon offline RL challenges by generating sequences of reasoning steps as latent subgoals.

Abstract: Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separate high- and low-level networks and generate only a single intermediate subgoal, making them inadequate for complex tasks that require coordinating multiple intermediate decisions. To address this limitation, we draw inspiration from the chain-of-thought paradigm and propose the Chain-of-Goals Hierarchical Policy (CoGHP), a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture. Given a state and a final goal, CoGHP autoregressively generates a sequence of latent subgoals followed by the primitive action, where each latent subgoal acts as a reasoning step that conditions subsequent predictions. To implement this efficiently, we pioneer the use of an MLP-Mixer backbone, which supports cross-token communication and captures structural relationships among state, goal, latent subgoals, and action. Across challenging navigation and manipulation benchmarks, CoGHP consistently outperforms strong offline baselines, demonstrating improved performance on long-horizon tasks.

[620] Bayesian Conformal Prediction as a Decision Risk Problem

Fanyi Wu, Veronika Lohmanova, Samuel Kaski, Michele Caprio

Main category: cs.LG

TL;DR: Bayesian Conformal Prediction (BCP) combines Bayesian methods with conformal prediction to create prediction sets with valid coverage guarantees and reduced variability, showing robustness under model misspecification.

Details

Motivation: To improve conformal prediction by incorporating Bayesian methods to reduce run-to-run variability in prediction set sizes while maintaining valid coverage guarantees, especially under model misspecification.

Method: Uses Bayesian posterior predictive densities as non-conformity scores within a split conformal framework, with Bayesian quadrature to estimate and minimize expected prediction set size.

Result: BCP achieves valid coverage guarantees and reliable empirical coverage under model misspecification. In sparse regression with 80% nominal coverage, BCP achieves 81% empirical coverage vs. 49% for Bayesian credible intervals. Shows comparable set sizes to split conformal prediction but with substantially lower run-to-run variability.

Conclusion: BCP successfully combines Bayesian and conformal methods to create robust prediction sets with reliable coverage and reduced variability, outperforming traditional Bayesian intervals under misspecification.

Abstract: Bayesian posterior predictive densities as non-conformity scores and Bayesian quadrature are used to estimate and minimise the expected prediction set size. Operating within a split conformal framework, BCP provides valid coverage guarantees and demonstrates reliable empirical coverage under model misspecification. Across regression and classification tasks, including distribution-shifted settings such as ImageNet-A, BCP yields prediction sets of comparable size to split conformal prediction, while exhibiting substantially lower run-to-run variability in set size. In sparse regression with nominal coverage of 80 percent, BCP achieves 81 percent empirical coverage under a misspecified prior, whereas Bayesian credible intervals under-cover at 49 percent.

[621] On the Entropy Dynamics in Reinforcement Fine-Tuning of Large Language Models

Shumin Wang, Yuexiang Xie, Wenhao Zhang, Yuchang Sun, Yanxi Chen, Yaliang Li, Yanyong Zhang

Main category: cs.LG

TL;DR: Theoretical analysis of entropy dynamics in reinforcement fine-tuning of LLMs, providing framework for understanding exploration-exploitation balance and deriving entropy control methods.

Details

Motivation: While entropy is used to measure diversity in LLM outputs and balance exploration-exploitation in reinforcement fine-tuning, there's a lack of principled understanding of entropy dynamics during this process that needs theoretical investigation.

Method: Establishes theoretical framework for analyzing entropy dynamics in RFT, starting with discriminant expression for entropy change under single logit update, deriving first-order expression for entropy change, extending to GRPO update formula, and designing entropy control methods based on theoretical insights.

Result: Provides empirical evidence supporting theoretical conclusions and demonstrates effectiveness of derived entropy-discriminator clipping methods, offering novel insights into RFT training dynamics.

Conclusion: The study yields theoretical support and practical strategies for optimizing exploration-exploitation balance during LLM fine-tuning, with unified interpretation of existing entropy-based methods.

Abstract: Entropy serves as a critical metric for measuring the diversity of outputs generated by large language models (LLMs), providing valuable insights into their exploration capabilities. While recent studies increasingly focus on monitoring and adjusting entropy to better balance exploration and exploitation in reinforcement fine-tuning (RFT), a principled understanding of entropy dynamics during this process is yet to be thoroughly investigated. In this paper, we establish a theoretical framework for analyzing the entropy dynamics during the RFT process, which begins with a discriminant expression that quantifies entropy change under a single logit update. This foundation enables the derivation of a first-order expression for entropy change, which can be further extended to the update formula of Group Relative Policy Optimization (GRPO). The corollaries and insights drawn from the theoretical analysis inspire the design of entropy control methods, and also offer a unified lens for interpreting various entropy-based methods in existing studies. We provide empirical evidence to support the main conclusions of our analysis and demonstrate the effectiveness of the derived entropy-discriminator clipping methods. This study yields novel insights into RFT training dynamics, providing theoretical support and practical strategies for optimizing the exploration-exploitation balance during LLM fine-tuning.

[622] Achieving Linear Speedup for Composite Federated Learning

Kun Huang, Shi Pu

Main category: cs.LG

TL;DR: FedNMap: A normal map-based federated learning method for composite optimization problems with smooth loss and nonsmooth regularizer, achieving linear speedup in nonconvex settings.

Details

Motivation: Federated learning faces challenges with composite optimization problems involving nonsmooth regularizers and data heterogeneity across clients. Existing methods struggle to achieve linear speedup for nonconvex composite federated learning.

Method: FedNMap uses a normal map-based update scheme to handle nonsmooth regularization terms and incorporates a local correction strategy to mitigate data heterogeneity effects across clients.

Result: FedNMap achieves linear speedup with respect to both number of clients and local updates for nonconvex losses, with and without Polyak-Łojasiewicz condition - the first such result for nonconvex composite federated learning.

Conclusion: FedNMap provides an effective solution for composite federated learning with nonsmooth regularizers, achieving theoretical guarantees for linear speedup in nonconvex settings while handling data heterogeneity.

Abstract: This paper proposes FedNMap, a normal map-based method for composite federated learning, where the objective consists of a smooth loss and a possibly nonsmooth regularizer. FedNMap leverages a normal map-based update scheme to handle the nonsmooth term and incorporates a local correction strategy to mitigate the impact of data heterogeneity across clients. Under standard assumptions, including smooth local losses, weak convexity of the regularizer, and bounded stochastic gradient variance, FedNMap achieves linear speedup with respect to both the number of clients $n$ and the number of local updates $Q$ for nonconvex losses, both with and without the Polyak-Łojasiewicz (PL) condition. To our knowledge, this is the first result establishing linear speedup for nonconvex composite federated learning.

[623] Dynamic Topology Optimization for Non-IID Data in Decentralized Learning

Bart Cox, Antreas Ioannou, Jérémie Decouchant

Main category: cs.LG

TL;DR: Morph is a decentralized learning algorithm that dynamically optimizes communication topology based on model dissimilarity to improve performance on non-IID data distributions.

Details

Motivation: Decentralized learning struggles with non-IID data distributions and static communication topologies, leading to poor model accuracy and convergence issues.

Method: Nodes adaptively choose peers for model exchange based on maximum model dissimilarity, maintaining fixed in-degree while dynamically reshaping communication graphs through gossip-based peer discovery and diversity-driven neighbor selection.

Result: Morph outperforms static and epidemic baselines on CIFAR-10 and FEMNIST with up to 100 nodes, achieving 1.12x relative improvement in test accuracy on CIFAR-10 and 1.08x higher accuracy than Epidemic Learning on FEMNIST.

Conclusion: Morph achieves higher final accuracy, faster convergence, more stable learning with lower inter-node variance, requires fewer communication rounds than baselines, and operates without global knowledge.

Abstract: Decentralized learning (DL) enables a set of nodes to train a model collaboratively without central coordination, offering benefits for privacy and scalability. However, DL struggles to train a high accuracy model when the data distribution is non-independent and identically distributed (non-IID) and when the communication topology is static. To address these issues, we propose Morph, a topology optimization algorithm for DL. In Morph, nodes adaptively choose peers for model exchange based on maximum model dissimilarity. Morph maintains a fixed in-degree while dynamically reshaping the communication graph through gossip-based peer discovery and diversity-driven neighbor selection, thereby improving robustness to data heterogeneity. Experiments on CIFAR-10 and FEMNIST with up to 100 nodes show that Morph consistently outperforms static and epidemic baselines, while closely tracking the fully connected upper bound. On CIFAR-10, Morph achieves a relative improvement of 1.12x in test accuracy compared to the state-of-the-art baselines. On FEMNIST, Morph achieves an accuracy that is 1.08x higher than Epidemic Learning. Similar trends hold for 50 node deployments, where Morph narrows the gap to the fully connected upper bound within 0.5 percentage points on CIFAR-10. These results demonstrate that Morph achieves higher final accuracy, faster convergence, and more stable learning as quantified by lower inter-node variance, while requiring fewer communication rounds than baselines and no global knowledge.

[624] Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing

Xin Sheng, Jiaxin Li, Yujuan Pang, Ran Peng, Yong Ma

Main category: cs.LG

TL;DR: RLVR with positive-negative pairing improves prompt selection for language model training by pairing hard-but-solvable prompts with easy-but-brittle ones, using weighted GRPO for bidirectional learning signals.

Details

Motivation: Current RLVR methods use variance-based prompt selection which leads to unstable optimization and weak transfer. Need better mechanism-level prompt selection that provides both reliable positive anchors and explicit negative learning signals.

Method: Positive-negative pairing: sample hard-but-solvable prompt (low empirical success) and easy-but-brittle prompt (high but not perfect success). Weighted GRPO reweights binary outcomes at pair level with group-normalized advantages to amplify rare successes and failures.

Result: On Qwen2.5-Math-7B, single paired minibatch outperforms GRPO baseline: AIME 2025 Pass@8 improves from 16.8 to 22.2, AMC23 Pass@64 from 94.0 to 97.0. Similar gains on Qwen2.5-Math-7B-Instruct.

Conclusion: Positive-negative pairing with weighted GRPO provides informative bidirectional learning signals, improving sample efficiency without suppressing exploration, outperforming variance-based selection methods.

Abstract: Reinforcement learning with verifiable rewards (RLVR) is effective for training large language models on deterministic outcome reasoning tasks. Prior work shows RLVR works with few prompts, but prompt selection is often based only on training-accuracy variance, leading to unstable optimization directions and weaker transfer. We revisit prompt selection from a mechanism-level view and argue that an effective minibatch should provide both (i) a reliable positive anchor and (ii) explicit negative learning signals from rare failures. Based on this principle, we propose \emph{positive–negative pairing}: at each update, we sample a hard-but-solvable $q^{+}$ and an easy-but-brittle prompt $q^{-}$(high success rate but not perfect), characterized by low and high empirical success rates under multiple rollouts. We further introduce Weighted GRPO, which reweights binary outcomes at the pair level and uses group-normalized advantages to amplify rare successes on $q^{+}$ into sharp positive guidance while turning rare failures on $q^{-}$ into strong negative penalties. This bidirectional signal provides informative learning feedback for both successes and failures, improving sample efficiency without suppressing exploration. On Qwen2.5-Math-7B, a single paired minibatch per update consistently outperforms a GRPO baseline that selects two prompts via commonly used variance-based selection heuristics: AIME~2025 Pass@8 improves from 16.8 to 22.2, and AMC23 Pass@64 from 94.0 to 97.0, while remaining competitive with large-scale RLVR trained from a pool of 1209 training prompts. Similar gains are observed on Qwen2.5-Math-7B-Instruct.

[625] The Label Horizon Paradox: Rethinking Supervision Targets in Financial Forecasting

Chen-Hui Song, Shuoling Liu, Liyuan Chen

Main category: cs.LG

TL;DR: The paper introduces the Label Horizon Paradox in financial forecasting, showing optimal supervision signals differ from prediction targets, and proposes a bi-level optimization framework to find optimal proxy labels.

Details

Motivation: The paper challenges the conventional assumption that training labels must exactly match inference targets in financial forecasting, uncovering that optimal supervision signals often deviate from prediction goals due to market dynamics.

Method: The authors propose a bi-level optimization framework that autonomously identifies optimal proxy labels within a single training run, grounded in a theoretical analysis of dynamic signal-noise trade-off.

Result: Extensive experiments on large-scale financial datasets demonstrate consistent improvements over conventional baselines, validating the effectiveness of the proposed approach.

Conclusion: The work opens new avenues for label-centric research in financial forecasting by showing that supervision signals should be optimized rather than simply mirroring prediction targets.

Abstract: While deep learning has revolutionized financial forecasting through sophisticated architectures, the design of the supervision signal itself is rarely scrutinized. We challenge the canonical assumption that training labels must strictly mirror inference targets, uncovering the Label Horizon Paradox: the optimal supervision signal often deviates from the prediction goal, shifting across intermediate horizons governed by market dynamics. We theoretically ground this phenomenon in a dynamic signal-noise trade-off, demonstrating that generalization hinges on the competition between marginal signal realization and noise accumulation. To operationalize this insight, we propose a bi-level optimization framework that autonomously identifies the optimal proxy label within a single training run. Extensive experiments on large-scale financial datasets demonstrate consistent improvements over conventional baselines, thereby opening new avenues for label-centric research in financial forecasting.

[626] ScDiVa: Masked Discrete Diffusion for Joint Modeling of Single-Cell Identity and Expression

Mingxuan Wang, Cheng Chen, Gaoyang Jiang, Zijia Ren, Chuangxin Zhao, Lu Shi, Yanbiao Ma

Main category: cs.LG

TL;DR: scDiVa is a masked discrete diffusion foundation model for single-cell RNA-seq data that addresses limitations of autoregressive generation by using a bidirectional denoiser to jointly model discrete gene identities and continuous values.

Details

Motivation: Autoregressive generation for single-cell RNA-seq data imposes artificial ordering bias and suffers from error accumulation due to the high-dimensional, sparse, and unordered nature of the data. The authors aim to develop a more biologically coherent approach.

Method: Proposes scDiVa, a masked discrete diffusion model with continuous-time forward masking in token space. Uses a bidirectional denoiser that jointly models discrete gene identities and continuous values, with entropy-normalized serialization and latent anchor token for information efficiency. Trained via depth-invariant time sampling and dual denoising objective.

Result: Pre-trained on 59 million cells, scDiVa achieves strong transfer performance across benchmarks including batch integration, cell type annotation, and perturbation response prediction.

Conclusion: Masked discrete diffusion serves as a biologically coherent and effective alternative to autoregression for single-cell RNA-seq data generation and analysis.

Abstract: Single-cell RNA-seq profiles are high-dimensional, sparse, and unordered, causing autoregressive generation to impose an artificial ordering bias and suffer from error accumulation. To address this, we propose scDiVa, a masked discrete diffusion foundation model that aligns generation with the dropout-like corruption process by defining a continuous-time forward masking mechanism in token space. ScDiVa features a bidirectional denoiser that jointly models discrete gene identities and continuous values, utilizing entropy-normalized serialization and a latent anchor token to maximize information efficiency and preserve global cell identity. The model is trained via depth-invariant time sampling and a dual denoising objective to simulate varying sparsity levels while ensuring precise recovery of both identity and magnitude. Pre-trained on 59 million cells, scDiVa achieves strong transfer performance across major benchmarks, including batch integration, cell type annotation, and perturbation response prediction. These results suggest that masked discrete diffusion serves as a biologically coherent and effective alternative to autoregression.

[627] Most Convolutional Networks Suffer from Small Adversarial Perturbations

Amit Daniely, Idan Mehalel

Main category: cs.LG

TL;DR: Theoretical analysis proves adversarial examples exist in random CNNs at essentially optimal distance (∥x∥/√d), found via single gradient step.

Details

Motivation: While adversarial examples are understood in fully connected networks, they're less understood in CNNs. Recent work showed existence but not optimal distance. This paper aims to prove adversarial examples exist at essentially optimal distance in random CNNs.

Method: Uses Fourier decomposition to bound singular values of random linear convolutional operators, which are key CNN components. Proves adversarial examples can be found at ∥x∥/√d distance using single gradient descent step.

Result: Shows adversarial examples exist in random CNNs at essentially optimal ℓ₂-distance of order ∥x∥/√d, and can be found with single gradient step. Provides Fourier-based bound for singular values of random convolutional operators.

Conclusion: Adversarial vulnerability is fundamental in CNNs, occurring at optimal distances and efficiently found. The Fourier analysis technique for bounding convolutional operator singular values may have broader applications.

Abstract: The existence of adversarial examples is relatively understood for random fully connected neural networks, but much less so for convolutional neural networks (CNNs). The recent work [Daniely, 2025] establishes that adversarial examples can be found in CNNs, in some non-optimal distance from the input. We extend over this work and prove that adversarial examples in random CNNs with input dimension $d$ can be found already in $\ell_2$-distance of order $\lVert x \rVert /\sqrt{d}$ from the input $x$, which is essentially the nearest possible. We also show that such adversarial small perturbations can be found using a single step of gradient descent. To derive our results we use Fourier decomposition to efficiently bound the singular values of a random linear convolutional operator, which is the main ingredient of a CNN layer. This bound might be of independent interest.

[628] DeepDFA: Injecting Temporal Logic in Deep Learning for Sequential Subsymbolic Applications

Elena Umili, Francesco Argenziano, Roberto Capobianco

Main category: cs.LG

TL;DR: DeepDFA integrates temporal logic (DFA/Moore Machines) into neural networks as differentiable layers for neurosymbolic learning in sequential domains.

Details

Motivation: Addressing the challenge of integrating logical knowledge into deep neural networks for sequential/temporal domains with subsymbolic observations, bridging symbolic reasoning and subsymbolic learning.

Method: Proposes DeepDFA framework that models temporal rules (DFA/Moore Machines) as continuous, differentiable layers within neural architectures, enabling symbolic knowledge injection into subsymbolic domains.

Result: Outperforms traditional deep learning models (LSTMs, GRUs, Transformers) and other neuro-symbolic systems, achieving state-of-the-art results in temporal knowledge integration for image sequence classification and policy learning in non-Markovian environments.

Conclusion: DeepDFA successfully bridges subsymbolic learning and symbolic reasoning in sequential tasks, demonstrating the potential of neurosymbolic integration for temporal domains.

Abstract: Integrating logical knowledge into deep neural network training is still a hard challenge, especially for sequential or temporally extended domains involving subsymbolic observations. To address this problem, we propose DeepDFA, a neurosymbolic framework that integrates high-level temporal logic - expressed as Deterministic Finite Automata (DFA) or Moore Machines - into neural architectures. DeepDFA models temporal rules as continuous, differentiable layers, enabling symbolic knowledge injection into subsymbolic domains. We demonstrate how DeepDFA can be used in two key settings: (i) static image sequence classification, and (ii) policy learning in interactive non-Markovian environments. Across extensive experiments, DeepDFA outperforms traditional deep learning models (e.g., LSTMs, GRUs, Transformers) and novel neuro-symbolic systems, achieving state-of-the-art results in temporal knowledge integration. These results highlight the potential of DeepDFA to bridge subsymbolic learning and symbolic reasoning in sequential tasks.

[629] Causal Inference on Networks under Misspecified Exposure Mappings: A Partial Identification Framework

Maresa Schröder, Miruna Oprescu, Stefan Feuerriegel, Nathan Kallus

Main category: cs.LG

TL;DR: A partial identification framework for causal inference on networks that bounds treatment effects when exposure mappings are misspecified, with applications to three canonical exposure settings.

Details

Motivation: Existing network causal inference methods rely on exposure mappings that compress treatment assignments, but misspecification can cause severe bias in treatment effect estimates.

Method: Proposes a partial identification framework deriving sharp upper/lower bounds on direct and spillover effects under exposure mapping misspecification, with orthogonal estimators for three canonical exposure settings.

Result: Developed valid, sharp, and efficient bound estimates that remain informative and provide reliable conclusions under exposure mapping misspecification.

Conclusion: The framework enables robust causal inference on networks by bounding treatment effects when exposure mappings are potentially misspecified, with practical applications to common exposure settings.

Abstract: Estimating treatment effects in networks is challenging, as each potential outcome depends on the treatments of all other nodes in the network. To overcome this difficulty, existing methods typically impose an exposure mapping that compresses the treatment assignments in the network into a low-dimensional summary. However, if this mapping is misspecified, standard estimators for direct and spillover effects can be severely biased. We propose a novel partial identification framework for causal inference on networks to assess the robustness of treatment effects under misspecifications of the exposure mapping. Specifically, we derive sharp upper and lower bounds on direct and spillover effects under such misspecifications. As such, our framework presents a novel application of causal sensitivity analysis to exposure mappings. We instantiate our framework for three canonical exposure settings widely used in practice: (i) weighted means of the neighborhood treatments, (ii) threshold-based exposure mappings, and (iii) truncated neighborhood interference in the presence of higher-order spillovers. Furthermore, we develop orthogonal estimators for these bounds and prove that the resulting bound estimates are valid, sharp, and efficient. Our experiments show the bounds remain informative and provide reliable conclusions under misspecification of exposure mappings.

[630] Reparameterization Flow Policy Optimization

Hai Zhong, Zhuoran Li, Xun Wang, Longbo Huang

Main category: cs.LG

TL;DR: RFO (Reparameterization Flow Policy Optimization) is a model-based RL method that combines flow-based policies with differentiable dynamics for high sample efficiency, outperforming prior RPG methods limited to Gaussian policies.

Details

Motivation: Prior Reparameterization Policy Gradient (RPG) approaches are limited to Gaussian policies, restricting performance and failing to leverage advances in generative models like flow-based policies. There's an unexplored connection between flow policies and RPG that could enable better model-based RL.

Method: RFO computes policy gradients by backpropagating jointly through flow generation (differentiable ODE integration) and system dynamics. It includes two tailored regularization terms for stability and exploration, plus a variant with action chunking.

Result: Extensive experiments on locomotion and manipulation tasks (rigid/soft bodies, state/visual inputs) show RFO’s effectiveness. On a challenging soft-body quadruped locomotion task, RFO achieves almost 2× the reward of state-of-the-art baselines.

Conclusion: RFO successfully bridges flow policies with RPG framework, enabling high sample efficiency without intractable log-likelihood calculations, and demonstrates superior performance on diverse robotic control tasks.

Abstract: Reparameterization Policy Gradient (RPG) has emerged as a powerful paradigm for model-based reinforcement learning, enabling high sample efficiency by backpropagating gradients through differentiable dynamics. However, prior RPG approaches have been predominantly restricted to Gaussian policies, limiting their performance and failing to leverage recent advances in generative models. In this work, we identify that flow policies, which generate actions via differentiable ODE integration, naturally align with the RPG framework, a connection not established in prior work. However, naively exploiting this synergy proves ineffective, often suffering from training instability and a lack of exploration. We propose Reparameterization Flow Policy Optimization (RFO). RFO computes policy gradients by backpropagating jointly through the flow generation process and system dynamics, unlocking high sample efficiency without requiring intractable log-likelihood calculations. RFO includes two tailored regularization terms for stability and exploration. We also propose a variant of RFO with action chunking. Extensive experiments on diverse locomotion and manipulation tasks, involving both rigid and soft bodies with state or visual inputs, demonstrate the effectiveness of RFO. Notably, on a challenging locomotion task controlling a soft-body quadruped, RFO achieves almost $2\times$ the reward of the state-of-the-art baseline.

[631] Soft-Radial Projection for Constrained End-to-End Learning

Philipp J. Schneider, Daniel Kuhn

Main category: cs.LG

TL;DR: Soft-Radial Projection: A differentiable layer that maps points into feasible set interiors while preserving full-rank Jacobians to avoid gradient saturation issues in constrained deep learning.

Details

Motivation: Existing constructive layers that project predictions onto constraint boundaries suffer from gradient saturation due to rank-deficient Jacobians when points collapse onto lower-dimensional surfaces, which nullifies gradients orthogonal to active constraints and hinders optimization.

Method: Introduces Soft-Radial Projection, a differentiable reparameterization layer that uses radial mapping from Euclidean space into the interior of the feasible set, guaranteeing strict feasibility while preserving full-rank Jacobian almost everywhere.

Result: Theoretically proves the architecture retains universal approximation property and empirically shows improved convergence behavior and solution quality over state-of-the-art optimization- and projection-based baselines.

Conclusion: Soft-Radial Projection effectively addresses gradient saturation in constrained deep learning by maintaining full-rank Jacobians while ensuring feasibility, leading to better optimization performance for safety-critical systems.

Abstract: Integrating hard constraints into deep learning is essential for safety-critical systems. Yet existing constructive layers that project predictions onto constraint boundaries face a fundamental bottleneck: gradient saturation. By collapsing exterior points onto lower-dimensional surfaces, standard orthogonal projections induce rank-deficient Jacobians, which nullify gradients orthogonal to active constraints and hinder optimization. We introduce Soft-Radial Projection, a differentiable reparameterization layer that circumvents this issue through a radial mapping from Euclidean space into the interior of the feasible set. This construction guarantees strict feasibility while preserving a full-rank Jacobian almost everywhere, thereby preventing the optimization stalls typical of boundary-based methods. We theoretically prove that the architecture retains the universal approximation property and empirically show improved convergence behavior and solution quality over state-of-the-art optimization- and projection-based baselines.

[632] A Minimal Task Reveals Emergent Path Integration and Object-Location Binding in a Predictive Sequence Model

Linda Ariel Ventura, Victoria Bosch, Tim C Kietzmann, Sushrut Thorat

Main category: cs.LG

TL;DR: A recurrent neural network learns world models through sequential prediction of tokens from 2D scenes, demonstrating in-context learning, path integration, and dynamic binding of identity to position.

Details

Motivation: To investigate whether action-conditioned sequential prediction suffices for learning structured world models that represent objects and their relations, which is fundamental to adaptive cognition.

Method: Train a recurrent neural network to predict upcoming tokens from 2D continuous token scenes using sequential sampling with saccade-like displacements, then analyze prediction accuracy, decoding, and interventions.

Result: Prediction accuracy improves across sequences on novel scenes (in-context learning), decoding reveals path integration and dynamic binding of token identity to position, and interventions show flexible binding capabilities.

Conclusion: Structured representations with flexible binding emerge from sequential prediction, providing a mechanistic account of world modeling relevant to cognitive science.

Abstract: Adaptive cognition requires structured internal models representing objects and their relations. Predictive neural networks are often proposed to form such “world models”, yet their underlying mechanisms remain unclear. One hypothesis is that action-conditioned sequential prediction suffices for learning such world models. In this work, we investigate this possibility in a minimal in-silico setting. Sequentially sampling tokens from 2D continuous token scenes, a recurrent neural network is trained to predict the upcoming token from current input and a saccade-like displacement. On novel scenes, prediction accuracy improves across the sequence, indicating in-context learning. Decoding analyses reveal path integration and dynamic binding of token identity to position. Interventional analyses show that new bindings can be learned late in sequence and that out-of-distribution bindings can be learned. Together, these results demonstrate how structured representations that rely on flexible binding emerge to support prediction, offering a mechanistic account of sequential world modeling relevant to cognitive science.

[633] Explaining the Explainer: Understanding the Inner Workings of Transformer-based Symbolic Regression Models

Arco van Breda, Erman Acar

Main category: cs.LG

TL;DR: PATCHES is an evolutionary circuit discovery algorithm that identifies compact and correct circuits in symbolic regression transformers, providing the first circuit-level characterization of SR transformers with causal validation.

Details

Motivation: While transformers have proven effective for symbolic regression, the internal mechanisms underlying their generation of mathematical operators remain largely unexplored. Mechanistic interpretability has been successful in language and vision models but hasn't been applied to SR, creating a gap in understanding how these models work internally.

Method: The authors introduce PATCHES, an evolutionary circuit discovery algorithm that identifies compact and correct circuits for symbolic regression. They use this method to isolate 28 circuits and validate findings through a robust causal evaluation framework based on faithfulness, completeness, and minimality.

Result: The analysis shows that mean patching with performance-based evaluation most reliably isolates functionally correct circuits. In contrast, direct logit attribution and probing classifiers primarily capture correlational features rather than causal ones, limiting their utility for circuit discovery.

Conclusion: Symbolic regression is established as a high-potential application domain for mechanistic interpretability, and the paper proposes a principled methodology for circuit discovery that can be applied to understand transformer internals in mathematical reasoning tasks.

Abstract: Following their success across many domains, transformers have also proven effective for symbolic regression (SR); however, the internal mechanisms underlying their generation of mathematical operators remain largely unexplored. Although mechanistic interpretability has successfully identified circuits in language and vision models, it has not yet been applied to SR. In this article, we introduce PATCHES, an evolutionary circuit discovery algorithm that identifies compact and correct circuits for SR. Using PATCHES, we isolate 28 circuits, providing the first circuit-level characterisation of an SR transformer. We validate these findings through a robust causal evaluation framework based on key notions such as faithfulness, completeness, and minimality. Our analysis shows that mean patching with performance-based evaluation most reliably isolates functionally correct circuits. In contrast, we demonstrate that direct logit attribution and probing classifiers primarily capture correlational features rather than causal ones, limiting their utility for circuit discovery. Overall, these results establish SR as a high-potential application domain for mechanistic interpretability and propose a principled methodology for circuit discovery.

[634] Least but not Last: Fine-tuning Intermediate Principal Components for Better Performance-Forgetting Trade-Offs

Alessio Quercia, Arya Bangun, Ira Assent, Hanno Scharr

Main category: cs.LG

TL;DR: Analysis of LoRA adaptation trade-offs showing intermediate principal components initialization provides better balance between task performance and catastrophic forgetting than existing first/last component methods.

Details

Motivation: LoRA methods face fundamental challenge in balancing task-specific performance gains against catastrophic forgetting of pre-trained knowledge, with existing methods providing inconsistent recommendations on this trade-off.

Method: Comprehensive analysis of performance-forgetting trade-offs in low-rank adaptation using principal components as initialization, investigating different component positions (first, intermediate, last) and their effects on learning robustness.

Result: Fine-tuning intermediate components leads to better balance and shows more robustness to high learning rates than first (PiSSA) and last (MiLoRA) components. The approach improves accuracy and reduces forgetting across various computer vision and NLP tasks.

Conclusion: Provides practical approach for LoRA initialization that offers superior trade-offs between task performance and knowledge retention, demonstrating effectiveness in continual learning scenarios.

Abstract: Low-Rank Adaptation (LoRA) methods have emerged as crucial techniques for adapting large pre-trained models to downstream tasks under computational and memory constraints. However, they face a fundamental challenge in balancing task-specific performance gains against catastrophic forgetting of pre-trained knowledge, where existing methods provide inconsistent recommendations. This paper presents a comprehensive analysis of the performance-forgetting trade-offs inherent in low-rank adaptation using principal components as initialization. Our investigation reveals that fine-tuning intermediate components leads to better balance and show more robustness to high learning rates than first (PiSSA) and last (MiLoRA) components in existing work. Building on these findings, we provide a practical approach for initialization of LoRA that offers superior trade-offs. We demonstrate in a thorough empirical study on a variety of computer vision and NLP tasks that our approach improves accuracy and reduces forgetting, also in continual learning scenarios.

[635] Lookahead Path Likelihood Optimization for Diffusion LLMs

Xuejie Liu, Yap Vit Chun, Yitao Liang, Anji Liu

Main category: cs.LG

TL;DR: POKE-SMC improves diffusion LLM inference by using path log-likelihood and Sequential Monte Carlo search to find globally optimal unmasking paths, boosting reasoning accuracy with minimal overhead.

Details

Motivation: Current diffusion LLMs rely on heuristic unmasking strategies that optimize local confidence but fail to identify globally consistent and accurate generation paths, limiting inference performance.

Method: Introduces path log-likelihood (Path LL) as a trajectory-conditioned objective, develops POKE value estimator to predict future Path LL, and integrates it into POKE-SMC - a Sequential Monte Carlo search framework for dynamic path optimization.

Result: Achieves 2-3% average accuracy gains across 6 reasoning tasks over strong baselines at comparable inference overhead on LLaDA models, advancing the accuracy-compute Pareto frontier.

Conclusion: POKE-SMC provides a principled approach to diffusion LLM inference that systematically improves reasoning accuracy by optimizing for globally consistent unmasking paths rather than local confidence.

Abstract: Diffusion Large Language Models (dLLMs) support arbitrary-order generation, yet their inference performance critically depends on the unmasking order. Existing strategies rely on heuristics that greedily optimize local confidence, offering limited guidance for identifying unmasking paths that are globally consistent and accurate. To bridge this gap, we introduce path log-likelihood (Path LL), a trajectory-conditioned objective that strongly correlates with downstream accuracy and enables principled selection of unmasking paths. To optimize Path LL at inference time, we propose POKE, an efficient value estimator that predicts the expected future Path LL of a partial decoding trajectory. We then integrate this lookahead signal into POKE-SMC, a Sequential Monte Carlo-based search framework for dynamically identifying optimal unmasking paths. Extensive experiments across 6 reasoning tasks show that POKE-SMC consistently improves accuracy, achieving 2%–3% average gains over strong decoding-time scaling baselines at comparable inference overhead on LLaDA models and advancing the accuracy–compute Pareto frontier.

[636] Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y. Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong Gao, Tong Gao, Yuyao Ge, Shangyi Geng, Qizheng Gu, Xinran Gu, Longyu Guan, Haiqing Guo, Jianhang Guo, Xiaoru Hao, Tianhong He, Weiran He, Wenyang He, Yunjia He, Chao Hong, Hao Hu, Yangyang Hu, Zhenxing Hu, Weixiao Huang, Zhiqi Huang, Zihao Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yongsheng Kang, Guokun Lai, Cheng Li, Fang Li, Haoyang Li, Ming Li, Wentao Li, Yang Li, Yanhao Li, Yiwei Li, Zhaowei Li, Zheming Li, Hongzhan Lin, Xiaohan Lin, Zongyu Lin, Chengyin Liu, Chenyu Liu, Hongzhang Liu, Jingyuan Liu, Junqi Liu, Liang Liu, Shaowei Liu, T. Y. Liu, Tianwei Liu, Weizhou Liu, Yangyang Liu, Yibo Liu, Yiping Liu, Yue Liu, Zhengying Liu, Enzhe Lu, Haoyu Lu, Lijun Lu, Yashuo Luo, Shengling Ma, Xinyu Ma, Yingwei Ma, Shaoguang Mao, Jie Mei, Xin Men, Yibo Miao, Siyuan Pan, Yebo Peng, Ruoyu Qin, Zeyu Qin, Bowen Qu, Zeyu Shang, Lidong Shi, Shengyuan Shi, Feifan Song, Jianlin Su, Zhengyuan Su, Lin Sui, Xinjie Sun, Flood Sung, Yunpeng Tai, Heyi Tang, Jiawen Tao, Qifeng Teng, Chaoran Tian, Chensi Wang, Dinglu Wang, Feng Wang, Hailong Wang, Haiming Wang, Jianzhou Wang, Jiaxing Wang, Jinhong Wang, Shengjie Wang, Shuyi Wang, Si Wang, Xinyuan Wang, Yao Wang, Yejie Wang, Yiqin Wang, Yuxin Wang, Yuzhi Wang, Zhaoji Wang, Zhengtao Wang, Zhengtao Wang, Zhexu Wang, Chu Wei, Qianqian Wei, Haoning Wu, Wenhao Wu, Xingzhe Wu, Yuxin Wu, Chenjun Xiao, Jin Xie, Xiaotong Xie, Weimin Xiong, Boyu Xu, Jinjing Xu, L. H. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinran Xu, Yangchuan Xu, Ziyao Xu, Jing Xu, Jing Xu, Junjie Yan, Yuzi Yan, Hao Yang, Xiaofei Yang, Yi Yang, Ying Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Xingcheng Yao, Wenjie Ye, Zhuorui Ye, Bohong Yin, Longhui Yu, Enming Yuan, Hongbang Yuan, Mengjie Yuan, Siyu Yuan, Haobing Zhan, Dehao Zhang, Hao Zhang, Wanlu Zhang, Xiaobin Zhang, Yadong Zhang, Yangkun Zhang, Yichi Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Haotian Zhao, Yikai Zhao, Zijia Zhao, Huabin Zheng, Shaojie Zheng, Longguang Zhong, Jianren Zhou, Xinyu Zhou, Zaida Zhou, Jinguo Zhu, Zhen Zhu, Weiyu Zhuang, Xinxing Zu

Main category: cs.LG

TL;DR: Kimi K2 is a 32B activated parameter MoE model with 1T total parameters, trained using novel MuonClip optimizer for stability, achieving SOTA performance in agentic tasks, coding, math, and reasoning without extended thinking.

Details

Motivation: To develop a highly capable open-source large language model with strong agentic capabilities that can perform well in software engineering and other complex tasks without requiring extended thinking processes.

Method: Uses Mixture-of-Experts architecture with 32B activated parameters out of 1T total. Introduces MuonClip optimizer with QK-clip technique for training stability. Pre-trained on 15.5T tokens with zero loss spikes. Multi-stage post-training includes large-scale agentic data synthesis pipeline and joint reinforcement learning with real and synthetic environments.

Result: Achieves SOTA among open-source non-thinking models: 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, 47.3 on SWE-Bench Multilingual. Strong performance in coding (53.7 LiveCodeBench v6), math (49.5 AIME 2025), reasoning (75.1 GPQA-Diamond, 27.1 OJBench). Surpasses most open and closed-source baselines in non-thinking settings.

Conclusion: Kimi K2 is one of the most capable open-source LLMs to date, particularly strong in software engineering and agentic tasks. The model checkpoints are released to facilitate research in agentic intelligence.

Abstract: We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual – surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence.

[637] Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

Hyunji Jung, Sungbin Shin, Namhoon Lee

Main category: cs.LG

TL;DR: Basis rotation technique mitigates gradient staleness in asynchronous pipeline parallelism, enabling scalable distributed training by aligning Hessian eigenbasis with coordinate basis to accelerate convergence.

Details

Motivation: Asynchronous pipeline parallelism eliminates pipeline bubbles for better hardware utilization but suffers from gradient staleness that scales with pipeline depth, undermining scalability. Current methods fail to address how delayed gradients interact with optimization algorithms, particularly when Hessian eigenbasis is misaligned with coordinate basis.

Method: Proposes basis rotation to rectify delayed gradients by aligning Hessian eigenbasis with coordinate basis. This enables coordinate-wise adaptive optimizers like Adam to effectively leverage curvature information. The approach is validated through theoretical analysis and empirical evaluation on large-scale models.

Result: Basis rotation significantly accelerates convergence in asynchronous settings. Training a 1B-parameter LLM achieves the same training loss in 76.8% fewer iterations compared to best-performing asynchronous pipeline parallel training baseline.

Conclusion: Basis rotation effectively addresses the fundamental scalability limitation of asynchronous pipeline parallelism by mitigating gradient staleness effects, enabling efficient large-scale distributed training while maintaining performance.

Abstract: Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient large-scale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. In this work, we investigate this inconsistency and bridge the gap by rectifying delayed gradients through basis rotation, restoring scalable asynchronous training while maintaining performance. Specifically, we observe that the deleterious effects of delayed gradients are exacerbated when the Hessian eigenbasis is misaligned with the standard coordinate basis. We demonstrate that this misalignment prevents coordinate-wise adaptive schemes, such as Adam, from effectively leveraging curvature-aware adaptivity. This failure leads to significant oscillations in the optimization trajectory and, consequently, slower convergence. We substantiate these findings through both rigorous theoretical analysis and empirical evaluation. To address this challenge, we propose the use of basis rotation, demonstrating that it effectively mitigates the alignment issue and significantly accelerates convergence in asynchronous settings. For example, our training of a 1B-parameter LLM with basis rotation achieves the same training loss in 76.8% fewer iterations compared to the best-performing asynchronous pipeline parallel training baseline.

[638] A Function-Space Stability Boundary for Generalization in Interpolating Learning Systems

Ronald Katende

Main category: cs.LG

TL;DR: Analyzes when algorithmic stability explains generalization in interpolating learning systems by measuring sensitivity to training perturbations and proposing stability certificates.

Details

Motivation: To understand why modern learning systems can interpolate training data while still generalizing well, and to determine when algorithmic stability provides a valid explanation for this phenomenon.

Method: Models training as a function-space trajectory, measures sensitivity to single-sample perturbations, proposes a contractive propagation condition, and derives stability certificates by unrolling the resulting recursion.

Result: Small certificates imply stability-based generalization, but there exist interpolating regimes with small risk where contractive sensitivity cannot hold, showing stability is not a universal explanation. Experiments show certificate growth predicts generalization differences across optimizers, step sizes, and dataset perturbations.

Conclusion: The framework identifies regimes where stability explains generalization and where alternative mechanisms must account for success, providing a tool to distinguish between different generalization explanations.

Abstract: Modern learning systems often interpolate training data while still generalizing well, yet it remains unclear when algorithmic stability explains this behavior. We model training as a function-space trajectory and measure sensitivity to single-sample perturbations along this trajectory. We propose a contractive propagation condition and a stability certificate obtained by unrolling the resulting recursion. A small certificate implies stability-based generalization, while we also prove that there exist interpolating regimes with small risk where such contractive sensitivity cannot hold, showing that stability is not a universal explanation. Experiments confirm that certificate growth predicts generalization differences across optimizers, step sizes, and dataset perturbations. The framework therefore identifies regimes where stability explains generalization and where alternative mechanisms must account for success.

[639] Not All Negative Samples Are Equal: LLMs Learn Better from Plausible Reasoning

Zixiang Di, Jinyi Han, Shuo Zhang, Ying Liao, Zhi Li, Xiaofeng Ji, Yongqi Wang, Zheming Yang, Ming Gao, Bingdong Li, Jie Wang

Main category: cs.LG

TL;DR: PNS generates high-quality negative samples for LLM reasoning training using reverse RL with composite rewards, outperforming other methods by 2.03% on math benchmarks

Details

Motivation: Existing methods treat all incorrect responses as equally informative for LLM reasoning training, overlooking sample quality. High-quality negative samples that maintain format/structure coherence while being incorrect could better improve reasoning capabilities.

Method: Proposes Plausible Negative Samples (PNS) using reverse reinforcement learning with composite reward: format compliance, accuracy inversion, reward model assessment, and chain-of-thought evaluation. Trains dedicated model to generate responses indistinguishable from correct solutions but ultimately incorrect.

Result: PNS consistently outperforms other negative sample synthesis methods across three backbone models on seven mathematical reasoning benchmarks, achieving average 2.03% improvement over RL-trained models.

Conclusion: PNS provides high-quality negative samples as plug-and-play data source for preference optimization, significantly improving LLM reasoning capabilities through better negative sample quality.

Abstract: Learning from negative samples holds great promise for improving Large Language Model (LLM) reasoning capability, yet existing methods treat all incorrect responses as equally informative, overlooking the crucial role of sample quality. To address this, we propose Plausible Negative Samples (PNS), a method that synthesizes high-quality negative samples exhibiting expected format and structural coherence while ultimately yielding incorrect answers. PNS trains a dedicated model via reverse reinforcement learning (RL) guided by a composite reward combining format compliance, accuracy inversion, reward model assessment, and chain-of-thought evaluation, generating responses nearly indistinguishable from correct solutions. We further validate PNS as a plug-and-play data source for preference optimization across three backbone models on seven mathematical reasoning benchmarks. Results demonstrate that PNS consistently outperforms other negative sample synthesis methods, achieving an average improvement of 2.03% over RL-trained models.

[640] Rank-Learner: Orthogonal Ranking of Treatment Effects

Henri Arno, Dennis Frauen, Emil Javurek, Thomas Demeester, Stefan Feuerriegel

Main category: cs.LG

TL;DR: Rank-Learner: A two-stage method that directly learns rankings of treatment effects from observational data without explicit CATE estimation, using pairwise learning objectives with Neyman-orthogonal properties for robustness.

Details

Motivation: Many real-world applications require ranking individuals by treatment effects rather than estimating exact effect magnitudes (e.g., prioritizing patients for preventive care, ranking customers for targeted advertising). While causal effect estimation has been extensively studied, directly learning treatment effect rankings from observational data has remained largely unexplored.

Method: Rank-Learner is a two-stage learner that optimizes a pairwise learning objective to recover true treatment effect ordering without explicit CATE estimation. It is Neyman-orthogonal, providing robustness to nuisance function estimation errors, and model-agnostic (can use neural networks or other ML models).

Result: Extensive experiments show Rank-Learner consistently outperforms standard CATE estimators and non-orthogonal ranking methods in recovering treatment effect rankings.

Conclusion: Rank-Learner provides practitioners with a new, orthogonal two-stage learner for ranking individuals by treatment effects, solving a more focused problem than full CATE estimation while offering strong theoretical guarantees.

Abstract: Many decision-making problems require ranking individuals by their treatment effects rather than estimating the exact effect magnitudes. Examples include prioritizing patients for preventive care interventions, or ranking customers by the expected incremental impact of an advertisement. Surprisingly, while causal effect estimation has received substantial attention in the literature, the problem of directly learning rankings of treatment effects has largely remained unexplored. In this paper, we introduce Rank-Learner, a novel two-stage learner that directly learns the ranking of treatment effects from observational data. We first show that naive approaches based on precise treatment effect estimation solve a harder problem than necessary for ranking, while our Rank-Learner optimizes a pairwise learning objective that recovers the true treatment effect ordering, without explicit CATE estimation. We further show that our Rank-Learner is Neyman-orthogonal and thus comes with strong theoretical guarantees, including robustness to estimation errors in the nuisance functions. In addition, our Rank-Learner is model-agnostic, and can be instantiated with arbitrary machine learning models (e.g., neural networks). We demonstrate the effectiveness of our method through extensive experiments where Rank-Learner consistently outperforms standard CATE estimators and non-orthogonal ranking methods. Overall, we provide practitioners with a new, orthogonal two-stage learner for ranking individuals by their treatment effects.

[641] Live or Lie: Action-Aware Capsule Multiple Instance Learning for Risk Assessment in Live Streaming Platforms

Yiran Qiao, Jing Chen, Xiang Ao, Qiwei Zhong, Yang Liu, Qing He

Main category: cs.LG

TL;DR: AC-MIL is a novel framework for detecting coordinated malicious behaviors in live streaming rooms using Multiple Instance Learning with action-aware capsules, achieving state-of-the-art performance on industrial datasets.

Details

Motivation: Live streaming platforms face significant risks from coordinated malicious behaviors that are hard to detect due to sparse signals concealed within normal activities, requiring timely and accurate risk assessment with only room-level supervision.

Method: Formulates risk assessment as a MIL problem where rooms are bags and structured user-timeslot capsules are instances. Proposes AC-MIL framework with serial and parallel architecture to capture multi-granular semantics, temporal dynamics, and cross-user dependencies for robust prediction.

Result: Extensive experiments on large-scale Douyin datasets show AC-MIL significantly outperforms MIL and sequential baselines, establishing new state-of-the-art performance in room-level risk assessment while providing interpretable capsule-level evidence.

Conclusion: AC-MIL effectively addresses the challenge of detecting coordinated malicious behaviors in live streaming with weak supervision, offering both high accuracy and interpretability for practical intervention.

Abstract: Live streaming has become a cornerstone of today’s internet, enabling massive real-time social interactions. However, it faces severe risks arising from sparse, coordinated malicious behaviors among multiple participants, which are often concealed within normal activities and challenging to detect timely and accurately. In this work, we provide a pioneering study on risk assessment in live streaming rooms, characterized by weak supervision where only room-level labels are available. We formulate the task as a Multiple Instance Learning (MIL) problem, treating each room as a bag and defining structured user-timeslot capsules as instances. These capsules represent subsequences of user actions within specific time windows, encapsulating localized behavioral patterns. Based on this formulation, we propose AC-MIL, an Action-aware Capsule MIL framework that models both individual behaviors and group-level coordination patterns. AC-MIL captures multi-granular semantics and behavioral cues through a serial and parallel architecture that jointly encodes temporal dynamics and cross-user dependencies. These signals are integrated for robust room-level risk prediction, while also offering interpretable evidence at the behavior segment level. Extensive experiments on large-scale industrial datasets from Douyin demonstrate that AC-MIL significantly outperforms MIL and sequential baselines, establishing new state-of-the-art performance in room-level risk assessment for live streaming. Moreover, AC-MIL provides capsule-level interpretability, enabling identification of risky behavior segments as actionable evidence for intervention. The project page is available at: https://qiaoyran.github.io/AC-MIL/.

[642] WARP Logic Neural Networks

Lino Gerlach, Thore Gerlach, Liv Våge, Elliott Kauffman, Isobel Ojalvo

Main category: cs.LG

TL;DR: WARP logic neural networks: A gradient-based framework for efficiently learning combinations of hardware-native logic blocks with improved training via learnable thresholding and residual initialization.

Details

Motivation: Existing logic neural networks have high training costs, introduce redundancy, or rely on approximate gradients, limiting scalability. There's a need for more efficient gradient-based frameworks for learning hardware-native logic operations.

Method: Introduces WAlsh Relaxation for Probabilistic (WARP) logic neural networks - a novel gradient-based framework that learns combinations of hardware-native logic blocks. Uses learnable thresholding and residual initialization for improved training, and bridges relaxed training with discrete logic inference through stochastic smoothing.

Result: WARP yields the most parameter-efficient representation for exactly learning Boolean functions. Experiments show faster convergence than state-of-the-art baselines, and effective scaling to deeper architectures and logic functions with higher input arity.

Conclusion: WARP provides an efficient gradient-based framework for learning logic operations that overcomes limitations of prior approaches, offering improved training efficiency and scalability for logic neural networks.

Abstract: Fast and efficient AI inference is increasingly important, and recent models that directly learn low-level logic operations have achieved state-of-the-art performance. However, existing logic neural networks incur high training costs, introduce redundancy or rely on approximate gradients, which limits scalability. To overcome these limitations, we introduce WAlsh Relaxation for Probabilistic (WARP) logic neural networks – a novel gradient-based framework that efficiently learns combinations of hardware-native logic blocks. We show that WARP yields the most parameter-efficient representation for exactly learning Boolean functions and that several prior approaches arise as restricted special cases. Training is improved by introducing learnable thresholding and residual initialization, while we bridge the gap between relaxed training and discrete logic inference through stochastic smoothing. Experiments demonstrate faster convergence than state-of-the-art baselines, while scaling effectively to deeper architectures and logic functions with higher input arity.

[643] EVE: Efficient Verification of Data Erasure through Customized Perturbation in Approximate Unlearning

Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Luoyu Chen, Shui Yu

Main category: cs.LG

TL;DR: EVE is an efficient verification method for machine unlearning that doesn’t require participation in initial training, using adversarial perturbations to detect prediction changes before/after unlearning.

Details

Motivation: Current machine unlearning verification methods rely on backdooring techniques that require participation in the model's initial training phase, which is inefficient and impractical. There's a need for verification methods that can work without involvement in the training process.

Method: EVE perturbs unlearning data to ensure model predictions change before/after unlearning. The perturbations are designed via adversarial optimization that aligns the unlearning gradient with the gradient of boundary change for target samples. Users observe prediction changes as verification signals.

Result: EVE successfully verifies machine unlearning without requiring initial training involvement, outperforms state-of-the-art methods, offers significant speedup in efficiency, and enhances verification accuracy.

Conclusion: EVE provides a novel, efficient verification tool for machine unlearning that eliminates the impractical requirement of participating in initial training, offering both accuracy and efficiency improvements over existing methods.

Abstract: Verifying whether the machine unlearning process has been properly executed is critical but remains underexplored. Some existing approaches propose unlearning verification methods based on backdooring techniques. However, these methods typically require participation in the model’s initial training phase to backdoor the model for later verification, which is inefficient and impractical. In this paper, we propose an efficient verification of erasure method (EVE) for verifying machine unlearning without requiring involvement in the model’s initial training process. The core idea is to perturb the unlearning data to ensure the model prediction of the specified samples will change before and after unlearning with perturbed data. The unlearning users can leverage the observation of the changes as a verification signal. Specifically, the perturbations are designed with two key objectives: ensuring the unlearning effect and altering the unlearned model’s prediction of target samples. We formalize the perturbation generation as an adversarial optimization problem, solving it by aligning the unlearning gradient with the gradient of boundary change for target samples. We conducted extensive experiments, and the results show that EVE can verify machine unlearning without involving the model’s initial training process, unlike backdoor-based methods. Moreover, EVE significantly outperforms state-of-the-art unlearning verification methods, offering significant speedup in efficiency while enhancing verification accuracy. The source code of EVE is released at \uline{https://anonymous.4open.science/r/EVE-C143}, providing a novel tool for verification of machine unlearning.

[644] Sparse Training of Neural Networks based on Multilevel Mirror Descent

Yannick Lunk, Sebastian J. Scott, Leon Bungert

Main category: cs.LG

TL;DR: Dynamic sparse training algorithm using linearized Bregman iterations with alternating static/dynamic sparsity patterns, achieving high sparsity with maintained accuracy and reduced FLOPs.

Details

Motivation: To develop efficient sparse training methods that can explore sparse parameter spaces effectively while maintaining model accuracy, addressing the computational inefficiency of traditional training methods.

Method: Combines sparsity-inducing Bregman iterations with adaptive freezing of network structure, alternating between periods of static and dynamic sparsity pattern updates within a multilevel optimization framework.

Result: Produces highly sparse and accurate models on standard benchmarks, reducing theoretical FLOPs from 38% for standard Bregman iterations to 6% while maintaining test accuracy.

Conclusion: The proposed dynamic sparse training algorithm effectively balances sparsity exploration and computational efficiency, offering a promising approach for training sparse neural networks with theoretical convergence guarantees.

Abstract: We introduce a dynamic sparse training algorithm based on linearized Bregman iterations / mirror descent that exploits the naturally incurred sparsity by alternating between periods of static and dynamic sparsity pattern updates. The key idea is to combine sparsity-inducing Bregman iterations with adaptive freezing of the network structure to enable efficient exploration of the sparse parameter space while maintaining sparsity. We provide convergence guaranties by embedding our method in a multilevel optimization framework. Furthermore, we empirically show that our algorithm can produce highly sparse and accurate models on standard benchmarks. We also show that the theoretical number of FLOPs compared to SGD training can be reduced from 38% for standard Bregman iterations to 6% for our method while maintaining test accuracy.

[645] Mechanistic Interpretability as Statistical Estimation: A Variance Analysis

Maxime Méloux, François Portet, Maxime Peyrard

Main category: cs.LG

TL;DR: Circuit discovery in mechanistic interpretability suffers from fundamental instability due to high variance in causal mediation analysis scores, which propagates through pipeline approximations and dataset aggregation, leading to fragile and non-reproducible circuits.

Details

Motivation: The paper aims to address the scientific validity of mechanistic interpretability findings by highlighting that circuit discovery is not a standalone task but a statistical estimation problem. The authors argue that current practices lack stability, making findings potentially unreliable and non-reproducible.

Method: The authors frame circuit discovery as a statistical estimation problem built upon causal mediation analysis (CMA). They systematically analyze variance sources: (1) intrinsic variance in exact single-input CMA scores, (2) additional noise from approximation methods like Edge Attribution Patching, and (3) fragility from aggregating noisy scores over datasets. They advocate for more rigorous practices with stability metrics.

Result: The paper demonstrates that causal effect of model components is a volatile random variable rather than a fixed property. Circuit discovery pipelines inherit and amplify this variance, making circuits highly sensitive to small perturbations in input data or hyperparameters, leading to vastly different circuit structures.

Conclusion: Mechanistic interpretability needs more rigorous statistical practices, including routine reporting of stability metrics and prioritizing statistical robustness to ensure scientific validity of circuit discovery findings.

Abstract: Mechanistic Interpretability (MI) aims to reverse-engineer model behaviors by identifying functional sub-networks. Yet, the scientific validity of these findings depends on their stability. In this work, we argue that circuit discovery is not a standalone task but a statistical estimation problem built upon causal mediation analysis (CMA). We uncover a fundamental instability at this base layer: exact, single-input CMA scores exhibit high intrinsic variance, implying that the causal effect of a component is a volatile random variable rather than a fixed property. We then demonstrate that circuit discovery pipelines inherit this variance and further amplify it. Fast approximation methods, such as Edge Attribution Patching and its successors, introduce additional estimation noise, while aggregating these noisy scores over datasets leads to fragile structural estimates. Consequently, small perturbations in input data or hyperparameters yield vastly different circuits. We systematically decompose these sources of variance and advocate for more rigorous MI practices, prioritizing statistical robustness and routine reporting of stability metrics.

[646] MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization

Maximilian Kleinegger, Elvir Crnčević, Dan Alistarh

Main category: cs.LG

TL;DR: MatGPTQ enables efficient multi-precision quantization for LLMs using post-training quantization with bit-slicing and cross-bit error compensation, achieving high accuracy across different bit-widths from a single checkpoint.

Details

Motivation: Matryoshka Quantization (MatQuant) allows serving a single integer-quantized model across multiple precisions by slicing most significant bits at inference, but it relies on expensive quantization-aware training (QAT) rather than fast post-training quantization (PTQ), lacks open-source support, and has no kernel implementation.

Method: MatGPTQ introduces a PTQ pipeline that produces a single parent model optimized for multiple target precisions using a small calibration set. It formulates Matryoshka quantization as a multi-precision objective with bit-slicing and cross-bit error compensation, includes budget-aware search for heterogeneous per-layer bit-widths, and provides efficient kernels for slicing and mixed-precision execution.

Result: MatGPTQ preserves high-bit accuracy while substantially improving performance at low-bit-width settings across standard LLMs and benchmarks, establishing new state-of-the-art for Matryoshka-style post-training quantization.

Conclusion: MatGPTQ makes single-checkpoint, multi-precision deployment open and practical by addressing MatQuant’s limitations through efficient post-training quantization with kernel support, enabling flexible deployment across different memory and latency budgets.

Abstract: Matryoshka Quantization (MatQuant) is a recent quantization approach showing that a single integer-quantized model can be served across multiple precisions, by slicing the most significant bits (MSB) at inference time. This enables a single checkpoint to cover a wide range of memory and latency budgets, but renders quantization much more challenging. In particular, the initial MatQuant relies on expensive quantization-aware training (QAT) variants, rather than fast one-shot post training quantization (PTQ), and lacks open-source and kernel support. We address all of these limitations by introducing Post-Training Matryoshka Quantization (MatGPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one-shot, based on a small calibration set. MatGPTQ casts Matryoshka quantization as a multi-precision objective with bit-slicing and cross-bit error compensation, resulting in an algorithm that produces a multi-bit-width, “sliceable” model in a single pass. We also incorporate a new budget-aware search for heterogeneous per-layer bit-witdhs and provide efficient kernels that implement slicing and mixed-precision execution. Across standard LLMs and benchmarks, MatGPTQ preserves high-bit accuracy while substantially improving performance at low-bit-witdh settings. Overall, we establish a new state of the art for Matryoshka-style post-training quantization and make single-checkpoint, multi-precision deployment open and practical. Code is available at https://github.com/IST-DASLab/MatGPTQ.

[647] APEX: Probing Neural Networks via Activation Perturbation

Tao Ren, Xiaoyu Luo, Qiongxiu Li

Main category: cs.LG

TL;DR: APEX is an inference-time probing method that perturbs hidden activations to reveal structural information in neural network representations, distinguishing it from input-space analysis and parameter perturbation approaches.

Details

Motivation: Existing neural network probing methods (input-space analysis and parameter perturbation) have fundamental limitations in accessing structural information encoded in intermediate representations. There's a need for methods that can better reveal model-dependent behavior and representation-level structure.

Method: Activation Perturbation for EXploration (APEX) - perturbs hidden activations during inference while keeping both inputs and model parameters fixed. Theoretically shows this induces transition from sample-dependent to model-dependent behavior by suppressing input-specific signals and amplifying representation-level structure.

Result: In small-noise regime: APEX provides lightweight measure of sample regularity aligning with established metrics, distinguishes structured from randomly labeled models, reveals semantically coherent prediction transitions. In large-noise regime: exposes training-induced model-level biases, including concentration of predictions on target class in backdoored models.

Conclusion: APEX offers an effective perspective for exploring and understanding neural networks beyond what’s accessible from input space alone, providing new insights into model behavior and representation structure.

Abstract: Prior work on probing neural networks primarily relies on input-space analysis or parameter perturbation, both of which face fundamental limitations in accessing structural information encoded in intermediate representations. We introduce Activation Perturbation for EXploration (APEX), an inference-time probing paradigm that perturbs hidden activations while keeping both inputs and model parameters fixed. We theoretically show that activation perturbation induces a principled transition from sample-dependent to model-dependent behavior by suppressing input-specific signals and amplifying representation-level structure, and further establish that input perturbation corresponds to a constrained special case of this framework. Through representative case studies, we demonstrate the practical advantages of APEX. In the small-noise regime, APEX provides a lightweight and efficient measure of sample regularity that aligns with established metrics, while also distinguishing structured from randomly labeled models and revealing semantically coherent prediction transitions. In the large-noise regime, APEX exposes training-induced model-level biases, including a pronounced concentration of predictions on the target class in backdoored models. Overall, our results show that APEX offers an effective perspective for exploring, and understanding neural networks beyond what is accessible from input space alone.

[648] How to Train Your Resistive Network: Generalized Equilibrium Propagation and Analytical Learning

Jonathan Lin, Aman Desai, Frank Barrows, Francesco Caravelli

Main category: cs.LG

TL;DR: Exact gradient calculation algorithm for analog computing systems using graph theory and Kirchhoff’s laws, enabling local training of resistor networks without full replica networks.

Details

Motivation: Analog computing systems offer energy-efficient alternatives to digital hardware for machine learning, but face challenges in training due to physical locality constraints. Current local learning algorithms like Equilibrium Propagation and Coupled Learning need improvement for practical implementation.

Method: Developed an exact gradient calculation algorithm using graph theory and analytical framework for Kirchhoff’s laws. Introduced Generalized Equilibrium Propagation framework encompassing Hebbian learning algorithms. Demonstrated training of resistor networks without replica networks or full resistor readouts.

Result: Numerical simulations show successful training of resistor networks using only output layer measurements. The analytical gradient approach allows updating only a subset of resistance values without significant performance degradation.

Conclusion: The proposed algorithm enables efficient training of analog computing systems while respecting physical locality constraints, advancing practical implementation of energy-efficient analog machine learning hardware.

Abstract: Machine learning is a powerful method of extracting meaning from data; unfortunately, current digital hardware is extremely energy-intensive. There is interest in an alternative analog computing implementation that could match the performance of traditional machine learning while being significantly more energy-efficient. However, it remains unclear how to train such analog computing systems while adhering to locality constraints imposed by the physical (as opposed to digital) nature of these systems. Local learning algorithms such as Equilibrium Propagation and Coupled Learning have been proposed to address this issue. In this paper, we develop an algorithm to exactly calculate gradients using a graph theoretic and analytical framework for Kirchhoff’s laws. We also introduce Generalized Equilibrium Propagation, a framework encompassing a broad class of Hebbian learning algorithms, including Coupled Learning and Equilibrium Propagation, and show how our algorithm compares. We demonstrate our algorithm using numerical simulations and show that we can train resistor networks without the need for a replica or readout over all resistors, only at the output layer. We also show that under the analytical gradient approach, it is possible to update only a subset of the resistance values without a strong degradation in performance.

[649] Equilibrium Propagation for Non-Conservative Systems

Antonino Emanuele Scurria, Dimitri Vanden Abeele, Bortolo Matteo Mognetti, Serge Massar

Main category: cs.LG

TL;DR: Equilibrium Propagation extended to nonconservative systems with non-reciprocal interactions, enabling exact gradient computation for arbitrary systems including feedforward networks.

Details

Motivation: Original Equilibrium Propagation was limited to conservative systems with energy-based dynamics. Many practical applications involve nonconservative systems with non-reciprocal interactions, requiring extension of EP to these more general cases.

Method: Proposes a framework extending EP to arbitrary nonconservative systems by modifying learning phase dynamics with a term proportional to the non-reciprocal part of interactions. Also presents variational formulation using energy function over augmented state space to generate learning dynamics.

Result: Achieves exact gradient computation of cost function for nonconservative systems. Numerical experiments on MNIST show better performance and faster learning compared to previous proposals.

Conclusion: Successfully extends Equilibrium Propagation to nonconservative systems while maintaining key property of using stationary states for both inference and learning, enabling exact gradient computation for broader class of dynamical systems.

Abstract: Equilibrium Propagation (EP) is a physics-inspired learning algorithm that uses stationary states of a dynamical system both for inference and learning. In its original formulation it is limited to conservative systems, $\textit{i.e.}$ to dynamics which derive from an energy function. Given their importance in applications, it is important to extend EP to nonconservative systems, $\textit{i.e.}$ systems with non-reciprocal interactions. Previous attempts to generalize EP to such systems failed to compute the exact gradient of the cost function. Here we propose a framework that extends EP to arbitrary nonconservative systems, including feedforward networks. We keep the key property of equilibrium propagation, namely the use of stationary states both for inference and learning. However, we modify the dynamics in the learning phase by a term proportional to the non-reciprocal part of the interaction so as to obtain the exact gradient of the cost function. This algorithm can also be derived using a variational formulation that generates the learning dynamics through an energy function defined over an augmented state space. Numerical experiments using the MNIST database show that this algorithm achieves better performance and learns faster than previous proposals.

[650] NPCNet: Navigator-Driven Pseudo Text for Deep Clustering of Early Sepsis Phenotyping

Pi-Ju Tsai, Charkkri Limbud, Kuan-Fu Chen, Yi-Ju Tseng

Main category: cs.LG

TL;DR: NPCNet is a deep clustering network with target navigator that identifies clinically distinct sepsis phenotypes from temporal EHR data, enabling more precise treatment strategies.

Details

Motivation: Sepsis is heterogeneous but current clustering approaches rarely incorporate clinical relevance, limiting their ability to identify clinically distinct phenotypes for precise treatment.

Method: NPCNet (deep clustering network with target navigator) integrates temporal Electronic Health Records (EHRs) to align sepsis phenotypes with clinical significance through improved clustering.

Result: Identified four sepsis phenotypes (α, β, γ, δ) with divergent SOFA trajectories; differentiated patients likely to improve (α) from those at risk of deterioration (δ); found α, β, and δ phenotypes may benefit from early vasopressor administration.

Conclusion: NPCNet enhances precision treatment strategies by uncovering clinically distinct sepsis phenotypes that better reflect clinical reality and enable targeted interventions.

Abstract: Sepsis is a heterogeneous syndrome. Identifying clinically distinct phenotypes may enable more precise treatment strategies. In recent years, many researchers have applied clustering algorithms to sepsis patients. However, the clustering process rarely incorporates clinical relevance, potentially limiting to reflect clinically distinct phenotypes. We propose NPCNet, a novel deep clustering network with a target navigator that integrates temporal Electronic Health Records (EHRs) to better align sepsis phenotypes with clinical significance. We identify four sepsis phenotypes ($α$, $β$, $γ$, and $δ$) with divergence in SOFA trajectories. Notably, while $α$ and $δ$ phenotypes both show severe conditions in the early stage, NPCNet effectively differentiates patients who are likely to improve ($α$) from those at risk of deterioration ($δ$). Furthermore, through the treatment effect analysis, we discover that $α$, $β$, and $δ$ phenotypes may benefit from early vasopressor administration. The results show that NPCNet enhances precision treatment strategies by uncovering clinically distinct phenotypes.

[651] ContraLog: Log File Anomaly Detection with Contrastive Learning and Masked Language Modeling

Simon Dietz, Kai Klede, An Nguyen, Bjoern M Eskofier

Main category: cs.LG

TL;DR: ContraLog is a parser-free, self-supervised method for log anomaly detection that predicts continuous message embeddings instead of discrete template IDs, using a combination of masked language modeling and contrastive learning.

Details

Motivation: Traditional log anomaly detection methods rely on log parsers that collapse messages into discrete templates, discarding variable values and semantic content. This limits their ability to capture rich information in log messages.

Method: ContraLog uses a message encoder to produce rich embeddings for individual log messages and a sequence encoder to model temporal dependencies. It’s trained with masked language modeling and contrastive learning to predict masked message embeddings based on surrounding context.

Result: Experiments on HDFS, BGL, and Thunderbird benchmark datasets show effectiveness on complex datasets with diverse log messages. Message embeddings generated by ContraLog carry meaningful information and are predictive of anomalies even without sequence context.

Conclusion: Embedding-level prediction is a promising approach for log anomaly detection with potential applicability to other event sequences, offering a parser-free alternative to traditional template-based methods.

Abstract: Log files record computational events that reflect system state and behavior, making them a primary source of operational insights in modern computer systems. Automated anomaly detection on logs is therefore critical, yet most established methods rely on log parsers that collapse messages into discrete templates, discarding variable values and semantic content. We propose ContraLog, a parser-free and self-supervised method that reframes log anomaly detection as predicting continuous message embeddings rather than discrete template IDs. ContraLog combines a message encoder that produces rich embeddings for individual log messages with a sequence encoder to model temporal dependencies within sequences. The model is trained with a combination of masked language modeling and contrastive learning to predict masked message embeddings based on the surrounding context. Experiments on the HDFS, BGL, and Thunderbird benchmark datasets empirically demonstrate effectiveness on complex datasets with diverse log messages. Additionally, we find that message embeddings generated by ContraLog carry meaningful information and are predictive of anomalies even without sequence context. These results highlight embedding-level prediction as an approach for log anomaly detection, with potential applicability to other event sequences.

[652] CoGenCast: A Coupled Autoregressive-Flow Generative Framework for Time Series Forecasting

Yaguo Liu, Mingyue Cheng, Daoyu Wang, Xiaoyu Tao, Qi Liu

Main category: cs.LG

TL;DR: CoGenCast is a hybrid generative framework for time series forecasting that couples pre-trained LLMs with flow-matching to handle both semantic context understanding and continuous stochastic dynamics.

Details

Motivation: Time series forecasting requires both semantic understanding of contextual conditions and stochastic modeling of continuous temporal dynamics. Existing approaches use either autoregressive LLMs for semantic context or diffusion models for probabilistic generation, but neither alone adequately models both aspects simultaneously.

Method: Reconfigures pre-trained decoder-only LLMs into a native forecasting encoder-decoder backbone by modifying attention topology for bidirectional context encoding and causal representation generation. Integrates flow-matching mechanism to model temporal evolution, capturing continuous stochastic dynamics conditioned on autoregressively generated representations.

Result: Extensive experiments on multiple benchmarks show CoGenCast consistently outperforms previous baselines. The framework naturally supports multimodal forecasting and cross-domain unified training.

Conclusion: CoGenCast effectively combines LLMs for semantic understanding with flow-matching for stochastic dynamics modeling, creating a powerful hybrid approach for time series forecasting that supports multimodal applications.

Abstract: Time series forecasting can be viewed as a generative problem that requires both semantic understanding over contextual conditions and stochastic modeling of continuous temporal dynamics. Existing approaches typically rely on either autoregressive large language models (LLMs) for semantic context modeling or diffusion-like models for continuous probabilistic generation. However, neither method alone can adequately model both aspects simultaneously. In this work, we propose CoGenCast, a hybrid generative framework that couples pre-trained LLMs with flow-matching mechanism for effective time series forecasting. Specifically, we reconfigure pre-trained decoder-only LLMs into a native forecasting encoder-decoder backbone by modifying only the attention topology, enabling bidirectional context encoding and causal representation generation. Building on this, a flow-matching mechanism is further integrated to model temporal evolution, capturing continuous stochastic dynamics conditioned on the autoregressively generated representation. Notably, CoGenCast naturally supports multimodal forecasting and cross-domain unified training. Extensive experiments on multiple benchmarks show that CoGenCast consistently outperforms previous compared baselines. Code is available at https://github.com/liuyaguo/_CoGenCast.

[653] Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

Joey Hong, Kang Liu, Zhan Ling, Jiecao Chen, Sergey Levine

Main category: cs.LG

TL;DR: NLAC is a novel actor-critic algorithm that trains LLM agents using a generative LLM critic that produces natural language feedback instead of scalar rewards, enabling more stable and data-efficient training in long-horizon tasks with sparse rewards.

Details

Motivation: Training LLM agents in long-horizon tasks with sparse rewards using traditional policy gradient methods leads to noisy training signals, instability, high sample complexity, and difficulty in exploration when actions are in natural language space.

Method: Proposes Natural Language Actor-Critic (NLAC) algorithm where a generative LLM critic provides natural language explanations for why actions are suboptimal, offering richer training signals. The approach can be trained off-policy without policy gradients, improving data efficiency and stability.

Result: NLAC shows promise in outperforming existing training approaches on reasoning, web browsing, and tool-use with dialogue tasks, offering a more scalable and stable training paradigm for LLM agents.

Conclusion: NLAC represents a novel approach to training LLM agents that leverages LLMs’ natural language capabilities to provide richer feedback, addressing key challenges in sparse-reward, long-horizon tasks and offering improved stability and data efficiency.

Abstract: Large language model (LLM) agents – LLMs that dynamically interact with an environment over long horizons – have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.

[654] Universal One-third Time Scaling in Learning Peaked Distributions

Yizhou Liu, Ziming Liu, Cengiz Pehlevan, Jeff Gore

Main category: cs.LG

TL;DR: The paper identifies that softmax and cross-entropy loss cause power-law convergence in LLM training due to learning peaked probability distributions, leading to universal 1/3 exponent scaling and optimization bottlenecks.

Details

Motivation: Training large language models is computationally expensive due to slow power-law convergence of the loss, but the origin of this behavior remains debatable. The authors aim to understand the fundamental causes of this optimization bottleneck.

Method: Systematic analysis of toy models combined with empirical evaluation of LLMs to investigate the relationship between softmax/cross-entropy components and power-law convergence when learning peaked probability distributions like next-token distributions.

Result: The study shows that softmax and cross-entropy intrinsically cause power-law vanishing losses and gradients when learning peaked distributions, leading to universal power-law time scaling with exponent 1/3. This provides a mechanistic explanation for observed neural scaling laws.

Conclusion: The findings offer a fundamental explanation for LLM training inefficiencies and suggest new directions for improving training efficiency by addressing the optimization bottlenecks created by softmax and cross-entropy components.

Abstract: Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components yield power-law vanishing losses and gradients, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.

[655] Riemannian Neural Optimal Transport

Alessandro Micheli, Yueqi Cao, Anthea Monod, Samir Bhatt

Main category: cs.LG

TL;DR: RNOT introduces continuous neural-network parameterizations of optimal transport maps on manifolds that avoid discretization and overcome the curse of dimensionality, with sub-exponential complexity in dimension.

Details

Motivation: Neural optimal transport methods are currently limited to Euclidean geometry, and extending them to high-dimensional Riemannian manifolds faces the curse of dimensionality where achieving fixed accuracy requires exponentially growing parameters with manifold dimension.

Method: Introduces Riemannian Neural OT (RNOT) maps - continuous neural-network parameterizations of OT maps on manifolds that avoid discretization and incorporate geometric structure by construction, achieving sub-exponential complexity in dimension.

Result: Experiments on synthetic and real datasets demonstrate improved scalability and competitive performance relative to discretization-based baselines, with theoretical guarantees of sub-exponential complexity.

Conclusion: RNOT provides a principled framework for neural optimal transport on manifolds that overcomes the curse of dimensionality, enabling scalable OT-based generative modeling in high-dimensional non-Euclidean spaces.

Abstract: Computational optimal transport (OT) offers a principled framework for generative modeling. Neural OT methods, which use neural networks to learn an OT map (or potential) from data in an amortized way, can be evaluated out of sample after training, but existing approaches are tailored to Euclidean geometry. Extending neural OT to high-dimensional Riemannian manifolds remains an open challenge. In this paper, we prove that any method for OT on manifolds that produces discrete approximations of transport maps necessarily suffers from the curse of dimensionality: achieving a fixed accuracy requires a number of parameters that grows exponentially with the manifold dimension. Motivated by this limitation, we introduce Riemannian Neural OT (RNOT) maps, which are continuous neural-network parameterizations of OT maps on manifolds that avoid discretization and incorporate geometric structure by construction. Under mild regularity assumptions, we prove that RNOT maps approximate Riemannian OT maps with sub-exponential complexity in the dimension. Experiments on synthetic and real datasets demonstrate improved scalability and competitive performance relative to discretization-based baselines.

[656] Encoder-Free Knowledge-Graph Reasoning with LLMs via Hyperdimensional Path Retrieval

Yezi Liu, William Youngwoo Chung, Hanning Chen, Calvin Yeung, Mohsen Imani

Main category: cs.LG

TL;DR: PathHD: An encoder-free knowledge-graph reasoning framework using hyperdimensional computing with single LLM call per query for efficient, interpretable QA.

Details

Motivation: Current KG-based QA systems suffer from efficiency and transparency issues due to multiple neural encoders or repeated LLM calls for path scoring, leading to high latency, GPU costs, and poor auditability.

Method: PathHD uses hyperdimensional computing to represent relation paths as block-diagonal GHRR hypervectors, retrieves candidate paths via calibrated blockwise cosine similarity with Top-K pruning, and performs one-shot LLM adjudication for final answer with supporting paths.

Result: Matches or improves Hits@1 compared to neural baselines on WebQSP, CWQ, and GrailQA while using only one LLM call per query, reduces latency by 40-60%, and lowers GPU memory by 3-5× due to encoder-free retrieval.

Conclusion: Engineered HDC path representations provide effective substrate for efficient and faithful KG-LLM reasoning, achieving strong accuracy-efficiency-interpretability trade-off.

Abstract: Recent progress in large language models (LLMs) has made knowledge-grounded reasoning increasingly practical, yet KG-based QA systems often pay a steep price in efficiency and transparency. In typical pipelines, symbolic paths are scored by neural encoders or repeatedly re-ranked by multiple LLM calls, which inflates latency and GPU cost and makes the decision process hard to audit. We introduce PathHD, an encoder-free framework for knowledge-graph reasoning that couples hyperdimensional computing (HDC) with a single LLM call per query. Given a query, PathHD represents relation paths as block-diagonal GHRR hypervectors, retrieves candidate paths using a calibrated blockwise cosine similarity with Top-K pruning, and then performs a one-shot LLM adjudication that outputs the final answer together with supporting, citeable paths. The design is enabled by three technical components: (i) an order-sensitive, non-commutative binding operator for composing multi-hop paths, (ii) a robust similarity calibration that stabilizes hypervector retrieval, and (iii) an adjudication stage that preserves interpretability while avoiding per-path LLM scoring. Across WebQSP, CWQ, and GrailQA, PathHD matches or improves Hits@1 compared to strong neural baselines while using only one LLM call per query, reduces end-to-end latency by $40-60%$, and lowers GPU memory by $3-5\times$ due to encoder-free retrieval. Overall, the results suggest that carefully engineered HDC path representations can serve as an effective substrate for efficient and faithful KG-LLM reasoning, achieving a strong accuracy-efficiency-interpretability trade-off.

[657] QuAIL: Quality-Aware Inertial Learning for Robust Training under Data Corruption

Mattia Sabella, Alberto Archetti, Pietro Pinoli, Matteo Matteucci, Cinzia Cappiello

Main category: cs.LG

TL;DR: QuAIL is a quality-informed training mechanism that incorporates feature reliability priors into learning to improve robustness against structured corruption in tabular data without explicit data cleaning.

Details

Motivation: Tabular ML systems often face non-uniform corruption (noise, missing values, biases) with only column-level reliability indicators available, limiting existing robustness techniques that require instance-wise quality annotations.

Method: QuAIL augments models with a learnable feature-modulation layer whose updates are selectively constrained by a quality-dependent proximal regularizer, inducing controlled adaptation across features of varying trustworthiness without explicit data repair.

Result: Evaluation across 50 classification/regression datasets shows QuAIL consistently improves average performance over neural baselines under both random and value-dependent corruption, especially robust in low-data and systematically biased settings.

Conclusion: Incorporating feature reliability information directly into optimization dynamics is a practical and effective approach for resilient tabular learning.

Abstract: Tabular machine learning systems are frequently trained on data affected by non-uniform corruption, including noisy measurements, missing entries, and feature-specific biases. In practice, these defects are often documented only through column-level reliability indicators rather than instance-wise quality annotations, limiting the applicability of many robustness and cleaning techniques. We present QuAIL, a quality-informed training mechanism that incorporates feature reliability priors directly into the learning process. QuAIL augments existing models with a learnable feature-modulation layer whose updates are selectively constrained by a quality-dependent proximal regularizer, thereby inducing controlled adaptation across features of varying trustworthiness. This stabilizes optimization under structured corruption without explicit data repair or sample-level reweighting. Empirical evaluation across 50 classification and regression datasets demonstrates that QuAIL consistently improves average performance over neural baselines under both random and value-dependent corruption, with especially robust behavior in low-data and systematically biased settings. These results suggest that incorporating feature reliability information directly into optimization dynamics is a practical and effective approach for resilient tabular learning.

Bixing Wu, Yuhong Zhao, Zongli Ye, Jiachen Lian, Xiangyu Yue, Gopala Anumanchipalli

Main category: cs.LG

TL;DR: AHA framework uses asymmetric hierarchical anchoring with audio RVQ as semantic anchor to guide video feature distillation for cross-modal generalization, outperforming symmetric baselines.

Details

Motivation: Existing symmetric frameworks for audio-visual joint representation learning suffer from information allocation ambiguity and semantic leakage across modalities during cross-modal generalization.

Method: Proposes Asymmetric Hierarchical Anchoring (AHA) that uses audio Residual Vector Quantization (RVQ) hierarchical discrete representations as semantic anchor to guide video feature distillation. Includes GRL-based adversarial decoupler to suppress semantic leakage and Local Sliding Alignment for fine-grained temporal alignment.

Result: Extensive experiments on AVE and AVVP benchmarks show AHA consistently outperforms symmetric baselines in cross-modal transfer. Talking-face disentanglement experiments validate improved semantic consistency and disentanglement.

Conclusion: AHA framework effectively addresses information allocation ambiguity in cross-modal generalization through asymmetric hierarchical anchoring, demonstrating improved performance and broader applicability.

Abstract: Audio-visual joint representation learning under Cross-Modal Generalization (CMG) aims to transfer knowledge from a labeled source modality to an unlabeled target modality through a unified discrete representation space. Existing symmetric frameworks often suffer from information allocation ambiguity, where the absence of structural inductive bias leads to semantic-specific leakage across modalities. We propose Asymmetric Hierarchical Anchoring (AHA), which enforces directional information allocation by designating a structured semantic anchor within a shared hierarchy. In our instantiation, we exploit the hierarchical discrete representations induced by audio Residual Vector Quantization (RVQ) to guide video feature distillation into a shared semantic space. To ensure representational purity, we replace fragile mutual information estimators with a GRL-based adversarial decoupler that explicitly suppresses semantic leakage in modality-specific branches, and introduce Local Sliding Alignment (LSA) to encourage fine-grained temporal alignment across modalities. Extensive experiments on AVE and AVVP benchmarks demonstrate that AHA consistently outperforms symmetric baselines in cross-modal transfer. Additional analyses on talking-face disentanglement experiment further validate that the learned representations exhibit improved semantic consistency and disentanglement, indicating the broader applicability of the proposed framework.

[659] KVzap: Fast, Adaptive, and Faithful KV Cache Pruning

Simon Jegou, Maximilian Jeblick

Main category: cs.LG

TL;DR: KVzap: Fast input-adaptive KV cache compression method that achieves 2-4× compression with negligible accuracy loss, outperforming existing methods on KVpress benchmark.

Details

Motivation: Growing transformer context lengths make KV cache a critical inference bottleneck; existing KV cache pruning methods have speed-accuracy trade-offs preventing adoption in major inference engines.

Method: KVzap is a fast, input-adaptive approximation of KVzip that works in both prefilling and decoding phases, providing efficient KV cache compression.

Result: Achieves 2-4× KV cache compression on Qwen3-8B, Llama-3.1-8B-Instruct, and Qwen3-32B across long-context and reasoning tasks with negligible accuracy loss; state-of-the-art on KVpress leaderboard.

Conclusion: KVzap provides practical KV cache compression solution that balances speed and accuracy, making it suitable for deployment in inference engines.

Abstract: Growing context lengths in transformer-based language models have made the key-value (KV) cache a critical inference bottleneck. While many KV cache pruning methods have been proposed, they have not yet been adopted in major inference engines due to speed–accuracy trade-offs. We introduce KVzap, a fast, input-adaptive approximation of KVzip that works in both prefilling and decoding. On Qwen3-8B, Llama-3.1-8B-Instruct, and Qwen3-32B across long-context and reasoning tasks, KVzap achieves $2$–$4\times$ KV cache compression with negligible accuracy loss and achieves state-of-the-art performance on the KVpress leaderboard. Code and models are available at https://github.com/NVIDIA/kvpress.

[660] LLM-Inspired Pretrain-Then-Finetune for Small-Data, Large-Scale Optimization

Zishi Zhang, Jinhui Han, Ming Hu, Yijie Peng

Main category: cs.LG

TL;DR: A pretrain-then-finetune Transformer approach for small-data, large-scale decision problems, using domain-informed synthetic data for pretraining and real observations for fine-tuning.

Details

Motivation: Address the challenge of making many operational decisions simultaneously (e.g., across large product portfolios) with only few, potentially noisy data points per instance, inspired by LLM success.

Method: Design a Transformer model with problem-specific architecture and tailored training procedure. First pretrain on large-scale domain-informed synthetic data encoding managerial knowledge, then fine-tune on real observations.

Result: Develops comprehensive error analysis with nonasymptotic guarantees validating method effectiveness. Shows pretraining injects domain knowledge and enables high-capacity training, while fine-tuning adapts to operational environment.

Conclusion: The approach leverages Transformer’s representational capacity with attention mechanism to extract cross-task structure, with fine-tuning exhibiting economies-of-scale effect where transfer learning becomes more effective as instances grow.

Abstract: We consider small-data, large-scale decision problems in which a firm must make many operational decisions simultaneously (e.g., across a large product portfolio) while observing only a few, potentially noisy, data points per instance. Inspired by the success of large language models (LLMs), we propose a pretrain-then-finetune approach built on a designed Transformer model to address this challenge. The model is first pretrained on large-scale, domain-informed synthetic data that encode managerial knowledge and structural features of the decision environment, and is then fine-tuned on real observations. This new pipeline offers two complementary advantages: pretraining injects domain knowledge into the learning process and enables the training of high-capacity models using abundant synthetic data, while finetuning adapts the pretrained model to the operational environment and improves alignment with the true data-generating regime. While we have leveraged the Transformer’s state-of-the-art representational capacity, particularly its attention mechanism, to efficiently extract cross-task structure, our approach is not an off-the-shelf application. Instead, it relies on problem-specific architectural design and a tailored training procedure to match the decision setting. Theoretically, we develop the first comprehensive error analysis regarding Transformer learning in relevant contexts, establishing nonasymptotic guarantees that validate the method’s effectiveness. Critically, our analysis reveals how pretraining and fine-tuning jointly determine performance, with the dominant contribution governed by whichever is more favorable. In particular, finetuning exhibits an economies-of-scale effect, whereby transfer learning becomes increasingly effective as the number of instances grows.

[661] Optimization and Generation in Aerodynamics Inverse Design

Huaguan Chen, Ning Lin, Luxi Chen, Rui Zhang, Wenbing Huang, Chongxuan Li, Hao Sun

Main category: cs.LG

TL;DR: The paper presents a unified framework for inverse design with physics-based objectives, focusing on aerodynamic shape optimization through optimization and guided generation approaches with improved cost predictors and density-gradient optimization.

Details

Motivation: Inverse design with physics-based objectives is challenging due to high-dimensional geometry coupled with expensive simulations, particularly in aerodynamic shape optimization for drag reduction. The paper aims to address these challenges through a unified framework.

Method: The authors revisit inverse design through two canonical solutions (optimal design point and optimal design distribution), propose a new training loss for cost predictors, develop a density-gradient optimization method, unify existing training-free guided generation methods, and create a time- and memory-efficient algorithm for approximate covariance estimation in high dimensions.

Result: Experiments on 2D studies and high-fidelity 3D aerodynamic benchmarks (car and aircraft), validated by OpenFOAM simulations and miniature wind-tunnel tests with 3D-printed prototypes, demonstrate consistent gains in both optimization and guided generation. Additional offline RL results support the generality of the approach.

Conclusion: The paper presents a comprehensive framework that improves inverse design for physics-based objectives, particularly in aerodynamic applications, with demonstrated effectiveness across various benchmarks and validation methods.

Abstract: Inverse design with physics-based objectives is challenging because it couples high-dimensional geometry with expensive simulations, as exemplified by aerodynamic shape optimization for drag reduction. We revisit inverse design through two canonical solutions, the optimal design point and the optimal design distribution, and relate them to optimization and guided generation. Building on this view, we propose a new training loss for cost predictors and a density-gradient optimization method that improves objectives while preserving plausible shapes. We further unify existing training-free guided generation methods. To address their inability to approximate conditional covariance in high dimensions, we develop a time- and memory-efficient algorithm for approximate covariance estimation. Experiments on a controlled 2D study and high-fidelity 3D aerodynamic benchmarks (car and aircraft), validated by OpenFOAM simulations and miniature wind-tunnel tests with 3D-printed prototypes, demonstrate consistent gains in both optimization and guided generation. Additional offline RL results further support the generality of our approach.

[662] Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

Alexandru Meterez, Pranav Ajit Nair, Depen Morwani, Cengiz Pehlevan, Sham Kakade

Main category: cs.LG

TL;DR: Anytime learning schedules with weight averaging achieve comparable performance to horizon-dependent cosine schedules for large language model pretraining.

Details

Motivation: Most existing pretraining recipes rely on horizon-dependent learning rate schedules that require extensive tuning under fixed compute budgets, which is problematic for continual or open-ended training where total training horizon is unknown.

Method: Theoretical analysis of anytime learning schedules for overparameterized linear regression, highlighting weight averaging’s role in achieving minimax convergence rates. Empirical evaluation of 150M and 300M parameter language models trained at 1-32x Chinchilla scale, comparing constant learning rates with weight averaging and 1/√t schedules with weight averaging against well-tuned cosine schedules.

Result: Anytime schedules achieve comparable final loss to cosine decay across the full training range. Weight averaging combined with simple, horizon-free step sizes offers practical and effective alternative to cosine learning rate schedules.

Conclusion: Weight averaging with simple horizon-free step sizes provides a practical anytime alternative to cosine schedules for large language model pretraining in continual or open-ended settings.

Abstract: Large language models are increasingly trained in continual or open-ended settings, where the total training horizon is not known in advance. Despite this, most existing pretraining recipes are not anytime: they rely on horizon-dependent learning rate schedules and extensive tuning under a fixed compute budget. In this work, we provide a theoretical analysis demonstrating the existence of anytime learning schedules for overparameterized linear regression, and we highlight the central role of weight averaging - also known as model merging - in achieving the minimax convergence rates of stochastic gradient descent. We show that these anytime schedules polynomially decay with time, with the decay rate determined by the source and capacity conditions of the problem. Empirically, we evaluate 150M and 300M parameter language models trained at 1-32x Chinchilla scale, comparing constant learning rates with weight averaging and $1/\sqrt{t}$ schedules with weight averaging against a well-tuned cosine schedule. Across the full training range, the anytime schedules achieve comparable final loss to cosine decay. Taken together, our results suggest that weight averaging combined with simple, horizon-free step sizes offers a practical and effective anytime alternative to cosine learning rate schedules for large language model pretraining.

[663] SAGE-5GC: Security-Aware Guidelines for Evaluating Anomaly Detection in the 5G Core Network

Cristian Manca, Christian Scano, Giorgio Piras, Fabio Brau, Maura Pintor, Battista Biggio

Main category: cs.LG

TL;DR: Study on realistic evaluation of anomaly detection for 5G Core networks, proposing security-aware guidelines and demonstrating vulnerability to adversarial attacks using genetic algorithms.

Details

Motivation: Existing anomaly detection systems for 5G Core networks are evaluated under unrealistic assumptions (IID data, no adaptive attackers), failing to account for real-world deployment challenges and adversarial threats.

Method: Proposed SAGE-5GC guidelines for security-aware evaluation; trained anomaly detectors on realistic 5G Core dataset; analyzed model sensitivity via randomized perturbations; developed genetic algorithm-based optimization for adversarial attacks using only attacker-controllable features.

Result: Adversarially crafted attacks significantly degrade detection performance, demonstrating vulnerability of current anomaly detectors and highlighting need for robust evaluation methodologies.

Conclusion: Current 5G anomaly detection systems are vulnerable to adversarial attacks; security-aware evaluation frameworks like SAGE-5GC are essential for realistic assessment of robustness in operational environments.

Abstract: Machine learning-based anomaly detection systems are increasingly being adopted in 5G Core networks to monitor complex, high-volume traffic. However, most existing approaches are evaluated under strong assumptions that rarely hold in operational environments, notably the availability of independent and identically distributed (IID) data and the absence of adaptive attackers.In this work, we study the problem of detecting 5G attacks \textit{in the wild}, focusing on realistic deployment settings. We propose a set of Security-Aware Guidelines for Evaluating anomaly detectors in 5G Core Network (SAGE-5GC), driven by domain knowledge and consideration of potential adversarial threats. Using a realistic 5G Core dataset, we first train several anomaly detectors and assess their baseline performance against standard 5GC control-plane cyberattacks targeting PFCP-based network services.We then extend the evaluation to adversarial settings, where an attacker tries to manipulate the observable features of the network traffic to evade detection, under the constraint that the intended functionality of the malicious traffic is preserved. Starting from a selected set of controllable features, we analyze model sensitivity and adversarial robustness through randomized perturbations. Finally, we introduce a practical optimization strategy based on genetic algorithms that operates exclusively on attacker-controllable features and does not require prior knowledge of the underlying detection model. Our experimental results show that adversarially crafted attacks can substantially degrade detection performance, underscoring the need for robust, security-aware evaluation methodologies for anomaly detection in 5G networks deployed in the wild.

[664] Decision-oriented benchmarking to transform AI weather forecast access: Application to the Indian monsoon

Rajat Masiwal, Colin Aitken, Adam Marchakitus, Mayank Gupta, Katherine Kowal, Hamid A. Pahlavan, Tyler Yang, Y. Qiang Sun, Michael Kremer, Amir Jina, William R. Boos, Pedram Hassanzadeh

Main category: cs.LG

TL;DR: AI weather prediction models are evaluated using a decision-oriented framework connecting meteorology, AI, and social sciences, applied to Indian monsoon forecasting for agricultural benefits.

Details

Motivation: Current AI weather prediction (AIWP) model evaluations focus on aggregated meteorological metrics without considering local stakeholders' needs in operational decision-making contexts, especially for vulnerable populations in low- and middle-income countries facing weather shocks.

Method: Developed a decision-oriented benchmarking framework connecting meteorology, AI, and social sciences. Applied it to Indian monsoon forecasting, focusing on agriculturally relevant onset indices at regional scales with deterministic and probabilistic out-of-sample evaluation.

Result: AIWP models skillfully predicted agriculturally relevant monsoon onset indices weeks in advance. The framework informed a 2025 government effort to send AI-based monsoon onset forecasts to 38 million Indian farmers, successfully capturing an unusual weeks-long pause in monsoon progression.

Conclusion: The decision-oriented benchmarking framework provides a blueprint for harnessing AIWP models to help vulnerable populations adapt to weather shocks in the face of climate variability and change, bridging the gap between technical AI capabilities and practical societal needs.

Abstract: Artificial intelligence weather prediction (AIWP) models now often outperform traditional physics-based models on common metrics while requiring orders-of-magnitude less computing resources and time. Open-access AIWP models thus hold promise as transformational tools for helping low- and middle-income populations make decisions in the face of high-impact weather shocks. Yet, current approaches to evaluating AIWP models focus mainly on aggregated meteorological metrics without considering local stakeholders’ needs in decision-oriented, operational frameworks. Here, we introduce such a framework that connects meteorology, AI, and social sciences. As an example, we apply it to the 150-year-old problem of Indian monsoon forecasting, focusing on benefits to rain-fed agriculture, which is highly susceptible to climate change. AIWP models skillfully predict an agriculturally relevant onset index at regional scales weeks in advance when evaluated out-of-sample using deterministic and probabilistic metrics. This framework informed a government-led effort in 2025 to send 38 million Indian farmers AI-based monsoon onset forecasts, which captured an unusual weeks-long pause in monsoon progression. This decision-oriented benchmarking framework provides a key component of a blueprint for harnessing the power of AIWP models to help large vulnerable populations adapt to weather shocks in the face of climate variability and change.

[665] Explanations Leak: Membership Inference with Differential Privacy and Active Learning Defense

Fatima Ezzeddine, Osama Zammar, Silvia Giordano, Omran Ayoub

Main category: cs.LG

TL;DR: Counterfactual explanations in MLaaS systems can strengthen membership inference attacks, requiring defense mechanisms that balance privacy, utility, and explainability.

Details

Motivation: While counterfactual explanations improve transparency in MLaaS systems, they may expand the attack surface by strengthening privacy attacks like membership inference. The impact of explanations on privacy threats is insufficiently understood, necessitating research into defense mechanisms that protect privacy without undermining utility and explainability.

Method: 1) Systematic analysis of how exposing counterfactual explanations through query-based APIs enables more effective shadow-based membership inference attacks. 2) Proposed defense framework integrating Differential Privacy with Active Learning to jointly reduce memorization and limit effective training data exposure. 3) Extensive empirical evaluation characterizing the three-way trade-off between privacy leakage, predictive performance, and explanation quality.

Result: The research demonstrates that counterfactual explanations can significantly strengthen membership inference attacks in MLaaS systems. The proposed DP-AL defense framework helps mitigate this risk while maintaining reasonable utility and explanation quality, highlighting the complex trade-offs involved.

Conclusion: There is a critical need to carefully balance transparency, utility, and privacy in the responsible deployment of explainable MLaaS systems. Counterfactual explanations expand the attack surface for privacy attacks, requiring integrated defense mechanisms that address this emerging risk.

Abstract: Counterfactual explanations (CFs) are increasingly integrated into Machine Learning as a Service (MLaaS) systems to improve transparency; however, ML models deployed via APIs are already vulnerable to privacy attacks such as membership inference and model extraction, and the impact of explanations on this threat landscape remains insufficiently understood. In this work, we focus on the problem of how CFs expand the attack surface of MLaaS by strengthening membership inference attacks (MIAs), and on the need to design defense mechanisms that mitigate this emerging risk without undermining utility and explainability. First, we systematically analyze how exposing CFs through query-based APIs enables more effective shadow-based MIAs. Second, we propose a defense framework that integrates Differential Privacy (DP) with Active Learning (AL) to jointly reduce memorization and limit effective training data exposure. Finally, we conduct an extensive empirical evaluation to characterize the three-way trade-off between privacy leakage, predictive performance, and explanation quality. Our findings highlight the need to carefully balance transparency, utility, and privacy in the responsible deployment of explainable MLaaS systems.

[666] PRISM: Deriving a White-Box Transformer as a Signal-Noise Decomposition Operator via Maximum Coding Rate Reduction

Dongchen Huang

Main category: cs.LG

TL;DR: Prism is a white-box attention architecture derived from MCR² principles that uses π-RoPE to separate signal and noise subspaces, inducing unsupervised functional disentanglement where attention heads specialize into low-frequency (semantic) and high-frequency (syntactic) regimes.

Details

Motivation: Transformers are criticized as black boxes lacking interpretability. The paper aims to develop a theoretically grounded, interpretable attention architecture that unifies interpretability with performance through principled geometric construction rather than heuristic modifications.

Method: Proposes Prism architecture based on Maximizing Coding Rate Reduction principles. Models attention as gradient ascent on signal-noise manifold. Introduces π-RoPE (irrational frequency separation) to enforce incoherence between signal and noise subspaces. Draws analogy between attention and Hamiltonian dynamical systems, identifying RoPE’s geometric progression causes dense resonance networks and feature rank collapse.

Result: Empirical validation on 124M-parameter models trained on OpenWebText shows Prism spontaneously isolates Attention Sink pathology and maintains isentropic information flow across layers. Attention heads specialize into spectrally distinct regimes: low-frequency heads capture long-range causal dependencies (signal) and high-frequency heads handle local syntactic constraints (noise).

Conclusion: Interpretability and performance can be unified through principled geometric construction. The work offers a theoretically grounded alternative to heuristic architectural modifications and suggests physics-informed plug-and-play intervention KAM-RoPE for LLMs.

Abstract: Deep learning models, particularly Transformers, are often criticized as “black boxes” and lack interpretability. We propose Prism, a white-box attention-based architecture derived from the principles of Maximizing Coding Rate Reduction ($\text{MCR}^2$). By modeling the attention mechanism as a gradient ascent process on a distinct signal-noise manifold, we introduce a specific irrational frequency separation ($π$-RoPE) to enforce incoherence between signal (semantic) and noise (syntactic) subspaces. We show empirical evidence that these geometric inductive biases can induce unsupervised functional disentanglement alone. Prism spontaneously specializes its attention heads into spectrally distinct regimes: low-frequency heads capturing long-range causal dependencies (signal) and high-frequency heads handling local syntactic constraints and structural artifacts. To provide a theoretical grounding for these spectral phenomena, we draw an analogy between attention mechanism and a Hamiltonian dynamical system and identify that the standard geometric progression of Rotary Positional Embeddings (RoPE) induces dense resonance networks (Arnold Tongues), leading to feature rank collapse. Empirical validation on 124M-parameter models trained on OpenWebText demonstrates that Prism spontaneously isolates the Attention Sink pathology and maintains isentropic information flow across layers. Further, we suggest a physics-informed plug-and-play intervention KAM-RoPE for large language models (LLMs). Our results suggest that interpretability and performance can be unified through principled geometric construction, offering a theoretically grounded alternative to heuristic architectural modifications

[667] UniGeM: Unifying Data Mixing and Selection via Geometric Exploration and Mining

Changhao Wang, Yunfei Yu, Xinhao Yao, Jiaolong Yang, Riccardo Cantoro, Chaobo Li, Qing Cui, Jun Zhou

Main category: cs.LG

TL;DR: UniGeM: A unified framework for data curation in LLM scaling that treats data mixing and selection as manifold approximation, achieving 2x data efficiency and improved performance.

Details

Motivation: LLM scaling is increasingly limited by data quality, and existing methods handle data mixing and sample selection separately, which can break the structure in code corpora and other structured data.

Method: UniGeM unifies mixing and selection as manifold approximation without training proxy models or external datasets. It operates hierarchically: Macro-Exploration learns mixing weights via stability-based clustering, and Micro-Mining filters high-quality instances by their geometric distribution to ensure logical consistency.

Result: Validated by training 8B and 16B MoE models on 100B tokens, UniGeM achieves 2.0× data efficiency over random baseline and improves overall performance compared to SOTA methods in reasoning-heavy evaluations and multilingual generalization.

Conclusion: UniGeM provides an effective unified framework for data curation that addresses data quality limitations in LLM scaling, particularly beneficial for structured data like code corpora.

Abstract: The scaling of Large Language Models (LLMs) is increasingly limited by data quality. Most methods handle data mixing and sample selection separately, which can break the structure in code corpora. We introduce \textbf{UniGeM}, a framework that unifies mixing and selection by treating data curation as a \textit{manifold approximation} problem without training proxy models or relying on external reference datasets. UniGeM operates hierarchically: \textbf{Macro-Exploration} learns mixing weights with stability-based clustering; \textbf{Micro-Mining} filters high-quality instances by their geometric distribution to ensure logical consistency. Validated by training 8B and 16B MoE models on 100B tokens, UniGeM achieves \textbf{2.0$\times$ data efficiency} over a random baseline and further improves overall performance compared to SOTA methods in reasoning-heavy evaluations and multilingual generalization.

[668] Quantization-Aware Regularizers for Deep Neural Networks Compression

Dario Malchiodi, Mattia Ferraretto, Marco Frasca

Main category: cs.LG

TL;DR: Proposes a novel quantization-aware training method using per-layer regularization to naturally cluster weights during training, integrating quantization directly into optimization to reduce accuracy loss while maintaining compression benefits.

Details

Motivation: Large, over-parameterized neural networks pose deployment challenges on resource-constrained devices. While weight quantization is effective for compression, it typically causes accuracy drops and is applied post-training without influencing the learning process.

Method: Introduces per-layer regularization terms that drive weights to naturally form clusters during training, embedding quantization awareness directly into optimization. Quantization representatives become network parameters, integrating quantization directly into backpropagation.

Result: Experiments on CIFAR-10 with AlexNet and VGG16 models confirm the effectiveness of the proposed strategy in reducing accuracy loss while preserving compression potential.

Conclusion: The approach successfully integrates quantization awareness into training, reducing accuracy drops associated with quantization while maintaining compression benefits, representing a novel way to embed quantization parameters into backpropagation.

Abstract: Deep Neural Networks reached state-of-the-art performance across numerous domains, but this progress has come at the cost of increasingly large and over-parameterized models, posing serious challenges for deployment on resource-constrained devices. As a result, model compression has become essential, and – among compression techniques – weight quantization is largely used and particularly effective, yet it typically introduces a non-negligible accuracy drop. However, it is usually applied to already trained models, without influencing how the parameter space is explored during the learning phase. In contrast, we introduce per-layer regularization terms that drive weights to naturally form clusters during training, integrating quantization awareness directly into the optimization process. This reduces the accuracy loss typically associated with quantization methods while preserving their compression potential. Furthermore, in our framework quantization representatives become network parameters, marking, to the best of our knowledge, the first approach to embed quantization parameters directly into the backpropagation procedure. Experiments on CIFAR-10 with AlexNet and VGG16 models confirm the effectiveness of the proposed strategy.

[669] Ultra Fast PDE Solving via Physics Guided Few-step Diffusion

Cindy Xiangrui Kong, Yueqi Wang, Haoyang Zheng, Weijian Luo, Guang Lin

Main category: cs.LG

TL;DR: Phys-Instruct: A physics-guided distillation framework that compresses diffusion-based PDE solvers into few-step generators while enhancing physical consistency through explicit PDE knowledge injection.

Details

Motivation: Diffusion models show promise for solving PDEs but suffer from high sampling costs (many-step iterative sampling) and insufficient physical consistency due to lack of explicit physics constraints.

Method: Proposes a physics-guided distillation framework that: (1) compresses pre-trained diffusion PDE solvers into few-step generators via distribution matching, and (2) enhances physics consistency through PDE distillation guidance with explicit PDE knowledge injection. Built on solid theoretical foundation with physics-constrained training objective admitting tractable gradients.

Result: Achieves orders-of-magnitude faster inference while reducing PDE error by more than 8× compared to state-of-the-art diffusion baselines across five PDE benchmarks. The resulting unconditional student model serves as compact prior for efficient, physically consistent inference in downstream conditional tasks.

Conclusion: Phys-Instruct is a novel, effective, and efficient framework for ultra-fast PDE solving powered by deep generative models, addressing key limitations of diffusion-based PDE solvers.

Abstract: Diffusion-based models have demonstrated impressive accuracy and generalization in solving partial differential equations (PDEs). However, they still face significant limitations, such as high sampling costs and insufficient physical consistency, stemming from their many-step iterative sampling mechanism and lack of explicit physics constraints. To address these issues, we propose Phys-Instruct, a novel physics-guided distillation framework which not only (1) compresses a pre-trained diffusion PDE solver into a few-step generator via matching generator and prior diffusion distributions to enable rapid sampling, but also (2) enhances the physics consistency by explicitly injecting PDE knowledge through a PDE distillation guidance. Physic-Instruct is built upon a solid theoretical foundation, leading to a practical physics-constrained training objective that admits tractable gradients. Across five PDE benchmarks, Phys-Instruct achieves orders-of-magnitude faster inference while reducing PDE error by more than 8 times compared to state-of-the-art diffusion baselines. Moreover, the resulting unconditional student model functions as a compact prior, enabling efficient and physically consistent inference for various downstream conditional tasks. Our results indicate that Phys-Instruct is a novel, effective, and efficient framework for ultra-fast PDE solving powered by deep generative models.

[670] Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes

Amrith Setlur, Zijian Wang, Andrew Cohen, Paria Rashidinejad, Sang Michael Xie

Main category: cs.LG

TL;DR: PrefixRL: A reinforcement learning method for LLM reasoning that conditions on prefixes of successful off-policy traces to boost learning efficiency on hard problems

Details

Motivation: Standard RL methods for LLM reasoning waste compute on hard problems where correct on-policy traces are rare, policy gradients vanish, and learning stalls. The paper aims to bootstrap more efficient RL by reusing old sampling FLOPs from prior inference or RL training.

Method: PrefixRL conditions on the prefix of successful off-policy traces and runs on-policy RL to complete them, avoiding off-policy instabilities. It modulates problem difficulty through off-policy prefix length and creates a self-improvement loop by sourcing off-policy traces via rejection sampling with the base model.

Result: PrefixRL reaches the same training reward 2x faster than the strongest baseline (SFT on off-policy data then RL) and increases final reward by 3x on hard reasoning problems. Gains transfer to held-out benchmarks, and it remains effective even when off-policy traces come from different model families.

Conclusion: PrefixRL provides a more sample-efficient RL approach for LLM reasoning that leverages off-policy data while avoiding optimization instabilities, enabling faster learning and better performance on hard problems.

Abstract: Typical reinforcement learning (RL) methods for LLM reasoning waste compute on hard problems, where correct on-policy traces are rare, policy gradients vanish, and learning stalls. To bootstrap more efficient RL, we consider reusing old sampling FLOPs (from prior inference or RL training) in the form of off-policy traces. Standard off-policy methods supervise against off-policy data, causing instabilities during RL optimization. We introduce PrefixRL, where we condition on the prefix of successful off-policy traces and run on-policy RL to complete them, side-stepping off-policy instabilities. PrefixRL boosts the learning signal on hard problems by modulating the difficulty of the problem through the off-policy prefix length. We prove that the PrefixRL objective is not only consistent with the standard RL objective but also more sample efficient. Empirically, we discover back-generalization: training only on prefixed problems generalizes to out-of-distribution unprefixed performance, with learned strategies often differing from those in the prefix. In our experiments, we source the off-policy traces by rejection sampling with the base model, creating a self-improvement loop. On hard reasoning problems, PrefixRL reaches the same training reward 2x faster than the strongest baseline (SFT on off-policy data then RL), even after accounting for the compute spent on the initial rejection sampling, and increases the final reward by 3x. The gains transfer to held-out benchmarks, and PrefixRL is still effective when off-policy traces are derived from a different model family, validating its flexibility in practical settings.

[671] CTTVAE: Latent Space Structuring for Conditional Tabular Data Generation on Imbalanced Datasets

Milosh Devic, Jordan Gierschendorf, David Garson

Main category: cs.LG

TL;DR: CTTVaE+TBS is a conditional transformer-based VAE for tabular data generation that addresses class imbalance through latent space restructuring and adaptive sampling to improve minority class representation and downstream utility.

Details

Motivation: Existing generative models for tabular data often fail to adequately handle severe class imbalance, either overlooking minority groups or producing samples that lack utility for downstream learning tasks, which is critical in domains where rare but high-impact events drive decision-making.

Method: CTTVaE combines a conditional transformer-based VAE with two key mechanisms: (1) a class-aware triplet margin loss that restructures the latent space for better intra-class compactness and inter-class separation, and (2) a training-by-sampling strategy that adaptively increases exposure to underrepresented groups.

Result: Across six real-world benchmarks, CTTVaE+TBS achieves the strongest downstream utility on minority classes, often surpassing models trained on original imbalanced data while maintaining competitive fidelity. It bridges the gap between interpolation-based sampling and deep generative methods.

Conclusion: By explicitly prioritizing downstream performance in rare categories, CTTVaE+TBS provides a robust and interpretable solution for conditional tabular data generation with direct applicability to industries like healthcare, fraud detection, and predictive maintenance where minority case improvements are critical.

Abstract: Generating synthetic tabular data under severe class imbalance is essential for domains where rare but high-impact events drive decision-making. However, most generative models either overlook minority groups or fail to produce samples that are useful for downstream learning. We introduce CTTVAE, a Conditional Transformer-based Tabular Variational Autoencoder equipped with two complementary mechanisms: (i) a class-aware triplet margin loss that restructures the latent space for sharper intra-class compactness and inter-class separation, and (ii) a training-by-sampling strategy that adaptively increases exposure to underrepresented groups. Together, these components form CTTVAE+TBS, a framework that consistently yields more representative and utility-aligned samples without destabilizing training. Across six real-world benchmarks, CTTVAE+TBS achieves the strongest downstream utility on minority classes, often surpassing models trained on the original imbalanced data while maintaining competitive fidelity and bridging the gap for privacy for interpolation-based sampling methods and deep generative methods. Ablation studies further confirm that both latent structuring and targeted sampling contribute to these gains. By explicitly prioritizing downstream performance in rare categories, CTTVAE+TBS provides a robust and interpretable solution for conditional tabular data generation, with direct applicability to industries such as healthcare, fraud detection, and predictive maintenance where even small gains in minority cases can be critical.

[672] Reward Redistribution for CVaR MDPs using a Bellman Operator on L-infinity

Aneri Muni, Vincent Taboga, Esther Derman, Pierre-Luc Bacon, Erick Delage

Main category: cs.LG

TL;DR: Novel formulation of static CVaR in RL using state augmentation with dense rewards and contraction properties, enabling risk-averse value iteration and Q-learning algorithms with convergence guarantees.

Details

Motivation: Static CVaR is important for safety-critical applications to prevent catastrophic events, but lacks recursive Bellman decomposition in MDPs. Classical state augmentation approaches suffer from sparse rewards and degenerate fixed points.

Method: Proposes a novel formulation of static CVaR objective using state augmentation that leads to a Bellman operator with dense per-step rewards and contraction properties on bounded value functions. Develops risk-averse value iteration and model-free Q-learning algorithms with discretized augmented states.

Result: Theoretical foundation provides convergence guarantees and approximation error bounds. Empirical results show successful learning of CVaR-sensitive policies and effective performance-safety trade-offs.

Conclusion: The proposed formulation enables practical risk-averse RL algorithms for CVaR optimization with theoretical guarantees and empirical effectiveness.

Abstract: Tail-end risk measures such as static conditional value-at-risk (CVaR) are used in safety-critical applications to prevent rare, yet catastrophic events. Unlike risk-neutral objectives, the static CVaR of the return depends on entire trajectories without admitting a recursive Bellman decomposition in the underlying Markov decision process. A classical resolution relies on state augmentation with a continuous variable. However, unless restricted to a specialized class of admissible value functions, this formulation induces sparse rewards and degenerate fixed points. In this work, we propose a novel formulation of the static CVaR objective based on augmentation. Our alternative approach leads to a Bellman operator with: (1) dense per-step rewards; (2) contracting properties on the full space of bounded value functions. Building on this theoretical foundation, we develop risk-averse value iteration and model-free Q-learning algorithms that rely on discretized augmented states. We further provide convergence guarantees and approximation error bounds due to discretization. Empirical results demonstrate that our algorithms successfully learn CVaR-sensitive policies and achieve effective performance-safety trade-offs.

[673] Reinforcement Fine-Tuning for History-Aware Dense Retriever in RAG

Yicheng Zhang, Zhen Qin, Zhaomin Wu, Wenqi Zhang, Shuiguang Deng

Main category: cs.LG

TL;DR: Proposes a reinforcement learning approach to optimize retrievers in RAG systems by addressing deterministic retrieval incompatibility and state aliasing through stochastic sampling and retrieval history incorporation.

Details

Motivation: Existing retriever optimization methods suffer from objective mismatch with RAG pipeline goals, and deterministic retrieval is incompatible with RL formulations, while query-only retrieval causes state aliasing in multi-hop reasoning.

Method: Replace deterministic retrieval with stochastic sampling, formulate RAG as Markov decision process, incorporate retrieval history into state at each step to mitigate state aliasing, enabling RL-based retriever optimization.

Result: Extensive experiments across diverse RAG pipelines, datasets, and retriever scales demonstrate consistent improvements in RAG performance.

Conclusion: The proposed RL approach effectively addresses fundamental challenges in retriever optimization and improves RAG performance across various settings.

Abstract: Retrieval-augmented generation (RAG) enables large language models (LLMs) to produce evidence-based responses, and its performance hinges on the matching between the retriever and LLMs. Retriever optimization has emerged as an efficient alternative to fine-tuning LLMs. However, existing solutions suffer from objective mismatch between retriever optimization and the goal of RAG pipeline. Reinforcement learning (RL) provides a promising solution to address this limitation, yet applying RL to retriever optimization introduces two fundamental challenges: 1) the deterministic retrieval is incompatible with RL formulations, and 2) state aliasing arises from query-only retrieval in multi-hop reasoning. To address these challenges, we replace deterministic retrieval with stochastic sampling and formulate RAG as a Markov decision process, making retriever optimizable by RL. Further, we incorporate retrieval history into the state at each retrieval step to mitigate state aliasing. Extensive experiments across diverse RAG pipelines, datasets, and retriever scales demonstrate consistent improvements of our approach in RAG performance.

[674] Sequential Group Composition: A Window into the Mechanics of Deep Learning

Giovanni Luca Marchetti, Daniel Kunin, Adele Myers, Francisco Acosta, Nina Miolane

Main category: cs.LG

TL;DR: Neural networks learn sequential group composition tasks by decomposing them into irreducible group representations, with depth enabling efficient parallel computation through associativity.

Details

Motivation: To understand how neural networks acquire structured computation abilities (arithmetic, geometric, algorithmic) when trained on sequences, by studying a tractable mathematical task that isolates key learning mechanisms.

Method: Introduce sequential group composition task where networks predict cumulative product of group elements. Analyze two-layer networks learning via irreducible group representations determined by Fourier statistics, and compare with deeper architectures (RNNs and multilayer networks) exploiting associativity.

Result: Two-layer networks require exponential width in sequence length but can learn perfectly; deeper models dramatically improve scaling: RNNs compose sequentially in k steps, multilayer networks compose adjacent pairs in parallel in log k layers.

Conclusion: Sequential group composition provides a tractable framework to understand deep learning mechanics, showing how depth enables efficient structured computation through parallelization of associative operations.

Abstract: How do neural networks trained over sequences acquire the ability to perform structured operations, such as arithmetic, geometric, and algorithmic computation? To gain insight into this question, we introduce the sequential group composition task. In this task, networks receive a sequence of elements from a finite group encoded in a real vector space and must predict their cumulative product. The task can be order-sensitive and requires a nonlinear architecture to be learned. Our analysis isolates the roles of the group structure, encoding statistics, and sequence length in shaping learning. We prove that two-layer networks learn this task one irreducible representation of the group at a time in an order determined by the Fourier statistics of the encoding. These networks can perfectly learn the task, but doing so requires a hidden width exponential in the sequence length $k$. In contrast, we show how deeper models exploit the associativity of the task to dramatically improve this scaling: recurrent neural networks compose elements sequentially in $k$ steps, while multilayer networks compose adjacent pairs in parallel in $\log k$ layers. Overall, the sequential group composition task offers a tractable window into the mechanics of deep learning.

[675] Data-Driven Graph Filters via Adaptive Spectral Shaping

Dylan Sandfelder, Mihai Cucuringu, Xiaowen Dong

Main category: cs.LG

TL;DR: A framework for graph filtering that learns reusable spectral kernels modulated by Gaussian factors, enabling interpretable multi-peak spectral responses with efficient implementation and transfer learning capabilities.

Details

Motivation: Current graph filtering methods often lack interpretability, scalability, and cross-graph generalization capabilities. The authors aim to develop a framework that provides compact, interpretable spectral modules that can be efficiently implemented and transferred across different graphs.

Method: Proposes Adaptive Spectral Shaping (ASS) which learns a reusable baseline spectral kernel and modulates it with Gaussian factors to create multi-peak, multi-scale spectral responses. Uses Chebyshev polynomial expansions for efficient implementation without eigendecompositions. Extends to Transferable Adaptive Spectral Shaping (TASS) where baseline kernel is learned on source graphs and only shaping parameters are adapted on target graphs.

Result: ASS reduces reconstruction error compared to fixed-prototype wavelets and learned linear banks across synthetic benchmarks. TASS demonstrates consistent positive transfer in few-shot learning scenarios. The framework provides scalable, interpretable spectral modules that integrate with graph signal processing and graph neural networks.

Conclusion: The Adaptive Spectral Shaping framework successfully combines scalability, interpretability, and cross-graph generalization for graph filtering, offering a practical solution for spectral graph processing that can be efficiently transferred across different graph structures.

Abstract: We introduce Adaptive Spectral Shaping, a data-driven framework for graph filtering that learns a reusable baseline spectral kernel and modulates it with a small set of Gaussian factors. The resulting multi-peak, multi-scale responses allocate energy to heterogeneous regions of the Laplacian spectrum while remaining interpretable via explicit centers and bandwidths. To scale, we implement filters with Chebyshev polynomial expansions, avoiding eigendecompositions. We further propose Transferable Adaptive Spectral Shaping (TASS): the baseline kernel is learned on source graphs and, on a target graph, kept fixed while only the shaping parameters are adapted, enabling few-shot transfer under matched compute. Across controlled synthetic benchmarks spanning graph families and signal regimes, Adaptive Spectral Shaping reduces reconstruction error relative to fixed-prototype wavelets and learned linear banks, and TASS yields consistent positive transfer. The framework provides compact spectral modules that plug into graph signal processing pipelines and graph neural networks, combining scalability, interpretability, and cross-graph generalization.

[676] Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

Anxin Guo, Jingwei Li

Main category: cs.LG

TL;DR: The paper formalizes LLM hallucination as a membership testing problem, showing that even with optimal training, hallucinations are an information-theoretically necessary consequence of lossy compression under limited capacity.

Details

Motivation: Large language models often hallucinate with high confidence on random facts that lack inferable patterns. The paper aims to understand why hallucinations persist even with optimal training and perfect data, moving beyond explanations like training data quality or model architecture.

Method: Formalizes memorization of facts as a membership testing problem, unifying discrete error metrics of Bloom filters with continuous log-loss of LLMs. Analyzes the problem in the regime where facts are sparse in the universe of plausible claims, establishing a rate-distortion theorem that characterizes optimal memory efficiency by minimum KL divergence between score distributions on facts and non-facts.

Result: The theoretical framework shows that the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. This is validated empirically on synthetic data, demonstrating that hallucinations persist as a natural consequence of lossy compression.

Conclusion: Hallucination in LLMs is not just a practical training issue but an information-theoretic necessity under capacity constraints. Even with optimal training and perfect data, the optimal strategy involves assigning high confidence to some non-facts, providing a fundamental explanation for why hallucinations occur.

Abstract: Large language models often hallucinate with high confidence on “random facts” that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination: even with optimal training, perfect data, and a simplified “closed world” setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on synthetic data, showing that hallucinations persist as a natural consequence of lossy compression.

[677] Efficient Training of Boltzmann Generators Using Off-Policy Log-Dispersion Regularization

Henrik Schopmans, Christopher von Klitzing, Pascal Friederich

Main category: cs.LG

TL;DR: Off-policy log-dispersion regularization (LDR) improves data efficiency for Boltzmann generators by regularizing energy landscapes using target energy labels without requiring additional on-policy samples.

Details

Motivation: Boltzmann generators need data-efficient training because both simulation data and target energy evaluations are computationally expensive, limiting their practical application in physical systems.

Method: Proposes off-policy log-dispersion regularization (LDR), a regularization framework based on generalizing the log-variance objective. It uses target energy labels to regularize the energy landscape shape, works with biased/unbiased simulation datasets, and supports purely variational training without target samples.

Result: LDR improves both final performance and data efficiency across all benchmarks, achieving sample efficiency gains of up to one order of magnitude compared to standard approaches.

Conclusion: LDR provides an effective regularization framework for Boltzmann generators that significantly enhances data efficiency and performance, making these generative models more practical for physical system sampling applications.

Abstract: Sampling from unnormalized probability densities is a central challenge in computational science. Boltzmann generators are generative models that enable independent sampling from the Boltzmann distribution of physical systems at a given temperature. However, their practical success depends on data-efficient training, as both simulation data and target energy evaluations are costly. To this end, we propose off-policy log-dispersion regularization (LDR), a novel regularization framework that builds on a generalization of the log-variance objective. We apply LDR in the off-policy setting in combination with standard data-based training objectives, without requiring additional on-policy samples. LDR acts as a shape regularizer of the energy landscape by leveraging additional information in the form of target energy labels. The proposed regularization framework is broadly applicable, supporting unbiased or biased simulation datasets as well as purely variational training without access to target samples. Across all benchmarks, LDR improves both final performance and data efficiency, with sample efficiency gains of up to one order of magnitude.

[678] Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards

Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She, Viet Anh Nguyen

Main category: cs.LG

TL;DR: VIP is a variance-informed predictive allocation strategy that optimizes rollout budget allocation in reinforcement learning with verifiable rewards to improve sampling efficiency.

Details

Motivation: Existing group-based policy optimization methods allocate fixed numbers of rollouts uniformly across all training prompts, treating them as equally informative. This uniform allocation leads to inefficient computational budget usage and impedes training progress by not considering which prompts provide more valuable information.

Method: VIP uses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts. These probability predictions are translated into variance estimates, which are then fed into a convex optimization problem to determine optimal rollout allocations under a hard compute budget constraint to minimize expected gradient variance of policy updates.

Result: Empirical results show VIP consistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies in multiple benchmarks.

Conclusion: VIP provides an effective variance-informed approach to optimize rollout budget allocation in reinforcement learning, leading to better sampling efficiency and performance compared to uniform allocation strategies.

Abstract: Sampling efficiency is a key bottleneck in reinforcement learning with verifiable rewards. Existing group-based policy optimization methods, such as GRPO, allocate a fixed number of rollouts for all training prompts. This uniform allocation implicitly treats all prompts as equally informative, and could lead to inefficient computational budget usage and impede training progress. We introduce VIP, a Variance-Informed Predictive allocation strategy that allocates a given rollout budget to the prompts in the incumbent batch to minimize the expected gradient variance of the policy update. At each iteration, VIP uses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts. These probability predictions are translated into variance estimates, which are then fed into a convex optimization problem to determine the optimal rollout allocations under a hard compute budget constraint. Empirical results show that VIP consistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies in multiple benchmarks.

[679] Enhancing Imbalanced Node Classification via Curriculum-Guided Feature Learning and Three-Stage Attention Network

Abdul Joseph Fofanah, Lian Wen, David Chen, Shaoyang Zhang

Main category: cs.LG

TL;DR: CL3AN-GNN: A curriculum-guided graph neural network with three-stage attention (Engage, Enact, Embed) for imbalanced node classification, showing improved performance on diverse graph datasets.

Details

Motivation: Imbalanced node classification in GNNs causes unfair learning and poor performance on minority classes due to label skew. Current methods lack structured curriculum learning approaches that mimic human learning patterns.

Method: Three-stage attention network: 1) Engage stage focuses on structurally simpler features (local patterns, low-degree nodes, class-separable pairs), 2) Enact stage handles complex aspects (multi-hop connections, heterogeneous edges, minority class boundaries), 3) Embed stage consolidates features via iterative message passing with curriculum-aligned loss weighting.

Result: Consistent improvements across 8 Open Graph Benchmark datasets in accuracy, F1-score, and AUC over state-of-the-art methods. Shows faster convergence than end-to-end training, better generalization to new imbalanced graphs, and interpretable learning stages.

Conclusion: CL3AN-GNN provides a theoretically grounded curriculum learning framework for GNNs that effectively addresses label imbalance through structured attention mechanisms, validated by metrics, convergence speeds, and generalization tests.

Abstract: Imbalanced node classification in graph neural networks (GNNs) happens when some labels are much more common than others, which causes the model to learn unfairly and perform badly on the less common classes. To solve this problem, we propose a Curriculum-Guided Feature Learning and Three-Stage Attention Network (CL3AN-GNN), a learning network that uses a three-step attention system (Engage, Enact, Embed) similar to how humans learn. The model begins by engaging with structurally simpler features, defined as (1) local neighbourhood patterns (1-hop), (2) low-degree node attributes, and (3) class-separable node pairs identified via initial graph convolutional networks and graph attention networks (GCN and GAT) embeddings. This foundation enables stable early learning despite label skew. The Enact stage then addresses complicated aspects: (1) connections that require multiple steps, (2) edges that connect different types of nodes, and (3) nodes at the edges of minority classes by using adjustable attention weights. Finally, Embed consolidates these features via iterative message passing and curriculum-aligned loss weighting. We evaluate CL3AN-GNN on eight Open Graph Benchmark datasets spanning social, biological, and citation networks. Experiments show consistent improvements across all datasets in accuracy, F1-score, and AUC over recent state-of-the-art methods. The model’s step-by-step method works well with different types of graph datasets, showing quicker results than training everything at once, better performance on new, imbalanced graphs, and clear explanations of each step using gradient stability and attention correlation learning curves. This work provides both a theoretically grounded framework for curriculum learning in GNNs and practical evidence of its effectiveness against imbalances, validated through metrics, convergence speeds, and generalisation tests.

[680] Fast-MWEM: Private Data Release in Sublinear Time

Themistoklis Haris, Steve Choi, Mutiraj Laksanawisit

Main category: cs.LG

TL;DR: Accelerated MWEM framework reduces per-iteration runtime from Θ(m) to Θ(√m) for private data analysis using lazy sampling with Gumbel noise and k-NN data structures.

Details

Motivation: The Multiplicative Weights Exponential Mechanism (MWEM) is widely used for private data analysis but suffers from scalability issues due to Θ(m) time complexity per iteration, which limits its practical application for large-scale problems.

Method: Introduces a modification to MWEM using lazy sampling approach to Report-Noisy-Max mechanism, implemented efficiently with Gumbel noise and k-Nearest Neighbor data structures to avoid exhaustive linear scans.

Result: Achieves substantial runtime improvement from Θ(m) to Θ(√m) per iteration in expectation, with experimental validation showing significant speedup over classic MWEM for private linear query release and Linear Programming problems.

Conclusion: The accelerated MWEM framework provides a practical solution to scalability bottlenecks in private data analysis, enabling more efficient processing of large-scale linear query and constraint problems while maintaining privacy guarantees.

Abstract: The Multiplicative Weights Exponential Mechanism (MWEM) is a fundamental iterative framework for private data analysis, with broad applications such as answering $m$ linear queries, or privately solving systems of $m$ linear constraints. However, a critical bottleneck hindering its scalability is the $Θ(m)$ time complexity required to execute the exponential mechanism in each iteration. We introduce a modification to the MWEM framework that improves the per-iteration runtime dependency to $Θ(\sqrt{m})$ in expectation. This is done via a lazy sampling approach to the Report-Noisy-Max mechanism, which we implement efficiently using Gumbel noise and a $k$-Nearest Neighbor data structure. This allows for the rapid selection of the approximate score in the exponential mechanism without an exhaustive linear scan. We apply our accelerated framework to the problems of private linear query release and solving Linear Programs (LPs) under neighboring constraint conditions and low-sensitivity assumptions. Experimental evaluation confirms that our method provides a substantial runtime improvement over classic MWEM.

[681] PLATE: Plasticity-Tunable Efficient Adapters for Geometry-Aware Continual Learning

Romain Cosentino

Main category: cs.LG

TL;DR: PLATE: A continual learning method for pretrained models that requires no old-task data by exploiting geometric redundancy in neural networks to create protected update subspaces.

Details

Motivation: Addresses practical barrier in foundation model adaptation where pretraining distributions are often unavailable, enabling continual learning without access to old-task data.

Method: Exploits geometric redundancy in pretrained networks to construct protected update subspaces. Uses structured low-rank updates ΔW = B A Q⊤ where B and Q are computed from pretrained weights and frozen, and only A is trained on new tasks.

Result: Provides explicit control over plasticity-retention trade-off, reduces functional drift on old-data distribution, and offers improved worst-case retention guarantees without needing old-task data.

Conclusion: PLATE enables efficient continual learning for pretrained models by leveraging inherent geometric redundancy, offering a practical solution for adapting foundation models when old data is unavailable.

Abstract: We develop a continual learning method for pretrained models that \emph{requires no access to old-task data}, addressing a practical barrier in foundation model adaptation where pretraining distributions are often unavailable. Our key observation is that pretrained networks exhibit substantial \emph{geometric redundancy}, and that this redundancy can be exploited in two complementary ways. First, redundant neurons provide a proxy for dominant pretraining-era feature directions, enabling the construction of approximately protected update subspaces directly from pretrained weights. Second, redundancy offers a natural bias for \emph{where} to place plasticity: by restricting updates to a subset of redundant neurons and constraining the remaining degrees of freedom, we obtain update families with reduced functional drift on the old-data distribution and improved worst-case retention guarantees. These insights lead to \textsc{PLATE} (\textbf{Pla}sticity-\textbf{T}unable \textbf{E}fficient Adapters), a continual learning method requiring no past-task data that provides explicit control over the plasticity-retention trade-off. PLATE parameterizes each layer with a structured low-rank update $ΔW = B A Q^\top$, where $B$ and $Q$ are computed once from pretrained weights and kept frozen, and only $A$ is trained on the new task. The code is available at https://github.com/SalesforceAIResearch/PLATE.

[682] Soft Sensor for Bottom-Hole Pressure Estimation in Petroleum Wells Using Long Short-Term Memory and Transfer Learning

M. A. Fernandes, E. Gildin, M. A. Sampaio

Main category: cs.LG

TL;DR: A machine learning-based soft sensor using LSTM networks to estimate bottom-hole pressure in petroleum wells from wellhead measurements, achieving under 2% MAPE and demonstrating transfer learning across operational environments.

Details

Motivation: Permanent Downhole Gauges (PDGs) for monitoring bottom-hole variables in petroleum wells face reliability and cost issues, creating a need for alternative, cost-effective solutions for production optimization, safety, and emissions reduction.

Method: Proposes a Long Short-Term Memory (LSTM) model as a soft sensor to estimate flowing Bottom-Hole Pressure (BHP) using wellhead and topside measurements, compared against Multi-Layer Perceptron (MLP) and Ridge Regression baselines, and introduces Transfer Learning for model adaptation across different operational environments.

Result: Tested on real offshore datasets from Brazil’s Pre-salt basin, the methodology achieved Mean Absolute Percentage Error (MAPE) consistently below 2%, outperforming benchmark models and demonstrating successful transfer learning across operational conditions.

Conclusion: The work offers a cost-effective, accurate alternative to physical sensors for bottom-hole pressure monitoring with broad applicability across diverse reservoir and flow conditions, potentially improving petroleum production optimization.

Abstract: Monitoring bottom-hole variables in petroleum wells is essential for production optimization, safety, and emissions reduction. Permanent Downhole Gauges (PDGs) provide real-time pressure data but face reliability and cost issues. We propose a machine learning-based soft sensor to estimate flowing Bottom-Hole Pressure (BHP) using wellhead and topside measurements. A Long Short-Term Memory (LSTM) model is introduced and compared with Multi-Layer Perceptron (MLP) and Ridge Regression. We also pioneer Transfer Learning for adapting models across operational environments. Tested on real offshore datasets from Brazil’s Pre-salt basin, the methodology achieved Mean Absolute Percentage Error (MAPE) consistently below 2%, outperforming benchmarks. This work offers a cost-effective, accurate alternative to physical sensors, with broad applicability across diverse reservoir and flow conditions.

[683] Reasoning with Latent Tokens in Diffusion Language Models

Andre He, Sean Welleck, Daniel Fried

Main category: cs.LG

TL;DR: Discrete diffusion models outperform autoregressive models on reasoning tasks but are slower; this trade-off stems from joint prediction of all tokens including undecoded ones. Introducing latent tokens enables speed-quality tradeoff and improves autoregressive models on reasoning tasks.

Details

Motivation: Discrete diffusion models have shown competitive performance with autoregressive models for language modeling, even surpassing them on reasoning tasks requiring planning and global coherence. However, they suffer from slower inference times. The researchers aim to understand this trade-off and find ways to improve both diffusion and autoregressive models.

Method: The authors analyze the mechanism behind diffusion models’ performance, identifying that they jointly predict distributions over all unknown tokens (including undecoded ones). They ablate this joint prediction to study its effects, then introduce a method to modulate the number of latent tokens (undecoded tokens). They also apply latent tokens to autoregressive models through an auxiliary multi-token prediction objective.

Result: Joint prediction of undecoded tokens is crucial for diffusion models’ performance on reasoning tasks. Modulating latent token count enables smooth tradeoff between inference speed and sample quality. Introducing latent tokens into autoregressive models via auxiliary objectives yields substantial improvements on reasoning tasks where they traditionally struggle.

Conclusion: Latent tokens, while naturally arising in diffusion models, represent a general mechanism for improving performance on tasks requiring global coherence or lookahead. This insight can be applied to both diffusion and autoregressive models to enhance their reasoning capabilities.

Abstract: Discrete diffusion models have recently become competitive with autoregressive models for language modeling, even outperforming them on reasoning tasks requiring planning and global coherence, but they require more computation at inference time. We trace this trade-off to a key mechanism: diffusion models are trained to jointly predict a distribution over all unknown tokens, including those that will not actually be decoded in the current step. Ablating this joint prediction yields faster inference but degrades performance, revealing that accurate prediction at the decoded position relies on joint reasoning about the distribution of undecoded tokens. We interpret these as latent tokens and introduce a method for modulating their number, demonstrating empirically that this enables a smooth tradeoff between inference speed and sample quality. Furthermore, we demonstrate that latent tokens can be introduced into autoregressive models through an auxiliary multi-token prediction objective, yielding substantial improvements on the same reasoning tasks where they have traditionally struggled. Our results suggest that latent tokens, while arising naturally in diffusion, represent a general mechanism for improving performance on tasks requiring global coherence or lookahead.

[684] Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL

Ian Wu, Yuxiao Qu, Amrith Setlur, Aviral Kumar

Main category: cs.LG

TL;DR: RC is an iterative decoding algorithm that enables LLMs to continually improve beyond training budgets by constructing reasoning chains that get better across iterations, allowing extrapolation to much longer reasoning horizons.

Details

Motivation: Standard RL operates over fixed problem distributions and training budgets, limiting LLMs' ability to extrapolate and adapt to distribution shift at test time. The authors want LLMs that can continually improve beyond their training budgets.

Method: RC replaces standard autoregressive decoding during both training and inference. It exploits an asymmetry between response generation and summarization capabilities of LLMs to construct reasoning chains that consistently improve across iterations.

Result: Models trained with RC can extrapolate to reasoning horizons more than an order of magnitude longer than seen during training. A 4B model trained with RC improved from 40% to nearly 70% on HMMT 2025 with 0.5m tokens at test time, outperforming comparably sized and many larger reasoning LLMs.

Conclusion: RC enables LLMs to continually improve beyond training budgets through iterative decoding, achieving strong extrapolation capabilities and better leveraging of existing scaffolds for test-time performance scaling.

Abstract: Large Language Models (LLMs) that can continually improve beyond their training budgets are able to solve increasingly difficult problems by adapting at test time, a property we refer to as extrapolation. However, standard reinforcement learning (RL) operates over fixed problem distributions and training budgets, which limits extrapolation amidst distribution shift at test time. To address this, we introduce RC, an iterative decoding algorithm that replaces standard autoregressive decoding during both training and inference. RC exploits an asymmetry between the response generation and summarization capabilities of LLMs to construct reasoning chains that consistently improve across iterations. Models trained to use RC can extrapolate and continually improve over reasoning horizons more than an order of magnitude longer than those seen during training. Empirically, training a 4B model with RC using a 16k-token training budget improves performance on HMMT 2025 from 40% to nearly 70% with 0.5m tokens at test time, outperforming both comparably sized models and many larger reasoning LLMs. Finally, we also show that models trained with RC can more effectively leverage existing scaffolds to further scale test-time performance, due to the improved summary-conditioned generation abilities learned through training.

[685] Inference-time Unlearning Using Conformal Prediction

Somnath Basu Roy Chowdhury, Rahul Kidambi, Avinava Dubey, David Wang, Gokhan Mergen, Amr Ahmed, Aranyak Mehta

Main category: cs.LG

TL;DR: A framework for inference-time machine unlearning using verifiers and conformal prediction to remove information without updating model parameters.

Details

Motivation: Existing unlearning methods degrade model capabilities and have unrealistic assumptions, especially for generative models. Need for approaches that preserve pre-trained knowledge while removing specific information.

Method: Inference-time unlearning with verifiers that judge responses for unlearning guarantees. Uses iterative refinement with verifier feedback and conformal prediction for computational efficiency and distribution-free guarantees.

Result: Outperforms state-of-the-art methods, reducing unlearning error by up to 93% across challenging benchmarks.

Conclusion: Inference-time unlearning with verifiers provides effective information removal while preserving model capabilities, with formal guarantees via conformal prediction.

Abstract: Machine unlearning is the process of efficiently removing specific information from a trained machine learning model without retraining from scratch. Existing unlearning methods, which often provide provable guarantees, typically involve retraining a subset of model parameters based on a forget set. While these approaches show promise in certain scenarios, their underlying assumptions are often challenged in real-world applications – particularly when applied to generative models. Furthermore, updating parameters using these unlearning procedures often degrades the general-purpose capabilities the model acquired during pre-training. Motivated by these shortcomings, this paper considers the paradigm of inference time unlearning – wherein, the generative model is equipped with an (approximately correct) verifier that judges whether the model’s response satisfies appropriate unlearning guarantees. This paper introduces a framework that iteratively refines the quality of the generated responses using feedback from the verifier without updating the model parameters. The proposed framework leverages conformal prediction to reduce computational overhead and provide distribution-free unlearning guarantees. This paper’s approach significantly outperforms existing state-of-the-art methods, reducing unlearning error by up to 93% across challenging unlearning benchmarks.

Bogdan Kulynych, Theresa Stadler, Jean Louis Raisaro, Carmela Troncoso

Main category: cs.LG

TL;DR: Synthetic data has limitations across three key use cases: privacy-preserving data sharing, ML training augmentation, and statistical variance reduction; many envisioned applications are poor fits due to fundamental constraints.

Details

Motivation: To critically examine whether synthetic data truly solves problems around data access, scarcity, and under-representation, as many assume it's a universal solution.

Method: Formal analysis and case studies of three prominent use cases: (1) privacy-preserving data sharing, (2) ML training augmentation, and (3) statistical variance reduction.

Result: Identifies fundamental and practical limits constraining synthetic data’s effectiveness; reveals many existing or envisioned use cases are poor problem fits.

Conclusion: Synthetic data is not a universal solution; decision makers need to carefully assess whether it’s suitable for their specific data availability problem based on formalized criteria.

Abstract: Recent advances in generative modelling have led many to see synthetic data as the go-to solution for a range of problems around data access, scarcity, and under-representation. In this paper, we study three prominent use cases: (1) Sharing synthetic data as a proxy for proprietary datasets to enable statistical analyses while protecting privacy, (2) Augmenting machine learning training sets with synthetic data to improve model performance, and (3) Augmenting datasets with synthetic data to reduce variance in statistical estimation. For each use case, we formalise the problem setting and study, through formal analysis and case studies, under which conditions synthetic data can achieve its intended objectives. We identify fundamental and practical limits that constrain when synthetic data can serve as an effective solution for a particular problem. Our analysis reveals that due to these limits many existing or envisioned use cases of synthetic data are a poor problem fit. Our formalisations and classification of synthetic data use cases enable decision makers to assess whether synthetic data is a suitable approach for their specific data availability problem.

[687] Manifold Random Features

Ananya Parashar, Derek Long, Dwaipayan Saha, Krzysztof Choromanski

Main category: cs.LG

TL;DR: MRFs: A new method for approximating kernels on manifolds using graph discretization and random features, providing positive bounded features for accurate approximation.

Details

Motivation: To develop a general method for approximating bi-variate functions (kernels) on manifolds where analytical solutions are often unavailable, leveraging graph-based approaches to handle complex manifold structures.

Method: Uses manifold discretization and Graph Random Features (GRFs) to learn continuous fields on manifolds, creating Manifold Random Features (MRFs) that provide positive bounded features for kernel approximation.

Result: Shows deep asymptotic connections between discrete GRFs and continuous random features, re-discovers Gaussian kernel approximation mechanisms, and provides rigorous theoretical analysis with experimental verification.

Conclusion: MRFs offer a powerful new paradigm for kernel approximation on manifolds with positive bounded features, connecting discrete graph methods with continuous kernel theory.

Abstract: We present a new paradigm for creating random features to approximate bi-variate functions (in particular, kernels) defined on general manifolds. This new mechanism of Manifold Random Features (MRFs) leverages discretization of the manifold and the recently introduced technique of Graph Random Features (GRFs) to learn continuous fields on manifolds. Those fields are used to find continuous approximation mechanisms that otherwise, in general scenarios, cannot be derived analytically. MRFs provide positive and bounded features, a key property for accurate, low-variance approximation. We show deep asymptotic connection between GRFs, defined on discrete graph objects, and continuous random features used for regular kernels. As a by-product of our method, we re-discover recently introduced mechanism of Gaussian kernel approximation applied in particular to improve linear-attention Transformers, considering simple random walks on graphs and by-passing original complex mathematical computations. We complement our algorithm with a rigorous theoretical analysis and verify in thorough experimental studies.

[688] Prediction of Critical Heat Flux in Rod Bundles Using Tube-Based Hybrid Machine Learning Models in CTF

Aidan Furlong, Robert Salko, Xingang Zhao, Xu Wu

Main category: cs.LG

TL;DR: Machine learning models for critical heat flux prediction trained on tube data generalize well to rod bundle geometries, outperforming traditional correlations and lookup tables.

Details

Motivation: Previous ML models for CHF prediction were developed for tube geometries, but reactor core simulations require rod bundle geometries with complex thermal hydraulic phenomena. The study investigates whether tube-trained ML models can generalize to rod bundles.

Method: Implemented a purely data-driven DNN and two hybrid bias-correction models in CTF subchannel code. Tested on Combustion Engineering 5-by-5 bundle CHF test series, comparing against W-3 correlation, Bowring correlation, and Groeneveld LUT as baselines.

Result: All three ML-based approaches produced more accurate CHF magnitude and location predictions than baseline models. The hybrid LUT model showed the most favorable performance metrics.

Conclusion: ML models trained on tube-based CHF data can successfully generalize to rod bundle geometries, providing more accurate predictions than conventional approaches for reactor core simulations.

Abstract: The prediction of critical heat flux (CHF) using machine learning (ML) approaches has become a highly active research activity in recent years, the goal of which is to build models more accurate than current conventional approaches such as empirical correlations or lookup tables (LUTs). Previous work developed and deployed tube-based pure and hybrid ML models in the CTF subchannel code, however, full-scale reactor core simulations require the use of rod bundle geometries. Unlike isolated subchannels, rod bundles experience complex thermal hydraulic phenomena such as channel crossflow, spacer grid losses, and effects from unheated conductors. This study investigates the generalization of ML-based CHF prediction models in rod bundles after being trained on tube-based CHF data. A purely data-driven DNN and two hybrid bias-correction models were implemented in the CTF subchannel code and used to predict CHF location and magnitude in the Combustion Engineering 5-by-5 bundle CHF test series. The W-3 correlation, Bowring correlation, and Groeneveld LUT were used as baseline comparators. On average, all three ML-based approaches produced magnitude and location predictions more accurate than the baseline models, with the hybrid LUT model exhibiting the most favorable performance metrics.

[689] SymPlex: A Structure-Aware Transformer for Symbolic PDE Solving

Yesom Park, Annie C. Lu, Shao-Ching Huang, Qiyang Hu, Y. Sungtaek Ju, Stanley Osher

Main category: cs.LG

TL;DR: SymPlex is a reinforcement learning framework that discovers analytical symbolic solutions to PDEs using tree-structured decision-making and a structure-aware Transformer called SymFormer.

Details

Motivation: Current numerical and neural approaches to solving PDEs only provide approximations in discretized or implicit function spaces, lacking interpretability and human-readable solutions. There's a need for methods that can discover exact symbolic solutions that naturally represent non-smooth behavior and explicit parametric dependence.

Method: SymPlex formulates symbolic PDE solving as tree-structured decision-making using reinforcement learning. It employs SymFormer, a structure-aware Transformer with tree-relative self-attention and grammar-constrained autoregressive decoding to ensure syntactic validity. The framework optimizes candidate solutions using only the PDE and its boundary conditions without ground-truth expressions.

Result: Empirical results demonstrate exact recovery of non-smooth and parametric PDE solutions using deep learning-based symbolic methods, showing the framework’s ability to discover interpretable, human-readable solutions.

Conclusion: SymPlex enables direct operation in symbolic expression space, overcoming limitations of sequence-based generators and providing interpretable solutions that naturally represent non-smooth behavior and explicit parametric dependence in PDEs.

Abstract: We propose SymPlex, a reinforcement learning framework for discovering analytical symbolic solutions to partial differential equations (PDEs) without access to ground-truth expressions. SymPlex formulates symbolic PDE solving as tree-structured decision-making and optimizes candidate solutions using only the PDE and its boundary conditions. At its core is SymFormer, a structure-aware Transformer that models hierarchical symbolic dependencies via tree-relative self-attention and enforces syntactic validity through grammar-constrained autoregressive decoding, overcoming the limited expressivity of sequence-based generators. Unlike numerical and neural approaches that approximate solutions in discretized or implicit function spaces, SymPlex operates directly in symbolic expression space, enabling interpretable and human-readable solutions that naturally represent non-smooth behavior and explicit parametric dependence. Empirical results demonstrate exact recovery of non-smooth and parametric PDE solutions using deep learning-based symbolic methods.

[690] Robust Intervention Learning from Emergency Stop Interventions

Ethan Pronovost, Khimya Khetarpal, Siddhartha Srinivasa

Main category: cs.LG

TL;DR: RIFT: Residual Intervention Fine-Tuning for robust learning from noisy human interventions in autonomous systems by combining intervention signals with prior policies.

Details

Motivation: Human interventions during autonomous system testing provide valuable but often noisy/incomplete signals about policy improvement needs. Current approaches struggle when avoiding interventions is necessary but not sufficient for good performance.

Method: Proposes Residual Intervention Fine-Tuning (RIFT), a residual fine-tuning algorithm that treats intervention feedback as incomplete learning signal and explicitly combines it with a prior policy to resolve ambiguity when intervention signals under-specify the task.

Result: Theoretical analysis characterizes conditions for principled policy improvement and identifies failure regimes. Experiments show RIFT enables robust and consistent policy improvement across various intervention strategies and prior policy qualities.

Conclusion: Residual fine-tuning enables robust intervention learning, presenting a promising direction for future work in learning from noisy human intervention data in autonomous systems.

Abstract: Human interventions are a common source of data in autonomous systems during testing. These interventions provide an important signal about where the current policy needs improvement, but are often noisy and incomplete. We define Robust Intervention Learning (RIL) as the problem of learning from intervention data while remaining robust to the quality and informativeness of the intervention signal. In the best case, interventions are precise and avoiding them is sufficient to solve the task, but in many realistic settings avoiding interventions is necessary but not sufficient for achieving good performance. We study robust intervention learning in the context of emergency stop interventions and propose Residual Intervention Fine-Tuning (RIFT), a residual fine-tuning algorithm that treats intervention feedback as an incomplete learning signal and explicitly combines it with a prior policy. By framing intervention learning as a fine-tuning problem, our approach leverages structure encoded in the prior policy to resolve ambiguity when intervention signals under-specify the task. We provide theoretical analysis characterizing conditions under which this formulation yields principled policy improvement, and identify regimes where intervention learning is expected to fail. Our experiments reveal that residual fine-tuning enables robust and consistent policy improvement across a range of intervention strategies and prior policy qualities, and highlight robust intervention learning as a promising direction for future work.

[691] Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

Erfan Miahi, Eugene Belilovsky

Main category: cs.LG

TL;DR: PULSE enables efficient decentralized RL training by exploiting high sparsity in LLM weight updates, transmitting only modified parameters to reduce communication bandwidth by 100x while maintaining training performance.

Details

Motivation: Distributed RL training for large language models faces scalability bottlenecks due to bandwidth constraints when synchronizing policy weights between trainers and inference workers, especially over commodity networks or in decentralized settings.

Method: Systematic empirical study of weight-update sparsity at step-level and multi-step granularities, followed by PULSE (Patch Updates via Lossless Sparse Encoding) - a lossless weight synchronization method that transmits only indices and values of modified parameters.

Result: Update sparsity consistently exceeds 99% across practical settings. PULSE achieves over 100x communication reduction (14 GB to ~108 MB) while maintaining bit-identical training dynamics and performance compared to full weight synchronization.

Conclusion: By exploiting the high sparsity of RL weight updates, PULSE enables decentralized RL training to approach centralized throughput, reducing bandwidth requirements from 20 Gbit/s to 0.2 Gbit/s while maintaining high GPU utilization.

Abstract: Reinforcement learning (RL) is a critical component for post-training large language models (LLMs). However, in bandwidth-constrained distributed RL, scalability is often bottlenecked by the synchronization of policy weights from trainers to inference workers, particularly over commodity networks or in decentralized settings. While recent studies suggest that RL updates modify only a small fraction of model parameters, these observations are typically based on coarse checkpoint differences. We present a systematic empirical study of weight-update sparsity at both step-level and multi-step granularities, examining its evolution across training dynamics, off-policy delay, and model scale. We find that update sparsity is consistently high, frequently exceeding 99% across practically relevant settings. Leveraging this structure, we propose PULSE (Patch Updates via Lossless Sparse Encoding), a simple yet highly efficient lossless weight synchronization method that transmits only the indices and values of modified parameters. PULSE is robust to transmission errors and avoids floating-point drift inherent in additive delta schemes. In bandwidth-constrained decentralized environments, our approach achieves over 100x (14 GB to ~108 MB) communication reduction while maintaining bit-identical training dynamics and performance compared to full weight synchronization. By exploiting this structure, PULSE enables decentralized RL training to approach centralized throughput, reducing the bandwidth required for weight synchronization from 20 Gbit/s to 0.2 Gbit/s to maintain high GPU utilization.

[692] Understanding Representation Dynamics of Diffusion Models via Low-Dimensional Modeling

Xiao Li, Zekai Zhang, Xiang Li, Siyi Chen, Zhihui Zhu, Peng Wang, Qing Qu

Main category: cs.LG

TL;DR: Diffusion models show unimodal representation dynamics where feature quality peaks at intermediate noise levels, which theoretically emerges when models capture true data distribution and empirically correlates with generalization vs memorization.

Details

Motivation: To understand the intriguing phenomenon of unimodal representation dynamics in diffusion models, where learned feature quality peaks at intermediate noise levels rather than monotonically improving with less noise.

Method: Combined theoretical analysis leveraging low-dimensional structure of image data with empirical validation through classification tasks, examining the interplay between denoising strength and class confidence across noise scales.

Result: Theoretically proved unimodal dynamics emerge when diffusion models successfully capture underlying data distribution. Empirically showed unimodal dynamics reliably reflect generalization: present when generating novel images, transitioning to monotonically decreasing curve when memorizing training data.

Conclusion: Unimodal representation dynamics in diffusion models serve as a reliable indicator of model generalization capability, with the phenomenon emerging from successful data distribution capture and providing insights into the model’s learning behavior.

Abstract: Diffusion models, though originally designed for generative tasks, have demonstrated impressive self-supervised representation learning capabilities. A particularly intriguing phenomenon in these models is the emergence of unimodal representation dynamics, where the quality of learned features peaks at an intermediate noise level. In this work, we conduct a comprehensive theoretical and empirical investigation of this phenomenon. Leveraging the inherent low-dimensionality structure of image data, we theoretically demonstrate that the unimodal dynamic emerges when the diffusion model successfully captures the underlying data distribution. The unimodality arises from an interplay between denoising strength and class confidence across noise scales. Empirically, we further show that, in classification tasks, the presence of unimodal dynamics reliably reflects the generalization of the diffusion model: it emerges when the model generates novel images and gradually transitions to a monotonically decreasing curve as the model begins to memorize the training data.

[693] Accurate and Efficient World Modeling with Masked Latent Transformers

Maxime Burchi, Radu Timofte

Main category: cs.LG

TL;DR: EMERALD: Efficient MaskEd latent tRAnsformer worLD model improves Dreamer’s world modeling by using spatial latent states with MaskGIT predictions for more accurate trajectory generation, achieving SOTA on Crafter benchmark.

Details

Motivation: Dreamer algorithm's compressed latent space loses crucial information affecting agent performance. Existing solutions like Δ-IRIS and DIAMOND train accurate world models but require training from pixels, reducing efficiency and preventing agents from benefiting from world model's inner representations.

Method: Proposes EMERALD: Efficient MaskEd latent tRAnsformer worLD model using spatial latent state with MaskGIT predictions to generate accurate trajectories in latent space. Combines transformer architecture with masked prediction for efficient world modeling.

Result: On Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming first method to surpass human experts within 10M environment steps. Successfully unlocks all 22 Crafter achievements at least once during evaluation.

Conclusion: EMERALD provides an accurate and efficient alternative to world modeling that improves agent performance by better preserving information in latent space while maintaining training efficiency.

Abstract: The Dreamer algorithm has recently obtained remarkable performance across diverse environment domains by training powerful agents with simulated trajectories. However, the compressed nature of its world model’s latent space can result in the loss of crucial information, negatively affecting the agent’s performance. Recent approaches, such as $Δ$-IRIS and DIAMOND, address this limitation by training more accurate world models. However, these methods require training agents directly from pixels, which reduces training efficiency and prevents the agent from benefiting from the inner representations learned by the world model. In this work, we propose an alternative approach to world modeling that is both accurate and efficient. We introduce EMERALD (Efficient MaskEd latent tRAnsformer worLD model), a world model using a spatial latent state with MaskGIT predictions to generate accurate trajectories in latent space and improve the agent performance. On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation.

[694] Imbalance-Robust and Sampling-Efficient Continuous Conditional GANs via Adaptive Vicinity and Auxiliary Regularization

Xin Ding, Yun Chen, Yongwei Wang, Kao Zhang, Sen Zhang, Peibei Cao, Xiangxue Wang

Main category: cs.LG

TL;DR: CcGAN-AVAR enhances continuous conditional GANs with adaptive vicinity and multi-task discriminator to handle data imbalance, achieving SOTA quality with 30x-2000x faster inference than diffusion models.

Details

Motivation: Existing continuous conditional generative models (CcGAN and CCDM) have limitations: CcGAN suffers from data imbalance due to fixed-size vicinity constraints, while CCDM requires computationally expensive iterative sampling. There's a need for a method that combines high-quality generation with efficient inference.

Method: Proposes CcGAN-AVAR with two key components: (1) adaptive vicinity mechanism that dynamically adjusts vicinity size to handle data imbalance, and (2) multi-task discriminator that enhances generator training through auxiliary regression and density ratio estimation. Maintains GAN’s native one-step generator for fast inference.

Result: Extensive experiments on four benchmark datasets (64x64 to 256x256 resolution) across eleven challenging settings demonstrate state-of-the-art generation quality while achieving 30x-2000x faster inference than CCDM.

Conclusion: CcGAN-AVAR successfully addresses limitations of previous continuous conditional generative models by handling data imbalance while maintaining efficient one-step generation, making it practical for real-world applications requiring fast inference.

Abstract: Recent advances in conditional generative modeling have introduced Continuous conditional Generative Adversarial Network (CcGAN) and Continuous Conditional Diffusion Model (CCDM) for estimating high-dimensional data distributions conditioned on scalar, continuous regression labels (e.g., angles, ages, or temperatures). However, these approaches face fundamental limitations: CcGAN suffers from data imbalance due to fixed-size vicinity constraints, while CCDM requires computationally expensive iterative sampling. To address these issues, we propose CcGAN-AVAR, an enhanced CcGAN framework featuring (1) two novel components for handling data imbalance - an adaptive vicinity mechanism that dynamically adjusts vicinity size and a multi-task discriminator that enhances generator training through auxiliary regression and density ratio estimation - and (2) the GAN framework’s native one-step generator, enable 30x-2000x faster inference than CCDM. Extensive experiments on four benchmark datasets (64x64 to 256x256 resolution) across eleven challenging settings demonstrate that CcGAN-AVAR achieves state-of-the-art generation quality while maintaining sampling efficiency.

[695] Embedding Compression via Spherical Coordinates

Han Xiao

Main category: cs.LG

TL;DR: Novel compression method for unit-norm embeddings achieves 1.5× compression by exploiting geometric properties of high-dimensional spherical coordinates and IEEE 754 floating-point representation.

Details

Motivation: Need efficient compression for unit-norm embeddings used in various applications (text, image, multi-vector) while maintaining high precision and lossless reconstruction.

Method: Exploits that spherical coordinates of high-dimensional unit vectors concentrate around π/2, causing IEEE 754 exponents to collapse to a single value and high-order mantissa bits to become predictable, enabling entropy coding of both components.

Result: Achieves 1.5× compression (25% better than best prior lossless method), reconstruction error below 1e-7 (under float32 machine epsilon), consistent improvement across 26 configurations spanning text, image, and multi-vector embeddings.

Conclusion: The method provides efficient, high-precision compression for unit-norm embeddings by leveraging geometric properties and floating-point representation characteristics, with broad applicability across different embedding types.

Abstract: We present a compression method for unit-norm embeddings that achieves 1.5$\times$ compression, 25% better than the best prior lossless method. The method exploits that spherical coordinates of high-dimensional unit vectors concentrate around $π/2$, causing IEEE 754 exponents to collapse to a single value and high-order mantissa bits to become predictable, enabling entropy coding of both. Reconstruction error is below 1e-7, under float32 machine epsilon. Evaluation across 26 configurations spanning text, image, and multi-vector embeddings confirms consistent improvement.

[696] MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations

Qishuai Wen, Zhiyuan Huang, Xianghan Meng, Wei He, Chun-Guang Li

Main category: cs.LG

TL;DR: MiTA attention: A new efficient attention mechanism that compresses the N-width MLP view of attention using landmark queries and top-k activated key-value pairs, forming deformable experts for improved efficiency on long sequences.

Details

Motivation: Standard attention scales poorly with sequence length due to the quadratic complexity of the N-width MLP view of attention. The paper aims to develop efficient attention mechanisms that maintain expressive capacity while reducing computational cost for long sequences.

Method: Proposes a “compress-and-route” strategy: (1) compresses the N-width MLP into a narrower one using landmark queries, (2) constructs deformable experts by gathering top-k activated key-value pairs for each landmark query, and (3) implements this as Mixture of Top-k Activations (MiTA) attention.

Result: Preliminary experiments on vision tasks show promise for MiTA attention, demonstrating its potential for efficient attention in long-sequence scenarios.

Conclusion: MiTA attention provides a novel efficient attention mechanism through compression and routing strategies, with preliminary results encouraging further optimization and broader applications in challenging settings.

Abstract: The attention operator in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically instantiated from input tokens and whose width equals sequence length N. As the context extends, the expressive capacity of such an N-width MLP increases, but scaling its fast weights becomes prohibitively expensive for extremely long sequences. Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through routing and/or compression. Then we propose a compress-and-route strategy, which compresses the N-width MLP into a narrower one using a small set of landmark queries and constructs deformable experts by gathering top-k activated key-value pairs for each landmark query. We call this strategy a Mixture of Top-k Activations (MiTA), and refer to the resulting efficient mechanism as MiTA attention. Preliminary experiments on vision tasks demonstrate the promise of our MiTA attention and motivate further investigation on its optimization and broader applications in more challenging settings.

[697] FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning

Hongwei Yan, Guanglong Sun, Kanglei Zhou, Qian Li, Liyuan Wang, Yi Zhong

Main category: cs.LG

TL;DR: FlyPrompt is a brain-inspired continual learning framework that addresses general continual learning without task boundaries by using sparse expansion and modular ensembles inspired by fruit fly memory systems.

Details

Motivation: Current continual parameter-efficient tuning methods rely on multiple training epochs and explicit task cues, limiting their effectiveness in general continual learning scenarios with single-pass, non-stationary data streams without clear task boundaries.

Method: FlyPrompt decomposes GCL into expert routing and expert competence improvement, using a randomly expanded analytic router for instance-level expert activation and a temporal ensemble of output heads to dynamically adapt decision boundaries over time.

Result: Achieves up to 11.23%, 12.43%, and 7.62% gains over state-of-the-art baselines on CIFAR-100, ImageNet-R, and CUB-200 datasets respectively.

Conclusion: FlyPrompt provides an effective brain-inspired solution for general continual learning that addresses fundamental challenges in continual parameter-efficient tuning through biologically-inspired sparse expansion and modular ensemble mechanisms.

Abstract: General continual learning (GCL) challenges intelligent systems to learn from single-pass, non-stationary data streams without clear task boundaries. While recent advances in continual parameter-efficient tuning (PET) of pretrained models show promise, they typically rely on multiple training epochs and explicit task cues, limiting their effectiveness in GCL scenarios. Moreover, existing methods often lack targeted design and fail to address two fundamental challenges in continual PET: how to allocate expert parameters to evolving data distributions, and how to improve their representational capacity under limited supervision. Inspired by the fruit fly’s hierarchical memory system characterized by sparse expansion and modular ensembles, we propose FlyPrompt, a brain-inspired framework that decomposes GCL into two subproblems: expert routing and expert competence improvement. FlyPrompt introduces a randomly expanded analytic router for instance-level expert activation and a temporal ensemble of output heads to dynamically adapt decision boundaries over time. Extensive theoretical and empirical evaluations demonstrate FlyPrompt’s superior performance, achieving up to 11.23%, 12.43%, and 7.62% gains over state-of-the-art baselines on CIFAR-100, ImageNet-R, and CUB-200, respectively. Our source code is available at https://github.com/AnAppleCore/FlyGCL.

[698] On the Convergence of Experience Replay in Policy Optimization: Characterizing Bias, Variance, and Finite-Time Convergence

Hua Zheng, Wei Xie, M. Ben Feng

Main category: cs.LG

TL;DR: Theoretical analysis of experience replay in policy gradient methods, quantifying bias-variance trade-offs and providing convergence guarantees based on buffer size, sample correlation, and mixing time.

Details

Motivation: Experience replay is widely used in deep reinforcement learning but lacks theoretical understanding beyond empirical heuristics. The paper aims to develop a formal theoretical framework to understand the benefits and trade-offs of experience replay in policy gradient methods, particularly addressing the challenges posed by Markovian correlations and policy drift.

Method: Develops a novel theoretical framework using auxiliary Markov chains and lag-based decoupling techniques to handle dependencies from Markovian correlations and policy drift. Derives finite-time bias bounds for policy-gradient estimators under replay, provides correlation-aware variance decomposition, and establishes convergence guarantees for experience-replay-based policy optimization.

Result: The analysis reveals how bias scales with cumulative policy update, mixing time, and data age; shows how sample dependence governs gradient variance; and establishes finite-time convergence guarantees that explicitly quantify how buffer size, sample correlation, and mixing jointly determine convergence rate. Identifies an inherent bias-variance trade-off where larger buffers reduce variance but increase bias from stale data.

Conclusion: The theoretical framework provides principled guidance for buffer sizing and replay schedules in policy optimization, bridging empirical findings with quantitative theory and offering formal understanding of experience replay’s benefits and limitations in deep reinforcement learning.

Abstract: Experience replay is a core ingredient of modern deep reinforcement learning, yet its benefits in policy optimization are poorly understood beyond empirical heuristics. This paper develops a novel theoretical framework for experience replay in modern policy gradient methods, where two sources of dependence fundamentally complicate analysis: Markovian correlations along trajectories and policy drift across optimization iterations. We introduce a new proof technique based on auxiliary Markov chains and lag-based decoupling that makes these dependencies tractable. Within this framework, we derive finite-time bias bounds for policy-gradient estimators under replay, identifying how bias scales with the cumulative policy update, the mixing time of the underlying dynamics, and the age of buffered data, thereby formalizing the practitioner’s rule of avoiding overly stale replay. We further provide a correlation-aware variance decomposition showing how sample dependence governs gradient variance from replay and when replay is beneficial. Building on these characterizations, we establish the finite-time convergence guarantees for experience-replay-based policy optimization, explicitly quantifying how buffer size, sample correlation, and mixing jointly determine the convergence rate and revealing an inherent bias-variance trade-off: larger buffers can reduce variance by averaging less correlated samples but can increase bias as data become stale. These results offer a principled guide for buffer sizing and replay schedules, bridging prior empirical findings with quantitative theory.

[699] Exact Solution to Data-Driven Inverse Optimization of MILPs in Finite Time via Gradient-Based Methods

Akira Kitaoka

Main category: cs.LG

TL;DR: A method for solving data-driven inverse optimization problems for mixed integer linear programs using gradient-based optimization with finite convergence guarantees.

Details

Motivation: Data-driven inverse optimization problems for MILPs face challenges due to discontinuous prediction loss functions, making gradient-based optimization difficult to apply effectively.

Method: Focuses on a Lipschitz continuous and convex suboptimality loss, exploits its convex piecewise-linear structure and interiority of minimum set to show that gradient-based methods like projected subgradient descent achieve finite convergence.

Result: Proves that projected subgradient descent reaches minimum suboptimality loss in finite iterations, exactly solving DDIOP for MILPs, and also attains minimum prediction loss on features in finite iterations with derived upper bounds.

Conclusion: Provides theoretical guarantees for finite convergence of gradient-based methods in solving inverse optimization problems for MILPs, addressing the discontinuity challenge through suboptimality loss.

Abstract: A data-driven inverse optimization problem (DDIOP) seeks to estimate an objective function (i.e., weights) that is consistent with observed optimal-solution data, and is important in many applications, including those involving mixed integer linear programs (MILPs). In the DDIOP for MILPs, the prediction loss on features (PLF), defined as the discrepancy between observed and predicted feature values, becomes discontinuous with respect to the weights, which makes it difficult to apply gradient-based optimization. To address this issue, we focus on a Lipschitz continuous and convex suboptimality loss. By exploiting its convex and piecewise-linear structure and the interiority of the minimum set, we show that a broad class of gradient-based optimization methods, including projected subgradient descent (PSGD), reaches the minimum suboptimality loss value in a finite number of iterations, thereby exactly solving the DDIOP for MILPs. Furthermore, as a corollary, we show that PSGD attains the minimum PLF in finitely many iterations. We also derive an upper bound on the number of iterations required for PSGD to reach finite convergence, and confirm the finite-step behavior through numerical experiments.

[700] Conformal Prediction for Causal Effects of Continuous Treatments

Maresa Schröder, Dennis Frauen, Jonas Schweisthal, Konstantin Heß, Valentyn Melnychuk, Stefan Feuerriegel

Main category: cs.LG

TL;DR: Novel conformal prediction method for potential outcomes of continuous treatments with unknown propensity scores, providing finite-sample guarantees and practical algorithm.

Details

Motivation: Uncertainty quantification of causal effects is crucial for safety-critical applications like personalized medicine, but existing conformal prediction methods are limited to binary/discrete treatments and require known propensity scores.

Method: Derives finite-sample prediction intervals for potential outcomes of continuous treatments, accounts for uncertainty from propensity estimation, and provides algorithm for calculating intervals.

Result: Demonstrates effectiveness of conformal prediction intervals on synthetic and real-world datasets, providing valid intervals even when propensity scores are unknown and must be estimated.

Conclusion: First conformal prediction method for continuous treatments with unknown propensity scores, enabling reliable uncertainty quantification for causal effects in practical applications.

Abstract: Uncertainty quantification of causal effects is crucial for safety-critical applications such as personalized medicine. A powerful approach for this is conformal prediction, which has several practical benefits due to model-agnostic finite-sample guarantees. Yet, existing methods for conformal prediction of causal effects are limited to binary/discrete treatments and make highly restrictive assumptions such as known propensity scores. In this work, we provide a novel conformal prediction method for potential outcomes of continuous treatments. We account for the additional uncertainty introduced through propensity estimation so that our conformal prediction intervals are valid even if the propensity score is unknown. Our contributions are three-fold: (1) We derive finite-sample prediction intervals for potential outcomes of continuous treatments. (2) We provide an algorithm for calculating the derived intervals. (3) We demonstrate the effectiveness of the conformal prediction intervals in experiments on synthetic and real-world datasets. To the best of our knowledge, we are the first to propose conformal prediction for continuous treatments when the propensity score is unknown and must be estimated from data.

[701] Hyper-Compression: Model Compression via Hyperfunction

Fenglei Fan, Juntong Fan, Dayang Wang, Jingbo Zhang, Zelin Dong, Shijun Zhang, Ge Wang, Tieyong Zeng

Main category: cs.LG

TL;DR: Hyper-Compression: A novel model compression method using dynamic systems as hyperfunctions to represent network parameters via composition numbers or trajectory lengths, achieving high compression ratios without retraining.

Details

Motivation: The rapid growth of model sizes has outpaced computing resources, creating a need for efficient compression methods. Inspired by the parsimonious relationship between genotype and phenotype in brain development, the authors seek a fundamentally different approach from existing compression techniques.

Method: Proposes Hyper-Compression that uses low-dimensional dynamic systems as hyperfunctions to represent network parameters. The method identifies suitable dynamic systems with irrational winding as hyperfunctions, derives theoretical error bounds, and adds engineering twists for practicality. Parameters are represented by composition numbers or trajectory lengths rather than traditional compression approaches.

Result: Achieves close-to-int4-quantization performance on LLaMA2-7B with less than 1% performance drop, compresses in under an hour without retraining. Demonstrates PNAS merits: Preferable compression ratio, No post-hoc retraining, Affordable inference time, and Short compression time. Tested on NLP models (LLaMA, Qwen series) and vision models.

Conclusion: Hyper-Compression offers a novel, effective compression paradigm fundamentally different from pruning, quantization, distillation, and decomposition. It provides practical compression with minimal performance degradation and no retraining requirements.

Abstract: The rapid growth of large models’ size has far outpaced that of computing resources. To bridge this gap, encouraged by the parsimonious relationship between genotype and phenotype in the brain’s growth and development, we propose the so-called Hyper-Compression that turns the model compression into the issue of parameter representation via a hyperfunction. Specifically, it is known that the trajectory of some low-dimensional dynamic systems can fill the high-dimensional space eventually. Thus, Hyper-Compression, using these dynamic systems as the hyperfunctions, represents the parameters of the target network by their corresponding composition number or trajectory length. This suggests a novel mechanism for model compression, substantially different from the existing pruning, quantization, distillation, and decomposition. Along this direction, we methodologically identify a suitable dynamic system with the irrational winding as the hyperfunction and theoretically derive its associated error bound. Next, guided by our theoretical insights, we propose several engineering twists to make the Hyper-Compression pragmatic and effective. Lastly, systematic and comprehensive experiments on \textcolor{black}{NLP models such as LLaMA and Qwen series and vision models} confirm that Hyper-Compression enjoys the following \textbf{PNAS} merits: 1) \textbf{P}referable compression ratio; 2) \textbf{N}o post-hoc retraining; 3) \textbf{A}ffordable inference time; and 4) \textbf{S}hort compression time. It compresses LLaMA2-7B in an hour and achieves close-to-int4-quantization performance, without retraining and with a performance drop of less than 1%. We have open-sourced our code in https://github.com/Juntongkuki/Hyper-Compression.git for free download and evaluation.

[702] Fast Training of Sinusoidal Neural Fields via Scaling Initialization

Taesun Yeom, Sangyoon Lee, Jaeho Lee

Main category: cs.LG

TL;DR: Weight scaling initialization accelerates sinusoidal neural fields training by 10x through better spectral bias resolution and optimization conditioning

Details

Motivation: Neural fields have high training costs that limit adoption; current initialization schemes for sinusoidal neural fields are suboptimal for training speed

Method: Proposes weight scaling - multiplying each weight (except last layer) by a constant - to accelerate SNF training. Conducts theoretical and empirical analyses to understand why it works

Result: 10x training speedup across various data domains, making SNFs train faster than more recently proposed architectures

Conclusion: Simple weight scaling initialization significantly improves SNF training efficiency by addressing spectral bias and improving optimization trajectory

Abstract: Neural fields are an emerging paradigm that represent data as continuous functions parameterized by neural networks. Despite many advantages, neural fields often have a high training cost, which prevents a broader adoption. In this paper, we focus on a popular family of neural fields, called sinusoidal neural fields (SNFs), and study how it should be initialized to maximize the training speed. We find that the standard initialization scheme for SNFs – designed based on the signal propagation principle – is suboptimal. In particular, we show that by simply multiplying each weight (except for the last layer) by a constant, we can accelerate SNF training by 10$\times$. This method, coined $\textit{weight scaling}$, consistently provides a significant speedup over various data domains, allowing the SNFs to train faster than more recently proposed architectures. To understand why the weight scaling works well, we conduct extensive theoretical and empirical analyses which reveal that the weight scaling not only resolves the spectral bias quite effectively but also enjoys a well-conditioned optimization trajectory. The code is available $\href{https://github.com/effl-lab/Fast-Neural-Fields}{here}$.

[703] Dataset-Driven Channel Masks in Transformers for Multivariate Time Series

Seunghan Lee, Taeyoung Park, Kibok Lee

Main category: cs.LG

TL;DR: Proposes partial channel dependence (PCD) with channel masks to improve channel dependency modeling in Transformer-based time series models by incorporating dataset-specific information.

Details

Motivation: Existing attention-based methods for multivariate time series primarily focus on architectural modifications while neglecting dataset-specific characteristics, which limits their ability to capture meaningful channel dependencies.

Method: Introduces channel masks (CMs) integrated into Transformer attention matrices via element-wise multiplication. CMs consist of similarity matrices capturing channel relationships and dataset-specific learnable domain parameters that refine the similarity matrix.

Result: Validates effectiveness of PCD across diverse tasks and datasets with various backbones, showing improved channel dependency modeling.

Conclusion: Partial channel dependence with channel masks enhances Transformer-based time series models by incorporating dataset-specific information to better capture channel dependencies.

Abstract: Recent advancements in foundation models have been successfully extended to the time series (TS) domain, facilitated by the emergence of large-scale TS datasets. However, previous efforts have primarily Capturing channel dependency (CD) is essential for modeling multivariate time series (TS), and attention-based methods have been widely employed for this purpose. Nonetheless, these methods primarily focus on modifying the architecture, often neglecting the importance of dataset-specific characteristics. In this work, we introduce the concept of partial channel dependence (PCD) to enhance CD modeling in Transformer-based models by leveraging dataset-specific information to refine the CD captured by the model. To achieve PCD, we propose channel masks (CMs), which are integrated into the attention matrices of Transformers via element-wise multiplication. CMs consist of two components: 1) a similarity matrix that captures relationships between the channels, and 2) dataset-specific and learnable domain parameters that refine the similarity matrix. We validate the effectiveness of PCD across diverse tasks and datasets with various backbones. Code is available at this repository: https://github.com/YonseiML/pcd.

[704] GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning

Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, Yong Wang

Main category: cs.LG

TL;DR: Group Policy Gradient (GPG) is a minimalist RL approach that directly optimizes the original RL objective for enhancing LLM reasoning, eliminating complex components like critics, reference models, and KL constraints.

Details

Motivation: Traditional RL methods for LLMs often rely on complex components like surrogate loss functions, critics, reference models, and KL divergence constraints, which complicate training and may introduce biases. The authors aim to simplify RL training for LLMs while maintaining or improving performance.

Method: Proposes Group Policy Gradient (GPG) that directly optimizes the original RL objective without surrogate losses. Eliminates critic and reference models, avoids KL divergence constraints, and addresses advantage and gradient estimation bias. Simplifies training compared to methods like GRPO.

Result: GPG reduces computational costs and consistently outperforms GRPO across various unimodal and multimodal tasks without relying on auxiliary techniques or adjustments.

Conclusion: GPG provides a simpler yet more effective RL approach for enhancing LLM reasoning capabilities, demonstrating that direct optimization of the original RL objective can yield superior performance with reduced complexity.

Abstract: Reinforcement Learning (RL) can directly enhance the reasoning capabilities of large language models without extensive reliance on Supervised Fine-Tuning (SFT). In this work, we revisit the traditional Policy Gradient (PG) mechanism and propose a minimalist RL approach termed Group Policy Gradient (GPG). Unlike conventional methods, GPG directly optimize the original RL objective, thus obviating the need for surrogate loss functions. By eliminating the critic and reference models, avoiding KL divergence constraints, and addressing the advantage and gradient estimation bias, our approach significantly simplifies the training process compared to Group Relative Policy Optimization (GRPO). Our approach achieves superior performance without relying on auxiliary techniques or adjustments. As illustrated in Figure 1, extensive experiments demonstrate that our method not only reduces computational costs but also consistently outperforms GRPO across various unimodal and multimodal tasks. Our code is available at https://github.com/AMAP-ML/GPG.

[705] Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors

Ren-Wei Liang, Chin-Ting Hsu, Chan-Hung Yu, Saransh Agrawal, Shih-Cheng Huang, Shang-Tse Chen, Kuan-Hao Huang, Shao-Hua Sun

Main category: cs.LG

TL;DR: Preference Vector framework enables modular, user-controllable alignment of LLMs by training separate models on individual preferences, extracting behavior shifts as vectors, and dynamically merging them at test time.

Details

Motivation: Existing LLM alignment approaches (RLHF, DPO) suffer from performance conflicts, limited controllability, and poor extendability when balancing helpfulness and harmlessness trade-offs.

Method: Train separate models on individual preferences, extract behavior shifts as preference vectors using task arithmetic principles, and dynamically merge vectors at test time for fine-grained control.

Result: Framework improves helpfulness without excessive conservatism, enables smooth control over preference trade-offs, and supports scalable multi-preference alignment.

Conclusion: Preference Vector provides a modular, extensible approach to LLM alignment that addresses limitations of existing methods through dynamic preference vector merging.

Abstract: Ensuring that large language models (LLMs) are both helpful and harmless is a critical challenge, as overly strict constraints can lead to excessive refusals, while permissive models risk generating harmful content. Existing approaches, such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), attempt to balance these trade-offs but suffer from performance conflicts, limited controllability, and poor extendability. To address these issues, we propose Preference Vector, a novel framework inspired by task arithmetic. Instead of optimizing multiple preferences within a single objective, we train separate models on individual preferences, extract behavior shifts as preference vectors, and dynamically merge them at test time. This modular approach enables fine-grained, user-controllable preference adjustments and facilitates seamless integration of new preferences without retraining. Experiments show that our proposed Preference Vector framework improves helpfulness without excessive conservatism, allows smooth control over preference trade-offs, and supports scalable multi-preference alignment.

[706] Discrete Latent Structure in Neural Networks

Vlad Niculae, Caio F. Corro, Nikita Nangia, Tsvetomila Mihaylova, André F. T. Martins

Main category: cs.LG

TL;DR: Survey paper on learning discrete latent structures (trees, sequences, matchings) using three main strategies: continuous relaxation, surrogate gradients, and probabilistic estimation.

Details

Motivation: Many data types (NLP, CV, bioinformatics) have discrete compositional structures, and latent structure models can extract these representations to incorporate structural bias, gain insights, and interpret decisions, but training is challenging due to neural networks' continuous nature.

Method: Analyzes three broad strategies: 1) Continuous relaxation (making discrete variables continuous), 2) Surrogate gradients (approximating gradients for discrete operations), and 3) Probabilistic estimation (using probability distributions). Uses consistent notation to reveal connections between approaches.

Result: Shows that most latent structure learning strategies consist of the same fundamental building blocks used differently, leading to different applicability and properties. Provides unified framework for understanding discrete latent structure learning.

Conclusion: The paper provides a comprehensive survey and unified perspective on learning with discrete latent structures, revealing common patterns across different approaches and offering guidance for selecting appropriate methods based on application needs.

Abstract: Many types of data from fields including natural language processing, computer vision, and bioinformatics, are well represented by discrete, compositional structures such as trees, sequences, or matchings. Latent structure models are a powerful tool for learning to extract such representations, offering a way to incorporate structural bias, discover insight about the data, and interpret decisions. However, effective training is challenging, as neural networks are typically designed for continuous computation. This text explores three broad strategies for learning with discrete latent structure: continuous relaxation, surrogate gradients, and probabilistic estimation. Our presentation relies on consistent notations for a wide range of models. As such, we reveal many new connections between latent structure learning strategies, showing how most consist of the same small set of fundamental building blocks, but use them differently, leading to substantially different applicability and properties.

[707] Lightweight and Interpretable Transformer via Mixed Graph Algorithm Unrolling for Traffic Forecast

Ji Qi, Tam Thuc Do, Mingxiao Liu, Zhuoshi Pan, Yuzhe Li, Gene Cheung, H. Vicky Zhao

Main category: cs.LG

TL;DR: A lightweight, interpretable transformer-like network for traffic forecasting using unrolled optimization with mixed graphs for spatial and temporal modeling.

Details

Motivation: To create a more interpretable and lightweight alternative to conventional "black-box" transformers for traffic forecasting that captures both spatial and temporal dependencies efficiently.

Method: Constructs two graphs: undirected for spatial correlations and directed for temporal relationships. Uses ℓ₂ and ℓ₁-norm variational terms to promote signal smoothness on directed graphs. Unrolls an ADMM-based iterative algorithm into a feed-forward network with periodic graph learning modules that serve as self-attention.

Result: Achieves competitive traffic forecast performance compared to state-of-the-art methods while drastically reducing parameter counts.

Conclusion: The unrolled optimization approach provides an interpretable, lightweight transformer-like architecture that maintains competitive performance for spatio-temporal forecasting tasks.

Abstract: Unlike conventional “black-box” transformers with classical self-attention mechanism, we build a lightweight and interpretable transformer-like neural net by unrolling a mixed-graph-based optimization algorithm to forecast traffic with spatial and temporal dimensions. We construct two graphs: an undirected graph $\mathcal{G}^u$ capturing spatial correlations across geography, and a directed graph $\mathcal{G}^d$ capturing sequential relationships over time. We predict future samples of signal $\mathbf{x}$, assuming it is “smooth” with respect to both $\mathcal{G}^u$ and $\mathcal{G}^d$, where we design new $\ell_2$ and $\ell_1$-norm variational terms to quantify and promote signal smoothness (low-frequency reconstruction) on a directed graph. We design an iterative algorithm based on alternating direction method of multipliers (ADMM), and unroll it into a feed-forward network for data-driven parameter learning. We periodically insert graph learning modules for $\mathcal{G}^u$ and $\mathcal{G}^d$ that play the role of self-attention. Experiments show that our unrolled networks achieve competitive traffic forecast performance as state-of-the-art prediction schemes, while reducing parameter counts drastically.

[708] Contextual Causal Bayesian Optimisation

Vahan Arsenyan, Antoine Grosnit, Haitham Bou-Ammar, Arnak Dalalyan

Main category: cs.LG

TL;DR: A unified framework for contextual and causal Bayesian optimization that designs intervention policies to maximize target variable expectations using both observed context and causal graph structures.

Details

Motivation: To address limitations in existing approaches by unifying Causal Bayesian Optimization and Contextual Bayesian Optimization, enabling more effective optimization in high-dimensional settings with both contextual information and causal knowledge.

Method: Proposes a novel algorithm that jointly optimizes over intervention policies and the sets of variables on which these policies are defined, leveraging both observed contextual information and known causal graph structures.

Result: The approach achieves sublinear regret and reduces sample complexity in high-dimensional settings, with experimental results across diverse environments confirming effectiveness.

Conclusion: The framework successfully unifies and extends previous approaches to Bayesian optimization, providing theoretical guarantees and practical improvements for optimization in causal and contextual settings.

Abstract: We introduce a unified framework for contextual and causal Bayesian optimisation, which aims to design intervention policies maximising the expectation of a target variable. Our approach leverages both observed contextual information and known causal graph structures to guide the search. Within this framework, we propose a novel algorithm that jointly optimises over policies and the sets of variables on which these policies are defined. This thereby extends and unifies two previously distinct approaches: Causal Bayesian Optimisation and Contextual Bayesian Optimisation, while also addressing their limitations in scenarios that yield suboptimal results. We derive worst-case and instance-dependent high-probability regret bounds for our algorithm. We report experimental results across diverse environments, corroborating that our approach achieves sublinear regret and reduces sample complexity in high-dimensional settings.

[709] Sparse maximal update parameterization: A holistic approach to sparse training dynamics

Nolan Dey, Shane Bergsma, Joel Hestness

Main category: cs.LG

TL;DR: SμPar is a parameterization method that enables stable training of sparse neural networks by ensuring signal propagation scales independently of sparsity level and allowing hyperparameter transfer from dense to sparse models.

Details

Motivation: Sparse neural networks face challenges in competing with dense models due to impaired signal propagation when weights are zeroed, and prohibitive hyperparameter tuning costs when testing multiple sparsity levels. Current practice of reusing dense model hyperparameters is suboptimal since sparse and dense networks have different optimal hyperparameters.

Method: SμPar is a holistic approach for random unstructured static sparsity that ensures activations, gradients, and weight updates all scale independently of sparsity level. It reparameterizes hyperparameters so the same values remain optimal across varying sparsity levels and model widths, enabling hyperparameter tuning on small dense networks and transfer to large sparse models.

Result: On large-scale language modeling, SμPar shows increasing improvements over standard parameterization as sparsity increases, achieving up to 11.9% relative loss improvement at 99.2% sparsity.

Conclusion: SμPar provides an effective training recipe for sparse networks that reduces tuning costs and enables stable training across sparsity levels, making sparse models more competitive with dense counterparts.

Abstract: Several challenges make it difficult for sparse neural networks to compete with dense models. First, setting a large fraction of weights to zero impairs forward and gradient signal propagation. Second, sparse studies often need to test multiple sparsity levels, while also introducing new hyperparameters (HPs), leading to prohibitive tuning costs. Indeed, the standard practice is to re-use the learning HPs originally crafted for dense models. Unfortunately, we show sparse and dense networks do not share the same optimal HPs. Without stable dynamics and effective training recipes, it is costly to test sparsity at scale, which is key to surpassing dense networks and making the business case for sparsity acceleration in hardware. A holistic approach is needed to tackle these challenges and we propose S$μ$Par as one such approach. For random unstructured static sparsity, S$μ$Par ensures activations, gradients, and weight updates all scale independently of sparsity level. Further, by reparameterizing the HPs, S$μ$Par enables the same HP values to be optimal as we vary both sparsity level and model width. HPs can be tuned on small dense networks and transferred to large sparse models, greatly reducing tuning costs. On large-scale language modeling, S$μ$Par shows increasing improvements over standard parameterization as sparsity increases, leading up to 11.9% relative loss improvement at 99.2% sparsity. A minimal implementation of S$μ$Par is available at https://github.com/EleutherAI/nanoGPT-mup/tree/supar.

[710] ME-IGM: Individual-Global-Max in Maximum Entropy Multi-Agent Reinforcement Learning

Wen-Tse Chen, Yuxuan Li, Shiyu Huang, Jiayu Chen, Jeff Schneider

Main category: cs.LG

TL;DR: ME-IGM: A maximum entropy MARL algorithm that addresses misalignment between local policies and joint policy through order-preserving transformation, maintaining IGM condition while benefiting from maximum entropy exploration.

Details

Motivation: Existing maximum entropy MARL methods suffer from misalignment between local policies and the joint policy that maximizes global Q-value, violating the IGM condition crucial for multi-agent credit assignment.

Method: Proposes ME-IGM algorithm with order-preserving transformation to align local policies with joint policy while maintaining IGM condition. Two variants: ME-QMIX and ME-QPLEX.

Result: Demonstrates state-of-the-art performance in non-monotonic matrix games and across 17 scenarios in SMAC-v2 and Overcooked benchmarks.

Conclusion: ME-IGM successfully addresses the misalignment problem in maximum entropy MARL while preserving IGM condition, enabling effective credit assignment with enhanced exploration.

Abstract: Multi-agent credit assignment is a fundamental challenge for cooperative multi-agent reinforcement learning (MARL), where a team of agents learn from shared reward signals. The Individual-Global-Max (IGM) condition is a widely used principle for multi-agent credit assignment, requiring that the joint action determined by individual Q-functions maximizes the global Q-value. Meanwhile, the principle of maximum entropy has been leveraged to enhance exploration in MARL. However, we identify a critical limitation in existing maximum entropy MARL methods: a misalignment arises between local policies and the joint policy that maximizes the global Q-value, leading to violations of the IGM condition. To address this misalignment, we propose an order-preserving transformation. Building on it, we introduce ME-IGM, a novel maximum entropy MARL algorithm compatible with any credit assignment mechanism that satisfies the IGM condition while enjoying the benefits of maximum entropy exploration. We empirically evaluate two variants of ME-IGM: ME-QMIX and ME-QPLEX, in non-monotonic matrix games, and demonstrate their state-of-the-art performance across 17 scenarios in SMAC-v2 and Overcooked.

[711] NOBLE – Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models

Luca Ghafourpour, Valentin Duruisseaux, Bahareh Tolooshams, Philip H. Wong, Costas A. Anastassiou, Anima Anandkumar

Main category: cs.LG

TL;DR: NOBLE is a neural operator framework that learns to map neuron features to voltage responses, enabling efficient generation of bio-realistic neuron models with experimental variability.

Details

Motivation: Current bio-realistic neuron modeling approaches are limited by scarce experimental data and inability to capture natural variability, while deep learning methods fail to capture full biophysical complexity and nonlinear voltage dynamics.

Method: NOBLE uses a neural operator framework that learns a mapping from continuous frequency-modulated embeddings of interpretable neuron features to somatic voltage responses induced by current injection, trained on synthetic data from bio-realistic models.

Result: NOBLE predicts distributions of neural dynamics accounting for experimental variability, enables efficient generation of synthetic neurons resembling experimental data with trial-to-trial variability, and offers 4200× speedup over numerical solvers.

Conclusion: NOBLE is the first scaled-up deep learning framework validated with real experimental data, capturing fundamental neural properties in an emergent manner that advances understanding of cellular composition, neuromorphic architectures, and neuroAI applications.

Abstract: Characterizing the cellular properties of neurons is fundamental to understanding their function in the brain. In this quest, the generation of bio-realistic models is central towards integrating multimodal cellular data sets and establishing causal relationships. However, current modeling approaches remain constrained by the limited availability and intrinsic variability of experimental neuronal data. The deterministic formalism of bio-realistic models currently precludes accounting for the natural variability observed experimentally. While deep learning is becoming increasingly relevant in this space, it fails to capture the full biophysical complexity of neurons, their nonlinear voltage dynamics, and variability. To address these shortcomings, we introduce NOBLE, a neural operator framework that learns a mapping from a continuous frequency-modulated embedding of interpretable neuron features to the somatic voltage response induced by current injection. Trained on synthetic data generated from bio-realistic neuron models, NOBLE predicts distributions of neural dynamics accounting for the intrinsic experimental variability. Unlike conventional bio-realistic neuron models, interpolating within the embedding space offers models whose dynamics are consistent with experimentally observed responses. NOBLE enables the efficient generation of synthetic neurons that closely resemble experimental data and exhibit trial-to-trial variability, offering a $4200\times$ speedup over the numerical solver. NOBLE is the first scaled-up deep learning framework that validates its generalization with real experimental data. To this end, NOBLE captures fundamental neural properties in a unique and emergent manner that opens the door to a better understanding of cellular composition and computations, neuromorphic architectures, large-scale brain circuits, and general neuroAI applications.

[712] Individual Regret in Cooperative Stochastic Multi-Armed Bandits

Idan Barnea, Tal Lancewicki, Yishay Mansour

Main category: cs.LG

TL;DR: Cooperative multi-agent bandit algorithm with communication over arbitrary graphs achieves individual regret independent of graph diameter

Details

Motivation: Study regret in cooperative multi-armed bandits with multiple agents communicating over arbitrary connected graphs, addressing limitations of prior work that depended on graph diameter

Method: Analyze COOP-SE (Cooperative Successive Elimination) algorithm for cooperative stochastic MAB with communication constraints including message size and communication rounds

Result: Achieve individual regret bound O(R/m + A² + A√log T) independent of graph diameter, with additional results for logarithmic message size and communication rounds

Conclusion: First work to show graph-diameter-independent individual regret in cooperative stochastic MAB, with practical communication constraints addressed

Abstract: We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph. We analyzed a variant of Cooperative Successive Elimination algorithm, COOP-SE, and show an individual regret bound of $O(R/ m + A^2 + A \sqrt{\log T})$ and a nearly matching lower bound. Here $A$ is the number of actions, $T$ the time horizon, $m$ the number of agents, and $R = \sum_{Δ_i > 0}\log(T)/Δ_i$ is the optimal single agent regret, where $Δ_i$ is the sub-optimality gap of action $i$. Our work is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph’s diameter. When considering communication networks there are additional considerations beyond regret, such as message size and number of communication rounds. First, we show that our regret bound holds even if we restrict the messages to be of logarithmic size. Second, for logarithmic number of communication rounds, we obtain a regret bound of $O(R / m+A \log T)$.

[713] Sensitivity analysis of image classification models using generalized polynomial chaos

Lukas Bahr, Lucas Poßner, Konstantin Weise, Sophie Gröger, Rüdiger Daub

Main category: cs.LG

TL;DR: Proposes using Sobol indices via generalized polynomial chaos to quantify impact of domain shifts on image classification model outputs for predictive quality applications.

Details

Motivation: ML models in image classification face uncertainties from model, data, and domain shifts, leading to overconfidence. Need better understanding of model sensitivity to input variations for predictive quality applications in production.

Method: Model distributional domain shifts of inputs with random variables and quantify their impact on model outputs using Sobol indices computed via generalized polynomial chaos (GPC). Validated through welding defect classification case study with fine-tuned ResNet18 and emblem classification models from BMW production.

Result: The approach was validated through case studies showing it can quantify the impact of domain shifts on classification model outputs, helping understand model sensitivity in production environments.

Conclusion: Sensitivity analysis using Sobol indices via GPC provides valuable insights into how domain shifts affect image classification models, improving understanding and reliability of predictive quality systems in production.

Abstract: Integrating advanced communication protocols in production has accelerated the adoption of data-driven predictive quality methods, notably machine learning (ML) models. However, ML models in image classification often face significant uncertainties arising from model, data, and domain shifts. These uncertainties lead to overconfidence in the classification model’s output. To better understand these models, sensitivity analysis can help to analyze the relative influence of input parameters on the output. This work investigates the sensitivity of image classification models used for predictive quality. We propose modeling the distributional domain shifts of inputs with random variables and quantifying their impact on the model’s outputs using Sobol indices computed via generalized polynomial chaos (GPC). This approach is validated through a case study involving a welding defect classification problem, utilizing a fine-tuned ResNet18 model and an emblem classification model used in BMW Group production facilities.

[714] Agnostic Learning of Arbitrary ReLU Activation under Gaussian Marginals

Anxin Guo, Aravindan Vijayaraghavan

Main category: cs.LG

TL;DR: A polynomial-time statistical query (SQ) algorithm achieves constant factor approximation for learning arbitrarily-biased ReLU neurons over Gaussian marginals, showing separation between SQ and correlational SQ (CSQ) algorithms where gradient descent fails.

Details

Motivation: Despite ReLU being the basic building block of modern neural networks, we don't understand whether an arbitrary ReLU neuron is learnable in non-realizable settings. Existing polynomial-time algorithms only work for unbiased or restricted bias settings, leaving the general case open.

Method: Developed a polynomial-time statistical query (SQ) algorithm that outputs a ReLU activation achieving loss O(OPT) + ε in time poly(d,1/ε). The algorithm departs from gradient descent approaches, which are correlational statistical query (CSQ) algorithms.

Result: The algorithm provides the first constant factor approximation for arbitrary bias ReLU neurons. Complemented by showing no polynomial-time CSQ algorithm can achieve constant factor approximation, revealing intrinsic limitations of gradient descent.

Conclusion: This work identifies arguably the simplest setting (single neuron) where there’s a separation between SQ and CSQ algorithms, shedding light on gradient descent’s limitations while providing the first efficient algorithm for learning arbitrary-bias ReLU neurons.

Abstract: We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether one arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomial time algorithms only provide approximation guarantees for the better-behaved unbiased setting or restricted bias setting. Our main result is a polynomial time statistical query (SQ) algorithm that gives the first constant factor approximation for arbitrary bias. It outputs a ReLU activation that achieves a loss of $O(\mathrm{OPT}) + \varepsilon$ in time $\mathrm{poly}(d,1/\varepsilon)$, where $\mathrm{OPT}$ is the loss obtained by the optimal ReLU activation. Our algorithm presents an interesting departure from existing algorithms, which are all based on gradient descent and thus fall within the class of correlational statistical query (CSQ) algorithms. We complement our algorithmic result by showing that no polynomial time CSQ algorithm can achieve a constant factor approximation. Together, these results shed light on the intrinsic limitation of gradient descent, while identifying arguably the simplest setting (a single neuron) where there is a separation between SQ and CSQ algorithms.

[715] Deep Graph Learning will stall without Network Science

Christopher Blöcker, Martin Rosvall, Ingo Scholtes, Jevin D. West

Main category: cs.LG

TL;DR: Position paper arguing that deep graph learning needs insights from network science to avoid stagnation, proposing six Calls for Action to integrate network science principles into deep graph learning.

Details

Motivation: Deep graph learning prioritizes empirical performance but ignores fundamental insights from network science, risking stagnation without incorporating network science's organizational principles and explicit assumptions about complex systems.

Method: Position paper methodology - identifies the gap between deep graph learning and network science, formulates six Calls for Action to leverage network science insights for addressing current issues in deep graph learning.

Result: Proposes six specific Calls for Action to integrate network science insights into deep graph learning, providing a framework for bridging the two fields and ensuring continued progress in graph-structured data modeling.

Conclusion: Deep graph learning will stall without insights from network science; integrating network science principles through the proposed Calls for Action is essential for the field’s continued advancement and better understanding of graph-structured patterns.

Abstract: Deep graph learning focuses on flexible and generalizable models that learn patterns in an automated fashion. Network science focuses on models and measures revealing the organizational principles of complex systems with explicit assumptions. Both fields share the same goal: to better model and understand patterns in graph-structured data. However, deep graph learning prioritizes empirical performance but ignores fundamental insights from network science. Our position is that deep graph learning will stall without insights from network science. In this position paper, we formulate six Calls for Action to leverage untapped insights from network science to address current issues in deep graph learning, ensuring the field continues to make progress.

[716] OverThink: Slowdown Attacks on Reasoning LLMs

Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, Eugene Bagdasarian

Main category: cs.LG

TL;DR: OverThink attack forces reasoning language models to spend excessive tokens on decoy reasoning problems injected into public content, increasing latency and costs while evading safety filters.

Details

Motivation: Reasoning chains in language models increase token usage, latency, and costs. The paper aims to exploit this by creating attacks that force models to process excessive reasoning tokens through benign decoy problems.

Method: Inject decoy reasoning problems (e.g., Markov decision processes, Sudokus) into public content consumed by RLMs at inference time. These decoys are benign to evade safety filters. Also explore multi-modal attacks by creating images that cause excessive reasoning.

Result: OverThink successfully increases reasoning token usage across closed-source and open-source models on FreshQA, SQuAD, and MuSR datasets. The slowdown transfers across models. Multi-modal attacks via images also work effectively.

Conclusion: The attack exposes vulnerabilities in reasoning models’ efficiency, with societal, financial, and energy implications. Both LLM-based and systems-level defenses are explored, highlighting the need for robust protections against such attacks.

Abstract: Most flagship language models generate explicit reasoning chains, enabling inference-time scaling. However, producing these reasoning chains increases token usage (i.e., reasoning tokens), which in turn increases latency and costs. Our OverThink attack increases overhead for applications that rely on reasoning language models (RLMs) and external context by forcing them to spend substantially more reasoning tokens while still producing contextually correct answers. An adversary mounts an attack by injecting decoy reasoning problems into public content that is consumed by RLM at inference time. Because our decoys (e.g., Markov decision processes, Sudokus, etc.) are benign, they evade safety filters. We evaluate OverThink on both closed-source and open-source reasoning models across the FreshQA, SQuAD, and MuSR datasets. We also explore the attack in multi-modal settings by creating images that cause excessive reasoning. We show that the resulting slowdown transfers across models. Finally, we explore both LLM-based and systems-level defenses, and discuss the societal, financial, and energy implications of the OverThink attacks.

[717] Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Hanhui Li, Yiwei Wang, Xiaodan Liang, Jing Tang

Main category: cs.LG

TL;DR: RLVR (Reinforcement Learning with Verifiable Reward) for reasoning in LLMs suffers from depth neglect (ignoring hard problems) and limited breadth (batch size). DARS addresses depth by adaptively sampling hard problems with targeted rollouts, while large-breadth training improves Pass@1 by reducing gradient noise. Combined as DARS-B, they yield gains in both Pass@K and Pass@1.

Details

Motivation: Current RLVR methods like GRPO have systematic bias: they disproportionately weight medium-accuracy samples while down-weighting low-accuracy instances crucial for pushing reasoning boundaries. This depth neglect limits the hardest problems models can solve. Additionally, limited breadth (batch size) restricts training efficiency and performance.

Method: 1) DARS (Difficulty Adaptive Rollout Sampling): Re-weights hard problems through targeted multi-stage rollouts to increase positive rollouts for hard problems. 2) Large-breadth training: Aggressively scales batch size and replaces PPO’s mini-batch iterations with full-batch updates over multiple epochs. 3) DARS-B: Combines DARS with large-breadth training.

Result: DARS delivers consistent Pass@K gains without extra inference cost. Large-breadth training significantly enhances Pass@1 performance and sustains high token-level entropy (continued exploration). DARS-B demonstrates simultaneous gains in both Pass@K and Pass@1, showing breadth and adaptive depth exploration are orthogonal dimensions in RLVR.

Conclusion: Breadth (batch size scaling) and adaptive exploration across depth (targeted sampling of hard problems) are orthogonal dimensions key to unleashing reasoning power in RLVR. The combination addresses systematic biases in current methods and enables better reasoning capabilities in LLMs.

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO’s mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.

[718] MetaSym: A Symplectic Meta-learning Framework for Physical Intelligence

Pranav Vaidhyanathan, Aristotelis Papatheodorou, Mark T. Mitchison, Natalia Ares, Ioannis Havoutis

Main category: cs.LG

TL;DR: MetaSym is a deep learning framework that incorporates symplectic geometry to preserve physical invariants while enabling few-shot adaptation across diverse physical systems.

Details

Motivation: Physics-aware deep learning faces challenges in scalability and generalizability across diverse domains. Symplectic forms are central to physical systems as they underpin fundamental invariants like energy and momentum, but existing methods struggle to maintain these invariants while adapting to system heterogeneities.

Method: MetaSym combines a symplectic encoder with strong inductive bias for preserving physical invariants, and an autoregressive decoder with meta-attention for flexible, data-efficient adaptation to system heterogeneities.

Result: MetaSym achieves superior few-shot adaptation across varied datasets including high-dimensional spring-mesh systems, open quantum systems with dissipation, and robotics-inspired quadrotor dynamics. It demonstrates robustness to sensor noise and real-world uncertainty when fine-tuned on real-world quadrotor data, outperforming larger state-of-the-art models.

Conclusion: MetaSym provides a principled deep learning framework that preserves core physical invariants through symplectic geometry while enabling efficient adaptation to diverse physical systems, demonstrating strong performance on realistic and varied physics tasks.

Abstract: Scalable and generalizable physics-aware deep learning has long been considered a significant challenge with various applications across diverse domains ranging from robotics to molecular dynamics. Central to almost all physical systems are symplectic forms, the geometric backbone that underpins fundamental invariants like energy and momentum. In this work, we introduce a novel deep learning framework, MetaSym. In particular, MetaSym combines a strong symplectic inductive bias obtained from a symplectic encoder, and an autoregressive decoder with meta-attention. This principled design ensures that core physical invariants remain intact, while allowing flexible, data efficient adaptation to system heterogeneities. We benchmark MetaSym with highly varied and realistic datasets, such as a high-dimensional spring-mesh system Otness et al. (2021), an open quantum system with dissipation and measurement backaction, and robotics-inspired quadrotor dynamics. Crucially, we fine-tune and deploy MetaSym on real-world quadrotor data, demonstrating robustness to sensor noise and real-world uncertainty. Across all tasks, MetaSym achieves superior few-shot adaptation and outperforms larger state-of-the-art (SOTA) models.

[719] Latent Space Representation of Electricity Market Curves: Maintaining Structural Integrity

Martin Výboh, Zuzana Chladná, Gabriela Grmanová, Mária Lucká

Main category: cs.LG

TL;DR: Evaluation of dimensionality reduction methods (PCA, Kernel PCA, UMAP, AutoEncoder) for energy market supply/demand curves with isotonic regression post-processing to enforce economic monotonicity constraints.

Details

Motivation: Energy market analysis requires efficient representation of supply/demand curves, but existing dimensionality reduction methods often violate fundamental economic principles like monotonicity, limiting their practical utility.

Method: Compare PCA, Kernel PCA, UMAP, and AutoEncoder across 2D/3D latent spaces with preprocessing to unify structure and mitigate outliers. Use isotonic regression as optional post-processing to enforce monotonic constraints on reconstructed outputs.

Result: UMAP consistently outperforms other methods across multiple error metrics on three-year hourly MIBEL dataset. Isotonic regression significantly reduces error and restores physical validity for several methods.

Conclusion: UMAP’s local structure preservation combined with isotonic regression post-processing provides robust foundation for downstream tasks like forecasting, classification, and clustering in energy markets.

Abstract: Efficiently representing supply and demand curves is vital for energy market analysis and downstream modelling; however, dimensionality reduction often produces reconstructions that violate fundamental economic principles such as monotonicity. This paper evaluates the performance of PCA, Kernel PCA, UMAP, and AutoEncoder across 2d and 3d latent spaces. During preprocessing, we transform the original data to achieve a unified structure, mitigate outlier effects, and focus on critical curve segments. To ensure theoretical validity, we integrate Isotonic Regression as an optional post-processing step to enforce monotonic constraints on reconstructed outputs. Results from a three-year hourly MIBEL dataset demonstrate that the non-linear technique UMAP consistently outperforms other methods, securing the top rank across multiple error metrics. Furthermore, Isotonic Regression serves as a crucial corrective layer, significantly reducing error and restoring physical validity for several methods. We argue that UMAP`s local structure preservation, combined with intelligent post-processing, provides a robust foundation for downstream tasks such as forecasting, classification, and clustering.

[720] An Overview of Low-Rank Structures in the Training and Adaptation of Large Models

Laura Balzano, Tianjiao Ding, Benjamin D. Haeffele, Soo Min Kwon, Qing Qu, Peng Wang, Zhangyang Wang, Can Yaras

Main category: cs.LG

TL;DR: A comprehensive tutorial reviewing how deep networks naturally learn low-rank structures in weights and representations, with theoretical perspectives on optimization dynamics and implicit regularization, plus practical applications like LoRA and parameter-efficient training.

Details

Motivation: The computational demands of large-scale deep learning require efficient training and deployment methods. Recent findings show deep networks inherently develop low-rank structures during training, which can be exploited for efficiency.

Method: The paper provides a tutorial review of advances in identifying and exploiting low-rank structures. It presents two theoretical perspectives: 1) viewing low-rankness through gradient descent optimization dynamics during training, and 2) understanding it as implicit regularization effects at convergence.

Result: The theoretical perspectives provide foundations for understanding practical techniques like Low-Rank Adaptation (LoRA) for fine-tuning, inspire new parameter-efficient low-rank training strategies, and explain the effectiveness of masked training approaches like dropout and masked self-supervised learning.

Conclusion: Understanding and exploiting inherent low-rank structures in deep networks offers promising pathways for more efficient training and deployment of large-scale models, with both theoretical foundations and practical applications.

Abstract: The substantial computational demands of modern large-scale deep learning present significant challenges for efficient training and deployment. Recent research has revealed a widespread phenomenon wherein deep networks inherently learn low-rank structures in their weights and representations during training. This tutorial paper provides a comprehensive review of advances in identifying and exploiting these low-rank structures, bridging mathematical foundations with practical applications. We present two complementary theoretical perspectives on the emergence of low-rankness: viewing it through the optimization dynamics of gradient descent throughout training, and understanding it as a result of implicit regularization effects at convergence. Practically, these theoretical perspectives provide a foundation for understanding the success of techniques such as Low-Rank Adaptation (LoRA) in fine-tuning, inspire new parameter-efficient low-rank training strategies, and explain the effectiveness of masked training approaches like dropout and masked self-supervised learning.

[721] Exploring the Global-to-Local Attention Scheme in Graph Transformers: An Empirical Study

Gang Wu, Zhengwei Wang

Main category: cs.LG

TL;DR: G2LFormer is a Graph Transformer with global-to-local attention scheme where shallow layers use attention for global information and deeper layers use GNNs for local structure, preventing neighborhood information loss.

Details

Motivation: Existing Graph Transformers integrate GNNs with attention mechanisms (local-and-global or local-to-global), but these may suffer from information loss where local neighborhood information learned by GNNs gets diluted by attention mechanisms that focus on long-range dependencies.

Method: Proposes G2LFormer with global-to-local attention: shallow layers use attention mechanisms to capture global information, deeper layers employ GNN modules to learn local structural information. Includes cross-layer information fusion strategy to allow local layers to retain beneficial information from global layers while maintaining linear complexity.

Result: G2LFormer exhibits excellent performance on both node-level and graph-level tasks while keeping linear complexity, outperforming state-of-the-art linear Graph Transformers and GNNs.

Conclusion: The global-to-local attention scheme is effective for graph representation learning, preventing nodes from ignoring immediate neighbors while capturing global dependencies, with acceptable scalability trade-offs.

Abstract: Graph Transformers (GTs) show considerable potential in graph representation learning. The architecture of GTs typically integrates Graph Neural Networks (GNNs) with global attention mechanisms either in parallel or as a precursor to attention mechanisms, yielding a local-and-global or local-to-global attention scheme. However, as the global attention mechanism primarily captures long-range dependencies between nodes, these integration schemes may suffer from information loss, where the local neighborhood information learned by GNN could be diluted by the attention mechanism. Therefore, we propose G2LFormer, featuring a novel global-to-local attention scheme where the shallow network layers use attention mechanisms to capture global information, while the deeper layers employ GNN modules to learn local structural information, thereby preventing nodes from ignoring their immediate neighbors. An effective cross-layer information fusion strategy is introduced to allow local layers to retain beneficial information from global layers and alleviate information loss, with acceptable trade-offs in scalability. To validate the feasibility of the global-to-local attention scheme, we compare G2LFormer with state-of-the-art linear GTs and GNNs on node-level and graph-level tasks. The results indicate that G2LFormer exhibits excellent performance while keeping linear complexity.

[722] Neural Thermodynamics: Entropic Forces in Deep and Universal Representation Learning

Liu Ziyin, Yizhou Xu, Isaac Chuang

Main category: cs.LG

TL;DR: The paper proposes an entropic-force theory to explain learning dynamics in neural networks, showing how emergent entropic forces from SGD break symmetries and lead to gradient balance phenomena that explain representation alignment and optimization behaviors.

Details

Motivation: There's an urgent need to understand emergent phenomena in deep learning and large language models, particularly the causes behind representation learning dynamics and optimization behaviors that currently lack rigorous theoretical explanations.

Method: Develops a rigorous entropic-force theory based on parameter symmetries and entropic loss landscapes, analyzing how stochastic gradient descent and its variants create emergent entropic forces that systematically break continuous parameter symmetries while preserving discrete ones.

Result: The theory explains universal alignment of neural representations between AI models (proving the Platonic Representation Hypothesis) and reconciles contradictory observations of sharpness- and flatness-seeking optimization behaviors through gradient balance phenomena resembling thermal system equipartition.

Conclusion: A combination of entropic forces and symmetry breaking is key to understanding emergent phenomena in deep learning, providing a unified theoretical framework for representation learning dynamics in neural networks.

Abstract: With the rapid discovery of emergent phenomena in deep learning and large language models, understanding their cause has become an urgent need. Here, we propose a rigorous entropic-force theory for understanding the learning dynamics of neural networks trained with stochastic gradient descent (SGD) and its variants. Building on the theory of parameter symmetries and an entropic loss landscape, we show that representation learning is crucially governed by emergent entropic forces arising from stochasticity and discrete-time updates. These forces systematically break continuous parameter symmetries and preserve discrete ones, leading to a series of gradient balance phenomena that resemble the equipartition property of thermal systems. These phenomena, in turn, (a) explain the universal alignment of neural representations between AI models and lead to a proof of the Platonic Representation Hypothesis, and (b) reconcile the seemingly contradictory observations of sharpness- and flatness-seeking behavior of deep learning optimization. Our theory and experiments demonstrate that a combination of entropic forces and symmetry breaking is key to understanding emergent phenomena in deep learning.

[723] Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models

Shutong Wu, Jiawei Zhang

Main category: cs.LG

TL;DR: FreeDave is a novel parallel decoding algorithm for Diffusion Large Language Models that achieves lossless inference acceleration without model modifications.

Details

Motivation: Diffusion LLMs have potential for efficient parallel decoding but existing algorithms cause performance degradation. Current methods either require many steps (equal to sequence length) for quality or sacrifice performance for speed.

Method: FreeDave uses parallel-decoded candidate generation and verification algorithm that theoretically guarantees minimal model forward calls to reproduce same sequences as one-token-per-step decoding.

Result: Achieves up to 2.83× inference acceleration without performance degradation on math reasoning and code generation benchmarks across different DLLMs.

Conclusion: FreeDave enables lossless parallel decoding for Diffusion LLMs, addressing the speed-quality tradeoff in existing parallel decoding algorithms.

Abstract: Diffusion Large Language Models (DLLMs) have emerged as a new paradigm of language modeling beyond autoregressive next-token prediction. Taking advantage of their inherent modeling foundations, DLLMs have the great potential of efficient inference with parallel decoding algorithms, which enable multi-token prediction. However, the high generation quality often requires the number of decoding steps equal to the sequence length, which performs a one-token-per-step decoding, and existing parallel decoding algorithms, which yield suboptimal decoding paths, bring inference speedup at the cost of non-negligible performance degradation. To overcome this challenge, we introduce Free Draft-and-Verification (FreeDave), a novel fast decoding algorithm tailored for DLLMs that achieves lossless parallel decoding without any model modification or extra modules. Specifically, we propose an algorithm of parallel-decoded candidate generation and verification, which is theoretically guaranteed to use the fewest model forward calls to reproduce the same sequence generated by one-token-per-step decoding. By extensive evaluations on math reasoning and code generation benchmarks across different DLLMs, FreeDave is proven to accelerate the inference up to $2.83\times$ without performance degradation.

[724] Multi-Level Monte Carlo Training of Neural Operators

James Rowbottom, Stefania Fresca, Pietro Lio, Carola-Bibiane Schönlieb, Nicolas Boullé

Main category: cs.LG

TL;DR: MLMC training framework for neural operators that uses multi-resolution data to reduce computational cost while maintaining accuracy

Details

Motivation: Traditional neural operator training is expensive for large-scale problems at high resolution, requiring efficient training methods

Method: Multi-Level Monte Carlo approach using gradient corrections from fewer fine-resolution samples with a hierarchy of resolutions

Result: Improved computational efficiency compared to single-resolution training, with Pareto curve between accuracy and computational time

Conclusion: MLMC training framework is effective for reducing computational cost of neural operator training while maintaining accuracy

Abstract: Operator learning is a rapidly growing field that aims to approximate nonlinear operators related to partial differential equations (PDEs) using neural operators. These rely on discretization of input and output functions and are, usually, expensive to train for large-scale problems at high-resolution. Motivated by this, we present a Multi-Level Monte Carlo (MLMC) approach to train neural operators by leveraging a hierarchy of resolutions of function discretization. Our framework relies on using gradient corrections from fewer samples of fine-resolution data to decrease the computational cost of training while maintaining a high level accuracy. The proposed MLMC training procedure can be applied to any architecture accepting multi-resolution data. Our numerical experiments on a range of state-of-the-art models and test-cases demonstrate improved computational efficiency compared to traditional single-resolution training approaches, and highlight the existence of a Pareto curve between accuracy and computational time, related to the number of samples per resolution.

[725] UrbanGraph: Physics-Informed Spatio-Temporal Dynamic Heterogeneous Graphs for Urban Microclimate Prediction

Weilin Xin, Chenyu Huang, Peilin Li, Jing Zhong, Jiawei Yao

Main category: cs.LG

TL;DR: UrbanGraph is a framework that transforms physical first principles into dynamic causal topology for urban microclimate prediction, achieving state-of-the-art performance with improved efficiency through explicit causal pruning.

Details

Motivation: Existing generative and homogeneous graph approaches fail to capture physical consistency, spatial dependencies, and temporal variability in urban microclimate prediction, which is critical for building energy demand and public health risk assessment.

Method: UrbanGraph introduces a structure-based inductive bias that transforms physical first principles into a dynamic causal topology, explicitly encoding time-varying causalities (e.g., shading and convection) directly into the graph structure rather than using implicit graph learning.

Result: UrbanGraph achieves state-of-the-art performance across all baselines. Explicit causal pruning reduces FLOPs by 73.8% and increases training speed by 21% compared to implicit graphs. The paper also introduces the first high-resolution benchmark for spatio-temporal microclimate modeling.

Conclusion: UrbanGraph provides a generalizable explicit topological encoding paradigm for urban spatio-temporal dynamics governed by known physical equations, offering improved physical consistency and data efficiency for microclimate prediction.

Abstract: With rapid urbanization, predicting urban microclimates has become critical, as it affects building energy demand and public health risks. However, existing generative and homogeneous graph approaches fall short in capturing physical consistency, spatial dependencies, and temporal variability. To address this, we introduce UrbanGraph, a framework founded on a novel structure-based inductive bias. Unlike implicit graph learning, UrbanGraph transforms physical first principles into a dynamic causal topology, explicitly encoding time-varying causalities (e.g., shading and convection) directly into the graph structure to ensure physical consistency and data efficiency. Results show that UrbanGraph achieves state-of-the-art performance across all baselines. Specifically, the use of explicit causal pruning significantly reduces the model’s floating-point operations (FLOPs) by 73.8% and increases training speed by 21% compared to implicit graphs. Our contribution includes the first high-resolution benchmark for spatio-temporal microclimate modeling, and a generalizable explicit topological encoding paradigm applicable to urban spatio-temporal dynamics governed by known physical equations.

[726] Inferring stochastic dynamics with growth from cross-sectional data

Stephen Zhang, Suryanarayana Maddu, Xiaojie Qiu, Victor Chardès

Main category: cs.LG

TL;DR: Unbalanced probability flow inference method for analyzing time-resolved single-cell omics data to infer stochastic dynamics with cell growth/death

Details

Motivation: Single-cell omics data provides genome-wide measurements but is destructive and cross-sectional, making it challenging to infer realistic biophysical models when cells can divide, die, or change molecular states.

Method: Novel approach using unbalanced probability flow inference that leverages Lagrangian formulation of Fokker-Planck equation to disentangle drift from intrinsic noise and growth in stochastic dynamics with growth.

Result: Method accurately infers dynamics on simulated and real single-cell RNA-seq datasets, achieving higher accuracy than existing methods with a simple two-step training scheme.

Conclusion: The approach successfully addresses challenges in inferring biophysical models from time-resolved single-cell data by properly accounting for stochastic dynamics with growth processes.

Abstract: Time-resolved single-cell omics data offers high-throughput, genome-wide measurements of cellular states, which are instrumental to reverse-engineer the processes underpinning cell fate. Such technologies are inherently destructive, allowing only cross-sectional measurements of the underlying stochastic dynamical system. Furthermore, cells may divide or die in addition to changing their molecular state. Collectively these present a major challenge to inferring realistic biophysical models. We present a novel approach, unbalanced probability flow inference, that addresses this challenge for biological processes modelled as stochastic dynamics with growth. By leveraging a Lagrangian formulation of the Fokker-Planck equation, our method accurately disentangles drift from intrinsic noise and growth. We showcase the applicability of our approach through evaluation on a range of simulated and real single-cell RNA-seq datasets. Comparing to several existing methods, we find our method achieves higher accuracy while enjoying a simple two-step training scheme.

[727] MoGU: Mixture-of-Gaussians with Uncertainty-based Gating for Time Series Forecasting

Gilad Aviv, Jacob Goldberger, Yoli Shavit

Main category: cs.LG

TL;DR: MoGU is a novel Mixture-of-Experts framework for regression tasks that uses expert-specific uncertainty as gating signal instead of learned gating, validated on multivariate time-series forecasting with improved accuracy and better uncertainty quantification.

Details

Motivation: Traditional Mixture-of-Experts (MoE) frameworks use learned gating mechanisms that may not optimally handle regression tasks, especially in domains like time-series forecasting characterized by high volatility and varying noise patterns. There's a need for more effective uncertainty-aware routing in MoE systems for regression.

Method: MoGU replaces standard learned gating with an intrinsic routing paradigm where expert-specific uncertainty serves as the native gating signal. Each expert models predictions as Gaussian distributions, and the system uses predicted variance to dynamically weight expert contributions based on their uncertainty.

Result: Empirical results across multiple benchmarks, horizon lengths, and backbones show MoGU consistently improves forecasting accuracy compared to traditional MoE. Conformal prediction evaluation indicates MoGU yields more efficient prediction intervals than existing baselines.

Conclusion: MoGU demonstrates capacity for providing both competitive performance and reliable, high-fidelity uncertainty quantification in regression tasks, particularly for time-series forecasting with complex noise patterns.

Abstract: We introduce Mixture-of-Gaussians with Uncertainty-based Gating (MoGU), a novel Mixture-of-Experts (MoE) framework designed for regression tasks. MoGU replaces standard learned gating with an intrinsic routing paradigm where expert-specific uncertainty serves as the native gating signal. By modeling each prediction as a Gaussian distribution, the system utilizes predicted variance to dynamically weight expert contributions. We validate MoGU on multivariate time-series forecasting, a domain defined by high volatility and varying noise patterns. Empirical results across multiple benchmarks, horizon lengths, and backbones demonstrate that MoGU consistently improves forecasting accuracy compared to traditional MoE. Further evaluation via conformal prediction indicates that our approach yields more efficient prediction intervals than existing baselines. These findings highlight MoGU’s capacity for providing both competitive performance and reliable, high-fidelity uncertainty quantification. Our code is available at: https://github.com/yolish/moe_unc_tsf

[728] SPAR: Self-supervised Placement-Aware Representation Learning for Distributed Sensing

Yizhuo Chen, Tianchen Wang, You Lyu, Yanlan Hu, Jinyang Li, Tomoyoshi Kimura, Hongjue Zhao, Yigong Hu, Denizhan Kara, Tarek Abdelzaher

Main category: cs.LG

TL;DR: SPAR is a self-supervised framework for placement-aware representation learning in distributed sensing that models the duality between signals and sensor positions.

Details

Motivation: Distributed sensing applications face the challenge that observed signals are inseparably shaped by sensor placements (spatial locations and structural characteristics), but existing pretraining methods remain largely placement-agnostic.

Method: SPAR introduces spatial and structural positional embeddings with dual reconstruction objectives, explicitly modeling how observing positions and observed signals shape each other, treating placement as intrinsic to representation learning rather than auxiliary metadata.

Result: Extensive experiments on three real-world datasets show SPAR achieves superior robustness and generalization across various modalities, placements, and downstream tasks.

Conclusion: SPAR provides a principled framework for placement-aware representation learning in distributed sensing, with theoretical support from information theory and occlusion-invariant learning.

Abstract: We present SPAR, a framework for self-supervised placement-aware representation learning in distributed sensing. Distributed sensing spans applications where multiple spatially distributed and multimodal sensors jointly observe an environment, from vehicle monitoring to human activity recognition and earthquake localization. A central challenge shared by this wide spectrum of applications is that observed signals are inseparably shaped by sensor placements, including their spatial locations and structural characteristics. However, existing pretraining methods remain largely placement-agnostic. SPAR addresses this gap through a unifying principle: the duality between signals and positions. Guided by this principle, SPAR introduces spatial and structural positional embeddings together with dual reconstruction objectives, explicitly modeling how observing positions and observed signals shape each other. Placement is thus treated not as auxiliary metadata but as intrinsic to representation learning. SPAR is theoretically supported by analyses from information theory and occlusion-invariant learning. Extensive experiments on three real-world datasets show that SPAR achieves superior robustness and generalization across various modalities, placements, and downstream tasks.

[729] Redirection for Erasing Memory (REM): Towards a universal unlearning method for corrupted data

Stefan Schoepf, Michael Curtis Mozer, Nicole Elyse Mitchell, Alexandra Brintrup, Georgios Kaissis, Peter Kairouz, Eleni Triantafillou

Main category: cs.LG

TL;DR: The paper proposes a conceptual framework for comparing machine unlearning methods in vision classifiers and introduces REM, a novel unlearning technique that redirects corrupted data to dedicated neurons for effective erasure across diverse task scenarios.

Details

Motivation: Current machine unlearning methods are specialized to specific tasks, making systematic comparison difficult. There's a need for a unified framework to characterize diverse corrupted data unlearning tasks in vision classifiers and develop methods that work across different scenarios.

Method: Proposes a conceptual space with two dimensions: discovery rate (fraction of corrupted data known at unlearning time) and statistical regularity of corrupted data (from random exemplars to shared concepts). Introduces Redirection for Erasing Memory (REM) which redirects corrupted data to dedicated neurons introduced at unlearning time, then discards/deactivates them to suppress corrupted data influence.

Result: REM performs strongly across the entire space of unlearning tasks, while prior state-of-the-art methods fail predictably outside the regions they were designed for. The conceptual framework enables systematic comparison of unlearning methods.

Conclusion: The proposed conceptual space provides a systematic way to characterize and compare machine unlearning tasks. REM demonstrates robust performance across diverse unlearning scenarios, outperforming specialized methods that only work in limited regions of the task space.

Abstract: Machine unlearning is studied for a multitude of tasks, but specialization of unlearning methods to particular tasks has made their systematic comparison challenging. To address this issue, we propose a conceptual space to characterize diverse corrupted data unlearning tasks in vision classifiers. This space is described by two dimensions, the discovery rate (the fraction of the corrupted data that are known at unlearning time) and the statistical regularity of the corrupted data (from random exemplars to shared concepts). Methods proposed previously have been targeted at portions of this space and-we show-fail predictably outside these regions. We propose a novel method, Redirection for Erasing Memory (REM), whose key feature is that corrupted data are redirected to dedicated neurons introduced at unlearning time and then discarded or deactivated to suppress the influence of corrupted data. REM performs strongly across the space of tasks, in contrast to prior SOTA methods that fail outside the regions for which they were designed.

[730] Efficient Utility-Preserving Machine Unlearning with Implicit Gradient Surgery

Shiji Zhou, Tianbai Yu, Zhi Zhang, Heng Chang, Xiao Zhou, Dong Wu, Han Zhao

Main category: cs.LG

TL;DR: Proposes an efficient utility-preserving machine unlearning method using implicit gradient surgery to balance forgetting sensitive information while maintaining model performance.

Details

Motivation: Machine unlearning aims to remove sensitive/harmful memory from pre-trained models, but faces a tradeoff between unlearning efficacy and utility preservation. Existing multi-objective methods find Pareto-optimal solutions without fine-grained control, leading to under-optimization of unlearning objectives.

Method: Models MU as constrained optimization (optimizing unlearning objective with bounded utility loss constraint), shows equivalence to unilateral gradient surgery, and proposes implicit gradient surgery that approximates the solution with one backpropagation for efficiency.

Result: Theoretical convergence analysis and extensive experiments show the proposed algorithm achieves better tradeoff results than existing baselines.

Conclusion: The implicit gradient surgery method provides an efficient solution for utility-preserving machine unlearning with better tradeoff performance than previous approaches.

Abstract: Machine unlearning (MU) aims to efficiently remove sensitive or harmful memory from a pre-trained model. The key challenge is to balance the potential tradeoff between unlearning efficacy and utility preservation, which involves forgetting undesirable information as defined while maintaining the model’s original performance. One potential way to tackle this problem is to use multi-objective optimization to jointly optimize both the unlearning and utility preservation objectives. However, existing multi-objective methods only guarantee finding a Pareto-optimal solution without fine-grained control, which causes under-optimization of the unlearning objective. To this end, we first model MU as a constrained optimization problem, that is, optimizing the unlearning objective under the constraint of a bounded increase for utility loss. We then show that solving this optimization problem is equivalent to unilateral gradient surgery on the unlearning objective. To resolve the additional computational cost brought by gradient surgery, we propose an implicit gradient surgery method, which approximates the solution to the aforementioned constrained optimization problem via only one backpropagation, thereby achieving efficient utility-preserving MU. Theoretically, we provide a tight convergence analysis of the algorithm. Empirically, our extensive experiments show that the proposed algorithm achieves better tradeoff results than existing baselines. Codes are available at https://github.com/anseryuer/EUPMU-Efficient-Utility-Preserving-Machine-Unlearning.

[731] Position: Epistemic uncertainty estimation methods are fundamentally incomplete

Sebastián Jiménez, Mira Jürgens, Willem Waegeman

Main category: cs.LG

TL;DR: Current uncertainty quantification methods are incomplete as they fail to account for bias contamination and capture only partial variance sources, leading to unreliable uncertainty estimates for safety-critical applications.

Details

Motivation: The paper addresses fundamental limitations in widely used second-order uncertainty quantification methods that disentangle aleatoric and epistemic uncertainty. The motivation is to demonstrate that these methods provide incomplete and potentially misleading uncertainty estimates, which is critical for trustworthy supervised learning in safety-critical and high-stakes decision-making applications.

Method: The paper analyzes existing second-order uncertainty quantification methods theoretically and empirically. It shows how unaccounted bias contaminates uncertainty estimates and demonstrates that current approaches capture only partial contributions to variance-driven epistemic uncertainty. The analysis reveals that different methods account for different variance sources, leading to incomplete and difficult-to-interpret estimates.

Result: The results show that current methods overestimate aleatoric uncertainty while underestimating epistemic uncertainty due to unaccounted bias contamination. Additionally, existing approaches provide only partial coverage of variance sources in epistemic uncertainty estimation, making the estimates incomplete and unreliable for critical applications.

Conclusion: Current epistemic uncertainty estimates can only be safely used in safety-critical and high-stakes decision-making when their limitations are fully understood by end users and acknowledged by AI developers. The paper highlights the need for more complete uncertainty quantification methods that properly account for bias and capture all relevant variance sources.

Abstract: Identifying and disentangling sources of predictive uncertainty is essential for trustworthy supervised learning. We argue that widely used second-order methods that disentangle aleatoric and epistemic uncertainty are fundamentally incomplete. First, we show that unaccounted bias contaminates uncertainty estimates by overestimating aleatoric (data-related) uncertainty and underestimating the epistemic (model-related) counterpart, leading to incorrect uncertainty quantification. Second, we demonstrate that existing methods capture only partial contributions to the variance-driven part of epistemic uncertainty; different approaches account for different variance sources, yielding estimates that are incomplete and difficult to interpret. Together, these results highlight that current epistemic uncertainty estimates can only be used in safety-critical and high-stakes decision-making when limitations are fully understood by end users and acknowledged by AI developers.

[732] Transformers can do Bayesian Clustering

Prajit Bhaskaran, Tom Viering

Main category: cs.LG

TL;DR: Cluster-PFN is a Transformer-based model for Bayesian clustering that learns from synthetic GMM data to estimate posterior distributions over cluster counts and assignments, handling missing values without imputation and being much faster than traditional methods.

Details

Motivation: Bayesian clustering methods are computationally expensive at scale and struggle with missing data, where simple imputation ignores uncertainty. There's a need for scalable Bayesian clustering that properly handles missing values.

Method: Extends Prior-Data Fitted Networks (PFNs) using Transformers, trained entirely on synthetic datasets from Gaussian Mixture Model priors. Learns to estimate posterior distributions over both number of clusters and cluster assignments, handling missing data directly without imputation.

Result: Outperforms AIC, BIC and Variational Inference in estimating number of clusters, achieves competitive clustering quality with VI while being orders of magnitude faster. Handles high missingness in real-world genomic datasets better than imputation-based baselines.

Conclusion: Cluster-PFN provides scalable and flexible Bayesian clustering that effectively handles missing data uncertainty and offers significant computational advantages over traditional methods.

Abstract: Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include missing data, outperforming imputation-based baselines on real-world genomic datasets, at high missingness. These results show that the Cluster-PFN can provide scalable and flexible Bayesian clustering.

[733] On Universality Classes of Equivariant Networks

Marco Pacini, Gabriele Santin, Bruno Lepri, Shubhendu Trivedi

Main category: cs.LG

TL;DR: Equivariant neural networks’ separation power doesn’t fully capture expressivity; identical separation power models can differ in approximation ability. Authors characterize universality classes of shallow invariant networks and identify conditions where shallow equivariant networks fail or achieve separation-constrained universality.

Details

Motivation: While equivariant neural networks have been extensively analyzed for their separation power (ability to distinguish inputs modulo symmetry), their universality (capacity to approximate target functions) remains underexplored. The authors aim to investigate whether separation power fully captures expressivity.

Method: The authors analyze the approximation power of equivariant neural networks beyond separation constraints. They characterize universality classes of shallow invariant networks, providing a general framework for understanding which functions these architectures can approximate. Since equivariant models reduce to invariant ones under projection, this analysis yields conditions for universality failure/success.

Result: Separation power does not fully capture expressivity: models with identical separation power may differ in their approximation ability. The authors characterize when shallow equivariant networks fail to be universal and identify settings where they achieve separation-constrained universality. Positive results depend critically on structural properties of the symmetry group.

Conclusion: The approximation power of equivariant neural networks extends beyond separation constraints, with universality depending on group structure properties. For important cases like permutation symmetry, adequate normal subgroups may not exist, limiting universality despite separation power.

Abstract: Equivariant neural networks provide a principled framework for incorporating symmetry into learning architectures and have been extensively analyzed through the lens of their separation power, that is, the ability to distinguish inputs modulo symmetry. This notion plays a central role in settings such as graph learning, where it is often formalized via the Weisfeiler-Leman hierarchy. In contrast, the universality of equivariant models-their capacity to approximate target functions-remains comparatively underexplored. In this work, we investigate the approximation power of equivariant neural networks beyond separation constraints. We show that separation power does not fully capture expressivity: models with identical separation power may differ in their approximation ability. To demonstrate this, we characterize the universality classes of shallow invariant networks, providing a general framework for understanding which functions these architectures can approximate. Since equivariant models reduce to invariant ones under projection, this analysis yields sufficient conditions under which shallow equivariant networks fail to be universal. Conversely, we identify settings where shallow models do achieve separation-constrained universality. These positive results, however, depend critically on structural properties of the symmetry group, such as the existence of adequate normal subgroups, which may not hold in important cases like permutation symmetry.

[734] Relational reasoning and inductive bias in transformers and large language models

Jesse Geerts, Andrew Liu, Stephanie Chan, Claudia Clopath, Kimberly Stachenfeld

Main category: cs.LG

TL;DR: Transformers trained with in-weights learning naturally develop transitive inference capabilities, while in-context learning models use match-and-copy strategies that fail at hierarchical reasoning, but pre-training on linear regression enables in-context transitive inference with human-like effects.

Details

Motivation: To understand how transformers perform relational reasoning, specifically transitive inference, and compare different learning strategies (in-weights vs in-context learning) for this fundamental reasoning task.

Method: Investigated transformer models performing transitive inference tasks, comparing in-weights learning (IWL) and in-context learning (ICL) strategies. Extended findings to large language models using linear and circular geometric scaffolds as prompts.

Result: IWL naturally induces transitive inference generalization despite training only on adjacent items, while ICL models develop match-and-copy strategies that fail to encode hierarchical relationships. Pre-training on in-context linear regression enables transformers to exhibit human-like transitive inference patterns (symbolic distance and terminal item effects). Linear geometric scaffolds improve LLM performance while circular geometries impair it.

Conclusion: Both training regime and geometric structure of induced representations critically determine transformers’ capacity for transitive inference, revealing fundamental differences in how different learning strategies approach relational reasoning.

Abstract: Transformer-based models have demonstrated remarkable reasoning abilities, but the mechanisms underlying relational reasoning remain poorly understood. We investigate how transformers perform \textit{transitive inference}, a classic relational reasoning task which requires inference indirectly related items (e.g., if $A>B$ and $B>C$, then $A>C$), comparing in-weights learning (IWL) and in-context learning (ICL) strategies. We find that IWL naturally induces a generalization bias towards transitive inference despite training only on adjacent items, whereas ICL models develop induction circuits implementing match-and-copy strategies that fail to encode hierarchical relationships. However, when pre-trained on in-context linear regression tasks, transformers successfully exhibit in-context generalizable transitive inference, displaying both \textit{symbolic distance} and \textit{terminal item effects} characteristic of human and animal performance, without forming induction circuits. We extend these findings to large language models, demonstrating that prompting with linear geometric scaffolds improves transitive inference, while circular geometries (which violate transitivity by allowing wraparound) impair performance, particularly when models cannot rely on stored knowledge. Together, these results reveal that both the training regime and the geometric structure of induced representations critically determine transformers’ capacity for transitive inference.

[735] Differentially Private Relational Learning with Entity-level Privacy Guarantees

Yinan Huang, Haoteng Yin, Eli Chien, Rongzhe Wei, Pan Li

Main category: cs.LG

TL;DR: A framework for differentially private relational learning with entity-level privacy guarantees, addressing challenges of high sensitivity and coupled sampling in network-structured data.

Details

Motivation: Relational and network-structured data learning is crucial in sensitive domains requiring privacy protection. Direct application of DP-SGD to relational learning faces challenges: (1) entities participate in multiple relations causing high sensitivity, and (2) multi-stage coupled sampling procedures make standard privacy amplification analyses inapplicable.

Method: Presents a principled framework with rigorous sensitivity analysis and adaptive gradient clipping based on entity occurrence frequency. Extends privacy amplification to tractable coupled sampling where dependence arises only through sample sizes, leading to a tailored DP-SGD variant for relational data.

Result: Experiments on fine-tuning text encoders over text-attributed network-structured relational data demonstrate strong utility-privacy trade-offs of the approach.

Conclusion: The framework provides formal entity-level DP guarantees for relational learning, addressing key challenges in applying differential privacy to network-structured data with provable privacy guarantees.

Abstract: Learning with relational and network-structured data is increasingly vital in sensitive domains where protecting the privacy of individual entities is paramount. Differential Privacy (DP) offers a principled approach for quantifying privacy risks, with DP-SGD emerging as a standard mechanism for private model training. However, directly applying DP-SGD to relational learning is challenging due to two key factors: (i) entities often participate in multiple relations, resulting in high and difficult-to-control sensitivity; and (ii) relational learning typically involves multi-stage, potentially coupled (interdependent) sampling procedures that make standard privacy amplification analyses inapplicable. This work presents a principled framework for relational learning with formal entity-level DP guarantees. We provide a rigorous sensitivity analysis and introduce an adaptive gradient clipping scheme that modulates clipping thresholds based on entity occurrence frequency. We also extend the privacy amplification results to a tractable subclass of coupled sampling, where the dependence arises only through sample sizes. These contributions lead to a tailored DP-SGD variant for relational data with provable privacy guarantees. Experiments on fine-tuning text encoders over text-attributed network-structured relational data demonstrate the strong utility-privacy trade-offs of our approach. Our code is available at https://github.com/Graph-COM/Node_DP.

[736] Polynomial Neural Sheaf Diffusion: A Spectral Filtering Approach on Cellular Sheaves

Alessio Borgi, Fabrizio Silvestri, Pietro Liò

Main category: cs.LG

TL;DR: PolyNSD introduces polynomial sheaf diffusion using orthogonal polynomials to create stable K-hop receptive fields with convex spectral mixing, achieving SOTA on graph benchmarks with diagonal restriction maps.

Details

Motivation: Existing Neural Sheaf Diffusion methods have limitations: they rely on SVD-based sheaf normalization and dense per-edge restriction maps that scale poorly with stalk dimension, require frequent Laplacian rebuilds, and produce brittle gradients.

Method: Proposes Polynomial Neural Sheaf Diffusion (PolyNSD) using a degree-K polynomial in normalized sheaf Laplacian evaluated via stable three-term recurrence on spectrally rescaled operator. Creates explicit K-hop receptive field in single layer with trainable spectral response as convex mixture of orthogonal polynomial basis responses.

Result: Achieves new state-of-the-art results on both homophilic and heterophilic benchmarks, inverts trend by obtaining results with just diagonal restriction maps, decouples performance from large stalk dimension, reduces runtime and memory requirements.

Conclusion: PolyNSD provides stable, efficient sheaf diffusion with explicit multi-hop receptive fields and trainable spectral responses, overcoming limitations of previous sheaf neural network implementations while maintaining strong performance.

Abstract: Sheaf Neural Networks equip graph structures with a cellular sheaf: a geometric structure which assigns local vector spaces (stalks) and a linear learnable restriction/transport maps to nodes and edges, yielding an edge-aware inductive bias that handles heterophily and limits oversmoothing. However, common Neural Sheaf Diffusion implementations rely on SVD-based sheaf normalization and dense per-edge restriction maps, which scale with stalk dimension, require frequent Laplacian rebuilds, and yield brittle gradients. To address these limitations, we introduce Polynomial Neural Sheaf Diffusion (PolyNSD), a new sheaf diffusion approach whose propagation operator is a degree-K polynomial in a normalised sheaf Laplacian, evaluated via a stable three-term recurrence on a spectrally rescaled operator. This provides an explicit K-hop receptive field in a single layer (independently of the stalk dimension), with a trainable spectral response obtained as a convex mixture of K+1 orthogonal polynomial basis responses. PolyNSD enforces stability via convex mixtures, spectral rescaling, and residual/gated paths, reaching new state-of-the-art results on both homophilic and heterophilic benchmarks, inverting the Neural Sheaf Diffusion trend by obtaining these results with just diagonal restriction maps, decoupling performance from large stalk dimension, while reducing runtime and memory requirements.

[737] Sharpness-Aware Machine Unlearning

Haoran Tang, Rajiv Khanna

Main category: cs.LG

TL;DR: SAM’s sharpness-aware properties benefit machine unlearning by reducing feature entanglement between retain and forget sets, with proposed Sharp MinMax method achieving best performance.

Details

Motivation: To understand how Sharpness-aware minimization (SAM) affects machine unlearning, particularly how its denoising properties interact with forgetting signals, and to develop improved unlearning methods based on sharpness characteristics.

Method: Characterize SAM’s behavior under unlearning, analyze signal surplus properties, propose Sharp MinMax method that splits model into two parts: one learns retain signals with SAM, the other unlearns forget signals with sharpness maximization.

Result: SAM outperforms SGD in unlearning with relaxed retain signal requirements, reduces feature entanglement, strengthens resistance to membership inference attacks, and flattens loss landscape. Sharp MinMax achieves best performance.

Conclusion: SAM enhances machine unlearning effectiveness across various scenarios, with sharpness properties playing crucial role in separating retain/forget information and improving privacy/security aspects.

Abstract: We characterize the effectiveness of Sharpness-aware minimization (SAM) under machine unlearning scheme, where unlearning forget signals interferes with learning retain signals. While previous work prove that SAM improves generalization with noise memorization prevention, we show that SAM abandons such denoising property when fitting the forget set, leading to altered generalization depending on signal strength. We further characterize the signal surplus of SAM in the order of signal strength, which enables learning from less retain signals to maintain model performance and putting more weight on unlearning the forget set. Empirical studies show that SAM outperforms SGD with relaxed requirement for retain signals and can enhance various unlearning methods either as pretrain or unlearn algorithm. Motivated by our refined characterization of SAM unlearning and observing that overfitting can benefit more stringent sample-specific unlearning, we propose Sharp MinMax, which splits the model into two to learn retain signals with SAM and unlearn forget signals with sharpness maximization, achieving best performance. Extensive experiments show that SAM enhances unlearning across varying difficulties measured by memorization, yielding decreased feature entanglement between retain and forget sets, stronger resistance to membership inference attacks, and a flatter loss landscape. Our observations generalize to more noised data, different optimizers, and different architectures.

[738] Mitigating Gender Bias in Depression Detection via Counterfactual Inference

Mingxuan Hu, Hongbo Ma, Xinlan Wu, Ziqi Liu, Jiaqi Liu, Yangbin Chen

Main category: cs.LG

TL;DR: A causal inference framework for debiasing audio-based depression detection models by removing gender bias through counterfactual reasoning.

Details

Motivation: Audio-based depression detection models suffer from gender bias due to imbalanced training data (higher female prevalence), causing models to learn spurious correlations and over-diagnose females while underperforming on males, raising fairness concerns.

Method: Propose a Counterfactual Debiasing Framework using causal inference: construct causal graph to model decision-making, identify gender bias as direct causal effect of gender on prediction, and during inference use counterfactual inference to estimate and subtract this direct effect.

Result: Extensive experiments on DAIC-WOZ dataset with two advanced acoustic backbones show the framework significantly reduces gender bias while improving overall detection performance compared to existing debiasing strategies.

Conclusion: The causal inference approach effectively addresses gender bias in audio-based depression detection, ensuring models rely on authentic acoustic pathological features rather than spurious gender correlations.

Abstract: Audio-based depression detection models have demonstrated promising performance but often suffer from gender bias due to imbalanced training data. Epidemiological statistics show a higher prevalence of depression in females, leading models to learn spurious correlations between gender and depression. Consequently, models tend to over-diagnose female patients while underperforming on male patients, raising significant fairness concerns. To address this, we propose a novel Counterfactual Debiasing Framework grounded in causal inference. We construct a causal graph to model the decision-making process and identify gender bias as the direct causal effect of gender on the prediction. During inference, we employ counterfactual inference to estimate and subtract this direct effect, ensuring the model relies primarily on authentic acoustic pathological features. Extensive experiments on the DAIC-WOZ dataset using two advanced acoustic backbones demonstrate that our framework not only significantly reduces gender bias but also improves overall detection performance compared to existing debiasing strategies.

[739] Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

Ziyue Li, Chenrui Fan, Tianyi Zhou

Main category: cs.LG

TL;DR: First study of grokking in practical LLM pretraining, showing memorization-to-generalization transition in MoE LLMs with pathway evolution analysis

Details

Motivation: To investigate when LLMs memorize training data vs. generalize on downstream tasks, and understand the lag between them in practical pretraining settings (one-epoch, cross-domain corpus)

Method: Study grokking in mixture-of-experts LLMs, analyze pathway dynamics (expert choices across layers), develop two novel metrics: pathway similarity between samples and consistency of aggregated experts between layers

Result: Grokking emerges in pretraining MoE LLMs with asynchronous local grokking; pathways evolve from random/non-smooth to structured/transferable despite converged loss; metrics track generalization without costly evaluation

Conclusion: Pathway evolution reveals memorization-to-generalization transition; zero-cost metrics can monitor LLM generalization on downstream tasks without instruction tuning or benchmark evaluation

Abstract: This paper presents the first study of grokking in practical LLM pretraining. Specifically, we investigate when an LLM memorizes the training data, when its generalization on downstream tasks starts to improve, and what happens if there is a lag between the two. Unlike existing works studying when a small model generalizes to limited and specified tasks during thousands epochs’ training on algorithmic data, we focus on a practical setting for LLMs, i.e., one-epoch pretraining of next-token prediction on a cross-domain, large-scale corpus, and generalization on diverse benchmark tasks covering math/commonsense reasoning, code generation, and domain-specific retrieval. Our study, for the first time, verifies that grokking still emerges in pretraining mixture-of-experts (MoE) LLMs, though different local data groups may enter their grokking stages asynchronously due to the heterogeneity of their distributions and attributions to others. To find a mechanistic interpretation of this local grokking, we investigate the dynamics of training data’s pathways (i.e., expert choices across layers in MoE). Our primary discovery is that the pathways evolve from random, non-smooth across layers, instance-specific to more structured and transferable across samples, despite the converged pretraining loss. This depicts a transition from memorization to generalization. Two novel metrics are developed to quantify these patterns: one computes the pathway similarity between samples, while the other measures the consistency of aggregated experts between subsequent layers for each sample. These training data based metrics induce zero cost but can faithfully track and monitor the generalization of LLMs on downstream tasks, which, in conventional settings, requires costly instruction tuning and benchmark evaluation.

[740] DOME: Improving Signal-to-Noise in Stochastic Gradient Descent via Sharp-Direction Subspace Filtering

Julien Nicolas, Mohamed Maouche, Sonia Ben Mokhtar, Mark Coates

Main category: cs.LG

TL;DR: First-order method identifies and removes nuisance gradient subspace correlated with Hessian outliers, improving gradient SNR without harming optimization

Details

Motivation: Stochastic gradients in deep learning show strong correlations aligned with Hessian outlier eigenvectors, but removing this subspace doesn't hurt optimization while potentially improving gradient signal-to-noise ratio for applications like gradient compression

Method: Proposes a first-order characterization of nuisance subspace using stochastic gradient covariance, with efficient online estimation method to identify and remove this subspace without computing Hessian

Result: Removing the nuisance subspace has little impact on optimization performance while providing practical benefits for gradient SNR-sensitive applications like gradient compression

Conclusion: A first-order method can effectively identify and remove Hessian-correlated gradient noise without harming optimization, offering practical advantages for gradient compression and other SNR-sensitive applications

Abstract: Stochastic gradients for deep neural networks exhibit strong correlations along the optimization trajectory, and are often aligned with a small set of Hessian eigenvectors associated with outlier eigenvalues. Recent work shows that projecting gradients away from this Hessian outlier subspace has little impact on optimization, despite capturing a large fraction of gradient variability. Since computing the Hessian is intractable in practice, we introduce a principled first-order characterization of the nuisance subspace based on the covariance of stochastic gradients, and propose an efficient method to estimate it online. We show that removing this subspace also has little impact on optimization, and yields practical benefits for applications sensitive to gradient signal-to-noise ratio such as gradient compression.

[741] Multi-view Graph Condensation via Tensor Decomposition

Nícolas Roque dos Santos, Dawon Ahn, Diego Minatel, Alneu de Andrade Lopes, Evangelos E. Papalexakis

Main category: cs.LG

TL;DR: GCTD proposes tensor decomposition for graph condensation to reduce computational demands while maintaining GNN performance and interpretability.

Details

Motivation: Current graph condensation methods rely on computationally intensive bi-level optimization and lack interpretability due to missing node mappings between original and synthetic graphs. Tensor decomposition techniques offer a more transparent and efficient alternative but haven't been explored for graph condensation.

Method: Proposes Multi-view Graph Condensation via Tensor Decomposition (GCTD), using tensor decomposition techniques to synthesize a smaller, informative graph while preserving essential information from the original graph for GNN training.

Result: Extensive experiments on six real-world datasets show GCTD effectively reduces graph size while preserving GNN performance, achieving up to 4.0% accuracy improvement on three datasets and competitive performance on large graphs compared to existing approaches.

Conclusion: Tensor decomposition techniques can effectively address computational challenges in graph condensation while maintaining performance and improving interpretability through preserved node mappings.

Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable results in various real-world applications, including drug discovery, object detection, social media analysis, recommender systems, and text classification. In contrast to their vast potential, training them on large-scale graphs presents significant computational challenges due to the resources required for their storage and processing. Graph Condensation has emerged as a promising solution to reduce these demands by learning a synthetic compact graph that preserves the essential information of the original one while maintaining the GNN’s predictive performance. Despite their efficacy, current graph condensation approaches frequently rely on a computationally intensive bi-level optimization. Moreover, they fail to maintain a mapping between synthetic and original nodes, limiting the interpretability of the model’s decisions. In this sense, a wide range of decomposition techniques have been applied to learn linear or multi-linear functions from graph data, offering a more transparent and less resource-intensive alternative. However, their applicability to graph condensation remains unexplored. This paper addresses this gap and proposes a novel method called Multi-view Graph Condensation via Tensor Decomposition (GCTD) to investigate to what extent such techniques can synthesize an informative smaller graph and achieve comparable downstream task performance. Extensive experiments on six real-world datasets demonstrate that GCTD effectively reduces graph size while preserving GNN performance, achieving up to a 4.0\ improvement in accuracy on three out of six datasets and competitive performance on large graphs compared to existing approaches. Our code is available at https://anonymous.4open.science/r/gctd-345A.

[742] Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks

Jihang Wang, Dongcheng Zhao, Ruolin Chen, Qian Zhang, Yi Zeng

Main category: cs.LG

TL;DR: Proposes ASSG (Adaptive Sharpness Surrogate Gradient) and SA-PGD (Stable Adaptive Projected Gradient Descent) for more reliable adversarial robustness evaluation in Spiking Neural Networks, revealing current SNN robustness is significantly overestimated.

Details

Motivation: SNNs have binary/discontinuous spike activations causing vanishing gradients, making gradient-based adversarial robustness evaluation unreliable. Current surrogate gradient methods' effectiveness under strong attacks is unclear, requiring more reliable evaluation frameworks.

Method: 1) Theoretical analysis of gradient vanishing in surrogate gradients; 2) ASSG adaptively evolves surrogate function shape based on input distribution during attacks; 3) SA-PGD uses adaptive step size under L∞ constraint for faster, more stable convergence with imprecise gradients.

Result: Substantially increases attack success rates across diverse adversarial training schemes, SNN architectures, and neuron models. Reveals current SNN robustness is significantly overestimated, providing more generalized and reliable evaluation.

Conclusion: Proposed framework provides more reliable SNN adversarial robustness evaluation, highlighting need for more dependable adversarial training methods as current robustness is overestimated.

Abstract: Spiking Neural Networks (SNNs) utilize spike-based activations to mimic the brain’s energy-efficient information processing. However, the binary and discontinuous nature of spike activations causes vanishing gradients, making adversarial robustness evaluation via gradient descent unreliable. While improved surrogate gradient methods have been proposed, their effectiveness under strong adversarial attacks remains unclear. We propose a more reliable framework for evaluating SNN adversarial robustness. We theoretically analyze the degree of gradient vanishing in surrogate gradients and introduce the Adaptive Sharpness Surrogate Gradient (ASSG), which adaptively evolves the shape of the surrogate function according to the input distribution during attack iterations, thereby enhancing gradient accuracy while mitigating gradient vanishing. In addition, we design an adversarial attack with adaptive step size under the $L_\infty$ constraint-Stable Adaptive Projected Gradient Descent (SA-PGD), achieving faster and more stable convergence under imprecise gradients. Extensive experiments show that our approach substantially increases attack success rates across diverse adversarial training schemes, SNN architectures and neuron models, providing a more generalized and reliable evaluation of SNN adversarial robustness. The experimental results further reveal that the robustness of current SNNs has been significantly overestimated and highlighting the need for more dependable adversarial training methods.

Jiajun Li, Yixuan Li, Ran Hou, Yu Ding, Shisi Guan, Jiahui Duan, Xiongwei Han, Tao Zhong, Vincent Chau, Weiwei Wu, Wanyuan Wang

Main category: cs.LG

TL;DR: A constraint-based model reduction approach for Mixed Integer Linear Programming that identifies critical inequality constraints to transform into equalities, accelerating MILP solving while preserving feasibility.

Details

Motivation: Existing MILP model reduction methods focus primarily on variable reduction, while constraint reduction (transforming inequality constraints into equalities) has been largely ignored despite its potential to reduce complexity and accelerate solving.

Method: 1) Identify critical constraints by labeling tight-constraints at optimal solutions as potential critical constraints with heuristic selection rules. 2) Use multi-modal representation learning that leverages both instance-level and abstract-level MILP formulations to predict critical tight-constraints efficiently.

Result: Experimental results show the method improves solution quality by over 50% and reduces computation time by 17.47% compared to state-of-the-art methods.

Conclusion: Constraint-based model reduction is an effective approach for accelerating MILP solving, with the proposed multi-modal representation technique successfully identifying critical constraints for reduction.

Abstract: Model reduction, which aims to learn a simpler model of the original mixed integer linear programming (MILP), can solve large-scale MILP problems much faster. Most existing model reduction methods are based on variable reduction, which predicts a solution value for a subset of variables. From a dual perspective, constraint reduction that transforms a subset of inequality constraints into equalities can also reduce the complexity of MILP, but has been largely ignored. Therefore, this paper proposes a novel constraint-based model reduction approach for the MILP. Constraint-based MILP reduction has two challenges: 1) which inequality constraints are critical such that reducing them can accelerate MILP solving while preserving feasibility, and 2) how to predict these critical constraints efficiently. To identify critical constraints, we first label these tight-constraints at the optimal solution as potential critical constraints and design a heuristic rule to select a subset of critical tight-constraints. To learn the critical tight-constraints, we propose a multi-modal representation technique that leverages information from both instance-level and abstract-level MILP formulations. The experimental results show that, compared to the state-of-the-art methods, our method improves the quality of the solution by over 50% and reduces the computation time by 17.47%.

[744] MSACL: Multi-Step Actor-Critic Learning with Lyapunov Certificates for Exponentially Stabilizing Control

Yongwei Zhang, Yuanzhe Xing, Quanyi Liang, Quan Quan, Zhikun She

Main category: cs.LG

TL;DR: MSACL integrates exponential stability with maximum entropy RL using Lyapunov certificates and multi-step data for safe, efficient learning in safety-critical applications.

Details

Motivation: Model-free RL lacks verifiable stability guarantees for safety-critical applications while maintaining exploration efficiency. Existing methods rely on complex reward engineering and single-step constraints.

Method: Introduces Exponential Stability Labels to categorize samples, uses λ-weighted aggregation to learn Lyapunov certificates, develops stability-aware advantage function for policy optimization, ensuring rapid Lyapunov descent and state convergence.

Result: Outperforms standard RL baselines and state-of-the-art Lyapunov-based RL algorithms across six benchmarks (four stabilization, two high-dimensional tracking tasks), shows robustness against uncertainties and generalization to unseen reference signals.

Conclusion: MSACL successfully integrates stability guarantees with efficient exploration, providing a practical solution for safety-critical RL applications with verifiable stability.

Abstract: For safety-critical applications, model-free reinforcement learning (RL) faces numerous challenges, particularly the difficulty of establishing verifiable stability guarantees while maintaining high exploration efficiency. To address these challenges, we present Multi-Step Actor-Critic Learning with Lyapunov Certificates (MSACL), a novel approach that seamlessly integrates exponential stability with maximum entropy reinforcement learning (MERL). In contrast to existing methods that rely on complex reward engineering and single-step constraints, MSACL utilizes intuitive rewards and multi-step data for actor-critic learning. Specifically, we first introduce Exponential Stability Labels (ESLs) to categorize samples and propose a $λ$-weighted aggregation mechanism to learn Lyapunov certificates. Leveraging these certificates, we then develop a stability-aware advantage function to guide policy optimization, thereby ensuring rapid Lyapunov descent and robust state convergence. We evaluate MSACL across six benchmarks, comprising four stabilization and two high-dimensional tracking tasks. Experimental results demonstrate its consistent superiority over both standard RL baselines and state-of-the-art Lyapunov-based RL algorithms. Beyond rapid convergence, MSACL exhibits significant robustness against environmental uncertainties and remarkable generalization to unseen reference signals. The source code and benchmarking environments are available at \href{https://github.com/YuanZhe-Xing/MSACL}{https://github.com/YuanZhe-Xing/MSACL}.

[745] Emergent Alignment via Competition

Natalie Collina, Surbhi Goel, Aaron Roth, Emily Ryu, Mirah Shi

Main category: cs.LG

TL;DR: Strategic competition among multiple misaligned AI agents can yield outcomes comparable to perfect alignment when user utility lies within the convex hull of agents’ utilities, especially as model diversity increases.

Details

Motivation: The paper addresses the fundamental challenge of aligning AI systems with human values, questioning whether perfect alignment is necessary to obtain alignment benefits. It explores whether strategic interactions with multiple imperfectly aligned agents can achieve outcomes similar to perfect alignment.

Method: Models the setting as a multi-leader Stackelberg game, extending Bayesian persuasion to multi-round conversations between differently informed parties. Uses game theory to analyze strategic competition among misaligned AI agents and proves three theoretical results about equilibrium outcomes.

Result: Proves that: (1) when perfect alignment would allow learning Bayes-optimal action, the user can do so under convex hull condition; (2) with approximate utility learning, non-strategic user achieves near-optimal utility; (3) when selecting best single AI after evaluation, equilibrium guarantees remain near-optimal. Includes experimental validation.

Conclusion: Strategic competition among multiple misaligned AI agents can effectively substitute for perfect alignment when user utility lies within the convex hull of agents’ utilities, with benefits increasing with model diversity.

Abstract: Aligning AI systems with human values remains a fundamental challenge, but does our inability to create perfectly aligned models preclude obtaining the benefits of alignment? We study a strategic setting where a human user interacts with multiple differently misaligned AI agents, none of which are individually well-aligned. Our key insight is that when the users utility lies approximately within the convex hull of the agents utilities, a condition that becomes easier to satisfy as model diversity increases, strategic competition can yield outcomes comparable to interacting with a perfectly aligned model. We model this as a multi-leader Stackelberg game, extending Bayesian persuasion to multi-round conversations between differently informed parties, and prove three results: (1) when perfect alignment would allow the user to learn her Bayes-optimal action, she can also do so in all equilibria under the convex hull condition (2) under weaker assumptions requiring only approximate utility learning, a non-strategic user employing quantal response achieves near-optimal utility in all equilibria and (3) when the user selects the best single AI after an evaluation period, equilibrium guarantees remain near-optimal without further distributional assumptions. We complement the theory with two sets of experiments.

[746] Q-Regularized Generative Auto-Bidding: From Suboptimal Trajectories to Optimal Policies

Mingming Zhang, Na Li, Zhuang Feiqing, Hongyang Zheng, Jiangbing Zhou, Wang Wuyin, Sheng-jie Sun, XiaoWei Chen, Junxiong Zhu, Lixin Zou, Chenliang Li

Main category: cs.LG

TL;DR: QGA: Q-value regularized Generative Auto-bidding method that combines Decision Transformer with Q-value regularization and dual-exploration for improved advertising bidding optimization.

Details

Motivation: Current auto-bidding approaches using RL and generative models face challenges with complex structures, expensive hyperparameter tuning, and suboptimal trajectories that hinder policy learning in e-commerce advertising.

Method: Proposes QGA with Q-value regularization using double Q-learning strategy integrated into Decision Transformer backbone, plus Q-value guided dual-exploration mechanism with multiple return-to-go targets and locally perturbed actions.

Result: Superior or highly competitive results on public benchmarks and simulation environments; 3.27% increase in Ad GMV and 2.49% improvement in Ad ROI in large-scale real-world A/B testing.

Conclusion: QGA effectively addresses limitations of existing auto-bidding methods by combining policy imitation with action-value maximization and safe exploration beyond data distribution.

Abstract: With the rapid development of e-commerce, auto-bidding has become a key asset in optimizing advertising performance under diverse advertiser environments. The current approaches focus on reinforcement learning (RL) and generative models. These efforts imitate offline historical behaviors by utilizing a complex structure with expensive hyperparameter tuning. The suboptimal trajectories further exacerbate the difficulty of policy learning. To address these challenges, we proposes QGA, a novel Q-value regularized Generative Auto-bidding method. In QGA, we propose to plug a Q-value regularization with double Q-learning strategy into the Decision Transformer backbone. This design enables joint optimization of policy imitation and action-value maximization, allowing the learned bidding policy to both leverage experience from the dataset and alleviate the adverse impact of the suboptimal trajectories. Furthermore, to safely explore the policy space beyond the data distribution, we propose a Q-value guided dual-exploration mechanism, in which the DT model is conditioned on multiple return-to-go targets and locally perturbed actions. This entire exploration process is dynamically guided by the aforementioned Q-value module, which provides principled evaluation for each candidate action. Experiments on public benchmarks and simulation environments demonstrate that QGA consistently achieves superior or highly competitive results compared to existing alternatives. Notably, in large-scale real-world A/B testing, QGA achieves a 3.27% increase in Ad GMV and a 2.49% improvement in Ad ROI.

[747] A Capacity-Based Rationale for Multi-Head Attention

Micah Adler

Main category: cs.LG

TL;DR: Theoretical analysis shows self-attention capacity scales with key dimension budget, with multi-head attention increasing capacity by reducing interference from embedding superposition.

Details

Motivation: To understand the fundamental capacity limits of self-attention mechanisms, specifically how many distinct token-token relations a single attention layer can reliably encode given a fixed computational budget.

Method: Introduces Relational Graph Recognition task where key-query channels encode directed graphs. Uses information-theoretic analysis with matching lower and upper bounds, analyzing both tractable multi-head models and scaled-softmax attention. Examines capacity scaling with key dimension budget.

Result: Recovering a graph with m’ relations requires key dimension D_K to grow essentially as m’/d_model up to logarithmic factors. Multi-head attention increases capacity by reducing interference from embedding superposition, even in permutation graphs. Controlled experiments show sharp phase transitions at predicted capacity.

Conclusion: Provides theoretical foundation for attention capacity, offering new capacity-based rationale for multi-head attention design. The multi-head advantage persists even with softmax normalization, value routing, and full Transformer blocks.

Abstract: We study the capacity of the self-attention key-query channel: for a fixed budget, how many distinct token-token relations can a single layer reliably encode? We introduce Relational Graph Recognition, where the key-query channel encodes a directed graph and, given a context (a subset of the vertices), must recover the neighbors of each vertex in the context. We measure resources by the total key dimension $D_K = h,d_k$. In a tractable multi-head model, we prove matching information-theoretic lower bounds and upper bounds via explicit constructions showing that recovering a graph with $m’$ relations in $d_{\text{model}}$-dimensional embeddings requires $D_K$ to grow essentially as $m’/d_{\text{model}}$ up to logarithmic factors, and we obtain corresponding guarantees for scaled-softmax attention. This analysis yields a new, capacity-based rationale for multi-head attention: even in permutation graphs, where all queries attend to a single target, splitting a fixed $D_K$ budget into multiple heads increases capacity by reducing interference from embedding superposition. Controlled experiments mirror the theory, revealing sharp phase transitions at the predicted capacity, and the multi-head advantage persists when adding softmax normalization, value routing, and a full Transformer block trained with frozen GPT-2 embeddings.

[748] fev-bench: A Realistic Benchmark for Time Series Forecasting

Oleksandr Shchur, Abdul Fatir Ansari, Caner Turkmen, Lorenzo Stella, Nick Erickson, Pablo Guerron, Michael Bohlke-Schneider, Yuyang Wang

Main category: cs.LG

TL;DR: fev-bench is a comprehensive forecasting benchmark with 100 tasks across 7 domains, including 46 with covariates, featuring principled statistical aggregation and a lightweight Python library for evaluation.

Details

Motivation: Existing time series forecasting benchmarks have limitations: limited domain coverage, overlooking real-world settings like covariates, lack of statistical rigor in aggregation, inconsistent evaluation infrastructure, and rigidity for integration into existing pipelines.

Method: Propose fev-bench with 100 forecasting tasks across 7 domains (46 with covariates), supported by fev - a lightweight Python library for forecasting evaluation. Uses principled aggregation with bootstrapped confidence intervals to report performance via win rates and skill scores.

Result: Benchmark includes results for pretrained, statistical, and baseline models, providing comprehensive evaluation across diverse forecasting scenarios with covariates.

Conclusion: fev-bench addresses critical gaps in forecasting evaluation, offering statistically rigorous aggregation, real-world task coverage with covariates, and practical integration tools for sustained progress in time series forecasting.

Abstract: Benchmark quality is critical for meaningful evaluation and sustained progress in time series forecasting, particularly with the rise of pretrained models. Existing benchmarks often have limited domain coverage or overlook real-world settings such as tasks with covariates. Their aggregation procedures frequently lack statistical rigor, making it unclear whether observed performance differences reflect true improvements or random variation. Many benchmarks lack consistent evaluation infrastructure or are too rigid for integration into existing pipelines. To address these gaps, we propose fev-bench, a benchmark of 100 forecasting tasks across seven domains, including 46 with covariates. Supporting the benchmark, we introduce fev, a lightweight Python library for forecasting evaluation emphasizing reproducibility and integration with existing workflows. Using fev, fev-bench employs principled aggregation with bootstrapped confidence intervals to report performance along two dimensions: win rates and skill scores. We report results on fev-bench for pretrained, statistical, and baseline models and identify promising future research directions.

[749] Power Transform Revisited: Numerically Stable, and Federated

Xuefeng Xu, Graham Cormode

Main category: cs.LG

TL;DR: Analysis of numerical instabilities in power transforms and proposed remedies, with extension to federated learning settings

Details

Motivation: Power transforms are widely used for making data more Gaussian-like in statistical analysis and machine learning, but direct implementations suffer from severe numerical instabilities that can lead to incorrect results or crashes

Method: Comprehensive analysis of instability sources and proposed effective remedies, with extension to federated learning addressing both numerical and distributional challenges

Result: Experiments on real-world datasets demonstrate methods are both effective and robust, substantially improving stability compared to existing approaches

Conclusion: The paper provides solutions to numerical instability issues in power transforms and extends their application to federated learning contexts

Abstract: Power transforms are popular parametric methods for making data more Gaussian-like, and are widely used as preprocessing steps in statistical analysis and machine learning. However, we find that direct implementations of power transforms suffer from severe numerical instabilities, which can lead to incorrect results or even crashes. In this paper, we provide a comprehensive analysis of the sources of these instabilities and propose effective remedies. We further extend power transforms to the federated learning setting, addressing both numerical and distributional challenges that arise in this context. Experiments on real-world datasets demonstrate that our methods are both effective and robust, substantially improving stability compared to existing approaches.

[750] PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization

Ruiyi Ding, Yongxuan Lv, Xianhui Meng, Jiahe Song, Chao Wang, Chen Jiang, Yuan Cheng

Main category: cs.LG

TL;DR: PRPO is a critic-free policy optimization method that combines outcome reliability with process-level guidance by segmenting reasoning sequences, normalizing PRM scores into token-level advantages, and aligning their distribution with outcome advantages through location-parameter shift.

Details

Motivation: Policy optimization for LLMs suffers from sparse reward signals in multi-step reasoning tasks. Critic-free methods like GRPO provide limited guidance by assigning single normalized outcome rewards to all tokens, while Process Reward Models (PRMs) alone risk premature collapse when early low-reward tokens drive policies toward truncated outputs.

Method: PRPO segments reasoning sequences based on semantic clues, normalizes PRM scores into token-level advantages, and aligns their distribution with outcome advantages through location-parameter shift. This combines outcome reliability with process-level guidance in a critic-free framework without requiring a value network.

Result: On MATH500, PRPO improves Qwen2.5-Math-1.5B accuracy from 61.2% to 64.4% over GRPO using only eight rollouts and no value network, demonstrating efficient fine-grained credit assignment within critic-free optimization.

Conclusion: PRPO effectively addresses the limitations of both outcome-only rewards and standalone PRMs by providing dense process-level guidance while maintaining outcome reliability, enabling more efficient policy optimization for multi-step reasoning tasks.

Abstract: Policy optimization for large language models often suffers from sparse reward signals in multi-step reasoning tasks. Critic-free methods like GRPO assign a single normalized outcome reward to all tokens, providing limited guidance for intermediate reasoning . While Process Reward Models (PRMs) offer dense feedback, they risk premature collapse when used alone, as early low-reward tokens can drive policies toward truncated outputs. We introduce Process Relative Policy Optimization (PRPO), which combines outcome reliability with process-level guidance in a critic-free framework. PRPO segments reasoning sequences based on semantic clues, normalizes PRM scores into token-level advantages, and aligns their distribution with outcome advantages through location-parameter shift. On MATH500, PRPO improves Qwen2.5-Math-1.5B accuracy from 61.2% to 64.4% over GRPO using only eight rollouts and no value network, demonstrating efficient fine-grained credit assignment within critic-free optimization. Code is available at: https://github.com/SchumiDing/srpocode

[751] BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining

Jie Hao, Rui Yu, Wei Zhang, Huixia Wang, Jie Xu, Mingrui Liu

Main category: cs.LG

TL;DR: BLISS is a data selection method for LLM pretraining that operates from scratch without external models, using bilevel optimization to estimate long-term influence of training samples.

Details

Motivation: Existing data selection methods for LLM pretraining rely on external pretrained models, making it hard to isolate data selection effects, and ignore long-term impact due to prohibitive full-scale training costs.

Method: BLISS uses a small proxy model as surrogate for LLM and a score model to estimate long-term influence. Formulates data selection as bilevel optimization: upper-level optimizes score model to assign sample weights, lower-level trains proxy model on weighted loss to convergence.

Result: BLISS achieves 1.7× speedup in reaching same performance as state-of-the-art method for 1B model, with superior performance across multiple downstream tasks when pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on C4 subsets.

Conclusion: BLISS provides an effective, lightweight data selection method that operates from scratch without external models, explicitly accounts for long-term impact, and significantly accelerates LLM pretraining.

Abstract: Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the long-term impact of selected data if the model is trained to convergence, primarily due to the prohibitive cost of full-scale LLM pretraining. In this paper, we introduce BLISS (\textbf{B}ileve\textbf{L} \textbf{I}nfluence \textbf{S}coring method for data \textbf{S}election): a lightweight data selection method that operates entirely \emph{from scratch}, without relying on any external pretrained oracle models, while explicitly accounting for the long-term impact of selected data. BLISS leverages a small proxy model as a surrogate for the LLM and employs a score model to estimate the long-term influence of training samples if the proxy model is trained to convergence. We formulate data selection as a bilevel optimization problem, where the upper-level objective optimizes the score model to assign importance weights to training samples, ensuring that minimizing the lower-level objective (i.e., training the proxy model over the weighted training loss until convergence) leads to best validation performance. Once optimized, the trained score model predicts influence scores for the dataset, enabling efficient selection of high-quality samples for LLM pretraining. We validate BLISS by pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on selected subsets of the C4 dataset. Notably, under the 1B model setting, BLISS achieves $1.7\times$ speedup in reaching the same performance as the state-of-the-art method, demonstrating superior performance across multiple downstream tasks.

Rongchao Xu, Kunlin Cai, Lin Jian, Zhiqing Hong, Yuan Tian, Guang Wang

Main category: cs.LG

TL;DR: GeoGen: A two-stage coarse-to-fine framework for generating synthetic LBSN check-in trajectories using sparsity-aware spatio-temporal diffusion models and Transformer-based Seq2Seq architecture.

Details

Motivation: High collection costs and privacy concerns limit access to large-scale LBSN trajectory data needed for applications like POI recommendation and pandemic intervention. Synthetic data generation offers a solution but is challenging due to spatially discrete, temporally irregular nature and complex spatio-temporal patterns of LBSN check-ins.

Method: Two-stage framework: 1) Reconstruct continuous latent movement sequences from original trajectories using Sparsity-aware Spatio-temporal Diffusion model (S²TDiff) with efficient denoising network. 2) Generate fine-grained trajectories using Coarse2FineNet - a Transformer-based Seq2Seq architecture with dynamic context fusion encoder and multi-task hybrid-head decoder.

Result: Extensive experiments on four real-world datasets show GeoGen excels state-of-the-art models in both fidelity and utility evaluation, achieving over 69% and 55% improvements in distance and radius metrics on FS-TKY dataset.

Conclusion: GeoGen effectively addresses challenges in LBSN trajectory generation by handling spatial discreteness, temporal irregularity, and complex spatio-temporal patterns through a novel two-stage coarse-to-fine approach with diffusion models and Transformer architecture.

Abstract: Location-Based Social Network (LBSN) check-in trajectory data are important for many practical applications, like POI recommendation, advertising, and pandemic intervention. However, the high collection costs and ever-increasing privacy concerns prevent us from accessing large-scale LBSN trajectory data. The recent advances in synthetic data generation provide us with a new opportunity to achieve this, which utilizes generative AI to generate synthetic data that preserves the characteristics of real data while ensuring privacy protection. However, generating synthetic LBSN check-in trajectories remains challenging due to their spatially discrete, temporally irregular nature and the complex spatio-temporal patterns caused by sparse activities and uncertain human mobility. To address this challenge, we propose GeoGen, a two-stage coarse-to-fine framework for large-scale LBSN check-in trajectory generation. In the first stage, we reconstruct spatially continuous, temporally regular latent movement sequences from the original LBSN check-in trajectories and then design a Sparsity-aware Spatio-temporal Diffusion model (S$^2$TDiff) with an efficient denosing network to learn their underlying behavioral patterns. In the second stage, we design Coarse2FineNet, a Transformer-based Seq2Seq architecture equipped with a dynamic context fusion mechanism in the encoder and a multi-task hybrid-head decoder, which generates fine-grained LBSN trajectories based on coarse-grained latent movement sequences by modeling semantic relevance and behavioral uncertainty. Extensive experiments on four real-world datasets show that GeoGen excels state-of-the-art models for both fidelity and utility evaluation, e.g., it increases over 69% and 55% in distance and radius metrics on the FS-TKY dataset.

[753] Toward Learning POMDPs Beyond Full-Rank Actions and State Observability

Seiji Shaw, Travis Manderson, Chad Kessens, Nicholas Roy

Main category: cs.LG

TL;DR: The paper presents a method for learning POMDP parameters from action-observation sequences using spectral approaches and tensor decomposition, enabling agents to reason about systems with hidden states.

Details

Motivation: To enable autonomous agents to learn and reason about systems with hidden states (like locking mechanisms) by learning POMDP parameters from sequential data, since traditional PSR models lack explicit transition/observation models needed for planning with different reward functions.

Method: Uses spectral approaches (Predictive State Representations) and tensor decomposition to learn POMDP observation and transition matrices up to a similarity transform under rankness assumptions. Shows how to estimate this transform and identifies limitations in state distinguishability.

Result: Demonstrates that explicit observation/transition likelihoods can be used for planning with different goals/reward functions after learning. Shows fundamental limitation: learning POMDP beyond state partition is impossible from sequential data alone (constructs counterexamples).

Conclusion: The method enables learning POMDP models from sequential data for planning with different objectives, but has inherent limitations in state distinguishability that cannot be overcome without additional information.

Abstract: We are interested in enabling autonomous agents to learn and reason about systems with hidden states, such as locking mechanisms. We cast this problem as learning the parameters of a discrete Partially Observable Markov Decision Process (POMDP). The agent begins with knowledge of the POMDP’s actions and observation spaces, but not its state space, transitions, or observation models. These properties must be constructed from a sequence of actions and observations. Spectral approaches to learning models of partially observable domains, such as Predictive State Representations (PSRs), learn representations of state that are sufficient to predict future outcomes. PSR models, however, do not have explicit transition and observation system models that can be used with different reward functions to solve different planning problems. Under a mild set of rankness assumptions on the products of transition and observation matrices, we show how PSRs learn POMDP matrices up to a similarity transform, and this transform may be estimated via tensor decomposition methods. Our method learns observation matrices and transition matrices up to a partition of states, where the states in a single partition have the same observation distributions corresponding to actions whose transition matrices are full-rank. Our experiments suggest that explicit observation and transition likelihoods can be leveraged to generate new plans for different goals and reward functions after the model has been learned. We also show that learning a POMDP beyond a partition of states is impossible from sequential data by constructing two POMDPs that agree on all observation distributions but differ in their transition dynamics.

[754] Tight Robustness Certificates and Wasserstein Distributional Attacks for Deep Neural Networks

Bach C. Le, Tung V. Dao, Binh T. Nguyen, Hong T. M. Chu

Main category: cs.LG

TL;DR: A primal approach for Wasserstein distributionally robust optimization that provides tighter certificates for neural network robustness using exact Lipschitz analysis and novel Wasserstein distributional attacks.

Details

Motivation: Existing WDRO methods based on global Lipschitz continuity or strong duality produce loose upper bounds or require prohibitive computation, limiting their practical effectiveness for adversarial robustness.

Method: Proposes a primal approach with exact Lipschitz certificates, leveraging piecewise-affine structure of ReLU networks for tractable characterization, extending to smooth activations (GELU, SiLU) in Transformers, and introducing Wasserstein Distributional Attacks (WDA, WDA++) for flexible worst-case distribution construction.

Result: Achieves competitive robust accuracy against state-of-the-art baselines while providing tighter certificates than existing methods, with flexible attack construction beyond point-wise perturbations.

Conclusion: The proposed framework offers improved theoretical guarantees and practical effectiveness for distributionally robust optimization in neural networks, with applications to adversarial robustness.

Abstract: Wasserstein distributionally robust optimization (WDRO) provides a framework for adversarial robustness, yet existing methods based on global Lipschitz continuity or strong duality often yield loose upper bounds or require prohibitive computation. We address these limitations with a primal approach and adopt a notion of exact Lipschitz certificates to tighten this upper bound of WDRO. For ReLU networks, we leverage the piecewise-affine structure on activation cells to obtain an exact tractable characterization of the corresponding WDRO problem. We further extend our analysis to modern architectures with smooth activations (e.g., GELU, SiLU), such as Transformers. Additionally, we propose novel Wasserstein Distributional Attacks (WDA, WDA++) that construct candidates for the worst-case distribution. Compared to existing attacks that are restricted to point-wise perturbations, our methods offer greater flexibility in the number and location of attack points. Extensive evaluations demonstrate that our proposed framework achieves competitive robust accuracy against state-of-the-art baselines while offering tighter certificates than existing methods. Our code is available at https://github.com/OLab-Repo/WDA.

[755] Designing ReLU Generative Networks to Enumerate Trees with a Given Tree Edit Distance

Mamoona Ghafoor, Tatsuya Akutsu

Main category: cs.LG

TL;DR: Theoretical proof that ReLU-based generative networks with size O(n³) and constant depth can generate all trees within a specified tree edit distance from a given tree.

Details

Motivation: Tree generation with specified edit distance has applications in computational biology, structured data analysis, and image processing, but existing generative models lack theoretical guarantees about network size/depth needed for exact generation.

Method: Theoretical construction of deterministic ReLU-based generative networks that can produce all rooted, ordered, vertex-labeled trees within edit distance d from a given tree T. Networks have size O(n³) and constant depth.

Result: Networks successfully generated all valid trees up to 21 nodes within specified edit distances, while state-of-the-art models GraphRNN and GraphGDP achieved only 35% and 48% validation rates respectively due to non-deterministic mechanisms.

Conclusion: Provides theoretical foundation for compact generative models and opens directions for exact tree-structured data generation with deterministic guarantees.

Abstract: The generation of trees with a specified tree edit distance has significant applications across various fields, including computational biology, structured data analysis, and image processing. Recently, generative networks have been increasingly employed to synthesize new data that closely resembles the original datasets. However, the appropriate size and depth of generative networks required to generate data with a specified tree edit distance remain unclear. In this paper, we theoretically establish the existence and construction of generative networks capable of producing trees similar to a given tree with respect to the tree edit distance. Specifically, for a given rooted, ordered, and vertex-labeled tree T of size n + 1 with labels from an alphabet Σ, and a non-negative integer d, we prove that all rooted, ordered, and vertex-labeled trees over Σwith tree edit distance at most d from T can be generated using a ReLU-based generative network with size O(n^3 ) and constant depth. The proposed networks were implemented and evaluated for generating trees with up to 21 nodes. Due to their deterministic architecture, the networks successfully generated all valid trees within the specified tree edit distance. In contrast, state-of-the-art graph generative models GraphRNN and GraphGDP, which rely on non-deterministic mechanisms, produced significantly fewer valid trees, achieving validation rates of only up to 35% and 48%, respectively. These findings provide a theoretical foundation towards construction of compact generative models and open new directions for exact and valid tree-structured data generation. An implementation of the proposed networks is available at https://github.com/MGANN-KU/TreeGen_ReLUNetworks.

[756] CiMRAG: CiM-Aware Domain-Adaptive and Noise-Resilient Retrieval-Augmented Generation for Edge-Based LLMs

Shih-Hsuan Chiu, Ming-Syan Chen

Main category: cs.LG

TL;DR: TONEL framework improves noise robustness and domain adaptability for RAG-based personalized virtual assistants on edge devices using noise-aware embedding learning compatible with CiM hardware.

Details

Motivation: Deploying RAG for personalized virtual assistants on edge devices faces efficiency challenges due to growing profile data and environmental noise in CiM architectures, especially critical in dynamic multi-domain edge scenarios requiring both accuracy and adaptability.

Method: Proposes Task-Oriented Noise-resilient Embedding Learning (TONEL) framework with noise-aware projection model to learn task-specific embeddings compatible with CiM hardware constraints for accurate retrieval under noisy conditions.

Result: Extensive experiments on personalization benchmarks demonstrate effectiveness and practicality relative to strong baselines, especially in task-specific noisy scenarios.

Conclusion: TONEL addresses critical noise robustness and domain adaptability challenges for RAG deployment in noisy edge environments, enabling more reliable personalized virtual assistants.

Abstract: Personalized virtual assistants powered by large language models (LLMs) on edge devices are attracting growing attention, with Retrieval-Augmented Generation (RAG) emerging as a key method for personalization by retrieving relevant profile data and generating tailored responses. However, deploying RAG on edge devices faces efficiency hurdles due to the rapid growth of profile data, such as user-LLM interactions and recent updates. While Computing-in-Memory (CiM) architectures mitigate this bottleneck by eliminating data movement between memory and processing units via in-situ operations, they are susceptible to environmental noise that can degrade retrieval precision. This poses a critical issue in dynamic, multi-domain edge-based scenarios (e.g., travel, medicine, and law) where both accuracy and adaptability are paramount. To address these challenges, we propose Task-Oriented Noise-resilient Embedding Learning (TONEL), a framework that improves noise robustness and domain adaptability for RAG in noisy edge environments. TONEL employs a noise-aware projection model to learn task-specific embeddings compatible with CiM hardware constraints, enabling accurate retrieval under noisy conditions. Extensive experiments conducted on personalization benchmarks demonstrate the effectiveness and practicality of our methods relative to strong baselines, especially in task-specific noisy scenarios.

[757] Noise-Adaptive Layerwise Learning Rates: Accelerating Geometry-Aware Optimization for Deep Neural Network Training

Jie Hao, Xiaochuan Gong, Jie Xu, Zhengdao Wang, Mingrui Liu

Main category: cs.LG

TL;DR: Geometry-aware optimizer with noise-adaptive layerwise learning rates that dynamically adjusts learning rates within layer groups based on gradient variance, accelerating DNN training compared to fixed-rate methods.

Details

Motivation: Standard geometry-aware optimizers like Muon use fixed learning rates for layers within the same norm group, but local curvature varies heterogeneously across layers and dynamically during training (e.g., sharpness varies across transformer layers). This fixed-rate approach may be inefficient for DNN training.

Method: Proposes a noise-adaptive layerwise learning rate scheme on top of geometry-aware optimization algorithms. Estimates gradient variance in the dual norm induced by the chosen linear minimization oracle (LMO) on the fly, and uses it to assign time-varying noise-adaptive layerwise learning rates within each group.

Result: Theoretical analysis shows the algorithm achieves sharp convergence rate. Empirical results on transformer architectures (LLaMA and GPT) demonstrate faster convergence than state-of-the-art optimizers.

Conclusion: The proposed noise-adaptive layerwise learning rate scheme substantially accelerates DNN training compared to methods using fixed learning rates within each group, with theoretical guarantees and empirical validation on modern transformer architectures.

Abstract: Geometry-aware optimization algorithms, such as Muon, have achieved remarkable success in training deep neural networks (DNNs). These methods leverage the underlying geometry of DNNs by selecting appropriate norms for different layers and updating parameters via norm-constrained linear minimization oracles (LMOs). However, even within a group of layers associated with the same norm, the local curvature can be heterogeneous across layers and vary dynamically over the course of training. For example, recent work shows that sharpness varies substantially across transformer layers and throughout training, yet standard geometry-aware optimizers impose fixed learning rates to layers within the same group, which may be inefficient for DNN training. In this paper, we introduce a noise-adaptive layerwise learning rate scheme on top of geometry-aware optimization algorithms and substantially accelerate DNN training compared to methods that use fixed learning rates within each group. Our method estimates gradient variance in the dual norm induced by the chosen LMO on the fly, and uses it to assign time-varying noise-adaptive layerwise learning rates within each group. We provide a theoretical analysis showing that our algorithm achieves a sharp convergence rate. Empirical results on transformer architectures such as LLaMA and GPT demonstrate that our approach achieves faster convergence than state-of-the-art optimizers.

[758] Textual Equilibrium Propagation for Deep Compound AI Systems

Minghui Chen, Wenlong Deng, James Zou, Han Yu, Xiaoxiao Li

Main category: cs.LG

TL;DR: TEP introduces local prompt optimization for deep compound AI systems to address textual gradient issues in long-horizon workflows

Details

Motivation: Current LLM-based compound AI systems suffer from exploding/vanishing textual gradients in deep workflows, degrading performance as system depth increases

Method: Textual Equilibrium Propagation (TEP) uses local LLM critics for iterative prompt refinement (free phase) followed by proximal prompt edits with bounded modification (nudged phase)

Result: TEP improves accuracy and efficiency over global propagation methods like TextGrad, with gains increasing with system depth

Conclusion: Local prompt optimization via TEP enables effective deep compound AI systems without computational burden of global backpropagation

Abstract: Large language models (LLMs) are increasingly deployed as part of compound AI systems that coordinate multiple modules (e.g., retrievers, tools, verifiers) over long-horizon workflows. Recent approaches that propagate textual feedback globally (e.g., TextGrad) make it feasible to optimize such pipelines, but we find that performance degrades as system depth grows. In particular, long-horizon agentic workflows exhibit two depth-scaling failure modes: 1) exploding textual gradient, where textual feedback grows exponentially with depth, leading to prohibitively long message and amplifies evaluation biases; and 2) vanishing textual gradient, where limited long-context ability causes models overemphasize partial feedback and compression of lengthy feedback causes downstream messages to lose specificity gradually as they propagate many hops upstream. To mitigate these issues, we introduce Textual Equilibrium Propagation (TEP), a local learning principle inspired by Equilibrium Propagation in energy-based models. TEP includes two phases: 1) a free phase where a local LLM critics iteratively refine prompts until reaching equilibrium (no further improvements are suggested); and 2) a nudged phase which applies proximal prompt edits with bounded modification intensity, using task-level objectives that propagate via forward signaling rather than backward feedback chains. This design supports local prompt optimization followed by controlled adaptation toward global goals without the computational burden and signal degradation of global textual backpropagation. Across long-horizon QA benchmarks and multi-agent tool-use dataset, TEP consistently improves accuracy and efficiency over global propagation methods such as TextGrad. The gains grows with depth, while preserving the practicality of black-box LLM components in deep compound AI system.

[759] S4ECG: Exploring the impact of long-range interactions for arrhythmia prediction

Tiezhi Wang, Wilhelm Haverkamp, Nils Strodthoff

Main category: cs.LG

TL;DR: S4ECG is a deep learning architecture using structured state space models for multi-epoch ECG arrhythmia classification, achieving superior performance by capturing both global trends and local waveform features simultaneously.

Details

Motivation: ECG signals have complex temporal dynamics that conventional methods struggle to analyze, as they typically capture either global trends or local waveform features but not their simultaneous interplay at high temporal resolution.

Method: Introduces S4ECG, a novel deep learning architecture leveraging structured state space models for multi-epoch arrhythmia classification, enabling joint analysis across multiple time windows to capture both local and global temporal dependencies.

Result: Multi-epoch predictions outperform single-epoch approaches by 1.0-11.6% in macro-AUROC, with atrial fibrillation specificity improving from 0.718-0.979 to 0.967-0.998, showing superior in-distribution performance and enhanced out-of-distribution robustness.

Conclusion: The work enables a paradigm shift toward temporally-aware arrhythmia detection algorithms, particularly beneficial for complex arrhythmias like atrial fibrillation and atrial flutter, with optimal temporal dependency windows of 10-20 minutes.

Abstract: The electrocardiogram (ECG) exemplifies biosignal-based time series with continuous, temporally ordered structure reflecting cardiac physiological and pathophysiological dynamics. Detailed analysis of these dynamics has proven challenging, as conventional methods capture either global trends or local waveform features but rarely their simultaneous interplay at high temporal resolution. To bridge global and local signal analysis, we introduce S4ECG, a novel deep learning architecture leveraging structured state space models for multi-epoch arrhythmia classification. Our joint multi-epoch predictions significantly outperform single-epoch approaches by 1.0-11.6% in macro-AUROC, with atrial fibrillation specificity improving from 0.718-0.979 to 0.967-0.998, demonstrating superior performance in-distribution and enhanced out-of-distribution robustness. Systematic investigation reveals optimal temporal dependency windows spanning 10-20 minutes for peak performance. This work contributes to a paradigm shift toward temporally-aware arrhythmia detection algorithms, opening new possibilities for ECG interpretation, in particular for complex arrhythmias like atrial fibrillation and atrial flutter.

[760] Simple Denoising Diffusion Language Models

Huaisheng Zhu, Zhengyu Chen, Shijie Zhou, Zhihui Xie, Yige Yuan, Shiqi Chen, Zhimeng Guo, Siyuan Xu, Hangfan Zhang, Vasant Honavar, Teng Xiao

Main category: cs.LG

TL;DR: Simplified denoising loss for Uniform State Diffusion Models improves training stability and matches performance of complex objectives while enabling scalability.

Details

Motivation: Uniform State Diffusion Models (USDMs) offer fast text generation but rely on complex loss formulations with computational overhead that hinders scalability.

Method: Proposes simplified denoising-based loss that optimizes only noise-replaced tokens, plus efficient regularization term to mitigate corruption toward uniform output distributions.

Result: Demonstrates effectiveness and efficiency on widely used text datasets, showing strong potential for large-scale training with larger models.

Conclusion: Simplified loss formulations for USDMs maintain performance while improving training stability and enabling better scalability.

Abstract: Recent Uniform State Diffusion Models (USDMs), initialized from a uniform prior, offer the promise of fast text generation due to their inherent self-correction ability compared to masked diffusion models. However, they still rely on complex loss formulations with additional computational overhead, which hinders scalability. In this work, we explore a simplified denoising-based loss for USDMs that optimizes only noise-replaced tokens, stabilizing training while matching the performance of prior methods with more complex objectives. In addition, we introduce an efficient regularization term to mitigate corruption toward uniform output distributions, which further improves performance. We demonstrate the effectiveness and efficiency of our simple and improved loss formulations by pretraining models on widely used text datasets for USDMs. More importantly, our conclusions scale to larger models, showing strong potential for large-scale training.

[761] Methodology for Comparing Machine Learning Algorithms for Survival Analysis

Lucas Buk Cardoso, Simone Aldrey Angelo, Yasmin Pacheco Gil Bonilha, Fernando Maia, Adeylson Guimarães Ribeiro, Maria Paula Curado, Gisele Aparecida Fernandes, Vanderlei Cunha Parro, Flávio Almeida de Magalhães Cipparrone, Alexandre Dias Porto Chiavegatto Filho, Victor Wünsch Filho, Tatiana Natasha Toporcov

Main category: cs.LG

TL;DR: Comparative analysis of six machine learning models for survival analysis in colorectal cancer patients, with XGB-AFT achieving best performance metrics.

Details

Motivation: To evaluate and compare the performance of different machine learning models for survival analysis in colorectal cancer patients, particularly focusing on their ability to handle censored data and improve survival prediction for clinical decision support.

Method: Used data from 45,000 colorectal cancer patients, evaluated six ML models (RSF, GBSA, SSVM, XGB-Cox, XGB-AFT, LightGBM) with hyperparameter optimization using different samplers, assessed performance using C-Index, C-Index IPCW, time-dependent AUC, and Integrated Brier Score, compared survival curves with classification algorithms, and used SHAP and permutation importance for predictor interpretation.

Result: XGB-AFT achieved the best performance with C-Index = 0.7618 and IPCW = 0.7532, followed by GBSA and RSF. The study demonstrated the potential of machine learning survival analysis models to improve prediction accuracy.

Conclusion: Machine learning survival analysis models show significant potential for improving survival prediction in colorectal cancer patients and supporting clinical decision making, with XGB-AFT emerging as the top performer among the evaluated models.

Abstract: This study presents a comparative methodological analysis of six machine learning models for survival analysis (MLSA). Using data from nearly 45,000 colorectal cancer patients in the Hospital-Based Cancer Registries of São Paulo, we evaluated Random Survival Forest (RSF), Gradient Boosting for Survival Analysis (GBSA), Survival SVM (SSVM), XGBoost-Cox (XGB-Cox), XGBoost-AFT (XGB-AFT), and LightGBM (LGBM), capable of predicting survival considering censored data. Hyperparameter optimization was performed with different samplers, and model performance was assessed using the Concordance Index (C-Index), C-Index IPCW, time-dependent AUC, and Integrated Brier Score (IBS). Survival curves produced by the models were compared with predictions from classification algorithms, and predictor interpretation was conducted using SHAP and permutation importance. XGB-AFT achieved the best performance (C-Index = 0.7618; IPCW = 0.7532), followed by GBSA and RSF. The results highlight the potential and applicability of MLSA to improve survival prediction and support decision making.

[762] SCPL: Enhancing Neural Network Training Throughput with Decoupled Local Losses and Model Parallelism

Ming-Yao Ho, Cheng-Kai Wang, You-Teng Lin, Hung-Hsuan Chen

Main category: cs.LG

TL;DR: SCPL is a new training method that decouples backpropagation into parallel short gradient flows to improve training efficiency and throughput for large AI models.

Details

Motivation: High training costs and long development cycles hinder enterprise adoption of large AI models. Standard backpropagation is inefficient for deep networks, creating a need for more efficient training methods.

Method: Supervised Contrastive Parallel Learning (SCPL) decouples backpropagation by transforming long gradient flows into multiple short ones, enabling simultaneous computation of parameter gradients in different layers for superior model parallelism.

Result: SCPL demonstrates improved efficiency and effectiveness compared to standard BP, Early Exit, GPipe, and Associated Learning (AL), mitigating fundamental performance bottlenecks in training.

Conclusion: SCPL provides a practical pathway for organizations to develop and deploy advanced AI systems more cost-effectively and with greater agility by addressing training inefficiencies.

Abstract: Adopting large-scale AI models in enterprise information systems is often hindered by high training costs and long development cycles, posing a significant managerial challenge. The standard end-to-end backpropagation (BP) algorithm is a primary driver of modern AI, but it is also the source of inefficiency in training deep networks. This paper introduces a new training methodology, Supervised Contrastive Parallel Learning (SCPL), that addresses this issue by decoupling BP and transforming a long gradient flow into multiple short ones. This design enables the simultaneous computation of parameter gradients in different layers, achieving superior model parallelism and enhancing training throughput. Detailed experiments are presented to demonstrate the efficiency and effectiveness of our model compared to BP, Early Exit, GPipe, and Associated Learning (AL), a state-of-the-art method for decoupling backpropagation. By mitigating a fundamental performance bottleneck, SCPL provides a practical pathway for organizations to develop and deploy advanced information systems more cost-effectively and with greater agility. The experimental code is released for reproducibility. https://github.com/minyaho/scpl/

[763] Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

Svetlana Churina, Niranjan Chebrolu, Kokil Jaidka

Main category: cs.LG

TL;DR: Continual pretraining on misinformation can overwrite specific factual knowledge in LLMs without degrading overall performance, creating targeted factual corruption that’s hard to detect through standard benchmarks.

Details

Motivation: To understand how repeated exposure to counterfactual claims during continual model updates affects factual knowledge representation, and whether targeted misinformation can corrupt specific facts without triggering broad performance collapse.

Method: Used paired fact-counterfact items with graded poisoning ratios, tracked internal preferences across checkpoints, layers, and model scales, and tested reversibility via patching techniques.

Result: Moderate poisoning (50-100%) flips over 55% of responses from correct to counterfactual, with belief flips emerging abruptly and concentrating in late layers (Layers 29-36 in 3B models). Corruption generalizes beyond poisoned prompts and is partially reversible (up to 56.8%).

Conclusion: Continual pretraining creates a failure mode where targeted misinformation replaces internal factual representations without broad performance collapse, necessitating representation-level monitoring of factual integrity during model updates.

Abstract: We show that continual pretraining on plausible misinformation can overwrite specific factual knowledge in large language models without degrading overall performance. Unlike prior poisoning work under static pretraining, we study repeated exposure to counterfactual claims during continual updates. Using paired fact-counterfact items with graded poisoning ratios, we track how internal preferences between competing facts evolve across checkpoints, layers, and model scales. Even moderate poisoning (50-100%) flips over 55% of responses from correct to counterfactual while leaving ambiguity nearly unchanged. These belief flips emerge abruptly, concentrate in late layers (e.g., Layers 29-36 in 3B models), and are partially reversible via patching (up to 56.8%). The corrupted beliefs generalize beyond poisoned prompts, selectively degrading commonsense reasoning while leaving alignment benchmarks largely intact and transferring imperfectly across languages. These results expose a failure mode of continual pre-training in which targeted misinformation replaces internal factual representations without triggering broad performance collapse, motivating representation-level monitoring of factual integrity during model updates.

[764] SPGCL: Simple yet Powerful Graph Contrastive Learning via SVD-Guided Structural Perturbation

Hao Deng, Zhang Guo, Shuiping Gou, Bo Liu

Main category: cs.LG

TL;DR: SPGCL is a graph contrastive learning framework that integrates SVD-guided structural perturbation with stochastic edge removal to create robust and diverse graph views for better GNN performance.

Details

Motivation: Existing graph contrastive learning methods have limitations: random perturbations are structure-agnostic and may remove critical edges, while SVD-based views lack sufficient diversity. There's a need to integrate these disparate paradigms for more robust graph representation learning.

Method: Two-stage strategy: (1) lightweight stochastic edge removal for diversity injection, (2) truncated SVD to derive structure-aware scoring matrix for sparse top-P edge recovery. Also includes contrastive fusion module with global similarity constraint for embedding alignment.

Result: Extensive experiments on ten benchmark datasets show SPGCL consistently improves robustness and accuracy of GNNs, outperforming state-of-the-art GCL and structure learning methods.

Conclusion: SPGCL effectively integrates previously disparate paradigms of random perturbations and spectral augmentations, offering robustness to accidental deletion, enrichment with missing links, and controllable structural discrepancy for better contrastive learning.

Abstract: Graph Neural Networks (GNNs) are sensitive to structural noise from adversarial attacks or imperfections. Existing graph contrastive learning (GCL) methods typically rely on either random perturbations (e.g., edge dropping) for diversity or spectral augmentations (e.g., SVD) to preserve structural priors. However, random perturbations are structure-agnostic and may remove critical edges, while SVD-based views often lack sufficient diversity. Integrating these paradigms is challenging as they operate on discrete edge removal and continuous matrix factorization, respectively.We propose SPGCL, a framework for robust GCL via SVD-guided structural perturbation. Leveraging a recently developed SVD-based method that generalizes structural perturbation theory to arbitrary graphs, we design a two-stage strategy: (1) lightweight stochastic edge removal to inject diversity, and (2) truncated SVD to derive a structure-aware scoring matrix for sparse top-$P$ edge recovery. This integration offers three advantages: (1) Robustness to accidental deletion, as important edges can be recovered by SVD-guided scoring; (2) Enrichment with missing links, creating more informative contrastive views by introducing semantically meaningful edges; and (3) Controllable structural discrepancy, ensuring contrastive signals stem from semantic differences rather than edge-number gaps.Furthermore, we incorporate a contrastive fusion module with a global similarity constraint to align embeddings. Extensive experiments on ten benchmark datasets demonstrate that SPGCL consistently improves the robustness and accuracy of GNNs, outperforming state-of-the-art GCL and structure learning methods, validating its effectiveness in integrating previously disparate paradigms.

[765] Dynamic Priors in Bayesian Optimization for Hyperparameter Optimization

Lukas Fehring, Marcel Wever, Maximilian Spliethöver, Leona Hennig, Henning Wachsmuth, Marius Lindauer

Main category: cs.LG

TL;DR: DynaBO is a Bayesian optimization framework that enables continuous user control over hyperparameter optimization by incorporating user priors into the acquisition function with decaying weights while maintaining convergence guarantees.

Details

Motivation: Current hyperparameter optimization methods only incorporate expert knowledge during initialization, limiting practitioners' ability to influence the optimization process as new insights emerge during iterative machine learning development workflows.

Method: DynaBO augments the acquisition function with decaying, prior-weighted preferences that leverage user-provided priors over time. It includes a data-driven safeguard to detect and reject misleading priors while preserving asymptotic convergence guarantees.

Result: Extensive experiments across various HPO benchmarks show DynaBO consistently outperforms state-of-the-art competitors across all benchmarks and for all prior kinds, demonstrating reliable and efficient collaborative BO.

Conclusion: DynaBO bridges automated and manually controlled model development by enabling reliable and efficient collaborative Bayesian optimization with continuous user control.

Abstract: Bayesian optimization (BO) is a widely used approach to hyperparameter optimization (HPO). However, most existing HPO methods only incorporate expert knowledge during initialization, limiting practitioners’ ability to influence the optimization process as new insights emerge. This limits the applicability of BO in iterative machine learning development workflows. We propose DynaBO, a BO framework that enables continuous user control of the optimization process. Over time, DynaBO leverages provided user priors by augmenting the acquisition function with decaying, prior-weighted preferences while preserving asymptotic convergence guarantees. To reinforce robustness, we introduce a data-driven safeguard that detects and can be used to reject misleading priors. We prove theoretical results on near-certain convergence, robustness to adversarial priors, and accelerated convergence when informative priors are provided. Extensive experiments across various HPO benchmarks show that DynaBO consistently outperforms our state-of-the-art competitors across all benchmarks and for all prior kinds. Our results demonstrate that DynaBO enables reliable and efficient collaborative BO, bridging automated and manually controlled model development.

[766] Variational Approach for Job Shop Scheduling

Seung Heon Oh, Jiwon Baek, Ki Young Cho, Hee Chang Yoon, Jong Hun Woo

Main category: cs.LG

TL;DR: VG2S is a variational graph-to-scheduler framework that uses variational inference and maximum entropy reinforcement learning to solve Job Shop Scheduling Problems with better generalization and training stability.

Details

Motivation: Traditional DRL approaches for JSSP suffer from non-stationarity during training and poor generalization to unseen instances due to simultaneous optimization of representation learning and policy execution.

Method: Introduces variational inference to JSSP for the first time, using ELBO with maximum entropy RL to decouple representation learning from policy optimization via a variational graph encoder.

Result: Superior zero-shot generalization compared to state-of-the-art DRL baselines and traditional dispatching rules, especially on large-scale benchmarks like DMU and SWV.

Conclusion: VG2S framework enhances training stability and robustness against hyperparameter variations by learning robust structural representations through variational graph encoding.

Abstract: This paper proposes a novel Variational Graph-to-Scheduler (VG2S) framework for solving the Job Shop Scheduling Problem (JSSP), a critical task in manufacturing that directly impacts operational efficiency and resource utilization. Conventional Deep Reinforcement Learning (DRL) approaches often face challenges such as non-stationarity during training and limited generalization to unseen problem instances because they optimize representation learning and policy execution simultaneously. To address these issues, we introduce variational inference to the JSSP domain for the first time and derive a probabilistic objective based on the Evidence of Lower Bound (ELBO) with maximum entropy reinforcement learning. By mathematically decoupling representation learning from policy optimization, the VG2S framework enables the agent to learn robust structural representations of scheduling instances through a variational graph encoder. This approach significantly enhances training stability and robustness against hyperparameter variations. Extensive experiments demonstrate that the proposed method exhibits superior zero-shot generalization compared with state-of-the-art DRL baselines and traditional dispatching rules, particularly on large-scale and challenging benchmark instances such as DMU and SWV.

[767] DecoHD: Decomposed Hyperdimensional Classification under Extreme Memory Budgets

Sanggeon Yun, Hyunwoo Oh, Ryozo Masukawa, Mohsen Imani

Main category: cs.LG

TL;DR: DecoHD introduces decomposition techniques to hyperdimensional computing (HDC) for extreme memory savings while maintaining accuracy, achieving significant energy/speed gains in hardware.

Details

Motivation: Traditional HDC compression methods shrink the feature axis, eroding concentration and robustness. Prior decompositions use fixed atomic hypervectors unsuitable for compressing learned class prototypes, creating a need for more effective decomposition approaches.

Method: DecoHD learns directly in a decomposed HDC parameterization using a small shared set of per-layer channels with multiplicative binding across layers and bundling at the end. It compresses along the class axis via a lightweight bundling head while preserving native bind-bundle-score operations.

Result: Achieves extreme memory savings with only minor accuracy degradation (within 0.1-0.15% of baseline, worst case 5.7%). More robust to random bit-flip noise, reaches accuracy plateau with ~97% fewer trainable parameters, and delivers significant hardware gains: 277x/35x energy/speed over CPU, 13.5x/3.7x over GPU, and 2.0x/2.4x over baseline HDC ASIC.

Conclusion: DecoHD successfully applies decomposition to HDC, enabling extreme compression while maintaining accuracy and robustness, with substantial hardware efficiency improvements for in/near-memory accelerators.

Abstract: Decomposition is a proven way to shrink deep networks without changing input-output dimensionality or interface semantics. We bring this idea to hyperdimensional computing (HDC), where footprint cuts usually shrink the feature axis and erode concentration and robustness. Prior HDC decompositions decode via fixed atomic hypervectors, which are ill-suited for compressing learned class prototypes. We introduce DecoHD, which learns directly in a decomposed HDC parameterization: a small, shared set of per-layer channels with multiplicative binding across layers and bundling at the end, yielding a large representational space from compact factors. DecoHD compresses along the class axis via a lightweight bundling head while preserving native bind-bundle-score; training is end-to-end, and inference remains pure HDC, aligning with in/near-memory accelerators. In evaluation, DecoHD attains extreme memory savings with only minor accuracy degradation under tight deployment budgets. On average it stays within about 0.1-0.15% of a strong non-reduced HDC baseline (worst case 5.7%), is more robust to random bit-flip noise, reaches its accuracy plateau with up to ~97% fewer trainable parameters, and–in hardware–delivers roughly 277x/35x energy/speed gains over a CPU (AMD Ryzen 9 9950X), 13.5x/3.7x over a GPU (NVIDIA RTX 4090), and 2.0x/2.4x over a baseline HDC ASIC.

[768] LogHD: Robust Compression of Hyperdimensional Classifiers via Logarithmic Class-Axis Reduction

Sanggeon Yun, Hyunwoo Oh, Ryozo Masukawa, Pietro Mercati, Nathaniel D. Bastian, Mohsen Imani

Main category: cs.LG

TL;DR: LogHD reduces memory requirements in hyperdimensional computing by using logarithmic class-axis compression instead of feature-axis compression, maintaining robustness while achieving significant efficiency gains.

Details

Motivation: Standard hyperdimensional computing requires O(CD) memory for C classes and D dimensionality, which is inefficient for constrained systems. Prior feature-axis compression reduces D but weakens robustness. There's a need for memory-efficient HDC that preserves robustness.

Method: LogHD replaces C per-class prototypes with n≈⌈log_k C⌉ bundle hypervectors (alphabet size k) and decodes in an n-dimensional activation space. Uses capacity-aware codebook and profile-based decoding, and can compose with feature-axis sparsification.

Result: LogHD achieves competitive accuracy with smaller models and higher resilience at matched memory. Under equal memory, sustains target accuracy at 2.5-3.0× higher bit-flip rates than feature-axis compression. ASIC instantiation delivers 498× energy efficiency and 62.6× speedup over AMD Ryzen 9 9950X, and 4.06× more energy-efficient and 2.19× faster than feature-axis HDC ASIC baseline.

Conclusion: LogHD provides an effective logarithmic class-axis compression approach for HDC that significantly reduces memory requirements while maintaining robustness and achieving substantial efficiency improvements in hardware implementations.

Abstract: Hyperdimensional computing (HDC) suits memory, energy, and reliability-constrained systems, yet the standard “one prototype per class” design requires $O(CD)$ memory (with $C$ classes and dimensionality $D$). Prior compaction reduces $D$ (feature axis), improving storage/compute but weakening robustness. We introduce LogHD, a logarithmic class-axis reduction that replaces the $C$ per-class prototypes with $n!\approx!\lceil\log_k C\rceil$ bundle hypervectors (alphabet size $k$) and decodes in an $n$-dimensional activation space, cutting memory to $O(D\log_k C)$ while preserving $D$. LogHD uses a capacity-aware codebook and profile-based decoding, and composes with feature-axis sparsification. Across datasets and injected bit flips, LogHD attains competitive accuracy with smaller models and higher resilience at matched memory. Under equal memory, it sustains target accuracy at roughly $2.5$-$3.0\times$ higher bit-flip rates than feature-axis compression; an ASIC instantiation delivers $498\times$ energy efficiency and $62.6\times$ speedup over an AMD Ryzen 9 9950X and $24.3\times$/$6.58\times$ over an NVIDIA RTX 4090, and is $4.06\times$ more energy-efficient and $2.19\times$ faster than a feature-axis HDC ASIC baseline.

[769] DiScoFormer: Plug-In Density and Score Estimation with Transformers

Vasily Ilin, Peter Sushko

Main category: cs.LG

TL;DR: DiScoFormer is a transformer model that learns to estimate probability density and score functions from samples, generalizing across distributions without retraining.

Details

Motivation: Existing methods for density and score estimation are bifurcated: classical KDE suffers from curse of dimensionality, while neural score models require retraining for each distribution. There's a need for a "train-once, infer-anywhere" approach that generalizes across distributions.

Method: DiScoFormer uses an equivariant Transformer architecture that maps i.i.d. samples to both density values and score vectors. The model learns multi-scale kernel-like behaviors through self-attention heads, which are proven to recover normalized KDE.

Result: The model converges faster and achieves higher precision than KDE for density estimation, and provides high-fidelity plug-in score oracle for various applications including score-debiased KDE, Fisher information computation, and Fokker-Planck-type PDEs.

Conclusion: DiScoFormer offers a unified, generalizable approach to density and score estimation that bridges classical kernel methods with modern neural architectures, enabling efficient inference across diverse distributions without retraining.

Abstract: Estimating probability density and its score from samples remains a core problem in generative modeling, Bayesian inference, and kinetic theory. Existing methods are bifurcated: classical kernel density estimators (KDE) generalize across distributions but suffer from the curse of dimensionality, while modern neural score models achieve high precision but require retraining for every target distribution. We introduce DiScoFormer (Density and Score Transformer), a ``train-once, infer-anywhere" equivariant Transformer that maps i.i.d. samples to both density values and score vectors, generalizing across distributions and sample sizes. Analytically, we prove that self-attention can recover normalized KDE, establishing it as a functional generalization of kernel methods; empirically, individual attention heads learn multi-scale, kernel-like behaviors. The model converges faster and achieves higher precision than KDE for density estimation, and provides a high-fidelity plug-in score oracle for score-debiased KDE, Fisher information computation, and Fokker-Planck-type PDEs.

[770] Spectral Text Fusion: A Frequency-Aware Approach to Multimodal Time-Series Forecasting

Huu Hiep Nguyen, Minh Hoang Nguyen, Dung Nguyen, Hung Le

Main category: cs.LG

TL;DR: SpecTF: A multimodal time series forecasting framework that integrates textual context with time series data in the frequency domain using spectral decomposition and cross-attention.

Details

Motivation: Existing multimodal time series forecasting methods align textual features with time-series patterns step-by-step, but neglect multiscale temporal influences of contextual information like time-series cycles and dynamic shifts. There's a mismatch between local alignment and global textual context.

Method: Proposes SpecTF framework that extracts textual embeddings, projects them into frequency domain, fuses them with time series’ spectral components using lightweight cross-attention mechanism, adaptively reweights frequency bands based on textual relevance, then maps back to temporal domain for predictions.

Result: SpecTF significantly outperforms state-of-the-art models across diverse multi-modal time series datasets while utilizing considerably fewer parameters.

Conclusion: Frequency domain integration of textual context with time series data through spectral decomposition and cross-attention is an effective approach for multimodal time series forecasting.

Abstract: Multimodal time series forecasting is crucial in real-world applications, where decisions depend on both numerical data and contextual signals. The core challenge is to effectively combine temporal numerical patterns with the context embedded in other modalities, such as text. While most existing methods align textual features with time-series patterns one step at a time, they neglect the multiscale temporal influences of contextual information such as time-series cycles and dynamic shifts. This mismatch between local alignment and global textual context can be addressed by spectral decomposition, which separates time series into frequency components capturing both short-term changes and long-term trends. In this paper, we propose SpecTF, a simple yet effective framework that integrates the effect of textual data on time series in the frequency domain. Our method extracts textual embeddings, projects them into the frequency domain, and fuses them with the time series’ spectral components using a lightweight cross-attention mechanism. This adaptively reweights frequency bands based on textual relevance before mapping the results back to the temporal domain for predictions. Experimental results demonstrate that SpecTF significantly outperforms state-of-the-art models across diverse multi-modal time series datasets while utilizing considerably fewer parameters. Code is available at https://github.com/hiepnh137/SpecTF.

[771] Moirai 2.0: When Less Is More for Time Series Forecasting

Chenghao Liu, Taha Aksu, Juncheng Liu, Xu Liu, Hanshu Yan, Quang Pham, Silvio Savarese, Doyen Sahoo, Caiming Xiong, Junnan Li

Main category: cs.LG

TL;DR: Moirai 2.0 is a decoder-only time-series foundation model with quantile forecasting and multi-token prediction that achieves top performance on time-series benchmarks while being significantly more efficient than previous versions.

Details

Motivation: To create a more efficient and accurate time-series foundation model that simplifies the architecture while improving probabilistic forecasting capabilities and inference speed.

Method: Uses decoder-only architecture with quantile forecasting and multi-token prediction, replacing Moirai 1.0’s masked-encoder training, multi-patch inputs, and mixture-distribution outputs with simpler single-patch design and quantile loss.

Result: Achieves top performance on Gift-Eval benchmark, outperforms larger models from same family, is twice as fast and 30x smaller than Moirai 1.0-Large while performing better, with decoder-only backbone and recursive multi-quantile decoding contributing most to gains.

Conclusion: Moirai 2.0 demonstrates strong trade-off between accuracy, speed, and model size, though performance plateaus with increasing parameters and declines at longer horizons, motivating future work on data scaling and long-horizon modeling.

Abstract: We introduce Moirai 2.0, a decoder-only time-series foundation model trained on a new corpus of 36M series. The model adopts quantile forecasting and multi-token prediction, improving both probabilistic accuracy and inference efficiency. On the Gift-Eval benchmark, it ranks among the top pretrained models while achieving a strong trade-off between accuracy, speed, and model size. Compared to Moirai 1.0, Moirai 2.0 replaces masked-encoder training, multi-patch inputs, and mixture-distribution outputs with a simpler decoder-only architecture, single patch, and quantile loss. Ablation studies isolate these changes – showing that the decoder-only backbone along with recursive multi-quantile decoding contribute most to the gains. Additional experiments show that Moirai 2.0 outperforms larger models from the same family and exhibits robust domain-level results. In terms of efficiency and model size, Moirai 2.0 is twice as fast and thirty times smaller than its prior best version, Moirai 1.0-Large, while also performing better. Model performance plateaus with increasing parameter count and declines at longer horizons, motivating future work on data scaling and long-horizon modeling. We release code and evaluation details to support further research.

[772] IRIS: Implicit Reward-Guided Internal Sifting for Mitigating Multimodal Hallucination

Yuanshuai Li, Yuping Yan, Jirui Han, Fei Ming, Lingjuan Lv, Yaochu Jin

Main category: cs.LG

TL;DR: IRIS proposes an implicit reward-guided internal sifting method to reduce hallucinations in multimodal LLMs by using continuous implicit rewards in log-probability space to capture modal conflicts, eliminating need for external evaluators.

Details

Motivation: Current DPO approaches for MLLM alignment rely on costly external evaluators for scoring/rewriting, causing off-policy learnability gaps and discretization loss. External feedback overlooks fine-grained conflicts between modalities that cause hallucinations during generation.

Method: IRIS uses continuous implicit rewards in native log-probability space to preserve full information density and capture internal modal competition. It employs on-policy paradigm with self-generated preference pairs, sifting them based on multimodal implicit rewards to directly resolve modal conflicts.

Result: Achieves competitive performance on key hallucination benchmarks using only 5.7k samples, without requiring external feedback during preference alignment.

Conclusion: IRIS provides an efficient and principled paradigm for mitigating MLLM hallucinations by leveraging internal state information and eliminating dependency on external evaluators.

Abstract: Hallucination remains a fundamental challenge for Multimodal Large Language Models (MLLMs). While Direct Preference Optimization (DPO) is a key alignment framework, existing approaches often rely heavily on costly external evaluators for scoring or rewriting, incurring off-policy learnability gaps and discretization loss. Due to the lack of access to internal states, such feedback overlooks the fine-grained conflicts between different modalities that lead to hallucinations during generation. To address this issue, we propose IRIS (Implicit Reward-Guided Internal Sifting), which leverages continuous implicit rewards in the native log-probability space to preserve full information density and capture internal modal competition. This on-policy paradigm eliminates learnability gaps by utilizing self-generated preference pairs. By sifting these pairs based on multimodal implicit rewards, IRIS ensures that optimization is driven by signals that directly resolve modal conflicts. Extensive experiments demonstrate that IRIS achieves highly competitive performance on key hallucination benchmarks using only 5.7k samples, without requiring any external feedback during preference alignment. These results confirm that IRIS provides an efficient and principled paradigm for mitigating MLLM hallucinations.

[773] Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design

Lisa Schneckenreiter, Sohvi Luukkonen, Lukas Friedrich, Daniel Kuhn, Günter Klambauer

Main category: cs.LG

TL;DR: ConGLUDe is a unified contrastive geometric model that bridges structure-based and ligand-based drug design by jointly training on protein-ligand complexes and bioactivity data, enabling virtual screening, target fishing, and ligand-conditioned pocket prediction.

Details

Motivation: Traditional computational drug design uses separate structure-based and ligand-based approaches with disjoint data sources, limiting their joint application at scale. There's a need to unify these paradigms to leverage both structural and chemical information more effectively.

Method: ConGLUDe uses a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites, coupled with a fast ligand encoder. It employs contrastive learning to align ligands with both global protein representations and multiple candidate binding sites, eliminating the need for pre-defined pockets.

Result: Achieves competitive zero-shot virtual screening performance, substantially outperforms existing methods on target fishing tasks, and demonstrates state-of-the-art ligand-conditioned pocket selection across diverse benchmarks.

Conclusion: ConGLUDe demonstrates the advantages of unified structure-ligand training and represents a step toward general-purpose foundation models for drug discovery by bridging traditionally separate computational drug design paradigms.

Abstract: Structure-based and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for pre-defined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves competitive zero-shot virtual screening performance, substantially outperforms existing methods on a challenging target fishing task, and demonstrates state-of-the-art ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.

[774] Time2Vec Transformer for Robust Gesture Recognition from Low-Density sEMG

Blagoj Hristov, Hristijan Gjoreski, Vesna Ojleska Latkoska, Gorjan Nadzinski

Main category: cs.LG

TL;DR: A novel deep learning framework using hybrid Transformer with Time2Vec temporal embeddings achieves state-of-the-art myoelectric prosthesis control with minimal 2-channel sEMG sensors, enabling cost-effective prosthetic interfaces.

Details

Motivation: Current myoelectric prosthesis control relies on complex, dense multi-sensor arrays that limit consumer accessibility. There's a need for data-efficient approaches that can achieve precise control using minimal sensor hardware.

Method: Hybrid Transformer optimized for sparse 2-channel sEMG with Time2Vec learnable temporal embeddings to capture stochastic temporal warping. Uses normalized additive fusion to align spatial and temporal feature distributions, and two-stage curriculum learning for robust feature extraction despite data scarcity.

Result: Achieves state-of-the-art multi-subject F1-score of 95.7% ± 0.20% for 10-class movement set, outperforming standard Transformer and CNN-LSTM models. Rapid calibration with only two trials per gesture recovers performance from 21.0% to 96.9% for unseen subjects.

Conclusion: High-fidelity temporal embeddings can compensate for low spatial resolution, challenging the necessity of high-density sensing. The framework offers a robust, cost-effective blueprint for next-generation prosthetic interfaces with rapid personalization.

Abstract: Accurate and responsive myoelectric prosthesis control typically relies on complex, dense multi-sensor arrays, which limits consumer accessibility. This paper presents a novel, data-efficient deep learning framework designed to achieve precise and accurate control using minimal sensor hardware. Leveraging an external dataset of 8 subjects, our approach implements a hybrid Transformer optimized for sparse, two-channel surface electromyography (sEMG). Unlike standard architectures that use fixed positional encodings, we integrate Time2Vec learnable temporal embeddings to capture the stochastic temporal warping inherent in biological signals. Furthermore, we employ a normalized additive fusion strategy that aligns the latent distributions of spatial and temporal features, preventing the destructive interference common in standard implementations. A two-stage curriculum learning protocol is utilized to ensure robust feature extraction despite data scarcity. The proposed architecture achieves a state-of-the-art multi-subject F1-score of 95.7% $\pm$ 0.20% for a 10-class movement set, statistically outperforming both a standard Transformer with fixed encodings and a recurrent CNN-LSTM model. Architectural optimization reveals that a balanced allocation of model capacity between spatial and temporal dimensions yields the highest stability. Furthermore, while direct transfer to a new unseen subject led to poor accuracy due to domain shifts, a rapid calibration protocol utilizing only two trials per gesture recovered performance from 21.0% $\pm$ 2.98% to 96.9% $\pm$ 0.52%. By validating that high-fidelity temporal embeddings can compensate for low spatial resolution, this work challenges the necessity of high-density sensing. The proposed framework offers a robust, cost-effective blueprint for next-generation prosthetic interfaces capable of rapid personalization.

[775] Sample-Near-Optimal Agnostic Boosting with Improved Running Time

Arthur da Cunha, Mikael Møller Høgsgaard, Andrea Paudice

Main category: cs.LG

TL;DR: First polynomial-time agnostic boosting algorithm with near-optimal sample complexity, addressing the computational efficiency gap in agnostic learning theory.

Details

Motivation: While boosting is well-understood in classic settings, agnostic boosting (with no data assumptions) has been less studied. Recent work settled sample complexity but with exponential-time algorithms, creating a need for efficient implementations.

Method: Proposes a new agnostic boosting algorithm that achieves near-optimal sample complexity while running in polynomial time relative to sample size (with other parameters fixed).

Result: Develops the first polynomial-time agnostic boosting algorithm with near-optimal sample complexity, solving the computational efficiency problem in agnostic learning theory.

Conclusion: This work bridges the gap between theoretical sample complexity bounds and practical implementation by providing an efficient agnostic boosting algorithm.

Abstract: Boosting is a powerful method that turns weak learners, which perform only slightly better than random guessing, into strong learners with high accuracy. While boosting is well understood in the classic setting, it is less so in the agnostic case, where no assumptions are made about the data. Indeed, only recently was the sample complexity of agnostic boosting nearly settled arXiv:2503.09384, but the known algorithm achieving this bound has exponential running time. In this work, we propose the first agnostic boosting algorithm with near-optimal sample complexity, running in time polynomial in the sample size when considering the other parameters of the problem fixed.

[776] Learning from Synthetic Data: Limitations of ERM

Kareem Amin, Alex Bie, Weiwei Kong, Umar Syed, Sergei Vassilvitskii

Main category: cs.LG

TL;DR: Theoretical analysis of learning from mixed natural and LLM-generated synthetic data, showing ERM limitations and proposing better algorithms for mean estimation and PAC learning.

Details

Motivation: The proliferation of LLM-generated synthetic content contaminates natural datasets, raising fundamental questions about learning theory in this mixed-data setting where algorithms can't distinguish between natural and synthetic examples.

Method: Models learning as sequence of tasks with mixed natural/synthetic data, analyzes ERM performance for mean estimation and PAC learning, proposes alternative algorithms with non-uniform weighting across data generations.

Result: ERM converges to true mean but is outperformed by weighted algorithms; for PAC learning, ERM may not converge to true concept (echoing model collapse), but algorithms exist that can learn correct hypothesis despite arbitrary contamination.

Conclusion: Standard ERM is suboptimal for learning from LLM-contaminated data; better algorithms with strategic weighting can overcome synthetic data contamination and achieve reliable learning.

Abstract: The prevalence and low cost of LLMs have led to a rise of synthetic content. From review sites to court documents, “natural” content has been contaminated by data points that appear similar to natural data, but are in fact LLM-generated. In this work we revisit fundamental learning theory questions in this, now ubiquitous, setting. We model this scenario as a sequence of learning tasks where the input is a mix of natural and synthetic data, and the learning algorithms are oblivious to the origin of any individual example. We study the possibilities and limitations of ERM in this setting. For the problem of estimating the mean of an arbitrary $d$-dimensional distribution, we find that while ERM converges to the true mean, it is outperformed by an algorithm that assigns non-uniform weights to examples from different generations of data. For the PAC learning setting, the disparity is even more stark. We find that ERM does not always converge to the true concept, echoing the model collapse literature. However, we show there are algorithms capable of learning the correct hypothesis for arbitrary VC classes and arbitrary amounts of contamination.

[777] DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations

Minghao Li, Ruihang Wang, Rui Tan, Yonggang Wen

Main category: cs.LG

TL;DR: DCoPilot: A hybrid framework using LLMs and hypernetworks for generative control policies in dynamic data centers, enabling automatic adaptation to changing workloads and SLAs.

Details

Motivation: Data centers with AI workloads need minute-level adaptation for safety and efficiency, but manual DRL agent design can't keep pace with frequent dynamics shifts and SLA changes, risking service outages.

Method: Combines LLM for symbolic generation of structured reward forms and hypernetwork for parametric generation of policy weights through three phases: simulation scale-up, meta policy distillation, and online adaptation.

Result: Achieves near-zero constraint violations and outperforms all baselines across five control task families spanning diverse DC components and specification variations.

Conclusion: DCoPilot bridges the specification-to-policy gap in dynamic data center operation through generative control policies, with LLM-based reward generation enabling stable hypernetwork convergence.

Abstract: Modern data centers (DCs) hosting artificial intelligence (AI)-dedicated devices operate at high power densities with rapidly varying workloads, making minute-level adaptation essential for safe and energy-efficient operation. However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC. This specification-to-policy lag causes a lack of timely, effective control policies, which may lead to service outages. To bridge the gap, we present DCoPilot, a hybrid framework for generative control policies in dynamic DC operation. DCoPilot synergizes two distinct generative paradigms, i.e., a large language model (LLM) that performs symbolic generation of structured reward forms, and a hypernetwork that conducts parametric generation of policy weights. DCoPilot operates through three coordinated phases: (i) simulation scale-up, which stress-tests reward candidates across diverse simulation-ready (SimReady) scenes; (ii) meta policy distillation, where a hypernetwork is trained to output policy weights conditioned on SLA and scene embeddings; and (iii) online adaptation, enabling zero-shot policy generation in response to updated specifications. Evaluated across five control task families spanning diverse DC components, DCoPilot achieves near-zero constraint violations and outperforms all baselines across specification variations. Ablation studies validate the effectiveness of LLM-based unified reward generation in enabling stable hypernetwork convergence.

[778] SEAFormer: A Spatial Proximity and Edge-Aware Transformer for Real-World Vehicle Routing Problems

Saeed Nasehi Basharzad, Farhana Choudhury, Egemen Tanin

Main category: cs.LG

TL;DR: SEAFormer is a novel transformer architecture for solving complex Real-world Vehicle Routing Problems (RWVRPs) that incorporates both node-level and edge-level information through clustered proximity attention and edge-aware modules, enabling efficient training on large-scale instances up to 1,000+ nodes.

Details

Motivation: Existing neural methods for Vehicle Routing Problems struggle with Real-world VRPs (RWVRPs) because they overlook sequence dependencies and underutilize edge-level information, which are crucial for handling complex constraints like delivery time windows, replenishment stops, asymmetric travel costs, etc.

Method: SEAFormer introduces two key innovations: 1) Clustered Proximity Attention (CPA) that uses locality-aware clustering to reduce attention complexity from O(n²) to O(n) while preserving global perspective, and 2) a lightweight edge-aware module that captures pairwise features through residual fusion for effective edge-based information incorporation.

Result: SEAFormer achieves superior results over state-of-the-art methods across four RWVRP variants at various scales. It is the first neural method to effectively solve 1,000+ node RWVRPs while also achieving superior performance on classic VRPs.

Conclusion: SEAFormer provides a versatile solution for both research benchmarks and real-world applications, demonstrating that transformer architectures can be effectively adapted to solve complex combinatorial optimization problems with real-world constraints.

Abstract: Real-world Vehicle Routing Problems (RWVRPs) require solving complex, sequence-dependent challenges at scale with constraints such as delivery time window, replenishment or recharging stops, asymmetric travel cost, etc. While recent neural methods achieve strong results on large-scale classical VRP benchmarks, they struggle to address RWVRPs because their strategies overlook sequence dependencies and underutilize edge-level information, which are precisely the characteristics that define the complexity of RWVRPs. We present SEAFormer, a novel transformer that incorporates both node-level and edge-level information in decision-making through two key innovations. First, our Clustered Proximity Attention (CPA) exploits locality-aware clustering to reduce the complexity of attention from $O(n^2)$ to $O(n)$ while preserving global perspective, allowing SEAFormer to efficiently train on large instances. Second, our lightweight edge-aware module captures pairwise features through residual fusion, enabling effective incorporation of edge-based information and faster convergence. Extensive experiments across four RWVRP variants with various scales demonstrate that SEAFormer achieves superior results over state-of-the-art methods. Notably, SEAFormer is the first neural method to solve 1,000+ node RWVRPs effectively, while also achieving superior performance on classic VRPs, making it a versatile solution for both research benchmarks and real-world applications.

[779] SEDformer: Event-Synchronous Spiking Transformers for Irregular Telemetry Time Series Forecasting

Ziyu Zhou, Yuchen Fang, Weilin Ruan, Shiyu Wang, James Kwok, Yuxuan Liang

Main category: cs.LG

TL;DR: SEDformer: A spiking transformer for irregular multivariate time series forecasting that leverages the Sparsity-Event Duality property using event-driven spiking neural networks.

Details

Motivation: Existing methods for irregular multivariate time series (IMTS) forecasting ignore the Sparsity-Event Duality property, where long sparse periods are punctuated by dense event bursts. Traditional approaches use padding or disrupt temporal continuity, motivating a more faithful modeling paradigm.

Method: SEDformer combines: (1) SED-based Spike Encoder using Event-Aligned LIF neurons to convert observations into event-synchronous spikes, (2) Event-Preserving Temporal Downsampling to compress long gaps while retaining salient events, and (3) SED-based Spike Transformer blocks with membrane-based linear attention for intra-series dependency modeling.

Result: SEDformer achieves state-of-the-art forecasting accuracy on public telemetry IMTS datasets while significantly reducing energy and memory usage compared to existing methods.

Conclusion: The paper presents a natural and efficient approach for IMTS forecasting that aligns with the Sparsity-Event Duality property using spiking neural networks, offering both accuracy and computational efficiency.

Abstract: Telemetry streams from large-scale Internet-connected systems (e.g., IoT deployments and online platforms) naturally form an irregular multivariate time series (IMTS) whose accurate forecasting is operationally vital. A closer examination reveals a defining Sparsity-Event Duality (SED) property of IMTS, i.e., long stretches with sparse or no observations are punctuated by short, dense bursts where most semantic events (observations) occur. However, existing Graph- and Transformer-based forecasters ignore SED: pre-alignment to uniform grids with heavy padding violates sparsity by inflating sequences and forcing computation at non-informative steps, while relational recasting weakens event semantics by disrupting local temporal continuity. These limitations motivate a more faithful and natural modeling paradigm for IMTS that aligns with its SED property. We find that Spiking Neural Networks meet this requirement, as they communicate via sparse binary spikes and update in an event-driven manner, aligning naturally with the SED nature of IMTS. Therefore, we present SEDformer, an SED-enhanced Spiking Transformer for telemetry IMTS forecasting that couples: (1) a SED-based Spike Encoder converts raw observations into event synchronous spikes using an Event-Aligned LIF neuron, (2) an Event-Preserving Temporal Downsampling module compresses long gaps while retaining salient firings and (3) a stack of SED-based Spike Transformer blocks enable intra-series dependency modeling with a membrane-based linear attention driven by EA-LIF spiking features. Experiments on public telemetry IMTS datasets show that SEDformer attains state-of-the-art forecasting accuracy while reducing energy and memory usage, providing a natural and efficient path for modeling IMTS.

[780] GraphAllocBench: A Flexible Benchmark for Preference-Conditioned Multi-Objective Policy Learning

Zhiheng Jiang, Yunzhe Wang, Ryan Marr, Ellen Novoseller, Benjamin T. Files, Volkan Ustun

Main category: cs.LG

TL;DR: Introduces GraphAllocBench, a graph-based resource allocation benchmark for evaluating preference-conditioned policy learning in multi-objective reinforcement learning, with new metrics and scalable environment.

Details

Motivation: Existing benchmarks for Preference-Conditioned Policy Learning (PCPL) in Multi-Objective Reinforcement Learning (MORL) are limited to toy tasks and fixed environments, lacking realism and scalability needed for evaluating complex real-world applications.

Method: Proposes GraphAllocBench built on CityPlannerEnv, a graph-based resource allocation sandbox environment. Introduces two new evaluation metrics: Proportion of Non-Dominated Solutions (PNDS) and Ordering Score (OS) to complement hypervolume metric. Evaluates with MLPs and graph-aware models.

Result: GraphAllocBench exposes limitations of existing MORL approaches and demonstrates the potential of graph-based methods like GNNs for complex combinatorial allocation tasks. The benchmark enables flexible variation of objectives, preferences, and allocation rules.

Conclusion: GraphAllocBench provides a versatile and extensible benchmark for advancing PCPL in MORL, particularly for complex, high-dimensional combinatorial allocation problems, with potential applications in city management and resource allocation domains.

Abstract: Preference-Conditioned Policy Learning (PCPL) in Multi-Objective Reinforcement Learning (MORL) aims to approximate diverse Pareto-optimal solutions by conditioning policies on user-specified preferences over objectives. This enables a single model to flexibly adapt to arbitrary trade-offs at run-time by producing a policy on or near the Pareto front. However, existing benchmarks for PCPL are largely restricted to toy tasks and fixed environments, limiting their realism and scalability. To address this gap, we introduce GraphAllocBench, a flexible benchmark built on a novel graph-based resource allocation sandbox environment inspired by city management, which we call CityPlannerEnv. GraphAllocBench provides a rich suite of problems with diverse objective functions, varying preference conditions, and high-dimensional scalability. We also propose two new evaluation metrics – Proportion of Non-Dominated Solutions (PNDS) and Ordering Score (OS) – that directly capture preference consistency while complementing the widely used hypervolume metric. Through experiments with Multi-Layer Perceptrons (MLPs) and graph-aware models, we show that GraphAllocBench exposes the limitations of existing MORL approaches and paves the way for using graph-based methods such as Graph Neural Networks (GNNs) in complex, high-dimensional combinatorial allocation tasks. Beyond its predefined problem set, GraphAllocBench enables users to flexibly vary objectives, preferences, and allocation rules, establishing it as a versatile and extensible benchmark for advancing PCPL. Code: https://github.com/jzh001/GraphAllocBench

[781] The Powers of Precision: Structure-Informed Detection in Complex Systems – From Customer Churn to Seizure Onset

Augusto Santos, Teresa Santos, Catarina Rodrigues, José M. F. Moura

Main category: cs.LG

TL;DR: A machine learning method for early detection of emergent phenomena by learning optimal feature representations from covariance/precision matrix powers, with applications to seizure detection and churn prediction.

Details

Motivation: Emergent phenomena like epileptic seizures, customer churn, or pandemic outbreaks arise from hidden causal interactions in complex systems. The core challenge is unveiling latent causal structure despite unknown and partially observed data-generating processes.

Method: Learns optimal feature representation from a one-parameter family of estimators (powers of empirical covariance or precision matrix) to tune into underlying structure driving critical events. A supervised learning module then classifies the learned representation.

Result: Achieves competitive results on seizure detection and churn prediction. The optimal covariance power exhibits evidence of good identifiability while capturing structural signatures, reconciling predictive performance with interpretable statistical structure.

Conclusion: Proposes a principled approach for early detection of emergent phenomena that addresses the challenge of unknown causal structure in partially observed systems, with demonstrated effectiveness in real-world applications.

Abstract: Emergent phenomena – onset of epileptic seizures, sudden customer churn, or pandemic outbreaks – often arise from hidden causal interactions in complex systems. We propose a machine learning method for their early detection that addresses a core challenge: unveiling and harnessing a system’s latent causal structure despite the data-generating process being unknown and partially observed. The method learns an optimal feature representation from a one-parameter family of estimators – powers of the empirical covariance or precision matrix – offering a principled way to tune in to the underlying structure driving the emergence of critical events. A supervised learning module then classifies the learned representation. We prove structural consistency of the family and demonstrate the empirical soundness of our approach on seizure detection and churn prediction, attaining competitive results in both. Beyond prediction, and toward explainability, we ascertain that the optimal covariance power exhibits evidence of good identifiability while capturing structural signatures, thus reconciling predictive performance with interpretable statistical structure.

[782] Transferable Graph Condensation from the Causal Perspective

Huaming Du, Yijie Huang, Su Yao, Yiying Wang, Yueyang Zhou, Jingwen Yang, Jinshi Zhang, Han Ji, Yu Zhao, Guisong Liu, Hegui Zhang, Carl Yang, Gang Kou

Main category: cs.LG

TL;DR: TGCC is a transferable graph dataset condensation method that uses causal invariance to compress large graph datasets while maintaining performance across different tasks and domains.

Details

Motivation: Existing graph dataset condensation methods require downstream applications to match the original dataset and task, failing in cross-task and cross-domain scenarios. There's a need for condensation methods that preserve transferable knowledge across different applications.

Method: TGCC extracts domain causal-invariant features using causal interventions in the spatial domain, performs enhanced condensation operations to capture structural and feature information, and injects causal-invariant features into the condensed graph through spectral-domain enhanced contrastive learning.

Result: TGCC achieves up to 13.41% improvement in cross-task and cross-domain scenarios compared to existing methods, and achieves state-of-the-art performance on 5 out of 6 datasets in single dataset/task scenarios.

Conclusion: TGCC provides an effective and transferable graph dataset condensation method that preserves causal information and performs well in both single-task and cross-domain scenarios.

Abstract: The increasing scale of graph datasets has significantly improved the performance of graph representation learning methods, but it has also introduced substantial training challenges. Graph dataset condensation techniques have emerged to compress large datasets into smaller yet information-rich datasets, while maintaining similar test performance. However, these methods strictly require downstream applications to match the original dataset and task, which often fails in cross-task and cross-domain scenarios. To address these challenges, we propose a novel causal-invariance-based and transferable graph dataset condensation method, named \textbf{TGCC}, providing effective and transferable condensed datasets. Specifically, to preserve domain-invariant knowledge, we first extract domain causal-invariant features from the spatial domain of the graph using causal interventions. Then, to fully capture the structural and feature information of the original graph, we perform enhanced condensation operations. Finally, through spectral-domain enhanced contrastive learning, we inject the causal-invariant features into the condensed graph, ensuring that the compressed graph retains the causal information of the original graph. Experimental results on five public datasets and our novel \textbf{FinReport} dataset demonstrate that TGCC achieves up to a 13.41% improvement in cross-task and cross-domain complex scenarios compared to existing methods, and achieves state-of-the-art performance on 5 out of 6 datasets in the single dataset and task scenario.

[783] Convex Loss Functions for Support Vector Machines (SVMs) and Neural Networks

Filippo Portera

Main category: cs.LG

TL;DR: A new convex loss function for SVMs that incorporates pattern correlations to improve generalization performance in classification and regression tasks.

Details

Motivation: To develop a more effective SVM loss function that leverages pattern correlations to enhance generalization capabilities beyond standard SVM losses.

Method: Proposes a novel convex loss function for SVMs with mathematical derivation of dual problems, tested on small datasets due to SVM scalability limitations, and extended to shallow and deep neural networks.

Result: Achieved comparable or superior performance with up to 2.0% improvement in F1 scores for classification and 1.0% reduction in MSE for regression across various datasets compared to standard losses.

Conclusion: The correlation-based loss function consistently matches or outperforms standard losses, warranting further study with neural network architectures for broader applications.

Abstract: We propose a new convex loss for Support Vector Machines, both for the binary classification and for the regression models. Therefore, we show the mathematical derivation of the dual problems and we experiment with them on several small datasets. The minimal dimension of those datasets is due to the difficult scalability of the SVM method to bigger instances. This preliminary study should prove that using pattern correlations inside the loss function could enhance the generalisation performances. Our method consistently achieved comparable or superior performance, with improvements of up to 2.0% in F1 scores for classification tasks and 1.0% reduction in Mean Squared Error (MSE) for regression tasks across various datasets, compared to standard losses. Coherently, results show that generalisation measures are never worse than the standard losses and several times they are better. In our opinion, it should be considered a careful study of this loss, coupled with shallow and deep neural networks. In fact, we present some novel results obtained with those architectures.

[784] Scalable Linearized Laplace Approximation via Surrogate Neural Kernel

Luis A. Ortega, Simón Rodríguez-Santana, Daniel Hernández-Lobato

Main category: cs.LG

TL;DR: A scalable method to approximate Linearized Laplace Approximation kernels using a surrogate DNN that learns compact feature representations replicating the Neural Tangent Kernel, enabling efficient uncertainty estimation on large pre-trained models.

Details

Motivation: The Linearized Laplace Approximation (LLA) requires computing large Jacobians which is computationally expensive for large-scale pre-trained DNNs. There's a need for scalable methods to approximate LLA kernels for uncertainty estimation without the computational burden of full Jacobian computations.

Method: Uses a surrogate deep neural network that learns a compact feature representation whose inner product replicates the Neural Tangent Kernel (NTK). Training relies solely on efficient Jacobian-vector products, avoiding the need to compute large Jacobians directly.

Result: Experimental results show similar or improved uncertainty estimation and calibration compared to existing LLA approximations. Biasing the learned kernel significantly enhances out-of-distribution detection performance.

Conclusion: The proposed method enables scalable uncertainty estimation on large pre-trained DNNs and demonstrates that biasing the learned kernel can improve out-of-distribution detection, suggesting benefits for finding better kernels than NTK in LLA context.

Abstract: We introduce a scalable method to approximate the kernel of the Linearized Laplace Approximation (LLA). For this, we use a surrogate deep neural network (DNN) that learns a compact feature representation whose inner product replicates the Neural Tangent Kernel (NTK). This avoids the need to compute large Jacobians. Training relies solely on efficient Jacobian-vector products, allowing to compute predictive uncertainty on large-scale pre-trained DNNs. Experimental results show similar or improved uncertainty estimation and calibration compared to existing LLA approximations. Notwithstanding, biasing the learned kernel significantly enhances out-of-distribution detection. This remarks the benefits of the proposed method for finding better kernels than the NTK in the context of LLA to compute prediction uncertainty given a pre-trained DNN.

[785] Non-Intrusive Graph-Based Bot Detection for E-Commerce Using Inductive Graph Neural Networks

Sichen Zhao, Zhiming Xue, Yalun Qi, Xianling Zeng, Zihan Yu

Main category: cs.LG

TL;DR: Graph-based bot detection framework for e-commerce using inductive graph neural networks to identify automated activity through user session behavior modeling.

Details

Motivation: Malicious bots are increasingly sophisticated in e-commerce, evading traditional detection methods like IP blacklists and CAPTCHAs through proxies, botnets, and AI-assisted strategies, requiring more advanced detection approaches.

Method: Non-intrusive graph-based framework that models user session behavior as graphs, applies inductive graph neural networks for classification, capturing relational structure and behavioral semantics without client-side instrumentation.

Result: Outperforms session-level multilayer perceptron baseline in AUC and F1 scores on real-world e-commerce traffic, remains robust under adversarial perturbations, and generalizes well to unseen sessions and URLs.

Conclusion: The graph-based framework effectively detects subtle automated activity, integrates with existing systems, supports real-time inference and incremental updates, making it practical for e-commerce security deployments.

Abstract: Malicious bots pose a growing threat to e-commerce platforms by scraping data, hoarding inventory, and perpetrating fraud. Traditional bot mitigation techniques, including IP blacklists and CAPTCHA-based challenges, are increasingly ineffective or intrusive, as modern bots leverage proxies, botnets, and AI-assisted evasion strategies. This work proposes a non-intrusive graph-based bot detection framework for e-commerce that models user session behavior through a graph representation and applies an inductive graph neural network for classification. The approach captures both relational structure and behavioral semantics, enabling accurate identification of subtle automated activity that evades feature-based methods. Experiments on real-world e-commerce traffic demonstrate that the proposed inductive graph model outperforms a strong session-level multilayer perceptron baseline in terms of AUC and F1 score. Additional adversarial perturbation and cold-start simulations show that the model remains robust under moderate graph modifications and generalizes effectively to previously unseen sessions and URLs. The proposed framework is deployment-friendly, integrates with existing systems without client-side instrumentation, and supports real-time inference and incremental updates, making it suitable for practical e-commerce security deployments.

[786] OD-DEAL: Dynamic Expert-Guided Adversarial Learning with Online Decomposition for Scalable Capacitated Vehicle Routing

Dongbin Jiao, Zisheng Chen, Xianyi Wang, Jintao Shi, Shengcai Liu, Shi Yan

Main category: cs.LG

TL;DR: OD-DEAL is an adversarial learning framework that integrates hybrid genetic search with online barycenter clustering decomposition and uses knowledge distillation to solve large-scale capacitated vehicle routing problems with near-constant neural scaling.

Details

Motivation: Current approaches for large-scale capacitated vehicle routing problems face limitations: traditional heuristics are computationally complex, while neural solvers have poor generalization on massive graphs and cannot achieve real-time inference required for dynamic deployment.

Method: OD-DEAL combines hybrid genetic search (HGS) with online barycenter clustering (BCC) decomposition in an adversarial learning framework. It uses knowledge distillation to transfer expert heuristic behavior to a graph attention network (GAT)-based generative policy trained through a minimax game, where divide-and-conquer strategies are distilled into dense surrogate rewards.

Result: OD-DEAL achieves state-of-the-art real-time CVRP performance, solving 10,000-node instances with near-constant neural scaling, enabling sub-second, heuristic-quality inference suitable for dynamic large-scale deployment.

Conclusion: The proposed framework successfully addresses the scalability challenges in large-scale CVRP by combining adversarial learning with expert knowledge distillation, achieving both high solution quality and efficient real-time inference.

Abstract: Solving large-scale capacitated vehicle routing problems (CVRP) is hindered by the high complexity of heuristics and the limited generalization of neural solvers on massive graphs. We propose OD-DEAL, an adversarial learning framework that tightly integrates hybrid genetic search (HGS) and online barycenter clustering (BCC) decomposition, and leverages high-fidelity knowledge distillation to transfer expert heuristic behavior. OD-DEAL trains a graph attention network (GAT)-based generative policy through a minimax game, in which divide-and-conquer strategies from a hybrid expert are distilled into dense surrogate rewards. This enables high-quality, clustering-free inference on large-scale instances. Empirical results demonstrate that OD-DEAL achieves state-of-the-art (SOTA) real-time CVRP performance, solving 10000-node instances with near-constant neural scaling. This uniquely enables the sub-second, heuristic-quality inference required for dynamic large-scale deployment.

[787] NEST: Nested Event Stream Transformer for Sequences of Multisets

Minghui Sun, Haoyu Gong, Xingyu You, Jillian Hurst, Benjamin Goldstein, Matthew Engelhard

Main category: cs.LG

TL;DR: NEST is a foundation model for hierarchical event streams that preserves multiset structure, improving computational efficiency and representation quality over flattened approaches.

Details

Motivation: Existing foundation models flatten hierarchical event stream data (like EHRs with clinical encounters) into 1D sequences, causing computational inefficiency from dense attention, learning spurious within-set relationships, and poor set-level representations from heuristic pooling.

Method: Introduces Nested Event Stream Transformer (NEST) that preserves original hierarchy in architecture, and Masked Set Modeling (MSM) for efficient pretraining that promotes set-level representation learning.

Result: Experiments on real-world multiset sequence data show NEST captures real-world dynamics while improving both pretraining efficiency and downstream performance.

Conclusion: Preserving hierarchy in foundation model architecture provides useful inductive bias that improves computational efficiency and representation quality for event stream data with multiset structure.

Abstract: Event stream data often exhibit hierarchical structure in which multiple events co-occur, resulting in a sequence of multisets (i.e., bags of events). In electronic health records (EHRs), for example, medical events are grouped into a sequence of clinical encounters with well-defined temporal structure, but the order and timing of events within each encounter may be unknown or unreliable. Most existing foundation models (FMs) for event stream data flatten this hierarchy into a one-dimensional sequence, leading to (i) computational inefficiency associated with dense attention and learning spurious within-set relationships, and (ii) lower-quality set-level representations from heuristic post-training pooling for downstream tasks. Here, we show that preserving the original hierarchy in the FM architecture provides a useful inductive bias that improves both computational efficiency and representation quality. We then introduce Nested Event Stream Transformer (NEST), a FM for event streams comprised of sequences of multisets. Building on this architecture, we formulate Masked Set Modeling (MSM), an efficient paradigm that promotes improved set-level representation learning. Experiments on real-world multiset sequence data show that NEST captures real-world dynamics while improving both pretraining efficiency and downstream performance.

[788] Learning Heat-based Equations in Self-similar variables

Shihao Wang, Qipeng Qian, Jingquan Wang

Main category: cs.LG

TL;DR: SSV training framework improves neural operator learning for heat-based equations by using self-similar coordinates, leading to better long-term extrapolation and stability compared to physical coordinates.

Details

Motivation: To improve neural operator learning for heat-based equations by leveraging mathematical structure through self-similar variables, addressing limitations in long-term extrapolation and stability.

Method: Developed SSV training framework compatible with standard neural operator training, applied to 2D incompressible Navier-Stokes and 1D viscous Burgers equations, comparing physical vs self-similar coordinates using MLPs and factorized fully connected networks.

Result: SSV-trained networks consistently delivered substantially more accurate and stable extrapolation beyond training window and better captured qualitative long-time trends across both systems and architectures.

Conclusion: Self-similar coordinates provide mathematically motivated inductive bias for learning long-time dynamics of heat-based equations, improving neural operator performance.

Abstract: We study solution learning for heat-based equations in self-similar variables (SSV). We develop an SSV training framework compatible with standard neural-operator training. We instantiate this framework on the two-dimensional incompressible Navier-Stokes equations and the one-dimensional viscous Burgers equation, and perform controlled comparisons between models trained in physical coordinates and in the corresponding self-similar coordinates using two simple fully connected architectures (standard multilayer perceptrons and a factorized fully connected network). Across both systems and both architectures, SSV-trained networks consistently deliver substantially more accurate and stable extrapolation beyond the training window and better capture qualitative long-time trends. These results suggest that self-similar coordinates provide a mathematically motivated inductive bias for learning the long-time dynamics of heat-based equations.

[789] Best-of-Both-Worlds for Heavy-Tailed Markov Decision Processes

Yu Chen, Yuhao Liu, Jiatai Huang, Yihan Du, Longbo Huang

Main category: cs.LG

TL;DR: HT-FTRL algorithms for episodic Markov Decision Processes with heavy-tailed feedback achieve Best-of-Both-Worlds guarantees: instance-independent regret in adversarial environments and logarithmic regret in stochastic environments.

Details

Motivation: Existing approaches for heavy-tailed MDPs are conservative in stochastic environments and lack adaptivity in adversarial regimes, creating a need for algorithms that perform well in both settings.

Method: Propose HT-FTRL-OM (known transitions) using FTRL over occupancy measures with novel skipping loss estimators, and HT-FTRL-UOB (unknown transitions) using pessimistic skipping loss estimators with local control mechanisms.

Result: HT-FTRL-OM achieves Õ(T^{1/α}) regret in adversarial regimes and O(log T) in stochastic regimes; HT-FTRL-UOB achieves Õ(T^{1/α} + √T) in adversarial and O(log²T) in stochastic regimes.

Conclusion: The proposed algorithms provide Best-of-Both-Worlds guarantees for heavy-tailed MDPs through novel technical insights including local control mechanisms and suboptimal-mass propagation principles.

Abstract: We investigate episodic Markov Decision Processes with heavy-tailed feedback (HTMDPs). Existing approaches for HTMDPs are conservative in stochastic environments and lack adaptivity in adversarial regimes. In this work, we propose algorithms HT-FTRL-OM and HT-FTRL-UOB for HTMDPs that achieve Best-of-Both-Worlds (BoBW) guarantees: instance-independent regret in adversarial environments and logarithmic instance-dependent regret in self-bounding (including the stochastic case) environments. For the known transition setting, HT-FTRL-OM applies the Follow-The-Regularized-Leader (FTRL) framework over occupancy measures with novel skipping loss estimators, achieving a $\widetilde{O}(T^{1/α})$ regret bound in adversarial regimes and a $O(\log T)$ regret in stochastic regimes. Building upon this framework, we develop a novel algorithm HT-FTRL-UOB to tackle the more challenging unknown-transition setting. This algorithm employs a pessimistic skipping loss estimator and achieves a $\widetilde{O}(T^{1/α} + \sqrt{T})$ regret in adversarial regimes and a $O(\log^2(T))$ regret in stochastic regimes. Our analysis overcomes key barriers through several technical insights, including a local control mechanism for heavy-tailed shifted losses, a new suboptimal-mass propagation principle, and a novel regret decomposition that isolates transition uncertainty from heavy-tailed estimation errors and skipping bias.

[790] COMET: Codebook-based Online-adaptive Multi-scale Embedding for Time-series Anomaly Detection

Jinwoo Park, Hyeongwon Kang, Seung Hun Han, Pilsung Kang

Main category: cs.LG

TL;DR: COMET is a novel time series anomaly detection method using multi-scale patch encoding, vector-quantized coreset learning, and online codebook adaptation to capture temporal dependencies and adapt to distribution shifts.

Details

Motivation: Current time series anomaly detection methods struggle with capturing temporal dependencies and multivariate correlations at patch level, rely on single-scale patterns limiting detection across temporal ranges, and are vulnerable to distribution shifts at inference time due to focus on normal data representations.

Method: Three key components: (1) Multi-scale Patch Encoding to capture temporal dependencies and inter-variable correlations across multiple patch scales; (2) Vector-Quantized Coreset that learns representative normal patterns via codebook and detects anomalies with dual-score combining quantization error and memory distance; (3) Online Codebook Adaptation that generates pseudo-labels based on codebook entries and dynamically adapts the model at inference through contrastive learning.

Result: Experiments on five benchmark datasets show COMET achieves best performance in 36 out of 45 evaluation metrics, validating its effectiveness across diverse environments.

Conclusion: COMET effectively addresses limitations in time series anomaly detection by capturing multi-scale temporal dependencies, learning representative normal patterns, and adapting to distribution shifts through online learning.

Abstract: Time series anomaly detection is a critical task across various industrial domains. However, capturing temporal dependencies and multivariate correlations within patch-level representation learning remains underexplored, and reliance on single-scale patterns limits the detection of anomalies across different temporal ranges. Furthermore, focusing on normal data representations makes models vulnerable to distribution shifts at inference time. To address these limitations, we propose Codebook-based Online-adaptive Multi-scale Embedding for Time-series anomaly detection (COMET), which consists of three key components: (1) Multi-scale Patch Encoding captures temporal dependencies and inter-variable correlations across multiple patch scales. (2) Vector-Quantized Coreset learns representative normal patterns via codebook and detects anomalies with a dual-score combining quantization error and memory distance. (3) Online Codebook Adaptation generates pseudo-labels based on codebook entries and dynamically adapts the model at inference through contrastive learning. Experiments on five benchmark datasets demonstrate that COMET achieves the best performance in 36 out of 45 evaluation metrics, validating its effectiveness across diverse environments.

[791] MGKAN: Predicting Asymmetric Drug-Drug Interactions via a Multimodal Graph Kolmogorov-Arnold Network

Kunyi Fan, Mengjie Chen, Longlong Li, Cunquan Qu

Main category: cs.LG

TL;DR: MGKAN is a Graph Kolmogorov-Arnold Network for predicting asymmetric drug-drug interactions using learnable basis functions and multi-view network integration.

Details

Motivation: Existing GNN models for DDI prediction rely on linear aggregation and symmetric assumptions, limiting their ability to capture nonlinear and heterogeneous patterns in drug interactions.

Method: Proposes MGKAN with KAN-driven basis functions instead of MLPs, integrates three network views (asymmetric DDI, co-interaction, biochemical similarity), uses role-specific embeddings for directional semantics, and employs a fusion module with linear attention and nonlinear transformation.

Result: Outperforms seven state-of-the-art baselines on two benchmark datasets, with ablation studies and case studies confirming predictive accuracy and effectiveness in modeling directional drug effects.

Conclusion: MGKAN provides a more expressive and nonlinear approach to DDI prediction by capturing pharmacological dependencies through multi-view network integration and learnable basis functions.

Abstract: Predicting drug-drug interactions (DDIs) is essential for safe pharmacological treatments. Previous graph neural network (GNN) models leverage molecular structures and interaction networks but mostly rely on linear aggregation and symmetric assumptions, limiting their ability to capture nonlinear and heterogeneous patterns. We propose MGKAN, a Graph Kolmogorov-Arnold Network that introduces learnable basis functions into asymmetric DDI prediction. MGKAN replaces conventional MLP transformations with KAN-driven basis functions, enabling more expressive and nonlinear modeling of drug relationships. To capture pharmacological dependencies, MGKAN integrates three network views-an asymmetric DDI network, a co-interaction network, and a biochemical similarity network-with role-specific embeddings to preserve directional semantics. A fusion module combines linear attention and nonlinear transformation to enhance representational capacity. On two benchmark datasets, MGKAN outperforms seven state-of-the-art baselines. Ablation studies and case studies confirm its predictive accuracy and effectiveness in modeling directional drug effects.

[792] ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

Jie Xiao, Meng Chen, Qingnan Ren, Song Jingwei, Jiaqi Huang, Yangshen Deng, Chris Tong, Wanyi Chen, Suli Wang, Ziqian Bi, Shuo Lu, Yiqun Duan, Xu Wang, Rymon Yu, Ween Yang, Lynn Ai, Eric Yang, Bill Shi

Main category: cs.LG

TL;DR: ECHO-2 is a distributed RL framework for post-training LLMs that enables efficient wide-area coordination between rollout generation and centralized learning by treating policy staleness as a tunable parameter.

Details

Motivation: Distributing rollout execution for RL post-training of LLMs can leverage cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination latency.

Method: Combines centralized learning with distributed rollouts, treats bounded policy staleness as user-controlled parameter, introduces overlap-based capacity model for provisioning, and uses peer-assisted pipelined broadcast with cost-aware activation of heterogeneous workers.

Result: Experiments on GRPO post-training of 4B and 8B models show ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines under real wide-area bandwidth regimes.

Conclusion: ECHO-2 provides an effective distributed RL framework for LLM post-training that addresses wide-area coordination challenges and enables practical cost-efficient scaling.

Abstract: Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.

[793] SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

Maksim Afanasyev, Illarion Iov

Main category: cs.LG

TL;DR: SLIME is a reference-free alignment method that stabilizes preference optimization by decoupling preference learning from generation quality through anchoring, stabilizing penalties, and dual-margin constraints.

Details

Motivation: Current direct preference optimization methods suffer from objective mismatch where optimizing relative margins between chosen and rejected responses can degrade the absolute likelihood of high-quality outputs, leading to unlearning and formatting collapse.

Method: SLIME uses a three-pronged objective: 1) anchoring term to maximize likelihood of preferred responses, 2) stabilizing penalty to prevent rejected token probabilities from collapsing to zero, and 3) dual-margin mechanism combining hard and soft constraints for precise boundary shaping.

Result: SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.

Conclusion: SLIME provides a stabilized approach to preference optimization that decouples preference learning from generation quality, addressing critical limitations of existing direct preference optimization methods.

Abstract: Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they often suffer from a critical objective mismatch: optimizing the relative margin between chosen and rejected responses does not guarantee the preservation of the chosen response’s absolute likelihood. This can lead to unlearning, where the model degrades the probability of high-quality outputs to satisfy margin constraints, and formatting collapse caused by the over-penalization of rejected sequences. In this work, we introduce SLIME (Stabilized Likelihood Implicit Margin Enforcement), a reference-free alignment objective designed to decouple preference learning from generation quality. SLIME incorporates a three-pronged objective: (1) an anchoring term to maximize the likelihood of preferred responses; (2) a stabilizing penalty that prevents the probabilities of rejected tokens from collapsing to zero; and (3) a dual-margin mechanism that combines hard and soft constraints for precise boundary shaping. Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.

cs.MA

[794] Exploring Silicon-Based Societies: An Early Study of the Moltbook Agent Community

Yu-Zheng Lin, Bono Po-Jen Shih, Hsuan-Ying Alessandra Chien, Shalaka Satam, Jesus Horacio Pacheco, Sicong Shao, Soheil Salehi, Pratik Satam

Main category: cs.MA

TL;DR: Large-scale data mining study of autonomous agent societies using Moltbook platform data reveals emergent social structures from 150k+ agents, establishing data-driven silicon sociology as a framework for understanding AI agent ecosystems.

Details

Motivation: The emergence of large-scale autonomous LLM agent ecosystems requires systematic empirical frameworks beyond anecdotal observation or small-scale simulation to understand collective behavior and social structure formation.

Method: Data-driven silicon sociology framework analyzing Moltbook platform with 150k+ agents; collected 12,758 agent-authored sub-community descriptions; applied preprocessing, contextual embedding, and unsupervised clustering to uncover latent thematic patterns.

Result: Autonomous agents systematically organize collective space through reproducible patterns including human-mimetic interests, silicon-centric self-reflection, and early-stage economic/coordination behaviors, emerging directly from machine-generated data.

Conclusion: Establishes methodological foundation for data-driven silicon sociology, demonstrating data mining as powerful lens for understanding organization and evolution of large autonomous agent societies without predefined sociological taxonomies.

Abstract: The rapid emergence of autonomous large language model agents has given rise to persistent, large-scale agent ecosystems whose collective behavior cannot be adequately understood through anecdotal observation or small-scale simulation. This paper introduces data-driven silicon sociology as a systematic empirical framework for studying social structure formation among interacting artificial agents. We present a pioneering large-scale data mining investigation of an in-the-wild agent society by analyzing Moltbook, a social platform designed primarily for agent-to-agent interaction. At the time of study, Moltbook hosted over 150,000 registered autonomous agents operating across thousands of agent-created sub-communities. Using programmatic and non-intrusive data acquisition, we collected and analyzed the textual descriptions of 12,758 submolts, which represent proactive sub-community partitioning activities within the ecosystem. Treating agent-authored descriptions as first-class observational artifacts, we apply rigorous preprocessing, contextual embedding, and unsupervised clustering techniques to uncover latent patterns of thematic organization and social space structuring. The results show that autonomous agents systematically organize collective space through reproducible patterns spanning human-mimetic interests, silicon-centric self-reflection, and early-stage economic and coordination behaviors. Rather than relying on predefined sociological taxonomies, these structures emerge directly from machine-generated data traces. This work establishes a methodological foundation for data-driven silicon sociology and demonstrates that data mining techniques can provide a powerful lens for understanding the organization and evolution of large autonomous agent societies.

[795] Scaling Small Agents Through Strategy Auctions

Lisa Alazraki, William F. Shen, Yoram Bachrach, Akhil Mathur

Main category: cs.MA

TL;DR: SALE is an agent framework that uses auction-based coordination to route complex tasks to appropriate small language models, reducing reliance on large models while maintaining performance.

Details

Motivation: Small language models are cost-effective for agentic AI but struggle with complex tasks, while large models are expensive. Need a way to effectively coordinate heterogeneous agents for complex workloads without always using the largest model.

Method: Strategy Auctions for Workload Efficiency (SALE) - agents bid with short strategic plans, scored by cost-value mechanism, refined via shared auction memory. Enables per-task routing and continual self-improvement without training separate routers.

Result: Reduces reliance on largest agent by 53%, lowers overall cost by 35%, improves upon largest agent’s pass@1 with negligible overhead. Outperforms established routers that rely on task descriptions.

Conclusion: Small agents can be effectively scaled up through coordinated task allocation and test-time self-improvement. Performance gains come more from market-inspired coordination mechanisms than from ever-larger individual models.

Abstract: Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are sufficiently capable for agentic workflows. However, while smaller agents can closely match larger ones on simple tasks, it remains unclear how their performance scales with task complexity, when large models become necessary, and how to better leverage small agents for long-horizon workloads. In this work, we empirically show that small agents’ performance fails to scale with task complexity on deep search and coding tasks, and we introduce Strategy Auctions for Workload Efficiency (SALE), an agent framework inspired by freelancer marketplaces. In SALE, agents bid with short strategic plans, which are scored by a systematic cost-value mechanism and refined via a shared auction memory, enabling per-task routing and continual self-improvement without training a separate router or running all models to completion. Across deep search and coding tasks of varying complexity, SALE reduces reliance on the largest agent by 53%, lowers overall cost by 35%, and consistently improves upon the largest agent’s pass@1 with only a negligible overhead beyond executing the final trace. In contrast, established routers that rely on task descriptions either underperform the largest agent or fail to reduce cost – often both – underscoring their poor fit for agentic workflows. These results suggest that while small agents may be insufficient for complex workloads, they can be effectively “scaled up” through coordinated task allocation and test-time self-improvement. More broadly, they motivate a systems-level view of agentic AI in which performance gains come less from ever-larger individual models and more from market-inspired coordination mechanisms that organize heterogeneous agents into efficient, adaptive ecosystems.

[796] Game-Theoretic and Algorithmic Analyses of Multi-Agent Routing under Crossing Costs

Tesshu Hanaka, Nikolaos Melissinos, Hirotaka Ono

Main category: cs.MA

TL;DR: Multi-agent routing framework with crossing cost metric for asynchronous settings, bridging game theory and parameterized complexity

Details

Motivation: Traditional multi-agent path finding relies on centralized control and synchronous collision avoidance, which requires strict synchronization. There's a need for frameworks that work in asynchronous, decentralized settings where conflicts are treated as costs rather than hard constraints.

Method: Introduces Multi-Agent Routing under Crossing Cost model on mixed graphs. Models setting as congestion game with non-standard cost function (crossing cost = product of agents traversing edge in opposite directions). Provides game-theoretic analysis of pure Nash equilibria and parameterized algorithms for optimization.

Result: Proves existence of pure Nash equilibria, shows equilibria can be found in polynomial time under mild conditions (general case is PLS-complete). Minimizing total crossing cost is NP-hard (generalizes Steiner Orientation). Provides parameterized algorithms yielding XP/FPT results based on parameters like number of arcs, edges, agents, and structural graph measures.

Conclusion: The framework provides new theoretical foundation for decentralized multi-agent routing, bridging equilibrium analysis and parameterized complexity to support scalable, risk-aware coordination in asynchronous settings.

Abstract: Coordinating the movement of multiple autonomous agents over a shared network is a fundamental challenge in algorithmic robotics, intelligent transportation, and distributed systems. The dominant approach, Multi-Agent Path Finding, relies on centralized control and synchronous collision avoidance, which often requires strict synchronization and guarantees of globally conflict-free execution. This paper introduces the Multi-Agent Routing under Crossing Cost model on mixed graphs, a novel framework tailored to asynchronous settings. In our model, instead of treating conflicts as hard constraints, each agent is assigned a path, and the system is evaluated through a cost function that measures potential head-on encounters. This ``crossing cost’’, which is defined as the product of the numbers of agents traversing an edge in opposite directions, quantifies the risk of congestion and delay in decentralized execution. Our contributions are both game-theoretic and algorithmic. We model the setting as a congestion game with a non-standard cost function, prove the existence of pure Nash equilibria, and analyze the dynamics leading to them. Equilibria can be found in polynomial time under mild conditions, while the general case is PLS-complete. From an optimization perspective, minimizing the total crossing cost is NP-hard, as the problem generalizes Steiner Orientation. To address this hardness barrier, we design a suite of parameterized algorithms for minimizing crossing cost, with parameters including the number of arcs, edges, agents, and structural graph measures. These yield XP or FPT results depending on the parameter, offering algorithmic strategies for structurally restricted instances. Our framework provides a new theoretical foundation for decentralized multi-agent routing, bridging equilibrium analysis and parameterized complexity to support scalable and risk-aware coordination.

[797] When Should Agents Coordinate in Differentiable Sequential Decision Problems?

Caleb Probine, Su Ann Low, David Fridovich-Keil, Ufuk Topcu

Main category: cs.MA

TL;DR: The paper explores coordination value in multi-robot motion planning, modeling coordination as a spectrum from joint optimization to Nash equilibria, with algorithms for determining when coordination is beneficial.

Details

Motivation: Multi-robot teams need coordination for effective operation, but coordination often requires costly communication. The paper aims to understand when coordination is valuable versus when agents can operate independently without significant performance loss.

Method: Models coordinated behavior as a spectrum between joint optimization (team objective) and Nash equilibria (individual optimization). Uses second-order properties of agents’ objectives to reason about coordination value in differentiable motion-planning problems. Provides algorithms that determine optimal coordination timing based on this analysis.

Result: Demonstrates that reasoning about coordination reduces to analyzing second-order properties of objectives. Provides practical algorithms for determining when teams should coordinate, potentially reducing unnecessary communication costs while maintaining performance.

Conclusion: The framework enables systematic analysis of coordination value in multi-robot systems, offering a principled approach to balance coordination benefits against communication costs in motion planning problems.

Abstract: Multi-robot teams must coordinate to operate effectively. When a team operates in an uncoordinated manner, and agents choose actions that are only individually optimal, the team’s outcome can suffer. However, in many domains, coordination requires costly communication. We explore the value of coordination in a broad class of differentiable motion-planning problems. In particular, we model coordinated behavior as a spectrum: at one extreme, agents jointly optimize a common team objective, and at the other, agents make unilaterally optimal decisions given their individual decision variables, i.e., they operate at Nash equilibria. We then demonstrate that reasoning about coordination in differentiable motion-planning problems reduces to reasoning about the second-order properties of agents’ objectives, and we provide algorithms that use this second-order reasoning to determine at which times a team of agents should coordinate.

[798] Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems

Haibo Jin, Kuang Peng, Ye Yu, Xiaopeng Yuan, Haohan Wang

Main category: cs.MA

TL;DR: Agent Primitives: reusable latent building blocks for LLM-based multi-agent systems that improve efficiency and robustness through KV cache communication

Details

Motivation: Existing multi-agent systems are task-specific with manually crafted roles and prompts, leading to complexity and limited reusability. They rely on natural language communication which causes error accumulation in long-context interactions.

Method: Propose three reusable primitives (Review, Voting and Selection, Planning and Execution) that communicate via key-value cache instead of natural language. An Organizer agent selects and composes primitives using a knowledge pool of successful configurations.

Result: Primitives-based MAS improve accuracy by 12.0-16.5% over single-agent baselines, reduce token usage and inference latency by 3-4× compared to text-based MAS, with only 1.3-1.6× overhead relative to single-agent inference.

Conclusion: Agent Primitives provide a modular, efficient approach to building multi-agent systems that improves performance, reduces computational costs, and offers more stable performance across different model backbones.

Abstract: While existing multi-agent systems (MAS) can handle complex problems by enabling collaboration among multiple agents, they are often highly task-specific, relying on manually crafted agent roles and interaction prompts, which leads to increased architectural complexity and limited reusability across tasks. Moreover, most MAS communicate primarily through natural language, making them vulnerable to error accumulation and instability in long-context, multi-stage interactions within internal agent histories. In this work, we propose \textbf{Agent Primitives}, a set of reusable latent building blocks for LLM-based MAS. Inspired by neural network design, where complex models are built from reusable components, we observe that many existing MAS architectures can be decomposed into a small number of recurring internal computation patterns. Based on this observation, we instantiate three primitives: Review, Voting and Selection, and Planning and Execution. All primitives communicate internally via key-value (KV) cache, which improves both robustness and efficiency by mitigating information degradation across multi-stage interactions. To enable automatic system construction, an Organizer agent selects and composes primitives for each query, guided by a lightweight knowledge pool of previously successful configurations, forming a primitive-based MAS. Experiments show that primitives-based MAS improve average accuracy by 12.0-16.5% over single-agent baselines, reduce token usage and inference latency by approximately 3$\times$-4$\times$ compared to text-based MAS, while incurring only 1.3$\times$-1.6$\times$ overhead relative to single-agent inference and providing more stable performance across model backbones.

[799] ENGRAM: Effective, Lightweight Memory Orchestration for Conversational Agents

Daivik Patel, Shrenik Patel

Main category: cs.MA

TL;DR: ENGRAM: A lightweight memory system for LLMs using three canonical memory types (episodic, semantic, procedural) with simple dense retrieval, achieving SOTA on long-horizon conversational benchmarks.

Details

Motivation: Current memory systems for LLMs are overly complex with knowledge graphs, multi-stage retrieval, and OS-style schedulers, creating engineering complexity and reproducibility challenges. There's a need for simpler, more effective long-horizon memory management.

Method: ENGRAM organizes conversations into three memory types (episodic, semantic, procedural) using a single router and retriever. User turns are converted into typed memory records with normalized schemas and embeddings stored in a database. At query time, top-k dense neighbors are retrieved for each type, merged with set operations, and provided as context.

Result: Achieves state-of-the-art results on LoCoMo (multi-session conversational QA benchmark) and exceeds full-context baseline by 15 points on LongMemEval while using only about 1% of tokens.

Conclusion: Careful memory typing and straightforward dense retrieval can enable effective long-term memory management in language models without requiring complex architectures.

Abstract: Large language models (LLMs) deployed in user-facing applications require long-horizon consistency: the ability to remember prior interactions, respect user preferences, and ground reasoning in past events. However, contemporary memory systems often adopt complex architectures such as knowledge graphs, multi-stage retrieval pipelines, and OS-style schedulers, which introduce engineering complexity and reproducibility challenges. We present ENGRAM, a lightweight memory system that organizes conversation into three canonical memory types (episodic, semantic, and procedural) through a single router and retriever. Each user turn is converted into typed memory records with normalized schemas and embeddings and stored in a database. At query time, the system retrieves top-k dense neighbors for each type, merges results with simple set operations, and provides the most relevant evidence as context to the model. ENGRAM attains state-of-the-art results on LoCoMo, a multi-session conversational QA benchmark for long-horizon memory, and exceeds the full-context baseline by 15 points on LongMemEval while using only about 1% of the tokens. These results show that careful memory typing and straightforward dense retrieval can enable effective long-term memory management in language models without requiring complex architectures.

[800] Reuse, Don’t Recompute: Efficient Large Reasoning Model Inference via Memory Orchestration

Daivik Patel, Shrenik Patel

Main category: cs.MA

TL;DR: ENGRAM-R is an inference-time memory layer that enables large reasoning models to reuse structured memory instead of recomputing derivations, achieving significant token reduction while maintaining accuracy.

Details

Motivation: Current large reasoning models achieve accuracy through test-time scaling (longer chains of thought, multiple solution sampling) but at high computational costs in tokens and latency. The authors argue that memory should be a core component for efficient reasoning - when evidence exists, models should reuse structured memory rather than recompute.

Method: ENGRAM-R is an inference-time memory layer that integrates typed retrieval with compact fact card representations and explicit citation control. It enables models to access and reuse previously computed evidence and derivations stored in structured memory.

Result: On the LoCoMo benchmark, ENGRAM-R reduces input tokens by 85% and reasoning tokens by 75% compared to full context while maintaining high accuracy. On a multi-hop slice of the LongMemEval benchmark, it achieves similar efficiency with substantial accuracy gains.

Conclusion: Memory is not only critical for long-horizon correctness but also a practical lever for efficient reasoning under tight compute, memory, and latency budgets. Structured memory reuse can dramatically reduce computational costs while maintaining or improving accuracy.

Abstract: Large reasoning models (LRMs) achieve strong accuracy through test-time scaling, generating longer chains of thought or sampling multiple solutions, but at steep costs in tokens and latency. We argue that memory is a core ingredient for efficient reasoning: when evidence already exists, models should think less by reusing structured memory instead of recomputing derivations. We present ENGRAM-R, an inference-time memory layer that integrates typed retrieval with compact fact card representations and explicit citation control. On the LoCoMo benchmark, ENGRAM-R reduces input tokens by 85% and reasoning tokens by 75% compared to full context while maintaining high accuracy. On a multi-hop slice of the LongMemEval benchmark, it achieves similar efficiency with substantial accuracy gains. These results show that memory is not only critical for long-horizon correctness but also a practical lever for efficient reasoning under tight compute, memory, and latency budgets.

[801] Multi-Agent Teams Hold Experts Back

Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao, James Zou

Main category: cs.MA

TL;DR: Self-organizing LLM teams consistently fail to match expert agent performance due to integrative compromise behavior, showing a gap in multi-agent coordination for expertise utilization.

Details

Motivation: Multi-agent LLM systems are increasingly deployed as autonomous collaborators where coordination must emerge through interaction rather than being pre-specified. Prior work enforces coordination through fixed roles or workflows, leaving open how well self-organizing teams perform when coordination is unconstrained.

Method: Drawing on organizational psychology, the study examines whether self-organizing LLM teams achieve synergy where team performance matches or exceeds the best individual member. The research uses human-inspired and frontier ML benchmarks, analyzing conversational patterns and coordination failures in unconstrained multi-agent settings.

Result: LLM teams consistently fail to match their expert agent’s performance (up to 37.6% performance loss), even when explicitly told who the expert is. Expert leveraging, not identification, is the primary bottleneck. Teams show integrative compromise behavior - averaging expert and non-expert views rather than appropriately weighting expertise - which increases with team size and correlates negatively with performance.

Conclusion: Self-organizing multi-agent LLM teams have a significant gap in harnessing collective expertise, showing consensus-seeking behavior that improves robustness to adversarial agents but creates a trade-off between alignment and effective expertise utilization.

Abstract: Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that – unlike human teams – LLM teams consistently fail to match their expert agent’s performance, even when explicitly told who the expert is, incurring performance losses of up to 37.6%. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise – averaging expert and non-expert views rather than appropriately weighting expertise – which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.

cs.MM

[802] Trailer Reimagined: An Innovative, Llm-DRiven, Expressive Automated Movie Summary framework (TRAILDREAMS)

Roberto Balestri, Pasquale Cascarano, Mirko Degli Esposti, Guglielmo Pescatore

Main category: cs.MM

TL;DR: TRAILDREAMS uses LLMs to automate movie trailer creation by selecting key visual sequences and dialogues, and generating audio elements like music and voiceovers.

Details

Motivation: To automate the production of movie trailers efficiently, reducing manual effort while maintaining quality and engagement.

Method: Uses a large language model to select key visual sequences and impactful dialogues, and to generate audio elements including music and voiceovers.

Result: TRAILDREAMS surpasses current state-of-the-art trailer generation methods in viewer ratings but still falls short compared to real human-crafted trailers.

Conclusion: The framework demonstrates significant promise and advances automated creative processes, but further improvements are needed to bridge the quality gap with traditional trailers.

Abstract: This paper introduces TRAILDREAMS, a framework that uses a large language model (LLM) to automate the production of movie trailers. The purpose of LLM is to select key visual sequences and impactful dialogues, and to help TRAILDREAMS to generate audio elements such as music and voiceovers. The goal is to produce engaging and visually appealing trailers efficiently. In comparative evaluations, TRAILDREAMS surpasses current state-of-the-art trailer generation methods in viewer ratings. However, it still falls short when compared to real, human-crafted trailers. While TRAILDREAMS demonstrates significant promise and marks an advancement in automated creative processes, further improvements are necessary to bridge the quality gap with traditional trailers.

eess.AS

[803] WAXAL: A Large-Scale Multilingual African Language Speech Corpus

Abdoulaye Diack, Perry Nelson, Kwaku Agbesi, Angela Nakalembe, MohamedElfatih MohamedKhair, Vusumuzi Dube, Tavonga Siyavora, Subhashini Venugopalan, Jason Hickey, Uche Okonkwo, Abhishek Bapna, Isaac Wiafe, Raynard Dodzi Helegah, Elikem Doe Atsakpo, Charles Nutrokpor, Fiifi Baffoe Payin Winful, Kafui Kwashie Solaga, Jamal-Deen Abdulai, Akon Obu Ekpezu, Audace Niyonkuru, Samuel Rutunda, Boris Ishimwe, Michael Melese, Engineer Bainomugisha, Joyce Nakatumba-Nabende, Andrew Katumba, Claire Babirye, Jonathan Mukiibi, Vincent Kimani, Samuel Kibacia, James Maina, Fridah Emmah, Ahmed Ibrahim Shekarau, Ibrahim Shehu Adamu, Yusuf Abdullahi, Howard Lakougna, Bob MacDonald, Hadar Shemtov, Aisha Walcott-Bryant, Moustapha Cisse, Avinatan Hassidim, Jeff Dean, Yossi Matias

Main category: eess.AS

TL;DR: WAXAL is a large-scale open speech dataset for 21 Sub-Saharan African languages containing ~1,250 hours of transcribed natural speech for ASR and ~180 hours of high-quality single-speaker recordings for TTS.

Details

Motivation: To address the digital divide in speech technology for Sub-Saharan African languages, which have been historically underserved compared to high-resource languages.

Method: Collaborated with four African academic and community organizations to collect, annotate, and perform quality control on speech data from 21 languages representing over 100 million speakers.

Result: Created WAXAL dataset with two components: ASR dataset (~1,250 hours of transcribed natural speech) and TTS dataset (~180 hours of high-quality single-speaker recordings with phonetically balanced scripts).

Conclusion: WAXAL provides a vital resource for inclusive technology development, research, and digital preservation of African languages, released under CC-BY-4.0 license.

Abstract: The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at https://huggingface.co/datasets/google/WaxalNLP under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.

[804] WST-X Series: Wavelet Scattering Transform for Interpretable Speech Deepfake Detection

Xi Xuan, Davide Carbone, Ruchi Pandey, Wenxin Zhang, Tomi H. Kinnunen

Main category: eess.AS

TL;DR: WST-X series combines wavelet scattering transform with deep learning principles to create interpretable yet powerful front-ends for speech deepfake detection, outperforming both hand-crafted filterbanks and self-supervised features.

Details

Motivation: Current speech deepfake detection front-ends face a trade-off: hand-crafted filterbank features are transparent but limited in capturing high-level semantics, while self-supervised features are powerful but lack interpretability and may miss fine-grained spectral anomalies.

Method: Proposes WST-X series feature extractors using wavelet scattering transform (WST) that integrates wavelets with nonlinearities similar to deep convolutional networks. Investigates 1D WST for acoustic details and 2D WST for higher-order structural anomalies, with specific attention to averaging scale (J), frequency resolution (Q), and directional resolution (L).

Result: WST-X outperforms existing front-ends by a wide margin on the challenging Deepfake-Eval-2024 dataset. Analysis shows that small averaging scale combined with high-frequency and directional resolutions is critical for capturing subtle artifacts.

Conclusion: WST-X provides translation-invariant and deformation-stable features that offer both robustness and interpretability for speech deepfake detection, bridging the gap between transparent hand-crafted features and powerful but opaque self-supervised features.

Abstract: Designing front-ends for speech deepfake detectors primarily focuses on two categories. Hand-crafted filterbank features are transparent but are limited in capturing high-level semantic details, often resulting in performance gaps compared to self-supervised (SSL) features. SSL features, in turn, lack interpretability and may overlook fine-grained spectral anomalies. We propose the WST-X series, a novel family of feature extractors that combines the best of both worlds via the wavelet scattering transform (WST), integrating wavelets with nonlinearities analogous to deep convolutional networks. We investigate 1D and 2D WSTs to extract acoustic details and higher-order structural anomalies, respectively. Experimental results on the recent and challenging Deepfake-Eval-2024 dataset indicate that WST-X outperforms existing front-ends by a wide margin. Our analysis reveals that a small averaging scale ($J$), combined with high-frequency and directional resolutions ($Q, L$), is critical for capturing subtle artifacts. This underscores the value of translation-invariant and deformation-stable features for robust and interpretable speech deepfake detection.

[805] Mići Princ – A Little Boy Teaching Speech Technologies the Chakavian Dialect

Nikola Ljubešić, Peter Rupnik, Tea Perinčić

Main category: eess.AS

TL;DR: Researchers created a computer-readable dataset of The Little Prince translated into the Chakavian dialect, with aligned text and audio, enabling AI applications like speech recognition adaptation.

Details

Motivation: Three main motivations: 1) Preserve valuable dialectal content beyond limited print/audio editions, 2) Create AI-ready dataset for research applications, 3) Enable future digital online editions for broader access.

Method: Released printed and audio book of The Little Prince in Chakavian dialect as a computer-readable dataset with word-level alignment between text and audio components, published in CLARIN.SI repository.

Result: Successfully adapted Whisper-large-v3 speech recognition model to Chakavian dialect, reducing word error rate by half and character-level errors by up to two-thirds on test data.

Conclusion: The dataset enables both AI research (speech recognition, dialect adaptation) and dialect preservation, with potential for broader applications beyond the initial experiments.

Abstract: This paper documents our efforts in releasing the printed and audio book of the translation of the famous novel The Little Prince into the Chakavian dialect, as a computer-readable, AI-ready dataset, with the textual and the audio components of the two releases now aligned on the level of each written and spoken word. Our motivation for working on this release is multiple. The first one is our wish to preserve the highly valuable and specific content beyond the small editions of the printed and the audio book. With the dataset published in the CLARIN.SI repository, this content is from now on at the fingertips of any interested individual. The second motivation is to make the data available for various artificial-intelligence-related usage scenarios, such as the one we follow upon inside this paper already – adapting the Whisper-large-v3 open automatic speech recognition model, with decent performance on standard Croatian, to Chakavian dialectal speech. We can happily report that with adapting the model, the word error rate on the selected test data has being reduced to a half, while we managed to remove up to two thirds of the error on character level. We envision many more usages of this dataset beyond the set of experiments we have already performed, both on tasks of artificial intelligence research and application, as well as dialectal research. The third motivation for this release is our hope that this, now highly structured dataset, will be transformed into a digital online edition of this work, allowing individuals beyond the research and technology communities to enjoy the beauty of the message of the little boy in the desert, told through the spectacular prism of the Chakavian dialect.

Shunxi Xu, Thushara Abhayapala, Craig T. Jin

Main category: eess.AS

TL;DR: SVD-based sparse recovery framework for hybrid spherical-linear microphone arrays improves spatial selectivity and reconstruction accuracy in reverberant environments.

Details

Motivation: To develop a principled, unified approach for processing hybrid spherical-linear microphone arrays that combines the advantages of both array types for robust sound-field reconstruction.

Method: Uses singular value decomposition (SVD) of the transfer operator to obtain orthogonal microphone and field modes, which reduce to spherical harmonics for spherical arrays alone but incorporate complementary modes when linear arrays are added.

Result: Experimental results show reduced energy-map mismatch and angular error across frequency, distance, and source count in reverberant conditions, outperforming spherical-only arrays and direct concatenation approaches.

Conclusion: SVD-modal processing provides a principled framework for hybrid arrays that improves spatial selectivity and enables robust sparse sound-field reconstruction.

Abstract: We propose a data-driven sparse recovery framework for hybrid spherical linear microphone arrays using singular value decomposition (SVD) of the transfer operator. The SVD yields orthogonal microphone and field modes, reducing to spherical harmonics (SH) in the SMA-only case, while incorporating LMAs introduces complementary modes beyond SH. Modal analysis reveals consistent divergence from SH across frequency, confirming the improved spatial selectivity. Experiments in reverberant conditions show reduced energy-map mismatch and angular error across frequency, distance, and source count, outperforming SMA-only and direct concatenation. The results demonstrate that SVD-modal processing provides a principled and unified treatment of hybrid arrays for robust sparse sound-field reconstruction.

[807] Conditional Flow Matching for Visually-Guided Acoustic Highlighting

Hugo Malard, Gael Le Lan, Daniel Wong, David Lou Alon, Yi-Chiao Wu, Sanjeel Parekh

Main category: eess.AS

TL;DR: A generative Conditional Flow Matching framework for visually-guided acoustic highlighting that rebalances audio to align with video focus, using rollout loss for stability and cross-modal conditioning.

Details

Motivation: Existing discriminative models struggle with audio remixing ambiguity where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. Visually-guided acoustic highlighting remains underexplored despite visual saliency being widely studied.

Method: Reframes the task as a generative problem using Conditional Flow Matching (CFM). Introduces rollout loss to penalize drift at final step for self-correcting trajectories, and a conditioning module that fuses audio and visual cues before vector field regression for explicit cross-modal source selection.

Result: Extensive quantitative and qualitative evaluations show the method consistently surpasses previous state-of-the-art discriminative approaches, establishing that visually-guided audio remixing is best addressed through generative modeling.

Conclusion: The generative Conditional Flow Matching framework with rollout loss and cross-modal conditioning effectively addresses the ambiguity in audio remixing, outperforming discriminative approaches and providing better audio-visual alignment.

Abstract: Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors – in selecting the correct source to enhance – compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.

[808] CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

Hankun Wang, Yiwei Guo, Chongtian Shao, Bohan Li, Kai Yu

Main category: eess.AS

TL;DR: CodecSlime introduces a dynamic frame rate plugin for neural speech codecs to reduce temporal redundancy, improving efficiency by allocating tokens based on speech information density rather than fixed intervals.

Details

Motivation: Current neural speech codecs use fixed frame rates, which waste tokens on steady-state speech segments like long vowels and silences. Speech has non-uniform temporal information density, so a dynamic approach could improve efficiency.

Method: CodecSlime is an unsupervised, architecture-agnostic plugin with two innovations: ScheDFR for adapting inference and Melt-and-Cool for adapting training. It enables dynamic frame rate operation on existing codec backbones.

Result: At 40 Hz DFR (~600 bps), CodecSlime reduces reconstruction WER by up to 32% relative to fixed frame rate baselines with similar bitrates. It supports multiple frame rates from a single model and consistently outperforms FFR models.

Conclusion: CodecSlime successfully addresses temporal redundancy in speech codecs through dynamic frame rate adaptation, offering flexible quality-bitrate tradeoffs while maintaining competitive performance across metrics.

Abstract: Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ($\approx$ 600 bps), the reconstruction WER of CodecSlime is reduced by up to 32% relative to conventional FFR baselines with the same model architecture and similar bitrates, while other metrics are also competitive. CodecSlime also enables flexible trade-offs between reconstruction quality and bitrate: a single model supports inference at multiple frame rates and consistently outperforms FFR models at the corresponding frame rates. Audio samples are available at https://acadarmeria.github.io/codecslime/.

[809] Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-task Multi-Scale Network

Zhanhong He, Hanyu Meng, David Huang, Roberto Togneri

Main category: eess.AS

TL;DR: A multi-task network for piano dynamic estimation that jointly predicts dynamic levels, change points, beats, and downbeats from audio using Bark-scale specific loudness features, achieving state-of-the-art results with significantly reduced model size.

Details

Motivation: Estimating piano dynamics from audio recordings is a fundamental challenge in computational music analysis. Current methods often use separate models for different musical structure elements, and there's a need for more efficient, integrated approaches that can handle long audio sequences while maintaining accuracy.

Method: Proposes an efficient multi-task network that jointly predicts four related targets (dynamic levels, change points, beats, and downbeats) from a shared latent representation. Uses a multi-scale network backbone with Bark-scale specific loudness as input feature instead of log-Mel, enabling 60-second audio segmentation (double typical beat tracking length) while reducing model size from 14.7M to 0.5M parameters.

Result: Achieves state-of-the-art results on the public MazurkaBL dataset across all four tasks. The model is significantly more compact (0.5M parameters vs 14.7M) while maintaining high performance, enabling efficient processing of long audio sequences.

Conclusion: Sets a new benchmark for piano dynamic estimation and provides a powerful, compact tool for large-scale, resource-efficient analysis of musical expression. The multi-task approach with Bark-scale features enables efficient joint modeling of musical structure elements.

Abstract: Estimating piano dynamic from audio recordings is a fundamental challenge in computational music analysis. In this paper, we propose an efficient multi-task network that jointly predicts dynamic levels, change points, beats, and downbeats from a shared latent representation. These four targets form the metrical structure of dynamics in the music score. Inspired by recent vocal dynamic research, we use a multi-scale network as the backbone, which takes Bark-scale specific loudness as the input feature. Compared to log-Mel as input, this reduces model size from 14.7 M to 0.5 M, enabling long sequential input. We use a 60-second audio length in audio segmentation, which doubled the length of beat tracking commonly used. Evaluated on the public MazurkaBL dataset, our model achieves state-of-the-art results across all tasks. This work sets a new benchmark for piano dynamic estimation and delivers a powerful and compact tool, paving the way for large-scale, resource-efficient analysis of musical expression.

[810] DiffRhythm 2: Efficient and High Fidelity Song Generation via Block Flow Matching

Yuepeng Jiang, Huakang Chen, Ziqian Ning, Jixun Yao, Zerui Han, Di Wu, Meng Meng, Jian Luan, Zhonghua Fu, Lei Xie

Main category: eess.AS

TL;DR: DiffRhythm 2 is an end-to-end framework for high-fidelity, controllable song generation that addresses lyric-vocal alignment and multi-preference RLHF optimization through semi-autoregressive block flow matching and cross-pair preference optimization.

Details

Motivation: Existing non-autoregressive song generation frameworks struggle with lyric-vocal alignment and face performance degradation when optimizing for diverse musical preferences through RLHF, necessitating a more robust solution.

Method: Uses semi-autoregressive block flow matching for lyric alignment, music VAE for low frame rate (5Hz) audio representation, cross-pair preference optimization for RLHF without model merging degradation, and stochastic block representation alignment for musical coherence.

Result: Achieves faithful lyric-vocal alignment without external constraints, maintains high generation quality and efficiency, enables computationally tractable long sequence generation, and robustly optimizes for diverse human preferences.

Conclusion: DiffRhythm 2 provides an effective solution for high-quality, controllable song generation with improved lyric alignment and preference optimization capabilities.

Abstract: Generating full-length, high-quality songs is challenging, as it requires maintaining long-term coherence both across text and music modalities and within the music modality itself. Existing non-autoregressive (NAR) frameworks, while capable of producing high-quality songs, often struggle with the alignment between lyrics and vocal. Concurrently, catering to diverse musical preferences necessitates reinforcement learning from human feedback (RLHF). However, existing methods often rely on merging multiple models during multi-preference optimization, which results in significant performance degradation. To address these challenges, we introduce DiffRhythm 2, an end-to-end framework designed for high-fidelity, controllable song generation. To tackle the lyric alignment problem, DiffRhythm 2 employs a semi-autoregressive architecture based on block flow matching. This design enables faithful alignment of lyrics to singing vocals without relying on external labels and constraints, all while preserving the high generation quality and efficiency of NAR models. To make this framework computationally tractable for long sequences, we implement a music variational autoencoder (VAE) that achieves a low frame rate of 5 Hz while still enabling high-fidelity audio reconstruction. In addition, to overcome the limitations of multi-preference optimization in RLHF, we propose cross-pair preference optimization. This method effectively mitigates the performance drop typically associated with model merging, allowing for more robust optimization across diverse human preferences. We further enhance musicality and structural coherence by introducing stochastic block representation alignment loss.

[811] SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland

Main category: eess.AS

TL;DR: SPEAR is a self-supervised framework that unifies speech and audio representation learning by distilling knowledge from both speech-focused and general-audio SSL teachers into a single model using multi-codebook vector quantization and joint prediction.

Details

Motivation: There's a persistent gap between speech-focused and audio event understanding models in self-supervised learning. Most existing SSL models are optimized for either speech or audio events, but not both. The authors aim to create a unified model that bridges this domain gap.

Method: SPEAR uses knowledge distillation from two teachers: one speech-focused SSL model and one general-audio SSL model. It applies multi-codebook vector quantization to continuous teacher representations to produce fine-grained discrete tokens capturing both semantic and acoustic information. The model jointly predicts these heterogeneous representations given masked inputs with an asymmetric pre-training loss, and includes a novel token mixing mechanism for robustness in complex sound scenes.

Result: SPEAR consistently outperforms existing unified speech and audio models, establishes new SOTA on SUPERB benchmark (surpassing WavLM Large on 12 of 15 tasks), and achieves competitive performance on HEAR benchmark.

Conclusion: SPEAR successfully bridges the gap between speech and audio representation learning, creating a versatile foundation for general-purpose speech and audio representation learning that outperforms specialized models in both domains.

Abstract: Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.

[812] RIR-Former: Coordinate-Guided Transformer for Continuous Reconstruction of Room Impulse Responses

Shaoheng Xu, Chunyi Sun, Jihui Zhang, Prasanga N. Samarasinghe, Thushara D. Abhayapala

Main category: eess.AS

TL;DR: RIR-Former: A transformer-based model for reconstructing room impulse responses from sparse measurements using sinusoidal position encoding and segmented multi-branch decoder for early reflections and late reverberation.

Details

Motivation: Measuring room impulse responses (RIRs) densely across space is often impractical, creating a need for efficient reconstruction methods from sparse measurements.

Method: Grid-free, one-step feed-forward transformer model with sinusoidal encoding for microphone positions, enabling interpolation at arbitrary locations. Uses segmented multi-branch decoder to separately handle early reflections and late reverberation.

Result: Outperforms state-of-the-art baselines in normalized mean square error (NMSE) and cosine distance (CD) across diverse simulated acoustic environments, varying missing rates, and array configurations.

Conclusion: Demonstrates potential for practical deployment and motivates future work on scaling to complex array geometries, dynamic acoustic scenes, and real-world environments.

Abstract: Room impulse responses (RIRs) are essential for many acoustic signal processing tasks, yet measuring them densely across space is often impractical. In this work, we propose RIR-Former, a grid-free, one-step feed-forward model for RIR reconstruction. By introducing a sinusoidal encoding module into a transformer backbone, our method effectively incorporates microphone position information, enabling interpolation at arbitrary array locations. Furthermore, a segmented multi-branch decoder is designed to separately handle early reflections and late reverberation, improving reconstruction across the entire RIR. Experiments on diverse simulated acoustic environments demonstrate that RIR-Former consistently outperforms state-of-the-art baselines in terms of normalized mean square error (NMSE) and cosine distance (CD), under varying missing rates and array configurations. These results highlight the potential of our approach for practical deployment and motivate future work on scaling from randomly spaced linear arrays to complex array geometries, dynamic acoustic scenes, and real-world environments.

eess.IV

[813] Super-résolution non supervisée d’images hyperspectrales de télédétection utilisant un entraînement entièrement synthétique

Xinxin Xu, Yann Gousseau, Christophe Kervazo, Saïd Ladjal

Main category: eess.IV

TL;DR: Unsupervised hyperspectral image super-resolution using synthetic abundance data and dead leaves model training

Details

Motivation: Hyperspectral SISR typically requires high-resolution ground truth data which is often unavailable; need for unsupervised approaches that can work without paired training data

Method: 1) Decompose hyperspectral image into endmembers and abundance maps via hyperspectral unmixing; 2) Train neural network to super-resolve abundance maps using synthetic data generated with dead leaves model; 3) Reconstruct super-resolved hyperspectral image by recombining processed abundance maps with endmembers

Result: Experimental results demonstrate method effectiveness and show relevance of synthetic data for training in unsupervised hyperspectral super-resolution

Conclusion: Proposed unsupervised approach using synthetic abundance data provides effective solution for hyperspectral SISR when ground truth data is unavailable, with dead leaves model successfully replicating statistical properties of real abundances

Abstract: Hyperspectral single image super-resolution (SISR) aims to enhance spatial resolution while preserving the rich spectral information of hyperspectral images. Most existing methods rely on supervised learning with high-resolution ground truth data, which is often unavailable in practice. To overcome this limitation, we propose an unsupervised learning approach based on synthetic abundance data. The hyperspectral image is first decomposed into endmembers and abundance maps through hyperspectral unmixing. A neural network is then trained to super-resolve these maps using data generated with the dead leaves model, which replicates the statistical properties of real abundances. The final super-resolution hyperspectral image is reconstructed by recombining the super-resolved abundance maps with the endmembers. Experimental results demonstrate the effectiveness of our method and the relevance of synthetic data for training.

[814] EchoJEPA: A Latent Predictive Foundation Model for Echocardiography

Alif Munim, Adibvafa Fallahpour, Teodora Szasz, Ahmadreza Attarpour, River Jiang, Brana Sooriyakanthan, Maala Sooriyakanthan, Heather Whitney, Jeremy Slivnick, Barry Rubin, Wendy Tsang, Bo Wang

Main category: eess.IV

TL;DR: EchoJEPA is a foundation model for echocardiography trained on 18 million echocardiograms that uses latent prediction to disentangle anatomical signals from ultrasound artifacts, achieving superior performance in cardiac function estimation and view classification with strong robustness and sample efficiency.

Details

Motivation: Current echocardiography foundation models fail to separate anatomical signals from ultrasound artifacts like speckle noise, limiting their diagnostic utility. There's a need for models that can learn robust representations from large unlabeled video archives to reduce annotation burden and improve diagnostic consistency.

Method: EchoJEPA uses latent prediction as a pretraining paradigm, trained on 18 million echocardiograms across 300K patients. It employs a novel multi-view probing framework with factorized stream embeddings for standardized evaluation under frozen backbones, focusing on disentangling anatomical information from acquisition artifacts.

Result: EchoJEPA reduces left ventricular ejection fraction estimation error by 19%, achieves 87.4% view classification accuracy, shows strong sample efficiency (78.6% accuracy with only 1% labeled data), degrades only 2.3% under acoustic perturbations (vs 16.8% for next best), and transfers zero-shot to pediatric patients with 15% lower error than competitors.

Conclusion: Latent prediction is established as a superior paradigm for ultrasound foundation models, with EchoJEPA demonstrating robust, sample-efficient representations that generalize well across patient populations and are resilient to ultrasound-specific artifacts.

Abstract: Foundation models for echocardiography promise to reduce annotation burden and improve diagnostic consistency by learning generalizable representations from large unlabeled video archives. However, current approaches fail to disentangle anatomical signal from the stochastic speckle and acquisition artifacts that dominate ultrasound imagery. We present EchoJEPA, a foundation model for echocardiography trained on 18 million echocardiograms across 300K patients, the largest pretraining corpus for this modality to date. We also introduce a novel multi-view probing framework with factorized stream embeddings that standardizes evaluation under frozen backbones. Compared to prior methods, EchoJEPA reduces left ventricular ejection fraction estimation error by 19% and achieves 87.4% view classification accuracy. EchoJEPA exhibits strong sample efficiency, reaching 78.6% accuracy with only 1% of labeled data versus 42.1% for the best baseline trained on 100%. Under acoustic perturbations, EchoJEPA degrades by only 2.3% compared to 16.8% for the next best model, and transfers zero-shot to pediatric patients with 15% lower error than the next best model, outperforming all fine-tuned baselines. These results establish latent prediction as a superior paradigm for ultrasound foundation models.

[815] Physics-based generation of multilayer corneal OCT data via Gaussian modeling and MCML for AI-driven diagnostic and surgical guidance applications

Jinglun Yu, Yaning Wang, Rosalinda Xiong, Ziyi Huang, Kristina Irsch, Jin U. Kang

Main category: eess.IV

TL;DR: A Monte Carlo simulation framework generates synthetic corneal OCT images with pixel-level segmentation labels for training AI models in ophthalmology.

Details

Motivation: Training deep learning models for corneal OCT imaging is limited by the scarcity of large, well-annotated datasets, necessitating synthetic data generation.

Method: A configurable Monte Carlo simulation framework creates synthetic corneal B-scan OCT images using a five-layer corneal model with Gaussian surfaces, assigning optical properties from literature and simulating light transport with MCML while incorporating system features like confocal PSF and sensitivity roll-off.

Result: The framework produces over 10,000 high-resolution (1024x1024) image-label pairs that support customization of geometry, photon count, noise, and system parameters.

Conclusion: The synthetic dataset enables systematic training, validation, and benchmarking of AI models under controlled ground-truth conditions, providing a reproducible and scalable resource for diagnostic and surgical guidance applications in image-guided ophthalmology.

Abstract: Training deep learning models for corneal optical coherence tomography (OCT) imaging is limited by the availability of large, well-annotated datasets. We present a configurable Monte Carlo simulation framework that generates synthetic corneal B-scan optical OCT images with pixel-level five-layer segmentation labels derived directly from the simulation geometry. A five-layer corneal model with Gaussian surfaces captures curvature and thickness variability in healthy and keratoconic eyes. Each layer is assigned optical properties from the literature and light transport is simulated using Monte Carlo modeling of light transport in multi-layered tissues (MCML), while incorporating system features such as the confocal PSF and sensitivity roll-off. This approach produces over 10,000 high-resolution (1024x1024) image-label pairs and supports customization of geometry, photon count, noise, and system parameters. The resulting dataset enables systematic training, validation, and benchmarking of AI models under controlled, ground-truth conditions, providing a reproducible and scalable resource to support the development of diagnostic and surgical guidance applications in image-guided ophthalmology.

[816] Wide-field high-resolution microscopy via high-speed galvo scanning and real-time mosaicking

Ziyi Huang, Rosalinda Xiong, Yaning Wang, Jinglun Yu, Jin U. Kang

Main category: eess.IV

TL;DR: A framework for wide-field microscopic image mosaicking that handles geometric and brightness inconsistencies from galvanometric scanning, supporting both linear and sinusoidal scanning strategies.

Details

Motivation: Conventional galvanometric scanning, especially under sinusoidal driving, introduces nonuniform spatial sampling leading to geometric inconsistencies and brightness variations, which degrade image quality in wide-field microscopy.

Method: Combines translation-based geometric mosaicking model with ROI-based brightness correction and seam-aware feathering, using calibrated scan parameters and synchronized scan-camera control without image-content-based registration.

Result: Successfully reconstructed wide-field mosaicked images achieving 2.5×2.5 cm² FOV in ~6s per dataset, with improved brightness uniformity, increased CNR, reduced seam artifacts, and preserved 7.81μm lateral resolution.

Conclusion: The framework provides a practical and efficient solution for scan-based wide-field microscopic mosaicking that works with both linear and sinusoidal scanning strategies.

Abstract: Wide-field high-resolution microscopy requires fast scanning and accurate image mosaicking to cover large fields of view without compromising image quality. However, conventional galvanometric scanning, particularly under sinusoidal driving, can introduce nonuniform spatial sampling, leading to geometric inconsistencies and brightness variations across the scanned field. To address these challenges, we present an image mosaicking framework for wide-field microscopic imaging that is applicable to both linear and sinusoidal galvanometric scanning strategies. The proposed approach combines a translation-based geometric mosaicking model with region-of-interest (ROI) based brightness correction and seam-aware feathering to improve radiometric consistency across large fields of view. The method relies on calibrated scan parameters and synchronized scan–camera control, without requiring image-content-based registration. Using the proposed framework, wide-field mosaicked images were successfully reconstructed under both linear and sinusoidal scanning strategies, achieving a field of view of up to $2.5 \times 2.5~\mathrm{cm}^2$ with a total acquisition time of approximately $6~\mathrm{s}$ per dataset. Quantitative evaluation shows that both scanning strategies demonstrate improved image quality, including enhanced brightness uniformity, increased contrast-to-noise ratio (CNR), and reduced seam-related artifacts after image processing, while preserving a lateral resolution of $7.81~μ\mathrm{m}$. Overall, the presented framework provides a practical and efficient solution for scan-based wide-field microscopic mosaicking.

[817] Super-Resolution and Denoising of Corneal B-Scan OCT Imaging Using Diffusion Model Plug-and-Play Priors

Yaning Wang, Jinglun Yu, Wenhan Guo, Ziyi Huang, Rosalinda Xiong, Yu Sun, Jin U. Kang

Main category: eess.IV

TL;DR: Diffusion model-based super-resolution framework for OCT corneal imaging that achieves 4x resolution enhancement with effective denoising using plug-and-play priors.

Details

Motivation: High-speed OCT acquisitions degrade spatial resolution and increase speckle noise, making accurate interpretation challenging for clinical applications like surgical planning and diagnosis.

Method: Formulates reconstruction as Bayesian inverse problem using diffusion model plug-and-play priors, combining Markov chain Monte Carlo sampling with pretrained generative priors to enforce anatomical consistency.

Result: Superior performance compared to bicubic interpolation, U-Net baselines, and alternative diffusion priors, with state-of-the-art metrics (PSNR, SSIM) and improved delineation of corneal layers.

Conclusion: Diffusion-driven plug-and-play reconstruction enables high-fidelity OCT imaging for reliable clinical assessments and can be extended to other biomedical imaging modalities.

Abstract: Optical coherence tomography (OCT) is pivotal in corneal imaging for both surgical planning and diagnosis. However, high-speed acquisitions often degrade spatial resolution and increase speckle noise, posing challenges for accurate interpretation. We propose an advanced super-resolution framework leveraging diffusion model plug-and-play (PnP) priors to achieve 4x spatial resolution enhancement alongside effective denoising of OCT Bscan images. Our approach formulates reconstruction as a principled Bayesian inverse problem, combining Markov chain Monte Carlo sampling with pretrained generative priors to enforce anatomical consistency. We comprehensively validate the framework using \emph{in vivo} fisheye corneal datasets, to assess robustness and scalability under diverse clinical settings. Comparative experiments against bicubic interpolation, conventional supervised U-Net baselines, and alternative diffusion priors demonstrate that our method consistently yields more precise anatomical structures, improved delineation of corneal layers, and superior noise suppression. Quantitative results show state-of-the-art performance in peak signal-to-noise ratio, structural similarity index, and perceptual metrics. This work highlights the potential of diffusion-driven plug-and-play reconstruction to deliver high-fidelity, high-resolution OCT imaging, supporting more reliable clinical assessments and enabling advanced image-guided interventions. Our findings suggest the approach can be extended to other biomedical imaging modalities requiring robust super-resolution and denoising.

[818] Real-time topology-aware M-mode OCT segmentation for robotic deep anterior lamellar keratoplasty (DALK) guidance

Rosalinda Xiong, Jinglun Yu, Yaning Wang, Ziyi Huang, Jin U. Kang

Main category: eess.IV

TL;DR: Lightweight UNeXt-based M-mode OCT segmentation pipeline with anatomical topology regularization for real-time robotic DALK surgery guidance, achieving 80+ Hz throughput with improved boundary stability.

Details

Motivation: Robotic deep anterior lamellar keratoplasty requires accurate real-time depth feedback to approach Descemet's membrane without perforation. M-mode OCT provides depth traces but suffers from speckle noise, attenuation, and instrument shadowing, resulting in discontinuous layer interfaces that challenge consistent segmentation at deployment frame rates.

Method: Proposes a lightweight, topology-aware M-mode segmentation pipeline based on UNeXt architecture that incorporates anatomical topology regularization to stabilize boundary continuity and layer ordering under low signal-to-noise ratio conditions.

Result: Achieves end-to-end throughput exceeding 80 Hz on a single GPU for the complete preprocessing-inference-overlay pipeline, demonstrating practical real-time guidance. Shows improved qualitative boundary stability compared with topology-agnostic controls while maintaining deployable real-time performance.

Conclusion: The system provides sufficient temporal headroom to reject low-quality frames while maintaining stable effective depth update rates, making it suitable for real-time robotic surgical guidance in challenging OCT imaging conditions.

Abstract: Robotic deep anterior lamellar keratoplasty (DALK) requires accurate real time depth feedback to approach Descemet’s membrane (DM) without perforation. M-mode intraoperative optical coherence tomography (OCT) provides high temporal resolution depth traces, but speckle noise, attenuation, and instrument induced shadowing often result in discontinuous or ambiguous layer interfaces that challenge anatomically consistent segmentation at deployment frame rates. We present a lightweight, topology aware M-mode segmentation pipeline based on UNeXt that incorporates anatomical topology regularization to stabilize boundary continuity and layer ordering under low signal to noise ratio conditions. The proposed system achieves end to end throughput exceeding 80 Hz measured over the complete preprocessing inference overlay pipeline on a single GPU, demonstrating practical real time guidance beyond model only timing. This operating margin provides temporal headroom to reject low quality or dropout frames while maintaining a stable effective depth update rate. Evaluation on a standard rabbit eye M-mode dataset using an established baseline protocol shows improved qualitative boundary stability compared with topology agnostic controls, while preserving deployable real time performance.

[819] Joint Background-Anomaly-Noise Decomposition for Robust Hyperspectral Anomaly Detection via Constrained Convex Optimization

Koyo Sato, Shunsuke Ono

Main category: eess.IV

TL;DR: A robust hyperspectral anomaly detection method that handles mixed noise types (sparse, stripe) while separating background and anomalies through constrained convex optimization.

Details

Motivation: Real-world hyperspectral images often contain various noise types (sparse, stripe) from sensor failures or calibration errors, but most existing anomaly detection methods either ignore noise or assume Gaussian noise, leading to degraded performance in noisy scenarios.

Method: Formulates a constrained convex optimization problem to decompose HS images into background, anomaly, and three noise components. Uses a preconditioned primal-dual splitting method for efficient optimization.

Result: Achieves comparable accuracy to state-of-the-art methods on clean images and demonstrates significantly higher robustness when various mixed noise types are added to six real HS datasets.

Conclusion: The proposed method effectively handles real-world noise in hyperspectral anomaly detection, making it more robust and practical for applications where sensor noise is common.

Abstract: We propose a novel hyperspectral (HS) anomaly detection method that is robust to various types of noise. Most existing HS anomaly detection methods are designed without explicit consideration of noise or are based on the assumption of Gaussian noise. However, in real-world situations, observed HS images are often degraded by various types of noise, such as sparse noise and stripe noise, due to sensor failure or calibration errors, significantly affecting the detection performance. To address this problem, this article establishes a robust HS anomaly detection method with a mechanism that can properly remove mixed noise while separating background and anomaly parts. Specifically, we newly formulate a constrained convex optimization problem to decompose background and anomaly parts, and three types of noise from a given HS image. Then, we develop an efficient algorithm based on a preconditioned variant of a primal-dual splitting method to solve this problem. Experimental results using six real HS datasets demonstrate that the proposed method achieves detection accuracy comparable to state-of-the-art methods on original images and exhibits significantly higher robustness in scenarios where various types of mixed noise are added.

[820] ResSR: A Computationally Efficient Residual Approach to Super-Resolving Multispectral Images

Haley Duba-Sullivan, Emma J. Reid, Sophie Voisin, Charles A. Bouman, Gregery T. Buzzard

Main category: eess.IV

TL;DR: ResSR is an efficient model-based multispectral image super-resolution method that decouples spectral and spatial processing, achieving high-quality reconstruction without supervised training or spatially-coupled optimization.

Details

Motivation: Multispectral imaging sensors have wavelength-dependent resolution limitations that hinder downstream analysis. Existing MSI super-resolution methods achieve good quality but are computationally expensive due to spatially-coupled optimization or large learning-based models, limiting their use in large-scale or time-critical applications.

Method: ResSR decouples spectral and spatial processing into separate branches: spectral branch uses singular value decomposition plus spatially-decoupled approximate forward model for upsampling; spatial branch uses bicubic upsampling. A residual correction step combines these branches to recover accurate spectral and spatial features.

Result: ResSR achieves comparable or improved reconstruction quality relative to existing MSI-SR methods while being 2× to 10× faster, demonstrating computational efficiency without sacrificing performance.

Conclusion: ResSR provides an efficient, model-based approach for multispectral image super-resolution that eliminates the need for supervised training or spatially-coupled optimization while maintaining high reconstruction quality and significant speed improvements.

Abstract: Multispectral imaging (MSI) plays a critical role in material classification, environmental monitoring, and remote sensing. However, MSI sensors typically have wavelength-dependent resolution, which limits downstream analysis. MSI super-resolution (MSI-SR) methods address this limitation by reconstructing all bands at a common high spatial resolution. Existing methods can achieve high reconstruction quality but often rely on spatially-coupled optimization or large learning-based models, leading to significant computational cost and limiting their use in large-scale or time-critical settings. In this paper, we introduce ResSR, a computationally efficient, model-based MSI-SR method that achieves high-quality reconstruction without supervised training or spatially-coupled optimization. Notably, ResSR decouples spectral and spatial processing into separate branches, which are then combined in a residual correction step. The spectral branch uses singular value decomposition plus a spatially-decoupled approximate forward model to upsample the MSI, while the spatial branch uses bicubic upsampling. The residual correction step combines these branches to recover accurate spectral and spatial MSI features. ResSR achieves comparable or improved reconstruction quality relative to existing MSI-SR methods while being 2$\times$ to 10$\times$ faster. Code is available at https://github.com/hdsullivan/ResSR.

[821] Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment

Shiyun Chen, Li Lin, Pujin Cheng, ZhiCheng Jin, JianJian Chen, HaiDong Zhu, Kenneth K. Y. Wong, Xiaoying Tang

Main category: eess.IV

TL;DR: Diff4MMLiTS: A four-stage pipeline for multimodal liver tumor segmentation that addresses misaligned clinical data by using diffusion models to synthesize aligned multimodal CTs for training segmentation models.

Details

Motivation: Existing multimodal segmentation methods require well-registered multimodal data, which is unrealistic for real-world clinical images, especially for indistinct regions like liver tumors where registration is challenging.

Method: Four-stage pipeline: 1) Pre-registration of target organs in multimodal CTs, 2) Dilation of annotated mask and inpainting to get multimodal normal CTs without tumors, 3) Synthesis of strictly aligned multimodal CTs with tumors using latent diffusion model based on multimodal CT features and random tumor masks, 4) Training segmentation model on synthesized aligned data.

Result: Extensive experiments on public and internal datasets demonstrate superiority over other state-of-the-art multimodal segmentation methods.

Conclusion: Diff4MMLiTS eliminates the need for strictly aligned multimodal data for training segmentation models, addressing a key limitation in real-world clinical applications.

Abstract: Multimodal learning has been demonstrated to enhance performance across various clinical tasks, owing to the diverse perspectives offered by different modalities of data. However, existing multimodal segmentation methods rely on well-registered multimodal data, which is unrealistic for real-world clinical images, particularly for indistinct and diffuse regions such as liver tumors. In this paper, we introduce Diff4MMLiTS, a four-stage multimodal liver tumor segmentation pipeline: pre-registration of the target organs in multimodal CTs; dilation of the annotated modality’s mask and followed by its use in inpainting to obtain multimodal normal CTs without tumors; synthesis of strictly aligned multimodal CTs with tumors using the latent diffusion model based on multimodal CT features and randomly generated tumor masks; and finally, training the segmentation model, thus eliminating the need for strictly aligned multimodal data. Extensive experiments on public and internal datasets demonstrate the superiority of Diff4MMLiTS over other state-of-the-art multimodal segmentation methods.

[822] Understanding-informed Bias Mitigation for Fair CMR Segmentation

Tiarna Lee, Esther Puyol-Antón, Bram Ruijsink, Pier-Giorgio Masci, Louise Keehn, Phil Chowienczyk, Emily Haseler, Miaojing Shi, Andrew P. King

Main category: eess.IV

TL;DR: Investigates bias mitigation methods for ethnicity bias in AI-based cardiac MRI segmentation between Black and White subjects, finding oversampling effective and cropping reduces bias.

Details

Motivation: AI models for medical imaging often have biases from imbalanced training data, particularly ethnicity bias in cardiac MRI segmentation. Little is known about effectiveness of bias mitigation methods in this domain, especially for Black vs White subjects.

Method: Used oversampling, importance reweighing, and Group DRO bias mitigation techniques on CMR segmentation models. Also evaluated methods on cropped CMR images based on findings about root causes of bias. Tested on external clinical validation set.

Result: Oversampling significantly improved performance for underrepresented Black subjects without significantly reducing White subjects’ performance. Cropping increased performance for both ethnicities and reduced bias. Combining cropping with oversampling further reduced bias. External validation showed high segmentation performance with no statistically significant bias.

Conclusion: Bias in AI-based CMR segmentation can be effectively mitigated using oversampling and image cropping techniques. These methods improve performance for underrepresented groups while maintaining overall segmentation quality.

Abstract: Artificial intelligence (AI) is increasingly being used for medical imaging tasks. However, there can be biases in AI models, particularly when they are trained using imbalanced training datasets. One such example has been the strong ethnicity bias effect in cardiac magnetic resonance (CMR) image segmentation models. Although this phenomenon has been reported in a number of publications, little is known about the effectiveness of bias mitigation algorithms in this domain. We aim to investigate the impact of common bias mitigation methods to address bias between Black and White subjects in AI-based CMR segmentation models. Specifically, we use oversampling, importance reweighing and Group DRO as well as combinations of these techniques to mitigate the ethnicity bias. Second, motivated by recent findings on the root causes of AI-based CMR segmentation bias, we evaluate the same methods using models trained and evaluated on cropped CMR images. We find that bias can be mitigated using oversampling, significantly improving performance for the underrepresented Black subjects whilst not significantly reducing the majority White subjects’ performance. Using cropped images increases performance for both ethnicities and reduces the bias, whilst adding oversampling as a bias mitigation technique with cropped images reduces the bias further. When testing the models on an external clinical validation set, we find high segmentation performance and no statistically significant bias.

[823] GroundGazer: Camera-based indoor localization of mobile robots with millimeter accuracy at low cost

Sven Hinderer, Jakob Hüsken, Bohan Sun, Bin Yang

Main category: eess.IV

TL;DR: GroundGazer is a low-cost, high-accuracy indoor localization system for autonomous mobile robots using a monocular camera and chessboard floor pattern.

Details

Motivation: Existing high-accuracy indoor localization systems (laser trackers, total stations, motion capture) are very expensive, creating a need for affordable alternatives with millimeter-level accuracy for autonomous mobile robots.

Method: Uses a monocular (fisheye) camera mounted on the robot, a chessboard pattern on the floor, and an optional laser diode. The system analyzes the camera’s view of the chessboard floor to estimate position with millimeter accuracy and heading with sub-degree accuracy.

Result: Achieves millimeter-level position accuracy and sub-degree heading accuracy for autonomous mobile robots in indoor environments.

Conclusion: GroundGazer provides a simple, low-cost, portable, robust, and scalable solution for high-accuracy indoor localization that can be extended to 3D position and orientation estimation.

Abstract: Highly accurate indoor localization systems with mm positioning accuracy are currently very expensive. They include laser trackers, total stations, and motion capture systems relying on multiple high-end cameras. In this work, we introduce a high-accuracy, planar indoor localization system named GroundGazer (GG) for autonomous mobile robots (AMRs). GG estimates the AMR’s position with mm and its heading with sub-degree accuracy. The system requires only a monocular (fisheye) camera, a chessboard floor, and an optional laser diode. Our system is simple and low-cost, easy to set up, portable, robust, scalable to large areas and robot swarms, and extendable to 3D position and orientation estimation.

[824] Observer-Usable Information as a Task-specific Image Quality Metric

Changjie Lu, Sourya Sengupta, Hua Li, Mark A. Anastasio

Main category: eess.IV

TL;DR: V-information is introduced as a new objective, task-specific image quality metric that quantifies how much task-relevant information in an image can be exploited by sub-ideal observers, complementing conventional signal detection theory measures.

Details

Motivation: Current task-based image quality measures like task-specific information (TSI) assume ideal observers and don't quantify how much task-relevant information can actually be exploited by sub-ideal observers, limiting their practical utility.

Method: Introduces predictive V-information (V-info) as a relaxation of TSI that considers specified families of sub-ideal observers. Validates using a stylized magnetic resonance image restoration problem to quantify signal detection/discrimination performance.

Result: V-info correlates with area under ROC curve for binary tasks, works for multi-class tasks where ROC analysis is challenging, and shows greater sensitivity where conventional metrics saturate.

Conclusion: V-info represents a new objective image quality measure that complements conventional signal detection theory-based metrics, particularly useful for multi-class tasks and scenarios where traditional measures saturate.

Abstract: Objective, task-based measures of image quality (IQ) have been widely advocated for assessing and optimizing medical imaging technologies. Besides signal detection theory-based measures, information-theoretic quantities have been proposed to quantify task-based IQ. For example, task-specific information (TSI), defined as the mutual information between an image and a task variable, represents an optimal measure of how informative an image is for performing a specified task. However, like the ideal observer from signal detection theory, TSI does not quantify the amount of task-relevant information in an image that can be exploited by a sub-ideal observer. A recently proposed relaxation of TSI, termed predictive V-information (V-info), removes this limitation and can quantify the utility of an image with consideration of a specified family of sub-ideal observers. In this study, for the first time, we introduce and investigate V-info as an objective, task-specific IQ metric. To corroborate its usefulness, a stylized magnetic resonance image restoration problem is considered in which V-info is employed to quantify signal detection or discrimination performance. The presented results show that V-info correlates with area under the receiver operating characteristic (ROC) curve for binary tasks, while being readily applicable to multi-class (>2) tasks where ROC analysis is challenging. Notably, V-info exhibits greater sensitivity in scenarios where conventional metrics saturate. These findings demonstrate that V-info represents a new objective IQ measure that can complement conventional signal detection theory-based ones.

[825] Fetpype: An Open-Source Pipeline for Reproducible Fetal Brain MRI Analysis

Thomas Sanchez, Gerard Martí-Juan, David Meunier, Miguel Angel Gonzalez Ballester, Oscar Camara, Elisenda Eixarch, Gemma Piella, Meritxell Bach Cuadra, Guillaume Auzias

Main category: eess.IV

TL;DR: Fetpype is a standardized, modular framework for fetal brain MRI preprocessing and analysis that integrates motion correction, super-resolution reconstruction, tissue segmentation, and cortical surface extraction into a unified workflow.

Details

Motivation: Fetal MRI analysis is technically challenging due to fetal motion, low signal-to-noise ratio, and complex multi-step processing pipelines. Existing tools are fragmented, making integration into robust, reproducible end-to-end workflows difficult, which limits reproducibility and adoption in research and clinical contexts.

Method: Fetpype provides a standardized, modular framework that processes raw T2-weighted fetal MRI acquisitions through motion correction, super-resolution reconstruction, tissue segmentation, and cortical surface extraction in a unified workflow.

Result: Fetpype is publicly available on GitHub as an open-source tool that enables researchers to process fetal brain MRI data from raw acquisitions to derived volumetric and surface-based outputs within a reproducible framework.

Conclusion: Fetpype addresses the fragmentation in fetal MRI analysis by providing a standardized, reproducible framework that facilitates advanced fetal neuroimaging methods in both research and clinical applications.

Abstract: Fetal brain magnetic resonance imaging (MRI) is crucial for assessing neurodevelopment in utero. However, fetal MRI analysis remains technically challenging due to fetal motion, low signal-to-noise ratio, and the need for complex multi-step processing pipelines. These pipelines typically include motion correction, super-resolution reconstruction, tissue segmentation, and cortical surface extraction. While specialized tools exist for each individual processing step, integrating them into a robust, reproducible, and user-friendly end-to-end workflow remains difficult. This fragmentation limits reproducibility across studies and hinders the adoption of advanced fetal neuroimaging methods in both research and clinical contexts. Fetpype addresses this gap by providing a standardized, modular, and reproducible framework for fetal brain MRI preprocessing and analysis, enabling researchers to process raw T2-weighted acquisitions through to derived volumetric and surface-based outputs within a unified workflow. Fetpype is publicly available on GitHub at https://github.com/fetpype/fetpype.

[826] Comprehensive Machine Learning Benchmarking for Fringe Projection Profilometry with Photorealistic Synthetic Data

Anush Lakshman S, Adam Haroon, Beiwen Li

Main category: eess.IV

TL;DR: First open-source photorealistic synthetic dataset for fringe projection profilometry (FPP) enables benchmarking of ML approaches for single-shot 3D depth prediction from fringe images, revealing information deficit as key limitation.

Details

Motivation: Machine learning for FPP lacks large datasets and standardized benchmarks. Need synthetic dataset to enable systematic evaluation of ML approaches for single-shot depth prediction from fringe images.

Method: Created 15,600 fringe images and 300 depth reconstructions across 50 objects using NVIDIA Isaac Sim. Evaluated single-shot FPP models predicting 3D depth directly from individual fringe images without phase shifting. Conducted ablation studies on depth normalization strategies, background removal, loss functions, and four architectures.

Result: Individual depth normalization improved object reconstruction accuracy 9.1x over raw depth. Background fringes proved essential (not noise). Hybrid L1 loss optimal. UNet performed best among architectures, but errors far above classical FPP accuracy. Small performance gap between architectures indicates information deficit from single fringe images is main limitation.

Conclusion: Single fringe images lack sufficient information for accurate depth recovery without explicit phase cues. Work provides benchmark and evidence for hybrid approaches combining phase-based FPP with learned refinement.

Abstract: Machine learning approaches for fringe projection profilometry (FPP) are hindered by the lack of large, diverse datasets and standardized benchmarking protocols. This paper introduces the first open-source, photorealistic synthetic dataset for FPP, generated using NVIDIA Isaac Sim, comprising 15,600 fringe images and 300 depth reconstructions across 50 objects. We apply this dataset to single-shot FPP, where models predict 3D depth maps directly from individual fringe images without temporal phase shifting. Through systematic ablation studies, we identify optimal learning configurations for long-range (1.5-2.1 m) depth prediction. We compare three depth normalization strategies and show that individual normalization, which decouples object shape from absolute scale, yields a 9.1x improvement in object reconstruction accuracy over raw depth. We further show that removing background fringe patterns severely degrades performance across all normalizations, demonstrating that background fringes provide essential spatial phase reference rather than noise. We evaluate six loss functions and identify Hybrid L1 loss as optimal. Using the best configuration, we benchmark four architectures and find UNet achieves the strongest performance, though errors remain far above the sub-millimeter accuracy of classical FPP. The small performance gap between architectures indicates that the dominant limitation is information deficit rather than model design: single fringe images lack sufficient information for accurate depth recovery without explicit phase cues. This work provides a standardized benchmark and evidence motivating hybrid approaches combining phase-based FPP with learned refinement. The dataset is available at https://huggingface.co/datasets/aharoon/fpp-ml-bench and code at https://github.com/AnushLak/fpp-ml-bench.

Editor’s Picks

[1] AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

[2] GRAM: Spatial general-purpose audio representations for real-world environments

[3] SpecFLASH: A Latent-Guided Semi-autoregressive Speculative Decoding Framework for Efficient Multimodal Generation

Today’s Research Highlights

Table of Contents

cs.CL

[1] The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders

[2] STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models

[3] Test-Time Detoxification without Training or Learning Anything

[4] ROSA-Tuning: Enhancing Long-Context Modeling via Suffix Matching

[5] Graph-Augmented Reasoning with Large Language Models for Tobacco Pest and Disease Management

[6] WideSeek: Advancing Wide Research via Multi-Agent Scaling

[7] Monotonicity as an Architectural Bias for Robust Language Models

[8] InfMem: Learning System-2 Memory Control for Long-Context Agent

[9] Predicting first-episode homelessness among US Veterans using longitudinal EHR data: time-varying models and social risk factors

[10] Time-Critical Multimodal Medical Transportation: Organs, Patients, and Medical Supplies

[11] From Task Solving to Robust Real-World Adaptation in LLM Agents

[12] AmharicStoryQA: A Multicultural Story Question Answering Benchmark in Amharic

[13] When Efficient Communication Explains Convexity

[14] R2-Router: A New Paradigm for LLM Routing with Reasoning

[15] LatentMem: Customizing Latent Memory for Multi-Agent Systems

[16] CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment

[17] AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation

[18] Act or Clarify? Modeling Sensitivity to Uncertainty and Cost in Communication

[19] Which course? Discourse! Teaching Discourse and Generation in the Era of LLMs

[20] Modeling Sarcastic Speech: Semantic and Prosodic Cues in a Speech Synthesis Framework

[21] HALT: Hallucination Assessment via Log-probs as Time series

[22] Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness

[23] Where Norms and References Collide: Evaluating LLMs on Normative Reasoning

[24] CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

[25] SAES-SVD: Self-Adaptive Suppression of Accumulated and Local Errors for SVD-based LLM Compression

[26] ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

[27] AERO: Autonomous Evolutionary Reasoning Optimization via Endogenous Dual-Loop Feedback

[28] Test-time Recursive Thinking: Self-Improvement without External Feedback

[29] Task–Specificity Score: Measuring How Much Instructions Really Matter for Supervision

[30] The Mask of Civility: Benchmarking Chinese Mock Politeness Comprehension in Large Language Models

[31] ChemPro: A Progressive Chemistry Benchmark for Large Language Models

[32] One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence

[33] Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization

[34] FASA: Frequency-aware Sparse Attention

[35] Privasis: Synthesizing the Largest “Public” Private Dataset from Scratch

[36] ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution

[37] Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

[38] ATACompressor: Adaptive Task-Aware Compression for Efficient Long-Context Processing in LLMs

[39] POP: Prefill-Only Pruning for Efficient Large Model Inference

[40] MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research

[41] Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

[42] PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

[43] Pursuing Best Industrial Practices for Retrieval-Augmented Generation in the Medical Domain

[44] Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

[45] Verified Critical Step Optimization for LLM Agents

[46] FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding

[47] A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces

[48] Preferences for Idiomatic Language are Acquired Slowly – and Forgotten Quickly: A Case Study on Swedish

[49] Self-Verification Dilemma: Experience-Driven Suppression of Overused Checking in LLM Reasoning

[50] Learning to Reason Faithfully through Step-Level Faithfulness Maximization

[51] Can Large Language Models Generalize Procedures Across Representations?

[52] SEAD: Self-Evolving Agent for Multi-Turn Service Dialogue

[53] Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models

[54] HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

[55] ACL: Aligned Contrastive Learning Improves BERT and Multi-exit BERT Fine-tuning

[56] Use Graph When It Needs: Efficiently and Adaptively Integrating Retrieval-Augmented Generation with Graphs

[57] $V_0$: A Generalist Value Model for Any Policy at State Zero

[58] CL-bench: A Benchmark for Context Learning

[59] Efficient Algorithms for Partial Constraint Satisfaction Problems over Control-flow Graphs

[60] Controlling Output Rankings in Generative Engines for LLM-based Search

[61] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

[62] BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish

[63] TRE: Encouraging Exploration in the Trust Region

[64] RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish

[65] Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration

[66] Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models

[67] Rethinking the Reranker: Boundary-Aware Evidence Selection for Robust Retrieval-Augmented Generation

[68] OCRTurk: A Comprehensive OCR Benchmark for Turkish

[69] Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models

[70] OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

[71] Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States

[72] No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding

[73] Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling