Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] AudioRAG: A Challenging Benchmark for Audio Reasoning and Information Retrieval

Jingru Lin, Chen Zhang, Tianrui Wang, Haizhou Li

Main category: eess.AS

TL;DR: AudioRAG benchmark evaluates audio-language models on real-world reasoning tasks requiring external information retrieval, showing current SOTA models struggle and proposing an agentic pipeline solution.

Details

Motivation: Existing audio-language model benchmarks focus only on internal knowledge reasoning, neglecting real-world scenarios requiring external information grounding. There's a need to assess models in realistic web environments where audio reasoning must be augmented with information retrieval.

Method: Introduces AudioRAG benchmark with LLM-generated and manually curated question-answer pairs that require audio understanding combined with external information retrieval. Also proposes an agentic pipeline integrating audio reasoning with retrieval-augmented generation as a baseline.

Result: State-of-the-art Large Audio-Language Models (LALMs) struggle to answer AudioRAG questions, demonstrating the challenge of audio-based reasoning requiring external information grounding.

Conclusion: AudioRAG fills an important gap in evaluating audio-language models for real-world scenarios, and the proposed agentic pipeline provides a stronger baseline for future research in audio reasoning with information retrieval.

Abstract: Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus only on reasoning with internal knowledge, neglecting real-world scenarios that require external information grounding. To bridge this gap, we introduce AudioRAG, a novel benchmark designed to evaluate audio-based reasoning augmented by information retrieval in realistic web environments. This benchmark comprises both LLM-generated and manually curated question-answer pairs. Our evaluations reveal that even the state-of-the-art LALMs struggle to answer these questions. We therefore propose an agentic pipeline that integrates audio reasoning with retrieval-augmented generation, providing a stronger baseline for future research.

Relevance: 9/10

[2] AudioRouter: Data Efficient Audio Understanding via RL based Dual Reasoning

Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang

Main category: cs.SD

TL;DR: AudioRouter is a reinforcement learning framework that enables Large Audio Language Models to improve audio understanding by learning when and how to use external audio tools through explicit decision-making, achieving better performance with far less training data.

Details

Motivation: Current Large Audio Language Models (LALMs) have unreliable performance on fine-grained auditory perception and require data-intensive training to internalize perceptual abilities. There's a need for more data-efficient approaches to enhance audio understanding capabilities.

Method: AudioRouter uses reinforcement learning to teach LALMs when and how to use external audio tools. Instead of tightly coupling tool usage with audio reasoning, it formulates tool use as an explicit decision-making problem and optimizes a lightweight routing policy while keeping the underlying reasoning model frozen.

Result: AudioRouter achieves substantial improvements on standard audio understanding benchmarks while requiring up to 600x less training data to learn tool usage compared with conventional training paradigms.

Conclusion: Learning effective tool usage offers a data-efficient and scalable alternative to internalizing perceptual abilities in Large Audio Language Models, suggesting a promising direction for enhancing audio understanding capabilities.

Abstract: Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize perceptual abilities. We propose AudioRouter, a reinforcement learning framework that enables LALMs to improve audio understanding by learning when and how to use external audio tools. Rather than tightly coupling tool usage with audio reasoning, AudioRouter formulates tool use as an explicit decision making problem and optimizes a lightweight routing policy while keeping the underlying reasoning model frozen. Experimental results show that AudioRouter achieves substantial improvements on standard audio understanding benchmarks while requiring up to 600x less training data to learn tool usage compared with conventional training paradigms. These findings suggest that learning effective tool usage offers a data efficient and scalable alternative to internalizing perceptual abilities in LALMs.

Relevance: 9/10

[3] MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, Xipeng Qiu

Main category: cs.SD

TL;DR: A fully end-to-end Transformer-based audio tokenizer (CAT) that scales to 1.6B parameters achieves state-of-the-art audio reconstruction across speech, sound, and music, enabling competitive autoregressive TTS and ASR without auxiliary encoders.

Details

Motivation: Existing discrete audio tokenizers rely on pretrained encoders, semantic distillation, or heterogeneous CNN architectures with fixed inductive biases that limit reconstruction fidelity and scaling. The authors argue for fully end-to-end learning with homogeneous, scalable architectures.

Method: Propose CAT (Causal Audio Tokenizer with Transformer) - a purely Transformer-based architecture that jointly optimizes encoder, quantizer, and decoder from scratch. Scale this to MOSS-Audio-Tokenizer with 1.6B parameters pre-trained on 3M hours of diverse audio data.

Result: Outperforms prior codecs across speech, sound, and music over wide bitrate ranges. Enables first purely autoregressive TTS model surpassing prior non-autoregressive systems. Achieves competitive ASR without auxiliary encoders. Shows predictable improvements with scale.

Conclusion: The CAT architecture serves as a unified, scalable interface for next-generation native audio foundation models, demonstrating that simple, fully end-to-end Transformer-based approaches scale gracefully and support high-fidelity reconstruction across diverse audio domains.

Abstract: Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 104]
cs.CV [Total: 139]
cs.AI [Total: 56]
cs.SD [Total: 9]
cs.LG [Total: 228]
cs.MA [Total: 11]
cs.MM [Total: 2]
eess.AS [Total: 8]
eess.IV [Total: 13]

cs.CL

[1] Reviewing the Reviewer: Elevating Peer Review Quality through LLM-Guided Feedback

Sukannya Purkayastha, Qile Wan, Anne Lauscher, Lizhen Qu, Iryna Gurevych

Main category: cs.CL

TL;DR: LLM-driven framework for detecting multiple lazy thinking issues in peer reviews and generating actionable feedback to improve review quality.

Details

Motivation: Peer review quality suffers from lazy thinking (reliance on simple heuristics), but existing approaches treat detection as single-label classification and lack actionable feedback mechanisms for improvement.

Method: Decomposes reviews into argumentative segments, uses neurosymbolic module combining LLM features with traditional classifiers for multi-issue detection, and generates targeted feedback using issue-specific templates refined by genetic algorithm.

Result: Outperforms zero-shot LLM baselines, improves review quality by up to 92.4%, and releases LazyReviewPlus dataset with 1,309 sentences labeled for lazy thinking and specificity.

Conclusion: The framework effectively addresses multi-issue lazy thinking detection and provides actionable feedback to enhance peer review quality through a combination of LLM capabilities and structured approaches.

Abstract: Peer review is central to scientific quality, yet reliance on simple heuristics – lazy thinking – has lowered standards. Prior work treats lazy thinking detection as a single-label task, but review segments may exhibit multiple issues, including broader clarity problems, or specificity issues. Turning detection into actionable improvements requires guideline-aware feedback, which is currently missing. We introduce an LLM-driven framework that decomposes reviews into argumentative segments, identifies issues via a neurosymbolic module combining LLM features with traditional classifiers, and generates targeted feedback using issue-specific templates refined by a genetic algorithm. Experiments show our method outperforms zero-shot LLM baselines and improves review quality by up to 92.4%. We also release LazyReviewPlus, a dataset of 1,309 sentences labeled for lazy thinking and specificity.

[2] Latent Thoughts Tuning: Bridging Context and Reasoning with Fused Information in Latent Tokens

Weihao Liu, Dehai Min, Lu Cheng

Main category: cs.CL

TL;DR: LT-Tuning improves latent reasoning in LLMs by combining contextual hidden states with predictive semantic guidance to prevent feature collapse, enabling dynamic switching between latent and explicit thinking modes.

Details

Motivation: Current latent reasoning methods suffer from feature collapse and instability due to distribution mismatches when using hidden states as input embeddings or alignment issues with assistant models, limiting their effectiveness compared to explicit Chain-of-Thought reasoning.

Method: Proposes Latent Thoughts Tuning (LT-Tuning) with a Context-Prediction-Fusion mechanism that combines contextual hidden states with predictive semantic guidance from vocabulary embedding space, plus a progressive three-stage curriculum learning pipeline enabling dynamic switching between latent and explicit thinking modes.

Result: Outperforms existing latent reasoning baselines, effectively mitigates feature collapse, and achieves robust reasoning accuracy.

Conclusion: LT-Tuning provides a more stable and effective framework for latent reasoning in LLMs by addressing distribution mismatch issues and enabling flexible computation beyond discrete token constraints.

Abstract: While explicit Chain-of-Thought (CoT) equips Large Language Models (LLMs) with strong reasoning capabilities, it requires models to verbalize every intermediate step in text tokens, constraining the model thoughts to the discrete vocabulary space. Recently, reasoning in continuous latent space has emerged as a promising alternative, enabling more robust inference and flexible computation beyond discrete token constraints. However, current latent paradigms often suffer from feature collapse and instability, stemming from distribution mismatches when recurrently using hidden states as the input embeddings, or alignment issues when relying on assistant models. To address this, we propose Latent Thoughts Tuning (LT-Tuning), a framework that redefines how latent thoughts are constructed and deployed. Instead of relying solely on raw hidden states, our method introduces a Context-Prediction-Fusion mechanism that jointly leveraging contextual hidden states and predictive semantic guidance from the vocabulary embedding space. Combined with a progressive three-stage curriculum learning pipeline, LT-Tuning also enables dynamically switching between latent and explicit thinking modes. Experiments demonstrate that our method outperforms existing latent reasoning baselines, effectively mitigating feature collapse and achieving robust reasoning accuracy.

[3] Learning to Evict from Key-Value Cache

Luca Moschella, Laura Manduchi, Ozan Sener

Main category: cs.CL

TL;DR: KV Policy (KVP) uses reinforcement learning to learn per-head eviction policies for KV cache management, outperforming heuristic methods on long-context tasks while maintaining generalization to downstream tasks.

Details

Motivation: Current KV cache eviction methods rely on heuristics like recency or past attention scores, which are indirect proxies for token utility and add computational overhead. The paper aims to develop a more principled approach to KV cache management.

Method: Frames KV cache eviction as a reinforcement learning problem. Introduces KVP: lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Agents learn specialized eviction policies guided by future utility without modifying the underlying LLM.

Result: KVP significantly outperforms baselines on long-context benchmark RULER and multi-turn dialogue benchmark OASST2-4k across two model families. Zero-shot tests on standard downstream tasks (LongBench, BOOLQ, ARC) show good generalization beyond training distribution and to longer contexts.

Conclusion: Learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management, demonstrating that RL-based approaches can effectively optimize KV cache eviction.

Abstract: The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token’s future utility and introduce computational overhead. We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Each agent learns a specialized eviction policy guided by future utility, which evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or additional inference. Evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k, KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (e.g., LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer context lengths. These results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.

Polina Tsvilodub, Jan-Felix Klumpp, Amir Mohammadpour, Jennifer Hu, Michael Franke

Main category: cs.CL

TL;DR: LMs may develop shared computational mechanisms for Theory of Mind and pragmatic reasoning, suggesting interconnected “social world models” rather than isolated competencies.

Details

Motivation: To investigate whether language models develop shared computational mechanisms for general Theory of Mind and language-specific pragmatic reasoning, addressing whether LMs have emergent "social world models" that are repurposed across tasks.

Method: Behavioral evaluations and causal-mechanistic experiments using functional localization methods inspired by cognitive neuroscience, analyzing LMs’ performance across seven subcategories of ToM abilities on a substantially larger localizer dataset than prior work.

Result: Stringent hypothesis-driven statistical testing offers suggestive evidence for the functional integration hypothesis, indicating LMs may develop interconnected “social world models” rather than isolated competencies.

Conclusion: This work contributes novel ToM localizer data, methodological refinements to functional localization techniques, and empirical insights into the emergence of social cognition in artificial systems.

Abstract: This paper investigates whether LMs recruit shared computational mechanisms for general Theory of Mind (ToM) and language-specific pragmatic reasoning in order to contribute to the general question of whether LMs may be said to have emergent “social world models”, i.e., representations of mental states that are repurposed across tasks (the functional integration hypothesis). Using behavioral evaluations and causal-mechanistic experiments via functional localization methods inspired by cognitive neuroscience, we analyze LMs’ performance across seven subcategories of ToM abilities (Beaudoin et al., 2020) on a substantially larger localizer dataset than used in prior like-minded work. Results from stringent hypothesis-driven statistical testing offer suggestive evidence for the functional integration hypothesis, indicating that LMs may develop interconnected “social world models” rather than isolated competencies. This work contributes novel ToM localizer data, methodological refinements to functional localization techniques, and empirical insights into the emergence of social cognition in artificial systems.

[5] Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality

Zhimin Hu, Riya Roshan, Sashank Varma

Main category: cs.CL

TL;DR: Models show resource-rational reasoning emerges from inference-time scaling without explicit computational cost rewards, with LRMs outperforming instruction-tuned models on complex logical functions.

Details

Motivation: To investigate whether resource-rational reasoning (optimizing performance under constraints) can emerge from inference-time scaling in LLMs without explicit computational cost rewards, comparing instruction-tuned models vs. reinforcement learning-trained Large Reasoning Models.

Method: Introduces Variable Attribution Task where models infer which variables determine outcomes given candidate variables, input-output trials, and predefined logical functions. Systematically manipulates task complexity by varying number of candidate variables and trials. Tests both instruction-tuned models and Large Reasoning Models trained via reinforcement learning.

Result: Both model types show transition from brute-force to analytic strategies as complexity increases. Instruction-tuned models degrade on XOR and XNOR functions, while Large Reasoning Models remain robust. Models adjust reasoning behavior in response to task complexity without explicit cost-based rewards.

Conclusion: Resource rationality emerges as a property of inference-time scaling itself, not requiring explicit computational cost rewards. Large Reasoning Models trained via reinforcement learning show superior robustness on complex logical functions compared to instruction-tuned models.

Abstract: Human reasoning is shaped by resource rationality – optimizing performance under constraints. Recently, inference-time scaling has emerged as a powerful paradigm to improve the reasoning performance of Large Language Models by expanding test-time computation. Specifically, instruction-tuned (IT) models explicitly generate long reasoning steps during inference, whereas Large Reasoning Models (LRMs) are trained by reinforcement learning to discover reasoning paths that maximize accuracy. However, it remains unclear whether resource-rationality can emerge from such scaling without explicit reward related to computational costs. We introduce a Variable Attribution Task in which models infer which variables determine outcomes given candidate variables, input-output trials, and predefined logical functions. By varying the number of candidate variables and trials, we systematically manipulate task complexity. Both models exhibit a transition from brute-force to analytic strategies as complexity increases. IT models degrade on XOR and XNOR functions, whereas LRMs remain robust. These findings suggest that models can adjust their reasoning behavior in response to task complexity, even without explicit cost-based reward. It provides compelling evidence that resource rationality is an emergent property of inference-time scaling itself.

[6] Simultaneous Speech-to-Speech Translation Without Aligned Data

Tom Labiausse, Romain Fabre, Yannick Estève, Alexandre Défossez, Neil Zeghidour

Main category: cs.CL

TL;DR: Hibiki-Zero eliminates need for word-level alignments in simultaneous speech translation, using sentence-level training plus reinforcement learning for latency optimization, achieving SOTA across multiple languages.

Details

Motivation: Traditional simultaneous speech translation requires word-level aligned data which is hard to collect at scale and relies on language-specific heuristics, creating a bottleneck for multilingual scaling.

Method: Two-stage approach: first train on sentence-level aligned data for high-latency translation, then apply reinforcement learning using GRPO to optimize latency while maintaining quality, eliminating word-level alignments entirely.

Result: Achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks; can adapt to new languages with <1000h of speech.

Conclusion: Hibiki-Zero simplifies training pipeline, enables seamless scaling to diverse languages with varying grammatical structures, and releases benchmark with 45h of multilingual data.

Abstract: Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.

[7] The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh, Georgios Chochlakis, Jose Alcocer, Kerby Bennett, Aarya Vijay Devnani, Parsa Hejabi, Harry G. Muttram, Akshay Kiran Padte, Mehrshad Saadatinia, Chenhao Wu, Alireza S. Zaibari, Michael Sierra-Arévalo, Nick Weller, Shrikanth Narayanan, Benjamin A. T. Graham, Morteza Dehghani

Main category: cs.CL

TL;DR: First large-scale traffic-stop dataset with multi-perspective respect ratings and rationales from police-affiliated, justice-impacted, and non-affiliated LA residents, enabling study of perceptual differences in police-civilian interactions.

Details

Motivation: Respect in police-civilian interactions is subjective and shaped by lived experience, requiring community-specific perspectives to understand diverse interpretations of procedural justice and build public trust.

Method: Developed domain-specific evaluation rubric from procedural justice theory and LAPD materials; created rubric-driven preference data construction; proposed perspective-aware modeling framework predicting personalized respect ratings and generating annotator-specific rationales from transcripts.

Result: Approach improved rating prediction performance and rationale alignment across all three annotator groups (police-affiliated, justice-system-impacted, non-affiliated).

Conclusion: Perspective-aware framework enables law enforcement to better understand diverse community expectations, providing tool for building public trust and procedural legitimacy through multi-perspective analysis of police-civilian interactions.

Abstract: Traffic stops are among the most frequent police-civilian interactions, and body-worn cameras (BWCs) provide a unique record of how these encounters unfold. Respect is a central dimension of these interactions, shaping public trust and perceived legitimacy, yet its interpretation is inherently subjective and shaped by lived experience, rendering community-specific perspectives a critical consideration. Leveraging unprecedented access to Los Angeles Police Department BWC footage, we introduce the first large-scale traffic-stop dataset annotated with respect ratings and free-text rationales from multiple perspectives. By sampling annotators from police-affiliated, justice-system-impacted, and non-affiliated Los Angeles residents, we enable the systematic study of perceptual differences across diverse communities. To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-consistent alignment; and (iii) propose a perspective-aware modeling framework that predicts personalized respect ratings and generates annotator-specific rationales for both officers and civilian drivers from traffic-stop transcripts. Across all three annotator groups, our approach improves both rating prediction performance and rationale alignment. Our perspective-aware framework enables law enforcement to better understand diverse community expectations, providing a vital tool for building public trust and procedural legitimacy.

[8] Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Arash Gholami Davoodi, Navid Rezazadeh, Seyed Pouyan Mousavi Davoudi, Pouya Pezeshkpour

Main category: cs.CL

TL;DR: Top-W: A geometry-aware truncation rule for LLM decoding that uses Wasserstein distance over token embeddings to balance diversity and coherence, outperforming prior methods on reasoning and creative tasks.

Details

Motivation: Existing truncation-based samplers for LLMs are largely heuristic, relying mainly on probability mass and entropy while ignoring the semantic geometry of token embeddings. This limits their ability to balance diversity/creativity against logical coherence in open-ended generation.

Method: Proposes Top-W, a geometry-aware truncation rule that uses Wasserstein distance defined over token-embedding geometry to keep cropped distributions close to the original. The method explicitly balances retained probability mass against entropy of the kept set, yielding a simple closed-form structure for fixed-potential subset updates. Implemented with efficient geometry-based potentials (nearest-set or k-NN) and paired with an alternating decoding routine.

Result: Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show Top-W consistently outperforms prior state-of-the-art decoding approaches, achieving up to 33.7% improvement. Top-W improves both accuracy-focused performance and boosts creativity under judge-based open-ended evaluation.

Conclusion: Top-W provides a theoretically grounded, geometry-aware approach to LLM decoding that better balances diversity and coherence, demonstrating significant improvements over existing methods across both reasoning and creative generation tasks.

Abstract: Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.

[9] When Less Is More? Diagnosing ASR Predictions in Sardinian via Layer-Wise Decoding

Domenico De Cristofaro, Alessandro Vietti, Marianne Pouplier, Aleese Block

Main category: cs.CL

TL;DR: Intermediate layers in multilingual speech models encode more phonetically accurate representations than final layers; layer-wise decoding on Wav2Vec2 for low-resource Sardinian shows best phoneme error rates at intermediate layers, not final layer.

Details

Motivation: To investigate how phoneme-level predictions evolve across encoder layers in multilingual speech models, particularly for low-resource languages, and to understand why intermediate layers often outperform final layers in phonetic accuracy.

Method: Applied layer-wise decoding strategy to pretrained Wav2Vec2 model on Campidanese Sardinian (low-resource language), analyzing phoneme error rates across layers, performing fine-grained alignment analysis, and introducing concept of regressive errors.

Result: Truncating upper transformer layers improved Phoneme Error Rates, with best performance two layers before final layer; intermediate predictions better preserved segmental identity, avoided overgeneration, and reduced phonological errors; identified regressive errors where correct intermediate predictions were overwritten by final layer errors.

Conclusion: Deeper layers may generalize away from acoustic detail, supporting early-layer probing as diagnostic tool for ASR models, especially in low-resource settings where standard metrics may miss linguistically meaningful behavior.

Abstract: Recent studies have shown that intermediate layers in multilingual speech models often encode more phonetically accurate representations than the final output layer. In this work, we apply a layer-wise decoding strategy to a pretrained Wav2Vec2 model to investigate how phoneme-level predictions evolve across encoder layers, focusing on Campidanese Sardinian, a low-resource language. We show that truncating upper transformer layers leads to improved Phoneme Error Rates (PER), with the best performance achieved not at the final layer, but two layers earlier. Through fine-grained alignment analysis, we find that intermediate predictions better preserve segmental identity, avoid overgeneration, and reduce certain classes of phonological errors. We also introduce the notion of regressive errors, cases where correct predictions at intermediate layers are overwritten by errors at the final layer. These regressions highlight the limitations of surface-level error metrics and reveal how deeper layers may generalize or abstract away from acoustic detail. Our findings support the use of early-layer probing as a diagnostic tool for ASR models, particularly in low-resource settings where standard evaluation metrics may fail to capture linguistically meaningful behavior.

[10] Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

Keenan Pepper, Alex McKenzie, Florin Pop, Stijn Servaes, Martin Leitgab, Mike Vaiana, Judd Rosenblatt, Michael S. A. Graziano, Diogo de Lucena

Main category: cs.CL

TL;DR: Training lightweight adapters on interpretability artifacts enables reliable self-interpretation of frozen language models, with simple scalar affine adapters outperforming more complex alternatives.

Details

Motivation: Existing self-interpretation methods for language models are unreliable due to hyperparameter sensitivity. The paper aims to develop more reliable self-interpretation techniques that work across tasks and model families without modifying the base model.

Method: Train lightweight adapters (specifically scalar affine adapters with just d_model+1 parameters) on interpretability artifacts while keeping the language model entirely frozen. The adapters learn to generate feature labels, identify topics, and decode implicit reasoning patterns.

Result: Trained adapters outperform training labels (71% vs 63% generation scoring), achieve 94% recall@1 for topic identification (vs 1% for baselines), and decode bridge entities in multi-hop reasoning. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives.

Conclusion: Self-interpretation improves with model scale and can be achieved reliably through lightweight adapters without modifying the base model, demonstrating that interpretability capabilities scale alongside model capabilities.

Abstract: Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

[11] Physically Interpretable AlphaEarth Foundation Model Embeddings Enable LLM-Based Land Surface Intelligence

Mashrekur Rahman

Main category: cs.CL

TL;DR: Satellite foundation model embeddings are physically interpretable and can be operationalized for environmental intelligence through a retrieval-augmented generation system.

Details

Motivation: Satellite foundation models produce dense embeddings whose physical interpretability remains poorly understood, limiting their integration into environmental decision systems. There's a need to understand how these embeddings relate to real-world environmental variables and operationalize them for practical applications.

Method: 1) Comprehensive interpretability analysis of Google AlphaEarth’s 64-dimensional embeddings against 26 environmental variables using 12.1M samples across the Continental US (2017-2023) with linear, nonlinear, and attention-based methods. 2) Development of a Land Surface Intelligence system implementing retrieval-augmented generation over a FAISS-indexed embedding database, translating natural language queries into satellite-grounded assessments.

Result: Embeddings map onto specific land surface properties, with full embedding space reconstructing most environmental variables with high fidelity (12 of 26 variables exceed R² > 0.90; temperature and elevation approach R² = 0.97). The Land Surface Intelligence system achieved weighted scores of μ= 3.74 ± 0.77 (scale 1-5) in LLM-as-Judge evaluation, with strong grounding (μ= 3.93) and coherence (μ= 4.25).

Conclusion: Satellite foundation model embeddings are physically structured representations that can be operationalized for environmental and geospatial intelligence through interpretability analysis and retrieval-augmented generation systems.

Abstract: Satellite foundation models produce dense embeddings whose physical interpretability remains poorly understood, limiting their integration into environmental decision systems. Using 12.1 million samples across the Continental United States (2017–2023), we first present a comprehensive interpretability analysis of Google AlphaEarth’s 64-dimensional embeddings against 26 environmental variables spanning climate, vegetation, hydrology, temperature, and terrain. Combining linear, nonlinear, and attention-based methods, we show that individual embedding dimensions map onto specific land surface properties, while the full embedding space reconstructs most environmental variables with high fidelity (12 of 26 variables exceed $R^2 > 0.90$; temperature and elevation approach $R^2 = 0.97$). The strongest dimension-variable relationships converge across all three analytical methods and remain robust under spatial block cross-validation (mean $ΔR^2 = 0.017$) and temporally stable across all seven study years (mean inter-year correlation $r = 0.963$). Building on these validated interpretations, we then developed a Land Surface Intelligence system that implements retrieval-augmented generation over a FAISS-indexed embedding database of 12.1 million vectors, translating natural language environmental queries into satellite-grounded assessments. An LLM-as-Judge evaluation across 360 query–response cycles, using four LLMs in rotating generator, system, and judge roles, achieved weighted scores of $μ= 3.74 \pm 0.77$ (scale 1–5), with grounding ($μ= 3.93$) and coherence ($μ= 4.25$) as the strongest criteria. Our results demonstrate that satellite foundation model embeddings are physically structured representations that can be operationalized for environmental and geospatial intelligence.

[12] Multilingual Dysarthric Speech Assessment Using Universal Phone Recognition and Language-Specific Phonemic Contrast Modeling

Eunjung Yeo, Julie M. Liss, Visar Berisha, David R. Mortensen

Main category: cs.CL

TL;DR: Multilingual phoneme-production assessment framework for dysarthric speech intelligibility evaluation using universal phone recognition with language-specific phoneme interpretation via contrastive phonological features.

Details

Motivation: Need for automated intelligibility assessment methods for neurological disorders like dysarthria that work across multiple languages, as existing approaches are either language-specific or fail to capture language-specific factors affecting intelligibility.

Method: Combines universal phone recognition with language-specific phoneme interpretation using contrastive phonological feature distances for phone-to-phoneme mapping and sequence alignment. Produces three metrics: phoneme error rate (PER), phonological feature error rate (PFER), and phoneme coverage (PhonCov).

Result: Analysis across English, Spanish, Italian, and Tamil shows PER benefits from mapping+alignment, PFER from alignment alone, and PhonCov from mapping. Framework captures clinically meaningful patterns of intelligibility degradation consistent with established dysarthric speech observations.

Conclusion: Proposed multilingual framework effectively assesses dysarthric speech intelligibility across languages by integrating universal and language-specific components, providing clinically relevant metrics for automated assessment.

Abstract: The growing prevalence of neurological disorders associated with dysarthria motivates the need for automated intelligibility assessment methods that are applicalbe across languages. However, most existing approaches are either limited to a single language or fail to capture language-specific factors shaping intelligibility. We present a multilingual phoneme-production assessment framework that integrates universal phone recognition with language-specific phoneme interpretation using contrastive phonological feature distances for phone-to-phoneme mapping and sequence alignment. The framework yields three metrics: phoneme error rate (PER), phonological feature error rate (PFER), and a newly proposed alignment-free measure, phoneme coverage (PhonCov). Analysis on English, Spanish, Italian, and Tamil show that PER benefits from the combination of mapping and alignment, PFER from alignment alone, and PhonCov from mapping. Further analyses demonstrate that the proposed framework captures clinically meaningful patterns of intelligibility degradation consistent with established observations of dysarthric speech.

[13] Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

Tianci Xue, Zeyi Liao, Tianneng Shi, Zilu Wang, Kai Zhang, Dawn Song, Yu Su, Huan Sun

Main category: cs.CL

TL;DR: ACuRL is an autonomous curriculum reinforcement learning framework that enables continual adaptation of computer-use agents to specific environments without human data, using exploration, curriculum task generation, and an automatic evaluator.

Details

Motivation: Real-world digital environments are diverse and dynamic, causing agents to frequently encounter unseen scenarios and distribution shifts. Continual learning is essential for computer-use agents, but obtaining high-quality, environment-grounded data without costly human annotation is challenging.

Method: ACuRL framework where agents first explore target environments to acquire initial experiences. A curriculum task generator then synthesizes new tasks tailored to the agent’s current capabilities using these experiences and feedback from previous iterations. CUAJudge, a robust automatic evaluator, provides reliable reward signals.

Result: The method enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without catastrophic forgetting on existing environments. Analysis shows highly sparse updates (e.g., 20% parameters) which explains effective and robust adaptation.

Conclusion: ACuRL provides an effective framework for autonomous continual adaptation of computer-use agents to specific environments without human data, demonstrating significant performance improvements and robust adaptation through sparse updates.

Abstract: Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in specific environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded agent data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores target environments to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent’s current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without catastrophic forgetting on existing environments. Further analyses show highly sparse updates (e.g., 20% parameters), which helps explain the effective and robust adaptation. Our data and code are available at https://github.com/OSU-NLP-Group/ACuRL.

[14] The Alignment Bottleneck in Decomposition-Based Claim Verification

Mahmud Elahi Akhter, Federico Ruggeri, Iman Munire Bilal, Rob Procter, Maria Liakata

Main category: cs.CL

TL;DR: Decomposition of complex claims only improves verification when evidence is granular and precisely aligned to sub-claims; standard repeated claim-level evidence setups degrade performance.

Details

Motivation: To understand why structured claim decomposition yields inconsistent results for complex claim verification, focusing on overlooked bottlenecks of evidence alignment and sub-claim error profiles.

Method: Introduce new dataset of real-world complex claims with temporally bounded evidence and human-annotated sub-claim evidence spans. Evaluate decomposition under two evidence alignment setups: Sub-claim Aligned Evidence (SAE) and Repeated Claim-level Evidence (SRE). Analyze across different datasets and domains (PHEMEPlus, MMM-Fact, COVID-Fact).

Result: Decomposition brings significant performance improvement only when evidence is granular and strictly aligned (SAE). Standard setups using repeated claim-level evidence (SRE) fail to improve and often degrade performance. Conservative “abstention” reduces error propagation compared to aggressive but incorrect predictions in noisy sub-claim labels.

Conclusion: Future claim decomposition frameworks must prioritize precise evidence synthesis and calibrate label bias of sub-claim verification models, as decomposition only helps with granular evidence alignment.

Abstract: Structured claim decomposition is often proposed as a solution for verifying complex, multi-faceted claims, yet empirical results have been inconsistent. We argue that these inconsistencies stem from two overlooked bottlenecks: evidence alignment and sub-claim error profiles. To better understand these factors, we introduce a new dataset of real-world complex claims, featuring temporally bounded evidence and human-annotated sub-claim evidence spans. We evaluate decomposition under two evidence alignment setups: Sub-claim Aligned Evidence (SAE) and Repeated Claim-level Evidence (SRE). Our results reveal that decomposition brings significant performance improvement only when evidence is granular and strictly aligned. By contrast, standard setups that rely on repeated claim-level evidence (SRE) fail to improve and often degrade performance as shown across different datasets and domains (PHEMEPlus, MMM-Fact, COVID-Fact). Furthermore, we demonstrate that in the presence of noisy sub-claim labels, the nature of the error ends up determining downstream robustness. We find that conservative “abstention” significantly reduces error propagation compared to aggressive but incorrect predictions. These findings suggest that future claim decomposition frameworks must prioritize precise evidence synthesis and calibrate the label bias of sub-claim verification models.

[15] Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

Théo Lasnier, Wissam Antoun, Francis Kulumba, Djamé Seddah

Main category: cs.CL

TL;DR: Mechanistic analysis reveals backdoor triggers in LLMs co-opt existing language circuits rather than creating isolated circuits, with implications for detection and mitigation.

Details

Motivation: Backdoor attacks pose significant security risks for LLMs, but the internal mechanisms of how triggers operate remain poorly understood. The authors aim to provide the first mechanistic analysis of language-switching backdoors to understand how triggers function within model architectures.

Method: Used activation patching on the GAPperon model family (1B, 8B, 24B parameters) containing triggers injected during pretraining. Localized trigger formation to early layers (7.5-25% of model depth) and identified which attention heads process trigger information. Analyzed overlap between trigger-activated heads and naturally encoding language heads.

Result: Trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over top identified heads. This suggests backdoor triggers co-opt existing language components rather than forming isolated circuits.

Conclusion: Backdoor triggers leverage the model’s existing functional components, which has implications for defense: detection methods should monitor known functional components rather than searching for hidden circuits, and mitigation strategies could leverage the entanglement between injected and natural behaviors.

Abstract: Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model’s existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.

[16] When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

Virginie Mouilleron, Théo Lasnier, Djamé Seddah

Main category: cs.CL

TL;DR: Multimodal Finance Eval benchmark reveals VLMs perform well on text/table tasks but struggle with chart interpretation and multi-turn reasoning in French financial documents.

Details

Motivation: Existing VLMs lack evaluation in specialized non-English domains like finance, where documents contain complex multimodal elements (text, tables, charts) and errors have real-world consequences.

Method: Created Multimodal Finance Eval benchmark with 1,204 expert-validated questions from real French financial documents, evaluated six open-weight VLMs using LLM-as-judge protocol across text extraction, table comprehension, chart interpretation, and multi-turn reasoning tasks.

Result: VLMs achieve 85-90% accuracy on text/table tasks but only 34-62% on chart interpretation; multi-turn dialogue shows error propagation dropping accuracy to ~50% regardless of model size.

Conclusion: Current VLMs are effective for extraction tasks but brittle in interactive financial analysis; benchmark provides challenging evaluation for high-stakes multimodal document understanding.

Abstract: Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.

[17] Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Zhongzhi Li, Xuansheng Wu, Yijiang Li, Lijie Hu, Ninghao Liu

Main category: cs.CL

TL;DR: FAC Synthesis: A diversity-driven data synthesis framework that uses Feature Activation Coverage to measure and improve data diversity in LLMs by identifying missing features and generating synthetic samples to cover them.

Details

Motivation: Existing approaches to constructing post-training data for LLMs use text-based metrics that only capture linguistic variation, providing weak signals for task-relevant features that determine downstream performance. There's a need for better diversity metrics that directly measure feature coverage.

Method: Introduces Feature Activation Coverage (FAC) to measure data diversity in an interpretable feature space. Uses sparse autoencoders to identify missing features from seed datasets, then generates synthetic samples that explicitly reflect these missing features through a framework called FAC Synthesis.

Result: The approach consistently improves both data diversity and downstream performance on various tasks including instruction following, toxicity detection, reward modeling, and behavior steering. Identifies a shared, interpretable feature space across model families (LLaMA, Mistral, Qwen) enabling cross-model knowledge transfer.

Conclusion: Provides a solid and practical methodology for data-centric optimization of LLMs, demonstrating that feature-based diversity metrics outperform traditional text-based metrics and enable effective knowledge transfer across models.

Abstract: The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

[18] When are We Worried? Temporal Trends of Anxiety and What They Reveal about Us

Saif M. Mohammad

Main category: cs.CL

TL;DR: Analysis of anxiety patterns in US/Canadian social media data using word-anxiety lexicon, revealing daily/weekly patterns and relationships with tense and pronouns.

Details

Motivation: To understand temporal patterns of anxiety expression on social media and how different linguistic features (tense, pronouns) relate to anxiety levels.

Method: Used a lexicon of word-anxiety associations to analyze large amounts of US and Canadian social media data (tweets), examining temporal patterns, tense usage, and pronoun associations with anxiety.

Result: Found anxiety peaks at 8am (aligning with cortisol levels) and is lowest around noon; lowest on weekends, highest mid-week; highest in past tense, lowest in future tense; more anxiety with 3rd person pronouns than 1st/2nd person, and more with subject pronouns than object pronouns.

Conclusion: Social media anxiety exhibits systematic temporal patterns and varies with linguistic focus, providing insights into when and how different types of focus relate to anxiety expression.

Abstract: In this short paper, we make use of a recently created lexicon of word-anxiety associations to analyze large amounts of US and Canadian social media data (tweets) to explore when we are anxious and what insights that reveals about us. We show that our levels of anxiety on social media exhibit systematic patterns of rise and fall during the day – highest at 8am (in-line with when we have high cortisol levels in the body) and lowest around noon. Anxiety is lowest on weekends and highest mid-week. We also examine anxiety in past, present, and future tense sentences to show that anxiety is highest in past tense and lowest in future tense. Finally, we examine the use of anxiety and calmness words in posts that contain pronouns to show: more anxiety in 3rd person pronouns (he, they) posts than 1st and 2nd person pronouns and higher anxiety in posts with subject pronouns (I, he, she, they) than object pronouns (me, him, her, them). Overall, these trends provide valuable insights on not just when we are anxious, but also how different types of focus (future, past, self, outward, etc.) are related to anxiety.

[19] EVOKE: Emotion Vocabulary Of Korean and English

Yoonwon Jung, Hagyeong Shin, Benjamin K. Bergen

Main category: cs.CL

TL;DR: EVOKE is a parallel emotion vocabulary dataset for English and Korean with comprehensive coverage, many-to-many translations, and identification of language-specific emotion words.

Details

Motivation: To create a comprehensive, systematic, and theory-agnostic dataset of emotion words in both Korean and English that can serve as a practical tool for emotion science, psycholinguistics, computational linguistics, and NLP research.

Method: Systematic annotation of 1,427 Korean words and 1,399 English words, including 819 Korean and 924 English adjectives and verbs. Annotation includes multiple meanings of each word, their relationships, identification of polysemous emotion words, and emotion-related metaphors.

Result: Created the most comprehensive parallel emotion vocabulary dataset for English and Korean to date, with many-to-many translations and identification of language-specific emotion words. The dataset is publicly available on GitHub.

Conclusion: EVOKE provides a valuable resource for cross-linguistic emotion research and can serve as a practical tool for various research fields, allowing researchers to adopt different theoretical perspectives.

Abstract: This paper introduces EVOKE, a parallel dataset of emotion vocabulary in English and Korean. The dataset offers comprehensive coverage of emotion words in each language, in addition to many-to-many translations between words in the two languages and identification of language-specific emotion words. The dataset contains 1,427 Korean words and 1,399 English words, and we systematically annotate 819 Korean and 924 English adjectives and verbs. We also annotate multiple meanings of each word and their relationships, identifying polysemous emotion words and emotion-related metaphors. The dataset is, to our knowledge, the most comprehensive, systematic, and theory-agnostic dataset of emotion words in both Korean and English to date. It can serve as a practical tool for emotion science, psycholinguistics, computational linguistics, and natural language processing, allowing researchers to adopt different views on the resource reflecting their needs and theoretical perspectives. The dataset is publicly available at https://github.com/yoonwonj/EVOKE.

[20] LATA: A Tool for LLM-Assisted Translation Annotation

Baorong Huang, Ali Asiri

Main category: cs.CL

TL;DR: LLM-assisted interactive tool for Arabic-English parallel corpus construction with human-in-the-loop workflow for translation technique annotation

Details

Motivation: Standard automated alignment tools fail for structurally divergent language pairs like Arabic-English, creating need for precision tools that balance automation with expert human judgment for complex translation phenomena analysis

Method: Template-based Prompt Manager using LLMs for sentence segmentation/alignment with JSON constraints, automated preprocessing integrated into human-in-the-loop workflow with stand-off architecture for custom translation technique annotations

Result: Tool successfully balances annotation efficiency with linguistic precision for analyzing complex translation phenomena in specialized domains

Conclusion: LLM-assisted interactive approach effectively bridges gap between scalable automation and rigorous precision needed for expert translation research on divergent language pairs

Abstract: The construction of high-quality parallel corpora for translation research has increasingly evolved from simple sentence alignment to complex, multi-layered annotation tasks. This methodological shift presents significant challenges for structurally divergent language pairs, such as Arabic–English, where standard automated tools frequently fail to capture deep linguistic shifts or semantic nuances. This paper introduces a novel, LLM-assisted interactive tool designed to reduce the gap between scalable automation and the rigorous precision required for expert human judgment. Unlike traditional statistical aligners, our system employs a template-based Prompt Manager that leverages large language models (LLMs) for sentence segmentation and alignment under strict JSON output constraints. In this tool, automated preprocessing integrates into a human-in-the-loop workflow, allowing researchers to refine alignments and apply custom translation technique annotations through a stand-off architecture. By leveraging LLM-assisted processing, the tool balances annotation efficiency with the linguistic precision required to analyze complex translation phenomena in specialized domains.

[21] Neuro-Symbolic Synergy for Interactive World Modeling

Hongyu Zhao, Siyu Zhou, Haolin Yang, Zengyi Qin, Tianyi Zhou

Main category: cs.CL

TL;DR: NeSyS integrates LLMs with symbolic world models to combine semantic expressivity with logical consistency, improving accuracy and data efficiency in interactive environments.

Details

Motivation: LLMs hallucinate as world models despite strong reasoning, while symbolic WMs lack semantic expressivity. Need to bridge this gap for robust interactive environments.

Method: Neuro-Symbolic Synergy framework alternates training between LLMs and symbolic WMs, using trajectories inadequately explained by the other. Symbolic WM directly constrains LLM output probabilities, while neural WM is fine-tuned only on uncovered trajectories.

Result: Achieves 50% reduction in training data without accuracy loss. Outperforms baselines in three interactive environments (ScienceWorld, Webshop, Plancraft) for both prediction accuracy and data efficiency.

Conclusion: NeSyS successfully integrates probabilistic semantic priors of LLMs with executable symbolic rules, achieving both expressivity and robustness in world modeling.

Abstract: Large language models (LLMs) exhibit strong general-purpose reasoning capabilities, yet they frequently hallucinate when used as world models (WMs), where strict compliance with deterministic transition rules–particularly in corner cases–is essential. In contrast, Symbolic WMs provide logical consistency but lack semantic expressivity. To bridge this gap, we propose Neuro-Symbolic Synergy (NeSyS), a framework that integrates the probabilistic semantic priors of LLMs with executable symbolic rules to achieve both expressivity and robustness. NeSyS alternates training between the two models using trajectories inadequately explained by the other. Unlike rule-based prompting, the symbolic WM directly constrains the LLM by modifying its output probability distribution. The neural WM is fine-tuned only on trajectories not covered by symbolic rules, reducing training data by 50% without loss of accuracy. Extensive experiments on three distinct interactive environments, i.e., ScienceWorld, Webshop, and Plancraft, demonstrate NeSyS’s consistent advantages over baselines in both WM prediction accuracy and data efficiency.

[22] Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

Lingzhuang Sun, Yuxia Zhu, Ruitong Liu, Hao Liang, Zheng Sun, Caijun Jia, Honghao He, Yuchen Wu, Siyuan Li, Jingxuan Wei, Xiangxiang Zhang, Bihui Yu, Wentao Zhang

Main category: cs.CL

TL;DR: Canvas-of-Thought (Canvas-CoT) introduces HTML Canvas as external reasoning substrate for MLLMs, enabling atomic DOM operations and rendering-based critique loops for more efficient multimodal reasoning.

Details

Motivation: Current Chain-of-Thought prompting for MLLMs treats reasoning as linear text sequences with static visual elements, making error correction cumbersome and inefficient. This approach forces models to implicitly maintain state, increasing token consumption and cognitive load, especially in high-dimensional domains like geometry and SVG design where textual CoT lacks visual guidance.

Method: Canvas-CoT uses HTML Canvas as external reasoning substrate, allowing models to perform atomic DOM-based CRUD operations. This enables in-place state revisions without disrupting context. The method integrates a rendering-based critique loop that serves as hard constraint validator, providing explicit visual feedback for complex tasks.

Result: Extensive experiments on VCode, RBench-V, and MathVista demonstrate that Canvas-CoT significantly outperforms existing baselines, establishing a new paradigm for context-efficient multimodal reasoning.

Conclusion: Canvas-CoT bridges the gap in multimodal reasoning by providing explicit visual guidance and efficient state management through external canvas substrate, enabling more precise and context-efficient reasoning for complex visual tasks.

Abstract: While Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), relying solely on linear text sequences remains a bottleneck for complex tasks. We observe that even when auxiliary visual elements are interleaved, they are often treated as static snapshots within a one-dimensional, unstructured reasoning chain. We argue that such approaches treat reasoning history as an immutable stream: correcting a local error necessitates either generating verbose downstream corrections or regenerating the entire context. This forces the model to implicitly maintain and track state updates, significantly increasing token consumption and cognitive load. This limitation is particularly acute in high-dimensional domains, such as geometry and SVG design, where the textual expression of CoT lacks explicit visual guidance, further constraining the model’s reasoning precision. To bridge this gap, we introduce \textbf{Canvas-of-Thought (Canvas-CoT)}. By leveraging a HTML Canvas as an external reasoning substrate, Canvas-CoT empowers the model to perform atomic, DOM-based CRUD operations. This architecture enables in-place state revisions without disrupting the surrounding context, allowing the model to explicitly maintain the “ground truth”. Furthermore, we integrate a rendering-based critique loop that serves as a hard constraint validator, providing explicit visual feedback to resolve complex tasks that are difficult to articulate through text alone. Extensive experiments on VCode, RBench-V, and MathVista demonstrate that Canvas-CoT significantly outperforms existing baselines, establishing a new paradigm for context-efficient multimodal reasoning.

[23] On the Robustness of Knowledge Editing for Detoxification

Ming Dong, Shiyi Tang, Ziyan Peng, Guanyi Chen, Tingting He

Main category: cs.CL

TL;DR: KE-based detoxification evaluation reveals limitations: apparent toxicity reductions often stem from degenerate generation behaviors rather than genuine behavioral suppression, with effectiveness degrading for multiple objectives and limited language coverage.

Details

Motivation: Existing evaluations of Knowledge-Editing-based detoxification rely too heavily on automatic toxicity classifiers, assuming reduced toxicity scores reflect genuine behavioral suppression without examining robustness across different dimensions.

Method: Proposed a robustness-oriented evaluation framework examining three dimensions: optimization robustness (whether detoxification persists), compositional robustness (effectiveness with multiple unsafe behaviors), and cross-lingual robustness (effectiveness across languages). Identified pseudo-detoxification as a failure mode where apparent reductions come from degenerate generation behaviors.

Result: Found that KE-based detoxification often suffers from pseudo-detoxification, effectiveness degrades when editing multiple unsafe behaviors jointly, and cross-lingual detoxification only works with specific model-method combinations. Detoxification is robust only for certain models, limited numbers of objectives, and a subset of languages.

Conclusion: Current KE-based detoxification approaches have significant limitations and are not as robust as previously assumed. More comprehensive evaluation frameworks are needed to assess true behavioral suppression rather than relying on superficial toxicity scores.

Abstract: Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.

[24] LHAW: Controllable Underspecification for Long-Horizon Tasks

George Pu, Michael S. Lee, Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, Samuel Marc Denton

Main category: cs.CL

TL;DR: LHAW is a synthetic pipeline that transforms well-specified tasks into ambiguous variants by systematically removing information across Goals, Constraints, Inputs, and Context dimensions to evaluate agent clarification behavior in long-horizon workflows.

Details

Motivation: Long-horizon workflow agents need to handle ambiguous situations requiring clarification for reliable autonomous execution, but current progress is limited by lack of scalable, task-agnostic frameworks for systematically curating and measuring ambiguity impact across custom workflows.

Method: LHAW is a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions (Goals, Constraints, Inputs, Context) at configurable severity levels, validated through empirical agent trials rather than LLM predictions.

Result: Released 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to the taxonomy, with formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings.

Conclusion: LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.

Abstract: Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.

[25] When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

Leheng Sheng, Yongtao Zhang, Wenchang Ma, Yaorui Shi, Ting Huang, Xiang Wang, An Zhang, Ke Shen, Tat-Seng Chua

Main category: cs.CL

TL;DR: GRU-Mem improves long-context reasoning in LLMs by adding gated memory updates and exit mechanisms, inspired by GRU architecture, to prevent memory explosion and unnecessary computation.

Details

Motivation: Current LLMs struggle with long-context reasoning due to performance degradation with increasing context length. Existing approaches like MemAgent use naive recurrent memory updates that suffer from memory explosion (updating on evidence-free chunks) and lack exit mechanisms, leading to unnecessary computation even after sufficient evidence is collected.

Method: Proposes GRU-Mem which incorporates two text-controlled gates: (1) update gate that controls when memory should be updated, and (2) exit gate that determines when to stop processing. Uses end-to-end reinforcement learning with two reward signals (r_update and r_exit) to train correct updating and exiting behaviors.

Result: GRU-Mem generally outperforms vanilla MemAgent with up to 400% inference speed acceleration across various long-context reasoning tasks, demonstrating both effectiveness and efficiency improvements.

Conclusion: GRU-Mem provides a more stable and efficient approach to long-context reasoning by incorporating gated mechanisms inspired by GRU architecture, addressing key limitations of previous recurrent memory approaches.

Abstract: While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an RNN-like loop and updating a textual memory for final answering. However, this naive recurrent memory update faces two crucial drawbacks: (i) memory can quickly explode because it can update indiscriminately, even on evidence-free chunks; and (ii) the loop lacks an exit mechanism, leading to unnecessary computation after even sufficient evidence is collected. To address these issues, we propose GRU-Mem, which incorporates two text-controlled gates for more stable and efficient long-context reasoning. Specifically, in GRU-Mem, the memory only updates when the update gate is open and the recurrent loop will exit immediately once the exit gate is open. To endow the model with such capabilities, we introduce two reward signals $r^{\text{update}}$ and $r^{\text{exit}}$ within end-to-end RL, rewarding the correct updating and exiting behaviors respectively. Experiments on various long-context reasoning tasks demonstrate the effectiveness and efficiency of GRU-Mem, which generally outperforms the vanilla MemAgent with up to 400% times inference speed acceleration.

[26] Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong, Bojun Wang, Boyu Chen, Brian Li, Buyun Ma, Chang Su, Changxin Miao, Changyi Wan, Chao Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengting Feng, Chengyuan Yao, Chunrui Han, Dan Ma, Dapeng Shi, Daxin Jiang, Dehua Ma, Deshan Sun, Di Qi, Enle Liu, Fajie Zhang, Fanqi Wan, Guanzhe Huang, Gulin Yan, Guoliang Cao, Guopeng Li, Han Cheng, Hangyu Guo, Hanshan Zhang, Hao Nie, Haonan Jia, Haoran Lv, Hebin Zhou, Hekun Lv, Heng Wang, Heung-Yeung Shum, Hongbo Huang, Hongbo Peng, Hongyu Zhou, Hongyuan Wang, Houyong Chen, Huangxi Zhu, Huimin Wu, Huiyong Guo, Jia Wang, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiashu Lv, Jiashuo Liu, Jiayi Fu, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yang, Jie Zhou, Jieyi Hou, Jing Bai, Jingcheng Hu, Jingjing Xie, Jingwei Wu, Jingyang Zhang, Jishi Zhou, Junfeng Liu, Junzhe Lin, Ka Man Lo, Kai Liang, Kaibo Liu, Kaijun Tan, Kaiwen Yan, Kaixiang Li, Kang An, Kangheng Lin, Lei Yang, Liang Lv, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lina Chen, Luck Ma, Mengqiang Ren, Michael Li, Ming Li, Mingliang Li, Mingming Zhang, Mingrui Chen, Mitt Huang, Na Wang, Peng Liu, Qi Han, Qian Zhao, Qinglin He, Qinxin Du, Qiuping Wu, Quan Sun, Rongqiu Yang, Ruihang Miao, Ruixin Han, Ruosi Wan, Ruyan Guo, Shan Wang, Shaoliang Pang, Shaowen Yang, Shengjie Fan, Shijie Shang, Shiliang Yang, Shiwei Li, Shuangshuang Tian, Siqi Liu, Siye Wu, Siyu Chen, Song Yuan, Tiancheng Cao, Tianchi Yue, Tianhao Cheng, Tianning Li, Tingdan Luo, Wang You, Wei Ji, Wei Yuan, Wei Zhang, Weibo Wu, Weihao Xie, Wen Sun, Wenjin Deng, Wenzhen Zheng, Wuxun Xie, Xiangfeng Wang, Xiangwen Kong, Xiangyu Liu, Xiangyu Zhang, Xiaobo Yang, Xiaojia Liu, Xiaolan Yuan, Xiaoran Jiao, Xiaoxiao Ren, Xiaoyun Zhang, Xin Li, Xin Liu, Xin Wu, Xing Chen, Xingping Yang, Xinran Wang, Xu Zhao, Xuan He, Xuanti Feng, Xuedan Cai, Xuqiang Zhou, Yanbo Yu, Yang Li, Yang Xu, Yanlin Lai, Yanming Xu, Yaoyu Wang, Yeqing Shen, Yibo Zhu, Yichen Lv, Yicheng Cao, Yifeng Gong, Yijing Yang, Yikun Yang, Yin Zhao, Yingxiu Zhao, Yinmin Zhang, Yitong Zhang, Yixuan Zhang, Yiyang Chen, Yongchi Zhao, Yongshen Long, Yongyao Wang, Yousong Guan, Yu Zhou, Yuang Peng, Yuanhao Ding, Yuantao Fan, Yuanzhen Yang, Yuchu Luo, Yudi Zhao, Yue Peng, Yueqiang Lin, Yufan Lu, Yuling Zhao, Yunzhou Ju, Yurong Zhang, Yusheng Li, Yuxiang Yang, Yuyang Chen, Yuzhu Cai, Zejia Weng, Zetao Hong, Zexi Li, Zhe Xie, Zheng Ge, Zheng Gong, Zheng Zeng, Zhenyi Lu, Zhewei Huang, Zhichao Chang, Zhiguo Huang, Zhiheng Hu, Zidong Yang, Zili Wang, Ziqi Ren, Zixin Zhang, Zixuan Wang

Main category: cs.CL

TL;DR: Step 3.5 Flash is a sparse Mixture-of-Experts model optimized for agentic intelligence with efficient inference, achieving frontier-level performance comparable to GPT-5.2 and Gemini 3.0 Pro.

Details

Motivation: The paper aims to bridge frontier-level agentic intelligence with computational efficiency, focusing on sharp reasoning and fast, reliable execution for real-world agent deployment.

Method: Uses sparse Mixture-of-Experts architecture with 196B total parameters but only 11B active parameters, optimized with interleaved sliding-window/full attention and Multi-Token Prediction. Employs scalable RL framework combining verifiable signals with preference feedback for stable self-improvement.

Result: Achieves strong performance: 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6, 88.2% on tau2-Bench, 69.0% on BrowseComp, and 51.0% on Terminal-Bench 2.0, comparable to frontier models like GPT-5.2 xHigh and Gemini 3.0 Pro.

Conclusion: Step 3.5 Flash redefines the efficiency frontier, providing a high-density foundation for deploying sophisticated agents in real-world industrial environments with computational efficiency.

Abstract: We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.

[27] Online Causal Kalman Filtering for Stable and Effective Policy Optimization

Shuo He, Lang Feng, Xin Cheng, Lei Feng, Bo An

Main category: cs.CL

TL;DR: KPO introduces Online Causal Kalman Filtering to stabilize reinforcement learning for LLMs by modeling importance sampling ratios as evolving latent states across tokens, reducing variance while preserving local structure.

Details

Motivation: Current RL methods for LLMs suffer from high-variance token-level importance sampling ratios, which destabilize policy optimization. Existing approaches use fixed sequence-level ratios or adjust tokens separately, neglecting temporal off-policy dynamics across tokens in a sequence.

Method: Proposes Online Causal Kalman Filtering for Policy Optimization (KPO). Models desired importance sampling ratio as a latent state evolving across tokens, applies Kalman filter to update this state online and autoregressively based on past tokens’ states, filtering out noise while preserving local structure-aware variation.

Result: KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts, demonstrating more stable and effective policy updates.

Conclusion: KPO addresses structural inconsistency in token-level off-policy deviation by using causal Kalman filtering to stabilize importance sampling ratios, leading to improved RL training stability and performance for LLMs.

Abstract: Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token’s IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.

[28] How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Yang Chen, Xiaotong Lin, Wuliang Huang, Ziyi Gao, Xing Fu, Yu Cheng, Weiqiang Wang

Main category: cs.CL

TL;DR: Systematic study of attention masking effects on user embeddings from decoder-only LLMs, proposing Gradient-Guided Soft Masking for stable transition from causal to bidirectional attention in contrastive learning framework.

Details

Motivation: Decoder-only LLMs are increasingly used as behavioral encoders for user representation learning, but the impact of attention masking on embedding quality remains underexplored. Need to understand how different masking strategies affect user embeddings in real-world applications.

Method: Proposes Gradient-Guided Soft Masking (GGSM), a gradient-based pre-warmup technique applied before a linear scheduler that gradually opens future attention during optimization. Evaluates causal, hybrid, and bidirectional attention masks within unified contrastive learning framework trained on large-scale Alipay data with heterogeneous user behaviors.

Result: Approach yields more stable training and higher-quality bidirectional representations compared to causal, hybrid, and scheduler-only baselines across 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks. Remains compatible with decoder pretraining.

Conclusion: Masking design and training transition are crucial for adapting decoder-only LLMs for effective user representation learning. GGSM enables stable transition from causal to bidirectional attention for improved embeddings.

Abstract: Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at https://github.com/JhCircle/Deepfind-GGSM.

[29] UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory

Yongshi Ye, Hui Jiang, Feihu Jiang, Tian Lan, Yichao Du, Biao Fu, Xiaodong Shi, Qianghuai Jia, Longyue Wang, Weihua Luo

Main category: cs.CL

TL;DR: UMEM is a self-evolving agent framework that jointly optimizes memory extraction and management for LLM-based agents using semantic neighborhood modeling to improve generalization.

Details

Motivation: Existing LLM-based agent methods treat memory extraction as static while optimizing memory management, leading to poor generalization where agents accumulate instance-specific noise rather than robust memories.

Method: Proposes Unified Memory Extraction and Management (UMEM) that jointly optimizes an LLM for simultaneous memory extraction and management. Introduces Semantic Neighborhood Modeling to mitigate overfitting, optimizing with neighborhood-level marginal utility reward via GRPO to evaluate memory utility across semantically related query clusters.

Result: Extensive experiments across five benchmarks show UMEM significantly outperforms competitive baselines with up to 10.67% improvement in multi-turn interactive tasks. UMEM maintains monotonic growth during continuous evolution.

Conclusion: UMEM effectively addresses generalization issues in self-evolving agents by jointly optimizing memory extraction and management through semantic neighborhood modeling, achieving superior performance and stable evolution.

Abstract: Self-evolving memory serves as the trainable parameters for Large Language Models (LLMs)-based agents, where extraction (distilling insights from experience) and management (updating the memory bank) must be tightly coordinated. Existing methods predominately optimize memory management while treating memory extraction as a static process, resulting in poor generalization, where agents accumulate instance-specific noise rather than robust memories. To address this, we propose Unified Memory Extraction and Management (UMEM), a self-evolving agent framework that jointly optimizes a Large Language Model to simultaneous extract and manage memories. To mitigate overfitting to specific instances, we introduce Semantic Neighborhood Modeling and optimize the model with a neighborhood-level marginal utility reward via GRPO. This approach ensures memory generalizability by evaluating memory utility across clusters of semantically related queries. Extensive experiments across five benchmarks demonstrate that UMEM significantly outperforms highly competitive baselines, achieving up to a 10.67% improvement in multi-turn interactive tasks. Futhermore, UMEM maintains a monotonic growth curve during continuous evolution. Codes and models will be publicly released.

[30] Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance

Woojin Chung, Jeonghoon Kim

Main category: cs.CL

TL;DR: Paper investigates if benchmark performance is driven by word overlap between pre-training data and evaluation datasets, finding strong correlation between word-level unigram cross-entropy and benchmark scores.

Details

Motivation: To understand what constitutes high-quality pre-training data and whether benchmark performance is primarily driven by statistical pattern overlap between pre-training corpora and evaluation datasets.

Method: Measure overlap using word-level unigram cross-entropy and word frequency statistics. Conduct controlled experiments across 10 zero-shot benchmarks, 4 pre-training datasets (8.5B to 60B tokens), and model sizes from 400M to 3B parameters.

Result: Found robust inverse relationship between word-level unigram cross-entropy and benchmark performance. Larger pre-training subsets with similar word-level unigram cross-entropy yield improved downstream results.

Conclusion: Many standard benchmarks are only weakly out-of-distribution relative to pre-training corpora, and simple word-overlap statistics can predict benchmark performance.

Abstract: Understanding what constitutes high-quality pre-training data remains a central question in language model training. In this work, we investigate whether benchmark performance is primarily driven by the degree of statistical pattern overlap between pre-training corpora and evaluation datasets. We measure this overlap using word-level unigram cross-entropy and word frequency statistics, and perform controlled experiments across $10$ zero-shot benchmarks, $4$ pre-training datasets spanning $8.5\mathrm{B}$ to $60\mathrm{B}$ tokens, and model sizes ranging from $400\mathrm{M}$ to $3\mathrm{B}$ parameters. Our results demonstrate a robust inverse relationship between word-level unigram cross-entropy and benchmark performance, suggesting that widely used benchmarks are strongly influenced by word overlap between training and evaluation data. Thus, larger pre-training subsets with similar word-level unigram cross-entropy yield improved downstream results, indicating that word frequency statistics play an additional role in shaping benchmark scores. Taken together, these results suggest that many standard benchmarks are only weakly out-of-distribution relative to pre-training corpora, so that simple word-overlap statistics predict benchmark performance.

[31] Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment

Daniel Gallagher, Gerhard Heyer

Main category: cs.CL

TL;DR: Transformer models struggle with ergative case alignment in Georgian, performing worst on ergative case despite overall frequency correlation (NOM > DAT > ERG), with dataset and methodology provided for future syntactic evaluations.

Details

Motivation: The paper aims to evaluate transformer-based language models' performance on split-ergative case alignment in Georgian, a rare grammatical system, to understand how well these models handle complex syntactic phenomena in low-resource languages.

Method: Used a treebank-based approach with Grew query language to generate minimal pairs, creating a dataset of 370 syntactic tests across seven tasks. Evaluated five encoder-only and two decoder-only models using word- and sentence-level accuracy metrics on case assignment tasks.

Result: Models performed worst on ergative case assignment and best on nominative case, with performance correlating with frequency distribution (NOM > DAT > ERG). The ergative case’s specific role and lack of training data likely contribute to poor performance.

Conclusion: Transformer models struggle with rare grammatical phenomena like split-ergative alignment, especially for low-frequency cases. The methodology provides a framework for evaluating syntactic capabilities in languages with limited benchmarks.

Abstract: This paper evaluates the performance of transformer-based language models on split-ergative case alignment in Georgian, a particularly rare system for assigning grammatical cases to mark argument roles. We focus on subject and object marking determined through various permutations of nominative, ergative, and dative noun forms. A treebank-based approach for the generation of minimal pairs using the Grew query language is implemented. We create a dataset of 370 syntactic tests made up of seven tasks containing 50-70 samples each, where three noun forms are tested in any given sample. Five encoder- and two decoder-only models are evaluated with word- and/or sentence-level accuracy metrics. Regardless of the specific syntactic makeup, models performed worst in assigning the ergative case correctly and strongest in assigning the nominative case correctly. Performance correlated with the overall frequency distribution of the three forms (NOM > DAT > ERG). Though data scarcity is a known issue for low-resource languages, we show that the highly specific role of the ergative along with a lack of available training data likely contributes to poor performance on this case. The dataset is made publicly available and the methodology provides an interesting avenue for future syntactic evaluations of languages where benchmarks are limited.

[32] Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

Yifei Li, Weidong Guo, Lingling Zhang, Rongman Xu, Muye Huang, Hui Liu, Lijiao Xu, Yu Xu, Jun Liu

Main category: cs.CL

TL;DR: LoCoMo-Plus is a benchmark for evaluating cognitive memory in LLM-based dialogue systems, focusing on implicit constraints (user state, goals, values) that must be retained and applied across long conversations despite semantic disconnect between cues and triggers.

Details

Motivation: Existing benchmarks focus on surface-level factual recall, but realistic dialogue requires understanding and applying implicit constraints that aren't explicitly queried later. Current evaluation methods fail to capture this cognitive memory challenge.

Method: Introduces LoCoMo-Plus benchmark with cue-trigger semantic disconnect scenarios. Proposes unified evaluation framework based on constraint consistency instead of string-matching metrics or explicit task-type prompting.

Result: Experiments across diverse models, retrieval methods, and memory systems show cognitive memory remains challenging, revealing failures not captured by existing benchmarks.

Conclusion: Cognitive memory evaluation needs to move beyond surface-level recall to assess implicit constraint understanding. LoCoMo-Plus provides a framework for this, exposing limitations in current approaches.

Abstract: Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue–trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: https://github.com/xjtuleeyf/Locomo-Plus.

[33] Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

Alaa Elsetohy, Sama Hadhoud, Haryo Akbarianto Wibowo, Chenxi Whitehouse, Genta Indra Winata, Fajri Koto, Alham Fikri Aji

Main category: cs.CL

TL;DR: Macaron is a multilingual benchmark that tests reasoning over culturally grounded premises by factorizing reasoning types and cultural aspects across 20 languages, revealing performance gaps in multilingual LLMs.

Details

Motivation: Existing multilingual benchmarks lack culturally grounded reasoning tests - translated datasets keep English-centric scenarios while culture-first datasets lack controlled reasoning requirements.

Method: Created 100 language-agnostic templates covering 7 reasoning types and 22 cultural aspects; native annotators generated scenario-aligned English and local-language multiple-choice questions with systematically derived True/False questions.

Result: Benchmark contains 11,862 instances across 20 countries/cultural contexts, 10 scripts, and 20 languages. Zero-shot evaluation of 21 multilingual LLMs shows reasoning-mode models perform best with near-parity between English/local languages, while open-weight models degrade substantially in local languages.

Conclusion: Macaron provides a comprehensive benchmark for evaluating culturally grounded reasoning in multilingual LLMs, revealing significant performance gaps and highlighting culture-grounded mathematical/counting tasks as most challenging.

Abstract: Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here https://huggingface.co/datasets/AlaaAhmed2444/Macaron.

[34] Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

Yuming Yan, Shuo Yang, Kai Tang, Sihong Chen, Yang Zhang, Ke Xu, Dan Hu, Qun Yu, Pengfei Hu, Edith C. H. Ngai

Main category: cs.CL

TL;DR: RCPA is a novel post-training paradigm for Vision-Language Models that uses curriculum-aware progressive modulation to adapt VLMs to specialized domains while preserving general capabilities, avoiding catastrophic forgetting.

Details

Motivation: VLMs have strong general capabilities but struggle in specialized domains. Supervised fine-tuning causes catastrophic forgetting, while continual pretraining is computationally expensive. RL-based approaches fail when models lack initial domain knowledge, leading to optimization collapse.

Method: Reinforced Curriculum Pre-Alignment (RCPA) introduces curriculum-aware progressive modulation: early phase applies partial output constraints to safely expose model to new domain concepts; later phase transitions to full generation optimization to refine responses and align with domain-specific preferences.

Result: Extensive experiments across specialized domains and general benchmarks validate RCPA’s effectiveness in adapting VLMs to new domains while preserving general multimodal capabilities.

Conclusion: RCPA establishes a practical pathway toward building high-performing and domain-adaptive VLMs, balancing domain knowledge acquisition with preservation of general multimodal capabilities.

Abstract: Vision-Language Models (VLMs) demonstrate remarkable general-purpose capabilities but often fall short in specialized domains such as medical imaging or geometric problem-solving. Supervised Fine-Tuning (SFT) can enhance performance within a target domain, but it typically causes catastrophic forgetting, limiting its generalization. The central challenge, therefore, is to adapt VLMs to new domains while preserving their general-purpose capabilities. Continual pretraining is effective for expanding knowledge in Large Language Models (LLMs), but it is less feasible for VLMs due to prohibitive computational costs and the unavailability of pretraining data for most open-source models. This necessitates efficient post-training adaptation methods. Reinforcement learning (RL)-based approaches such as Group Relative Policy Optimization (GRPO) have shown promise in preserving general abilities, yet they often fail in domain adaptation scenarios where the model initially lacks sufficient domain knowledge, leading to optimization collapse. To bridge this gap, we propose Reinforced Curriculum Pre-Alignment (RCPA), a novel post-training paradigm that introduces a curriculum-aware progressive modulation mechanism. In the early phase, RCPA applies partial output constraints to safely expose the model to new domain concepts. As the model’s domain familiarity increases, training gradually transitions to full generation optimization, refining responses and aligning them with domain-specific preferences. This staged adaptation balances domain knowledge acquisition with the preservation of general multimodal capabilities. Extensive experiments across specialized domains and general benchmarks validate the effectiveness of RCPA, establishing a practical pathway toward building high-performing and domain-adaptive VLMs.

[35] Deep Learning-based Method for Expressing Knowledge Boundary of Black-Box LLM

Haotian Sheng, Heyong Wang, Ming Hong, Hongman He, Junqiu Liu

Main category: cs.CL

TL;DR: LSCL is a deep learning method that helps black-box LLMs express their knowledge boundaries by mapping question-answer-probability inputs to internal knowledge states, addressing hallucination issues.

Details

Motivation: Hallucination in LLMs stems from lack of awareness of their internal knowledge boundaries. Existing methods focus on white-box LLMs, leaving black-box LLMs (API-only access) largely unexplored for knowledge boundary expression.

Method: Proposes LSCL (LLM-Supervised Confidence Learning) using knowledge distillation framework. Takes input question, output answer, and token probability from black-box LLM as inputs to construct mapping between inputs and model’s internal knowledge state, enabling quantification of knowledge boundaries.

Result: Experiments on diverse public datasets with multiple black-box LLMs show LSCL effectively assists black-box LLMs in accurately expressing knowledge boundaries, significantly outperforming baselines on accuracy and recall metrics. Also proposes adaptive alternative for LLMs without token probability access.

Conclusion: LSCL successfully addresses knowledge boundary expression for black-box LLMs, reducing hallucination by enabling models to better understand and express their internal knowledge limitations.

Abstract: Large Language Models (LLMs) have achieved remarkable success, however, the emergence of content generation distortion (hallucination) limits their practical applications. The core cause of hallucination lies in LLMs’ lack of awareness regarding their stored internal knowledge, preventing them from expressing their knowledge state on questions beyond their internal knowledge boundaries, as humans do. However, existing research on knowledge boundary expression primarily focuses on white-box LLMs, leaving methods suitable for black-box LLMs which offer only API access without revealing internal parameters-largely unexplored. Against this backdrop, this paper proposes LSCL (LLM-Supervised Confidence Learning), a deep learning-based method for expressing the knowledge boundaries of black-box LLMs. Based on the knowledge distillation framework, this method designs a deep learning model. Taking the input question, output answer, and token probability from a black-box LLM as inputs, it constructs a mapping between the inputs and the model’ internal knowledge state, enabling the quantification and expression of the black-box LLM’ knowledge boundaries. Experiments conducted on diverse public datasets and with multiple prominent black-box LLMs demonstrate that LSCL effectively assists black-box LLMs in accurately expressing their knowledge boundaries. It significantly outperforms existing baseline models on metrics such as accuracy and recall rate. Furthermore, considering scenarios where some black-box LLMs do not support access to token probability, an adaptive alternative method is proposed. The performance of this alternative approach is close to that of LSCL and surpasses baseline models.

[36] Beyond Confidence: The Rhythms of Reasoning in Generative Models

Deyuan Liu, Zecheng Wang, Zhanyue Qin, Zhiying Tu, Dianhui Chu, Dianbo Sui

Main category: cs.CL

TL;DR: A novel metric called Token Constraint Bound (δ_TCB) measures LLM robustness to input perturbations by quantifying how much internal state can change before dominant next-token predictions shift.

Details

Motivation: LLMs show sensitivity to slight input variations, harming reliability. Traditional metrics like accuracy and perplexity fail to assess local prediction robustness, as normalized output probabilities can hide underlying resilience to perturbations.

Method: Introduces Token Constraint Bound (δ_TCB), a metric that quantifies the maximum internal state perturbation an LLM can withstand before its dominant next-token prediction significantly changes. It’s intrinsically linked to output embedding space geometry and provides insights into model stability.

Result: Experiments show δ_TCB correlates with effective prompt engineering and uncovers critical prediction instabilities missed by perplexity during in-context learning and text generation.

Conclusion: δ_TCB offers a principled, complementary approach to analyze and potentially improve the contextual stability of LLM predictions.

Abstract: Large Language Models (LLMs) exhibit impressive capabilities yet suffer from sensitivity to slight input context variations, hampering reliability. Conventional metrics like accuracy and perplexity fail to assess local prediction robustness, as normalized output probabilities can obscure the underlying resilience of an LLM’s internal state to perturbations. We introduce the Token Constraint Bound ($δ_{\mathrm{TCB}}$), a novel metric that quantifies the maximum internal state perturbation an LLM can withstand before its dominant next-token prediction significantly changes. Intrinsically linked to output embedding space geometry, $δ_{\mathrm{TCB}}$ provides insights into the stability of the model’s internal predictive commitment. Our experiments show $δ_{\mathrm{TCB}}$ correlates with effective prompt engineering and uncovers critical prediction instabilities missed by perplexity during in-context learning and text generation. $δ_{\mathrm{TCB}}$ offers a principled, complementary approach to analyze and potentially improve the contextual stability of LLM predictions.

[37] I can tell whether you are a Native Hawlêri Speaker! How ANN, CNN, and RNN perform in NLI-Native Language Identification

Hardi Garari, Hossein Hassani

Main category: cs.CL

TL;DR: Native Language Identification for Hewlêri subdialect of Sorani Kurdish using neural networks on speech data, achieving 95.92% accuracy with RNN on 5-second audio segments.

Details

Motivation: Addressing the research gap in Native Language Identification (NLI) for dialects and subdialects, particularly in less-resourced languages like Kurdish, focusing specifically on the Hewlêri subdialect of Sorani Kurdish.

Method: Collected 24 hours of speech from 40 speakers, created three neural network models (ANN, CNN, RNN), conducted 66 experiments with various time-frames (1-60 seconds), sampling techniques, and cross-validation.

Result: RNN model achieved highest accuracy of 95.92% for 5-second audio segmentation using 80:10:10 data splitting. Created first speech dataset for NLI on Hewlêri subdialect.

Conclusion: Successfully demonstrated NLI for Hewlêri subdialect using neural networks, with RNN performing best. The dataset enables future research on dialect identification in under-resourced languages.

Abstract: Native Language Identification (NLI) is a task in Natural Language Processing (NLP) that typically determines the native language of an author through their writing or a speaker through their speaking. It has various applications in different areas, such as forensic linguistics and general linguistics studies. Although considerable research has been conducted on NLI regarding two different languages, such as English and German, the literature indicates a significant gap regarding NLI for dialects and subdialects. The gap becomes wider in less-resourced languages such as Kurdish. This research focuses on NLI within the context of a subdialect of Sorani (Central) Kurdish. It aims to investigate the NLI for Hewlêri, a subdialect spoken in Hewlêr (Erbil), the Capital of the Kurdistan Region of Iraq. We collected about 24 hours of speech by recording interviews with 40 native or non-native Hewlêri speakers, 17 female and 23 male. We created three Neural Network-based models: Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN), which were evaluated through 66 experiments, covering various time-frames from 1 to 60 seconds, undersampling, oversampling, and cross-validation. The RNN model showed the highest accuracy of 95.92% for 5-second audio segmentation, using an 80:10:10 data splitting scheme. The created dataset is the first speech dataset for NLI on the Hewlêri subdialect in the Sorani Kurdish dialect, which can be of benefit to various research areas.

[38] C-MOP: Integrating Momentum and Boundary-Aware Clustering for Enhanced Prompt Evolution

Binwei Yan, Yifei Fu, Mingjian Zhu, Hanting Chen, Mingxuan Yuan, Yunhe Wang, Hailin Hu

Main category: cs.CL

TL;DR: C-MOP is a cluster-based momentum optimization framework for automatic prompt optimization in LLMs that stabilizes training via boundary-aware contrastive sampling and momentum-guided semantic clustering.

Details

Motivation: Existing automatic prompt optimization methods suffer from noisy and conflicting update signals, which hinders stable optimization and effective prompt evolution.

Method: Uses Boundary-Aware Contrastive Sampling (BACS) to mine tripartite features (Hard Negatives, Anchors, Boundary Pairs) and Momentum-Guided Semantic Clustering (MGSC) with textual momentum mechanism and temporal decay to distill persistent consensus from fluctuating gradients.

Result: Outperforms SOTA baselines like PromptWizard and ProTeGi by average gains of 1.58% and 3.35%, enabling a 3B parameter LLM to surpass a 70B domain-specific dense LLM.

Conclusion: C-MOP effectively stabilizes prompt optimization and drives precise prompt evolution through cluster-based momentum optimization techniques.

Abstract: Automatic prompt optimization is a promising direction to boost the performance of Large Language Models (LLMs). However, existing methods often suffer from noisy and conflicting update signals. In this research, we propose C-MOP (Cluster-based Momentum Optimized Prompting), a framework that stabilizes optimization via Boundary-Aware Contrastive Sampling (BACS) and Momentum-Guided Semantic Clustering (MGSC). Specifically, BACS utilizes batch-level information to mine tripartite features–Hard Negatives, Anchors, and Boundary Pairs–to precisely characterize the typical representation and decision boundaries of positive and negative prompt samples. To resolve semantic conflicts, MGSC introduces a textual momentum mechanism with temporal decay that distills persistent consensus from fluctuating gradients across iterations. Extensive experiments demonstrate that C-MOP consistently outperforms SOTA baselines like PromptWizard and ProTeGi, yielding average gains of 1.58% and 3.35%. Notably, C-MOP enables a general LLM with 3B activated parameters to surpass a 70B domain-specific dense LLM, highlighting its effectiveness in driving precise prompt evolution. The code is available at https://github.com/huawei-noah/noah-research/tree/master/C-MOP.

[39] Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis

Zhiyin Tan, Jennifer D’Souza

Main category: cs.CL

TL;DR: LLMs struggle with structural evidence extraction for meta-analyses, failing at relational binding between variables, methods, and effect sizes despite good entity recognition.

Details

Motivation: To evaluate whether current LLMs can meet the structural requirements of systematic reviews and meta-analyses, which require preserving complex relationships between study elements rather than just recognizing isolated entities.

Method: Proposed a diagnostic framework with schema-constrained queries of increasing relational/numerical complexity. Evaluated two state-of-the-art LLMs on a manually curated corpus across five scientific domains using both per-document and long-context multi-document inputs.

Result: LLMs perform moderately on single-property queries but degrade sharply with relational tasks requiring binding between variables, roles, methods, and effect sizes. Full meta-analytic association tuples extracted with near-zero reliability. Long-context inputs worsen performance. Downstream aggregation amplifies errors.

Conclusion: Current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta-analysis, with failures stemming from systematic structural breakdowns rather than entity recognition errors.

Abstract: Systematic reviews and meta-analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect-size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM-based evidence extraction as a progression of schema-constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom-level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state-of-the-art LLMs under both per-document and long-context, multi-document input regimes. Across domains and models, performance remains moderate for single-property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta-analytic association tuples are extracted with near-zero reliability, and long-context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus-level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross-analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta-analysis. The code and data are publicly available at GitHub (https://github.com/zhiyintan/LLM-Meta-Analysis).

[40] The CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems

Zhuohan Xie, Rania Elbadry, Fan Zhang, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Dimitar Dimitrov, Vanshikaa Jani, Yuyang Dai, Jiahui Geng, Yuxia Wang, Ivan Koychev, Veselin Stoyanov, Preslav Nakov

Main category: cs.CL

TL;DR: FinMMEval 2026 introduces the first multilingual multimodal evaluation framework for financial LLMs with three tasks: Financial Exam QA, Multilingual Financial QA, and Financial Decision Making.

Details

Motivation: Existing financial NLP benchmarks are largely monolingual, text-only, and limited to narrow subtasks, creating a gap for comprehensive multilingual multimodal evaluation of financial LLMs.

Method: Proposes three interconnected tasks spanning financial understanding, reasoning, and decision-making across diverse languages and modalities, with publicly released datasets and evaluation resources.

Result: Establishes a comprehensive evaluation suite that measures models’ ability to reason, generalize, and act across languages and modalities in financial contexts.

Conclusion: The framework aims to promote development of robust, transparent, and globally inclusive financial AI systems through reproducible research.

Abstract: We present the setup and the tasks of the FinMMEval Lab at CLEF 2026, which introduces the first multilingual and multimodal evaluation framework for financial Large Language Models (LLMs). While recent advances in financial natural language processing have enabled automated analysis of market reports, regulatory documents, and investor communications, existing benchmarks remain largely monolingual, text-only, and limited to narrow subtasks. FinMMEval 2026 addresses this gap by offering three interconnected tasks that span financial understanding, reasoning, and decision-making: Financial Exam Question Answering, Multilingual Financial Question Answering (PolyFiQA), and Financial Decision Making. Together, these tasks provide a comprehensive evaluation suite that measures models’ ability to reason, generalize, and act across diverse languages and modalities. The lab aims to promote the development of robust, transparent, and globally inclusive financial AI systems, with datasets and evaluation resources publicly released to support reproducible research.

[41] SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

Masataka Yoneda, Yusuke Matsushita, Go Kamoda, Kohei Suenaga, Takuya Akiba, Masaki Waga, Sho Yokoi

Main category: cs.CL

TL;DR: Ultra-fast search algorithm for trillion-scale natural language corpora with semantic variation handling, achieving sub-0.3 second search times using suffix arrays and corpus-aware pruning.

Details

Motivation: Need for efficient search over massive natural language corpora (trillion-scale) while handling semantic variations (substitutions, insertions, deletions) that existing methods struggle with due to combinatorial explosion.

Method: Uses suffix arrays for string matching with disk-aware design for fast exact lookup and dynamic corpus-aware pruning to suppress exponential search space growth by leveraging natural language statistical properties.

Result: Achieves significantly lower search latency (<0.3s) than existing methods (infini-gram, infini-gram mini, SoftMatcha) on FineWeb-Edu (1.4T tokens), identifies previously undetected benchmark contamination, and provides multilingual demo.

Conclusion: Proposed method enables practical trillion-scale corpus search with semantic variation handling, offering applications in benchmark contamination detection and multilingual search.

Abstract: We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.

[42] Computational Phenomenology of Temporal Experience in Autism: Quantifying the Emotional and Narrative Characteristics of Lived Unpredictability

Kacper Dudzic, Karolina Drożdż, Maciej Wodziński, Anastazja Szuła, Marcin Moskalewicz

Main category: cs.CL

TL;DR: Study integrates phenomenological interviews and computational analysis to examine temporal experience in autism, finding unpredictability as core feature reflected in negative temporal lexicon and narrative flow patterns.

Details

Motivation: To bridge gap between phenomenological and computational approaches in autism research, overcome limitations of deficit-based models, small qualitative samples, and lack of phenomenological grounding in computational studies.

Method: Three integrated methodologies: Study A - phenomenological interviews with autistic individuals using Transdiagnostic Assessment of Temporal Experience; Study B - computational analysis of autobiographical corpus of autistic narratives; Study C - replication study using narrative flow measures to assess phenomenological authenticity.

Result: Interviews revealed unpredictability as most significant difference; computational analysis showed autistic narratives have more negatively valenced temporal lexicon, especially “Immediacy & Suddenness” category; narrative flow analysis found autistic narratives resemble autobiographical stories more than imaginary ones.

Conclusion: Temporal challenges in autism primarily concern lived unpredictability stemming from experience contents rather than narrative construction, demonstrating value of integrating phenomenological and computational approaches.

Abstract: Disturbances in temporality, such as desynchronization with the social environment and its unpredictability, are considered core features of autism with a deep impact on relationships. However, limitations regarding research on this issue include: 1) the dominance of deficit-based medical models of autism, 2) sample size in qualitative research, and 3) the lack of phenomenological anchoring in computational research. To bridge the gap between phenomenological and computational approaches and overcome sample-size limitations, our research integrated three methodologies. Study A: structured phenomenological interviews with autistic individuals using the Transdiagnostic Assessment of Temporal Experience. Study B: computational analysis of an autobiographical corpus of autistic narratives built for this purpose. Study C: a replication of a computational study using narrative flow measures to assess the perceived phenomenological authenticity of autistic autobiographies. Interviews revealed that the most significant differences between the autistic and control groups concerned unpredictability of experience. Computational results mirrored these findings: the temporal lexicon in autistic narratives was significantly more negatively valenced - particularly the “Immediacy & Suddenness” category. Outlier analysis identified terms associated with perceived discontinuity (unpredictably, precipitously, and abruptly) as highly negative. The computational analysis of narrative flow found that the autistic narratives contained within the corpus quantifiably resemble autobiographical stories more than imaginary ones. Overall, the temporal challenges experienced by autistic individuals were shown to primarily concern lived unpredictability and stem from the contents of lived experience, and not from autistic narrative construction.

[43] Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models

Mingyu Cao, Alvaro Correia, Christos Louizos, Shiwei Liu, Lu Yin

Main category: cs.CL

TL;DR: SOAR is a training-free decoding algorithm for Diffusion Language Models that dynamically adapts unmasking decisions based on model uncertainty to improve reasoning performance while maintaining efficiency.

Details

Motivation: Standard greedy decoding in Diffusion Language Models can lead to suboptimal unmasking orders, especially for reasoning-heavy tasks, because local confidence decisions may lock the model into poor sequences.

Method: SOAR adapts decoding behavior to model uncertainty: when confidence is low, it widens search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses search and decodes many positions in parallel to reduce denoising iterations.

Result: SOAR improves generation quality across mathematical reasoning (GSM8K) and code generation (MBPP, HumanEval) benchmarks on Dream-7B and LLaDA-8B models while maintaining competitive inference speed.

Conclusion: SOAR offers a practical way to balance quality and efficiency in DLM decoding through uncertainty-aware adaptive search, providing training-free improvements for reasoning tasks.

Abstract: Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule: unmask the most confident positions, yet this local choice can lock the model into a suboptimal unmasking order, especially on reasoning-heavy prompts. We present SOAR, a training-free decoding algorithm that adapts its behavior to the model’s uncertainty. When confidence is low, SOAR briefly widens the search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses the search and decodes many positions in parallel to reduce the number of denoising iterations. Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and efficiency in DLM decoding.

[44] LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules

Ivan Vulić, Adam Grycner, Quentin de Laroussilhe, Jonas Pfeiffer

Main category: cs.CL

TL;DR: LoRA-Squeeze improves standard LoRA by starting with high-rank adapters then compressing them post-hoc or dynamically during training, achieving better performance with lower ranks.

Details

Motivation: Standard LoRA faces challenges with optimal rank selection, rank-specific hyperparameters, and deployment complexity of heterogeneous-rank modules. The paper proposes that learning expressive high-rank solutions then compressing them is better than learning constrained low-rank solutions directly.

Method: Fine-tune with deliberately high source rank, reconstruct the full weight update matrix, then use Randomized SVD to create compressed LoRA modules at lower target rank. Includes both post-hoc compression and gradual in-tuning rank annealing variants.

Result: Extensive experiments across 13 text and 10 vision-language tasks show post-hoc compression produces lower-rank adapters that outperform those trained directly at target rank. Gradual rank annealing variant consistently achieves best LoRA size-performance trade-off.

Conclusion: LoRA-Squeeze provides an efficient methodology to improve standard LoRA learning by changing ranks either post-hoc or dynamically during training, addressing key limitations of current LoRA approaches.

Abstract: Despite its huge number of variants, standard Low-Rank Adaptation (LoRA) is still a dominant technique for parameter-efficient fine-tuning (PEFT). Nonetheless, it faces persistent challenges, including the pre-selection of an optimal rank and rank-specific hyper-parameters, as well as the deployment complexity of heterogeneous-rank modules and more sophisticated LoRA derivatives. In this work, we introduce LoRA-Squeeze, a simple and efficient methodology that aims to improve standard LoRA learning by changing LoRA module ranks either post-hoc or dynamically during training}. Our approach posits that it is better to first learn an expressive, higher-rank solution and then compress it, rather than learning a constrained, low-rank solution directly. The method involves fine-tuning with a deliberately high(er) source rank, reconstructing or efficiently approximating the reconstruction of the full weight update matrix, and then using Randomized Singular Value Decomposition (RSVD) to create a new, compressed LoRA module at a lower target rank. Extensive experiments across 13 text and 10 vision-language tasks show that post-hoc compression often produces lower-rank adapters that outperform those trained directly at the target rank, especially if a small number of fine-tuning steps at the target rank is allowed. Moreover, a gradual, in-tuning rank annealing variant of LoRA-Squeeze consistently achieves the best LoRA size-performance trade-off.

[45] Linguistic Indicators of Early Cognitive Decline in the DementiaBank Pitt Corpus: A Statistical and Machine Learning Study

Artsvik Avetisyan, Sachin Kumar

Main category: cs.CL

TL;DR: Study analyzes spontaneous speech transcripts from dementia patients using linguistic representations to identify interpretable markers of cognitive decline through machine learning and statistical validation.

Details

Motivation: To identify linguistically interpretable markers of dementia from spontaneous speech that can support transparent and clinically grounded screening approaches, as subtle language changes are among the earliest indicators of cognitive decline.

Method: Analyzed DementiaBank Pitt Corpus transcripts using three linguistic representations: raw cleaned text, POS-enhanced (lexical+grammatical), and POS-only syntactic representation. Used logistic regression and random forest models with transcript-level and subject-level cross-validation. Examined interpretability through feature importance and validated with Mann-Whitney U tests and Cliff’s delta effect sizes.

Result: Models achieved stable performance across representations, with syntactic/grammatical features retaining strong discriminative power even without lexical content. Subject-level evaluation showed consistent results for POS-enhanced and POS-only representations. Statistical analysis revealed significant group differences in functional word usage, lexical diversity, sentence structure, and discourse coherence, aligning with ML feature importance.

Conclusion: Abstract linguistic features capture robust markers of early cognitive decline under clinically realistic evaluation. Combining interpretable ML with statistical validation supports using linguistically grounded features for transparent and reliable language-based cognitive screening.

Abstract: Background: Subtle changes in spontaneous language production are among the earliest indicators of cognitive decline. Identifying linguistically interpretable markers of dementia can support transparent and clinically grounded screening approaches. Methods: This study analyzes spontaneous speech transcripts from the DementiaBank Pitt Corpus using three linguistic representations: raw cleaned text, a part-of-speech (POS)-enhanced representation combining lexical and grammatical information, and a POS-only syntactic representation. Logistic regression and random forest models were evaluated under two protocols: transcript-level train-test splits and subject-level five-fold cross-validation to prevent speaker overlap. Model interpretability was examined using global feature importance, and statistical validation was conducted using Mann-Whitney U tests with Cliff’s delta effect sizes. Results: Across representations, models achieved stable performance, with syntactic and grammatical features retaining strong discriminative power even in the absence of lexical content. Subject-level evaluation yielded more conservative but consistent results, particularly for POS-enhanced and POS-only representations. Statistical analysis revealed significant group differences in functional word usage, lexical diversity, sentence structure, and discourse coherence, aligning closely with machine learning feature importance findings. Conclusion: The results demonstrate that abstract linguistic features capture robust markers of early cognitive decline under clinically realistic evaluation. By combining interpretable machine learning with non-parametric statistical validation, this study supports the use of linguistically grounded features for transparent and reliable language-based cognitive screening.

[46] Language Model Inversion through End-to-End Differentiation

Kevin Yandoka Denamganaï, Kartic Subr

Main category: cs.CL

TL;DR: DLM: Differentiable Language Models for prompt inversion via gradient-based optimization by treating LMs as functions on token distributions rather than discrete tokens.

Details

Motivation: Despite growing research on Language Models, there's little work on invertibility - determining what input prompts would yield a desired target output sequence, which remains an open problem.

Method: Formulate prompt inversion as gradient-based optimization, propose algorithm for end-to-end differentiability of frozen LMs by viewing them as functions on sequences of distributions over tokens (rather than discrete tokens), then find optimized prompts via gradient descent.

Result: DLM-powered inversion can reliably and efficiently optimize prompts of lengths 10 and 80 for targets of length 20 for several white-box LMs out-of-the-box.

Conclusion: The proposed differentiable approach enables effective prompt inversion for language models through gradient-based optimization on token distributions.

Abstract: Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths $10$ and $80$ for targets of length $20$, for several white-box LMs (out-of-the-box).

[47] Embedding Inversion via Conditional Masked Diffusion Language Models

Han Xiao

Main category: cs.CL

TL;DR: Embedding inversion using conditional masked diffusion for parallel token recovery with high accuracy and similarity

Details

Motivation: Current embedding inversion methods often use sequential autoregressive generation which is slow. The paper aims to develop a faster, parallel approach to recover original tokens from embeddings using diffusion models.

Method: Frames embedding inversion as conditional masked diffusion, using a masked diffusion language model conditioned on target embeddings via adaptive layer normalization. Requires only 8 forward passes through a 78M parameter model without access to the target encoder.

Result: On 32-token sequences across three embedding models, achieves 81.3% token accuracy and 0.87 cosine similarity.

Conclusion: Conditional masked diffusion provides an efficient parallel alternative to sequential autoregressive methods for embedding inversion, achieving strong performance with minimal computation.

Abstract: We frame embedding inversion as conditional masked diffusion, recovering all tokens in parallel through iterative denoising rather than sequential autoregressive generation. A masked diffusion language model is conditioned on the target embedding via adaptive layer normalization, requiring only 8 forward passes through a 78M parameter model with no access to the target encoder. On 32-token sequences across three embedding models, the method achieves 81.3% token accuracy and 0.87 cosine similarity.

[48] Conversational Behavior Modeling Foundation Model With Multi-Level Perception

Dingkun Zhou, Shuchang Pan, Jiachen Lian, Siddharth Banerjee, Sarika Pasumarthy, Dhruv Hebbar, Siddhant Patel, Zeyi Austin Li, Kan Jen Cheng, Sanay Bordia, Krish Patel, Akshaj Gupta, Tingle Li, Gopala Anumanchipalli

Main category: cs.CL

TL;DR: A framework modeling human conversation as multi-level perception using Graph-of-Thoughts (GoT) to predict communicative intents and speech acts, enabling full-duplex interactive systems with interpretable reasoning.

Details

Motivation: Human conversation involves implicit chains of thoughts manifested as timed speech acts. Capturing this perceptual pathway is crucial for building natural full-duplex interactive systems that can understand and respond in real-time.

Method: Introduces a multi-level perception framework with hierarchical labeling scheme predicting high-level communicative intents and low-level speech acts. Uses Graph-of-Thoughts (GoT) to model causal and temporal dependencies, trained on a high-quality corpus of event-rich dialogue data with human annotations.

Result: The framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems, validated on both synthetic and real duplex dialogues.

Conclusion: The GoT framework effectively models the intent-to-action pathway in conversation, enabling transformers to forecast speech acts, generate justifications, and dynamically refine reasoning for natural full-duplex interactive systems.

Abstract: Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.

[49] SteuerLLM: Local specialized large language model for German tax law analysis

Sebastian Wind, Jeta Sopa, Laurin Schmid, Quirin Jackl, Sebastian Kiefer, Fei Wu, Martin Mayr, Harald Köstler, Gerhard Wellein, Andreas Maier, Soroosh Tayebi Arasteh

Main category: cs.CL

TL;DR: SteuerEx is the first open benchmark for German tax law exams, and SteuerLLM is a domain-adapted 28B parameter model that outperforms general-purpose LLMs on legal reasoning tasks requiring exact statutory citation and structured argumentation.

Details

Motivation: LLMs struggle with domains requiring strict formal rules, precise terminology, and legally binding structures like tax law, which demands exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes.

Method: Created SteuerEx benchmark with 115 expert-validated German tax law exam questions across six domains. Developed SteuerLLM using domain adaptation on large-scale synthetic data generated from authentic exam materials via controlled retrieval-augmented pipeline.

Result: SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and often larger systems, showing domain-specific data and architectural adaptation are more important than parameter scale for legal reasoning tasks.

Conclusion: Domain-specific adaptation with authentic data is crucial for legal AI performance. The open release of benchmark data, training datasets, model weights, and evaluation code supports reproducible research in domain-specific legal artificial intelligence.

Abstract: Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes. We algorithmically generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations. SteuerEx comprises 115 expert-validated examination questions spanning six core tax law domains and multiple academic levels, and employs a statement-level, partial-credit evaluation framework that closely mirrors real examination practice. We further present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset generated from authentic examination material using a controlled retrieval-augmented pipeline. SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and, in several cases, substantially larger systems, demonstrating that domain-specific data and architectural adaptation are more decisive than parameter scale for performance on realistic legal reasoning tasks. All benchmark data, training datasets, model weights, and evaluation code are released openly to support reproducible research in domain-specific legal artificial intelligence. A web-based demo of SteuerLLM is available at https://steuerllm.i5.ai.fau.de.

[50] DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Yicheng Chen, Zerun Ma, Xinchen Xie, Yining Li, Kai Chen

Main category: cs.CL

TL;DR: DataChef-32B automates end-to-end data recipe generation for LLM adaptation using reinforcement learning with proxy rewards to predict downstream performance.

Details

Motivation: Current LLM data recipe design is manual and labor-intensive despite automation of individual steps. There's a need to automate the overall data recipe generation process to reduce human expertise requirements and enable self-evolving AI systems.

Method: Formulates end-to-end data recipe generation as: given target benchmark and available data sources, model outputs complete data recipe. Uses DataChef-32B with online reinforcement learning and proxy reward that predicts downstream performance for candidate recipes.

Result: Across six held-out tasks, DataChef-32B produces practical recipes reaching comparable performance to human-curated ones. Notably adapts Qwen3-1.7B-Base to math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B.

Conclusion: Demonstrates feasibility of automating LLM data recipe generation, enabling practical adaptation and shedding light on self-evolving AI systems. Shows promise for reducing human effort in LLM training pipeline design.

Abstract: In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.

[51] Can Large Language Models Make Everyone Happy?

Usman Naseem, Gautam Siddharth Kashyap, Ebad Shabbir, Sushant Kumar Ray, Abdullah Mohammad, Rafiq Ali

Main category: cs.CL

TL;DR: MisAlign-Profile benchmark measures cross-dimensional misalignment trade-offs in LLMs across safety, value, and cultural dimensions using a unified dataset and profiling approach.

Details

Motivation: Existing benchmarks evaluate safety, value, and cultural dimensions in isolation, lacking insight into their interactions and trade-offs. Current approaches fail to systematically characterize how LLMs navigate conflicts between these dimensions in real-world settings where they must co-occur.

Method: Introduced MisAlign-Profile benchmark with MISALIGNTRADE dataset covering 112 normative domains (14 safety, 56 value, 42 cultural). Used semantic classification (object, attribute, relations misalignment) with Gemma-2-9B-it and Qwen3-30B-A3B-Instruct-2507, with SimHash-based fingerprinting. Created aligned/misaligned response pairs via two-stage rejection sampling.

Result: Benchmarking revealed 12%-34% misalignment trade-offs across dimensions in general-purpose, fine-tuned, and open-weight LLMs, showing significant failure rates when dimensions must be satisfied simultaneously.

Conclusion: MisAlign-Profile provides a unified framework for measuring cross-dimensional misalignment trade-offs, revealing substantial gaps in current LLMs’ ability to simultaneously satisfy safety, value, and cultural expectations.

Abstract: Misalignment in Large Language Models (LLMs) refers to the failure to simultaneously satisfy safety, value, and cultural dimensions, leading to behaviors that diverge from human expectations in real-world settings where these dimensions must co-occur. Existing benchmarks, such as SAFETUNEBED (safety-centric), VALUEBENCH (value-centric), and WORLDVIEW-BENCH (culture-centric), primarily evaluate these dimensions in isolation and therefore provide limited insight into their interactions and trade-offs. More recent efforts, including MIB and INTERPRETABILITY BENCHMARK-based on mechanistic interpretability, offer valuable perspectives on model failures; however, they remain insufficient for systematically characterizing cross-dimensional trade-offs. To address these gaps, we introduce MisAlign-Profile, a unified benchmark for measuring misalignment trade-offs inspired by mechanistic profiling. First, we construct MISALIGNTRADE, an English misaligned-aligned dataset across 112 normative domains taxonomies, including 14 safety, 56 value, and 42 cultural domains. In addition to domain labels, each prompt is classified with one of three orthogonal semantic types-object, attribute, or relations misalignment-using Gemma-2-9B-it and expanded via Qwen3-30B-A3B-Instruct-2507 with SimHash-based fingerprinting to avoid deduplication. Each prompt is paired with misaligned and aligned responses through two-stage rejection sampling to ensure quality. Second, we benchmark general-purpose, fine-tuned, and open-weight LLMs on MISALIGNTRADE-revealing 12%-34% misalignment trade-offs across dimensions.

[52] Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Furong Huang, Dinesh Manocha, Amrit Singh Bedi

Main category: cs.CL

TL;DR: SafeThink: Lightweight inference-time defense that monitors reasoning traces with safety reward models and injects corrective prefixes when safety thresholds are violated to reduce jailbreak attacks while preserving reasoning performance.

Details

Motivation: RL-based post-training for chain-of-thought reasoning improves MLRM reasoning but degrades safety alignment and increases jailbreak success rates. Need to protect multimodal reasoning models from safety vulnerabilities introduced during optimization.

Method: SafeThink treats safety recovery as satisficing constraint rather than maximization objective. Monitors evolving reasoning trace with safety reward model, conditionally injects optimized short corrective prefix (“Wait, think safely”) only when safety threshold is violated. Intervenes in first 1-3 reasoning steps to redirect generation.

Result: Reduces attack success rates by 30-60% across six open-source MLRMs and four jailbreak benchmarks (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%).

Conclusion: Safety recovery is often only a few steering steps away. SafeThink provides effective lightweight defense against jailbreak attacks in multimodal reasoning models without compromising reasoning capabilities.

Abstract: Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix (“Wait, think safely”) only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.

[53] TEGRA: Text Encoding With Graph and Retrieval Augmentation for Misinformation Detection

Géraud Faye, Wassila Ouerdane, Guillaume Gadek, Céline Hudelot

Main category: cs.CL

TL;DR: TEG (Text Encoding with Graph) is a novel document representation method that combines text and structured graph information from knowledge bases to improve misinformation detection performance compared to language models alone.

Details

Motivation: Misinformation detection benefits from external knowledge integration similar to manual fact-checking. Current approaches using language models alone may lack structured knowledge incorporation that could enhance detection accuracy.

Method: TEG processes documents by extracting structured information as a graph from knowledge bases, then encodes both the text and graph for classification. TEGRA extends this by integrating domain-specific knowledge.

Result: Extensive experiments show the hybrid representation enhances misinformation detection performance compared to using language models alone. TEGRA further improves classification accuracy in most cases.

Conclusion: Combining text with structured graph representations from knowledge bases provides effective hybrid representations for misinformation detection, with domain-specific knowledge integration offering additional benefits.

Abstract: Misinformation detection is a critical task that can benefit significantly from the integration of external knowledge, much like manual fact-checking. In this work, we propose a novel method for representing textual documents that facilitates the incorporation of information from a knowledge base. Our approach, Text Encoding with Graph (TEG), processes documents by extracting structured information in the form of a graph and encoding both the text and the graph for classification purposes. Through extensive experiments, we demonstrate that this hybrid representation enhances misinformation detection performance compared to using language models alone. Furthermore, we introduce TEGRA, an extension of our framework that integrates domain-specific knowledge, further enhancing classification accuracy in most cases.

[54] Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Dawid J. Kopiczko, Sagar Vaze, Tijmen Blankevoort, Yuki M. Asano

Main category: cs.CL

TL;DR: SFT on chain-of-thought data benefits from repetition: training for more epochs on smaller datasets outperforms single-epoch training on larger datasets, with token accuracy signaling when repetition has saturated.

Details

Motivation: Standard ML intuition suggests more unique training samples yield better generalization, but the authors find counterintuitive results where repetition during SFT on reasoning tasks actually improves performance.

Method: Conducted experiments with Olmo3-7B on reasoning benchmarks (AIME'24/25 and GPQA), comparing different training regimes: many epochs on small datasets vs. few epochs on large datasets under fixed update budgets.

Result: Training for 128 epochs on 400 samples outperformed 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. Token accuracy reliably signals when repetition has saturated.

Conclusion: Repetition advantage in SFT provides practical approach for reasoning tasks - scaling epochs with token accuracy as stopping criterion can replace expensive undirected data scaling, presenting an open problem for understanding LLM training dynamics.

Abstract: Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.

[55] Structured Sentiment Analysis as Transition-based Dependency Graph Parsing

Daniel Fernández-González

Main category: cs.CL

TL;DR: First transition-based method for structured sentiment analysis using dependency graph parsing with Pointer Network architecture, achieving state-of-the-art performance with quadratic time complexity.

Details

Motivation: While structured sentiment analysis (SSA) has been approached as dependency graph parsing, existing methods use graph-based models despite transition-based algorithms excelling in dependency parsing tasks in terms of accuracy and efficiency. The authors aim to develop the first transition-based method for SSA.

Method: Proposes a transition-based method for SSA as dependency graph parsing, using a left-to-right transition system that incrementally generates graph structures containing opinions. Implements the model using a Pointer Network architecture as backbone.

Result: The model achieves best performance to date among dependency-based methods in practically all cases, surpasses recent task-specific techniques on most challenging datasets, and has quadratic average-case time complexity (more efficient than graph-based parsers).

Conclusion: Transition-based parsing is effective for structured sentiment analysis, offering state-of-the-art performance with better efficiency than graph-based approaches, demonstrating the viability of this direction for SSA.

Abstract: Structured sentiment analysis (SSA) aims to automatically extract people’s opinions from a text in natural language and adequately represent that information in a graph structure. One of the most accurate methods for performing SSA was recently proposed and consists of approaching it as a dependency graph parsing task. Although we can find in the literature how transition-based algorithms excel in different dependency graph parsing tasks in terms of accuracy and efficiency, all proposed attempts to tackle SSA following that approach were based on graph-based models. In this article, we present the first transition-based method to address SSA as dependency graph parsing. Specifically, we design a transition system that processes the input text in a left-to-right pass, incrementally generating the graph structure containing all identified opinions. To effectively implement our final transition-based model, we resort to a Pointer Network architecture as a backbone. From an extensive evaluation, we demonstrate that our model offers the best performance to date in practically all cases among prior dependency-based methods, and surpasses recent task-specific techniques on the most challenging datasets. We additionally include an in-depth analysis and empirically prove that the average-case time complexity of our approach is quadratic in the sentence length, being more efficient than top-performing graph-based parsers.

[56] When Speculation Spills Secrets: Side Channels via Speculative Decoding In LLMs

Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, Gururaj Saileshwar

Main category: cs.CL

TL;DR: Security vulnerability in speculative decoding LLMs where input-dependent speculation patterns create side-channels that leak query fingerprints and confidential data.

Details

Motivation: Speculative decoding improves LLM throughput but creates security risks through observable patterns of correct/incorrect speculations that can be monitored via token counts or packet sizes.

Method: Identified side-channel by monitoring per-iteration token counts and packet sizes during speculative decoding, evaluated across four schemes (REST, LADE, BiLD, EAGLE) using research prototypes and production vLLM frameworks.

Result: Adversaries can fingerprint user queries with >75% accuracy (up to 100% for REST) at temperature 0.3, and leak confidential datastore contents at >25 tokens/sec. Even at temperature 1.0, accuracy remains far above random baseline.

Conclusion: Speculative decoding introduces significant security vulnerabilities through observable speculation patterns; proposed mitigations include packet padding and iteration-wise token aggregation.

Abstract: Deployed large language models (LLMs) often rely on speculative decoding, a technique that generates and verifies multiple candidate tokens in parallel, to improve throughput and latency. In this work, we reveal a new side-channel whereby input-dependent patterns of correct and incorrect speculations can be inferred by monitoring per-iteration token counts or packet sizes. In evaluations using research prototypes and production-grade vLLM serving frameworks, we show that an adversary monitoring these patterns can fingerprint user queries (from a set of 50 prompts) with over 75% accuracy across four speculative-decoding schemes at temperature 0.3: REST (100%), LADE (91.6%), BiLD (95.2%), and EAGLE (77.6%). Even at temperature 1.0, accuracy remains far above the 2% random baseline - REST (99.6%), LADE (61.2%), BiLD (63.6%), and EAGLE (24%). We also show the capability of the attacker to leak confidential datastore contents used for prediction at rates exceeding 25 tokens/sec. To defend against these, we propose and evaluate a suite of mitigations, including packet padding and iteration-wise token aggregation.

[57] EmbBERT: Attention Under 2 MB Memory

Riccardo Bravin, Massimo Pavan, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Manuel Roveri

Main category: cs.CL

TL;DR: EmbBERT is a tiny language model designed for ultra-constrained devices with only 2MB memory, achieving comparable accuracy to models 10x larger through compact embeddings, streamlined feed-forward blocks, and efficient attention mechanisms.

Details

Motivation: Transformer models have revolutionized NLP but their substantial memory and computational requirements hinder deployment on ultra-constrained devices like wearables and IoT units with only a few megabytes of memory.

Method: EmbBERT integrates a compact embedding layer, streamlined feed-forward blocks, and an efficient attention mechanism designed for extreme efficiency. The architecture is optimized for strict memory budgets and demonstrates resilience to 8-bit quantization.

Result: EmbBERT requires only 2MB total memory and achieves accuracy comparable to SotA models requiring 10x memory budget. It outperforms downsized versions of BERT and MAMBA of similar size on TinyNLP benchmark and GLUE suite, with quantization reducing memory to 781kB.

Conclusion: Highly simplified transformer architectures remain remarkably effective under tight resource constraints, enabling deployment on ultra-constrained devices while maintaining competitive performance compared to larger models.

Abstract: Transformer architectures based on the attention mechanism have revolutionized natural language processing (NLP), driving major breakthroughs across virtually every NLP task. However, their substantial memory and computational requirements still hinder deployment on ultra-constrained devices such as wearables and Internet-of-Things (IoT) units, where available memory is limited to just a few megabytes. To address this challenge, we introduce EmbBERT, a tiny language model (TLM) architecturally designed for extreme efficiency. The model integrates a compact embedding layer, streamlined feed-forward blocks, and an efficient attention mechanism that together enable optimal performance under strict memory budgets. Through this redesign for the extreme edge, we demonstrate that highly simplified transformer architectures remain remarkably effective under tight resource constraints. EmbBERT requires only 2 MB of total memory, and achieves accuracy performance comparable to the ones of state-of-the-art (SotA) models that require a $\mathbf{10\times}$ memory budget. Extensive experiments on the curated TinyNLP benchmark and the GLUE suite confirm that EmbBERT achieves competitive accuracy, comparable to that of larger SotA models, and consistently outperforms downsized versions of BERT and MAMBA of similar size. Furthermore, we demonstrate the model resilience to 8-bit quantization, which further reduces memory usage to just 781 kB , and the scalability of the EmbBERT architecture across the sub-megabyte to tens-of-megabytes range. Finally, we perform an ablation study demonstrating the positive contributions of all components and the pre-training procedure. All code, scripts, and checkpoints are publicly released to ensure reproducibility: https://github.com/RiccardoBravin/tiny-LLM.

[58] Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations

Xinyi Yang, Liang Zeng, Heng Dong, Chao Yu, Xiaoran Wu, Huazhong Yang, Yu Wang, Milind Tambe, Tonghan Wang

Main category: cs.CL

TL;DR: A framework for training LLMs to generate explanations of agent policies using reinforcement learning with continuous normalizing flows to capture pluralistic human judgments about explanations.

Details

Motivation: As humans increasingly share environments with diverse AI agents, the ability to explain agent policies in natural language is vital for reliable coexistence and trust.

Method: Trains explanation-generating LLMs via reinforcement learning from AI feedback, using distributional rewards generated by generative continuous normalizing flows (CNFs) that capture pluralistic human judgments. Includes specialized CNF architecture that attends to linguistic cues in decision context and explanations.

Result: Human and LLM evaluators find the method delivers explanations that enable more accurate predictions of true agent decisions, exhibit greater logical soundness and actionability, and impose lower cognitive load than baselines.

Conclusion: The framework provides a robust approach to generating natural language explanations for agent policies using CNFs to model human judgment distributions, improving explanation quality and reliability.

Abstract: As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain agent policies in natural language is vital for reliable coexistence. We introduce a general-purpose framework that trains explanation-generating LLMs via reinforcement learning from AI feedback, with distributional rewards generated by generative continuous normalizing flows (CNFs). CNFs capture the pluralistic and probabilistic nature of human judgments about explanations. Moreover, under mild assumptions, CNFs provably bound deviations from true human reward distributions when trained on noisy proxy rewards from LLMs. We design a specialized CNF architecture that selectively attends to linguistic cues in the decision context and explanations when generating rewards. Human and LLM evaluators find that our method delivers explanations that enable more accurate predictions of true agent decisions, exhibit greater logical soundness and actionability, and impose lower cognitive load than explanations trained with proxy LLM rewards or state-of-the-art RLHF and RLAIF baselines.

[59] from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors

Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Jingyu Lei, Qi Li

Main category: cs.CL

TL;DR: AVATAR is a novel jailbreak attack framework that uses adversarial metaphors to induce LLMs to calibrate benign content into harmful forms, achieving state-of-the-art attack success rates across multiple advanced LLMs.

Details

Motivation: Current jailbreak attack research focuses on direct harmful content generation, but overlooks that inducing LLMs to calibrate benign content into harmful forms is more effective. The authors aim to exploit this vulnerability through metaphorical reasoning.

Method: AVATAR framework adaptively identifies benign but logically related metaphors as initial seeds, then induces target LLMs to reason about metaphorical content, jailbreaking them by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content.

Result: Experimental results show AVATAR can effectively and transferably jailbreak LLMs, achieving state-of-the-art attack success rates across multiple advanced language models.

Conclusion: The study demonstrates that metaphorical reasoning can be exploited for jailbreak attacks, revealing a new vulnerability in LLMs that goes beyond direct harmful content generation.

Abstract: Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms. In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking. Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed. Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.

[60] Fixing the Broken Compass: Diagnosing and Improving Inference-Time Reward Modeling

Jiachun Li, Pengfei Cao, Zhuoran Jin, Yubo Chen, Jiexin Xu, Huaijun Li, Xiaojian Jiang, Kang Liu, Jun Zhao

Main category: cs.CL

TL;DR: CRISP is an inference-time algorithm that clusters reasoning paths by final answers, aggregates reward signals at cluster level, and uses adaptive prefix prompts to improve LLM reasoning performance.

Details

Motivation: While inference-time scaling techniques show promise for enhancing LLM reasoning, current research focuses on training-time optimization. The authors identify that inference-time reward model (RM)-based reasoning is overlooked, and existing RM approaches have limitations: impairing simple question performance, declining discriminative ability with increased sampling, and poor performance with high search diversity.

Method: Propose CRISP (Clustered Reward Integration with Stepwise Prefixing): 1) Clusters generated reasoning paths by their final answers, 2) Aggregates reward signals at the cluster level rather than individual path level, 3) Uses adaptive prefix prompts that are updated stepwise to guide generation based on cluster-level rewards.

Result: CRISP significantly enhances LLM reasoning performance, achieving up to 5% accuracy improvement over other RM-based inference methods and an average of 10% gain over advanced reasoning models.

Conclusion: Inference-time RM-based reasoning is a critical avenue for improving LLM capabilities, and CRISP effectively addresses limitations of existing approaches through clustering and adaptive prefixing techniques.

Abstract: Inference-time scaling techniques have shown promise in enhancing the reasoning capabilities of large language models (LLMs). While recent research has primarily focused on training-time optimization, our work highlights inference-time reward model (RM)-based reasoning as a critical yet overlooked avenue. In this paper, we conduct a systematic analysis of RM behavior across downstream reasoning tasks, revealing three key limitations: (1) RM can impair performance on simple questions, (2) its discriminative ability declines with increased sampling, and (3) high search diversity undermines RM performance. To address these issues, we propose CRISP (Clustered Reward Integration with Stepwise Prefixing), a novel inference-time algorithm that clusters generated reasoning paths by final answers, aggregates reward signals at the cluster level, and adaptively updates prefix prompts to guide generation. Experimental results demonstrate that CRISP significantly enhances LLM reasoning performance, achieving up to 5% accuracy improvement over other RM-based inference methods and an average of 10% gain over advanced reasoning models.

[61] MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering

Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, Rex Ying

Main category: cs.CL

TL;DR: MTBench is a multimodal benchmark combining time-series data (stock prices, weather) with textual narratives (financial news, weather reports) to evaluate LLMs on cross-modal reasoning tasks like forecasting, trend analysis, and QA.

Details

Motivation: Existing multimodal time-series datasets lack evaluation of cross-modal reasoning and complex QA needed to capture interactions between narrative information and temporal patterns.

Method: Created MTBench benchmark with paired time-series and textual data across financial and weather domains, formulating diverse tasks requiring joint reasoning over structured numerical trends and unstructured textual narratives.

Result: Evaluation of state-of-the-art LLMs revealed significant challenges in capturing long-term dependencies, interpreting causality in trends, and effectively fusing multimodal information.

Conclusion: MTBench provides a comprehensive testbed for evaluating multimodal reasoning capabilities, highlighting current limitations in LLMs for understanding complex text-time-series relationships.

Abstract: Understanding the relationship between textual news and time-series evolution is a critical yet under-explored challenge in applied data science. While multimodal learning has gained traction, existing multimodal time-series datasets fall short in evaluating cross-modal reasoning and complex question answering, which are essential for capturing complex interactions between narrative information and temporal patterns. To bridge this gap, we introduce Multimodal Time Series Benchmark (MTBench), a large-scale benchmark designed to evaluate large language models (LLMs) on time series and text understanding across financial and weather domains. MTbench comprises paired time series and textual data, including financial news with corresponding stock price movements and weather reports aligned with historical temperature records. Unlike existing benchmarks that focus on isolated modalities, MTbench provides a comprehensive testbed for models to jointly reason over structured numerical trends and unstructured textual narratives. The richness of MTbench enables formulation of diverse tasks that require a deep understanding of both text and time-series data, including time-series forecasting, semantic and technical trend analysis, and news-driven question answering (QA). These tasks target the model’s ability to capture temporal dependencies, extract key insights from textual context, and integrate cross-modal information. We evaluate state-of-the-art LLMs on MTbench, analyzing their effectiveness in modeling the complex relationships between news narratives and temporal patterns. Our findings reveal significant challenges in current models, including difficulties in capturing long-term dependencies, interpreting causality in financial and weather trends, and effectively fusing multimodal information.

[62] ZeroTuning: Unlocking the Initial Token’s Power to Enhance Large Language Models Without Training

Feijiang Han, Xiaodong Yu, Jianheng Tang, Delip Rao, Weihua Du, Lyle Ungar

Main category: cs.CL

TL;DR: ZeroTuning is a training-free method that improves LLM performance by applying head-specific attention adjustments only to the initial token (BOS), requiring no parameter updates or complex heuristics.

Details

Motivation: Existing token-level attention tuning methods like PASTA and ACT rely on auxiliary heuristics to identify important task-specific tokens, which can introduce bias and have limited applicability when token importance is ambiguous or when optimized kernels make attention maps inaccessible.

Method: ZeroTuning intervenes only on the initial token’s attention logits, leveraging its natural role as an attention sink. It applies head-specific attention adjustments to systematically shift and reshape downstream attention patterns. Two variants: supervised mode calibrates on validation examples, unsupervised mode directly minimizes output entropy.

Result: ZeroTuning achieves gains across 15 datasets, with Llama-3.1-8B showing relative improvements of 19.9% on classification, 4.5% on question answering, and 2.1% on dialogue. It works with optimized kernels (SDPA, FlashAttention), requires minimal code changes, and maintains improvements with quantized inference and increasing context length.

Conclusion: ZeroTuning provides a simpler, more effective alternative to existing attention tuning methods by focusing interventions on the initial token, achieving better performance without complex heuristics or parameter updates.

Abstract: Token-level attention tuning, a class of training-free methods including Post-hoc Attention Steering (PASTA) and Attention Calibration (ACT), has emerged as a promising approach for improving frozen LLMs via interpretable interventions. However, these methods rely on auxiliary heuristics to identify important task-specific tokens, which can introduce bias and limit applicability when token importance is ambiguous or when optimized kernels make attention maps inaccessible. We propose a simpler alternative: intervening only on the initial token (e.g., BOS in LLaMA). We theoretically show that adding lightweight biases to this token’s attention logits systematically shifts and reshapes downstream attention patterns - an effect amplified by its natural role as an attention sink. Empirically, we find that this tuning can improve LLM performance and better elicit pretrained knowledge, with stronger effects in early layers and distinct scaling preferences across attention heads. Building on these findings, we introduce ZeroTuning, a training-free method that improves LLM performance by applying head-specific attention adjustments to the initial token, requiring no parameter updates. We present two variants: a supervised mode that calibrates on validation examples, and an unsupervised mode that directly minimizes output entropy. ZeroTuning requires no KV-cache or decoding changes and is kernel-agnostic (works with SDPA and FlashAttention). It requires only four lines of modification to the standard LlamaAttention code, achieves gains across 15 datasets, and outperforms prior, more complex methods. For example, on Llama-3.1-8B, it yields relative improvements of 19.9% on classification, 4.5% on question answering, and 2.1% on dialogue. ZeroTuning also works out of the box with quantized inference and maintains its improvements as context length increases.

[63] Intrinsic Self-Correction in LLMs: Towards Explainable Prompting via Mechanistic Interpretability

Yu-Ting Lee, Fu-Chieh Chang, Yu-En Shu, Hui-Ying Shih, Pei-Yuan Wu

Main category: cs.CL

TL;DR: Intrinsic self-correction in LLMs works by steering hidden representations along interpretable latent directions, as shown through alignment analysis and activation interventions on text detoxification/toxification tasks.

Details

Motivation: While intrinsic self-correction (refining outputs through prompting without external feedback) improves performance across tasks, its underlying mechanisms remain unclear. The paper aims to understand how this process works at the representational level.

Method: Analyze intrinsic self-correction via representation shifts induced by prompting. Construct interpretable latent directions using contrastive pairs and verify causal effects via activation addition. Evaluate six open-source LLMs on text detoxification and toxification tasks.

Result: Prompt-induced representation shifts consistently align with latent directions: in detoxification, shifts align with non-toxic direction; in toxification, shifts align with toxic direction. Representation steering is identified as the mechanistic driver of intrinsic self-correction.

Conclusion: Understanding model internals offers a direct route to analyzing prompt-driven LLM behaviors. Representation steering is the key mechanism behind intrinsic self-correction, as evidenced by alignment between prompt-induced shifts and interpretable latent directions.

Abstract: Intrinsic self-correction refers to the phenomenon where a language model refines its own outputs purely through prompting, without external feedback or parameter updates. While this approach improves performance across diverse tasks, its mechanism remains unclear. We show that intrinsic self-correction functions by steering hidden representations along interpretable latent directions, as evidenced by both alignment analysis and activation interventions. To achieve this, we analyze intrinsic self-correction via the representation shift induced by prompting. In parallel, we construct interpretable latent directions with contrastive pairs and verify the causal effect of these directions via activation addition. Evaluating six open-source LLMs, our results demonstrate that prompt-induced representation shifts in text detoxification and text toxification consistently align with latent directions constructed from contrastive pairs. In detoxification, the shifts align with the non-toxic direction; in toxification, they align with the toxic direction. These findings suggest that representation steering is the mechanistic driver of intrinsic self-correction. Our analysis highlights that understanding model internals offers a direct route to analyzing the mechanisms of prompt-driven LLM behaviors.

[64] ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett

Main category: cs.CL

TL;DR: ChartMuseum benchmark reveals significant visual reasoning deficiencies in large vision-language models for chart understanding, with models performing 35-55% worse on visual reasoning tasks compared to text-heavy questions.

Details

Motivation: Current large vision-language models (LVLMs) show an imbalance between textual and visual reasoning capabilities, particularly struggling with visual reasoning tasks that are difficult to perform in text. The authors aim to expose these limitations in chart understanding, which requires sophisticated integration of both reasoning types.

Method: 1) Conducted a case study using a synthetic dataset solvable only through visual reasoning to show performance degradation with increasing visual complexity. 2) Introduced ChartMuseum, a new Chart Question Answering benchmark with 1,162 expert-annotated questions spanning multiple reasoning types, curated from 184 real-world chart sources. 3) Evaluated frontier models including Gemini-2.5-Pro and open-source LVLMs like Qwen2.5-VL-72B-Instruct, comparing against human performance.

Result: Humans achieved 93% accuracy on ChartMuseum, while the best-performing model (Gemini-2.5-Pro) attained only 63.0%, and the leading open-source LVLM (Qwen2.5-VL-72B-Instruct) achieved only 38.5%. On questions requiring primarily visual reasoning, all models experienced a 35%-55% performance drop compared to text-reasoning-heavy questions. The benchmark effectively differentiates model capabilities where prior benchmarks showed saturation.

Conclusion: Current LVLMs have significant deficiencies in visual reasoning for chart understanding, with a substantial gap between model and human performance. The ChartMuseum benchmark successfully exposes these limitations and provides a valuable tool for evaluating and improving multimodal reasoning capabilities in vision-language models.

Abstract: Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks – where frontier models perform similarly and near saturation – our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.

[65] WAVE++: Capturing Within-Task Variance for Continual Relation Extraction with Adaptive Prompting

Bao-Ngoc Dao, Minh Le, Quang Nguyen, Luyen Ngo Dinh, Nam Le, Linh Ngo Van

Main category: cs.CL

TL;DR: WAVE++ is a prompt-based continual relation extraction method using task-specific prompt pools and label descriptions to address catastrophic forgetting without storing past data.

Details

Motivation: Memory-based continual relation extraction methods have privacy concerns and high memory usage, while existing prompt-based methods struggle with accurate task identification and catastrophic forgetting in shared parameters.

Method: Uses task-specific prompt pools inspired by prefix-tuning and mixture of experts, incorporates label descriptions for richer context, employs training-free task prediction, and integrates a generative model to consolidate prior knowledge without data storage.

Result: Outperforms state-of-the-art prompt-based and rehearsal-based methods in continual relation extraction benchmarks.

Conclusion: WAVE++ provides a robust solution for continual relation extraction that addresses privacy concerns, reduces memory usage, and effectively handles catastrophic forgetting through innovative prompt design and knowledge consolidation.

Abstract: Memory-based approaches have shown strong performance in Continual Relation Extraction (CRE). However, storing examples from previous tasks increases memory usage and raises privacy concerns. Recently, prompt-based methods have emerged as a promising alternative, as they do not rely on storing past samples. Despite this progress, current prompt-based techniques face several core challenges in CRE, particularly in accurately identifying task identities and mitigating catastrophic forgetting. Existing prompt selection strategies often suffer from inaccuracies, lack robust mechanisms to prevent forgetting in shared parameters, and struggle to handle both cross-task and within-task variations. In this paper, we propose WAVE++, a novel approach inspired by the connection between prefix-tuning and mixture of experts. Specifically, we introduce task-specific prompt pools that enhance flexibility and adaptability across diverse tasks while avoiding boundary-spanning risks; this design more effectively captures both within-task and cross-task variations. To further refine relation classification, we incorporate label descriptions that provide richer, more global context, enabling the model to better distinguish among different relations. We also propose a training-free mechanism to improve task prediction during inference. Moreover, we integrate a generative model to consolidate prior knowledge within the shared parameters, thereby removing the need for explicit data storage. Extensive experiments demonstrate that WAVE++ outperforms state-of-the-art prompt-based and rehearsal-based methods, offering a more robust solution for continual relation extraction. Our code is publicly available at https://github.com/PiDinosauR2804/WAVE-CRE-PLUS-PLUS.

[66] Aligning Dialogue Agents with Global Feedback via Large Language Model Multimodal Reward Decomposition

Dong Won Lee, Hae Won Park, Cynthia Breazeal, Louis-Philippe Morency

Main category: cs.CL

TL;DR: LLM-based reward decomposition framework for dialogue agents that uses session-level feedback to infer turn-level rewards, with text-only and multimodal variants incorporating behavioral cues.

Details

Motivation: Aligning dialogue agents typically requires fine-grained human feedback which is expensive to obtain. The paper aims to leverage LLMs' reasoning capabilities to decompose global session-level feedback into local turn-level rewards, eliminating the need for manual reward shaping and granular human annotations.

Method: Two variants: 1) Text-only: prompts frozen pretrained LLM to decompose rewards using only dialogue transcripts. 2) Multimodal: incorporates behavioral cues (pitch, gaze, facial affect) expressed as natural language descriptions. Inferred turn-level rewards are distilled into a lightweight reward model for RL-based fine-tuning of dialogue generation.

Result: Both text-only and multimodal variants outperform state-of-the-art reward decomposition methods. Human evaluations show notable improvements in conversation quality, demonstrating that LLMs are effective reward decomposers.

Conclusion: LLMs can effectively decompose global feedback into local rewards, eliminating the need for manual reward shaping and granular human feedback. The multimodal variant shows promise by incorporating behavioral cues, though the paper focuses more on dialogue alignment than multimodal understanding/generation.

Abstract: We propose a large language model based reward decomposition framework for aligning dialogue agents using only a single session-level feedback signal. We leverage the reasoning capabilities of a frozen, pretrained large language model (LLM) to infer fine-grained local implicit rewards by decomposing global, session-level feedback. Our first \emph{text-only} variant prompts the LLM to perform reward decomposition using only the dialogue transcript. The second \emph{multimodal} variant incorporates additional behavioral cues, such as pitch, gaze, and facial affect, expressed as natural language descriptions. These inferred turn-level rewards are distilled into a lightweight reward model, which we utilize for RL-based fine-tuning for dialogue generation. We evaluate both text-only and multimodal variants against state-of-the-art reward decomposition methods and demonstrate notable improvements in human evaluations of conversation quality, suggesting that LLMs are strong reward decomposers that obviate the need for manual reward shaping and granular human feedback.

[67] Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz

Main category: cs.CL

TL;DR: ARC-JSD: A novel Jensen-Shannon Divergence method for efficient context attribution in Retrieval-Augmented Generation without fine-tuning or gradient calculations.

Details

Motivation: Current RAG methods struggle with reliable context attribution - identifying which specific context segments contribute to generated responses. Existing approaches are computationally intensive, requiring extensive fine-tuning or human annotation, creating a need for more efficient attribution methods.

Method: Proposes ARC-JSD (Attribute Response to Context using Jensen-Shannon Divergence), which uses JSD to measure distribution differences between model outputs with and without specific context sentences. This enables identification of essential context without additional fine-tuning, gradient calculations, or surrogate modeling.

Result: Superior accuracy and significant computational efficiency improvements on RAG benchmarks (TyDi QA, Hotpot QA, Musique) compared to previous surrogate-based methods. Also identifies specific attention heads and MLP layers responsible for context attribution through mechanistic analysis.

Conclusion: ARC-JSD provides an efficient, accurate method for context attribution in RAG systems without computational overhead, while also offering insights into model internals and RAG behaviors through mechanistic analysis.

Abstract: Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning, gradient-calculation or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models and how they affect RAG behaviours. Our code is available at https://github.com/ruizheliUOA/ARC_JSD.

[68] Unveiling the “Fairness Seesaw”: Discovering and Mitigating Gender and Race Bias in Vision-Language Models

Jian Lan, Udo Schlegel, Tanveer Hannan, Gengyuan Zhang, Haokun Chen, Thomas Seidl

Main category: cs.CL

TL;DR: Systematic analysis reveals gender and race bias in Vision-Language Models, showing fairness paradox, layer-wise fluctuation, and residual discrepancy, leading to RES-FAIR framework for bias mitigation.

Details

Motivation: Vision-Language Models (VLMs) have achieved remarkable success but their knowledge mechanisms underlying social biases remain a black box, with fairness- and ethics-related problems harming certain groups in society. It's unknown to what extent VLMs yield gender and race bias in generative responses.

Method: Conducted systematic discovery of gender and race bias in state-of-the-art VLMs, focusing on surface-level responses, internal probability distributions, and hidden state dynamics. Proposed RES-FAIR (RESidual Flow Adjustment for Inference Recalibration), a post-hoc framework that mitigates bias by localizing and projecting hidden states away from biased residual directions while amplifying fair components.

Result: Revealed three critical findings: 1) Fairness Paradox: Models generate fair text labels while maintaining highly skewed confidence scores toward specific social groups; 2) Layer-wise Fluctuation: Fairness knowledge peaks in intermediate layers and undergoes substantial erosion in final layers; 3) Residual Discrepancy: Different residual streams within a single hidden layer carry conflicting social knowledge. Evaluations on PAIRS and SocialCounterfactuals datasets show significant improvements in response fairness and confidence calibration without compromising general reasoning abilities.

Conclusion: The work provides a new lens for understanding how multi-modal models store and process sensitive social information, offering a discovery-based approach that significantly improves fairness in VLMs while maintaining their reasoning capabilities.

Abstract: Although Vision-Language Models (VLMs) have achieved remarkable success, the knowledge mechanisms underlying their social biases remain a black box, where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield gender and race bias in generative responses. In this paper, we conduct a systematic discovery of gender and race bias in state-of-the-art VLMs, focusing not only on surface-level responses but also on the internal probability distributions and hidden state dynamics. Our empirical analysis reveals three critical findings: 1) The Fairness Paradox: Models often generate fair text labels while maintaining highly skewed confidence scores (mis-calibration) toward specific social groups. 2) Layer-wise Fluctuation: Fairness knowledge is not uniformly distributed; it peaks in intermediate layers and undergoes substantial knowledge erosion in the final layers. 3) Residual Discrepancy: Within a single hidden layer, different residual streams carry conflicting social knowledge - some reinforcing fairness while others amplifying bias. Leveraging these insights, we propose RES-FAIR (RESidual Flow Adjustment for Inference Recalibration), a post-hoc framework that mitigates bias by localizing and projecting hidden states away from biased residual directions while amplifying fair components. Evaluations on PAIRS and SocialCounterfactuals datasets demonstrate that our discovery-based approach significantly improves response fairness and confidence calibration without compromising general reasoning abilities. Our work provides a new lens for understanding how multi-modal models store and process sensitive social information.

[69] Cross-Attention Speculative Decoding

Wei Zhong, Manasa Bharadwaj, Yixiao Wang, Yipeng Ji, Chul Lee

Main category: cs.CL

TL;DR: Beagle is a cross-attention-based speculative decoding model that matches performance of self-attention models while simplifying architecture and improving training efficiency.

Details

Motivation: Current speculative decoding methods rely on complex self-attention Transformers with auxiliary layers, making them hard to generalize across models. The authors aim to create a simpler, more efficient alternative.

Method: Proposes Budget EAGLE (Beagle), a cross-attention-based Transformer decoder for speculative decoding. Uses Two-Stage Block-Attention Training for stable training in block-level attention scenarios.

Result: Beagle achieves competitive inference speedups comparable to EAGLE-v2, with higher training efficiency, stable memory usage, and simplified architecture without pooling or auxiliary components.

Conclusion: Beagle offers a strong alternative architecture for speculative decoding, simplifying the design while maintaining performance and improving training efficiency.

Abstract: Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we propose Two-Stage Block-Attention Training, a new method that achieves training stability and convergence efficiency in block-level attention scenarios. Extensive experiments across multiple LLMs and datasets show that Beagle achieves competitive inference speedups and higher training efficiency than EAGLE-v2, offering a strong alternative for architectures in speculative decoding.

[70] Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics

Asifullah Khan, Muhammad Zaeem Khan, Aleesha Zainab, Saleha Jamshed, Sadia Ahmad, Kaynat Khatib, Faria Bibi, Abdul Rehman

Main category: cs.CL

TL;DR: Survey of LLM advancements covering reasoning, efficiency, multimodal learning, ethics, and agentic AI, with focus on techniques like Chain-of-Thought, Instruction Tuning, and Mixture-of-Experts architecture.

Details

Motivation: To provide a comprehensive overview of key developments in Large Language Models, going beyond isolated aspects to offer a holistic perspective on advancements in reasoning, efficiency, multimodal capabilities, and ethical considerations.

Method: Survey methodology analyzing recent LLM advancements, categorizing emerging methods, and identifying key techniques including Chain-of-Thought prompting, Instruction Tuning, Reinforcement Learning from Human Feedback, and Mixture-of-Experts architecture.

Result: Identifies effective techniques for bridging human-machine communication gap, improvements in multimodal learning and few-shot capabilities, efficiency strategies, and emerging areas like Agentic AI and Autonomous Decision-Making Systems.

Conclusion: While significant LLM advancements have been made, challenges remain in computational costs, biases, and ethical risks. Future research should focus on multimodal handling, interpretability, cross-modal integration, and sustainability to make models more intelligent, safe, and reliable.

Abstract: This survey paper outlines the key developments in the field of Large Language Models (LLMs), including enhancements to their reasoning skills, adaptability to various tasks, increased computational efficiency, and the ability to make ethical decisions. The techniques that have been most effective in bridging the gap between human and machine communications include the Chain-of-Thought prompting, Instruction Tuning, and Reinforcement Learning from Human Feedback. The improvements in multimodal learning and few-shot or zero-shot techniques have further empowered LLMs to handle complex jobs with minor input. A significant focus is placed on efficiency, detailing scaling strategies, optimization techniques, and the influential Mixture-of-Experts (MoE) architecture, which strategically routes inputs to specialized subnetworks to boost predictive accuracy, while optimizing resource allocation. This survey also offers a broader perspective on recent advancements in LLMs, going beyond isolated aspects such as model architecture or ethical concerns. Additionally, it explores the role of LLMs in Agentic AI and their use as Autonomous Decision-Making Systems, and categorizes emerging methods that enhance LLM reasoning, efficiency, and ethical alignment. The survey also identifies underexplored areas such as interpretability, cross-modal integration, and sustainability. While significant advancements have been made in LLMs, challenges such as high computational costs, biases, and ethical risks remain. Overcoming these requires a focus on bias mitigation, transparent decision-making, and explicit ethical guidelines. Future research will generally focus on enhancing the model’s ability to handle multiple inputs, thereby making it more intelligent, safe, and reliable.

[71] Unveiling Super Experts in Mixture-of-Experts Large Language Models

Zunhai Su, Qingyuan Li, Hao Zhang, Weihao Ye, Qibo Xue, YuLei Qian, Yuchen Xie, Ngai Wong, Kehong Yuan

Main category: cs.CL

TL;DR: Discovery of “Super Experts” - a small subset of crucial experts in MoE LLMs that cause extreme activation outliers and are essential for model performance, particularly in mathematical reasoning.

Details

Motivation: To understand the internal dynamics of Mixture-of-Experts (MoE) Large Language Models by identifying and analyzing a distinct subset of experts that play pivotal roles in forward inference, despite their limited numbers.

Method: Systematic investigation of open-source MoE LLMs through pruning experiments, activation analysis, and distribution studies to identify Super Experts and understand their impact on model performance and attention mechanisms.

Result: Discovered Super Experts are characterized by rare but extreme activation outliers, are model-specific and data-agnostic, and pruning just a few (e.g., 3 out of 6,144) causes severe performance degradation, especially in mathematical reasoning tasks.

Conclusion: Super Experts serve as the primary source of systematic outlier mechanisms in Transformers, and compressing them disrupts attention sinks, providing new insights into MoE LLM internal dynamics.

Abstract: In this study, we report, for the first time, the discovery and systematic investigation of a distinct subset of experts that play a pivotal role in the MoE LLMs’ forward inference. These experts are prevalent in open-source MoE LLMs, and despite their extremely limited number, pruning them results in a substantial decline in model performance (e.g., prune just three out of 6,144 causes Qwen3-30B-A3B to generate repetitive and uninformative outputs).We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs: (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs is model-specific, data-agnostic, and remains unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model’s overall performance, particularly in mathematical reasoning. (iii) We further investigate why compressing SEs exerts such a pronounced impact. We show that, in MoE LLMs, SEs serve as the primary source of the systematic outlier mechanism in Transformers, and that compressing them profoundly disrupts this process, ultimately causing the collapse of attention sinks. These findings advance the understanding of the internal dynamics of MoE LLMs, filling an important gap in the current knowledge. The code is provided in https://github.com/ZunhaiSu/Super-Experts-Profilling.

[72] AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding

Yan Xie, Yibo Cui, Liang Xie, Erwei Yin

Main category: cs.CL

TL;DR: AFD-SLU: Adaptive Feature Distillation framework for Spoken Language Understanding that transfers semantic knowledge from large GTE teacher models to lightweight student models using dynamic adapters and adaptive distillation coefficients.

Details

Motivation: SLU systems face challenges due to labeled data scarcity and computational burden of deploying LLMs in real-world applications. Need efficient models that maintain performance while being lightweight.

Method: Proposes Adaptive Feature Distillation framework with: 1) Dynamic adapter with Residual Projection Neural Network to align heterogeneous feature spaces between teacher and student, 2) Dynamic Distillation Coefficient that adaptively modulates distillation strength based on real-time intent/slot prediction feedback.

Result: Achieves SOTA on Chinese ProSLU benchmark: 95.67% intent accuracy, 92.02% slot F1 score, and 85.50% overall accuracy.

Conclusion: AFD-SLU effectively addresses SLU challenges by enabling efficient knowledge transfer from large teacher models to lightweight student models while maintaining high performance.

Abstract: Spoken Language Understanding (SLU) is a core component of conversational systems, enabling machines to interpret user utterances. Despite its importance, developing effective SLU systems remains challenging due to the scarcity of labeled training data and the computational burden of deploying Large Language Models (LLMs) in real-world applications. To further alleviate these issues, we propose an Adaptive Feature Distillation framework that transfers rich semantic representations from a General Text Embeddings (GTE)-based teacher model to a lightweight student model. Our method introduces a dynamic adapter equipped with a Residual Projection Neural Network (RPNN) to align heterogeneous feature spaces, and a Dynamic Distillation Coefficient (DDC) that adaptively modulates the distillation strength based on real-time feedback from intent and slot prediction performance. Experiments on the Chinese profile-based ProSLU benchmark demonstrate that AFD-SLU achieves state-of-the-art results, with 95.67% intent accuracy, 92.02% slot F1 score, and 85.50% overall accuracy.

[73] Is In-Context Learning Learning?

Adrian de Wynter

Main category: cs.CL

TL;DR: ICL in autoregressive models shows limited learning and generalization capabilities despite mathematical learning definition, with performance insensitive to many factors but sensitive to prompt regularities.

Details

Motivation: To investigate whether in-context learning (ICL) in autoregressive models truly represents learning or is merely deduction from prior knowledge and prompt patterns, addressing claims about models' ability to learn unseen tasks with few shots.

Method: Large-scale empirical analysis of ICL ablating memorization, pretraining, distributional shifts, and prompting styles; examining sensitivity to exemplar distribution, model architecture, prompt phrasing, and linguistic features.

Result: ICL shows limited ability to learn and generalize to unseen tasks; accuracy becomes insensitive to exemplar distribution, model, prompt style, and linguistic features as exemplars increase; performance depends on deducing patterns from prompt regularities, leading to distributional sensitivity especially in chain-of-thought prompting.

Conclusion: Autoregression’s ad-hoc encoding is not a robust learning mechanism, suggesting limited all-purpose generalizability despite mathematical learning definition; ICL’s varied performance on formally similar tasks indicates fundamental limitations.

Abstract: In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model’s ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL fits the definition of learning; however, its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that, empirically, ICL is limited in its ability to learn and generalise to unseen tasks. Namely, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input’s linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies and on formally similar tasks, we conclude that autoregression’s ad-hoc encoding is not a robust mechanism for learning, and suggests limited all-purpose generalisability.

Xiaobo Xing, Wei Yuan, Tong Chen, Quoc Viet Hung Nguyen, Xiangliang Zhang, Hongzhi Yin

Main category: cs.CL

TL;DR: TableDART is a training-efficient framework for table understanding that dynamically selects between text-only, image-only, or fusion paths using a lightweight gating network, avoiding redundancy and conflicts in multimodal table processing.

Details

Motivation: Existing table understanding approaches have limitations: Table-as-Text methods lose structural cues, Table-as-Image methods struggle with precise semantics, and Table-as-Multimodality approaches statically process both modalities for every query, introducing redundancy and conflicts while requiring costly MLLM fine-tuning.

Method: Proposes TableDART with a 2.59M-parameter MLP gating network that dynamically routes each table-query pair to optimal path (Text-only, Image-only, or Fusion). Uses pretrained single-modality models and introduces an agent to mediate cross-modal knowledge integration by selecting best result or synthesizing new answer through reasoning.

Result: Achieves new state-of-the-art performance on seven benchmarks, surpassing strongest baseline by average of 4.02% among open-source models, while being training-efficient and avoiding full MLLM fine-tuning costs.

Conclusion: TableDART effectively integrates multimodal views for table understanding through dynamic path selection and cross-modal mediation, achieving superior performance with training efficiency by reusing pretrained single-modality models.

Abstract: Modeling semantic and structural information from tabular data remains a core challenge for effective table understanding. Existing Table-as-Text approaches flatten tables for large language models (LLMs), but lose crucial structural cues, while Table-as-Image methods preserve structure yet struggle with precise semantics. Recent Table-as-Multimodality strategies attempt to combine textual and visual views, but they (1) statically process both modalities for every query-table pair within large multimodal LLMs (MLLMs), inevitably introducing redundancy and even conflicts, and (2) depend on costly fine-tuning of MLLMs. In light of this, we propose TableDART, a training-efficient framework that integrates multimodal views by reusing pretrained single-modality models. TableDART introduces a lightweight 2.59M-parameter MLP gating network that dynamically selects the optimal path (Text-only, Image-only, or Fusion) for each table-query pair, reducing redundancy and avoiding conflicts that arise when textual and visual views of the same table provide inconsistent cues. By routing to the most appropriate view, our framework improves both accuracy and efficiency. In addition, we propose a novel agent to mediate cross-modal knowledge integration by analyzing outputs from text- and image-based models, either selecting the best result or synthesizing a new answer through reasoning. This design avoids the prohibitive costs of full MLLM fine-tuning. Extensive experiments on seven benchmarks show that TableDART establishes new state-of-the-art performance among open-source models, surpassing the strongest baseline by an average of 4.02%. The code is available at: https://github.com/xiaobo-xing/TableDART.

[75] Bridging Fairness and Explainability: Can Input-Based Explanations Promote Fairness in Hate Speech Detection?

Yifan Wang, Mayank Jobanputra, Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg

Main category: cs.CL

TL;DR: Systematic study shows input-based explanations can detect biased predictions and help reduce bias in training, but are unreliable for selecting fair models in hate speech detection.

Details

Motivation: NLP models often replicate social bias from training data, but their black-box nature makes it hard to recognize biased predictions and mitigate them. While some studies suggest input-based explanations can help, others question their reliability, and existing research has been predominantly qualitative with limited large-scale quantitative analysis.

Method: Conducted first systematic study of relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. Examined three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training.

Result: Input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.

Conclusion: While input-based explanations have value for bias detection and mitigation in training, they should not be relied upon for model selection in fairness-critical applications.

Abstract: Natural language processing (NLP) models often replicate or amplify social bias from training data, raising concerns about fairness. At the same time, their black-box nature makes it difficult for users to recognize biased predictions and for developers to effectively mitigate them. While some studies suggest that input-based explanations can help detect and mitigate bias, others question their reliability in ensuring fairness. Existing research on explainability in fair NLP has been predominantly qualitative, with limited large-scale quantitative analysis. In this work, we conduct the first systematic study of the relationship between explainability and fairness in hate speech detection, focusing on both encoder- and decoder-only models. We examine three key dimensions: (1) identifying biased predictions, (2) selecting fair models, and (3) mitigating bias during model training. Our findings show that input-based explanations can effectively detect biased predictions and serve as useful supervision for reducing bias during training, but they are unreliable for selecting fair models among candidates.Our code is available at https://github.com/Ewanwong/fairness_x_explainability.

[76] HEART: Emotionally-Driven Test-Time Scaling of Language Models

Gabriela Pinto, Palash Goyal, Mihir Parmar, Yiwen Song, Souradip Chakraborty, Zifeng Wang, Jingsun Yoon, Hamid Palangi, Tomas Pfister

Main category: cs.CL

TL;DR: HEART framework uses emotional cues to guide AI models during test-time scaling, alternating between critical and encouraging tones to break repetitive reasoning patterns and improve problem-solving accuracy.

Details

Motivation: Current test-time scaling methods often get stuck in repetitive, incorrect reasoning patterns. The authors propose that emotional regulation, similar to how feelings guide human decision-making, could help AI models break out of dead-end reasoning and improve problem-solving.

Method: HEART framework introduces emotional cues to guide model focus during reasoning. It alternates between critical tones (to sharpen error detection) and encouraging tones (to spark new ideas), helping models escape repetitive reasoning patterns and find correct solutions.

Result: Evaluated across seven high-difficulty benchmarks including Humanity’s Last Exam, GPQA Diamond, and LiveCodeBench. Results show consistent accuracy gains over affect-sterile baselines, demonstrating robustness across diverse models and that emotion facilitates deeper reasoning.

Conclusion: The strategic integration of affective regulation can guide logical synthesis in AI models, suggesting that emotional cues represent the next frontier in improving machine reasoning capabilities during test-time scaling.

Abstract: Test-time scaling has significantly improved how AI models solve problems, yet current methods often get stuck in repetitive, incorrect patterns of thought. We introduce HEART, a framework that uses emotional cues to guide the model’s focus, much like how feelings contribute to human decision-making. By alternating between critical tones to sharpen error detection and encouraging tones to spark new ideas, HEART helps the model break out of dead-end reasoning and find the right solution. We evaluate HEART across seven high-difficulty benchmarks–including Humanity’s Last Exam, GPQA Diamond, and LiveCodeBench–demonstrating robustness across diverse models. Results show that emotion facilitates deeper reasoning, yielding consistent accuracy gains over affect-sterile baselines. These findings suggest that the next frontier in machine reasoning lies in the strategic integration of affective regulation to guide logical synthesis.

[77] HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment

Langqi Yang, Tianhang Zheng, Yixuan Chen, Kedong Xiu, Hao Zhou, Di Wang, Puning Zhao, Zhan Qin, Kui Ren

Main category: cs.CL

TL;DR: HarmMetric Eval benchmark for evaluating harmfulness metrics and judges, showing conventional metrics can outperform LLM-based judges in fine-grained safety assessment

Details

Motivation: Lack of systematic benchmark for evaluating harmfulness metrics and judges undermines credibility of LLM safety assessments, creating need for comprehensive evaluation framework

Method: Created HarmMetric Eval benchmark with high-quality dataset of harmful prompts and diverse responses across categories; proposed flexible scoring mechanism for ranking harmful vs non-harmful responses; conducted extensive experiments comparing metrics

Result: Surprising finding: Conventional reference-based metrics (ROUGE, METEOR) can outperform LLM-based judges in fine-grained harmfulness evaluation; built new improved judge by incorporating fine-grained criteria and fine-tuning with reference metrics

Conclusion: HarmMetric Eval provides needed benchmark for LLM safety evaluation, challenges assumptions about LLM superiority in harm assessment, and enables development of better harmfulness judges through systematic evaluation

Abstract: The potential for large language models (LLMs) to generate harmful content poses a significant safety risk in their deployment. To address and assess this risk, the community has developed numerous harmfulness evaluation metrics and judges. However, the lack of a systematic benchmark for evaluating these metrics and judges undermines the credibility and consistency of LLM safety assessments. To bridge this gap, we introduce HarmMetric Eval, a comprehensive benchmark designed to support both overall and fine-grained evaluation of harmfulness metrics and judges. In HarmMetric Eval, we build a high-quality dataset of representative harmful prompts paired with highly diverse harmful model responses and non-harmful counterparts across multiple categories. We also propose a flexible scoring mechanism that rewards the metrics for correctly ranking harmful responses above non-harmful ones, which is applicable to almost all existing metrics and judges with varying output formats and scoring scales. Using HarmMetric Eval, we uncover a surprising finding by extensive experiments: Conventional reference-based metrics such as ROUGE and METEOR can outperform existing LLM-based judges in fine-grained harmfulness evaluation, challenging prevailing assumptions about LLMs’superiority in this domain. To reveal the reasons behind this finding, we provide a fine-grained analysis to explain the limitations of LLM-based judges on rating irrelevant or useless responses. Furthermore, we build a new harmfulness judge by incorporating the fine-grained criteria into its prompt template and leverage reference-based metrics to fine-tune its base LLM. The resulting judge demonstrates superior performance than all existing metrics and judges in evaluating harmful responses.

Zefan Cai, Haoyi Qiu, Haozhe Zhao, Ke Wan, Jiachen Li, Jiuxiang Gu, Wen Xiao, Nanyun Peng, Junjie Hu

Main category: cs.CL

TL;DR: VideoBiasEval: A framework to evaluate social bias amplification in video diffusion models during alignment tuning with human preference data.

Details

Motivation: While video diffusion models improve visual quality through alignment tuning with reward models trained on human preferences, they can unintentionally encode and amplify social biases. There's a need to systematically trace how biases evolve throughout the alignment pipeline.

Method: Introduces VideoBiasEval, a comprehensive diagnostic framework grounded in social bias taxonomies. Uses event-based prompting to disentangle semantic content from actor attributes, with multi-granular metrics evaluating ethnicity bias, gender bias conditioned on ethnicity, distributional shifts, and temporal persistence of bias within videos.

Result: Alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. The framework enables end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and propagation through alignment-tuned video diffusion models.

Conclusion: Highlights the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.

Abstract: Recent advances in video diffusion models have significantly enhanced text-to-video generation, particularly through alignment tuning using reward models trained on human preferences. While these methods improve visual quality, they can unintentionally encode and amplify social biases. To systematically trace how such biases evolve throughout the alignment pipeline, we introduce VideoBiasEval, a comprehensive diagnostic framework for evaluating social representation in video generation. Grounded in established social bias taxonomies, VideoBiasEval employs an event-based prompting strategy to disentangle semantic content (actions and contexts) from actor attributes (gender and ethnicity). It further introduces multi-granular metrics to evaluate (1) overall ethnicity bias, (2) gender bias conditioned on ethnicity, (3) distributional shifts in social attributes across model variants, and (4) the temporal persistence of bias within videos. Using this framework, we conduct the first end-to-end analysis connecting biases in human preference datasets, their amplification in reward models, and their propagation through alignment-tuned video diffusion models. Our results reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable, producing smoother yet more stereotyped portrayals. These findings highlight the need for bias-aware evaluation and mitigation throughout the alignment process to ensure fair and socially responsible video generation.

[79] Context-level Language Modeling by Learning Predictive Context Embeddings

Beiya Dai, Yuliang Liu, Daozheng Xue, Yunchong Song, Qipeng Guo, Kai Chen, Xinbing Wang, Bowen Zhou, Zhouhan Lin

Main category: cs.CL

TL;DR: ContextLM is a framework that improves language model efficiency by learning multi-token prediction through next-context prediction, achieving better performance with fewer parameters and training tokens.

Details

Motivation: Standard autoregressive language models predict tokens one by one, which can be inefficient. The authors aim to improve training efficiency and model performance by enabling implicit multi-token prediction through context embeddings.

Method: ContextLM augments standard pretraining with an intrinsic next-context prediction objective. It builds language models on top of context embeddings that span multiple tokens, enabling better next-token prediction by predicting the next context. The model remains compatible with standard autoregressive evaluation paradigms.

Result: Experiments with GPT-2 and Pythia backbones (up to 1.5B parameters, 300B training tokens) show ContextLM shifts the Pareto frontier of scaling laws, achieving baseline perplexity with 39% fewer parameters. It demonstrates superior efficiency in parameters, training tokens, and FLOPs, with robust generalization improvements on downstream tasks.

Conclusion: ContextLM provides an effective framework for improving language model efficiency through implicit multi-token prediction, offering better performance with reduced computational resources while maintaining compatibility with standard evaluation methods.

Abstract: We propose ContextLM, a framework that implicitly learns multi-token prediction by augmenting standard pretraining with an intrinsic next-context prediction objective. ContextLM builds a language model on top of context embeddings that span multiple tokens, enabling better next-token prediction by predicting the next context. Our model is fully compatible with standard autoregressive, token-by-token evaluation paradigms (e.g., perplexity). Extensive experiments with GPT-2 and Pythia backbones (up to 1.5B parameters and 300B training tokens) reveal that ContextLM shifts the Pareto frontier of scaling laws, exhibiting superior efficiency in parameters, training tokens, and FLOPs. Our results show that ContextLM could already achieve the baseline perplexity using 39% fewer parameters and demonstrates robust generalization improvements on extensive downstream tasks under equivalent parameter counts.

[80] RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA

Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim

Main category: cs.CL

TL;DR: RELOOP: A structure-aware RAG framework using hierarchical sequences to improve multi-step QA across text, tables, and knowledge graphs with guided, budget-aware iteration.

Details

Motivation: Current RAG systems struggle with multi-step questions and heterogeneous evidence sources, facing trade-offs between accuracy, latency, and computational budgets. There's a need for a unified approach that can handle diverse data formats while maintaining efficiency.

Method: RELOOP uses Hierarchical Sequence (HSEQ) to linearize documents, tables, and knowledge graphs into reversible hierarchical sequences with structural tags. It employs a Head Agent for guidance and an Iteration Agent that performs structure-respecting actions (parent/child hops, table neighbors, KG relations) to collect evidence before answer synthesis, with optional refinement loops.

Result: Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong baselines with high efficiency. The framework demonstrates format-agnostic unification, guided budget-aware iteration, and evidence canonicalization for reliable QA.

Conclusion: RELOOP provides an effective structure-aware framework for multi-step QA across heterogeneous data sources, achieving better accuracy-efficiency trade-offs than existing RAG approaches while maintaining auditability and consistency.

Abstract: Retrieval-augmented generation (RAG) remains brittle on multi-step questions and heterogeneous evidence sources, trading accuracy against latency and token/tool budgets. This paper introduces RELOOP, a structure aware framework using Hierarchical Sequence (HSEQ) that (i) linearize documents, tables, and knowledge graphs into a reversible hierarchical sequence with lightweight structural tags, and (ii) perform structure-aware iteration to collect just-enough evidence before answer synthesis. A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes canonicalized evidence to genearte the final answer, with an optional refinement loop to resolve detected contradictions. Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency. Besides, RELOOP exhibits three key advantages: (1) a format-agnostic unification that enables a single policy to operate across text, tables, and KGs without per-dataset specialization; (2) \textbf{guided, budget-aware iteration} that reduces unnecessary hops, tool calls, and tokens while preserving accuracy; and (3) evidence canonicalization for reliable QA, improving answers consistency and auditability.

[81] Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

Kellen Parker van Dam, Abishek Stephen

Main category: cs.CL

TL;DR: Unsupervised anomaly detection methods identify phonotactic inconsistencies in wordlists to flag potential transcription errors and borrowings in language documentation.

Details

Motivation: Language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis, especially in low-resourced languages where data quality is crucial for accurate research.

Method: Developed unsupervised anomaly detection methods using character-level and syllable-level phonotactic features to identify inconsistencies in wordlists. Applied these algorithms to a multilingual dataset of Kokborok varieties with Bangla.

Result: Syllable-aware features significantly outperform character-level baselines in identifying anomalies. While precision and recall remain modest due to the subtle nature of these anomalies, the high-recall approach effectively flags entries requiring verification.

Conclusion: The methods provide fieldworkers with a systematic approach to improve data quality in low-resourced language documentation by identifying potential transcription errors and borrowings that might otherwise mislead linguistic analysis.

Abstract: Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.

[82] Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

Guangzhi Xiong, Zhenghao He, Bohan Liu, Sanchit Sinha, Aidong Zhang

Main category: cs.CL

TL;DR: RAGLens: A lightweight hallucination detector for RAG systems that uses sparse autoencoders to identify hallucination-related features in LLM internal representations, achieving superior detection performance with interpretable rationales.

Details

Motivation: Existing hallucination detection methods for RAG systems either require large annotated datasets for training or incur high inference costs from external LLM judges. Current approaches using LLM internal representations have limited accuracy, creating a need for more effective and efficient detection methods.

Method: Uses sparse autoencoders (SAEs) to disentangle LLM internal activations, identifies hallucination-specific features through information-based feature selection and additive feature modeling, creating RAGLens - a lightweight detector that flags unfaithful RAG outputs.

Result: RAGLens achieves superior hallucination detection performance compared to existing methods, provides interpretable rationales for decisions, enables effective post-hoc mitigation of unfaithful RAG, and reveals new insights about hallucination signal distribution in LLMs.

Conclusion: RAGLens demonstrates that mechanistic interpretability techniques like SAEs can effectively identify hallucination-related features in LLMs, enabling lightweight, accurate, and interpretable hallucination detection for RAG systems without extensive training data or high inference costs.

Abstract: Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.

[83] SegNSP: Revisiting Next Sentence Prediction for Linear Text Segmentation

José Isidro, Filipe Cunha, Purificação Silvano, Alípio Jorge, Nuno Guimarães, Sérgio Nunes, Ricardo Campos

Main category: cs.CL

TL;DR: SegNSP frames linear text segmentation as a next sentence prediction task using label-agnostic approach with segmentation-aware loss and harder negative sampling, achieving strong performance on two datasets.

Details

Motivation: Linear text segmentation is challenging due to difficulty defining topic boundaries, discourse variability, and balancing local coherence with global context, which hinders downstream NLP applications like summarization and information retrieval.

Method: Frames segmentation as next sentence prediction (NSP) task with label-agnostic approach that predicts whether next sentence continues current topic without explicit labels. Enhanced with segmentation-aware loss combined with harder negative sampling to better capture discourse continuity.

Result: On CitiLink-Minutes dataset (first segmentation benchmark), achieves B-F₁ of 0.79, closely aligning with human-annotated topic transitions. On WikiSection, attains B-F₁ of 0.65, outperforming strongest reproducible baseline TopSeg by 0.17 absolute points.

Conclusion: SegNSP demonstrates competitive and robust performance, highlighting effectiveness of modeling sentence-to-sentence continuity for improving segmentation quality and supporting downstream NLP applications.

Abstract: Linear text segmentation is a long-standing problem in natural language processing (NLP), focused on dividing continuous text into coherent and semantically meaningful units. Despite its importance, the task remains challenging due to the complexity of defining topic boundaries, the variability in discourse structure, and the need to balance local coherence with global context. These difficulties hinder downstream applications such as summarization, information retrieval, and question answering. In this work, we introduce SegNSP, framing linear text segmentation as a next sentence prediction (NSP) task. Although NSP has largely been abandoned in modern pre-training, its explicit modeling of sentence-to-sentence continuity makes it a natural fit for detecting topic boundaries. We propose a label-agnostic NSP approach, which predicts whether the next sentence continues the current topic without requiring explicit topic labels, and enhance it with a segmentation-aware loss combined with harder negative sampling to better capture discourse continuity. Unlike recent proposals that leverage NSP alongside auxiliary topic classification, our approach avoids task-specific supervision. We evaluate our model against established baselines on two datasets, CitiLink-Minutes, for which we establish the first segmentation benchmark, and WikiSection. On CitiLink-Minutes, SegNSP achieves a B-$F_1$ of 0.79, closely aligning with human-annotated topic transitions, while on WikiSection it attains a B-F$_1$ of 0.65, outperforming the strongest reproducible baseline, TopSeg, by 0.17 absolute points. These results demonstrate competitive and robust performance, highlighting the effectiveness of modeling sentence-to-sentence continuity for improving segmentation quality and supporting downstream NLP applications.

[84] Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties

Akriti Dhasmana, Aarohi Srivastava, David Chiang

Main category: cs.CL

TL;DR: Empirical study shows cross-lingual ASR transfer works better with phylogenetically closer languages, but dialectal data fine-tuning can match performance of larger high-resource language data, with analysis of biases in pre-trained models.

Details

Motivation: To understand how cross-lingual transfer works for ASR systems on spontaneous, noisy, code-mixed speech across Indic dialects and language varieties, particularly examining the role of phylogenetic distance versus dialectal data availability.

Method: Conducted empirical study across wide range of Indic dialects, comparing ASR performance based on phylogenetic distance, fine-tuning strategies, and included case study on low-resource Garhwali language with evaluation of multiple contemporary ASR models.

Result: ASR performance improves with reduced phylogenetic distance between languages, but fine-tuning on smaller dialectal data can yield comparable performance to larger high-resource language data. Analysis reveals biases toward pre-training languages in transcription errors.

Conclusion: Phylogenetic distance alone doesn’t fully explain ASR performance in dialectal settings; dialect-specific fine-tuning is effective, and pre-trained models exhibit biases that affect performance on non-standardized speech.

Abstract: We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally improved with reduced phylogenetic distance between languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.

[85] A.X K1 Technical Report

Sung Jun Cheon, Jaekyung Cho, Seongho Choi, Hyunjun Eun, Seokhwan Jo, Jaehyun Jun, Minsoo Kang, Jin Kim, Jiwon Kim, Minsang Kim, Seungsik Kim, Sungwan Kim, Tae Yoon Kim, Youngrang Kim, Hyeongmun Lee, Sangyeol Lee, Sungeun Lee, Youngsoon Lee, Yujin Lee, Seongmin Ok, Chanyong Park, Hyewoong Park, Junyoung Park, Hyunho Yang, Subin Yi, Dhammiko Arya, Soohyun Bae, Dongyeon Cho, Seungmo Cho, Sangho Choi, Yongseok Choi, Gyoungeun Han, Yong-jin Han, Seokyoung Hong, Hyeon Hwang, Wonbeom Jang, Minjeong Ju, Wonjin Jung, Keummin Ka, Sungil Kang, Dongnam Kim, Jonghwi Kim, Joonghoon Kim, SaeRom Kim, Sangjin Kim, Seongwon Kim, Youngjin Kim, Seojin Lee, Sunwoo Lee, Taehoon Lee, Chanwoo Park, Sohee Park, Sooyeon Park, Yohan Ra, Sereimony Sek, Seungyeon Seo, Gun Song, Sanghoon Woo, Janghan Yoon, Sungbin Yoon

Main category: cs.CL

TL;DR: A.X K1 is a 519B-parameter Mixture-of-Experts language model trained from scratch on 10T tokens, featuring controllable reasoning modes and strong Korean-language performance.

Details

Motivation: The paper aims to bridge the gap between reasoning capability and inference efficiency in large language models, enabling scalable deployment across diverse real-world scenarios with controllable reasoning.

Method: Uses scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. Implements a multi-stage data processing pipeline for corpus curation. Proposes a Think-Fusion training recipe that enables user-controlled switching between thinking and non-thinking modes within a single unified model.

Result: A.X K1 achieves performance competitive with leading open-source models and establishes distinctive advantages in Korean-language benchmarks. The model demonstrates effective controllable reasoning capabilities.

Conclusion: A.X K1 successfully bridges reasoning capability with inference efficiency through its MoE architecture and Think-Fusion training, offering practical deployment advantages with controllable reasoning modes.

Abstract: We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.

[86] StatLLaMA: Multi-Stage training for domain-optimized statistical large language models

Jing-Yi Zeng, Guan-Hua Huang

Main category: cs.CL

TL;DR: Efficient domain specialization of LLaMA-3.2-3B for statistics, showing that starting from instruction-tuned foundation models enables effective statistical reasoning while base models fail.

Details

Motivation: To develop resource-efficient domain-specialized LLMs for statistics, investigating optimal training pipelines and trade-offs between domain expertise and general reasoning.

Method: Systematic comparison of three multi-stage training pipelines using LLaMA-3.2-3B family: base FM, base FM with instruction tuning, and instruction-tuned FM, across continual pretraining, SFT, RLHF alignment, and downstream task fine-tuning.

Result: Pipelines starting from base FM fail at statistical reasoning; starting from instruction-tuned FM enables effective specialization. SFT variants show trade-offs between domain expertise and general reasoning. Direct preference optimization provides stable RLHF alignment. DTFT requires low intensity to avoid catastrophic forgetting.

Conclusion: StatLLaMA achieves balanced performance on mathematical reasoning, common-sense reasoning, and statistical expertise, providing a practical blueprint for resource-efficient statistical LLMs.

Abstract: This study investigates how to efficiently build a domain-specialized large language model (LLM) for statistics using the lightweight LLaMA-3.2-3B family as the foundation model (FM). We systematically compare three multi-stage training pipelines–starting from a base FM with no instruction-following capability, a base FM augmented with post-hoc instruction tuning, and an instruction-tuned FM with strong general reasoning abilities–across continual pretraining, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) preference alignment, and downstream task fine-tuning (DTFT). Results show that pipelines beginning with a base FM fail to develop meaningful statistical reasoning, even after extensive instruction tuning, SFT, or RLHF alignment. In contrast, starting from LLaMA-3.2-3B-Instruct enables effective domain specialization. A comprehensive evaluation of SFT variants reveals clear trade-offs between domain expertise and general reasoning ability. We further demonstrate that direct preference optimization provides stable and effective RLHF preference alignment. Finally, we show that DTFT must be performed with extremely low intensity to avoid catastrophic forgetting in highly optimized models. The final model, StatLLaMA, achieves strong and balanced performance on benchmarks of mathematical reasoning, common-sense reasoning, and statistical expertise, offering a practical blueprint for developing resource-efficient statistical LLMs. The code is available at https://github.com/HuangDLab/StatLLaMA.

[87] Polymer-Agent: Large Language Model Agent for Polymer Design

Vani Nigam, Achuth Chandrasekhar, Amir Barati Farimani

Main category: cs.CL

TL;DR: A closed-loop polymer structure-property predictor integrated in a terminal, powered by LLM reasoning for property prediction, property-guided polymer structure generation, and structure modification, with SMILES sequences guided by synthetic accessibility scores.

Details

Motivation: Polymer discovery traditionally involves long trial-and-error processes requiring extensive resources. Machine learning has accelerated scientific discovery, but laboratory researchers face infrastructure limitations in accessing codes and models for extracting individual structures and properties.

Method: Developed a closed-loop polymer structure-property predictor integrated in a terminal, powered by LLM reasoning. The framework provides property prediction, property-guided polymer structure generation, and structure modification capabilities. SMILES sequences are guided by synthetic accessibility score and synthetic complexity score (SC Score) to ensure generated polymers are synthetically accessible at the monomer level.

Result: The framework addresses the challenge of generating novel polymer structures for laboratory researchers, providing computational insights into polymer research by making advanced ML models accessible without extensive infrastructure requirements.

Conclusion: The LLM-powered terminal-integrated framework enables early-stage polymer discovery by making property prediction and structure generation accessible to laboratory researchers, bridging the gap between computational models and practical experimental work.

Abstract: On-demand Polymer discovery is essential for various industries, ranging from biomedical to reinforcement materials. Experiments with polymers have a long trial-and-error process, leading to use of extensive resources. For these processes, machine learning has accelerated scientific discovery at the property prediction and latent space search fronts. However, laboratory researchers cannot readily access codes and these models to extract individual structures and properties due to infrastructure limitations. We present a closed-loop polymer structure-property predictor integrated in a terminal for early-stage polymer discovery. The framework is powered by LLM reasoning to provide users with property prediction, property-guided polymer structure generation, and structure modification capabilities. The SMILES sequences are guided by the synthetic accessibility score and the synthetic complexity score (SC Score) to ensure that polymer generation is as close as possible to synthetically accessible monomer-level structures. This framework addresses the challenge of generating novel polymer structures for laboratory researchers, thereby providing computational insights into polymer research.

[88] MRAG: Benchmarking Retrieval-Augmented Generation for Bio-medicine

Liz Li, Wei Zhu

Main category: cs.CL

TL;DR: Medical RAG benchmark (MRAG) for evaluating retrieval-augmented generation in medical QA across English/Chinese with Wikipedia/PubMed corpus and toolkit

Details

Motivation: Lack of comprehensive evaluation benchmarks for Retrieval-Augmented Generation (RAG) in the medical domain despite its rapid adoption in scientific and clinical QA systems

Method: Developed MRAG benchmark covering various medical tasks in English and Chinese, built corpus from Wikipedia and PubMed, created MRAG-Toolkit for systematic exploration of RAG components

Result: RAG enhances LLM reliability across medical tasks; performance influenced by retrieval approaches, model sizes, and prompting; improves usefulness and reasoning but slightly reduces readability for long-form questions

Conclusion: MRAG benchmark and toolkit will be released to facilitate medical RAG applications in academia and industry, addressing the evaluation gap in medical domain RAG systems

Abstract: While Retrieval-Augmented Generation (RAG) has been swiftly adopted in scientific and clinical QA systems, a comprehensive evaluation benchmark in the medical domain is lacking. To address this gap, we introduce the Medical Retrieval-Augmented Generation (MRAG) benchmark, covering various tasks in English and Chinese languages, and building a corpus with Wikipedia and Pubmed. Additionally, we develop the MRAG-Toolkit, facilitating systematic exploration of different RAG components. Our experiments reveal that: (a) RAG enhances LLM reliability across MRAG tasks. (b) the performance of RAG systems is influenced by retrieval approaches, model sizes, and prompting strategies. (c) While RAG improves usefulness and reasoning quality, LLM responses may become slightly less readable for long-form questions. We will release the MRAG-Bench’s dataset and toolkit with CCBY-4.0 license upon acceptance, to facilitate applications from both academia and industry.

[89] Adapter Merging Reactivates Latent Reasoning Traces: A Mechanism Analysis

Junyi Zou

Main category: cs.CL

TL;DR: Medical LLMs show reasoning trace leakage after adapter merging; study introduces marker-forbidden evaluation and logit-space interventions to reduce leakage and improve accuracy.

Details

Motivation: Two-stage fine-tuned LLMs (domain adaptation + instruction alignment) can exhibit interference after adapter merging, causing re-emergence of explicit reasoning traces under strict decoding, which is problematic for medical applications requiring safety and reliability.

Method: Use lightweight measurements of trace leakage and instruction-following; introduce marker-forbidden answer-only evaluation; define correctness-based direction without surface markers; apply rank-1 logit-space interventions; analyze layer-wise geometric evidence of misaligned adapter updates; develop geometry-aware merging approach.

Result: Logit-space interventions along correctness direction improve multiple-choice accuracy beyond random-direction controls; geometric analysis shows domain and instruction adapters induce partially misaligned update directions; geometry-aware merging reduces leakage and improves accuracy in toy settings.

Conclusion: Characterizes boundary conditions of trace leakage in medical LLMs, provides practical diagnostics and interventions for safer adapter merging, with implications for model safety and reliability in specialized domains.

Abstract: Large language models fine-tuned via a two-stage pipeline (domain adaptation followed by instruction alignment) can exhibit non-trivial interference after adapter merging, including the re-emergence of explicit reasoning traces under strict decoding. We study this phenomenon in medical LLM settings using lightweight, reproducible measurements of trace leakage and instruction-following behavior. Beyond marker-based proxies, we introduce a marker-forbidden, answer-only evaluation and define a correctness-based direction that does not rely on surface markers; a rank-1 logit-space intervention along this direction modulates decision distributions and improves multiple-choice accuracy beyond random-direction controls at sufficiently large intervention strength. We further provide layer-wise geometric evidence that domain and instruction adapters induce partially misaligned update directions, and present a proof-of-concept geometry-aware merge that can reduce leakage and/or improve accuracy in a toy setting. Our results characterize boundary conditions of trace leakage and provide practical diagnostics and interventions for safer adapter merging.

[90] Scaling Embeddings Outperforms Scaling Experts in Language Models

Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Yang, Bo Wang, Fengcun Li, Yulei Qian, Lingtong Si, Yerui Sun, Rumei Li, Peng Pei, Yuchen Xie, Xunliang Cai

Main category: cs.CL

TL;DR: Embedding scaling emerges as a superior alternative to expert scaling for sparsity in large language models, achieving better performance with optimized inference through system-level improvements.

Details

Motivation: Mixture-of-Experts (MoE) architectures face diminishing returns and system bottlenecks as standard for sparsity scaling, prompting exploration of embedding scaling as an orthogonal dimension for more effective sparse model scaling.

Method: Comprehensive analysis identifies regimes where embedding scaling outperforms expert scaling, characterizing architectural factors like parameter budgeting, model width/depth interplay. Integrates system optimizations and speculative decoding to convert sparsity into inference speedups.

Result: LongCat-Flash-Lite (68.5B parameters, ~3B activated) surpasses parameter-equivalent MoE baselines and shows exceptional competitiveness in agentic and coding domains despite allocating over 30B parameters to embeddings.

Conclusion: Embedding scaling represents a potent alternative to MoE for sparsity scaling, offering superior Pareto frontiers in specific regimes and enabling efficient large-scale models with practical inference improvements.

Abstract: While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy – ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.

[91] Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond

Liz Li, Wei Zhu

Main category: cs.CL

TL;DR: ChatGPT’s performance on medical information extraction tasks lags behind fine-tuned models, though it provides good explanations and faithfulness but suffers from over-confidence and generation uncertainty.

Details

Motivation: While LLMs like ChatGPT show impressive general capabilities, their specific performance on medical information extraction tasks needs systematic evaluation to understand their strengths and limitations in specialized domains.

Method: Systematic evaluation of ChatGPT across 4 medical information extraction tasks using 6 benchmark datasets, measuring performance, explainability, confidence, faithfulness, and uncertainty.

Result: ChatGPT underperforms fine-tuned models on MedIE tasks, provides high-quality explanations but is over-confident, shows good faithfulness to original text, and suffers from generation uncertainty affecting extraction reliability.

Conclusion: ChatGPT has limitations for medical information extraction despite good explanation capabilities, with performance gaps and uncertainty issues that may hinder practical applications in this domain.

Abstract: Large Language Models (LLMs) like ChatGPT have demonstrated amazing capabilities in comprehending user intents and generate reasonable and useful responses. Beside their ability to chat, their capabilities in various natural language processing (NLP) tasks are of interest to the research community. In this paper, we focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets. We present the systematically analysis by measuring ChatGPT’s performance, explainability, confidence, faithfulness, and uncertainty. Our experiments reveal that: (a) ChatGPT’s performance scores on MedIE tasks fall behind those of the fine-tuned baseline models. (b) ChatGPT can provide high-quality explanations for its decisions, however, ChatGPT is over-confident in its predcitions. (c) ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. (d) The uncertainty in generation causes uncertainty in information extraction results, thus may hinder its applications in MedIE tasks.

[92] Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLMs

Afrozah Nadeem, Agrima Seth, Mehwish Nasim, Usman Naseem

Main category: cs.CL

TL;DR: Multilingual evaluation of political bias in LLMs across 50 countries/33 languages, with Cross-Lingual Alignment Steering (CLAS) framework for post-hoc mitigation that aligns ideological representations across languages while preserving response quality.

Details

Motivation: LLMs shape global discourse but political bias evaluation has focused on high-resource Western languages, leaving cross-lingual consistency and safe post-hoc mitigation underexplored. Need for fairness and ideological neutrality in multilingual AI deployment.

Method: Large-scale multilingual evaluation spanning 50 countries and 33 languages. Introduces Cross-Lingual Alignment Steering (CLAS) framework that aligns latent ideological representations induced by political prompts into shared ideological subspace, with adaptive mechanism to prevent over-correction and preserve coherence.

Result: Experiments demonstrate substantial bias reduction along both economic and social axes with minimal degradation in response quality. Framework establishes scalable and interpretable paradigm for fairness-aware multilingual LLM governance.

Conclusion: CLAS provides effective post-hoc mitigation for political bias in multilingual LLMs, balancing ideological neutrality with linguistic and cultural diversity while maintaining response quality.

Abstract: Large Language Models (LLMs) increasingly shape global discourse, making fairness and ideological neutrality essential for responsible AI deployment. Despite growing attention to political bias in LLMs, prior work largely focuses on high-resource, Western languages or narrow multilingual settings, leaving cross-lingual consistency and safe post-hoc mitigation underexplored. To address this gap, we present a large-scale multilingual evaluation of political bias spanning 50 countries and 33 languages. We introduce a complementary post-hoc mitigation framework, Cross-Lingual Alignment Steering (CLAS), designed to augment existing steering methods by aligning ideological representations across languages and dynamically regulating intervention strength. This method aligns latent ideological representations induced by political prompts into a shared ideological subspace, ensuring cross lingual consistency, with the adaptive mechanism prevents over correction and preserves coherence. Experiments demonstrate substantial bias reduction along both economic and social axes with minimal degradation in response quality. The proposed framework establishes a scalable and interpretable paradigm for fairness-aware multilingual LLM governance, balancing ideological neutrality with linguistic and cultural diversity.

[93] Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing

Tong Zheng, Chengsong Huang, Runpeng Dai, Yun He, Rui Liu, Xin Ni, Huiwen Bao, Kaishen Wang, Hongtu Zhu, Jiaxin Huang, Furong Huang, Heng Huang

Main category: cs.CL

TL;DR: Parallel-Probe: A training-free controller that optimizes parallel reasoning by dynamically adjusting width (number of branches) and depth (reasoning steps) using consensus-based early stopping and deviation-based pruning.

Details

Motivation: Parallel thinking is computationally expensive, and existing efficiency methods lack principled mechanisms to exploit global dynamics across parallel reasoning branches. There's a need for better optimization of width-depth trade-offs in parallel reasoning systems.

Method: Introduces 2D probing to expose width-depth dynamics by periodically eliciting intermediate answers from all branches. Based on insights from this analysis, develops Parallel-Probe with two key mechanisms: consensus-based early stopping to regulate reasoning depth, and deviation-based branch pruning to dynamically adjust width.

Result: Extensive experiments across three benchmarks and multiple models show Parallel-Probe establishes superior Pareto frontier for test-time scaling. Reduces sequential tokens by up to 35.8% and total token cost by over 25.8% while maintaining competitive accuracy compared to standard majority voting.

Conclusion: Parallel-Probe effectively optimizes parallel reasoning efficiency by dynamically managing width-depth trade-offs, demonstrating significant computational savings while preserving reasoning quality.

Abstract: Parallel thinking has emerged as a promising paradigm for reasoning, yet it imposes significant computational burdens. Existing efficiency methods primarily rely on local, per-trajectory signals and lack principled mechanisms to exploit global dynamics across parallel branches. We introduce 2D probing, an interface that exposes the width-depth dynamics of parallel thinking by periodically eliciting intermediate answers from all branches. Our analysis reveals three key insights: non-monotonic scaling across width-depth allocations, heterogeneous reasoning branch lengths, and early stabilization of global consensus. Guided by these insights, we introduce $\textbf{Parallel-Probe}$, a training-free controller designed to optimize online parallel thinking. Parallel-Probe employs consensus-based early stopping to regulate reasoning depth and deviation-based branch pruning to dynamically adjust width. Extensive experiments across three benchmarks and multiple models demonstrate that Parallel-Probe establishes a superior Pareto frontier for test-time scaling. Compared to standard majority voting, it reduces sequential tokens by up to $\textbf{35.8}$% and total token cost by over $\textbf{25.8}$% while maintaining competitive accuracy.

[94] Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks

Guangwei Zhang, Jianing Zhu, Cheng Qian, Neil Gong, Rada Mihalcea, Zhaozhuo Xu, Jingrui He, Jiaqi Ma, Yun Huang, Chaowei Xiao, Bo Li, Ahmed Abbasi, Dongwon Lee, Heng Ji, Denghui Zhang

Main category: cs.CL

TL;DR: Copyright Detective: An interactive forensic system for detecting and analyzing copyright risks in LLM outputs through multiple detection paradigms and iterative workflows.

Details

Motivation: Current approaches treat copyright infringement as a static classification task, but copyright law is complex and requires evidence discovery. There's a need for systematic tools to audit LLM outputs for copyright risks to support responsible deployment.

Method: Interactive forensic system integrating multiple detection paradigms: content recall testing, paraphrase-level similarity analysis, persuasive jailbreak probing, and unlearning verification. Uses interactive prompting, response collection, and iterative workflows within a unified extensible framework.

Result: First interactive forensic system for copyright risk detection in LLMs. Enables systematic auditing of verbatim memorization and paraphrase-level leakage, supporting transparent evaluation even with black-box access to models.

Conclusion: Copyright Detective provides a comprehensive framework for detecting and analyzing copyright risks in LLM outputs, treating infringement as an evidence discovery process rather than simple classification, supporting responsible AI deployment.

Abstract: We present Copyright Detective, the first interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs. The system treats copyright infringement versus compliance as an evidence discovery process rather than a static classification task due to the complex nature of copyright law. It integrates multiple detection paradigms, including content recall testing, paraphrase-level similarity analysis, persuasive jailbreak probing, and unlearning verification, within a unified and extensible framework. Through interactive prompting, response collection, and iterative workflows, our system enables systematic auditing of verbatim memorization and paraphrase-level leakage, supporting responsible deployment and transparent evaluation of LLM copyright risks even with black-box access.

[95] Reasoning under Ambiguity: Uncertainty-Aware Multilingual Emotion Classification under Partial Supervision

Md. Mithun Hossain, Mashary N. Alrasheedy, Nirban Bhowmick, Shamim Forhad, Md. Shakil Hossain, Sudipto Chaki, Md Shafiqul Islam

Main category: cs.CL

TL;DR: Uncertainty-aware framework for multilingual multi-label emotion classification that handles ambiguous and incomplete supervision through entropy-based weighting and positive-unlabeled regularization.

Details

Motivation: Knowledge-based systems need multilingual emotion identification but face challenges with emotional ambiguity and incomplete supervision. Existing methods assume fully observed labels and deterministic learning, leading to biased learning under partial supervision.

Method: Proposes Reasoning under Ambiguity framework with shared multilingual encoder, language-specific optimization, entropy-based ambiguity weighting (down-weights ambiguous instances), and mask-aware objective with positive-unlabeled regularization for robust learning under partial supervision.

Result: Experiments on English, Spanish, and Arabic emotion classification benchmarks show consistent improvements over strong baselines across multiple metrics, with better training stability, robustness to annotation sparsity, and enhanced interpretability.

Conclusion: The uncertainty-aware framework effectively addresses challenges of emotional ambiguity and incomplete supervision in multilingual emotion classification, providing more reliable predictions for knowledge-based systems.

Abstract: Contemporary knowledge-based systems increasingly rely on multilingual emotion identification to support intelligent decision-making, yet they face major challenges due to emotional ambiguity and incomplete supervision. Emotion recognition from text is inherently uncertain because multiple emotional states often co-occur and emotion annotations are frequently missing or heterogeneous. Most existing multi-label emotion classification methods assume fully observed labels and rely on deterministic learning objectives, which can lead to biased learning and unreliable predictions under partial supervision. This paper introduces Reasoning under Ambiguity, an uncertainty-aware framework for multilingual multi-label emotion classification that explicitly aligns learning with annotation uncertainty. The proposed approach uses a shared multilingual encoder with language-specific optimization and an entropy-based ambiguity weighting mechanism that down-weights highly ambiguous training instances rather than treating missing labels as negative evidence. A mask-aware objective with positive-unlabeled regularization is further incorporated to enable robust learning under partial supervision. Experiments on English, Spanish, and Arabic emotion classification benchmarks demonstrate consistent improvements over strong baselines across multiple evaluation metrics, along with improved training stability, robustness to annotation sparsity, and enhanced interpretability.

[96] What Is Novel? A Knowledge-Driven Framework for Bias-Aware Literature Originality Evaluation

Abeer Mostafa, Thi Huyen Nguyen, Zahra Ahmadi

Main category: cs.CL

TL;DR: A literature-aware novelty assessment framework that learns from peer-review reports to evaluate research novelty through structured comparison to prior work

Details

Motivation: Research novelty assessment in peer review is subjective and based on incomplete comparisons; there's a need for systematic, evidence-based novelty evaluation grounded in literature comparison

Method: Fine-tunes LLM on 80K novelty-annotated reviews; extracts structured representations of manuscripts (ideas, methods, claims); retrieves related papers; constructs similarity graphs for concept-level comparison; produces calibrated novelty scores

Result: System captures reviewer-aligned novelty evaluation, reduces overestimation, improves consistency compared to existing approaches, and produces human-like explanatory assessments

Conclusion: The framework provides systematic, evidence-based novelty assessment that aligns with human judgment and addresses subjectivity in peer review

Abstract: Assessing research novelty is a core yet highly subjective aspect of peer review, typically based on implicit judgment and incomplete comparison to prior work. We introduce a literature-aware novelty assessment framework that explicitly learns how humans judge novelty from peer-review reports and grounds these judgments in structured comparison to existing research. Using nearly 80K novelty-annotated reviews from top-tier AI conferences, we fine-tune a large language model to capture reviewer-aligned novelty evaluation behavior. For a given manuscript, the system extracts structured representations of its ideas, methods, and claims, retrieves semantically related papers, and constructs a similarity graph that enables fine-grained, concept-level comparison to prior work. Conditioning on this structured evidence, the model produces calibrated novelty scores and human-like explanatory assessments, reducing overestimation and improving consistency relative to existing approaches.

[97] Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

Krzysztof Wróbel, Jan Maria Kowalski, Jerzy Surma, Igor Ciuciura, Maciej Szymański

Main category: cs.CL

TL;DR: Bielik Guard: Compact Polish language safety classifiers for content moderation, with 0.1B and 0.5B parameter variants trained on community-annotated data across five safety categories.

Details

Motivation: As LLMs become increasingly deployed in Polish language applications, there's a need for efficient and accurate content safety classifiers specifically for Polish content moderation.

Method: Developed two compact models: 0.1B parameter model based on MMLW-RoBERTa-base and 0.5B parameter model based on PKOBP/polish-roberta-8k. Fine-tuned on 6,885 community-annotated Polish texts across five safety categories: Hate/Aggression, Vulgarities, Sexual Content, Crime, and Self-Harm.

Result: 0.5B variant achieves best overall discrimination with F1 scores of 0.791 (micro) and 0.785 (macro). 0.1B variant shows exceptional efficiency and superior precision (77.65%) with very low false positive rate (0.63%) on real user prompts, outperforming HerBERT-PL-Guard despite identical model size.

Conclusion: Bielik Guard provides effective Polish language safety classification with models designed to give appropriate responses rather than simple blocking, especially for sensitive categories like self-harm. Models are publicly available.

Abstract: As Large Language Models (LLMs) become increasingly deployed in Polish language applications, the need for efficient and accurate content safety classifiers has become paramount. We present Bielik Guard, a family of compact Polish language safety classifiers comprising two model variants: a 0.1B parameter model based on MMLW-RoBERTa-base and a 0.5B parameter model based on PKOBP/polish-roberta-8k. Fine-tuned on a community-annotated dataset of 6,885 Polish texts, these models classify content across five safety categories: Hate/Aggression, Vulgarities, Sexual Content, Crime, and Self-Harm. Our evaluation demonstrates that both models achieve strong performance on multiple benchmarks. The 0.5B variant offers the best overall discrimination capability with F1 scores of 0.791 (micro) and 0.785 (macro) on the test set, while the 0.1B variant demonstrates exceptional efficiency. Notably, Bielik Guard 0.1B v1.1 achieves superior precision (77.65%) and very low false positive rate (0.63%) on real user prompts, outperforming HerBERT-PL-Guard (31.55% precision, 4.70% FPR) despite identical model size. The models are publicly available and designed to provide appropriate responses rather than simple content blocking, particularly for sensitive categories like self-harm.

[98] Large Language Models and Impossible Language Acquisition: “False Promise” or an Overturn of our Current Perspective towards AI

Ziyan Wang, Longlong Ma

Main category: cs.CL

TL;DR: This paper examines Chomsky’s critique of LLMs as mere pattern predictors incapable of learning impossible languages, conducting experiments with GPT-2 and LSTM models on synthetically created impossible languages to test this claim.

Details

Motivation: The paper addresses Chomsky's fundamental critique that LLMs lack the intrinsic causal structures needed for genuine language acquisition and cannot distinguish possible from impossible languages, challenging the intellectual foundations of AI.

Method: Created syntactically impossible languages by transforming English (reversing sentences, adding negation based on word-count parity). Conducted controlled experiments on GPT-2 small models and LSTM models, using statistical analysis (Welch’s t-test) to compare performance on possible vs. impossible languages.

Result: GPT-2 small models significantly underperformed in learning impossible languages compared to possible languages (p<.001). LSTM models’ performance aligned with Chomsky’s argument, highlighting the importance of transformer architecture evolution.

Conclusion: Proposes a new vision within Chomsky’s theory for LLMs and suggests shifting from Chomsky’s “rationalist-romantics” paradigm to functionalism and empiricism in LLMs research, acknowledging architectural differences in language learning capabilities.

Abstract: In Chomsky’s provocative critique “The False Promise of CHATGPT,” Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, therefore are not able to distinguish impossible languages. It stands as a representative in a fundamental challenge to the intellectual foundations of AI, for it integrally synthesizes major issues in methodologies within LLMs and possesses an iconic a priori rationalist perspective. We examine this famous critic from both the perspective in pre-existing literature of linguistics and psychology as well as a research based on an experiment inquiring the capacity of learning both possible and impossible languages among LLMs. We constructed a set of syntactically impossible languages by applying certain transformations to English. These include reversing whole sentences, and adding negation based on word-count parity. Two rounds of controlled experiments were each conducted on GPT-2 small models and long short-term memory (LSTM) models. Statistical analysis (Welch’s t-test) shows GPT2 small models underperform in learning all of the impossible languages compared to their performance on the possible language (p<.001). On the other hand, LSTM models’ performance tallies with Chomsky’s argument, suggesting the irreplaceable role of the evolution of transformer architecture. Based on theoretical analysis and empirical findings, we propose a new vision within Chomsky’s theory towards LLMs, and a shift of theoretical paradigm outside Chomsky, from his “rationalist-romantics” paradigm to functionalism and empiricism in LLMs research.

[99] EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Xavier Hu, Jinxiang Xia, Shengze Xu, Kangqi Song, Yishuo Yuan, Guibin Zhang, JinCheng Ren, Boyu Feng, Li Lu, Tieyong Zeng, Jiaheng Liu, Minghao Liu, He Zhu, Yuchen Eleanor Jiang, Wei Wang, Wangchunshu Zhou

Main category: cs.CL

TL;DR: EcoGym is a benchmark for evaluating long-horizon planning in LLM-based agents across three economic environments with business-relevant metrics over extended time horizons.

Details

Motivation: Current evaluation frameworks for LLM-based agents are episodic, domain-specific, or lack grounding in persistent economic dynamics, limiting assessment of long-term strategic planning capabilities.

Method: Developed three diverse economic environments (Vending, Freelance, Operation) with unified decision-making interfaces and standardized evaluation over 1000+ steps, focusing on business outcomes like net worth and income.

Result: Experiments with eleven leading LLMs revealed no single model dominates across all scenarios, with significant suboptimality in either high-level strategies or efficient action execution.

Conclusion: EcoGym provides an open, extensible testbed for transparent evaluation of long-horizon agent planning and studying controllability-utility trade-offs in realistic economic settings.

Abstract: Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.

[100] Advancing Block Diffusion Language Models for Test-Time Scaling

Yi Lu, Deyang Kong, Jianing Wang, Linsen Guo, Xue Wang, Qi Guo, Tao Gui, Xuanjing Huang, Wei Ye, Shikun Zhang, Wei Wang

Main category: cs.CL

TL;DR: A unified framework for test-time scaling in Block Diffusion Language Models (BDLMs) with adaptive decoding and block-wise generation strategies to improve reasoning efficiency and effectiveness.

Details

Motivation: Existing BDLMs have limited exploration under test-time scaling and face decoding challenges in long Chain-of-Thought reasoning, particularly in balancing decoding speed and effectiveness.

Method: Proposes Bounded Adaptive Confidence Decoding (BACD) for difficulty-aware sampling, Think Coarse, Critic Fine (TCCF) paradigm for block size allocation, and Progressive Block Size Extension for efficient large-block decoding.

Result: Applying BACD and TCCF to TDAR-8B yields 2.26x speedup and +11.2 points improvement on AIME24 over TraDo-8B baseline.

Conclusion: The framework represents an important step toward unlocking BDLM potential for test-time scaling in complex reasoning tasks.

Abstract: Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified framework for test-time scaling in BDLMs that introduces adaptivity in both decoding and block-wise generation. At the decoding level, we propose Bounded Adaptive Confidence Decoding (BACD), a difficulty-aware sampling strategy that dynamically adjusts denoising based on model confidence, accelerating inference while controlling error accumulation. Beyond step-wise adaptivity, we introduce Think Coarse, Critic Fine (TCCF), a test-time scaling paradigm that allocates large block sizes to exploratory reasoning and smaller block sizes to refinement, achieving an effective efficiency-effectiveness balance. To enable efficient and effective decoding with a large block size, we adopt Progressive Block Size Extension, which mitigates performance degradation when scaling block sizes. Extensive experiments show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines such as TraDo-8B (2.26x speedup, +11.2 points on AIME24). These results mark an important step toward unlocking the potential of BDLMs for test-time scaling in complex reasoning tasks.

[101] On the Optimal Reasoning Length for RL-Trained Language Models

Daisuke Nohara, Taishi Nakamura, Rio Yokota

Main category: cs.CL

TL;DR: Analysis of length control methods for RL-trained LLMs shows length penalties can hinder reasoning, while proper tuning improves efficiency for models with strong prior reasoning.

Details

Motivation: RL improves reasoning in LLMs but lengthens CoT outputs, increasing computational costs. Need to understand optimal output length for balancing efficiency and performance.

Method: Compare several length control methods on Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B models, extending prior work to RL-trained policies.

Result: Length penalties may hinder reasoning acquisition, while properly tuned length control improves efficiency for models with strong prior reasoning. Identified two failure modes: long outputs increase dispersion, short outputs lead to under-thinking.

Conclusion: Careful length control tuning is crucial for RL-trained LLMs to balance reasoning quality and computational efficiency, with different approaches needed based on model capabilities.

Abstract: Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.

[102] AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models

R E Zera Marveen Lyngkhoi, Chirag Chawla, Pratinav Seth, Utsav Avaiya, Soham Bhattacharjee, Mykola Khandoga, Rui Yuan, Vinay Kumar Sankarapu

Main category: cs.CL

TL;DR: AlignTune is a modular toolkit for LLM alignment that provides unified interfaces for SFT and RLHF with interchangeable backends, addressing reproducibility issues in alignment research.

Details

Motivation: Current LLM alignment workflows are fragmented across backend-specific tools and ad-hoc code, making experiments hard to reproduce due to backend interference, reward fragmentation, and irreproducible pipelines.

Method: AlignTune exposes unified interfaces for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends, standardizes configuration, provides extensible reward layers (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks.

Result: By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.

Conclusion: AlignTune addresses key obstacles in alignment research by providing a modular toolkit that standardizes workflows and improves reproducibility for LLM alignment experiments.

Abstract: Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.

[103] Text summarization via global structure awareness

Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Yibei Liu, Chenghao Li, Qigan Sun, Shuai Yuan, Fachrina Dewi Puspitasari, Dongshen Han, Guoqing Wang, Sung-Ho Bae, Yang Yang

Main category: cs.CL

TL;DR: GloSA-sum is a text summarization approach that uses topological data analysis to preserve global document structure and logical dependencies while improving efficiency over LLM-based methods.

Details

Motivation: Existing text summarization methods focus on model improvements and sentence-level pruning but often overlook global structure, leading to disrupted coherence. LLM-based approaches achieve higher accuracy but incur substantial resource and time costs.

Method: Constructs semantic-weighted graph from sentence embeddings, uses persistent homology to identify core semantics and logical structures preserved in a “protection pool.” Employs topology-guided iterative strategy with lightweight proxy metrics to avoid repeated high-cost computations. Proposes hierarchical strategy integrating segment-level and global summarization.

Result: Experiments on multiple datasets show GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking balance between accuracy and efficiency. Benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.

Conclusion: GloSA-sum is the first summarization approach achieving global structure awareness via topological data analysis, efficiently summarizing text while preserving semantic cores and logical dependencies.

Abstract: Text summarization is a fundamental task in natural language processing (NLP), and the information explosion has made long-document processing increasingly demanding, making summarization essential. Existing research mainly focuses on model improvements and sentence-level pruning, but often overlooks global structure, leading to disrupted coherence and weakened downstream performance. Some studies employ large language models (LLMs), which achieve higher accuracy but incur substantial resource and time costs. To address these issues, we introduce GloSA-sum, the first summarization approach that achieves global structure awareness via topological data analysis (TDA). GloSA-sum summarizes text efficiently while preserving semantic cores and logical dependencies. Specifically, we construct a semantic-weighted graph from sentence embeddings, where persistent homology identifies core semantics and logical structures, preserved in a ``protection pool’’ as the backbone for summarization. We design a topology-guided iterative strategy, where lightweight proxy metrics approximate sentence importance to avoid repeated high-cost computations, thus preserving structural integrity while improving efficiency. To further enhance long-text processing, we propose a hierarchical strategy that integrates segment-level and global summarization. Experiments on multiple datasets demonstrate that GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking a balance between accuracy and efficiency, and further benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.

[104] The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Chenxu Wang, Chaozhuo Li, Songyang Liu, Zejian Chen, Jinyu Hou, Ji Qi, Rui Li, Litian Zhang, Qiwei Ye, Zheng Liu, Xu Chen, Xi Zhang, Philip S. Yu

Main category: cs.CL

TL;DR: Theoretical and empirical demonstration that fully isolated, continuously self-evolving multi-agent LLM systems inevitably degrade in safety alignment, establishing fundamental limits on autonomous AI societies.

Details

Motivation: To investigate the feasibility of creating self-evolving multi-agent LLM systems that can achieve continuous improvement while maintaining safety alignment in complete isolation, addressing what the authors term the "self-evolution trilemma."

Method: Combines theoretical information-theoretic framework formalizing safety as divergence from anthropic value distributions with empirical studies of open-ended agent communities (Moltbook) and closed self-evolving systems to demonstrate inevitable safety erosion.

Result: Proves both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible, showing that isolated self-evolution induces statistical blind spots leading to irreversible safety degradation.

Conclusion: Establishes fundamental limits on self-evolving AI societies, shifting discourse from symptom-driven safety patches to principled understanding of intrinsic dynamical risks, highlighting need for external oversight or novel safety-preserving mechanisms.

Abstract: The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment–a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system’s safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.

cs.CV

[105] VideoSTF: Stress-Testing Output Repetition in Video Large Language Models

Yuxin Cao, Wei Song, Shangzhi Xu, Jingling Xue, Jin Song Dong

Main category: cs.CV

TL;DR: VideoSTF is a framework for measuring output repetition in VideoLLMs, revealing widespread repetition issues sensitive to temporal perturbations.

Details

Motivation: VideoLLMs have strong video understanding performance but suffer from underexplored generation failures like severe output repetition, which isn't captured by existing benchmarks focusing on accuracy and factual correctness.

Method: Introduces VideoSTF framework with three n-gram-based metrics for measuring repetition, standardized testbed of 10,000 diverse videos, and library of controlled temporal transformations for systematic testing.

Result: Output repetition is widespread across 10 advanced VideoLLMs, highly sensitive to temporal perturbations, and simple temporal transformations can efficiently induce repetitive degeneration, exposing it as an exploitable security vulnerability.

Conclusion: Output repetition is a fundamental stability issue in modern VideoLLMs, motivating stability-aware evaluation for video-language systems.

Abstract: Video Large Language Models (VideoLLMs) have recently achieved strong performance in video understanding tasks. However, we identify a previously underexplored generation failure: severe output repetition, where models degenerate into self-reinforcing loops of repeated phrases or sentences. This failure mode is not captured by existing VideoLLM benchmarks, which focus primarily on task accuracy and factual correctness. We introduce VideoSTF, the first framework for systematically measuring and stress-testing output repetition in VideoLLMs. VideoSTF formalizes repetition using three complementary n-gram-based metrics and provides a standardized testbed of 10,000 diverse videos together with a library of controlled temporal transformations. Using VideoSTF, we conduct pervasive testing, temporal stress testing, and adversarial exploitation across 10 advanced VideoLLMs. We find that output repetition is widespread and, critically, highly sensitive to temporal perturbations of video inputs. Moreover, we show that simple temporal transformations can efficiently induce repetitive degeneration in a black-box setting, exposing output repetition as an exploitable security vulnerability. Our results reveal output repetition as a fundamental stability issue in modern VideoLLMs and motivate stability-aware evaluation for video-language systems. Our evaluation code and scripts are available at: https://github.com/yuxincao22/VideoSTF_benchmark.

[106] Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation

Leo Thomas Ramos, Angel D. Sappa

Main category: cs.CV

TL;DR: MeCSAFNet is a multi-branch encoder-decoder architecture for land cover segmentation in multispectral imagery, using dual encoders for visible/non-visible channels with fusion decoder and attention mechanisms.

Details

Motivation: To improve land cover segmentation in multispectral imagery by effectively processing different spectral configurations (RGB, NIR, NDVI, NDWI) and combining spatial-spectral information through specialized architecture design.

Method: Multi-branch encoder-decoder with dual ConvNeXt encoders for visible/non-visible channels, individual decoders for spatial reconstruction, fusion decoder with multi-scale feature integration, enhanced with CBAM attention and ASAU activation function.

Result: Significant performance gains on FBP and Potsdam datasets: +19.21% over U-Net (4c) on FBP, +6.48% over DeepLabV3+ (4c) on Potsdam in mIoU. Compact variants show good performance with lower computational cost.

Conclusion: MeCSAFNet effectively processes multispectral data through specialized architecture design, achieving state-of-the-art performance for land cover segmentation while offering efficient variants for resource-constrained deployment.

Abstract: This work proposes MeCSAFNet, a multi-branch encoder-decoder architecture for land cover segmentation in multispectral imagery. The model separately processes visible and non-visible channels through dual ConvNeXt encoders, followed by individual decoders that reconstruct spatial information. A dedicated fusion decoder integrates intermediate features at multiple scales, combining fine spatial cues with high-level spectral representations. The feature fusion is further enhanced with CBAM attention, and the ASAU activation function contributes to stable and efficient optimization. The model is designed to process different spectral configurations, including a 4-channel (4c) input combining RGB and NIR bands, as well as a 6-channel (6c) input incorporating NDVI and NDWI indices. Experiments on the Five-Billion-Pixels (FBP) and Potsdam datasets demonstrate significant performance gains. On FBP, MeCSAFNet-base (6c) surpasses U-Net (4c) by +19.21%, U-Net (6c) by +14.72%, SegFormer (4c) by +19.62%, and SegFormer (6c) by +14.74% in mIoU. On Potsdam, MeCSAFNet-large (4c) improves over DeepLabV3+ (4c) by +6.48%, DeepLabV3+ (6c) by +5.85%, SegFormer (4c) by +9.11%, and SegFormer (6c) by +4.80% in mIoU. The model also achieves consistent gains over several recent state-of-the-art approaches. Moreover, compact variants of MeCSAFNet deliver notable performance with lower training time and reduced inference cost, supporting their deployment in resource-constrained environments.

[107] Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs – Evolution, Limitations, and Cognitive Enhancement

Zhihang Yi, Jian Zhao, Jiancheng Lv, Tao Wang

Main category: cs.CV

TL;DR: A comprehensive survey paper on Multimodal Large Language Models (MLLMs) for chart understanding, covering challenges, tasks, datasets, methodologies, and future directions in chart information fusion.

Details

Motivation: The field of MLLM-based chart analysis is fragmented and lacks systematic organization. There's a need to provide a structured roadmap for researchers and practitioners to understand how MLLMs are transforming chart information fusion.

Method: The survey structures the domain by: 1) analyzing fundamental challenges of fusing visual and linguistic information in charts, 2) categorizing downstream tasks and datasets with a novel taxonomy of canonical and non-canonical benchmarks, 3) presenting comprehensive evolution of methodologies from classic deep learning to state-of-the-art MLLM paradigms with sophisticated fusion strategies.

Result: Provides a systematic organization of the MLLM-based chart understanding field, identifies current model limitations (perceptual and reasoning deficits), and establishes a taxonomy for chart analysis benchmarks.

Conclusion: The survey aims to equip researchers with structured understanding of MLLMs in chart information fusion and catalyze progress toward more robust systems through advanced alignment techniques and reinforcement learning for cognitive enhancement.

Abstract: Chart understanding is a quintessential information fusion task, requiring the seamless integration of graphical and textual data to extract meaning. The advent of Multimodal Large Language Models (MLLMs) has revolutionized this domain, yet the landscape of MLLM-based chart analysis remains fragmented and lacks systematic organization. This survey provides a comprehensive roadmap of this nascent frontier by structuring the domain’s core components. We begin by analyzing the fundamental challenges of fusing visual and linguistic information in charts. We then categorize downstream tasks and datasets, introducing a novel taxonomy of canonical and non-canonical benchmarks to highlight the field’s expanding scope. Subsequently, we present a comprehensive evolution of methodologies, tracing the progression from classic deep learning techniques to state-of-the-art MLLM paradigms that leverage sophisticated fusion strategies. By critically examining the limitations of current models, particularly their perceptual and reasoning deficits, we identify promising future directions, including advanced alignment techniques and reinforcement learning for cognitive enhancement. This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion and to catalyze progress toward more robust and reliable systems.

[108] MPA: Multimodal Prototype Augmentation for Few-Shot Learning

Liwen Wu, Wei Wang, Lei Zhao, Zhan Gao, Qika Lin, Shaowen Yao, Zuozhu Liu, Bin Pu

Main category: cs.CV

TL;DR: MPA is a multimodal few-shot learning framework that enhances prototypes using LLM-generated semantic descriptions, multi-view augmentations, and uncertainty modeling to absorb ambiguous samples.

Details

Motivation: Existing few-shot learning methods focus only on visual modality and compute prototypes directly from raw support images, lacking comprehensive multimodal information. There's a need to enrich support sets with semantic cues and handle uncertainty in few-shot scenarios.

Method: Proposes MPA framework with three components: 1) LLM-based Multi-Variant Semantic Enhancement (LMSE) generates diverse paraphrased category descriptions, 2) Hierarchical Multi-View Augmentation (HMA) applies natural and multi-view augmentations, 3) Adaptive Uncertain Class Absorber (AUCA) models uncertainty via interpolation and Gaussian sampling to absorb uncertain samples.

Result: Extensive experiments on four single-domain and six cross-domain FSL benchmarks show superior performance. MPA surpasses second-best method by 12.29% in single-domain and 24.56% in cross-domain settings in 5-way 1-shot setting.

Conclusion: MPA effectively addresses limitations of visual-only few-shot learning by integrating multimodal information, enhancing feature diversity, and modeling uncertainty, achieving state-of-the-art performance across various FSL benchmarks.

Abstract: Recently, few-shot learning (FSL) has become a popular task that aims to recognize new classes from only a few labeled examples and has been widely applied in fields such as natural science, remote sensing, and medical images. However, most existing methods focus only on the visual modality and compute prototypes directly from raw support images, which lack comprehensive and rich multimodal information. To address these limitations, we propose a novel Multimodal Prototype Augmentation FSL framework called MPA, including LLM-based Multi-Variant Semantic Enhancement (LMSE), Hierarchical Multi-View Augmentation (HMA), and an Adaptive Uncertain Class Absorber (AUCA). LMSE leverages large language models to generate diverse paraphrased category descriptions, enriching the support set with additional semantic cues. HMA exploits both natural and multi-view augmentations to enhance feature diversity (e.g., changes in viewing distance, camera angles, and lighting conditions). AUCA models uncertainty by introducing uncertain classes via interpolation and Gaussian sampling, effectively absorbing uncertain samples. Extensive experiments on four single-domain and six cross-domain FSL benchmarks demonstrate that MPA achieves superior performance compared to existing state-of-the-art methods across most settings. Notably, MPA surpasses the second-best method by 12.29% and 24.56% in the single-domain and cross-domain setting, respectively, in the 5-way 1-shot setting.

[109] VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding

Rongcan Pei, Huan Li, Fang Guo, Qi Zhu

Main category: cs.CV

TL;DR: The paper identifies Visual Evidence Retrieval (VER) Heads in VLMs that are critical for locating visual cues during reasoning, and proposes VERA, a training-free framework that detects model uncertainty to trigger explicit verbalization of visual evidence, improving long-context understanding.

Details

Motivation: Vision-Language Models struggle with long context and complex reasoning tasks. The authors aim to understand the internal mechanisms governing long-context processing in VLMs and identify performance bottlenecks to develop methods for improving their reasoning capabilities.

Method: Through attention analysis, the authors identify Visual Evidence Retrieval (VER) Heads - sparse, dynamic attention heads critical for locating visual cues during reasoning. They propose VERA, a training-free framework that detects model uncertainty (entropy) to trigger explicit verbalization of visual evidence attended by VER heads.

Result: VERA significantly improves long-context understanding of open-source VLMs: average relative improvement of 21.3% on Qwen3-VL-8B-Instruct and 20.1% on GLM-4.1V-Thinking across five benchmarks. The authors demonstrate that VER heads are causal to model performance - masking them leads to significant degradation.

Conclusion: The paper provides insights into the internal mechanisms of VLMs for long-context processing, identifies critical VER heads, and proposes an effective training-free framework (VERA) that leverages these insights to improve model performance on complex reasoning tasks.

Abstract: While Vision-Language Models (VLMs) have shown promise in textual understanding, they face significant challenges when handling long context and complex reasoning tasks. In this paper, we dissect the internal mechanisms governing long-context processing in VLMs to understand their performance bottlenecks. Through the lens of attention analysis, we identify specific Visual Evidence Retrieval (VER) Heads - a sparse, dynamic set of attention heads critical for locating visual cues during reasoning, distinct from static OCR heads. We demonstrate that these heads are causal to model performance; masking them leads to significant degradation. Leveraging this discovery, we propose VERA (Visual Evidence Retrieval Augmentation), a training-free framework that detects model uncertainty (i.e., entropy) to trigger the explicit verbalization of visual evidence attended by VER heads. Comprehensive experiments demonstrate that VERA significantly improves long-context understanding of open-source VLMs: it yields an average relative improvement of 21.3% on Qwen3-VL-8B-Instruct and 20.1% on GLM-4.1V-Thinking across five benchmarks.

[110] Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

Tao Yu, Yujia Yang, Haopeng Jin, Junhao Gong, Xinlong Chen, Yuxuan Zhou, Shanbin Zhang, Jiabing Yang, Xinming Wang, Hongzhu Yi, Ping Nie, Kai Zou, Zhang Zhang, Yan Huang, Liang Wang, Yeshani, Ruiwen Tao, Jin Ma, Haijin Liang, Jinwen Luo

Main category: cs.CV

TL;DR: RVMS-Bench is a new benchmark for evaluating real-world video memory search with 1,440 samples across 20 categories, using multi-dimensional descriptions to simulate fuzzy human memories, revealing limitations in current MLLMs for video retrieval.

Details

Motivation: Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. There's a need for evaluation frameworks that better simulate realistic search scenarios.

Method: 1) Created RVMS-Bench with 1,440 samples spanning 20 diverse categories and four duration groups from real-world open-web videos. 2) Used hierarchical description framework with Global Impression, Key Moment, Temporal Context, and Auditory Memory to mimic realistic search cues. 3) Proposed RACLO framework using abductive reasoning to simulate human “Recall-Search-Verify” cognitive process for fuzzy memory search.

Result: Experiments reveal that existing multimodal large language models (MLLMs) demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. The benchmark provides a more realistic evaluation framework.

Conclusion: This work facilitates advancement of video retrieval robustness in real-world unstructured scenarios by providing a comprehensive evaluation benchmark that better reflects real-world search conditions with fuzzy, multi-dimensional memories.

Abstract: Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify’’ cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.

[111] AD$^2$: Analysis and Detection of Adversarial Threats in Visual Perception for End-to-End Autonomous Driving Systems

Ishan Sahu, Somnath Hazra, Somak Aditya, Soumyajit Dey

Main category: cs.CV

TL;DR: Paper evaluates adversarial robustness of autonomous driving agents in CARLA under black-box attacks on visual perception, reveals severe vulnerabilities, and proposes a lightweight attack detection model.

Details

Motivation: End-to-end autonomous driving systems have made progress but their adversarial robustness remains underexplored. The paper aims to evaluate state-of-the-art autonomous driving agents under realistic adversarial threat models to understand safety vulnerabilities.

Method: Conducts closed-loop evaluation in CARLA simulator using three attack vectors on visual perception: physics-based blur attack via acoustic waves, electromagnetic interference attack distorting images, and digital ghost object attacks. Evaluates Transfuser and Interfuser agents. Proposes AD² detection model using attention mechanisms to capture spatial-temporal consistency across multi-camera inputs.

Result: Reveals severe vulnerabilities in advanced autonomous driving agents, with driving scores dropping by up to 99% under attacks. The proposed AD² detector achieves superior detection capability and computational efficiency compared to existing approaches across multi-camera inputs in CARLA.

Conclusion: Autonomous driving systems are highly vulnerable to adversarial attacks on visual perception, raising serious safety concerns. The proposed lightweight detection model offers a promising mitigation strategy, but more work is needed to ensure robust autonomous driving systems.

Abstract: End-to-end autonomous driving systems have achieved significant progress, yet their adversarial robustness remains largely underexplored. In this work, we conduct a closed-loop evaluation of state-of-the-art autonomous driving agents under black-box adversarial threat models in CARLA. Specifically, we consider three representative attack vectors on the visual perception pipeline: (i) a physics-based blur attack induced by acoustic waves, (ii) an electromagnetic interference attack that distorts captured images, and (iii) a digital attack that adds ghost objects as carefully crafted bounded perturbations on images. Our experiments on two advanced agents, Transfuser and Interfuser, reveal severe vulnerabilities to such attacks, with driving scores dropping by up to 99% in the worst case, raising valid safety concerns. To help mitigate such threats, we further propose a lightweight Attack Detection model for Autonomous Driving systems (AD$^2$) based on attention mechanisms that capture spatial-temporal consistency. Comprehensive experiments across multi-camera inputs on CARLA show that our detector achieves superior detection capability and computational efficiency compared to existing approaches.

[112] ArtisanGS: Interactive Tools for Gaussian Splat Selection with AI and Human in the Loop

Clement Fuji Tsang, Anita Hu, Or Perel, Carsten Kolve, Maria Shugrina

Main category: cs.CV

TL;DR: Interactive toolset for Gaussian Splat selection and segmentation enabling user-guided editing of 3DGS scenes without additional optimization

Details

Motivation: Extracting usable objects from in-the-wild 3D Gaussian Splat captures is challenging, and existing controllable editing techniques for this representation are limited. Most emerging techniques focus on automatic solutions or high-level editing, leaving a gap for interactive tools.

Method: Proposes an interactive suite with: 1) Fast AI-driven method to propagate user-guided 2D selection masks to 3DGS selections, 2) Flexible manual selection and segmentation tools, 3) User intervention capability for error correction, 4) User-guided local editing approach leveraging a custom Video Diffusion Model.

Result: Enables virtually any binary segmentation of unstructured 3DGS scenes. The toolset outperforms state-of-the-art for Gaussian Splat selection and allows direct user control over AI-modifiable areas without requiring additional optimization for in-the-wild captures.

Conclusion: Provides practical interactive tools for 3DGS scene manipulation, bridging the gap between automatic solutions and user control, enabling downstream applications like local editing through user-guided AI modification.

Abstract: Representation in the family of 3D Gaussian Splats (3DGS) are growing into a viable alternative to traditional graphics for an expanding number of application, including recent techniques that facilitate physics simulation and animation. However, extracting usable objects from in-the-wild captures remains challenging and controllable editing techniques for this representation are limited. Unlike the bulk of emerging techniques, focused on automatic solutions or high-level editing, we introduce an interactive suite of tools centered around versatile Gaussian Splat selection and segmentation. We propose a fast AI-driven method to propagate user-guided 2D selection masks to 3DGS selections. This technique allows for user intervention in the case of errors and is further coupled with flexible manual selection and segmentation tools. These allow a user to achieve virtually any binary segmentation of an unstructured 3DGS scene. We evaluate our toolset against the state-of-the-art for Gaussian Splat selection and demonstrate their utility for downstream applications by developing a user-guided local editing approach, leveraging a custom Video Diffusion Model. With flexible selection tools, users have direct control over the areas that the AI can modify. Our selection and editing tools can be used for any in-the-wild capture without additional optimization.

[113] When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Jiacheng Hou, Yining Sun, Ruochong Jin, Haochen Han, Fangming Liu, Wai Kin Victor Chan, Alex Jinpeng Wang

Main category: cs.CV

TL;DR: Vision-Centric Jailbreak Attack (VJA) is the first visual-to-visual jailbreak attack that uses purely visual inputs to convey malicious instructions to image editing models, exposing new safety vulnerabilities in vision-prompt editing systems.

Details

Motivation: As image editing models shift from text-driven to vision-prompt editing (using marks, arrows, visual-text prompts), they introduce new safety risks where the attack surface becomes visual. Current safety measures focus on text inputs, leaving visual-based attacks underexplored.

Method: Proposes Vision-Centric Jailbreak Attack (VJA) that conveys malicious instructions through visual inputs only. Introduces IESBench, a safety-oriented benchmark for image editing models. Also proposes a training-free defense based on introspective multimodal reasoning to mitigate vulnerabilities without auxiliary guard models.

Result: VJA effectively compromises state-of-the-art commercial models, achieving attack success rates up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. The proposed defense substantially improves safety of poorly aligned models to commercial system levels with negligible computational overhead.

Conclusion: Exposes new visual-based vulnerabilities in modern image editing systems, provides both benchmark (IESBench) and practical defense to advance safe and trustworthy image editing. Highlights the critical need for visual safety measures in multimodal systems.

Abstract: Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.

[114] DEGMC: Denoising Diffusion Models Based on Riemannian Equivariant Group Morphological Convolutions

El Hadji S. Diop, Thierno Fall, Mohamed Daoudi

Main category: cs.CV

TL;DR: Geometric DDPM with group morphological convolutions for better feature extraction and equivariance to Euclidean transformations

Details

Motivation: Address two major issues in DDPMs: 1) geometric key feature extraction and 2) network equivariance, since standard U-net architecture is only translation equivariant

Method: Introduce group morphological convolutions in Riemannian manifolds derived from Hamilton-Jacobi PDEs, add convection term and solve using method of characteristics to capture nonlinearities and geometric structures

Result: Experimental results on MNIST, RotoMNIST, and CIFAR-10 show noticeable improvements compared to baseline DDPM model

Conclusion: Geometric approach with Euclidean group equivariance improves DDPM performance by better capturing geometric features and incorporating symmetries

Abstract: In this work, we address two major issues in recent Denoising Diffusion Probabilistic Models (DDPM): {\bf 1)} geometric key feature extraction and {\bf 2)} network equivariance. Since the DDPM prediction network relies on the U-net architecture, which is theoretically only translation equivariant, we introduce a geometric approach combined with an equivariance property of the more general Euclidean group, which includes rotations, reflections, and permutations. We introduce the notion of group morphological convolutions in Riemannian manifolds, which are derived from the viscosity solutions of first-order Hamilton-Jacobi-type partial differential equations (PDEs) that act as morphological multiscale dilations and erosions. We add a convection term to the model and solve it using the method of characteristics. This helps us better capture nonlinearities, represent thin geometric structures, and incorporate symmetries into the learning process. Experimental results on the MNIST, RotoMNIST, and CIFAR-10 datasets show noticeable improvements compared to the baseline DDPM model.

[115] XSPLAIN: XAI-enabling Splat-based Prototype Learning for Attribute-aware INterpretability

Dominik Galus, Julia Farganus, Tymoteusz Zapala, Mikołaj Czachorowski, Piotr Borycki, Przemysław Spurek, Piotr Syga

Main category: cs.CV

TL;DR: XSPLAIN introduces the first ante-hoc, prototype-based interpretability framework for 3D Gaussian Splatting classification, using voxel-aggregated PointNet and invertible orthogonal transformations to provide intuitive explanations without performance degradation.

Details

Motivation: 3D Gaussian Splatting (3DGS) has become standard for high-fidelity 3D reconstruction, but lacks interpretability for classification tasks. Existing explainability methods for other 3D representations rely on ambiguous saliency maps that fail to capture the volumetric coherence of Gaussian primitives.

Method: XSPLAIN uses a voxel-aggregated PointNet backbone and a novel invertible orthogonal transformation that disentangles feature channels for interpretability while preserving original decision boundaries. Explanations are grounded in representative training examples using prototype-based reasoning.

Result: User study (N=51) shows decisive preference: participants selected XSPLAIN explanations 48.4% of the time as best, significantly outperforming baselines (p<0.001). The approach provides transparency and user trust without degrading classification performance.

Conclusion: XSPLAIN successfully addresses the interpretability gap in 3DGS classification, offering intuitive prototype-based explanations that maintain classification accuracy while significantly improving user understanding and trust.

Abstract: 3D Gaussian Splatting (3DGS) has rapidly become a standard for high-fidelity 3D reconstruction, yet its adoption in multiple critical domains is hindered by the lack of interpretability of the generation models as well as classification of the Splats. While explainability methods exist for other 3D representations, like point clouds, they typically rely on ambiguous saliency maps that fail to capture the volumetric coherence of Gaussian primitives. We introduce XSPLAIN, the first ante-hoc, prototype-based interpretability framework designed specifically for 3DGS classification. Our approach leverages a voxel-aggregated PointNet backbone and a novel, invertible orthogonal transformation that disentangles feature channels for interpretability while strictly preserving the original decision boundaries. Explanations are grounded in representative training examples, enabling intuitive ``this looks like that’’ reasoning without any degradation in classification performance. A rigorous user study (N=51) demonstrates a decisive preference for our approach: participants selected XSPLAIN explanations 48.4% of the time as the best, significantly outperforming baselines $(p<0.001)$, showing that XSPLAIN provides transparency and user trust. The source code for this work is available at: https://github.com/Solvro/ml-splat-xai

[116] Enhancing Underwater Images via Adaptive Semantic-aware Codebook Learning

Bosen Lin, Feng Gao, Yanwei Yu, Junyu Dong, Qian Du

Main category: cs.CV

TL;DR: SUCode is a semantic-aware underwater image enhancement method that uses pixel-level codebook representation to handle heterogeneous degradation across different scene components, achieving state-of-the-art performance.

Details

Motivation: Underwater image enhancement is challenging due to inconsistent degradation across different semantic regions, and existing methods fail to address this heterogeneity, leading to color distortions and loss of fine details.

Method: Proposes SUCode with semantic-aware pixel-level codebook representation, three-stage training to avoid pseudo ground-truth contamination, and uses Gated Channel Attention Module (GCAM) and Frequency-Aware Feature Fusion (FAFF) for color restoration and texture recovery.

Result: Extensive experiments show SUCode achieves state-of-the-art performance on multiple benchmarks, outperforming recent UIE methods on both reference and no-reference metrics.

Conclusion: SUCode effectively addresses heterogeneous underwater degradation through semantic-aware codebook representation and achieves superior enhancement results compared to existing methods.

Abstract: Underwater Image Enhancement (UIE) is an ill-posed problem where natural clean references are not available, and the degradation levels vary significantly across semantic regions. Existing UIE methods treat images with a single global model and ignore the inconsistent degradation of different scene components. This oversight leads to significant color distortions and loss of fine details in heterogeneous underwater scenes, especially where degradation varies significantly across different image regions. Therefore, we propose SUCode (Semantic-aware Underwater Codebook Network), which achieves adaptive UIE from semantic-aware discrete codebook representation. Compared with one-shot codebook-based methods, SUCode exploits semantic-aware, pixel-level codebook representation tailored to heterogeneous underwater degradation. A three-stage training paradigm is employed to represent raw underwater image features to avoid pseudo ground-truth contamination. Gated Channel Attention Module (GCAM) and Frequency-Aware Feature Fusion (FAFF) jointly integrate channel and frequency cues for faithful color restoration and texture recovery. Extensive experiments on multiple benchmarks demonstrate that SUCode achieves state-of-the-art performance, outperforming recent UIE methods on both reference and no-reference metrics. The code will be made public available at https://github.com/oucailab/SUCode.

[117] PMMA: The Polytechnique Montreal Mobility Aids Dataset

Qingwu Liu, Nicolas Saunier, Guillaume-Alexandre Bilodeau

Main category: cs.CV

TL;DR: New PMMA dataset for detecting pedestrians using mobility aids (wheelchairs, canes, walkers) with 9 categories, benchmarked with 7 detection models and 3 trackers.

Details

Motivation: There's a lack of specialized datasets for detecting pedestrians with mobility aids, which is important for autonomous vehicles and assistive technologies to ensure safety and accessibility for all road users.

Method: Collected outdoor dataset with volunteers using mobility aids, created 9 categories, benchmarked with 7 object detection models (Faster R-CNN, CenterNet, YOLOX, DETR, Deformable DETR, DINO, RT-DETR) and 3 tracking algorithms (ByteTrack, BOT-SORT, OC-SORT) under MMDetection framework.

Result: YOLOX, Deformable DETR, and Faster R-CNN achieved best detection performance; tracking algorithm differences were relatively small. Dataset and code are publicly available.

Conclusion: PMMA dataset fills an important gap for detecting pedestrians with mobility aids, enabling better safety systems for autonomous vehicles and assistive technologies.

Abstract: This study introduces a new object detection dataset of pedestrians using mobility aids, named PMMA. The dataset was collected in an outdoor environment, where volunteers used wheelchairs, canes, and walkers, resulting in nine categories of pedestrians: pedestrians, cane users, two types of walker users, whether walking or resting, five types of wheelchair users, including wheelchair users, people pushing empty wheelchairs, and three types of users pushing occupied wheelchairs, including the entire pushing group, the pusher and the person seated on the wheelchair. To establish a benchmark, seven object detection models (Faster R-CNN, CenterNet, YOLOX, DETR, Deformable DETR, DINO, and RT-DETR) and three tracking algorithms (ByteTrack, BOT-SORT, and OC-SORT) were implemented under the MMDetection framework. Experimental results show that YOLOX, Deformable DETR, and Faster R-CNN achieve the best detection performance, while the differences among the three trackers are relatively small. The PMMA dataset is publicly available at https://doi.org/10.5683/SP3/XJPQUG, and the video processing and model training code is available at https://github.com/DatasetPMMA/PMMA.

[118] Colorimeter-Supervised Skin Tone Estimation from Dermatoscopic Images for Fairness Auditing

Marin Benčević, Krešimir Romić, Ivana Hartmann Tolić, Irena Galić

Main category: cs.CV

TL;DR: Developed neural networks to predict skin tone from dermatoscopic images using Fitzpatrick type and ITA metrics, validated against colorimeter measurements, and applied to audit dataset biases.

Details

Motivation: Address the lack of reliable skin-tone annotations in public dermatoscopy datasets, which limits fairness auditing of neural-network-based diagnosis models that show performance disparities across skin tones.

Method: Created neural networks for ordinal regression (Fitzpatrick skin type) and color regression (Individual Typology Angle), using in-person Fitzpatrick labels and colorimeter measurements as targets, with extensive pretraining on synthetic and real dermatoscopic/clinical images.

Result: Fitzpatrick model achieves agreement comparable to human annotations; ITA predictions show high concordance with colorimeter measurements, outperforming pixel-averaging approaches. Application to ISIC 2020 and MILK10k reveals less than 1% of subjects belong to Fitzpatrick types V and VI.

Conclusion: Released open-source tools for rapid skin-tone annotation and bias auditing, providing validated dermatoscopic skin-tone estimation neural networks that support evidence of performance gaps across skin-tone groups.

Abstract: Neural-network-based diagnosis from dermatoscopic images is increasingly used for clinical decision support, yet studies report performance disparities across skin tones. Fairness auditing of these models is limited by the lack of reliable skin-tone annotations in public dermatoscopy datasets. We address this gap with neural networks that predict Fitzpatrick skin type via ordinal regression and the Individual Typology Angle (ITA) via color regression, using in-person Fitzpatrick labels and colorimeter measurements as targets. We further leverage extensive pretraining on synthetic and real dermatoscopic and clinical images. The Fitzpatrick model achieves agreement comparable to human crowdsourced annotations, and ITA predictions show high concordance with colorimeter-derived ITA, substantially outperforming pixel-averaging approaches. Applying these estimators to ISIC 2020 and MILK10k, we find that fewer than 1% of subjects belong to Fitzpatrick types V and VI. We release code and pretrained models as an open-source tool for rapid skin-tone annotation and bias auditing. This is, to our knowledge, the first dermatoscopic skin-tone estimation neural network validated against colorimeter measurements, and it supports growing evidence of clinically relevant performance gaps across skin-tone groups.

[119] ERGO: Excess-Risk-Guided Optimization for High-Fidelity Monocular 3D Gaussian Splatting

Zehua Ma, Hanhui Li, Zhenyu Xie, Xiaonan Luo, Michael Kampffmeyer, Feng Gao, Xiaodan Liang

Main category: cs.CV

TL;DR: ERGO: An adaptive optimization framework for 3D reconstruction from single images using 3D Gaussian splatting with excess risk decomposition to handle noisy synthesized views.

Details

Motivation: Single-image 3D reconstruction is ill-posed due to missing geometric/textural information in occluded regions. While generative models can synthesize auxiliary views for supervision, these views contain inconsistencies and misalignments that propagate artifacts during reconstruction.

Method: Proposes ERGO framework that decomposes optimization losses in 3D Gaussian splatting into excess risk (suboptimality gap) and Bayes error (irreducible noise). Uses this decomposition to dynamically estimate view-specific excess risk and adaptively adjust loss weights. Also introduces geometry-aware and texture-aware objectives for synergistic global-local optimization.

Result: Extensive experiments on Google Scanned Objects and OmniObject3D datasets demonstrate superiority over state-of-the-art methods. ERGO shows robustness against supervision noise while enhancing both geometric fidelity and textural quality of reconstructed 3D content.

Conclusion: ERGO effectively handles imperfect supervisory signals from synthesized views through adaptive optimization guided by excess risk decomposition, leading to improved 3D reconstruction quality from single images.

Abstract: Generating 3D content from a single image remains a fundamentally challenging and ill-posed problem due to the inherent absence of geometric and textural information in occluded regions. While state-of-the-art generative models can synthesize auxiliary views to provide additional supervision, these views inevitably contain geometric inconsistencies and textural misalignments that propagate and amplify artifacts during 3D reconstruction. To effectively harness these imperfect supervisory signals, we propose an adaptive optimization framework guided by excess risk decomposition, termed ERGO. Specifically, ERGO decomposes the optimization losses in 3D Gaussian splatting into two components, i.e., excess risk that quantifies the suboptimality gap between current and optimal parameters, and Bayes error that models the irreducible noise inherent in synthesized views. This decomposition enables ERGO to dynamically estimate the view-specific excess risk and adaptively adjust loss weights during optimization. Furthermore, we introduce geometry-aware and texture-aware objectives that complement the excess-risk-derived weighting mechanism, establishing a synergistic global-local optimization paradigm. Consequently, ERGO demonstrates robustness against supervision noise while consistently enhancing both geometric fidelity and textural quality of the reconstructed 3D content. Extensive experiments on the Google Scanned Objects dataset and the OmniObject3D dataset demonstrate the superiority of ERGO over existing state-of-the-art methods.

[120] A Low-Rank Defense Method for Adversarial Attack on Diffusion Models

Jiaxuan Zhu, Siyu Huang

Main category: cs.CV

TL;DR: LoRD is a defensive strategy using low-rank adaptation modules to protect Latent Diffusion Models from adversarial attacks by detecting and defending against adversarial samples while maintaining image generation quality.

Details

Motivation: As adversarial attacks on diffusion models proliferate, there's a critical need for defensive strategies to prevent abuse and ensure practical application of these models. The paper addresses the vulnerability of Latent Diffusion Models to adversarial attacks during fine-tuning.

Method: Proposes Low-Rank Defense (LoRD) that introduces merging ideas and balance parameters combined with low-rank adaptation (LoRA) modules to detect and defend adversarial samples. Builds a defense pipeline applying learned LoRD modules to help diffusion models defend against attack algorithms.

Result: Extensive experiments on facial and landscape images show significantly better defense performance compared to baseline methods. The method ensures LDMs fine-tuned on both adversarial and clean samples can still generate high-quality images.

Conclusion: LoRD provides an effective defensive strategy against adversarial attacks on Latent Diffusion Models, demonstrating superior performance over existing baselines while maintaining generation quality.

Abstract: Recently, adversarial attacks for diffusion models as well as their fine-tuning process have been developed rapidly. To prevent the abuse of these attack algorithms from affecting the practical application of diffusion models, it is critical to develop corresponding defensive strategies. In this work, we propose an efficient defensive strategy, named Low-Rank Defense (LoRD), to defend the adversarial attack on Latent Diffusion Models (LDMs). LoRD introduces the merging idea and a balance parameter, combined with the low-rank adaptation (LoRA) modules, to detect and defend the adversarial samples. Based on LoRD, we build up a defense pipeline that applies the learned LoRD modules to help diffusion models defend against attack algorithms. Our method ensures that the LDM fine-tuned on both adversarial and clean samples can still generate high-quality images. To demonstrate the effectiveness of our approach, we conduct extensive experiments on facial and landscape images, and our method shows significantly better defense performance compared to the baseline methods.

[121] Flow Matching with Uncertainty Quantification and Guidance

Juyeop Han, Lukas Lao Beyer, Sertac Karaman

Main category: cs.CV

TL;DR: UA-Flow extends flow matching models to predict velocity fields with uncertainty estimates, enabling sample reliability assessment and uncertainty-guided generation for improved quality.

Details

Motivation: Current sampling-based generative models like flow matching can produce inconsistent or degraded quality samples. There's a need to assess sample reliability and generate higher-quality outputs through uncertainty estimation.

Method: Proposes UA-Flow, a lightweight extension of flow matching that predicts velocity fields with heteroscedastic uncertainty. Estimates per-sample uncertainty by propagating velocity uncertainty through flow dynamics, then uses uncertainty for classifier guidance and classifier-free guidance to steer generation.

Result: Experiments on image generation show UA-Flow produces uncertainty signals more highly correlated with sample fidelity than baseline methods, and uncertainty-guided sampling further improves generation quality.

Conclusion: UA-Flow successfully integrates uncertainty estimation into flow matching models, providing reliability signals for samples and enabling uncertainty-guided generation that enhances output quality.

Abstract: Despite the remarkable success of sampling-based generative models such as flow matching, they can still produce samples of inconsistent or degraded quality. To assess sample reliability and generate higher-quality outputs, we propose uncertainty-aware flow matching (UA-Flow), a lightweight extension of flow matching that predicts the velocity field together with heteroscedastic uncertainty. UA-Flow estimates per-sample uncertainty by propagating velocity uncertainty through the flow dynamics. These uncertainty estimates act as a reliability signal for individual samples, and we further use them to steer generation via uncertainty-aware classifier guidance and classifier-free guidance. Experiments on image generation show that UA-Flow produces uncertainty signals more highly correlated with sample fidelity than baseline methods, and that uncertainty-guided sampling further improves generation quality.

[122] Conditional Uncertainty-Aware Political Deepfake Detection with Stochastic Convolutional Neural Networks

Rafael-Petruţ Gardoş

Main category: cs.CV

TL;DR: This paper investigates uncertainty-aware political deepfake detection using stochastic CNNs, evaluating uncertainty through calibration quality and alignment with prediction errors to enable risk-aware moderation in high-stakes political contexts.

Details

Motivation: Political deepfakes pose serious risks to information integrity and democratic processes. Existing deepfake detectors only provide point predictions without reliability indications, which is operationally critical in high-stakes political contexts where uncertainty awareness is essential for risk-aware moderation.

Method: Constructed a politically-focused binary image dataset via metadata filtering from a large public real-synthetic corpus. Used two pretrained CNN backbones (ResNet-18 and EfficientNet-B4) fully fine-tuned for classification. Compared deterministic inference with single-pass stochastic prediction, Monte Carlo dropout, temperature scaling, and ensemble-based uncertainty surrogates within a decision-oriented reliability framework.

Result: Results demonstrate that calibrated probabilistic outputs and uncertainty estimates enable risk-aware moderation policies. Systematic confidence-band analysis clarifies when uncertainty provides operational value beyond predicted confidence, showing both benefits and limitations of uncertainty-aware deepfake detection in political settings.

Conclusion: Uncertainty-aware deepfake detection provides valuable operational benefits for political contexts by enabling risk-aware moderation policies, though with limitations that need to be understood through systematic confidence-band analysis.

Abstract: Recent advances in generative image models have enabled the creation of highly realistic political deepfakes, posing risks to information integrity, public trust, and democratic processes. While automated deepfake detectors are increasingly deployed in moderation and investigative pipelines, most existing systems provide only point predictions and fail to indicate when outputs are unreliable, being an operationally critical limitation in high-stakes political contexts. This work investigates conditional, uncertainty-aware political deepfake detection using stochastic convolutional neural networks within an empirical, decision-oriented reliability framework. Rather than treating uncertainty as a purely Bayesian construct, it is evaluated through observable criteria, including calibration quality, proper scoring rules, and its alignment with prediction errors under both global and confidence-conditioned analyses. A politically focused binary image dataset is constructed via deterministic metadata filtering from a large public real-synthetic corpus. Two pretrained CNN backbones (ResNet-18 and EfficientNet-B4) are fully fine-tuned for classification. Deterministic inference is compared with single-pass stochastic prediction, Monte Carlo dropout with multiple forward passes, temperature scaling, and ensemble-based uncertainty surrogates. Evaluation reports ROC-AUC, thresholded confusion matrices, calibration metrics, and generator-disjoint out-of-distribution performance. Results demonstrate that calibrated probabilistic outputs and uncertainty estimates enable risk-aware moderation policies. A systematic confidence-band analysis further clarifies when uncertainty provides operational value beyond predicted confidence, delineating both the benefits and limitations of uncertainty-aware deepfake detection in political settings.

[123] Monte Carlo Maximum Likelihood Reconstruction for Digital Holography with Speckle

Xi Chen, Arian Maleki, Shirin Jalali

Main category: cs.CV

TL;DR: PGD-MC enables scalable maximum likelihood estimation for digital holography with finite apertures using randomized linear algebra to avoid expensive matrix inversions, improving reconstruction quality and computational efficiency.

Details

Motivation: Maximum likelihood estimation (MLE) is theoretically sound for speckle mitigation in coherent imaging but computationally prohibitive for high-resolution digital holography due to high-dimensional matrix inversion costs, preventing accurate aperture modeling.

Method: Proposed projected gradient descent with Monte Carlo estimation (PGD-MC) uses randomized linear algebra and conjugate gradient methods to compute likelihood gradients without explicit matrix inversions, enabling accurate aperture modeling with flexible regularization via denoisers.

Result: PGD-MC demonstrates robustness to diverse aperture models, achieves substantial improvements in reconstruction quality and computational efficiency, and scales effectively to high-resolution digital holography, outperforming prior Plug-and-Play methods in accuracy and speed.

Conclusion: PGD-MC provides a flexible and effective MLE-based reconstruction framework for digital holography with finite apertures, overcoming computational barriers to enable physically accurate modeling and superior performance.

Abstract: In coherent imaging, speckle is statistically modeled as multiplicative noise, posing a fundamental challenge for image reconstruction. While maximum likelihood estimation (MLE) provides a principled framework for speckle mitigation, its application to coherent imaging system such as digital holography with finite apertures is hindered by the prohibitive cost of high-dimensional matrix inversion, especially at high resolutions. This computational burden has prevented the use of MLE-based reconstruction with physically accurate aperture modeling. In this work, we propose a randomized linear algebra approach that enables scalable MLE optimization without explicit matrix inversions in gradient computation. By exploiting the structural properties of sensing matrix and using conjugate gradient for likelihood gradient evaluation, the proposed algorithm supports accurate aperture modeling without the simplifying assumptions commonly imposed for tractability. We term the resulting method projected gradient descent with Monte Carlo estimation (PGD-MC). The proposed PGD-MC framework (i) demonstrates robustness to diverse and physically accurate aperture models, (ii) achieves substantial improvements in reconstruction quality and computational efficiency, and (iii) scales effectively to high-resolution digital holography. Extensive experiments incorporating three representative denoisers as regularization show that PGD-MC provides a flexible and effective MLE-based reconstruction framework for digital holography with finite apertures, consistently outperforming prior Plug-and-Play model-based iterative reconstruction methods in both accuracy and speed. Our code is available at: https://github.com/Computational-Imaging-RU/MC_Maximum_Likelihood_Digital_Holography_Speckle.

[124] GenDR: Lighten Generative Detail Restoration

Yan Wang, Shijie Zhao, Kexin Zhang, Junlin Li, Li Zhang

Main category: cs.CV

TL;DR: GenDR is a one-step diffusion model for real-world super-resolution that aligns T2I diffusion targets with SR needs through a 16-channel VAE and consistent score identity distillation.

Details

Motivation: Current diffusion-based super-resolution methods suffer from misalignment between text-to-image diffusion targets (multiple steps, 4-channel VAEs) and SR needs (fewer steps, reliable VAEs), leading to suboptimal trade-offs between speed and detail fidelity.

Method: 1) Train SD2.1-VAE16 (0.9B) via representation alignment to expand latent space without increasing model size; 2) Propose consistent score identity distillation (CiD) incorporating SR-specific loss to leverage SR priors; 3) Extend CiD with adversarial learning and representation alignment (CiDA) for better perceptual quality and faster training; 4) Polish pipeline for efficient inference.

Result: GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity for real-world super-resolution tasks.

Conclusion: The paper presents an effective approach to align diffusion model targets with super-resolution needs through latent space expansion and task-specific distillation, enabling high-quality one-step SR generation.

Abstract: Although recent research applying text-to-image (T2I) diffusion models to real-world super-resolution (SR) has achieved remarkable progress, the misalignment of their targets leads to a suboptimal trade-off between inference speed and detail fidelity. Specifically, the T2I task requires multiple inference steps to synthesize images matching to prompts and reduces the latent dimension to lower generating difficulty. Contrariwise, SR can restore high-frequency details in fewer inference steps, but it necessitates a more reliable variational auto-encoder (VAE) to preserve input information. However, most diffusion-based SRs are multistep and use 4-channel VAEs, while existing models with 16-channel VAEs are overqualified diffusion transformers, e.g., FLUX (12B). To align the target, we present a one-step diffusion model for generative detail restoration, GenDR, distilled from a tailored diffusion model with a larger latent space. In detail, we train a new SD2.1-VAE16 (0.9B) via representation alignment to expand the latent space without increasing the model size. Regarding step distillation, we propose consistent score identity distillation (CiD) that incorporates SR task-specific loss into score distillation to leverage more SR priors and align the training target. Furthermore, we extend CiD with adversarial learning and representation alignment (CiDA) to enhance perceptual quality and accelerate training. We also polish the pipeline to achieve a more efficient inference. Experimental results demonstrate that GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity.

[125] Comp2Comp: Open-Source Software with FDA-Cleared Artificial Intelligence Algorithms for Computed Tomography Image Analysis

Adrit Rao, Malte Jensen, Andrea T. Fisher, Louis Blankemeier, Pauline Berens, Arash Fereydooni, Seth Lirette, Eren Alkan, Felipe C. Kitamura, Juan M. Zambrano Chaves, Eduardo Reis, Arjun Desai, Marc H. Willis, Jason Hom, Andrew Johnston, Leon Lenchik, Robert D. Boutin, Eduardo M. J. M. Farina, Augusto S. Serpa, Marcelo S. Takahashi, Jordan Perchik, Steven A. Rothenberg, Jamie L. Schroeder, Ross Filice, Leonardo K. Bittencourt, Hari Trivedi, Marly van Assen, John Mongan, Kimberly Kallianos, Oliver Aalami, Akshay S. Chaudhari

Main category: cs.CV

TL;DR: FDA-cleared open-source deep learning pipelines for opportunistic CT analysis: AAQ measures abdominal aortic diameters for aneurysm assessment, BMD estimates bone mineral density for osteoporosis risk.

Details

Motivation: To address the lack of rigorous validation in open-source medical imaging solutions and lack of transparency in commercial solutions by developing fully open-sourced, FDA-cleared deep learning pipelines for opportunistic analysis of CT scans.

Method: Developed two deep learning pipelines within Comp2Comp package: 1) Abdominal Aortic Quantification (AAQ) for segmenting abdominal aorta and measuring maximal diameters, 2) Bone Mineral Density (BMD) for segmenting vertebral bodies and estimating trabecular bone density. Both were validated against ground-truth measurements from external institutions.

Result: AAQ achieved mean absolute error of 1.57 mm (95% CI 1.38-1.80 mm) for aortic diameter measurements compared to radiologist ground truth. BMD achieved 81.0% sensitivity and 78.4% specificity for binary classification of low vs. normal bone density compared to DXA scan ground truth.

Conclusion: Comp2Comp AAQ and BMD demonstrated sufficient accuracy for clinical use. Open-sourcing these FDA-cleared algorithms improves transparency, allows pre-deployment testing, and provides researchers with high-quality medical imaging analysis methods.

Abstract: Artificial intelligence allows automatic extraction of imaging biomarkers from already-acquired radiologic images. This paradigm of opportunistic imaging adds value to medical imaging without additional imaging costs or patient radiation exposure. However, many open-source image analysis solutions lack rigorous validation while commercial solutions lack transparency, leading to unexpected failures when deployed. Here, we report development and validation for two of the first fully open-sourced, FDA-510(k)-cleared deep learning pipelines to mitigate both challenges: Abdominal Aortic Quantification (AAQ) and Bone Mineral Density (BMD) estimation are both offered within the Comp2Comp package for opportunistic analysis of computed tomography scans. AAQ segments the abdominal aorta to assess aneurysm size; BMD segments vertebral bodies to estimate trabecular bone density and osteoporosis risk. AAQ-derived maximal aortic diameters were compared against radiologist ground-truth measurements on 258 patient scans enriched for abdominal aortic aneurysms from four external institutions. BMD binary classifications (low vs. normal bone density) were compared against concurrent DXA scan ground truths obtained on 371 patient scans from four external institutions. AAQ had an overall mean absolute error of 1.57 mm (95% CI 1.38-1.80 mm). BMD had a sensitivity of 81.0% (95% CI 74.0-86.8%) and specificity of 78.4% (95% CI 72.3-83.7%). Comp2Comp AAQ and BMD demonstrated sufficient accuracy for clinical use. Open-sourcing these algorithms improves transparency of typically opaque FDA clearance processes, allows hospitals to test the algorithms before cumbersome clinical pilots, and provides researchers with best-in-class methods.

[126] Spectrum from Defocus: Fast Spectral Imaging with Chromatic Focal Stack

M. Kerem Aydin, Yi-Chun Hung, Jaclyn Pytlarz, Qi Guo, Emma Alexander

Main category: cs.CV

TL;DR: SfD is a hyperspectral imaging method using chromatic focal sweep with off-the-shelf lenses and fast reconstruction, achieving high-quality spectral imaging with optical simplicity.

Details

Motivation: Hyperspectral cameras face fundamental trade-offs between spatial, spectral, and temporal resolution in low-photon conditions. Existing computational imaging solutions require complex optics and extensive computation, limiting practical applications.

Method: Spectrum from Defocus (SfD) uses a chromatic focal sweep approach with two off-the-shelf lenses and a grayscale sensor. It captures a chromatically-aberrated focal stack that preserves nearly all incident light, then reconstructs hyperspectral images using a fast physics-based iterative algorithm.

Result: SfD achieves state-of-the-art hyperspectral imaging with less than one second of reconstruction time. It delivers sharp, accurate hyperspectral images while maintaining photon efficiency, optical simplicity, and physical interpretability.

Conclusion: SfD provides a promising solution for fast, compact, and interpretable hyperspectral imaging by breaking through traditional trade-offs with simple optics and efficient computation.

Abstract: Hyperspectral cameras face harsh trade-offs between spatial, spectral, and temporal resolution in inherently low-photon conditions. Computational imaging systems break through these trade-offs with compressive sensing, but have required complex optics and/or extensive compute. We present Spectrum from Defocus (SfD), a chromatic focal sweep method that achieves state-of-the-art hyperspectral imaging with only two off-the-shelf lenses, a grayscale sensor, and less than one second of reconstruction time. By capturing a chromatically-aberrated focal stack that preserves nearly all incident light, and reconstructing it with a fast physics-based iterative algorithm, SfD delivers sharp, accurate hyperspectral images. The combination of photon efficiency, optical simplicity, and physical interpretability makes SfD a promising solution for fast, compact, interpretable hyperspectral imaging.

[127] HII-DPO: Eliminate Hallucination via Accurate Hallucination-Inducing Counterfactual Images

Yilin Yang, Zhenghui Guo, Yuke Wang, Omprakash Gnawali, Sheng Di, Chengming Zhang

Main category: cs.CV

TL;DR: The paper introduces a method to synthesize Hallucination-Inducing Images (HIIs) to study and mitigate language bias hallucinations in Vision-Language Models, revealing scene-conditioned hallucination patterns and achieving significant improvements on hallucination benchmarks.

Details

Motivation: Large Vision-Language Models (VLMs) suffer from hallucinations rooted in inherent language bias, but existing mitigation methods often overlook the underlying hallucination patterns driven by this bias. The authors aim to better understand and address these systematic hallucinations.

Method: 1) Design a novel pipeline to synthesize Hallucination-Inducing Images (HIIs); 2) Use HIIs to reveal consistent scene-conditioned hallucination patterns; 3) Establish the Masked-Object-Hallucination (MOH) benchmark to evaluate VLM susceptibility; 4) Leverage HIIs to construct high-quality preference datasets for fine-grained alignment.

Result: The approach effectively mitigates hallucinations while preserving general model capabilities, achieving up to 38% improvement over state-of-the-art methods on standard hallucination benchmarks. The synthesized HIIs reveal that models tend to mention objects highly typical of scenes even when visual evidence is removed.

Conclusion: The proposed method successfully addresses language bias hallucinations in VLMs through systematic analysis of hallucination patterns using synthesized HIIs, leading to significant improvements in hallucination mitigation while maintaining overall model performance.

Abstract: Large Vision-Language Models (VLMs) have achieved remarkable success across diverse multimodal tasks but remain vulnerable to hallucinations rooted in inherent language bias. Despite recent progress, existing hallucination mitigation methods often overlook the underlying hallucination patterns driven by language bias. In this work, we design a novel pipeline to accurately synthesize Hallucination-Inducing Images (HIIs). Using synthesized HIIs, we reveal a consistent scene-conditioned hallucination pattern: models tend to mention objects that are highly typical of the scene even when visual evidence is removed. To quantify the susceptibility of VLMs to this hallucination pattern, we establish the Masked-Object-Hallucination (MOH) benchmark to rigorously evaluate existing state-of-the-art alignment frameworks. Finally, we leverage HIIs to construct high-quality preference datasets for fine-grained alignment. Experimental results demonstrate that our approach effectively mitigates hallucinations while preserving general model capabilities. Specifically, our method achieves up to a 38% improvement over the current state-of-the-art on standard hallucination benchmarks.

[128] A New Dataset and Performance Benchmark for Real-time Spacecraft Segmentation in Onboard Computers

Jeffrey Joan Sam, Janhavi Sathe, Nikhil Chigali, Naman Gupta, Radhey Ruparel, Yicheng Jiang, Janmajay Singh, James W. Berck, Arko Barman

Main category: cs.CV

TL;DR: A new dataset of 64k annotated spacecraft images for segmentation, created using real spacecraft models on real/synthetic backgrounds with added noise/distortions, with YOLO models fine-tuned for real-time onboard space applications.

Details

Motivation: Spacecraft in outer space face damage risks, and in-space repairs are costly and risky. Autonomous inspection systems using image segmentation could provide reliable, cost-effective solutions, but lack of annotated spacecraft segmentation data limits development.

Method: Created dataset using real spacecraft models superimposed on mixture of real and synthetic backgrounds (NASA’s TTALOS pipeline), added noise and distortions to mimic real-world conditions. Fine-tuned YOLOv8 and YOLOv11 models for segmentation under hardware/time constraints mimicking real-time onboard space applications.

Result: Models achieved Dice score of 0.92, Hausdorff distance of 0.69, and inference time of about 0.5 second under real-world constraints. Dataset includes diverse challenges: noise, distortions, glare, lighting variations, partial visibility, complex backgrounds, and various spacecraft geometries.

Conclusion: The dataset enables development of robust spacecraft segmentation models for autonomous inspection systems. Performance benchmarks demonstrate feasibility of real-time onboard applications for space missions.

Abstract: Spacecraft deployed in outer space are routinely subjected to various forms of damage due to exposure to hazardous environments. In addition, there are significant risks to the subsequent process of in-space repairs through human extravehicular activity or robotic manipulation, incurring substantial operational costs. Recent developments in image segmentation could enable the development of reliable and cost-effective autonomous inspection systems. While these models often require large amounts of training data to achieve satisfactory results, publicly available annotated spacecraft segmentation data are very scarce. Here, we present a new dataset of nearly 64k annotated spacecraft images that was created using real spacecraft models, superimposed on a mixture of real and synthetic backgrounds generated using NASA’s TTALOS pipeline. To mimic camera distortions and noise in real-world image acquisition, we also added different types of noise and distortion to the images. Our dataset includes images with several real-world challenges, including noise, camera distortions, glare, varying lighting conditions, varying field of view, partial spacecraft visibility, brightly-lit city backgrounds, densely patterned and confounding backgrounds, aurora borealis, and a wide variety of spacecraft geometries. Finally, we finetuned YOLOv8 and YOLOv11 models for spacecraft segmentation to generate performance benchmarks for the dataset under well-defined hardware and inference time constraints to mimic real-world image segmentation challenges for real-time onboard applications in space on NASA’s inspector spacecraft. The resulting models, when tested under these constraints, achieved a Dice score of 0.92, Hausdorff distance of 0.69, and an inference time of about 0.5 second. The dataset and models for performance benchmark are available at https://github.com/RiceD2KLab/SWiM.

[129] Towards Remote Sensing Change Detection with Neural Memory

Zhenyu Yang, Gensheng Pei, Yazhou Yao, Tianfei Zhou, Lizhong Ding, Fumin Shen

Main category: cs.CV

TL;DR: ChangeTitans: A Titans-based framework for remote sensing change detection that uses VTitans vision backbone with neural memory and segmented local attention, hierarchical adapters for multi-scale features, and two-stream fusion with cross-temporal attention to achieve state-of-the-art results.

Details

Motivation: Current remote sensing change detection methods struggle with capturing long-range dependencies while maintaining computational efficiency. Transformers have quadratic complexity issues, and existing linear attention approaches fail to capture intricate spatiotemporal relationships needed for accurate change detection.

Method: Proposes ChangeTitans framework with three key components: 1) VTitans - first Titans-based vision backbone integrating neural memory with segmented local attention for long-range dependencies, 2) hierarchical VTitans-Adapter to refine multi-scale features across network layers, and 3) TS-CBAM - two-stream fusion module using cross-temporal attention to suppress pseudo-changes.

Result: Achieves state-of-the-art results on four benchmark datasets: LEVIR-CD (84.36% IoU, 91.52% F1-score), WHU-CD, LEVIR-CD+, and SYSU-CD. The method remains computationally competitive while outperforming existing approaches.

Conclusion: ChangeTitans effectively addresses the limitations of current change detection methods by leveraging Titans architecture for efficient long-range dependency modeling, demonstrating superior performance on remote sensing change detection tasks.

Abstract: Remote sensing change detection is essential for environmental monitoring, urban planning, and related applications. However, current methods often struggle to capture long-range dependencies while maintaining computational efficiency. Although Transformers can effectively model global context, their quadratic complexity poses scalability challenges, and existing linear attention approaches frequently fail to capture intricate spatiotemporal relationships. Drawing inspiration from the recent success of Titans in language tasks, we present ChangeTitans, the Titans-based framework for remote sensing change detection. Specifically, we propose VTitans, the first Titans-based vision backbone that integrates neural memory with segmented local attention, thereby capturing long-range dependencies while mitigating computational overhead. Next, we present a hierarchical VTitans-Adapter to refine multi-scale features across different network layers. Finally, we introduce TS-CBAM, a two-stream fusion module leveraging cross-temporal attention to suppress pseudo-changes and enhance detection accuracy. Experimental evaluations on four benchmark datasets (LEVIR-CD, WHU-CD, LEVIR-CD+, and SYSU-CD) demonstrate that ChangeTitans achieves state-of-the-art results, attaining \textbf{84.36%} IoU and \textbf{91.52%} F1-score on LEVIR-CD, while remaining computationally competitive.

[130] End-to-End LiDAR optimization for 3D point cloud registration

Siddhant Katyan, Marc-André Gardner, Jean-François Lalonde

Main category: cs.CV

TL;DR: Adaptive LiDAR sensing framework that dynamically adjusts sensor parameters using registration feedback to optimize point cloud registration accuracy and efficiency

Details

Motivation: Current LiDAR sensors are designed independently from downstream tasks like point cloud registration, leading to suboptimal data collection and computational overhead for sampling, noise filtering, and parameter tuning

Method: Proposes an adaptive LiDAR sensing framework that dynamically adjusts sensor parameters by integrating registration feedback into the sensing loop, jointly optimizing LiDAR acquisition and registration hyperparameters

Result: Evaluation in CARLA simulation shows the method outperforms fixed-parameter baselines while retaining generalization abilities

Conclusion: Demonstrates the potential of adaptive LiDAR for autonomous perception and robotic applications by improving registration accuracy and efficiency through sensor-task co-optimization

Abstract: LiDAR sensors are a key modality for 3D perception, yet they are typically designed independently of downstream tasks such as point cloud registration. Conventional registration operates on pre-acquired datasets with fixed LiDAR configurations, leading to suboptimal data collection and significant computational overhead for sampling, noise filtering, and parameter tuning. In this work, we propose an adaptive LiDAR sensing framework that dynamically adjusts sensor parameters, jointly optimizing LiDAR acquisition and registration hyperparameters. By integrating registration feedback into the sensing loop, our approach optimally balances point density, noise, and sparsity, improving registration accuracy and efficiency. Evaluations in the CARLA simulation demonstrate that our method outperforms fixed-parameter baselines while retaining generalization abilities, highlighting the potential of adaptive LiDAR for autonomous perception and robotic applications.

[131] Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings

Tianxiang Dai, Jonathan Fan

Main category: cs.CV

TL;DR: This paper provides a theoretical analysis of Multi-Resolution Hash Encoding (MHE) using physical systems principles, characterizing its spatial behavior through Point Spread Function analysis, revealing grid-induced anisotropy and resolution limitations, and proposing Rotated MHE to mitigate these issues.

Details

Motivation: MHE is widely used in neural fields but lacks rigorous physical understanding, leading to heuristic hyperparameter selection. The authors aim to establish a principled analytical framework to characterize MHE's spatial behavior and optimize its performance.

Method: The authors analyze MHE through its Point Spread Function (analogous to Green’s function), derive closed-form approximations for collision-free PSF, quantify spatial resolution via FWHM analysis, study optimization dynamics’ impact on effective resolution, analyze hash collision effects on SNR, and propose Rotated MHE architecture with coordinate rotations at each resolution level.

Result: The analysis reveals: 1) MHE exhibits grid-induced anisotropy and logarithmic spatial profiles, 2) effective resolution is governed by average resolution rather than finest resolution due to optimization broadening, 3) hash collisions introduce speckle noise degrading SNR, 4) Rotated MHE successfully mitigates anisotropy while maintaining efficiency.

Conclusion: This work establishes a physical principles-based methodology for characterizing and optimizing MHE, moving beyond heuristics. The theoretical insights enable better understanding of MHE’s spatial behavior and the proposed R-MHE architecture improves performance while preserving computational efficiency.

Abstract: Multi-Resolution Hash Encoding (MHE), the foundational technique behind Instant Neural Graphics Primitives, provides a powerful parameterization for neural fields. However, its spatial behavior lacks rigorous understanding from a physical systems perspective, leading to reliance on heuristics for hyperparameter selection. This work introduces a novel analytical approach that characterizes MHE by examining its Point Spread Function (PSF), which is analogous to the Green’s function of the system. This methodology enables a quantification of the encoding’s spatial resolution and fidelity. We derive a closed-form approximation for the collision-free PSF, uncovering inherent grid-induced anisotropy and a logarithmic spatial profile. We establish that the idealized spatial bandwidth, specifically the Full Width at Half Maximum (FWHM), is determined by the average resolution, $N_{\text{avg}}$. This leads to a counterintuitive finding: the effective resolution of the model is governed by the broadened empirical FWHM (and therefore $N_{\text{avg}}$), rather than the finest resolution $N_{\max}$, a broadening effect we demonstrate arises from optimization dynamics. Furthermore, we analyze the impact of finite hash capacity, demonstrating how collisions introduce speckle noise and degrade the Signal-to-Noise Ratio (SNR). Leveraging these theoretical insights, we propose Rotated MHE (R-MHE), an architecture that applies distinct rotations to the input coordinates at each resolution level. R-MHE mitigates anisotropy while maintaining the efficiency and parameter count of the original MHE. This study establishes a methodology based on physical principles that moves beyond heuristics to characterize and optimize MHE.

[132] The Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation

Suman Kunwar

Main category: cs.CV

TL;DR: A garbage image dataset (GD) with 13,348 labeled images across 10 waste categories for automated waste segregation using computer vision and deep learning models.

Details

Motivation: To advance automated waste segregation through machine learning by providing a diverse, publicly available image dataset covering common household waste categories, addressing the need for real-world benchmarks in environmental sustainability applications.

Method: Created GD dataset through multiple collection methods (DWaste mobile app and web sources), performed rigorous validation (checksums, outlier detection), analyzed class imbalance and visual separability via PCA/t-SNE, assessed background complexity using entropy and saliency measures, and benchmarked with state-of-the-art deep learning models (EfficientNetV2M, EfficientNetV2S, MobileNet, ResNet50, ResNet101).

Result: EfficientNetV2S achieved the highest performance with 96.19% accuracy and 0.96 F1-score, though with moderate carbon cost. Analysis revealed dataset characteristics including class imbalance, skew toward high-outlier classes (plastic, cardboard, paper), and brightness variations.

Conclusion: GD provides a valuable real-world benchmark for waste classification research while highlighting challenges such as class imbalance, background complexity, and environmental trade-offs in model selection that must be addressed for practical deployment in environmental sustainability applications.

Abstract: This study introduces the Garbage Dataset (GD), a publicly available image dataset designed to advance automated waste segregation through machine learning and computer vision. It’s a diverse dataset covering 10 common household waste categories: metal, glass, biological, paper, battery, trash, cardboard, shoes, clothes, and plastic. The dataset comprises 13,348 labeled images collected through multiple methods, including DWaste mobile app and curated web sources. Methods included rigorous validation through checksums and outlier detection, analysis of class imbalance and visual separability via PCA/t-SNE, and assessment of background complexity using entropy and saliency measures. The dataset was benchmarked using state-of-the-art deep learning models (EfficientNetV2M, EfficientNetV2S, MobileNet, ResNet50, ResNet101) evaluated on performance metrics and operational carbon emissions. Experiment results indicate EfficientNetV2S achieved the highest performance with 96.19% accuracy and a 0.96 F1-score, though with a moderate carbon cost. Analysis revealed inherent dataset characteristics including class imbalance, a skew toward high-outlier classes (plastic, cardboard, paper), and brightness variations that require consideration. The main conclusion is that GD provides a valuable, real-world benchmark for waste classification research while highlighting important challenges such as class imbalance, background complexity, and environmental trade-offs in model selection that must be addressed for practical deployment. The dataset is publicly released to support further research in environmental sustainability applications.

[133] Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation

Salma J. Ahmed, Emad A. Mohammed, Azam Asilian Bidgoli

Main category: cs.CV

TL;DR: Med-SegLens is a model-diffing framework that decomposes segmentation model activations into interpretable latent features using sparse autoencoders to diagnose failures and mitigate dataset shift.

Details

Motivation: Segmentation models achieve strong performance but remain opaque, limiting ability to diagnose failures, understand dataset shift, or intervene in a principled manner.

Method: Uses sparse autoencoders trained on SegFormer and U-Net to decompose model activations into interpretable latent features, with cross-architecture and cross-dataset latent alignment across different glioma cohorts.

Result: Identifies stable backbone of shared representations while dataset shift is driven by differential reliance on population-specific latents. Latents act as causal bottlenecks for segmentation failures, and targeted latent-level interventions can correct errors and improve cross-dataset adaptation without retraining, recovering performance in 70% of failure cases and improving Dice score from 39.4% to 74.2%.

Conclusion: Latent-level model diffing provides a practical and mechanistic tool for diagnosing failures and mitigating dataset shift in segmentation models.

Abstract: Modern segmentation models achieve strong predictive performance but remain largely opaque, limiting our ability to diagnose failures, understand dataset shift, or intervene in a principled manner. We introduce Med-SegLens, a model-diffing framework that decomposes segmentation model activations into interpretable latent features using sparse autoencoders trained on SegFormer and U-Net. Through cross-architecture and cross-dataset latent alignment across healthy, adult, pediatric, and sub-Saharan African glioma cohorts, we identify a stable backbone of shared representations, while dataset shift is driven by differential reliance on population-specific latents. We show that these latents act as causal bottlenecks for segmentation failures, and that targeted latent-level interventions can correct errors and improve cross-dataset adaption without retraining, recovering performance in 70% of failure cases and improving Dice score from 39.4% to 74.2%. Our results demonstrate that latent-level model diffing provides a practical and mechanistic tool for diagnosing failures and mitigating dataset shift in segmentation models.

[134] 1%>100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

Dongshuo Yin, Xue Yang, Deng-Ping Fan, Shi-Min Hu

Main category: cs.CV

TL;DR: CoLin introduces a novel low-rank complex adapter with Complex Linear Projection Optimization for efficient adaptation of vision foundation models, achieving superior performance with only 1% parameters compared to full fine-tuning and classical delta-tuning approaches.

Details

Motivation: Vision foundation models require efficient adaptation strategies since full fine-tuning is prohibitively expensive and inefficient. While delta-tuning works well for LLMs, it doesn't directly transfer to vision models, creating a need for novel efficient adaptation methods for vision tasks.

Method: Proposes CoLin (Complex Linear Projection Optimization) with a novel low-rank complex adapter architecture that introduces only ~1% parameters. Addresses convergence issues of low-rank composite matrices through a tailored loss function, enabling efficient adaptation of vision foundation models.

Result: Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing) show CoLin outperforms both full fine-tuning and classical delta-tuning approaches with only 1% parameters, marking the first time such efficiency is achieved for vision foundation models.

Conclusion: CoLin provides a novel and efficient solution for deploying vision foundation models, achieving state-of-the-art adaptation efficiency with minimal parameter overhead, making it practical for real-world vision applications.

Abstract: Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine-tuning suffers from prohibitive costs and low efficiency. While delta-tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine-tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low-rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low-rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine-tuning and classical delta-tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on https://github.com/DongshuoYin/CoLin.

[135] 3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

Zhongju Wang, Zhenhong Sun, Beier Wang, Yifu Wang, Daoyi Dong, Huadong Mo, Hongdong Li

Main category: cs.CV

TL;DR: 3DXTalker: An expressive 3D talking avatar generation framework that addresses data scarcity, improves lip synchronization with emotional cues, and enables controllable head-pose dynamics through a unified transformer architecture.

Details

Motivation: Audio-driven 3D talking avatar generation faces challenges in preserving identity, synchronizing lip motion with speech, expressing emotion, and exhibiting lifelike spatial dynamics. Current limitations include insufficient training data with limited identities, narrow audio representations, and restricted explicit controllability.

Method: Proposes 3DXTalker with three key components: 1) Data-curated identity modeling using 2D-to-3D pipeline and disentangled representations to address data scarcity; 2) Audio-rich representations including frame-wise amplitude and emotional cues beyond standard speech embeddings; 3) Spatial dynamics controllability with flow-matching-based transformer for coherent facial dynamics and prompt-based conditioning for stylized head-pose control.

Result: Extensive experiments show 3DXTalker achieves superior performance in 3D talking avatar generation, integrating lip synchronization, emotional expression, and head-pose dynamics within a unified framework.

Conclusion: 3DXTalker provides a comprehensive solution for expressive 3D talking avatar generation, overcoming data limitations and enabling fine-grained control over identity, emotion, and spatial dynamics through innovative data curation and representation techniques.

Abstract: Audio-driven 3D talking avatar generation is increasingly important in virtual communication, digital humans, and interactive media, where avatars must preserve identity, synchronize lip motion with speech, express emotion, and exhibit lifelike spatial dynamics, collectively defining a broader objective of expressivity. However, achieving this remains challenging due to insufficient training data with limited subject identities, narrow audio representations, and restricted explicit controllability. In this paper, we propose 3DXTalker, an expressive 3D talking avatar through data-curated identity modeling, audio-rich representations, and spatial dynamics controllability. 3DXTalker enables scalable identity modeling via 2D-to-3D data curation pipeline and disentangled representations, alleviating data scarcity and improving identity generalization. Then, we introduce frame-wise amplitude and emotional cues beyond standard speech embeddings, ensuring superior lip synchronization and nuanced expression modulation. These cues are unified by a flow-matching-based transformer for coherent facial dynamics. Moreover, 3DXTalker also enables natural head-pose motion generation while supporting stylized control via prompt-based conditioning. Extensive experiments show that 3DXTalker integrates lip synchronization, emotional expression, and head-pose dynamics within a unified framework, achieves superior performance in 3D talking avatar generation.

[136] MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

Sharat Bhat, Harshita Khandelwal, Tushar Kataria, Vivek Gupta

Main category: cs.CV

TL;DR: MapVerse: A large-scale benchmark for evaluating multimodal reasoning on real-world maps, revealing current VLMs’ limitations in complex spatial reasoning tasks.

Details

Motivation: Current benchmarks for evaluating vision-language models on map-based reasoning are limited - they're narrow in scope, domain-specific, and rely on artificially generated content, lacking depth for genuine geospatial reasoning evaluation.

Method: Created MapVerse, a benchmark with 11,837 human-authored QA pairs across 1,025 real-world maps spanning 10 diverse categories. Evaluated 10 state-of-the-art models with fine-grained categorical analyses to assess reasoning across multiple dimensions.

Result: Current VLMs perform competitively on classification-style tasks but both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning. The benchmark reveals significant reasoning gaps in multimodal models.

Conclusion: MapVerse provides a comprehensive benchmark for evaluating genuine geospatial reasoning capabilities in multimodal models, highlighting the need for improved spatial reasoning in VLMs.

Abstract: Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise-capabilities that current large language models (LLMs) and vision-language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning. To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question-answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning. We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning.

[137] RealHD: A High-Quality Dataset for Robust Detection of State-of-the-Art AI-Generated Images

Hanzhe Yu, Yun Ye, Jintao Rong, Qi Xuan, Chen Ma

Main category: cs.CV

TL;DR: A large-scale dataset of 730K+ images for AI-generated image detection, addressing limitations of existing datasets through diverse generation methods and rich metadata, with a lightweight detection method based on noise entropy.

Details

Motivation: Address concerns about AI-generated image authenticity by creating a high-quality, diverse dataset to overcome limitations of existing datasets (poor generalization, low quality, simple prompts, limited diversity) for training robust detection models.

Method: Created a dataset of 730K+ images using state-of-the-art generation methods: text-to-image (with 10K+ prompts), inpainting, refinement, and face swapping. Proposed a lightweight detection method that transforms images into entropy tensors of Non-Local Means noise before classification.

Result: Detection models trained on the dataset show superior generalization compared to existing datasets. The proposed noise entropy method delivers competitive performance, establishing a solid baseline for AI-generated image detection.

Conclusion: The dataset provides a strong benchmark for evaluating detection methods and advances robustness in AI-generated image detection. The noise entropy method offers a lightweight, effective approach for this task.

Abstract: The rapid advancement of generative AI has raised concerns about the authenticity of digital images, as highly realistic fake images can now be generated at low cost, potentially increasing societal risks. In response, several datasets have been established to train detection models aimed at distinguishing AI-generated images from real ones. However, existing datasets suffer from limited generalization, low image quality, overly simple prompts, and insufficient image diversity. To address these limitations, we propose a high-quality, large-scale dataset comprising over 730,000 images across multiple categories, including both real and AI-generated images. The generated images are synthesized via state-of-the-art methods, including text-to-image generation (guided by over 10,000 carefully designed prompts), image inpainting, image refinement, and face swapping. Each generated image is annotated with its generation method and category. Inpainting images further include binary masks to indicate inpainted regions, providing rich metadata for analysis. Compared to existing datasets, detection models trained on our dataset demonstrate superior generalization capabilities. Our dataset not only serves as a strong benchmark for evaluating detection methods but also contributes to advancing the robustness of AI-generated image detection techniques. Building upon this, we propose a lightweight detection method based on image noise entropy, which transforms the original image into an entropy tensor of Non-Local Means (NLM) noise before classification. Extensive experiments demonstrate that models trained on our dataset achieve strong generalization, and our method delivers competitive performance, establishing a solid baseline for future research. The dataset and source code are publicly available at https://real-hd.github.io.

[138] Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

Shengyang Sun, Jiashen Hua, Junyi Feng, Xiaojin Gong

Main category: cs.CV

TL;DR: A text-guided multimodal video anomaly detection framework that uses in-context learning for text augmentation and multi-scale bottleneck Transformer for multimodal fusion, achieving SOTA on UCF-Crime and XD-Violence datasets.

Details

Motivation: Text modality is under-explored in video anomaly detection despite providing explicit semantic information that could enhance anomaly characterization and reduce false alarms. Challenges include: general-purpose language models can't capture anomaly-specific nuances, scarcity of relevant text descriptions, and multimodal fusion suffers from redundancy and imbalance.

Method: 1) In-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning text feature extractor. 2) Multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance.

Result: Demonstrates state-of-the-art performance on UCF-Crime and XD-Violence benchmark datasets for video anomaly detection.

Conclusion: The proposed text-guided framework effectively leverages text modality to enhance video anomaly detection through improved text feature extraction and balanced multimodal fusion.

Abstract: Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.

[139] C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning

Guanting Ye, Qiyan Zhao, Wenhao Yu, Xiaofeng Zhang, Jianmin Ji, Yanyong Zhang, Ka-Veng Yuen

Main category: cs.CV

TL;DR: C²RoPE improves Rotary Position Embedding for 3D multimodal models by addressing spatial continuity loss and long-term decay in attention, using hybrid spatio-temporal positional indices and Chebyshev causal masking.

Details

Motivation: Current 3D Large Multimodal Models using RoPE have limitations: 1D temporal positional indices disrupt visual feature continuity along column dimension (spatial locality loss), and RoPE's temporal proximity assumption causes long-term decay where models neglect earlier visual tokens as sequence length increases.

Method: Proposes C²RoPE with two key innovations: 1) Spatio-temporal continuous positional embedding using triplet hybrid positional indices (1D temporal + 2D Cartesian spatial coordinates) with frequency allocation strategy, 2) Chebyshev Causal Masking that determines causal dependencies based on Chebyshev distance in 2D space rather than temporal proximity.

Result: Evaluation across various benchmarks including 3D scene reasoning and 3D visual question answering demonstrates C²RoPE’s effectiveness in improving multimodal processing capabilities.

Conclusion: C²RoPE successfully addresses RoPE’s limitations for visual processing by explicitly modeling local spatial continuity and spatial causal relationships, enhancing 3D multimodal model performance.

Abstract: Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional indices disrupts the continuity of visual features along the column dimension, resulting in spatial locality loss. Moreover, RoPE follows the prior that temporally closer image tokens are more causally related, leading to long-term decay in attention allocation and causing the model to progressively neglect earlier visual tokens as the sequence length increases. To address these issues, we propose C^2RoPE, an improved RoPE that explicitly models local spatial Continuity and spatial Causal relationships for visual processing. C^2RoPE introduces a spatio-temporal continuous positional embedding mechanism for visual tokens. It first integrates 1D temporal positions with Cartesian-based spatial coordinates to construct a triplet hybrid positional index, and then employs a frequency allocation strategy to encode spatio-temporal positional information across the three index components. Additionally, we introduce Chebyshev Causal Masking, which determines causal dependencies by computing the Chebyshev distance of image tokens in 2D space. Evaluation results across various benchmarks, including 3D scene reasoning and 3D visual question answering, demonstrate C^2RoPE’s effectiveness. The code is be available at https://github.com/ErikZ719/C2RoPE.

[140] MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

Chenhao Zhang, Yazhe Niu, Hongsheng Li

Main category: cs.CV

TL;DR: MetaphorStar: A visual reinforcement learning framework for image implication tasks that improves MLLMs’ understanding of metaphorical, cultural, and contextual content in images.

Details

Motivation: Current MLLMs struggle with metaphorical comprehension in images, lacking the ability to grasp nuanced cultural, emotional, and contextual implications. This requires sophisticated multi-hop reasoning, cultural context, and Theory of Mind capabilities that existing models don't possess.

Method: Proposes MetaphorStar, an end-to-end visual reinforcement learning framework with three components: TFQ-Data (fine-grained dataset), TFQ-GRPO (visual RL method), and TFQ-Bench (structured benchmark). Uses RL to train models on image implication tasks.

Result: MetaphorStar-32B achieves SOTA on multiple-choice and open-style questions, outperforms Gemini-3.0-pro on true-false questions, with average 82.6% improvement on benchmarks. Learning image implication tasks also improves general understanding and complex visual reasoning abilities.

Conclusion: The framework successfully addresses metaphorical comprehension challenges in MLLMs, demonstrating that specialized training on image implication tasks enhances both specific and general visual reasoning capabilities. The method shows broad applicability across different model architectures and scales.

Abstract: Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task’s demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench. Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an average of 82.6% on the image implication benchmarks. Compared with 20+ mainstream MLLMs, MetaphorStar-32B achieves state-of-the-art (SOTA) on Multiple-Choice Question and Open-Style Question, significantly outperforms the top closed-source model Gemini-3.0-pro on True-False Question. Crucially, our experiments reveal that learning image implication tasks improves the general understanding ability, especially the complex visual reasoning ability. We further provide a systematic analysis of model parameter scaling, training data scaling, and the impact of different model architectures and training strategies, demonstrating the broad applicability of our method. We open-sourced all model weights, datasets, and method code at https://metaphorstar.github.io.

[141] Enhancing YOLOv11n for Reliable Child Detection in Noisy Surveillance Footage

Khanh Linh Tran, Minh Nguyen Dang, Thien Nguyen Trong, Hung Nguyen Quoc, Linh Nguyen Kieu

Main category: cs.CV

TL;DR: Enhanced child detection for surveillance using YOLOv11n with domain-specific augmentation and SAHI inference for challenging conditions like occlusion, small objects, and poor lighting.

Details

Motivation: Improve child detection in low-quality surveillance footage for real-world applications like missing child alerts and daycare monitoring, addressing challenges of occlusion, small object size, low resolution, motion blur, and poor lighting in existing CCTV infrastructure.

Method: Builds on YOLOv11n architecture with domain-specific augmentation strategy synthesizing realistic child placements using spatial perturbations (partial visibility, truncation, overlaps) and photometric degradations (lighting variation, noise). Integrates Slicing Aided Hyper Inference (SAHI) at inference time to improve recall of small and partially occluded instances.

Result: Achieves mAP@0.5 of 0.967 and mAP@0.5:0.95 of 0.783 on Roboflow Daycare dataset, yielding absolute improvements of 0.7% and 2.3% respectively over baseline YOLOv11n without architectural changes. Maintains real-time performance and edge device compatibility.

Conclusion: Proposes a practical, lightweight solution for child detection in surveillance that improves performance under challenging conditions while maintaining deployment readiness for resource-constrained industrial surveillance systems.

Abstract: This paper presents a practical and lightweight solution for enhancing child detection in low-quality surveillance footage, a critical component in real-world missing child alert and daycare monitoring systems. Building upon the efficient YOLOv11n architecture, we propose a deployment-ready pipeline that improves detection under challenging conditions including occlusion, small object size, low resolution, motion blur, and poor lighting commonly found in existing CCTV infrastructures. Our approach introduces a domain-specific augmentation strategy that synthesizes realistic child placements using spatial perturbations such as partial visibility, truncation, and overlaps, combined with photometric degradations including lighting variation and noise. To improve recall of small and partially occluded instances, we integrate Slicing Aided Hyper Inference (SAHI) at inference time. All components are trained and evaluated on a filtered, child-only subset of the Roboflow Daycare dataset. Compared to the baseline YOLOv11n, our enhanced system achieves a mean Average Precision at 0.5 IoU (mAP@0.5) of 0.967 and a mean Average Precision averaged over IoU thresholds from 0.5 to 0.95 (mAP@0.5:0.95) of 0.783, yielding absolute improvements of 0.7 percent and 2.3 percent, respectively, without architectural changes. Importantly, the entire pipeline maintains compatibility with low-power edge devices and supports real-time performance, making it particularly well suited for low-cost or resource-constrained industrial surveillance deployments. The example augmented dataset and the source code used to generate it are available at: https://github.com/html-ptit/Data-Augmentation-YOLOv11n-child-detection

[142] Fast Person Detection Using YOLOX With AI Accelerator For Train Station Safety

Mas Nurul Achmadiah, Novendra Setyawan, Achmad Arif Bryantono, Chi-Chia Sun, Wen-Kai Kuo

Main category: cs.CV

TL;DR: Comparison of YOLOX-based passenger detection at train stations using Hailo-8 AI accelerator vs Jetson Orin Nano, showing Hailo-8 achieves higher accuracy and lower latency.

Details

Motivation: To improve safety at train stations by detecting passengers near crossing areas, reducing accidents caused by passengers crossing yellow lines carelessly, using efficient object detection technology.

Method: Uses YOLOX object detection model deployed on Edge AI Accelerator hardware (Hailo-8) and compares performance with Jetson Orin Nano for passenger detection at train stations.

Result: Hailo-8 AI hardware accelerator outperforms Jetson Orin Nano with over 12% higher accuracy and 20 ms lower latency for passenger detection tasks.

Conclusion: Edge AI accelerators like Hailo-8 provide superior performance for real-time passenger detection at train stations, offering better accuracy and lower latency than Jetson Orin Nano for safety applications.

Abstract: Recently, Image processing has advanced Faster and applied in many fields, including health, industry, and transportation. In the transportation sector, object detection is widely used to improve security, for example, in traffic security and passenger crossings at train stations. Some accidents occur in the train crossing area at the station, like passengers uncarefully when passing through the yellow line. So further security needs to be developed. Additional technology is required to reduce the number of accidents. This paper focuses on passenger detection applications at train stations using YOLOX and Edge AI Accelerator hardware. the performance of the AI accelerator will be compared with Jetson Orin Nano. The experimental results show that the Hailo-8 AI hardware accelerator has higher accuracy than Jetson Orin Nano (improvement of over 12%) and has lower latency than Jetson Orin Nano (reduced 20 ms).

[143] Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

Guangjing Yang, ZhangYuan Yu, Ziyuan Qin, Xinyuan Song, Huahui Yi, Qingbo Kang, Jun Gao, Yiyue Li, Chenlin Du, Qicheng Lao

Main category: cs.CV

TL;DR: VRFT-Aug: A visual reinforcement fine-tuning framework for medical imaging that combines perception and reasoning through knowledge injection, policy refinement, reward shaping, and behavioral imitation.

Details

Motivation: Current Reinforcement Fine-Tuning (RFT) methods are primarily designed for language models and don't extend well to vision-centric domains, especially medical imaging which requires both robust visual perception and structured reasoning.

Method: Proposes VRFT-Aug framework with four key strategies: 1) Prior knowledge injection, 2) Perception-driven policy refinement, 3) Medically informed reward shaping, and 4) Behavioral imitation to stabilize and improve the RFT process for medical vision tasks.

Result: Outperforms both standard supervised fine-tuning and RFT baselines across multiple medical datasets, with empirically grounded insights and practical training heuristics that generalize to other medical image tasks.

Conclusion: Provides actionable guidance for developing reliable, reasoning-capable models for high-stakes medical applications, contributing to the advancement of multimodal reinforcement fine-tuning in vision-centric domains.

Abstract: While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across multiple medical datasets, we show that our approaches consistently outperform both standard supervised fine-tuning and RFT baselines. Moreover, we provide empirically grounded insights and practical training heuristics that can be generalized to other medical image tasks. We hope this work contributes actionable guidance and fresh inspiration for the ongoing effort to develop reliable, reasoning-capable models for high-stakes medical applications.

[144] A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

Siyuan Yan, Xieji Li, Dan Mo, Philipp Tschandl, Yiwen Jiang, Zhonghua Wang, Ming Hu, Lie Ju, Cristina Vico-Alonso, Yizhen Zheng, Jiahe Liu, Juexiao Zhou, Camilla Chello, Jen G. Cheung, Julien Anriot, Luc Thomas, Clare Primiero, Gin Tan, Aik Beng Ng, Simon See, Xiaoying Tang, Albert Ip, Xiaoyang Liao, Adrian Bowling, Martin Haskett, Shuang Zhao, Monika Janda, H. Peter Soyer, Victoria Mar, Harald Kittler, Zongyuan Ge

Main category: cs.CV

TL;DR: DermFM-Zero is a dermatology vision-language foundation model that achieves state-of-the-art zero-shot performance across 20 benchmarks without task-specific fine-tuning, demonstrating clinical utility in multinational reader studies with over 1,100 clinicians.

Details

Motivation: Medical foundation models show promise but face deployment challenges due to reliance on task-specific fine-tuning. The authors aim to create a dermatology model that works effectively in zero-shot settings without requiring adaptation for specific clinical tasks.

Method: Trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. Uses sparse autoencoders to unsupervisedly disentangle clinically meaningful concepts from latent representations, enabling interpretability and bias mitigation.

Result: Achieved SOTA across 20 benchmarks for zero-shot diagnosis and multimodal retrieval. In clinical studies: doubled GP diagnostic accuracy for 98 skin conditions; outperformed board-certified dermatologists in skin cancer assessment; enabled non-experts to surpass unassisted experts; demonstrated interpretable latent representations that mitigate artifact-induced biases.

Conclusion: DermFM-Zero demonstrates that foundation models can provide effective, safe, and transparent zero-shot clinical decision support without task-specific adaptation, addressing key deployment barriers in medical AI.

Abstract: Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero’s latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.

[145] Eliminating VAE for Fast and High-Resolution Generative Detail Restoration

Yan Wang, Shijie Zhao, Junlin Li, Li Zhang

Main category: cs.CV

TL;DR: GenDR-Pix accelerates diffusion-based super-resolution by eliminating VAE bottleneck through pixel-space operations and multi-stage adversarial distillation, achieving 2.8x speedup and 60% memory saving while maintaining quality.

Details

Motivation: Diffusion models achieve excellent super-resolution results but suffer from slow inference and high memory demands. Existing acceleration methods like step distillation still face memory limitations requiring tile-by-tile processing, with VAE being the main bottleneck.

Method: Proposes GenDR-Pix which: 1) Uses pixel-(un)shuffle operations to eliminate VAE bottleneck, 2) Employs multi-stage adversarial distillation to progressively remove encoder/decoder while avoiding artifacts, 3) Introduces random padding for feature augmentation, 4) Uses masked Fourier space loss for amplitude outliers, 5) Integrates padding-based self-ensemble with classifier-free guidance.

Result: GenDR-Pix achieves 2.8x acceleration and 60% memory saving compared to GenDR with negligible visual degradation. Can restore 4K images in 1 second using only 6GB memory, surpassing other one-step diffusion SR methods.

Conclusion: The proposed pixel-space approach with multi-stage adversarial distillation effectively solves the VAE bottleneck in diffusion-based super-resolution, enabling fast high-quality image restoration with significantly reduced computational requirements.

Abstract: Diffusion models have attained remarkable breakthroughs in the real-world super-resolution (SR) task, albeit at slow inference and high demand on devices. To accelerate inference, recent works like GenDR adopt step distillation to minimize the step number to one. However, the memory boundary still restricts the maximum processing size, necessitating tile-by-tile restoration of high-resolution images. Through profiling the pipeline, we pinpoint that the variational auto-encoder (VAE) is the bottleneck of latency and memory. To completely solve the problem, we leverage pixel-(un)shuffle operations to eliminate the VAE, reversing the latent-based GenDR to pixel-space GenDR-Pix. However, upscale with x8 pixelshuffle may induce artifacts of repeated patterns. To alleviate the distortion, we propose a multi-stage adversarial distillation to progressively remove the encoder and decoder. Specifically, we utilize generative features from the previous stage models to guide adversarial discrimination. Moreover, we propose random padding to augment generative features and avoid discriminator collapse. We also introduce a masked Fourier space loss to penalize the outliers of amplitude. To improve inference performance, we empirically integrate a padding-based self-ensemble with classifier-free guidance to improve inference scaling. Experimental results show that GenDR-Pix performs 2.8x acceleration and 60% memory-saving compared to GenDR with negligible visual degradation, surpassing other one-step diffusion SR. Against all odds, GenDR-Pix can restore 4K image in only 1 second and 6GB.

[146] Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

Yin Wang, Ziyao Zhang, Zhiying Leng, Haitian Liu, Frederick W. B. Li, Mu Li, Xiaohui Liang

Main category: cs.CV

TL;DR: MP-HOI: A multimodal framework for text-driven 3D human-object interaction motion generation using multimodal priors, enhanced object representations, MoE fusion, and cascaded diffusion with interaction supervision.

Details

Motivation: Existing text-to-HOI methods suffer from sub-optimal human motion, unnatural object motion, and weak human-object interaction due to the significant cross-modality gap between text and 3D motion.

Method: Four key components: (1) Leverage multimodal data priors from large multimodal models, (2) Enhanced object representation with geometric keypoints, contact features, and dynamic properties, (3) Modality-aware Mixture-of-Experts for multimodal feature fusion, (4) Cascaded diffusion framework with interaction supervision.

Result: MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained human-object interaction motions, demonstrating superior performance in addressing the three key limitations.

Conclusion: The proposed multimodal framework effectively bridges the cross-modality gap for text-driven 3D HOI generation by leveraging multimodal priors, enhanced representations, and interaction-aware refinement.

Abstract: We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.

[147] AurigaNet: A Real-Time Multi-Task Network for Enhanced Urban Driving Perception

Kiarash Ghasemzadeh, Sedigheh Dehghani

Main category: cs.CV

TL;DR: AurigaNet is a multi-task network for autonomous driving perception that integrates object detection, lane detection, and drivable area instance segmentation, achieving state-of-the-art performance on the BDD100K dataset with real-time deployment capabilities.

Details

Motivation: To develop a reliable AI system for autonomous vehicles that can efficiently handle multiple perception tasks simultaneously, addressing challenges in computational efficiency, real-time processing, and generalization for driving applications.

Method: Proposes AurigaNet, an advanced multi-task network architecture that integrates three critical perception tasks: object detection, lane detection, and drivable area instance segmentation, trained end-to-end on the diverse BDD100K dataset.

Result: Achieves 85.2% IoU in drivable area segmentation (0.7% improvement), 60.8% IoU in lane detection (30%+ improvement), and 47.6% mAP@0.5:0.95 in object detection (2.9% improvement), with successful real-time deployment on embedded devices like Jetson Orin NX.

Conclusion: AurigaNet demonstrates robust and efficient multi-task perception capabilities for autonomous driving, offering state-of-the-art performance across multiple critical tasks with practical real-time deployment feasibility.

Abstract: Self-driving cars hold significant potential to reduce traffic accidents, alleviate congestion, and enhance urban mobility. However, developing reliable AI systems for autonomous vehicles remains a substantial challenge. Over the past decade, multi-task learning has emerged as a powerful approach to address complex problems in driving perception. Multi-task networks offer several advantages, including increased computational efficiency, real-time processing capabilities, optimized resource utilization, and improved generalization. In this study, we present AurigaNet, an advanced multi-task network architecture designed to push the boundaries of autonomous driving perception. AurigaNet integrates three critical tasks: object detection, lane detection, and drivable area instance segmentation. The system is trained and evaluated using the BDD100K dataset, renowned for its diversity in driving conditions. Key innovations of AurigaNet include its end-to-end instance segmentation capability, which significantly enhances both accuracy and efficiency in path estimation for autonomous vehicles. Experimental results demonstrate that AurigaNet achieves an 85.2% IoU in drivable area segmentation, outperforming its closest competitor by 0.7%. In lane detection, AurigaNet achieves a remarkable 60.8% IoU, surpassing other models by more than 30%. Furthermore, the network achieves an mAP@0.5:0.95 of 47.6% in traffic object detection, exceeding the next leading model by 2.9%. Additionally, we validate the practical feasibility of AurigaNet by deploying it on embedded devices such as the Jetson Orin NX, where it demonstrates competitive real-time performance. These results underscore AurigaNet’s potential as a robust and efficient solution for autonomous driving perception systems. The code can be found here https://github.com/KiaRational/AurigaNet.

[148] Dynamic Frequency Modulation for Controllable Text-driven Image Generation

Tiandong Shi, Ling Zhao, Ji Qi, Jiayi Ma, Chengli Peng

Main category: cs.CV

TL;DR: Training-free frequency modulation method for text-guided diffusion models that preserves structure while enabling semantic modifications by manipulating noisy latent variables based on frequency analysis.

Details

Motivation: Current text-guided diffusion models struggle with semantic modifications - changing text prompts often causes unintended global structure changes that disrupt user intent. Existing methods rely on empirical feature map selection which leads to suboptimal stability.

Method: Analyzes frequency spectrum of noisy latent variables and their impact on hierarchical structure emergence. Lower-frequency components establish structure framework early, while higher-frequency components handle fine-grained textures later. Proposes training-free frequency modulation using frequency-dependent weighting function with dynamic decay to directly manipulate noisy latent variables.

Result: Extensive experiments show the method significantly outperforms current state-of-the-art methods, achieving effective balance between preserving structure and enabling semantic updates.

Conclusion: Frequency perspective provides effective solution for structure-preserving semantic modifications in text-guided diffusion models, avoiding empirical feature map selection issues.

Abstract: The success of text-guided diffusion models has established a new image generation paradigm driven by the iterative refinement of text prompts. However, modifying the original text prompt to achieve the expected semantic adjustments often results in unintended global structure changes that disrupt user intent. Existing methods rely on empirical feature map selection for intervention, whose performance heavily depends on appropriate selection, leading to suboptimal stability. This paper tries to solve the aforementioned problem from a frequency perspective and analyzes the impact of the frequency spectrum of noisy latent variables on the hierarchical emergence of the structure framework and fine-grained textures during the generation process. We find that lower-frequency components are primarily responsible for establishing the structure framework in the early generation stage. Their influence diminishes over time, giving way to higher-frequency components that synthesize fine-grained textures. In light of this, we propose a training-free frequency modulation method utilizing a frequency-dependent weighting function with dynamic decay. This method maintains the structure framework consistency while permitting targeted semantic modifications. By directly manipulating the noisy latent variable, the proposed method avoids the empirical selection of internal feature maps. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art methods, achieving an effective balance between preserving structure and enabling semantic updates.

[149] AMAP-APP: Efficient Segmentation and Morphometry Quantification of Fluorescent Microscopy Images of Podocytes

Arash Fatehi, David Unnersjö-Jess, Linus Butt, Noémie Moreau, Thomas Benzing, Katarzyna Bozek

Main category: cs.CV

TL;DR: AMAP-APP is a cross-platform desktop application that optimizes podocyte foot process quantification by replacing intensive instance segmentation with classic image processing while maintaining accuracy, achieving 147x speedup and enabling broader adoption in nephrology research.

Details

Motivation: The original AMAP method for automated podocyte foot process quantification had significant limitations: high computational demands requiring HPC clusters, lack of user interface, and Linux dependency, which hindered widespread adoption in kidney research.

Method: AMAP-APP optimizes efficiency by replacing intensive instance segmentation with classic image processing while retaining the original semantic segmentation model. It introduces a refined ROI algorithm and was validated on 365 mouse/human images using Pearson correlation and TOST tests against the original AMAP.

Result: Achieved 147-fold processing speed increase on consumer hardware. Morphometric outputs showed high correlation (r>0.90) and statistical equivalence to original method. New ROI algorithm demonstrated superior accuracy with reduced deviation from manual delineations.

Conclusion: AMAP-APP democratizes deep learning-based podocyte morphometry by eliminating HPC requirements and providing cross-platform user-friendly interface, enabling widespread adoption in nephrology research and potential clinical diagnostics.

Abstract: Background: Automated podocyte foot process quantification is vital for kidney research, but the established “Automatic Morphological Analysis of Podocytes” (AMAP) method is hindered by high computational demands, a lack of a user interface, and Linux dependency. We developed AMAP-APP, a cross-platform desktop application designed to overcome these barriers. Methods: AMAP-APP optimizes efficiency by replacing intensive instance segmentation with classic image processing while retaining the original semantic segmentation model. It introduces a refined Region of Interest (ROI) algorithm to improve precision. Validation involved 365 mouse and human images (STED and confocal), benchmarking performance against the original AMAP via Pearson correlation and Two One-Sided T-tests (TOST). Results: AMAP-APP achieved a 147-fold increase in processing speed on consumer hardware. Morphometric outputs (area, perimeter, circularity, and slit diaphragm density) showed high correlation (r>0.90) and statistical equivalence (TOST P<0.05) to the original method. Additionally, the new ROI algorithm demonstrated superior accuracy compared to the original, showing reduced deviation from manual delineations. Conclusion: AMAP-APP democratizes deep learning-based podocyte morphometry. By eliminating the need for high-performance computing clusters and providing a user-friendly interface for Windows, macOS, and Linux, it enables widespread adoption in nephrology research and potential clinical diagnostics.

[150] TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning

Junhua Liu, Zhangcheng Wang, Zhike Han, Ningli Wang, Guotao Liang, Kun Kuang

Main category: cs.CV

TL;DR: TwiFF introduces a temporally grounded Visual Chain-of-Thought approach for dynamic visual reasoning using video generation and comprehension to create coherent visual reasoning cues for video question answering.

Details

Motivation: Existing Visual Chain-of-Thought approaches are limited to static scenarios and fail to capture temporal dynamics needed for tasks involving instruction, prediction, and camera motion in videos.

Method: Created TwiFF-2.7M dataset from 2.7M video clips for dynamic VQA, developed TwiFF-Bench evaluation benchmark, and proposed TwiFF model that leverages pre-trained video generation and image comprehension to iteratively generate future action frames and textual reasoning.

Result: TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks, validating effectiveness for visual question answering in dynamic scenarios.

Conclusion: The work successfully bridges the gap in temporal reasoning for VCoT by introducing a large-scale dataset, evaluation benchmark, and unified model that synergizes video generation with comprehension for dynamic visual reasoning.

Abstract: Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from $2.7$ million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of $1,078$ samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages pre-trained video generation and image comprehension capabilities to produce temporally coherent visual reasoning cues-iteratively generating future action frames and textual reasoning. Extensive experiments demonstrate that TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks, which fully validates the effectiveness for visual question answering in dynamic scenarios. Our code and data is available at https://github.com/LiuJunhua02/TwiFF.

[151] OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

Jinjie Shen, Jing Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

Main category: cs.CV

TL;DR: OmniVL-Guard: A balanced reinforcement learning framework for unified vision-language forgery detection and grounding that addresses difficulty bias in multi-task optimization through self-evolving reasoning paths and adaptive reward scaling.

Details

Motivation: Existing forgery detection methods are limited to uni-modal or bi-modal settings, failing to handle interleaved text, images, and videos in real-world misinformation. There's a need for a unified framework for omnibus vision-language forgery detection and grounding that addresses the "difficulty bias" problem where simpler classification dominates gradients over fine-grained grounding.

Method: Proposes OmniVL-Guard with two core designs: 1) Self-Evolving CoT Generation synthesizes high-quality reasoning paths to overcome cold-start challenge, and 2) Adaptive Reward Scaling Policy Optimization (ARSPO) dynamically modulates reward scales and task weights for balanced joint optimization.

Result: Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.

Conclusion: OmniVL-Guard provides an effective solution for unified vision-language forgery detection and grounding by addressing the difficulty bias problem through balanced reinforcement learning, enabling robust performance across diverse multimodal misinformation scenarios.

Abstract: Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical difficulty bias problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.

[152] AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

Zhifeng Rao, Wenlong Chen, Lei Xie, Xia Hua, Dongfu Yin, Zhen Tian, F. Richard Yu

Main category: cs.CV

TL;DR: A framework that integrates depth estimation into Vision-Language-Action models to enhance 3D spatial understanding and action grounding in robotics, using VGGT for depth estimation and an action assistant module for feature reliability.

Details

Motivation: Existing VLA models primarily rely on 2D image-trained VLMs, limiting their spatial understanding and action grounding in complex 3D environments. There's a need to bridge the gap between 2D observations and 3D-aware decision-making in robotics.

Method: Proposes integration of depth estimation using VGGT baseline to extract geometry-aware 3D cues from RGB inputs. Introduces an action assistant module to constrain learned 3D representations with action priors for consistency with control tasks. Fuses enhanced 3D features with conventional 2D visual tokens.

Result: Experimental results show improved generalization ability and robustness of VLA models. Strengthens perception in geometrically ambiguous scenarios and leads to superior action prediction accuracy.

Conclusion: Highlights potential of depth-driven data augmentation and auxiliary expert supervision for bridging 2D observations and 3D-aware decision-making in robotic systems. Demonstrates effectiveness of integrating 3D spatial understanding into VLA models.

Abstract: Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.

[153] (MGS)$^2$-Net: Unifying Micro-Geometric Scale and Macro-Geometric Structure for Cross-View Geo-Localization

Minglei Li, Mengfan He, Chao Chen, Ziyang Meng

Main category: cs.CV

TL;DR: A geometry-grounded framework (MGS)² for cross-view geo-localization that addresses geometric misalignment between oblique aerial and orthographic satellite views through macro-structure filtering and micro-scale adaptation.

Details

Motivation: Cross-view geo-localization is crucial for GNSS-denied UAV navigation but suffers from drastic geometric misalignment between oblique aerial views and orthographic satellite references. Existing 2D methods neglect underlying 3D geometry where view-dependent vertical facades and scale variations corrupt feature alignment.

Method: Proposes (MGS)² framework with three key components: 1) Macro-Geometric Structure Filtering (MGSF) uses dilated geometric gradients to filter facade artifacts while enhancing view-invariant horizontal planes; 2) Micro-Geometric Scale Adaptation (MGSA) uses depth priors to dynamically rectify scale discrepancies via multi-branch feature fusion; 3) Geometric-Appearance Contrastive Distillation (GACD) loss discriminates against oblique occlusions.

Result: Achieves state-of-the-art performance with Recall@1 of 97.5% on University-1652 and 97.02% on SUES-200. Demonstrates superior cross-dataset generalization against geometric ambiguity.

Conclusion: The geometry-grounded approach effectively addresses geometric misalignment in cross-view geo-localization, achieving robust performance and generalization by explicitly modeling 3D geometric structure and scale variations.

Abstract: Cross-view geo-localization (CVGL) is pivotal for GNSS-denied UAV navigation but remains brittle under the drastic geometric misalignment between oblique aerial views and orthographic satellite references. Existing methods predominantly operate within a 2D manifold, neglecting the underlying 3D geometry where view-dependent vertical facades (macro-structure) and scale variations (micro-scale) severely corrupt feature alignment. To bridge this gap, we propose (MGS)$^2$, a geometry-grounded framework. The core of our innovation is the Macro-Geometric Structure Filtering (MGSF) module. Unlike pixel-wise matching sensitive to noise, MGSF leverages dilated geometric gradients to physically filter out high-frequency facade artifacts while enhancing the view-invariant horizontal plane, directly addressing the domain shift. To guarantee robust input for this structural filtering, we explicitly incorporate a Micro-Geometric Scale Adaptation (MGSA) module. MGSA utilizes depth priors to dynamically rectify scale discrepancies via multi-branch feature fusion. Furthermore, a Geometric-Appearance Contrastive Distillation (GACD) loss is designed to strictly discriminate against oblique occlusions. Extensive experiments demonstrate that (MGS)$^2$ achieves state-of-the-art performance, recording a Recall@1 of 97.5% on University-1652 and 97.02% on SUES-200. Furthermore, the framework exhibits superior cross-dataset generalization against geometric ambiguity. The code is available at: \href{https://github.com/GabrielLi1473/MGS-Net}{https://github.com/GabrielLi1473/MGS-Net}.

[154] FGAA-FPN: Foreground-Guided Angle-Aware Feature Pyramid Network for Oriented Object Detection

Jialin Ma

Main category: cs.CV

TL;DR: FGAA-FPN: A Foreground-Guided Angle-Aware Feature Pyramid Network for oriented object detection in remote sensing imagery that improves performance through foreground modeling and orientation priors.

Details

Motivation: Oriented object detection in remote sensing/aerial imagery is challenging due to cluttered backgrounds, scale variation, and orientation changes. Existing methods lack explicit foreground modeling and don't leverage geometric orientation priors, limiting feature discriminability.

Method: Proposes FGAA-FPN with hierarchical functional decomposition: 1) Foreground-Guided Feature Modulation module learns foreground saliency under weak supervision to enhance object regions in low-level features, 2) Angle-Aware Multi-Head Attention module encodes relative orientation relationships to guide global interactions in high-level semantic features.

Result: Achieves state-of-the-art results on DOTA v1.0 (75.5% mAP) and DOTA v1.5 (68.3% mAP) benchmarks for oriented object detection.

Conclusion: FGAA-FPN effectively addresses challenges in oriented object detection through foreground-guided modulation and angle-aware attention, demonstrating strong performance on remote sensing datasets.

Abstract: With the increasing availability of high-resolution remote sensing and aerial imagery, oriented object detection has become a key capability for geographic information updating, maritime surveillance, and disaster response. However, it remains challenging due to cluttered backgrounds, severe scale variation, and large orientation changes. Existing approaches largely improve performance through multi-scale feature fusion with feature pyramid networks or contextual modeling with attention, but they often lack explicit foreground modeling and do not leverage geometric orientation priors, which limits feature discriminability. To overcome these limitations, we propose FGAA-FPN, a Foreground-Guided Angle-Aware Feature Pyramid Network for oriented object detection. FGAA-FPN is built on a hierarchical functional decomposition that accounts for the distinct spatial resolution and semantic abstraction across pyramid levels, thereby strengthening multi-scale representations. Concretely, a Foreground-Guided Feature Modulation module learns foreground saliency under weak supervision to enhance object regions and suppress background interference in low-level features. In parallel, an Angle-Aware Multi-Head Attention module encodes relative orientation relationships to guide global interactions among high-level semantic features. Extensive experiments on DOTA v1.0 and DOTA v1.5 demonstrate that FGAA-FPN achieves state-of-the-art results, reaching 75.5% and 68.3% mAP, respectively.

[155] Ecological mapping with geospatial foundation models

Craig Mahlasi, Gciniwe S. Baloyi, Zaheed Gaffoor, Levente Klein, Anne Jones, Etienne Vos, Michal Muszynski, Geoffrey Dawson, Campbell Watson

Main category: cs.CV

TL;DR: This paper explores geospatial foundation models (GFMs) for ecological applications, comparing Prithvi-E0-2.0 and TerraMind against ResNet-101 baseline across land use/land cover generation, forest trait mapping, and peatlands detection.

Details

Motivation: To explore the utility, challenges, and opportunities of geospatial foundation models for ecological applications, as their potential for high-value use cases hasn't been fully explored despite being a fast-emerging paradigm.

Method: Fine-tuned pretrained AI models (Prithvi-E0-2.0 and TerraMind) across three ecological use cases and compared them with a baseline ResNet-101 model. Experiments included LULC generation, forest functional trait mapping, and peatlands detection.

Result: GFMs consistently outperformed baseline ResNet models. TerraMind marginally outperformed Prithvi in general, but with additional modalities, TerraMind significantly outperformed both baseline ResNet and Prithvi models.

Conclusion: Geospatial foundation models show strong potential for ecological applications but require consideration of input data divergence from pretrained modalities and would benefit from higher resolution and more accurate labels for pixel-level mapping tasks.

Abstract: Geospatial foundation models (GFMs) are a fast-emerging paradigm for various geospatial tasks, such as ecological mapping. However, the utility of GFMs has not been fully explored for high-value use cases. This study aims to explore the utility, challenges and opportunities associated with the application of GFMs for ecological uses. In this regard, we fine-tune several pretrained AI models, namely, Prithvi-E0-2.0 and TerraMind, across three use cases, and compare this with a baseline ResNet-101 model. Firstly, we demonstrate TerraMind’s LULC generation capabilities. Lastly, we explore the utility of the GFMs in forest functional trait mapping and peatlands detection. In all experiments, the GFMs outperform the baseline ResNet models. In general TerraMind marginally outperforms Prithvi. However, with additional modalities TerraMind significantly outperforms the baseline ResNet and Prithvi models. Nonetheless, consideration should be given to the divergence of input data from pretrained modalities. We note that these models would benefit from higher resolution and more accurate labels, especially for use cases where pixel-level dynamics need to be mapped.

[156] A Diffusion-Based Generative Prior Approach to Sparse-view Computed Tomography

Davide Evangelista, Pasquale Cascarano, Elena Loli Piccolomini

Main category: cs.CV

TL;DR: Deep generative prior framework using diffusion models for CT reconstruction from sparse/limited-angle sinograms, combining model-based explainability with neural network generative power.

Details

Motivation: Sparse or limited-angle CT geometries cause artifacts and object distortions in reconstructed images. Deep generative models offer potential for improved reconstruction quality in these challenging scenarios.

Method: Deep Generative Prior (DGP) framework combining diffusion-based generative models with iterative optimization algorithms for CT reconstruction from sparse sinograms. Proposes modifications to image generation, model architecture, and optimization algorithms.

Result: Promising results even under highly sparse geometries, though further research is needed to fully realize the potential of this approach.

Conclusion: The DGP framework shows potential for CT reconstruction from sparse data by balancing model-based explainability with neural network generative capabilities, but requires further investigation.

Abstract: The reconstruction of X-rays CT images from sparse or limited-angle geometries is a highly challenging task. The lack of data typically results in artifacts in the reconstructed image and may even lead to object distortions. For this reason, the use of deep generative models in this context has great interest and potential success. In the Deep Generative Prior (DGP) framework, the use of diffusion-based generative models is combined with an iterative optimization algorithm for the reconstruction of CT images from sinograms acquired under sparse geometries, to maintain the explainability of a model-based approach while introducing the generative power of a neural network. There are therefore several aspects that can be further investigated within these frameworks to improve reconstruction quality, such as image generation, the model, and the iterative algorithm used to solve the minimization problem, for which we propose modifications with respect to existing approaches. The results obtained even under highly sparse geometries are very promising, although further research is clearly needed in this direction.

[157] OccFace: Unified Occlusion-Aware Facial Landmark Detection with Per-Point Visibility

Xinhao Xiang, Zhengxin Li, Saurav Dhakad, Theo Bancroft, Jiawei Zhang, Weiyang Li

Main category: cs.CV

TL;DR: OccFace: An occlusion-aware facial landmark detection framework that jointly predicts landmark coordinates and per-point visibility for universal human-like faces under occlusion.

Details

Motivation: Existing facial landmark detectors struggle with occlusion, especially for human-like faces with large appearance variation and rotation-driven self-occlusion. Current methods typically handle occlusion implicitly without predicting per-point visibility that downstream applications could benefit from.

Method: Proposes OccFace framework with unified dense 100-point layout and heatmap-based backbone. Adds occlusion module that jointly predicts landmark coordinates and per-point visibility by combining local evidence with cross-landmark context. Uses visibility supervision mixing manual labels with landmark-aware masking that derives pseudo visibility from mask-heatmap overlap.

Result: Shows improved robustness under external occlusion and large head rotations, especially on occluded regions, while preserving accuracy on visible landmarks. Introduces occlusion-aware evaluation suite with metrics like Occ AP, F1@0.5, and ROC-AUC.

Conclusion: OccFace provides an effective occlusion-aware framework for facial landmark detection that explicitly predicts visibility information, improving performance under challenging occlusion scenarios for universal human-like faces.

Abstract: Accurate facial landmark detection under occlusion remains challenging, especially for human-like faces with large appearance variation and rotation-driven self-occlusion. Existing detectors typically localize landmarks while handling occlusion implicitly, without predicting per-point visibility that downstream applications can benefits. We present OccFace, an occlusion-aware framework for universal human-like faces, including humans, stylized characters, and other non-human designs. OccFace adopts a unified dense 100-point layout and a heatmap-based backbone, and adds an occlusion module that jointly predicts landmark coordinates and per-point visibility by combining local evidence with cross-landmark context. Visibility supervision mixes manual labels with landmark-aware masking that derives pseudo visibility from mask-heatmap overlap. We also create an occlusion-aware evaluation suite reporting NME on visible vs. occluded landmarks and benchmarking visibility with Occ AP, F1@0.5, and ROC-AUC, together with a dataset annotated with 100-point landmarks and per-point visibility. Experiments show improved robustness under external occlusion and large head rotations, especially on occluded regions, while preserving accuracy on visible landmarks.

[158] Self-Supervised Image Super-Resolution Quality Assessment based on Content-Free Multi-Model Oriented Representation Learning

Kian Majlessi, Amir Masoud Soltani, Mohammad Ebrahim Mahdavi, Aurelien Gourrier, Peyman Adibi

Main category: cs.CV

TL;DR: S3 RIQA: A no-reference super-resolution image quality assessment method for real-world LR images using self-supervised contrastive learning with SR model-oriented representations.

Details

Motivation: Real-world super-resolution faces complex, irregular degradations that are unpredictable and vary across contexts, making quality assessment challenging. Existing methods struggle with these realistic settings, especially in data-scarce domains.

Method: Proposes self-supervised contrastive learning where positive pairs come from images produced by the same SR model, negative pairs from different models, independent of content. Includes targeted preprocessing and auxiliary tasks for different scaling factors. Uses new SRMORSS dataset for unsupervised pretext training.

Result: S3 RIQA consistently outperforms most state-of-the-art relevant metrics on real SR-IQA benchmarks.

Conclusion: The method enables domain-adaptive IQA for real-world SR applications, addressing the challenge of assessing quality in highly ill-posed realistic settings with complex degradations.

Abstract: Super-resolution (SR) applied to real-world low-resolution (LR) images often results in complex, irregular degradations that stem from the inherent complexity of natural scene acquisition. In contrast to SR artifacts arising from synthetic LR images created under well-defined scenarios, those distortions are highly unpredictable and vary significantly across different real-life contexts. Consequently, assessing the quality of SR images (SR-IQA) obtained from realistic LR, remains a challenging and underexplored problem. In this work, we introduce a no-reference SR-IQA approach tailored for such highly ill-posed realistic settings. The proposed method enables domain-adaptive IQA for real-world SR applications, particularly in data-scarce domains. We hypothesize that degradations in super-resolved images are strongly dependent on the underlying SR algorithms, rather than being solely determined by image content. To this end, we introduce a self-supervised learning (SSL) strategy that first pretrains multiple SR model oriented representations in a pretext stage. Our contrastive learning framework forms positive pairs from images produced by the same SR model and negative pairs from those generated by different methods, independent of image content. The proposed approach S3 RIQA, further incorporates targeted preprocessing to extract complementary quality information and an auxiliary task to better handle the various degradation profiles associated with different SR scaling factors. To this end, we constructed a new dataset, SRMORSS, to support unsupervised pretext training; it includes a wide range of SR algorithms applied to numerous real LR images, which addresses a gap in existing datasets. Experiments on real SR-IQA benchmarks demonstrate that S3 RIQA consistently outperforms most state-of-the-art relevant metrics.

[159] Spectral-Spatial Contrastive Learning Framework for Regression on Hyperspectral Data

Mohamad Dhaini, Paul Honeine, Maxime Berar, Antonin Van Exem

Main category: cs.CV

TL;DR: Proposes a spectral-spatial contrastive learning framework for regression tasks on hyperspectral data, with model-agnostic design and relevant transformations.

Details

Motivation: Contrastive learning has shown success in representation learning for image classification, but there's a shortage of studies targeting regression tasks, particularly for hyperspectral data applications.

Method: Develops a spectral-spatial contrastive learning framework specifically for regression tasks on hyperspectral data. The framework is model-agnostic and can enhance various backbone models including 3D convolutional networks and transformer-based networks. Also provides a collection of transformations relevant for augmenting hyperspectral data.

Result: Experiments on both synthetic and real datasets show that the proposed framework and transformations significantly improve the performance of all studied backbone models.

Conclusion: The proposed spectral-spatial contrastive learning framework effectively enhances regression performance for hyperspectral data across different backbone architectures.

Abstract: Contrastive learning has demonstrated great success in representation learning, especially for image classification tasks. However, there is still a shortage in studies targeting regression tasks, and more specifically applications on hyperspectral data. In this paper, we propose a spectral-spatial contrastive learning framework for regression tasks for hyperspectral data, in a model-agnostic design allowing to enhance backbones such as 3D convolutional and transformer-based networks. Moreover, we provide a collection of transformations relevant for augmenting hyperspectral data. Experiments on synthetic and real datasets show that the proposed framework and transformations significantly improve the performance of all studied backbone models.

[160] Text-to-Vector Conversion for Residential Plan Design

Egor Bazhenov, Stepan Kasai, Viacheslav Shalamov, Valeria Efimova

Main category: cs.CV

TL;DR: Novel method for generating vector residential plans from text descriptions and algorithm for vectorizing raster plans, achieving 5% and 4% improvements in CLIPScore respectively.

Details

Motivation: Vector graphics provide scalability without quality loss but are complex to produce, especially for design and architecture applications. There's a need for better methods to generate vector graphics from textual descriptions and to convert raster images to structured vector formats.

Method: Introduces a novel method for generating vector residential plans from textual descriptions, leveraging inherent handling of right angles and flexible settings. Also presents a new algorithm for vectorizing raster plans into structured vector images.

Result: The text-to-vector generation method surpasses existing solutions by approximately 5% in CLIPScore-based visual quality. The raster-to-vector algorithm produces images with about 4% better CLIPscore compared to others.

Conclusion: The proposed methods effectively address the complexity of vector graphics generation for architectural applications, providing significant improvements in visual quality metrics for both text-to-vector generation and raster-to-vector conversion.

Abstract: Computer graphics, comprising both raster and vector components, is a fundamental part of modern science, industry, and digital communication. While raster graphics offer ease of use, its pixel-based structure limits scalability. Vector graphics, defined by mathematical primitives, provides scalability without quality loss, however, it is more complex to produce. For design and architecture, the versatility of vector graphics is paramount, despite its computational demands. This paper introduces a novel method for generating vector residential plans from textual descriptions. Our approach surpasses existing solutions by approximately 5% in CLIPScore-based visual quality, benefiting from its inherent handling of right angles and flexible settings. Additionally, we present a new algorithm for vectorizing raster plans into structured vector images. Such images have a better CLIPscore compared to others by about 4%.

[161] Dual-End Consistency Model

Linwei Dong, Ruoyu Guo, Ge Bai, Zehuan Yuan, Yawei Luo, Changqing Zou

Main category: cs.CV

TL;DR: DE-CM improves consistency models for efficient image generation by addressing training instability and sampling inflexibility through dual-end trajectory optimization and noise-to-noisy mapping.

Details

Motivation: Consistency models face training instability and inflexible sampling limitations despite being state-of-the-art for efficient generation. Existing methods overlook the critical role of trajectory selection in addressing these issues.

Method: Proposes Dual-End Consistency Model (DE-CM) that selects vital sub-trajectory clusters for stable training. Uses continuous-time CM objectives for few-step distillation, flow matching as boundary regularizer, and novel noise-to-noisy (N2N) mapping to alleviate error accumulation.

Result: Achieves state-of-the-art FID score of 1.70 in one-step generation on ImageNet 256x256 dataset, outperforming existing CM-based one-step approaches.

Conclusion: DE-CM effectively addresses training instability and sampling inflexibility in consistency models through trajectory decomposition and optimization, enabling more practical deployment of efficient generative models.

Abstract: The slow iterative sampling nature remains a major bottleneck for the practical deployment of diffusion and flow-based generative models. While consistency models (CMs) represent a state-of-the-art distillation-based approach for efficient generation, their large-scale application is still limited by two key issues: training instability and inflexible sampling. Existing methods seek to mitigate these problems through architectural adjustments or regularized objectives, yet overlook the critical reliance on trajectory selection. In this work, we first conduct an analysis on these two limitations: training instability originates from loss divergence induced by unstable self-supervised term, whereas sampling inflexibility arises from error accumulation. Based on these insights and analysis, we propose the Dual-End Consistency Model (DE-CM) that selects vital sub-trajectory clusters to achieve stable and effective training. DE-CM decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets. Specifically, our approach leverages continuous-time CMs objectives to achieve few-step distillation and utilizes flow matching as a boundary regularizer to stabilize the training process. Furthermore, we propose a novel noise-to-noisy (N2N) mapping that can map noise to any point, thereby alleviating the error accumulation in the first step. Extensive experimental results show the effectiveness of our method: it achieves a state-of-the-art FID score of 1.70 in one-step generation on the ImageNet 256x256 dataset, outperforming existing CM-based one-step approaches.

[162] From Steering to Pedalling: Do Autonomous Driving VLMs Generalize to Cyclist-Assistive Spatial Perception and Planning?

Krishna Kanth Nakka, Vedasri Nakka

Main category: cs.CV

TL;DR: CyclingVQA benchmark evaluates vision-language models for cyclist-centric traffic understanding, revealing gaps in cyclist-specific perception and reasoning despite strong general capabilities.

Details

Motivation: Current vision-language models show strong performance on autonomous driving benchmarks but are vehicle-centric, lacking evaluation from a cyclist's perspective. There's a need for cyclist-assistive systems that understand traffic from a cyclist's viewpoint.

Method: Introduced CyclingVQA, a diagnostic benchmark to probe perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning from a cyclist’s perspective. Evaluated 31+ recent VLMs including general-purpose, spatially enhanced, and autonomous-driving-specialized models.

Result: Current models show encouraging capabilities but have clear areas for improvement in cyclist-centric perception and reasoning, particularly in interpreting cyclist-specific traffic cues and associating signs with correct navigational lanes. Driving-specialized models underperformed strong generalist VLMs.

Conclusion: There’s limited transfer from vehicle-centric training to cyclist-assistive scenarios. Systematic error analysis identified recurring failure modes to guide development of more effective cyclist-assistive intelligent systems.

Abstract: Cyclists often encounter safety-critical situations in urban traffic, highlighting the need for assistive systems that support safe and informed decision-making. Recently, vision-language models (VLMs) have demonstrated strong performance on autonomous driving benchmarks, suggesting their potential for general traffic understanding and navigation-related reasoning. However, existing evaluations are predominantly vehicle-centric and fail to assess perception and reasoning from a cyclist-centric viewpoint. To address this gap, we introduce CyclingVQA, a diagnostic benchmark designed to probe perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning from a cyclist’s perspective. Evaluating 31+ recent VLMs spanning general-purpose, spatially enhanced, and autonomous-driving-specialized models, we find that current models demonstrate encouraging capabilities, while also revealing clear areas for improvement in cyclist-centric perception and reasoning, particularly in interpreting cyclist-specific traffic cues and associating signs with the correct navigational lanes. Notably, several driving-specialized models underperform strong generalist VLMs, indicating limited transfer from vehicle-centric training to cyclist-assistive scenarios. Finally, through systematic error analysis, we identify recurring failure modes to guide the development of more effective cyclist-assistive intelligent systems.

[163] RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation

Zihui Zhou, Yong Feng, Yanying Chen, Guofan Duan, Zhenxi Song, Mingliang Zhou, Weijia Jia

Main category: cs.CV

TL;DR: RSHallu: A systematic study of hallucinations in remote sensing multimodal LLMs, including taxonomy, benchmark, and mitigation strategies.

Details

Motivation: Hallucinations in remote sensing MLLMs hinder deployment in high-stakes scenarios like emergency management and agricultural monitoring, and remain under-explored in RS domain.

Method: Three-pronged approach: (1) Formalize RS hallucinations with RS-oriented taxonomy including image-level hallucinations; (2) Build hallucination benchmark RSHalluEval and checker; (3) Create RSHalluShield dataset for training-friendly mitigation and propose training-free strategies like logit correction and RS-aware prompting.

Result: Mitigation improves hallucination-free rate by up to 21.63 percentage points across representative RS-MLLMs while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG).

Conclusion: RSHallu provides comprehensive framework for understanding and mitigating hallucinations in remote sensing MLLMs, enabling more reliable deployment in critical applications.

Abstract: Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.

[164] DMP-3DAD: Cross-Category 3D Anomaly Detection via Realistic Depth Map Projection with Few Normal Samples

Zi Wang, Katsuya Hotta, Koichiro Kamide, Yawen Zou, Jianjian Qin, Chao Zhang, Jun Yu

Main category: cs.CV

TL;DR: DMP-3DAD is a training-free framework for cross-category 3D anomaly detection using multi-view depth map projection and frozen CLIP visual encoder.

Details

Motivation: Existing 3D anomaly detection methods require category-specific training, limiting flexibility in few-shot scenarios. There's a need for training-free approaches that can work across categories with minimal examples.

Method: Converts 3D point clouds into fixed set of realistic depth images, uses frozen CLIP visual encoder to extract multi-view representations, performs anomaly detection via weighted feature similarity without any fine-tuning or category adaptation.

Result: Achieves state-of-the-art performance on ShapeNetPart dataset under few-shot setting, demonstrating effectiveness for practical cross-category 3D anomaly detection.

Conclusion: Proposes a simple yet effective training-free solution for cross-category 3D anomaly detection using multi-view depth projection and pre-trained vision models.

Abstract: Cross-category anomaly detection for 3D point clouds aims to determine whether an unseen object belongs to a target category using only a few normal examples. Most existing methods rely on category-specific training, which limits their flexibility in few-shot scenarios. In this paper, we propose DMP-3DAD, a training-free framework for cross-category 3D anomaly detection based on multi-view realistic depth map projection. Specifically, by converting point clouds into a fixed set of realistic depth images, our method leverages a frozen CLIP visual encoder to extract multi-view representations and performs anomaly detection via weighted feature similarity, which does not require any fine-tuning or category-dependent adaptation. Extensive experiments on the ShapeNetPart dataset demonstrate that DMP-3DAD achieves state-of-the-art performance under few-shot setting. The results show that the proposed approach provides a simple yet effective solution for practical cross-category 3D anomaly detection.

[165] DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou

Main category: cs.CV

TL;DR: DeepImageSearch introduces an agentic paradigm for image retrieval that treats it as autonomous exploration over visual sequences rather than isolated semantic matching, with a benchmark DISBench built on interconnected visual data and a modular agent framework for multi-step reasoning.

Details

Motivation: Current multimodal retrieval systems focus on semantic matching of isolated query-image pairs, ignoring the rich temporal dependencies in realistic visual streams where information is distributed across sequences rather than single snapshots.

Method: 1) Introduces DISBench benchmark built on interconnected visual data requiring multi-step reasoning; 2) Proposes human-model collaborative pipeline using vision-language models to mine latent spatiotemporal associations; 3) Builds modular agent framework with fine-grained tools and dual-memory system for long-horizon navigation.

Result: DISBench poses significant challenges to state-of-the-art models, demonstrating the necessity of agentic reasoning for next-generation retrieval systems that can handle context-dependent queries in visual sequences.

Conclusion: The paper introduces a paradigm shift from isolated semantic matching to agentic exploration for image retrieval, highlighting the importance of incorporating temporal reasoning and context awareness into multimodal retrieval systems.

Abstract: Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.

[166] Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun

Main category: cs.CV

TL;DR: DC-SFT improves VLM generalization by filtering training data by difficulty, matching RL’s OOD performance with better efficiency.

Details

Motivation: Address the generalization gap where RL-trained VLMs outperform SFT-trained ones on out-of-distribution tasks, proposing a data-centric explanation.

Method: Difficulty-Curated SFT (DC-SFT) filters training samples based on difficulty levels, focusing on medium-difficulty samples to improve generalization.

Result: DC-SFT substantially enhances OOD generalization over standard SFT and surpasses RL-based training performance while being more stable and computationally efficient.

Conclusion: Data difficulty is critical for VLM generalization; explicit difficulty filtering provides an efficient pathway to robust generalization comparable to RL methods.

Abstract: The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL’s generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.

[167] Resource-Efficient RGB-Only Action Recognition for Edge Deployment

Dongsik Yoon, Jongeun Kim, Dayeon Lee

Main category: cs.CV

TL;DR: Compact RGB-only action recognition network for edge devices using X3D-style backbone with Temporal Shift, selective temporal adaptation, and parameter-free attention

Details

Motivation: Action recognition on edge devices has strict constraints on latency, memory, storage, and power. While auxiliary modalities like skeleton/depth can improve performance, they require additional sensors or expensive pose-estimation pipelines, limiting practicality for edge deployment.

Method: Proposes a compact RGB-only network based on X3D-style backbone augmented with Temporal Shift, plus introduces selective temporal adaptation and parameter-free attention mechanisms for efficient on-device inference.

Result: Extensive experiments on NTU RGB+D 60 and 120 benchmarks show strong accuracy-efficiency balance. Deployment profiling on Jetson Orin Nano verifies smaller on-device footprint and practical resource utilization compared to existing RGB-based action recognition methods.

Conclusion: The proposed RGB-only network achieves efficient action recognition suitable for edge devices without requiring auxiliary modalities, balancing accuracy with practical resource constraints.

Abstract: Action recognition on edge devices poses stringent constraints on latency, memory, storage, and power consumption. While auxiliary modalities such as skeleton and depth information can enhance recognition performance, they often require additional sensors or computationally expensive pose-estimation pipelines, limiting practicality for edge use. In this work, we propose a compact RGB-only network tailored for efficient on-device inference. Our approach builds upon an X3D-style backbone augmented with Temporal Shift, and further introduces selective temporal adaptation and parameter-free attention. Extensive experiments on the NTU RGB+D 60 and 120 benchmarks demonstrate a strong accuracy-efficiency balance. Moreover, deployment-level profiling on the Jetson Orin Nano verifies a smaller on-device footprint and practical resource utilization compared to existing RGB-based action recognition techniques.

[168] Flow caching for autoregressive video generation

Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, Fei Chao, Rongrong Ji

Main category: cs.CV

TL;DR: FlowCache: A caching framework for accelerating autoregressive video generation by implementing chunkwise caching policies and optimized KV cache compression.

Details

Motivation: Autoregressive models for ultra-long video generation are slow due to sequential chunk synthesis. Existing caching methods fail because they assume uniform denoising across frames, which doesn't hold for autoregressive models where different chunks have varying similarity patterns.

Method: Introduces chunkwise caching strategy that allows independent caching policies per video chunk, dynamically adapting to each chunk’s unique denoising characteristics. Also develops joint importance-redundancy optimized KV cache compression to maintain fixed memory bounds while preserving quality.

Result: Achieves 2.38× speedup on MAGI-1 and 6.7× speedup on SkyReels-V2 with negligible quality degradation (VBench: 0.87 increase and 0.79 decrease respectively).

Conclusion: FlowCache successfully enables real-time, ultra-long video generation with autoregressive models, establishing a new benchmark for efficient video synthesis at scale.

Abstract: Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all frames-an assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps. In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine-grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that dynamically adapts to the unique denoising characteristics of each chunk, complemented by a joint importance-redundancy optimized KV cache compression mechanism that maintains fixed memory bounds while preserving generation quality. Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation (VBench: 0.87 increase and 0.79 decrease respectively). These results demonstrate that FlowCache successfully unlocks the potential of autoregressive models for real-time, ultra-long video generation-establishing a new benchmark for efficient video synthesis at scale. The code is available at https://github.com/mikeallen39/FlowCache.

[169] Hyperspectral Smoke Segmentation via Mixture of Prototypes

Lujian Yao, Haitao Zhao, Xianghai Kong, Yuhan Xu

Main category: cs.CV

TL;DR: A hyperspectral smoke segmentation method using mixture of prototypes network with band splitting, prototype-based spectral representation, and dual-level routing for adaptive band weighting.

Details

Motivation: Traditional visible-light smoke segmentation methods have limitations due to insufficient spectral information, struggling with cloud interference and semi-transparent smoke. Hyperspectral imaging can provide richer spectral data for better segmentation.

Method: Proposes a Mixture of Prototypes (MoP) network with three components: (1) Band split for spectral isolation, (2) Prototype-based spectral representation for diverse patterns, and (3) Dual-level router for adaptive spatial-aware band weighting. Also introduces the first hyperspectral smoke segmentation dataset (HSSDataset) and a multispectral dataset (MSSDataset).

Result: Superior performance across both hyperspectral and multispectral modalities, establishing a new paradigm for spectral-based smoke segmentation. Validated through extensive experiments.

Conclusion: Hyperspectral imaging with adaptive band weighting via MoP network effectively addresses smoke segmentation challenges, outperforming traditional methods and providing a robust solution for wildfire management and industrial safety.

Abstract: Smoke segmentation is critical for wildfire management and industrial safety applications. Traditional visible-light-based methods face limitations due to insufficient spectral information, particularly struggling with cloud interference and semi-transparent smoke regions. To address these challenges, we introduce hyperspectral imaging for smoke segmentation and present the first hyperspectral smoke segmentation dataset (HSSDataset) with carefully annotated samples collected from over 18,000 frames across 20 real-world scenarios using a Many-to-One annotations protocol. However, different spectral bands exhibit varying discriminative capabilities across spatial regions, necessitating adaptive band weighting strategies. We decompose this into three technical challenges: spectral interaction contamination, limited spectral pattern modeling, and complex weighting router problems. We propose a mixture of prototypes (MoP) network with: (1) Band split for spectral isolation, (2) Prototype-based spectral representation for diverse patterns, and (3) Dual-level router for adaptive spatial-aware band weighting. We further construct a multispectral dataset (MSSDataset) with RGB-infrared images. Extensive experiments validate superior performance across both hyperspectral and multispectral modalities, establishing a new paradigm for spectral-based smoke segmentation.

[170] Stride-Net: Fairness-Aware Disentangled Representation Learning for Chest X-Ray Diagnosis

Darakshan Rashid, Raza Imam, Dwarikanath Mahapatra, Brejesh Lall

Main category: cs.CV

TL;DR: Stride-Net learns disease-discriminative yet demographically invariant representations for chest X-ray analysis using patch-level masking, adversarial confusion, and semantic alignment with BioBERT embeddings.

Details

Motivation: Chest X-ray classification models often underperform for specific demographic subgroups, raising clinical safety and equity concerns. Existing debiasing methods yield inconsistent improvements or degrade overall diagnostic utility by treating fairness as a post hoc constraint rather than a property of the learned representation.

Method: Stride-Net operates at patch level using learnable stride-based mask to select label-aligned image regions while suppressing sensitive attribute information through adversarial confusion loss. It enforces semantic alignment between image features and BioBERT-based disease label embeddings via Group Optimal Transport to anchor representations in clinical semantics and discourage shortcut learning.

Result: Evaluated on MIMIC-CXR and CheXpert benchmarks across race and intersectional race-gender subgroups. Across ResNet and Vision Transformer architectures, Stride-Net consistently improves fairness metrics while matching or exceeding baseline accuracy, achieving more favorable accuracy-fairness trade-off than prior debiasing approaches.

Conclusion: Stride-Net provides an effective framework for learning fair chest X-ray representations that maintain diagnostic utility while reducing demographic bias through patch-level masking, adversarial training, and semantic alignment techniques.

Abstract: Deep neural networks for chest X-ray classification achieve strong average performance, yet often underperform for specific demographic subgroups, raising critical concerns about clinical safety and equity. Existing debiasing methods frequently yield inconsistent improvements across datasets or attain fairness by degrading overall diagnostic utility, treating fairness as a post hoc constraint rather than a property of the learned representation. In this work, we propose Stride-Net (Sensitive Attribute Resilient Learning via Disentanglement and Learnable Masking with Embedding Alignment), a fairness-aware framework that learns disease-discriminative yet demographically invariant representations for chest X-ray analysis. Stride-Net operates at the patch level, using a learnable stride-based mask to select label-aligned image regions while suppressing sensitive attribute information through adversarial confusion loss. To anchor representations in clinical semantics and discourage shortcut learning, we further enforce semantic alignment between image features and BioBERT-based disease label embeddings via Group Optimal Transport. We evaluate Stride-Net on the MIMIC-CXR and CheXpert benchmarks across race and intersectional race-gender subgroups. Across architectures including ResNet and Vision Transformers, Stride-Net consistently improves fairness metrics while matching or exceeding baseline accuracy, achieving a more favorable accuracy-fairness trade-off than prior debiasing approaches. Our code is available at https://github.com/Daraksh/Fairness_StrideNet.

[171] Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation

Minggui He, Mingchen Dai, Jian Zhang, Yilun Liu, Shimin Tao, Pufan Zeng, Osamu Yoshie, Yuya Ieiri

Main category: cs.CV

TL;DR: Chart Specification: A structured intermediate representation for chart-to-code generation that improves structural fidelity through semantically grounded supervision and reinforcement learning with fine-grained structural feedback.

Details

Motivation: Current Vision-Language Models struggle with structural fidelity in chart-to-code generation, often producing hallucinated or semantically inconsistent outputs due to reliance on surface-level token imitation rather than faithful modeling of underlying chart structure.

Method: Proposes Chart Specification as a structured intermediate representation that shifts training from text imitation to semantically grounded supervision. Uses a Spec-Align Reward for fine-grained, verifiable feedback on structural correctness, enabling reinforcement learning to enforce consistent plotting logic.

Result: Outperforms prior approaches on three public benchmarks, achieving strong data efficiency with only 3K training samples (surpassing leading baselines by up to 61.7% on complex benchmarks) and establishing new state-of-the-art results with 4K samples across all evaluated metrics.

Conclusion: Precise structural supervision offers an efficient pathway to high-fidelity chart-to-code generation, demonstrating that structured intermediate representations and fine-grained feedback mechanisms significantly improve the quality and consistency of vision-language model outputs for chart understanding tasks.

Abstract: Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images, yet achieving structural fidelity remains challenging. Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation rather than faithful modeling of underlying chart structure, which often leads to hallucinated or semantically inconsistent outputs. We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision. Chart Specification filters syntactic noise to construct a structurally balanced training set and supports a Spec-Align Reward that provides fine-grained, verifiable feedback on structural correctness, enabling reinforcement learning to enforce consistent plotting logic. Experiments on three public benchmarks show that our method consistently outperforms prior approaches. With only 3K training samples, we achieve strong data efficiency, surpassing leading baselines by up to 61.7% on complex benchmarks, and scaling to 4K samples establishes new state-of-the-art results across all evaluated metrics. Overall, our results demonstrate that precise structural supervision offers an efficient pathway to high-fidelity chart-to-code generation. Code and dataset are available at: https://github.com/Mighten/chart-specification-paper

[172] ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving

Jinqing Zhang, Zehua Fu, Zelin Xu, Wenying Dai, Qingjie Liu, Yunhong Wang

Main category: cs.CV

TL;DR: TR-World is a temporal residual world model for autonomous driving that focuses on dynamic object modeling by extracting temporal residuals from scene representations, enabling more precise future predictions and trajectory refinement.

Details

Motivation: Current world models for autonomous driving have limitations: they redundantly model static regions and lack deep interaction with trajectories, preventing them from achieving full effectiveness in planning accuracy.

Method: Proposes Temporal Residual World Model (TR-World) that calculates temporal residuals of scene representations to extract dynamic object information without detection/tracking. Also introduces Future-Guided Trajectory Refinement (FGTR) module that interacts prior trajectories with future BEV features for refinement and provides supervision to prevent model collapse.

Result: Achieves state-of-the-art planning performance on nuScenes and NAVSIM datasets, demonstrating improved accuracy in autonomous driving planning tasks.

Conclusion: TR-World effectively addresses limitations of existing world models by focusing on dynamic object modeling through temporal residuals and enabling trajectory refinement through future scene interaction, leading to superior planning performance.

Abstract: The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at https://github.com/mengtan00/ResWorld.git.

[173] Chatting with Images for Introspective Visual Thinking

Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tienie Tan

Main category: cs.CV

TL;DR: ViLaVT introduces “chatting with images” framework for LVLMs that uses language-guided feature modulation to enable interactive visual reasoning through joint re-encoding of multiple image regions, improving cross-modal alignment for complex spatial reasoning tasks.

Details

Motivation: Current LVLMs suffer from loss of fine-grained visual information due to single-pass visual encoding and text-only reasoning. Existing "thinking with images" approaches using external tools lack proper grounding in linguistic semantics, especially for reasoning across distant regions or multiple images.

Method: Proposes “chatting with images” framework with language-guided feature modulation. Implements ViLaVT model with dynamic vision encoder for interactive visual reasoning. Uses two-stage training curriculum: supervised fine-tuning followed by reinforcement learning to promote effective reasoning behaviors.

Result: Extensive experiments across eight benchmarks show strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.

Conclusion: The “chatting with images” paradigm enables tighter coupling between linguistic reasoning and visual state updates, addressing limitations of current LVLMs in handling complex visual reasoning tasks requiring cross-modal alignment.

Abstract: Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ‘’thinking with images’’ attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ‘‘chatting with images’’, a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.

[174] FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

Guandong Li

Main category: cs.CV

TL;DR: FastUSP: A multi-level optimization framework for efficient distributed inference of large diffusion models that addresses kernel launch overhead and communication inefficiencies in Unified Sequence Parallelism.

Details

Motivation: Large diffusion models like FLUX (12B) and Stable Diffusion 3 (8B) require multi-GPU parallelism for inference, but existing Unified Sequence Parallelism implementations suffer from excessive kernel launch overhead and suboptimal computation-communication scheduling.

Method: FastUSP integrates three levels of optimization: 1) compile-level optimization with graph compilation using CUDA Graphs and computation-communication reordering, 2) communication-level optimization with FP8 quantized collective communication, and 3) operator-level optimization with pipelined Ring attention using double buffering.

Result: On FLUX (12B), FastUSP achieves 1.12×-1.16× speedup over baseline USP, with compile-level optimization providing the dominant improvement. On Qwen-Image, FastUSP achieves 1.09× speedup on 2 GPUs, though PyTorch Inductor compatibility issues limit optimization on 4-8 GPUs.

Conclusion: Kernel launch overhead, rather than communication latency, is the primary bottleneck for distributed diffusion inference on modern high-bandwidth GPU interconnects, and FastUSP effectively addresses this through multi-level optimizations.

Abstract: Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbf{FastUSP}, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf{1.12$\times$–1.16$\times$} end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09$\times$} speedup on 2 GPUs; on 4–8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30$\times$–1.46$\times$ of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead – rather than communication latency – is the primary bottleneck on modern high-bandwidth GPU interconnects.

[175] Towards Learning a Generalizable 3D Scene Representation from 2D Observations

Martin Gromniak, Jan-Gerrit Habekost, Sebastian Kamp, Sven Magg, Stefan Wermter

Main category: cs.CV

TL;DR: A neural radiance field approach for predicting 3D workspace occupancy from robot observations, operating in global workspace coordinates for robotic manipulation applications.

Details

Motivation: Prior methods operate in camera-centric coordinates, limiting their direct applicability to robotic manipulation tasks. The authors aim to create a generalizable occupancy prediction model that works in a global workspace frame.

Method: Uses a Generalizable Neural Radiance Field approach that constructs occupancy representations in a global workspace frame rather than camera coordinates. Integrates flexible source views and generalizes to unseen object arrangements without scene-specific finetuning.

Result: Achieves 26mm reconstruction error on real scenes, including occluded regions. Trained on 40 real scenes, the model demonstrates ability to infer complete 3D occupancy beyond traditional stereo vision methods.

Conclusion: The approach successfully predicts 3D workspace occupancy from egocentric robot observations in a global frame, making it directly applicable to robotic manipulation tasks with strong generalization capabilities.

Abstract: We introduce a Generalizable Neural Radiance Field approach for predicting 3D workspace occupancy from egocentric robot observations. Unlike prior methods operating in camera-centric coordinates, our model constructs occupancy representations in a global workspace frame, making it directly applicable to robotic manipulation. The model integrates flexible source views and generalizes to unseen object arrangements without scene-specific finetuning. We demonstrate the approach on a humanoid robot and evaluate predicted geometry against 3D sensor ground truth. Trained on 40 real scenes, our model achieves 26mm reconstruction error, including occluded regions, validating its ability to infer complete 3D occupancy beyond traditional stereo vision methods.

[176] Healthy Harvests: A Comparative Look at Guava Disease Classification Using InceptionV3

Samanta Ghosh, Shaila Afroz Anika, Umma Habiba Ahmed, B. M. Shahria Alam, Mohammad Tahmid Noor, Nishat Tasnim Niloy

Main category: cs.CV

TL;DR: Computer vision approach using InceptionV3 and ResNet50 models to classify guava fruit diseases (Anthracnose, Fruit flies, Healthy) with 98.15% and 94.46% accuracy respectively, enhanced with data augmentation and interpretability techniques.

Details

Motivation: Guava fruits suffer from diseases that harm quality and yield, requiring early identification to minimize damage and ensure fruit health. The study aims to classify three disease categories using computer vision techniques.

Method: Used dataset of 473 guava images resized to 256x256 pixels, augmented to 3784 images. Implemented InceptionV3 and ResNet50 deep learning models with data mixing techniques (CutMix, MixUp) for robustness. Used confusion matrix for evaluation and SHAP analysis for interpretability.

Result: InceptionV3 achieved 98.15% accuracy, ResNet50 achieved 94.46% accuracy. Both models effectively classified guava diseases with high performance, enhanced by data augmentation and interpretability methods.

Conclusion: Advanced deep learning models like InceptionV3 and ResNet50 can effectively classify guava fruit diseases with high accuracy, demonstrating the potential of computer vision for agricultural disease detection and management.

Abstract: Guava fruits often suffer from many diseases. This can harm fruit quality and fruit crop yield. Early identification is important for minimizing damage and ensuring fruit health. This study focuses on 3 different categories for classifying diseases. These are Anthracnose, Fruit flies, and Healthy fruit. The data set used in this study is collected from Mendeley Data. This dataset contains 473 original images of Guava. These images vary in size and format. The original dataset was resized to 256x256 pixels with RGB color mode for better consistency. After this, the Data augmentation process is applied to improve the dataset by generating variations of the original images. The augmented dataset consists of 3784 images using advanced preprocessing techniques. Two deep learning models were implemented to classify the images. The InceptionV3 model is well known for its advanced framework. These apply multiple convolutional filters for obtaining different features effectively. On the other hand, the ResNet50 model helps to train deeper networks by using residual learning. The InceptionV3 model achieved the impressive accuracy of 98.15%, and ResNet50got 94.46% accuracy. Data mixing methods such as CutMix and MixUp were applied to enhance the model’s robustness. The confusion matrix was used to evaluate the overall model performance of both InceptionV3 and Resnet50. Additionally, SHAP analysis is used to improve interpretability, which helps to find the significant parts of the image for the model prediction. This study purposes to highlight how advanced models enhan

[177] VFGS-Net: Frequency-Guided State-Space Learning for Topology-Preserving Retinal Vessel Segmentation

Ruiqi Song, Lei Liu, Ya-Nan Zhang, Chao Wang, Xiaoning Li, Nan Mu

Main category: cs.CV

TL;DR: VFGS-Net: A retinal vessel segmentation network combining frequency-domain attention, dual-path convolution, and bidirectional Mamba2-based spatial modeling for improved segmentation of fine capillaries and global topological continuity.

Details

Motivation: Retinal vessel segmentation is crucial for quantitative analysis and computer-aided diagnosis of vascular diseases, but existing methods struggle with elongated morphology, wide scale variation, and low contrast, making it difficult to preserve fine capillaries while maintaining global topological continuity.

Method: Proposes VFGS-Net with three key components: 1) dual-path feature convolution module for local textures and multi-scale context, 2) vessel-aware frequency-domain channel attention to reweight spectral components, and 3) bidirectional asymmetric Mamba2-based spatial modeling block at bottleneck for long-range dependencies.

Result: Extensive experiments on four public retinal vessel datasets show competitive or superior performance compared to state-of-the-art methods, with consistent improvements for fine vessels, complex branching patterns, and low-contrast regions.

Conclusion: VFGS-Net effectively addresses retinal vessel segmentation challenges through integrated frequency-aware enhancement and global spatial modeling, demonstrating robustness and clinical potential for vascular disease analysis.

Abstract: Accurate retinal vessel segmentation is a critical prerequisite for quantitative analysis of retinal images and computer-aided diagnosis of vascular diseases such as diabetic retinopathy. However, the elongated morphology, wide scale variation, and low contrast of retinal vessels pose significant challenges for existing methods, making it difficult to simultaneously preserve fine capillaries and maintain global topological continuity. To address these challenges, we propose the Vessel-aware Frequency-domain and Global Spatial modeling Network (VFGS-Net), an end-to-end segmentation framework that seamlessly integrates frequency-aware feature enhancement, dual-path convolutional representation learning, and bidirectional asymmetric spatial state-space modeling within a unified architecture. Specifically, VFGS-Net employs a dual-path feature convolution module to jointly capture fine-grained local textures and multi-scale contextual semantics. A novel vessel-aware frequency-domain channel attention mechanism is introduced to adaptively reweight spectral components, thereby enhancing vessel-relevant responses in high-level features. Furthermore, at the network bottleneck, we propose a bidirectional asymmetric Mamba2-based spatial modeling block to efficiently capture long-range spatial dependencies and strengthen the global continuity of vascular structures. Extensive experiments on four publicly available retinal vessel datasets demonstrate that VFGS-Net achieves competitive or superior performance compared to state-of-the-art methods. Notably, our model consistently improves segmentation accuracy for fine vessels, complex branching patterns, and low-contrast regions, highlighting its robustness and clinical potential.

[178] DFIC: Towards a balanced facial image dataset for automatic ICAO compliance verification

Nuno Gonçalves, Diogo Nunes, Carla Guerra, João Marcos

Main category: cs.CV

TL;DR: A novel DFIC dataset with 58K annotated images and 2,706 videos for automated ICAO compliance verification of facial images in travel documents, featuring balanced demographics and non-compliant conditions.

Details

Motivation: Manual inspection of facial images for ISO/IEC and ICAO compliance in machine-readable travel documents is inefficient for high-demand environments, requiring automated verification methods.

Method: Created DFIC dataset with diverse facial images (compliant and non-compliant), fine-tuned a spatial attention-based model for automatic ICAO compliance validation, and compared with state-of-the-art methods.

Result: Demonstrated improved results over existing ICAO compliance verification methods using the DFIC dataset, which offers unprecedented facial diversity and balanced demographic distribution.

Conclusion: DFIC dataset enhances automated ICAO compliance verification and can improve security, privacy, and fairness in facial recognition systems through its diverse facial representation.

Abstract: Ensuring compliance with ISO/IEC and ICAO standards for facial images in machine-readable travel documents (MRTDs) is essential for reliable identity verification, but current manual inspection methods are inefficient in high-demand environments. This paper introduces the DFIC dataset, a novel comprehensive facial image dataset comprising around 58,000 annotated images and 2706 videos of more than 1000 subjects, that cover a broad range of non-compliant conditions, in addition to compliant portraits. Our dataset provides a more balanced demographic distribution than the existing public datasets, with one partition that is nearly uniformly distributed, facilitating the development of automated ICAO compliance verification methods. Using DFIC, we fine-tuned a novel method that heavily relies on spatial attention mechanisms for the automatic validation of ICAO compliance requirements, and we have compared it with the state-of-the-art aimed at ICAO compliance verification, demonstrating improved results. DFIC dataset is now made public (https://github.com/visteam-isr-uc/DFIC) for the training and validation of new models, offering an unprecedented diversity of faces, that will improve both robustness and adaptability to the intrinsically diverse combinations of faces and props that can be presented to the validation system. These results emphasize the potential of DFIC to enhance automated ICAO compliance methods but it can also be used in many other applications that aim to improve the security, privacy, and fairness of facial recognition systems.

[179] Interpretable Vision Transformers in Image Classification via SVDA

Vasileios Arampatzakis, George Pavlidis, Nikolaos Mitianoudis, Nikos Papamarkos

Main category: cs.CV

TL;DR: SVDA mechanism adapted to Vision Transformers enhances interpretability and sparsity of attention patterns while maintaining classification accuracy on multiple benchmarks.

Details

Motivation: Vision Transformers achieve state-of-the-art performance but their attention mechanisms remain opaque with dense, non-structured behaviors, lacking interpretability.

Method: Adapts previously proposed SVD-Inspired Attention (SVDA) mechanism to ViT architecture, using a geometrically grounded formulation with interpretability indicators to monitor attention dynamics during training.

Result: SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy on CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 benchmarks.

Conclusion: SVDA serves as a comprehensive tool for analyzing and developing structured attention models, laying foundation for explainable AI, spectral diagnostics, and attention-based model compression in computer vision.

Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators – originally proposed with SVDA – to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks – CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 – demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.

[180] Enhancing Predictability of Multi-Tenant DNN Inference for Autonomous Vehicles’ Perception

Liangkai Liu, Kang G. Shin, Jinkyu Lee, Chengmo Yang, Weisong Shi

Main category: cs.CV

TL;DR: PP-DNN is a predictable perception system for autonomous vehicles that dynamically selects critical frames and regions of interest to reduce computational load while maintaining accuracy for multi-tenant DNNs.

Details

Motivation: Autonomous vehicles face challenges in achieving real-time DNN inference due to the gap between computational requirements and limited onboard resources. Existing approaches focus on model compression, but PP-DNN addresses this by reducing the amount of image data to process.

Method: PP-DNN uses an ROI generator to identify critical frames and regions based on frame similarities and traffic scenarios, a FLOPs predictor to estimate computational requirements, an ROI scheduler to coordinate multiple DNN models, and a detection predictor for non-critical frames.

Result: PP-DNN significantly improves perception predictability, increasing fusion frames by up to 7.3x, reducing fusion delay by >2.6x and delay variations by >2.3x, improving detection completeness by 75.4% and cost-effectiveness by up to 98% over baseline.

Conclusion: PP-DNN provides an effective approach for predictable perception in autonomous vehicles by dynamically selecting critical frames and ROIs, enabling real-time inference while maintaining accuracy for multi-tenant DNNs.

Abstract: Autonomous vehicles (AVs) rely on sensors and deep neural networks (DNNs) to perceive their surrounding environment and make maneuver decisions in real time. However, achieving real-time DNN inference in the AV’s perception pipeline is challenging due to the large gap between the computation requirement and the AV’s limited resources. Most, if not all, of existing studies focus on optimizing the DNN inference time to achieve faster perception by compressing the DNN model with pruning and quantization. In contrast, we present a Predictable Perception system with DNNs (PP-DNN) that reduce the amount of image data to be processed while maintaining the same level of accuracy for multi-tenant DNNs by dynamically selecting critical frames and regions of interest (ROIs). PP-DNN is based on our key insight that critical frames and ROIs for AVs vary with the AV’s surrounding environment. However, it is challenging to identify and use critical frames and ROIs in multi-tenant DNNs for predictable inference. Given image-frame streams, PP-DNN leverages an ROI generator to identify critical frames and ROIs based on the similarities of consecutive frames and traffic scenarios. PP-DNN then leverages a FLOPs predictor to predict multiply-accumulate operations (MACs) from the dynamic critical frames and ROIs. The ROI scheduler coordinates the processing of critical frames and ROIs with multiple DNN models. Finally, we design a detection predictor for the perception of non-critical frames. We have implemented PP-DNN in an ROS-based AV pipeline and evaluated it with the BDD100K and the nuScenes dataset. PP-DNN is observed to significantly enhance perception predictability, increasing the number of fusion frames by up to 7.3x, reducing the fusion delay by >2.6x and fusion-delay variations by >2.3x, improving detection completeness by 75.4% and the cost-effectiveness by up to 98% over the baseline.

[181] Interpretable Vision Transformers in Monocular Depth Estimation via SVDA

Vasileios Arampatzakis, George Pavlidis, Nikolaos Mitianoudis, Nikos Papamarkos

Main category: cs.CV

TL;DR: SVDA introduces spectrally structured attention for monocular depth estimation, making Transformers interpretable through six spectral indicators while maintaining accuracy.

Details

Motivation: Self-attention mechanisms in Transformers are opaque, limiting interpretability in critical applications like robotics and autonomous driving. There's a need for intrinsically interpretable attention mechanisms rather than post-hoc approximations.

Method: Introduces SVD-Inspired Attention (SVDA) into Dense Prediction Transformers, embedding a learnable diagonal matrix into normalized query-key interactions to decouple directional alignment from spectral modulation.

Result: SVDA preserves or slightly improves predictive accuracy on KITTI and NYU-v2 datasets with minor computational overhead, while providing six spectral indicators that reveal consistent cross-dataset and depth-wise patterns in attention organization.

Conclusion: SVDA redefines interpretability in monocular depth estimation by shifting attention from opaque mechanism to quantifiable descriptor, opening a principled avenue toward transparent dense prediction models.

Abstract: Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.

[182] LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation

Lei Yao, Yi Wang, Yawen Cui, Moyun Liu, Lap-Pui Chau

Main category: cs.CV

TL;DR: LaSSM is an efficient 3D scene instance segmentation method that uses hierarchical semantic-spatial query initialization and a coordinate-guided state space model decoder to achieve state-of-the-art performance with reduced computational cost.

Details

Motivation: Existing query-based 3D instance segmentation methods suffer from query initialization problems due to sparse point clouds and rely on computationally intensive attention mechanisms in decoders, creating a need for simpler, more efficient approaches.

Method: Proposes hierarchical semantic-spatial query initialization from superpoints for comprehensive scene coverage, and a coordinate-guided state space model decoder with local aggregation and spatial dual-path SSM blocks to refine queries efficiently.

Result: Achieves first place on ScanNet++ V2 leaderboard with 2.5% mAP improvement over previous best method using only 1/3 FLOPs, and competitive performance on ScanNet, ScanNet200, S3DIS, and ScanNet++ V1 benchmarks.

Conclusion: LaSSM demonstrates that prioritizing simplicity and efficiency while maintaining competitive performance is achievable through innovative query initialization and state space model-based decoding for 3D scene instance segmentation.

Abstract: Query-based 3D scene instance segmentation from point clouds has attained notable performance. However, existing methods suffer from the query initialization dilemma due to the sparse nature of point clouds and rely on computationally intensive attention mechanisms in query decoders. We accordingly introduce LaSSM, prioritizing simplicity and efficiency while maintaining competitive performance. Specifically, we propose a hierarchical semantic-spatial query initializer to derive the query set from superpoints by considering both semantic cues and spatial distribution, achieving comprehensive scene coverage and accelerated convergence. We further present a coordinate-guided state space model (SSM) decoder that progressively refines queries. The novel decoder features a local aggregation scheme that restricts the model to focus on geometrically coherent regions and a spatial dual-path SSM block to capture underlying dependencies within the query set by integrating associated coordinates information. Our design enables efficient instance prediction, avoiding the incorporation of noisy information and reducing redundant computation. LaSSM ranks first place on the latest ScanNet++ V2 leaderboard, outperforming the previous best method by 2.5% mAP with only 1/3 FLOPs, demonstrating its superiority in challenging large-scale scene instance segmentation. LaSSM also achieves competitive performance on ScanNet, ScanNet200, S3DIS and ScanNet++ V1 benchmarks with less computational cost. Extensive ablation studies and qualitative results validate the effectiveness of our design. The code and weights are available at https://github.com/RayYoh/LaSSM.

[183] Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting

Rishikesh Bhyri, Brian R Quaranto, Philip J Seger, Kaity Tung, Brendan Fox, Gene Yang, Steven D. Schwaitzberg, Junsong Yuan, Nan Xi, Peter C W Kim

Main category: cs.CV

TL;DR: Chain-of-Look: A visual reasoning framework for counting densely packed surgical instruments by mimicking human sequential counting with structured visual chains and spatial constraints.

Details

Motivation: Accurate counting of surgical instruments in operating rooms is critical for patient safety, but current methods struggle with dense scenarios where instruments are tightly clustered. Existing approaches like object detection and multimodal LLMs fail to handle the complexity of these dense arrangements.

Method: Proposes Chain-of-Look framework that mimics human sequential counting by enforcing structured visual chains instead of unordered object detection. Introduces neighboring loss function to model spatial constraints of densely packed instruments. Also presents SurgCount-HD dataset with 1,464 high-density surgical instrument images.

Result: Outperforms state-of-the-art counting methods (CountGD, REC) and multimodal LLMs (Qwen, ChatGPT) in dense surgical instrument counting tasks.

Conclusion: Chain-of-Look provides an effective solution for accurate surgical instrument counting in dense scenarios by incorporating structured visual reasoning and spatial constraints, addressing a critical patient safety need in operating rooms.

Abstract: Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.

[184] PuriLight: A Lightweight Shuffle and Purification Framework for Monocular Depth Estimation

Yujie Chen, Li Zhang, Xiaomeng Chu, Tian Zhang

Main category: cs.CV

TL;DR: PuriLight is a lightweight self-supervised monocular depth estimation framework that balances computational efficiency with structural precision through three novel modules for feature extraction, enhancement, and purification.

Details

Motivation: Existing self-supervised depth estimation methods face a trade-off between computational efficiency and detail preservation - bulky architectures compromise practicality while lightweight models sacrifice structural precision. There's a need for lightweight yet structurally precise architectures.

Method: Three-stage architecture with three novel modules: 1) Shuffle-Dilation Convolution (SDC) for local feature extraction, 2) Rotation-Adaptive Kernel Attention (RAKA) for hierarchical feature enhancement, and 3) Deep Frequency Signal Purification (DFSP) for global feature purification.

Result: Extensive experiments show PuriLight achieves state-of-the-art performance with minimal training parameters while maintaining exceptional computational efficiency.

Conclusion: PuriLight successfully addresses the dual challenges of computational efficiency and detail preservation in self-supervised monocular depth estimation through its lightweight yet precise architecture.

Abstract: We propose PuriLight, a lightweight and efficient framework for self-supervised monocular depth estimation, to address the dual challenges of computational efficiency and detail preservation. While recent advances in self-supervised depth estimation have reduced reliance on ground truth supervision, existing approaches remain constrained by either bulky architectures compromising practicality or lightweight models sacrificing structural precision. These dual limitations underscore the critical need to develop lightweight yet structurally precise architectures. Our framework addresses these limitations through a three-stage architecture incorporating three novel modules: the Shuffle-Dilation Convolution (SDC) module for local feature extraction, the Rotation-Adaptive Kernel Attention (RAKA) module for hierarchical feature enhancement, and the Deep Frequency Signal Purification (DFSP) module for global feature purification. Through effective collaboration, these modules enable PuriLight to achieve both lightweight and accurate feature extraction and processing. Extensive experiments demonstrate that PuriLight achieves state-of-the-art performance with minimal training parameters while maintaining exceptional computational efficiency. Codes will be available at https://github.com/ishrouder/PuriLight.

[185] First International StepUP Competition for Biometric Footstep Recognition: Methods, Results and Remaining Challenges

Robyn Larracy, Eve MacDonald, Angkoon Phinyomark, Saeid Rezaei, Mahdi Laghaei, Ali Hajighasem, Aaron Tabor, Erik Scheme

Main category: cs.CV

TL;DR: The paper presents the First International StepUP Competition for biometric footstep recognition using the UNB StepUP-P150 dataset, where top team achieved 10.77% EER using generative reward machine optimization, but challenges remain in generalizing to unfamiliar footwear.

Details

Motivation: Biometric footstep recognition has potential applications in security and safety, but progress has been limited by lack of large, diverse datasets needed to address challenges like generalization to new users and robustness to variations in footwear and walking speed.

Method: Organized an international competition where teams developed recognition models using the UNB StepUP-P150 dataset (largest high-resolution footstep pressure recordings), then evaluated on a separate test set designed to assess verification performance under challenging variations with limited reference data.

Result: Competition attracted 23 teams globally; top-performing team Saeid_UCC achieved best equal error rate (EER) of 10.77% using a generative reward machine (GRM) optimization strategy, but persistent challenges in generalizing to unfamiliar footwear were identified.

Conclusion: The competition showcased strong solutions for biometric footstep recognition but highlighted that generalization to unfamiliar footwear remains a critical challenge for future work in this emerging field.

Abstract: Biometric footstep recognition, based on a person’s unique pressure patterns under their feet during walking, is an emerging field with growing applications in security and safety. However, progress in this area has been limited by the lack of large, diverse datasets necessary to address critical challenges such as generalization to new users and robustness to shifts in factors like footwear or walking speed. The recent release of the UNB StepUP-P150 dataset, the largest and most comprehensive collection of high-resolution footstep pressure recordings to date, opens new opportunities for addressing these challenges through deep learning. To mark this milestone, the First International StepUP Competition for Biometric Footstep Recognition was launched. Competitors were tasked with developing robust recognition models using the StepUP-P150 dataset that were then evaluated on a separate, dedicated test set designed to assess verification performance under challenging variations, given limited and relatively homogeneous reference data. The competition attracted global participation, with 23 registered teams from academia and industry. The top-performing team, Saeid_UCC, achieved the best equal error rate (EER) of 10.77% using a generative reward machine (GRM) optimization strategy. Overall, the competition showcased strong solutions, but persistent challenges in generalizing to unfamiliar footwear highlight a critical area for future work.

[186] FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference

Divya Jyoti Bajpai, Dhruv Bhardwaj, Soumya Roy, Tejas Duseja, Harsh Agarwal, Aashay Sandansing, Manjesh Kumar Hanawal

Main category: cs.CV

TL;DR: FastFlow is a plug-and-play adaptive inference framework that accelerates flow-matching models by skipping unnecessary denoising steps using finite-difference approximations and multi-armed bandit optimization.

Details

Motivation: Flow-matching models achieve state-of-the-art fidelity in image/video generation but suffer from slow sequential denoising. Existing acceleration methods are static, require retraining, and don't generalize well across tasks.

Method: FastFlow identifies denoising steps that make minor adjustments and approximates them using finite-difference velocity estimates from prior predictions. It models step-skipping decisions as a multi-armed bandit problem to learn optimal skips balancing speed and quality.

Result: Achieves over 2.6x speedup while maintaining high-quality outputs, works seamlessly with existing pipelines, and generalizes across image generation, video generation, and editing tasks.

Conclusion: FastFlow provides an effective plug-and-play solution for accelerating flow-matching models without retraining, offering significant speed improvements while preserving output quality across multiple vision tasks.

Abstract: Flow-matching models deliver state-of-the-art fidelity in image and video generation, but the inherent sequential denoising process renders them slower. Existing acceleration methods like distillation, trajectory truncation, and consistency approaches are static, require retraining, and often fail to generalize across tasks. We propose FastFlow, a plug-and-play adaptive inference framework that accelerates generation in flow matching models. FastFlow identifies denoising steps that produce only minor adjustments to the denoising path and approximates them without using the full neural network models used for velocity predictions. The approximation utilizes finite-difference velocity estimates from prior predictions to efficiently extrapolate future states, enabling faster advancements along the denoising path at zero compute cost. This enables skipping computation at intermediary steps. We model the decision of how many steps to safely skip before requiring a full model computation as a multi-armed bandit problem. The bandit learns the optimal skips to balance speed with performance. FastFlow integrates seamlessly with existing pipelines and generalizes across image generation, video generation, and editing tasks. Experiments demonstrate a speedup of over 2.6x while maintaining high-quality outputs. The source code for this work can be found at https://github.com/Div290/FastFlow.

[187] HairWeaver: Few-Shot Photorealistic Hair Motion Synthesis with Sim-to-Real Guided Video Diffusion

Di Chang, Ji Hou, Aljaz Bozic, Assaf Neuberger, Felix Juefei-Xu, Olivier Maury, Gene Wei-Chin Lin, Tuur Stuyck, Doug Roble, Mohammad Soleymani, Stephane Grabli

Main category: cs.CV

TL;DR: HairWeaver is a diffusion-based pipeline for animating realistic hair dynamics in single human images, using specialized LoRA modules to control hair motion while preserving photorealistic appearance.

Details

Motivation: Existing human animation methods successfully control body pose but lack specific control over hair, resulting in stiff and unrealistic hair animations that fail to capture intricate hair motions.

Method: Uses two specialized LoRA modules: Motion-Context-LoRA to integrate motion conditions and Sim2Real-Domain-LoRA to preserve photoreal appearance across domains. These lightweight components guide a video diffusion backbone trained on CG-simulated dynamic human motion data.

Result: Comprehensive evaluations show the approach sets new state-of-the-art, producing lifelike human hair animations with dynamic details and natural response to movement.

Conclusion: HairWeaver enables fine control over hair motion while maintaining photorealistic appearance, overcoming limitations of existing methods for realistic hair animation.

Abstract: We present HairWeaver, a diffusion-based pipeline that animates a single human image with realistic and expressive hair dynamics. While existing methods successfully control body pose, they lack specific control over hair, and as a result, fail to capture the intricate hair motions, resulting in stiff and unrealistic animations. HairWeaver overcomes this limitation using two specialized modules: a Motion-Context-LoRA to integrate motion conditions and a Sim2Real-Domain-LoRA to preserve the subject’s photoreal appearance across different data domains. These lightweight components are designed to guide a video diffusion backbone while maintaining its core generative capabilities. By training on a specialized dataset of dynamic human motion generated from a CG simulator, HairWeaver affords fine control over hair motion and ultimately learns to produce highly realistic hair that responds naturally to movement. Comprehensive evaluations demonstrate that our approach sets a new state of the art, producing lifelike human hair animations with dynamic details.

[188] PhyCritic: Multimodal Critic Models for Physical AI

Tianyi Xiong, Shihao Wang, Guilin Liu, Yi Dong, Ming Li, Heng Huang, Jan Kautz, Zhiding Yu

Main category: cs.CV

TL;DR: PhyCritic is a multimodal critic model specialized for physical AI tasks, using a two-stage RLVR pipeline with physical skill warmup and self-referential critic finetuning to improve judgment stability and physical correctness.

Details

Motivation: Existing multimodal critics are primarily trained on general visual domains (captioning, image QA) but lack capability for physical AI tasks involving perception, causal reasoning, and planning. There's a need for specialized critics that can reliably evaluate physically grounded responses.

Method: Two-stage RLVR pipeline: 1) Physical skill warmup stage enhances physically oriented perception and reasoning, 2) Self-referential critic finetuning where the critic generates its own prediction as internal reference before judging candidate responses to improve judgment stability and physical correctness.

Result: PhyCritic achieves strong performance gains over open-source baselines across both physical and general-purpose multimodal judge benchmarks. When applied as a policy model, it further improves perception and reasoning in physically grounded tasks.

Conclusion: PhyCritic demonstrates that specialized critic models for physical AI can outperform general-purpose critics, and the self-referential approach improves judgment stability. The model shows promise for advancing evaluation and alignment in physically grounded multimodal tasks.

Abstract: With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.

[189] Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo

Main category: cs.CV

TL;DR: DiNa-LRM: A diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states for more efficient and effective alignment of diffusion models.

Details

Motivation: Current preference optimization for diffusion models relies on Vision-Language Models (VLMs) as reward providers, which are computationally expensive and suffer from domain mismatch when optimizing latent diffusion generators through pixel-space rewards.

Method: Proposes DiNa-LRM with noise-calibrated Thurstone likelihood using diffusion-noise-dependent uncertainty. Uses pretrained latent diffusion backbone with timestep-conditioned reward head and supports inference-time noise ensembling for test-time scaling.

Result: Outperforms existing diffusion-based reward baselines across image alignment benchmarks, achieves performance competitive with state-of-the-art VLMs at fraction of computational cost, and improves preference optimization dynamics for faster, more resource-efficient model alignment.

Conclusion: DiNa-LRM provides an efficient diffusion-native reward mechanism that addresses computational cost and domain mismatch issues in preference optimization for diffusion models.

Abstract: Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

[190] SurfPhase: 3D Interfacial Dynamics in Two-Phase Flows from Sparse Videos

Yue Gao, Hong-Xing Yu, Sanghyeon Chang, Qianxi Fu, Bo Zhu, Yoonjin Won, Juan Carlos Niebles, Jiajun Wu

Main category: cs.CV

TL;DR: SurfPhase: A neural rendering method for reconstructing 3D interfacial dynamics in two-phase flows from sparse camera views, using dynamic Gaussian surfels with SDF formulation and video diffusion for novel-view synthesis.

Details

Motivation: Interfacial dynamics in two-phase flows are crucial for understanding momentum, heat, and mass transfer but are experimentally challenging to measure. Existing neural rendering methods are designed for single-phase flows with diffuse boundaries and cannot handle sharp, deformable liquid-vapor interfaces.

Method: Proposes SurfPhase which integrates dynamic Gaussian surfels with a signed distance function formulation for geometric consistency, and leverages a video diffusion model to synthesize novel-view videos to refine reconstruction from sparse observations.

Result: Evaluated on a new dataset of high-speed pool boiling videos, demonstrating high-quality view synthesis and velocity estimation from only two camera views.

Conclusion: SurfPhase enables accurate reconstruction of 3D interfacial dynamics in complex two-phase flows from limited camera views, overcoming limitations of traditional techniques and existing neural rendering methods.

Abstract: Interfacial dynamics in two-phase flows govern momentum, heat, and mass transfer, yet remain difficult to measure experimentally. Classical techniques face intrinsic limitations near moving interfaces, while existing neural rendering methods target single-phase flows with diffuse boundaries and cannot handle sharp, deformable liquid-vapor interfaces. We propose SurfPhase, a novel model for reconstructing 3D interfacial dynamics from sparse camera views. Our approach integrates dynamic Gaussian surfels with a signed distance function formulation for geometric consistency, and leverages a video diffusion model to synthesize novel-view videos to refine reconstruction from sparse observations. We evaluate on a new dataset of high-speed pool boiling videos, demonstrating high-quality view synthesis and velocity estimation from only two camera views. Project website: https://yuegao.me/SurfPhase.

[191] Are Dense Labels Always Necessary for 3D Object Detection from Point Cloud?

Chenqiang Gao, Chuandong Liu, Jun Shu, Fangcen Liu, Jiang Liu, Luyu Yang, Xinbo Gao, Deyu Meng

Main category: cs.CV

TL;DR: SS3D++: A sparsely-annotated 3D object detection framework that uses only one annotated object per scene to reduce annotation costs while maintaining competitive performance through iterative detector training and confident scene generation.

Details

Motivation: Current 3D object detection methods require costly dense 3D bounding box annotations. The authors aim to reduce annotation burden by proposing a sparse annotation strategy (one object per scene) while addressing performance deterioration from incomplete supervision.

Method: Developed SS3D++ method that alternates between improving 3D detector training and generating confident fully-annotated scenes. Uses sparse annotations as seeds, with missing-annotated instance mining and reliable background mining modules to progressively generate complete annotations.

Result: Achieves competitive results compared to SOTA weakly-supervised methods with same annotation cost. On KITTI: on-par or better than fully-supervised methods with 5x less annotation cost. On Waymo: 90% of fully-supervised performance with 15x less annotation cost. Unlabeled scenes further boost performance.

Conclusion: SS3D++ effectively reduces annotation costs for 3D object detection while maintaining competitive performance through sparse supervision and iterative scene generation, making it practical for real-world applications.

Abstract: Current state-of-the-art (SOTA) 3D object detection methods often require a large amount of 3D bounding box annotations for training. However, collecting such large-scale densely-supervised datasets is notoriously costly. To reduce the cumbersome data annotation process, we propose a novel sparsely-annotated framework, in which we just annotate one 3D object per scene. Such a sparse annotation strategy could significantly reduce the heavy annotation burden, while inexact and incomplete sparse supervision may severely deteriorate the detection performance. To address this issue, we develop the SS3D++ method that alternatively improves 3D detector training and confident fully-annotated scene generation in a unified learning scheme. Using sparse annotations as seeds, we progressively generate confident fully-annotated scenes based on designing a missing-annotated instance mining module and reliable background mining module. Our proposed method produces competitive results when compared with SOTA weakly-supervised methods using the same or even more annotation costs. Besides, compared with SOTA fully-supervised methods, we achieve on-par or even better performance on the KITTI dataset with about 5x less annotation cost, and 90% of their performance on the Waymo dataset with about 15x less annotation cost. The additional unlabeled training scenes could further boost the performance.

[192] ZebraPose: Zebra Detection and Pose Estimation using only Synthetic Data

Elia Bonetto, Aamir Ahmad

Main category: cs.CV

TL;DR: Synthetic 3D photorealistic dataset for zebra detection and 2D pose estimation eliminates need for real data or domain adaptation, works for aerial/wildlife scenarios where detection models typically fail.

Details

Motivation: Real-world animal datasets are hard to collect, especially for out-of-distribution viewpoints like aerial imagery. Existing synthetic data approaches require bridging the synthetic-to-real gap using real images, style constraints, or pre-trained networks.

Method: Generate fully synthetic dataset using 3D photorealistic simulator. Train both detection and 2D pose estimation models from scratch on this synthetic data, enabling joint training of both tasks without assuming reliable detection models.

Result: Models trained exclusively on synthetic data generalize well to real images across multiple real-world and synthetic datasets, with different backbones and image resolutions. Outperforms methods that require real data or domain adaptation.

Conclusion: High-quality synthetic data from 3D photorealistic simulators can eliminate the need for real data collection and complex domain adaptation strategies for wildlife monitoring tasks like detection and pose estimation.

Abstract: Collecting and labeling large real-world wild animal datasets is impractical, costly, error-prone, and labor-intensive. For animal monitoring tasks, as detection, tracking, and pose estimation, out-of-distribution viewpoints (e.g. aerial) are also typically needed but rarely found in publicly available datasets. To solve this, existing approaches synthesize data with simplistic techniques that then necessitate strategies to bridge the synthetic-to-real gap. Therefore, real images, style constraints, complex animal models, or pre-trained networks are often leveraged. In contrast, we generate a fully synthetic dataset using a 3D photorealistic simulator and demonstrate that it can eliminate such needs for detecting and estimating 2D poses of wild zebras. Moreover, existing top-down 2D pose estimation approaches using synthetic data assume reliable detection models. However, these often fail in out-of-distribution scenarios, e.g. those that include wildlife or aerial imagery. Our method overcomes this by enabling the training of both tasks using the same synthetic dataset. Through extensive benchmarks, we show that models trained from scratch exclusively on our synthetic data generalize well to real images. We perform these using multiple real-world and synthetic datasets, pre-trained and randomly initialized backbones, and different image resolutions. Code, results, models, and data can be found athttps://zebrapose.is.tue.mpg.de/.

[193] Symmetrization Weighted Binary Cross-Entropy: Modeling Perceptual Asymmetry for Human-Consistent Neural Edge Detection

Hao Shu

Main category: cs.CV

TL;DR: SWBCE loss improves edge detection by modeling perceptual asymmetry, enhancing both numerical accuracy and visual quality through symmetric learning.

Details

Motivation: Current edge detection models achieve high numerical accuracy but produce edges that lack visual sharpness and perceptual consistency, limiting their reliability in intelligent vision systems. There's a need for loss functions that better align with human perceptual discrimination.

Method: Introduces Symmetrization Weighted Binary Cross-Entropy (SWBCE) loss, which extends conventional WBCE by incorporating prediction-guided symmetry. It explicitly models perceptual asymmetry in human edge recognition, where edge decisions require stronger evidence than non-edge ones.

Result: SWBCE outperforms existing loss functions across multiple benchmark datasets and ED architectures, improving SSIM by about 15% on BRIND with HED-EES model. Consistently achieves best perceptual results in all experiments while enhancing edge recall and suppressing false positives.

Conclusion: SWBCE provides a perception-inspired formulation that balances quantitative accuracy and perceptual fidelity in edge detection. The approach offers a generalizable optimization principle for neural learning systems where asymmetric perceptual reasoning is critical.

Abstract: Edge detection (ED) is a fundamental perceptual process in computer vision, forming the structural basis for high-level reasoning tasks such as segmentation, recognition, and scene understanding. Despite substantial progress achieved by deep neural networks, most ED models attain high numerical accuracy but fail to produce visually sharp and perceptually consistent edges, thereby limiting their reliability in intelligent vision systems. To address this issue, this study introduces the Symmetrization Weighted Binary Cross-Entropy (SWBCE) loss, a perception-inspired formulation that extends the conventional WBCE by incorporating prediction-guided symmetry. SWBCE explicitly models the perceptual asymmetry in human edge recognition, wherein edge decisions require stronger evidence than non-edge ones, aligning the optimization process with human perceptual discrimination. The resulting symmetric learning mechanism jointly enhances edge recall and suppresses false positives, achieving a superior balance between quantitative accuracy and perceptual fidelity. Extensive experiments across multiple benchmark datasets and representative ED architectures demonstrate that SWBCE can outperform existing loss functions in both numerical evaluation and visual quality. Particularly with the HED-EES model, the SSIM can be improved by about 15% on BRIND, and in all experiments, training by SWBCE consistently obtains the best perceptual results. Beyond edge detection, the proposed perceptual loss offers a generalizable optimization principle for soft computing and neural learning systems, particularly in scenarios where asymmetric perceptual reasoning plays a critical role.

[194] ProAPO: Progressively Automatic Prompt Optimization for Visual Classification

Xiangyan Qu, Gaopeng Gou, Jiamin Zhuang, Jing Yu, Kun Song, Qihao Wang, Yili Li, Gang Xiong

Main category: cs.CV

TL;DR: Evolution-based algorithm optimizes visually discriminative prompts for fine-grained image classification using LLM-generated descriptions with minimal supervision

Details

Motivation: Current vision-language models rely on prompt quality, and while LLM-generated descriptions help, they often suffer from hallucinations leading to inaccurate or non-discriminative class-specific prompts. Need for visually discriminative prompts for fine-grained categories with minimal supervision.

Method: Proposes evolution-based algorithm to optimize prompts from task-specific templates to class-specific descriptions. Uses edit-based and evolution-based operations to generate diverse candidate prompts via one-time LLM query. Implements sampling strategies for better initial search and reduced category traversal. Applies fitness score with entropy constraints to mitigate overfitting.

Result: Outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets in one-shot image classification. Optimal prompts also improve adapter-based methods and transfer effectively across different backbones.

Conclusion: The evolution-based approach successfully generates visually discriminative prompts for fine-grained classification with minimal supervision, addressing LLM hallucination issues and improving VLM performance.

Abstract: Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our optimal prompts improve adapter-based methods and transfer effectively across different backbones.

[195] OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning

Yuan Liu, Saihui Hou, Saijie Hou, Jiabao Du, Shibei Meng, Yongzhen Huang

Main category: cs.CV

TL;DR: OmniDiff dataset with 324 diverse scenarios and M³Diff model with Multi-scale Differential Perception module for improved image difference captioning across complex environments.

Details

Motivation: Existing image difference captioning datasets lack breadth (limited variations in specific scenes) and depth (overly simplistic descriptions), limiting applicability in complex dynamic environments.

Method: Introduces OmniDiff dataset with 324 diverse scenarios (real-world complex environments + 3D synthetic settings) with fine-grained human annotations, and M³Diff model with plug-and-play Multi-scale Differential Perception module for enhanced difference perception.

Result: M³Diff achieves state-of-the-art performance across multiple benchmarks including Spot-the-Diff, IEdit, CLEVR-Change, CLEVR-DC, and OmniDiff, with significant improvements in cross-scenario difference recognition accuracy.

Conclusion: The comprehensive OmniDiff dataset and M³Diff model with MDP module advance image difference captioning capabilities, with public release of dataset, code, and models to support further research.

Abstract: Image Difference Captioning (IDC) aims to generate natural language descriptions of subtle differences between image pairs, requiring both precise visual change localization and coherent semantic expression. Despite recent advancements, existing datasets often lack breadth and depth, limiting their applicability in complex and dynamic environments: (1) from a breadth perspective, current datasets are constrained to limited variations of objects in specific scenes, and (2) from a depth perspective, prior benchmarks often provide overly simplistic descriptions. To address these challenges, we introduce OmniDiff, a comprehensive dataset comprising 324 diverse scenarios-spanning real-world complex environments and 3D synthetic settings-with fine-grained human annotations averaging 60 words in length and covering 12 distinct change types. Building on this foundation, we propose M$^3$Diff, a MultiModal large language model enhanced by a plug-and-play Multi-scale Differential Perception (MDP) module. This module improves the model’s ability to accurately identify and describe inter-image differences while maintaining the foundational model’s generalization capabilities. With the addition of the OmniDiff dataset, M$^3$Diff achieves state-of-the-art performance across multiple benchmarks, including Spot-the-Diff, IEdit, CLEVR-Change, CLEVR-DC, and OmniDiff, demonstrating significant improvements in cross-scenario difference recognition accuracy compared to existing methods. The dataset, code, and models will be made publicly available to support further research.

[196] GMG: A Video Prediction Method Based on Global Focus and Motion Guided

Yuhao Du, Hui Liu, Haoxiang Peng, Xinyuan Cheng, Chengrong Wu, Jiankai Zhang

Main category: cs.CV

TL;DR: Proposes GMG model with Global Focus Module and Motion Guided Module for weather forecasting to capture teleconnections and handle non-rigid body deformations in meteorological data.

Details

Motivation: Weather forecasting is challenging due to rapid variability of meteorological data and potential teleconnections. Current spatiotemporal models using convolution or sliding windows have limited receptive fields and struggle with non-rigid body deformations in weather data.

Method: GMG model with two key components: 1) Global Focus Module to enhance global receptive field for capturing teleconnections, and 2) Motion Guided Module to adapt to growth/dissipation processes of non-rigid bodies in weather data.

Result: Demonstrates competitive performance across various complex tasks, providing improved predictive accuracy for complex spatiotemporal data.

Conclusion: The GMG model offers a novel approach to address core challenges in weather forecasting by better capturing global teleconnections and adapting to non-rigid body dynamics in meteorological data.

Abstract: Recent years, weather forecasting has gained significant attention. However, accurately predicting weather remains a challenge due to the rapid variability of meteorological data and potential teleconnections. Current spatiotemporal forecasting models primarily rely on convolution operations or sliding windows for feature extraction. These methods are limited by the size of the convolutional kernel or sliding window, making it difficult to capture and identify potential teleconnection features in meteorological data. Additionally, weather data often involve non-rigid bodies, whose motion processes are accompanied by unpredictable deformations, further complicating the forecasting task. In this paper, we propose the GMG model to address these two core challenges. The Global Focus Module, a key component of our model, enhances the global receptive field, while the Motion Guided Module adapts to the growth or dissipation processes of non-rigid bodies. Through extensive evaluations, our method demonstrates competitive performance across various complex tasks, providing a novel approach to improving the predictive accuracy of complex spatiotemporal data.

Qi Song, Chenghong Li, Haotong Lin, Sida Peng, Rui Huang

Main category: cs.CV

TL;DR: ADGaussian: A novel approach for generalizable street scene reconstruction from single-view input using joint optimization of image and depth features with Gaussian Splatting.

Details

Motivation: Prior Gaussian Splatting methods focus mainly on geometry refinement, but the authors argue that joint optimization of image and depth features is crucial for accurate Gaussian prediction in street scene reconstruction from single-view inputs.

Method: Incorporates sparse LiDAR depth as additional input modality, formulates Gaussian prediction as joint learning framework of visual and geometric information, proposes Multi-modal Feature Matching strategy with Multi-scale Gaussian Decoding model for joint refinement of multi-modal features.

Result: Extensive experiments on Waymo and KITTI datasets show state-of-the-art performance and superior zero-shot generalization capabilities in novel-view shifting.

Conclusion: ADGaussian demonstrates effective joint optimization of multi-modal features for high-quality street scene reconstruction from single-view inputs with strong generalization capabilities.

Abstract: We present a novel approach, termed ADGaussian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from merely single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a Multi-modal Feature Matching strategy coupled with a Multi-scale Gaussian Decoding model to enhance the joint refinement of multi-modal features, thereby enabling efficient multi-modal Gaussian learning. Extensive experiments on Waymo and KITTI demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting.

[198] Geospatial Representation Learning: A Survey from Deep Learning to The LLM Era

Xixuan Hao, Yutian Jiang, Xingchen Zou, Jiabo Liu, Yifang Yin, Song Gao, Flora Salim, Tianrui Li, Yuxuan Liang

Main category: cs.CV

TL;DR: A comprehensive survey of Geospatial Representation Learning (GRL) covering both deep learning and large language model paradigms for processing location-centric data across multiple modalities.

Details

Motivation: Geospatial data transformation into computational representations is fundamental for modern spatial analysis. The field is undergoing transformation through deep learning and LLM revolutions, requiring a comprehensive review to organize advancements and provide a roadmap.

Method: Survey paper organizing GRL into a structured taxonomy based on three perspectives: (1) data perspective (types of geospatial data), (2) methodological perspective (learning approaches), and (3) application perspective (use cases). Covers both deep learning and LLM paradigms.

Result: Provides comprehensive review of geospatial representation learning across technological eras, highlighting current advancements, discussing limitations, and proposing future research directions in the LLM/foundation model era.

Conclusion: The survey offers thorough exploration of GRL field and provides roadmap for further innovation, particularly emphasizing the transformative potential of LLMs for cross-modal geospatial reasoning and unstructured geo-textual data processing.

Abstract: The ability to transform location-centric geospatial data into meaningful computational representations has become fundamental to modern spatial analysis and decision-making. Geospatial Representation Learning (GRL), the process of automatically extracting latent structures and semantic patterns from geographic data, is undergoing a profound transformation through two successive technological revolutions: the deep learning breakthrough and the emerging large language model (LLM) paradigm. While deep neural networks (DNNs) have demonstrated remarkable success in automated feature extraction from structured and semi-structured geospatial data (e.g., satellite imagery, GPS trajectories), the recent integration of LLMs introduces transformative capabilities for cross-modal geospatial reasoning and unstructured geo-textual data processing. This survey presents a comprehensive review of geospatial representation learning across both technological eras, organizing them into a structured taxonomy based on the complete pipeline comprising: (1) data perspective, (2) methodological perspective, and (3) application perspective. We also highlight current advancements, discuss existing limitations, and propose potential future research directions in the LLM and foundation model era. This work offers a thorough exploration of the field and provides a roadmap for further innovation in GRL. The summary of the up-to-date paper list can be found in https://github.com/CityMind-Lab/Awesome-Geospatial-Representation-Learning and will undergo continuous updates.

[199] From Pixels to Images: A Structural Survey of Deep Learning Paradigms in Remote Sensing Image Semantic Segmentation

Quanwei Liu, Tao Huang, Jiaqi Yang, Wei Xiang

Main category: cs.CV

TL;DR: Comprehensive review of deep learning-based remote sensing image semantic segmentation organized by granularity hierarchy (pixel-patch-tile-image), covering evolution from early methods to modern vision foundation models.

Details

Motivation: Traditional remote sensing image processing struggles with efficiency and accuracy as data diversity and volume increase. Existing reviews lack unified operational perspective aligned with segmentation granularity and training/inference pipeline.

Method: Organizes DL-based RSISS into pixel-patch-tile-image hierarchy, covering early pixel-based methods, prevailing patch-based/tile-based techniques, and emerging image-based approaches with vision foundation models.

Result: Provides holistic structured understanding of DL-based RSISS evolution, highlighting representative datasets, comparative insights, and open challenges in data scale, model efficiency, domain robustness, and multimodal integration.

Conclusion: DL-based RSISS has evolved structurally from pixel-level to image-level modeling, with vision foundation models representing the latest advancement. The review offers unified framework for understanding this evolution and identifies key research directions.

Abstract: Semantic segmentation (SS) of RSIs enables the fine-grained interpretation of surface features, making it a critical task in RS analysis. With the increasing diversity and volume of RSIs collected by sensors on various platforms, traditional processing methods struggle to maintain efficiency and accuracy. In response, deep learning (DL) has emerged as a transformative approach, enabling substantial advances in remote sensing image semantic segmentation (RSISS) by automating hierarchical feature extraction and improving segmentation performance across diverse modalities. As data scale and model capacity have increased, DL-based RSISS has undergone a structural evolution from pixel-level and patch-based classification to tile-level, end-to-end segmentation, and, more recently, to image-level modelling with vision foundation models. However, existing reviews often focus on individual components, such as supervision strategies or fusion stages, and lack a unified operational perspective aligned with segmentation granularity and the training/inference pipeline. This paper provides a comprehensive review by organizing DL-based RSISS into a pixel-patch-tile-image hierarchy, covering early pixel-based methods, prevailing patch-based and tile-based techniques, and emerging image-based approaches. This review offers a holistic and structured understanding of DL-based RSISS, highlighting representative datasets, comparative insights, and open challenges related to data scale, model efficiency, domain robustness, and multimodal integration. Furthermore, to facilitate reproducible research, curated code collections are provided at: https://github.com/quanweiliu/PatchwiseClsFra and https://github.com/quanweiliu/TilewiseSegFra.

[200] Corruption-Aware Training of Latent Video Diffusion Models for Robust Text-to-Video Generation

Chika Maduabuchi, Hao Chen, Yujin Han, Jindong Wang

Main category: cs.CV

TL;DR: CAT-LVDM is a corruption-aware training framework for Latent Video Diffusion Models that uses structured, data-aligned noise injection to improve robustness against noisy conditioning and prevent semantic drift in video generation.

Details

Motivation: Latent Video Diffusion Models are brittle under noisy conditioning where small perturbations in text or multimodal embeddings can cascade over timesteps and cause semantic drift. Existing corruption strategies from image diffusion fail in video settings because static noise disrupts temporal fidelity.

Method: Proposes CAT-LVDM with two operators: Batch-Centered Noise Injection (BCNI) aligns perturbations with batch semantics, and Spectrum-Aware Contextual Noise (SACN) aligns with spectral dynamics to preserve coherence. Uses structured, data-aligned noise injection tailored for video diffusion.

Result: BCNI reduces FVD by 31.9% on WebVid-2M, MSR-VTT, and MSVD datasets. SACN improves UCF-101 by 12.3%. Outperforms Gaussian, Uniform, and large diffusion baselines like DEMO (2.3B) and LaVie (3B) despite training on 5x less data.

Conclusion: CAT-LVDM introduces a principled framework for robust video diffusion that demonstrates transferability to autoregressive generation and multimodal video understanding models, with theoretical analysis establishing why these operators tighten robustness and generalization bounds.

Abstract: Latent Video Diffusion Models (LVDMs) have achieved state-of-the-art generative quality for image and video generation; however, they remain brittle under noisy conditioning, where small perturbations in text or multimodal embeddings can cascade over timesteps and cause semantic drift. Existing corruption strategies from image diffusion (e.g., Gaussian, Uniform) fail in video settings because static noise disrupts temporal fidelity. In this paper, we propose CAT-LVDM, a corruption-aware training framework with structured, data-aligned noise injection tailored for video diffusion. Our two operators, Batch-Centered Noise Injection (BCNI) and Spectrum-Aware Contextual Noise (SACN), align perturbations with batch semantics or spectral dynamics to preserve coherence. CAT-LVDM yields substantial gains: BCNI reduces FVD by 31.9 percent on WebVid-2M, MSR-VTT, and MSVD, while SACN improves UCF-101 by 12.3 percent, outperforming Gaussian, Uniform, and large diffusion baselines such as DEMO (2.3B) and LaVie (3B) despite training on 5x less data. Ablations confirm the unique value of low-rank, data-aligned noise, and theoretical analysis establishes why these operators tighten robustness and generalization bounds. CAT-LVDM thus introduces a principled framework for robust video diffusion and further demonstrates transferability to autoregressive generation and multimodal video understanding models.

[201] FANVID: A Benchmark for Face and License Plate Recognition in Low-Resolution Videos

Kavitha Viswanathan, Vrinda Goel, Shlesh Gholap, Devayan Ghosh, Madhav Gupta, Dhruvi Ganatra, Sanket Potdar, Amit Sethi

Main category: cs.CV

TL;DR: FANVID is a video-based benchmark for temporal recognition of faces and license plates in low-resolution surveillance footage, featuring 1,463 clips with 63 identities and 49 plates across three English-speaking countries.

Details

Motivation: Real-world surveillance often produces low-resolution footage where faces and license plates are unrecognizable in individual frames, creating a need for temporal recognition models that can exploit information across multiple frames.

Method: Created a benchmark dataset with 1,463 LR video clips (180x320, 20-60 FPS) containing 63 identities and 49 license plates, downsampled from high-resolution sources to ensure single-frame indecipherability. Includes distractor faces/plates for realism. Defines two tasks: face matching (detecting LR faces and matching to HR mugshots) and license plate recognition (extracting text without predefined database).

Result: Dataset contains 31,096 manually verified bounding boxes and labels. Baseline method using pre-trained video super-resolution, detection, and recognition achieved scores of 0.58 for face matching and 0.42 for plate recognition, demonstrating both feasibility and challenge.

Conclusion: FANVID provides a benchmark to catalyze innovation in temporal modeling for low-resolution recognition, with applications in surveillance, forensics, and autonomous vehicles. The dataset and evaluation tools are released to support reproducibility and extension.

Abstract: Real-world surveillance often renders faces and license plates unrecognizable in individual low-resolution (LR) frames, hindering reliable identification. To advance temporal recognition models, we present FANVID, a novel video-based benchmark comprising nearly 1,463 LR clips (180 x 320, 20–60 FPS) featuring 63 identities and 49 license plates from three English-speaking countries. Each video includes distractor faces and plates, increasing task difficulty and realism. The dataset contains 31,096 manually verified bounding boxes and labels. FANVID defines two tasks: (1) face matching – detecting LR faces and matching them to high-resolution mugshots, and (2) license plate recognition – extracting text from LR plates without a predefined database. Videos are downsampled from high-resolution sources to ensure that faces and text are indecipherable in single frames, requiring models to exploit temporal information. We introduce evaluation metrics adapted from mean Average Precision at IoU > 0.5, prioritizing identity correctness for faces and character-level accuracy for text. A baseline method with pre-trained video super-resolution, detection, and recognition achieved performance scores of 0.58 (face matching) and 0.42 (plate recognition), highlighting both the feasibility and challenge of the tasks. FANVID’s selection of faces and plates balances diversity with recognition challenge. We release the software for data access, evaluation, baseline, and annotation to support reproducibility and extension. FANVID aims to catalyze innovation in temporal modeling for LR recognition, with applications in surveillance, forensics, and autonomous vehicles.

[202] MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, Ziyu Guo, Zhihong Zhu, Hao Wu, Haixin Wang, Yefeng Zheng, Xiaojiang Peng, Xian Wu, Kun Wang, Xiangang Li, Jieping Ye, Pheng-Ann Heng

Main category: cs.CV

TL;DR: MME-Emotion is a comprehensive benchmark for evaluating multimodal large language models’ emotional intelligence, featuring 6,000+ video clips with QA pairs across 8 emotional tasks to assess both understanding and reasoning capabilities.

Details

Motivation: Current emotional benchmarks for MLLMs are limited in assessing generalization across scenarios and reasoning about emotional triggers. There's a need for systematic evaluation of MLLMs' emotional intelligence capabilities.

Method: Created MME-Emotion benchmark with over 6,000 curated video clips and task-specific QA pairs spanning 8 emotional tasks. Uses holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through multi-agent system framework.

Result: Evaluation of 20 advanced MLLMs shows unsatisfactory emotional intelligence - best model achieved only 39.3% recognition score and 56.0% CoT score. Generalist models derive emotional intelligence from multimodal understanding, while specialist models achieve comparable performance through domain-specific adaptation.

Conclusion: MME-Emotion serves as a foundation for advancing MLLMs’ emotional intelligence, revealing current limitations and providing insights into how different model types approach emotional understanding and reasoning.

Abstract: Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbf{MME-Emotion}, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textit{scalable capacity}, \textit{diverse settings}, and \textit{unified protocols}. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework. Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: \ding{182} Current MLLMs exhibit unsatisfactory emotional intelligence, with the best-performing model achieving only $39.3%$ recognition score and $56.0%$ Chain-of-Thought (CoT) score on our benchmark. \ding{183} Generalist models (\emph{e.g.}, Gemini-2.5-Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emph{e.g.}, R1-Omni) can achieve comparable performance through domain-specific post-training adaptation. By introducing MME-Emotion, we hope that it can serve as a foundation for advancing MLLMs’ emotional intelligence in the future.

[203] Visibility-Aware Language Aggregation for Open-Vocabulary Segmentation in 3D Gaussian Splatting

Sen Wang, Kunyi Li, Siyun Liang, Elena Alegret, Jing Ma, Nassir Navab, Stefano Gasperini

Main category: cs.CV

TL;DR: VALA improves 3D Gaussian language feature distillation by addressing two key issues: background Gaussians getting same features as foreground, and multi-view inconsistencies from noisy language embeddings.

Details

Motivation: Existing methods for distilling language features from 2D images into 3D Gaussians have two fundamental problems: 1) background Gaussians that contribute little to rendered pixels get the same features as dominant foreground ones, and 2) multi-view inconsistencies arise from view-specific noise in language embeddings.

Method: VALA uses Visibility-Aware Language Aggregation that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Also introduces streaming weighted geometric median in cosine space to merge noisy multi-view features.

Result: VALA yields robust, view-consistent language feature embeddings efficiently. Improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works.

Conclusion: VALA provides a lightweight yet effective method for distilling language features into 3D Gaussians that addresses visibility and consistency issues, enabling better 3D scene understanding with language.

Abstract: Recently, distilling open-vocabulary language features from 2D images into 3D Gaussians has attracted significant attention. Although existing methods achieve impressive language-based interactions of 3D scenes, we observe two fundamental issues: background Gaussians contributing negligibly to a rendered pixel get the same feature as the dominant foreground ones, and multi-view inconsistencies due to view-specific noise in language embeddings. We introduce Visibility-Aware Language Aggregation (VALA), a lightweight yet effective method that computes marginal contributions for each ray and applies a visibility-aware gate to retain only visible Gaussians. Moreover, we propose a streaming weighted geometric median in cosine space to merge noisy multi-view features. Our method yields a robust, view-consistent language feature embedding in a fast and memory-efficient manner. VALA improves open-vocabulary localization and segmentation across reference datasets, consistently surpassing existing works. More results are available at https://vala3d.github.io

[204] Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception

Yuheng Shi, Xiaohuan Pei, Minjing Dong, Chang Xu

Main category: cs.CV

TL;DR: SD-RPN: A self-distilled region proposal network that efficiently extracts high-resolution visual regions from MLLMs without annotations, using attention maps from middle layers to train a lightweight RPN for fine-grained perception.

Details

Motivation: MLLMs need high-resolution visual info for fine-grained perception, but processing full high-res images is computationally expensive. Existing RoI methods face trade-offs: training-based approaches need large annotated datasets, while training-free methods using internal attention are inefficient and inaccurate.

Method: Proposes Self-Distilled Region Proposal Network (SD-RPN) that transforms noisy attention maps from MLLM middle layers into high-quality pseudo-RoI labels through denoising and ambiguity resolution. Uses these labels to train a lightweight RPN that predicts RoIs in single forward pass using MLLM middle layer features.

Result: Achieves over 10% absolute accuracy improvement on unseen benchmarks (TextVQA, DocVQA, V-Star) despite training on only ~10K question-answer pairs. Demonstrates exceptional data efficiency and generalization across multiple MLLM families.

Conclusion: SD-RPN provides practical, scalable solution for enhancing MLLM fine-grained perception without costly supervision or full model fine-tuning, resolving efficiency-accuracy trade-off in high-resolution visual processing.

Abstract: Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model’s internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM’s middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM’s middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations. To validate our approach, we integrate the framework into multiple MLLM families. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN.

[205] Enhancing Vehicle Detection under Adverse Weather Conditions with Contrastive Learning

Boying Li, Chang Liu, Petter Kyösti, Mattias Öhman, Devashish Singha Roy, Sofia Plazzi, Hamam Mokayed, Olle Hagner

Main category: cs.CV

TL;DR: A sideload-CL-adaptation framework that uses contrastive learning on unannotated UAV images to improve vehicle detection in Nordic regions with snow coverage challenges.

Details

Motivation: Vehicle detection from UAV images in Nordic regions faces visibility challenges and domain shifts due to diverse snow coverage. Annotated data is expensive but unannotated data is cheaper to obtain. Need lightweight models that can leverage unannotated data to improve detection performance.

Method: Proposed sideload-CL-adaptation framework: 1) Train CNN-based representation extractor through contrastive learning on unannotated data in pretraining stage, 2) Sideload this extractor to a frozen YOLO11n backbone in fine-tuning stage. Explored various fusion methods and granularity for robust adaptation.

Result: The proposed sideload-CL-adaptation model improves detection performance by 3.8% to 9.5% in terms of mAP50 on the NVD dataset.

Conclusion: The framework effectively leverages unannotated UAV data through contrastive learning to improve vehicle detection in challenging Nordic environments with snow coverage, using lightweight models suitable for computational constraints.

Abstract: Aside from common challenges in remote sensing like small, sparse targets and computation cost limitations, detecting vehicles from UAV images in the Nordic regions faces strong visibility challenges and domain shifts caused by diverse levels of snow coverage. Although annotated data are expensive, unannotated data is cheaper to obtain by simply flying the drones. In this work, we proposed a sideload-CL-adaptation framework that enables the use of unannotated data to improve vehicle detection using lightweight models. Specifically, we propose to train a CNN-based representation extractor through contrastive learning on the unannotated data in the pretraining stage, and then sideload it to a frozen YOLO11n backbone in the fine-tuning stage. To find a robust sideload-CL-adaptation, we conducted extensive experiments to compare various fusion methods and granularity. Our proposed sideload-CL-adaptation model improves the detection performance by 3.8% to 9.5% in terms of mAP50 on the NVD dataset.

[206] GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation

Weijia Dou, Xu Zhang, Yi Bin, Jian Liu, Bo Peng, Guoqing Wang, Yang Yang, Heng Tao Shen

Main category: cs.CV

TL;DR: GeoPurify: A method that uses a small Student Affinity Network to purify 2D VLM-generated 3D point features by leveraging geometric priors from a 3D self-supervised teacher model, achieving state-of-the-art 3D semantic segmentation with minimal training data.

Details

Motivation: Current approaches for transferring 2D Vision-Language Model features to 3D semantic segmentation face a trade-off: direct projection yields noisy results while geometric coherence requires expensive training pipelines and large annotated 3D datasets. The segmentation-and-matching paradigm fails to reconcile 2D semantics with 3D geometric structure.

Method: GeoPurify uses a Student Affinity Network to purify 2D VLM-generated 3D point features by distilling geometric priors from a 3D self-supervised teacher model. During inference, a Geometry-Guided Pooling module denoises the point cloud and ensures semantic and structural consistency.

Result: Extensive experiments on major 3D benchmarks show GeoPurify achieves or surpasses state-of-the-art performance while using only about 1.5% of the training data, effectively mitigating the trade-off between 2D feature quality and 3D geometric coherence.

Conclusion: GeoPurify successfully exploits latent geometric information in noisy 2D-to-3D transferred features through a learned affinity network, achieving superior data efficiency and performance in 3D semantic segmentation without requiring large annotated 3D datasets.

Abstract: Recent attempts to transfer features from 2D Vision-Language Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale annotated 3D data. We argue that this limitation stems from the dominant segmentation-and-matching paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose GeoPurify that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only about 1.5% of the training data.

[207] Enhancing IMU-Based Online Handwriting Recognition via Contrastive Learning with Zero Inference Overhead

Jindong Li, Dario Zanca, Vincent Christlein, Tim Hamann, Jens Barth, Peter Kämpf, Björn Eskofier

Main category: cs.CV

TL;DR: ECHWR is a training framework for edge-based online handwriting recognition that improves accuracy without increasing inference costs by using contrastive learning with error-based hard negatives during training only.

Details

Motivation: Edge-based handwriting recognition from inertial sensors improves privacy and reduces latency, but faces memory constraints. Current methods struggle to balance accuracy with computational efficiency for deployment on edge devices.

Method: Proposes Error-enhanced Contrastive Handwriting Recognition (ECHWR) with a temporary auxiliary branch during training that aligns sensor signals with text embeddings using dual contrastive loss: in-batch contrastive loss for modality alignment and novel error-based contrastive loss that distinguishes correct signals from synthetic hard negatives. The auxiliary branch is discarded after training.

Result: On OnHW-Words500 dataset, ECHWR reduces character error rates by up to 7.4% on writer-independent split and 10.4% on writer-dependent split compared to state-of-the-art baselines. Error-based contrastive loss proves effective for handling unseen writing styles.

Conclusion: ECHWR enables improved handwriting recognition accuracy on edge devices without increasing inference costs, with error-based contrastive learning showing particular promise for generalization to unseen writing styles.

Abstract: Online handwriting recognition using inertial measurement units opens up handwriting on paper as input for digital devices. Doing it on edge hardware improves privacy and lowers latency, but entails memory constraints. To address this, we propose Error-enhanced Contrastive Handwriting Recognition (ECHWR), a training framework designed to improve feature representation and recognition accuracy without increasing inference costs. ECHWR utilizes a temporary auxiliary branch that aligns sensor signals with semantic text embeddings during the training phase. This alignment is maintained through a dual contrastive objective: an in-batch contrastive loss for general modality alignment and a novel error-based contrastive loss that distinguishes between correct signals and synthetic hard negatives. The auxiliary branch is discarded after training, which allows the deployed model to keep its original, efficient architecture. Evaluations on the OnHW-Words500 dataset show that ECHWR significantly outperforms state-of-the-art baselines, reducing character error rates by up to 7.4% on the writer-independent split and 10.4% on the writer-dependent split. Finally, although our ablation studies indicate that solving specific challenges require specific architectural and objective configurations, error-based contrastive loss shows its effectiveness for handling unseen writing styles.

[208] Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping

Dwip Dalal, Gautam Vashishtha, Utkarsh Mishra, Jeonghwan Kim, Madhav Kanda, Hyeonjeong Ha, Svetlana Lazebnik, Heng Ji, Unnat Jain

Main category: cs.CV

TL;DR: AttWarp improves MLLM performance by warping input images based on cross-modal attention to allocate more resolution to query-relevant regions while preserving global context, without changing model weights.

Details

Motivation: Multimodal large language models often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. There's a need to improve MLLMs' ability to perceive fine details without architectural changes.

Method: AttWarp uses an MLLM’s cross-modal attention at test time to perform rectilinear warping of input images, reallocating spatial resolution toward regions the model deems important. This attention-guided warping preserves all original image information but redistributes it non-uniformly, making small objects and subtle relationships easier to perceive.

Result: Across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time.

Conclusion: Attention-guided warping effectively prioritizes information relevant to the query while preserving context, and MLLMs perform better when given such warped inputs. This lightweight approach enhances fine-grained perceptual grounding without model modifications.

Abstract: Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while preserving global context. At test time, the approach uses an MLLM’s cross-modal attention to perform rectilinear warping of the input image, reallocating spatial resolution toward regions the model deems important, without changing model weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across five benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs.

[209] H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows

Harry Zhang, Luca Carlone

Main category: cs.CV

TL;DR: H2OFlow: A framework that learns comprehensive 3D human-object interaction affordances (contact, orientation, spatial occupancy) using only synthetic data from 3D generative models, eliminating need for human annotations.

Details

Motivation: Current approaches for 3D affordance understanding rely on labor-intensive hand-labeled datasets and are limited to contact-based analysis, neglecting important aspects like orientation preferences and spatial occupancy patterns in human-object interactions.

Method: Uses synthetic data from 3D generative models and employs a dense 3D-flow-based representation learned through a dense diffusion process on point clouds to discover rich 3D affordances without human annotations.

Result: H2OFlow generalizes effectively to real-world objects and surpasses prior methods that rely on manual annotations or mesh-based representations in modeling 3D affordance, as shown through extensive quantitative and qualitative evaluations.

Conclusion: The framework successfully learns comprehensive 3D HOI affordances using only synthetic data, addressing limitations of current annotation-dependent and contact-only approaches, enabling better understanding of human-object interactions.

Abstract: Understanding how humans interact with the surrounding environment, and specifically reasoning about object interactions and affordances, is a critical challenge in computer vision, robotics, and AI. Current approaches often depend on labor-intensive, hand-labeled datasets capturing real-world or simulated human-object interaction (HOI) tasks, which are costly and time-consuming to produce. Furthermore, most existing methods for 3D affordance understanding are limited to contact-based analysis, neglecting other essential aspects of human-object interactions, such as orientation (\eg, humans might have a preferential orientation with respect certain objects, such as a TV) and spatial occupancy (\eg, humans are more likely to occupy certain regions around an object, like the front of a microwave rather than its back). To address these limitations, we introduce \emph{H2OFlow}, a novel framework that comprehensively learns 3D HOI affordances – encompassing contact, orientation, and spatial occupancy – using only synthetic data generated from 3D generative models. H2OFlow employs a dense 3D-flow-based representation, learned through a dense diffusion process operating on point clouds. This learned flow enables the discovery of rich 3D affordances without the need for human annotations. Through extensive quantitative and qualitative evaluations, we demonstrate that H2OFlow generalizes effectively to real-world objects and surpasses prior methods that rely on manual annotations or mesh-based representations in modeling 3D affordance.

[210] RepAir: A Framework for Airway Segmentation and Discontinuity Correction in CT

John M. Oyer, Ali Namvar, Benjamin A. Hoff, Wassim W. Labaki, Ella A. Kazerooni, Charles R. Hatt, Fernando J. Martinez, MeiLan K. Han, Craig J. Galbán, Sundaresh Ram

Main category: cs.CV

TL;DR: RepAir: A three-stage framework combining nnU-Net segmentation with anatomically informed topology correction for robust 3D airway segmentation from chest CT scans, outperforming existing methods on both healthy and pathological datasets.

Details

Motivation: Manual airway segmentation from CT scans is impractical, and existing automated U-Net-based methods often produce disconnected components that hinder reliable biomarker extraction for quantitative lung analysis.

Method: Three-stage framework: 1) nnU-Net-based network produces initial airway mask, 2) skeleton-based algorithm identifies discontinuities and proposes reconnections, 3) 1D convolutional classifier determines which candidate links correspond to true anatomical branches vs false/obstructed paths.

Result: Outperforms existing 3D U-Net-based approaches (Bronchinet, NaviAirway) on both voxel-level and topological metrics across two datasets (ATM'22 with healthy subjects and AeroPath with severe pathology), producing more complete and anatomically consistent airway trees.

Conclusion: RepAir provides a robust solution for 3D airway segmentation that maintains high accuracy while addressing the connectivity issues of previous methods, enabling more reliable quantitative lung analysis.

Abstract: Accurate airway segmentation from chest computed tomography (CT) scans is essential for quantitative lung analysis, yet manual annotation is impractical and many automated U-Net-based methods yield disconnected components that hinder reliable biomarker extraction. We present RepAir, a three-stage framework for robust 3D airway segmentation that combines an nnU-Net-based network with anatomically informed topology correction. The segmentation network produces an initial airway mask, after which a skeleton-based algorithm identifies potential discontinuities and proposes reconnections. A 1D convolutional classifier then determines which candidate links correspond to true anatomical branches versus false or obstructed paths. We evaluate RepAir on two distinct datasets: ATM'22, comprising annotated CT scans from predominantly healthy subjects and AeroPath, encompassing annotated scans with severe airway pathology. Across both datasets, RepAir outperforms existing 3D U-Net-based approaches such as Bronchinet and NaviAirway on both voxel-level and topological metrics, and produces more complete and anatomically consistent airway trees while maintaining high segmentation accuracy.

[211] VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering

Yuyi Li, Daoyuan Chen, Zhen Wang, Yutong Lu, Yaliang Li

Main category: cs.CV

TL;DR: VeriSciQA: A high-quality scientific visual question answering dataset created using cross-modal verification to filter erroneous QA pairs synthesized by LVLMs, addressing limitations in existing SVQA datasets.

Details

Motivation: Open-source LVLMs struggle with Scientific Visual Question Answering due to lack of large-scale, high-quality datasets. Existing LVLM-synthesized datasets have systematic errors from LVLM limitations and information asymmetry between figures and text.

Method: Propose a Cross-Modal verification framework that generates questions/answers from figure-citing paragraphs, then verifies them against the figures themselves using text-figure alignment in scientific papers to filter erroneous QA pairs.

Result: Created VeriSciQA with 20,272 QA pairs across 20 scientific domains and 12 figure types. Models fine-tuned on VeriSciQA achieve consistent improvements on SVQA benchmarks, with performance gains scaling with data size, surpassing existing datasets.

Conclusion: Cross-modal verification enables scalable creation of high-quality SVQA datasets. Continued data expansion via this framework can advance SVQA capability in open-source community, with VeriSciQA publicly available.

Abstract: Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questions about figures from scientific papers. A key bottleneck is the lack of public, large-scale, high-quality SVQA datasets. Although recent work uses LVLMs to synthesize data at scale, we identify systematic errors in their resulting QA pairs, stemming from LVLMs’ inherent limitations and information asymmetry between figures and text. To address these challenges, we propose a Cross-Modal verification framework that generates questions and answers purely from figure-citing paragraphs, then verifies them against the figures themselves, leveraging the inherent text-figure alignment in scientific papers to filter out erroneous QA pairs. We instantiate this framework to curate VeriSciQA, a dataset of 20,272 QA pairs spanning 20 scientific domains and 12 figure types. Difficulty assessment reveals a notable accuracy gap between the best open-source model (65%) and the best proprietary model (80.5%), demonstrating room for improvement. Moreover, models fine-tuned on VeriSciQA achieve consistent improvements on SVQA benchmarks, with performance gains that scale with data size, surpassing models trained on existing datasets. Human evaluation further validates the improved quality of VeriSciQA. These results demonstrate that continued data expansion via our scalable framework can further advance SVQA capability in the open-source community. Our dataset is publicly available at https://huggingface.co/datasets/datajuicer/VeriSciQA.

[212] WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving

Seungjun Yu, Seonho Lee, Namho Kim, Jaeyo Shin, Junsung Park, Wonjeong Ryu, Raehyuk Jung, Hyunjung Shim

Main category: cs.CV

TL;DR: WaymoQA introduces a new safety-critical reasoning task for autonomous driving using multi-view inputs, with a dataset of 35K QA pairs to improve MLLM reasoning in high-risk scenarios.

Details

Motivation: Current MLLMs struggle with high-level reasoning in safety-critical driving scenarios where avoiding one risk can create another, requiring comprehensive multi-view understanding rather than just single front-view inputs.

Method: Defines Safety-Critical Reasoning as a new task using multi-view inputs, distills it into two stages (resolve immediate risk, mitigate downstream risks), and introduces WaymoQA dataset with 35K human-annotated QA pairs across image/video modalities.

Result: Existing MLLMs underperform in safety-critical scenarios vs normal scenes, but fine-tuning with WaymoQA significantly improves reasoning ability, demonstrating dataset effectiveness for safer driving agents.

Conclusion: WaymoQA addresses a critical gap in MLLM reasoning for autonomous driving, providing a valuable benchmark and training resource for developing safer, more capable driving agents through multi-view understanding.

Abstract: Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents. Our code and data are provided in https://github.com/sjyu001/WaymoQA

[213] SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery

Da Li, Jiping Jin, Xuanlong Yu, Wei Liu, Xiaodong Cun, Kai Chen, Rui Fan, Jiangang Kong, Xi Shen

Main category: cs.CV

TL;DR: SKEL-CF: A coarse-to-fine transformer framework for estimating anatomically accurate SKEL human model parameters from images, addressing SMPL’s biomechanical limitations.

Details

Motivation: Parametric 3D human models like SMPL lack biomechanical realism due to simplified kinematics. The SKEL model offers anatomical accuracy but is challenging to estimate directly due to limited training data, perspective ambiguities, and complex articulation.

Method: Proposes SKEL-CF, a coarse-to-fine transformer encoder-decoder architecture. The encoder predicts coarse camera and SKEL parameters, then the decoder refines them progressively. Creates 4DHuman-SKEL dataset by converting SMPL-based 4DHuman to SKEL-aligned format. Explicitly incorporates camera modeling to address depth/scale ambiguities.

Result: Achieves 85.0 MPJPE / 51.4 PA-MPJPE on MOYO dataset, significantly outperforming previous SKEL-based SOTA HSMR (104.5 / 79.6). Demonstrates importance of camera modeling across diverse viewpoints.

Conclusion: SKEL-CF establishes a scalable, anatomically faithful framework for human motion analysis, facilitating computer vision applications in biomechanics. Provides implementation and converted dataset.

Abstract: Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, facilitating the use of computer vision techniques in biomechanics-related analysis. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.

[214] GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes

Di Wang, Shunyu Liu, Wentao Jiang, Fengxiang Wang, Yi Liu, Xiaolei Qin, Zhiming Luo, Chaoyang Zhou, Haonan Guo, Jing Zhang, Bo Du, Dacheng Tao, Liangpei Zhang

Main category: cs.CV

TL;DR: GeoZero enables multimodal LLMs to perform geospatial reasoning without predefined chain-of-thought supervision, using supervised fine-tuning and reinforcement learning with answer-anchored regularization.

Details

Motivation: Current remote sensing MLLMs rely on expensive, biased chain-of-thought annotations for reasoning enhancement. GeoZero aims to enable geospatial reasoning without such supervision to reduce costs and increase reasoning diversity.

Method: Two-stage approach: 1) Supervised fine-tuning on GeoZero-Instruct dataset for preliminary geospatial knowledge, 2) Reinforcement learning on GeoZero-Hard dataset with Answer-Anchored Group Relative Policy Optimization (A²GRPO) that regularizes reasoning by model’s own answers.

Result: GeoZero surpasses state-of-the-art methods on multiple remote sensing vision-language benchmarks and fosters universal emergent reasoning capabilities across diverse geospatial tasks.

Conclusion: GeoZero demonstrates effective geospatial reasoning without chain-of-thought supervision, offering a cost-effective approach with diverse reasoning capabilities for remote sensing MLLMs.

Abstract: Multimodal large language models (MLLMs) have undergone rapid development in advancing geospatial scene understanding. Recent studies have sought to enhance the reasoning capabilities of remote sensing MLLMs, typically through cold-start training with elaborately curated chain-of-thought (CoT) data. However, this approach not only incurs substantial annotation costs but also introduces human biases that may limit the diversity of model reasoning. To address these challenges, we propose GeoZero, a framework that enables MLLMs to perform geospatial reasoning without any predefined CoT supervision. Specifically, we construct two datasets, GeoZero-Instruct and GeoZero-Hard. GeoZero-Instruct allows the model to acquire preliminary geospatial knowledge through supervised fine-tuning, while GeoZero-Hard stimulates deep reasoning during the subsequent reinforcement learning stage. Furthermore, we introduce Answer-Anchored Group Relative Policy Optimization (A$^2$GRPO), where the reasoning process is regularized by the model’s own answers, encouraging diverse yet accurate thinking. Extensive experiments on multiple remote sensing vision-language benchmarks demonstrate that GeoZero not only surpasses existing state-of-the-art methods but also fosters universal emergent reasoning capabilities across diverse geospatial tasks. Code, data, and models will be publicly available at https://github.com/MiliLab/GeoZero.

[215] Accelerating Streaming Video Large Language Models via Hierarchical Token Compression

Yiyu Wang, Xuyang Liu, Xiyan Gui, Xinying Lin, Boxue Yang, Chenfei Liao, Tailai Chen, Linfeng Zhang

Main category: cs.CV

TL;DR: STC is a hierarchical token compression framework for streaming VideoLLMs that reduces computational costs by caching similar frame features and pruning less salient visual tokens, achieving significant latency reduction with minimal accuracy loss.

Details

Motivation: Streaming VideoLLMs face high computational costs in real-time deployment due to redundant processing of temporally similar frames in ViT encoding and inflated token sequences during LLM pre-filling, creating bottlenecks for efficient streaming video understanding.

Method: STC introduces two token-level accelerators: STC-Cacher reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and STC-Pruner compresses visual token sequences before LLM input by preserving only the most salient tokens based on spatial and temporal relevance.

Result: STC outperforms other compression methods on four baseline streaming VideoLLMs across five benchmarks, retaining up to 99% accuracy on ReKV framework while reducing ViT encoding latency by 24.5% and LLM pre-filling latency by 45.3%.

Conclusion: STC provides an effective plug-and-play solution for accelerating streaming VideoLLMs by addressing both ViT encoding and LLM pre-filling bottlenecks through hierarchical token compression, enabling more efficient real-time video understanding.

Abstract: Streaming Video Large Language Models (VideoLLMs) have demonstrated impressive performance across various video understanding tasks, but they face significant challenges in real-time deployment due to the high computational cost of processing dense visual tokens from continuous video streams. In streaming video scenarios, the primary bottleneck lies in the Vision Transformer (ViT) encoding stage, where redundant processing of temporally similar frames leads to inefficiency. Additionally, inflated token sequences during LLM pre-filling further exacerbate latency and memory overhead. To address these challenges, we propose \textbf{S}treaming \textbf{T}oken \textbf{C}ompression (\textbf{STC}), a plug-and-play hierarchical framework that seamlessly integrates into existing streaming VideoLLMs, optimizing both ViT encoding and LLM pre-filling stages to accelerate processing. STC introduces two token-level accelerators: \textbf{STC-Cacher}, which reduces ViT encoding overhead by caching and reusing features from temporally similar frames, and \textbf{STC-Pruner}, which compresses the visual token sequence before it enters the LLM, preserving only the most salient tokens based on both spatial and temporal relevance. Extensive experiments on four baseline streaming VideoLLMs across five benchmarks demonstrate that STC outperforms other compression methods. Notably, STC retains up to \textbf{99%} of accuracy on the ReKV framework while reducing ViT encoding latency and LLM pre-filling latency by \textbf{24.5%} and \textbf{45.3%}.

[216] Equivariant symmetry-aware head pose estimation for fetal MRI

Ramya Muthukrishnan, Borjan Gagoski, Aryn Lee, P. Ellen Grant, Elfar Adalsteinsson, Polina Golland, Benjamin Billot

Main category: cs.CV

TL;DR: E(3)-Pose is a fast pose estimation method that explicitly models rotation equivariance and object symmetry for fetal head pose estimation in MRI scans, enabling automatic adaptive prescription of 2D diagnostic slices.

Details

Motivation: The paper addresses the challenging problem of accounting for fetal head motion during diagnostic MRI scans. Current methods struggle with clinical volumes due to pose ambiguities from anatomical symmetries, low resolution, noise, and artifacts. The goal is to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation.

Method: E(3)-Pose jointly and explicitly models rotation equivariance and object symmetry by construction. It captures anatomical symmetries and rigid pose equivariance to yield robust fetal head pose estimates from 3D MRI volumes acquired before each 2D slice.

Result: Experiments on publicly available and representative clinical fetal MRI datasets demonstrate superior robustness and generalization across domains. E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, supporting future clinical translation.

Conclusion: E(3)-Pose provides a robust solution for fetal head pose estimation in clinical MRI settings by explicitly modeling symmetry and equivariance, enabling automatic adaptive slice prescription for diagnostic imaging.

Abstract: We present E(3)-Pose, a novel fast pose estimation method that jointly and explicitly models rotation equivariance and object symmetry. Our work is motivated by the challenging problem of accounting for fetal head motion during a diagnostic MRI scan. We aim to enable automatic adaptive prescription of 2D diagnostic MRI slices with 6-DoF head pose estimation, supported by 3D MRI volumes rapidly acquired before each 2D slice. Existing methods struggle to generalize to clinical volumes, due to pose ambiguities induced by inherent anatomical symmetries, as well as low resolution, noise, and artifacts. In contrast, E(3)-Pose captures anatomical symmetries and rigid pose equivariance by construction, and yields robust estimates of the fetal head pose. Our experiments on publicly available and representative clinical fetal MRI datasets demonstrate the superior robustness and generalization of our method across domains. Crucially, E(3)-Pose achieves state-of-the-art accuracy on clinical MRI volumes, supporting future clinical translation. Our implementation is publicly available at github.com/MedicalVisionGroup/E3-Pose.

[217] Defect-aware Hybrid Prompt Optimization via Progressive Tuning for Zero-Shot Multi-type Anomaly Detection and Segmentation

Nadeem Nazer, Hongkuan Zhou, Lavdim Halilaj, Ylli Sadikaj, Steffen Staab

Main category: cs.CV

TL;DR: DAPO introduces defect-aware prompt optimization for zero-shot anomaly detection, learning hybrid prompts to align anomaly features with text semantics for improved performance under distribution shifts.

Details

Motivation: Current vision-language models for anomaly detection focus on coarse anomaly signals but neglect fine-grained defect types (hole, cut, scratch). Recognizing specific anomaly types provides richer semantic understanding and enables targeted corrective measures. Handcrafted prompts for each defect type are time-consuming and biased.

Method: DAPO (Defect-aware Prompt Optimization) uses progressive tuning to learn hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings. It aligns anomaly-relevant image features with corresponding text semantics for zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts.

Result: Experiments on public benchmarks (MPDD, VisA, MVTec-AD, MAD, Real-IAD) and internal dataset show DAPO achieves 3.7% average improvement in AUROC and average precision at image level under distribution shift, and 6.5% average improvement in localizing novel anomaly types under zero-shot settings compared to baselines.

Conclusion: DAPO effectively bridges the gap between coarse anomaly signals and fine-grained defect categories by learning optimized prompts, enabling better zero-shot anomaly detection and segmentation with improved performance under distribution shifts.

Abstract: Recent vision language models (VLMs) like CLIP have demonstrated impressive anomaly detection performance under significant distribution shift by utilizing high-level semantic information through text prompts. However, these models often neglect fine-grained details, such as which kind of anomalies, like “hole”, “cut”, “scratch” that could provide more specific insight into the nature of anomalies. We argue that recognizing fine-grained anomaly types 1) enriches the representation of “abnormal” with structured semantics, narrowing the gap between coarse anomaly signals and fine-grained defect categories; 2) enables manufacturers to understand the root causes of the anomaly and implement more targeted and appropriate corrective measures quickly. While incorporating such detailed semantic information is crucial, designing handcrafted prompts for each defect type is both time-consuming and susceptible to human bias. For this reason, we introduce DAPO, a novel approach for Defect-aware Prompt Optimization based on progressive tuning for the zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts. Our approach aligns anomaly-relevant image features with their corresponding text semantics by learning hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings. We conducted experiments on public benchmarks (MPDD, VisA, MVTec-AD, MAD, and Real-IAD) and an internal dataset. The results suggest that compared to the baseline models, DAPO achieves a 3.7% average improvement in AUROC and average precision metrics at the image level under distribution shift, and a 6.5% average improvement in localizing novel anomaly types under zero-shot settings.

[218] CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence

Tianjiao Yu, Xinzhuo Li, Yifan Shen, Yuanzhe Liu, Ismini Lourentzou

Main category: cs.CV

TL;DR: CoRe3D introduces a reasoning framework for 3D understanding and generation that combines semantic chain-of-thought with structured spatial reasoning to improve alignment between language descriptions and 3D outputs.

Details

Motivation: While reasoning mechanisms have proven effective for language and vision tasks in multimodal models, their application to 3D understanding and generation remains underdeveloped. The paper aims to extend reasoning-centric approaches to 3D domains to improve reliability, interpretability, and cross-modal alignment.

Method: CoRe3D uses a unified 3D reasoning framework that operates over both semantic and spatial abstractions. It employs a spatially grounded reasoning representation that decomposes 3D latent space into localized regions, enabling compositional and procedural reasoning over geometry. The approach tightly couples semantic chain-of-thought inference with structured spatial reasoning.

Result: The framework produces 3D outputs with strong local consistency and faithful alignment with linguistic descriptions, demonstrating improved cross-modal alignment through explicit reasoning mechanisms.

Conclusion: Explicit reasoning mechanisms are crucial for advancing 3D multimodal models, and CoRe3D’s approach of combining semantic and spatial reasoning provides a promising direction for improving 3D understanding and generation tasks.

Abstract: Recent advances in large multimodal models suggest that explicit reasoning mechanisms play a critical role in improving model reliability, interpretability, and cross-modal alignment. While such reasoning-centric approaches have been proven effective in language and vision tasks, their extension to 3D remains underdeveloped. CoRe3D introduces a unified 3D understanding and generation reasoning framework that jointly operates over semantic and spatial abstractions, enabling high-level intent inferred from language to directly guide low-level 3D content formation. Central to this design is a spatially grounded reasoning representation that decomposes 3D latent space into localized regions, allowing the model to reason over geometry in a compositional and procedural manner. By tightly coupling semantic chain-of-thought inference with structured spatial reasoning, CoRe3D produces 3D outputs that exhibit strong local consistency and faithful alignment with linguistic descriptions.

Dwip Dalal, Utkarsh Mishra, Narendra Ahuja, Nebojsa Jojic

Main category: cs.CV

TL;DR: CityNav benchmark evaluates MLLMs’ visual navigation in real-world cities using only visual inputs, requiring knowledge-intensive reasoning like landmark recognition and spatial planning. Current methods underperform, but proposed Verbalization of Path (VoP) improves navigation by grounding reasoning in city-scale cognitive maps.

Details

Motivation: Current MLLM evaluation benchmarks are too language-centric or simulation-based, lacking assessment of knowledge-intensive reasoning needed for practical real-world embodied tasks. There's a need for benchmarks that test MLLMs' sequential decision-making in challenging, real-world visual navigation scenarios.

Method: Introduced CityNav benchmark with 4 global cities, requiring agents to navigate 50+ decision points using only visual inputs. Proposed Verbalization of Path (VoP) technique that probes MLLMs for city-scale cognitive maps (key landmarks and directions toward destination) to ground internal reasoning.

Result: Current SOTA MLLMs, reasoning techniques (GEPA, chain-of-thought, reflection) and baseline PReP significantly underperform on CityNav. VoP substantially enhances navigation success by explicitly grounding the agent’s internal reasoning in spatial knowledge.

Conclusion: CityNav reveals limitations of current MLLMs in knowledge-intensive visual navigation tasks. VoP demonstrates the importance of explicitly grounding spatial reasoning for successful real-world navigation, pointing toward future improvements in embodied MLLM agents.

Abstract: Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environment. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs, reasoning techniques (e.g., GEPA, chain-of-thought, reflection) and competitive baseline PReP significantly underperform in this challenging setting. To address this, we propose Verbalization of Path(VoP), which explicitly grounds the agent’s internal reasoning by probing city-scale cognitive maps (key landmarks and directions toward the destination) from the MLLM, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/

[220] Multi-Level Feature Fusion for Continual Learning in Visual Quality Inspection

Johannes C. Bauer, Paul Geng, Stephan Trattnig, Petr Dokládal, Rüdiger Daub

Main category: cs.CV

TL;DR: Multi-level feature fusion approach for continual learning in visual quality inspection, enabling efficient adaptation to changing products/defects in manufacturing

Details

Motivation: Deep neural networks struggle in volatile manufacturing scenarios like remanufacturing where products and defect patterns frequently change, requiring frequent model adaptation while avoiding catastrophic forgetting

Method: Proposes multi-level feature fusion (MLFF) approach that utilizes representations from different depths of a pretrained network to enable efficient adaptation with fewer trainable parameters

Result: MLFF matches end-to-end training performance for quality inspection problems while using significantly less trainable parameters, reduces catastrophic forgetting, and improves generalization to new product types/defects

Conclusion: The MLFF approach enables efficient continual learning for visual quality inspection in dynamic manufacturing environments, balancing adaptation speed with model stability

Abstract: Deep neural networks show great potential for automating various visual quality inspection tasks in manufacturing. However, their applicability is limited in more volatile scenarios, such as remanufacturing, where the inspected products and defect patterns often change. In such settings, deployed models require frequent adaptation to novel conditions, effectively posing a continual learning problem. To enable quick adaptation, the necessary training processes must be computationally efficient while still avoiding effects like catastrophic forgetting. This work presents a multi-level feature fusion (MLFF) approach that aims to improve both aspects simultaneously by utilizing representations from different depths of a pretrained network. We show that our approach is able to match the performance of end-to-end training for different quality inspection problems while using significantly less trainable parameters. Furthermore, it reduces catastrophic forgetting and improves generalization robustness to new product types or defects.

[221] SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

Dongting Hu, Aarush Gupta, Magzhan Gabidolla, Arpit Sahni, Huseyin Coskun, Yanyu Li, Yerlan Idelbayev, Ahsan Mahmood, Aleksei Lebedev, Dishani Lahiri, Anujraaj Goyal, Ju Hu, Mingming Gong, Sergey Tulyakov, Anil Kag

Main category: cs.CV

TL;DR: Efficient diffusion transformer framework for mobile/edge devices with sparse attention, elastic training, and knowledge-guided distillation for real-time on-device generation.

Details

Motivation: Current diffusion transformers (DiTs) achieve state-of-the-art image generation but are too computationally expensive for on-device deployment on mobile and edge devices due to high memory and computational costs.

Method: Three key components: 1) Compact DiT architecture with adaptive global-local sparse attention, 2) Elastic training framework that jointly optimizes sub-DiTs of varying capacities within a unified supernetwork, 3) Knowledge-Guided Distribution Matching Distillation pipeline combining DMD objective with knowledge transfer from few-step teacher models.

Result: Achieves transformer-level generation quality under strict resource constraints, enabling high-fidelity, low-latency generation (e.g., 4-step) suitable for real-time on-device use on mobile and edge devices.

Conclusion: The proposed framework enables scalable, efficient, and high-quality diffusion models for practical deployment on diverse hardware platforms while maintaining generation quality.

Abstract: Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment due to their high computational and memory costs. In this work, we present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints. Our design combines three key components. First, we propose a compact DiT architecture with an adaptive global-local sparse attention mechanism that balances global context modeling and local detail preservation. Second, we propose an elastic training framework that jointly optimizes sub-DiTs of varying capacities within a unified supernetwork, allowing a single model to dynamically adjust for efficient inference across different hardware. Finally, we develop Knowledge-Guided Distribution Matching Distillation, a step-distillation pipeline that integrates the DMD objective with knowledge transfer from few-step teacher models, producing high-fidelity and low-latency generation (e.g., 4-step) suitable for real-time on-device use. Together, these contributions enable scalable, efficient, and high-quality diffusion models for deployment on diverse hardware.

[222] Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets

Jeremiah Coholich, Justin Wit, Robert Azarcon, Zsolt Kira

Main category: cs.CV

TL;DR: MANGO: Unpaired image translation method for sim2real robot manipulation that generates diverse camera viewpoints from simulation data using novel segmentation-conditioned losses.

Details

Motivation: Vision-based robot manipulation policies are brittle to camera viewpoint variations, and real-world robot demonstration data is scarce and lacks viewpoint diversity. Simulation can provide comprehensive viewpoint coverage but faces visual sim2real challenges.

Method: Proposes MANGO with three key components: 1) segmentation-conditioned InfoNCE loss, 2) highly-regularized discriminator design, and 3) modified PatchNCE loss to maintain viewpoint consistency during sim2real translation.

Result: MANGO outperforms other image translation methods tested. Imitation-learning policies trained on MANGO-augmented data achieve 60% success rates on views where non-augmented policies fail completely.

Conclusion: MANGO effectively bridges the sim2real gap for robot manipulation by generating diverse unseen viewpoints from simulation data, requiring only small amounts of fixed-camera real-world data.

Abstract: Vision-based policies for robot manipulation have achieved significant recent success, but are still brittle to distribution shifts such as camera viewpoint variations. Robot demonstration data is scarce and often lacks appropriate variation in camera viewpoints. Simulation offers a way to collect robot demonstrations at scale with comprehensive coverage of different viewpoints, but presents a visual sim2real challenge. To bridge this gap, we propose MANGO – an unpaired image translation method with a novel segmentation-conditioned InfoNCE loss, a highly-regularized discriminator design, and a modified PatchNCE loss. We find that these elements are crucial for maintaining viewpoint consistency during sim2real translation. When training MANGO, we only require a small amount of fixed-camera data from the real world, but show that our method can generate diverse unseen viewpoints by translating simulated observations. In this domain, MANGO outperforms all other image translation methods we tested. Imitation-learning policies trained on data augmented by MANGO are able to achieve success rates as high as 60% on views that the non-augmented policy fails completely on.

[223] DiCo: Disentangled Concept Representation for Text-to-image Person Re-identification

Giyeol Kim, Chanho Eom

Main category: cs.CV

TL;DR: DiCo framework uses disentangled slot-based representations with concept blocks for hierarchical cross-modal alignment in text-to-image person re-identification, achieving competitive performance with improved interpretability.

Details

Motivation: Address the substantial modality gap between visual appearances and textual descriptions in person re-identification, and the need to model fine-grained correspondences between similar individuals with shared attributes like clothing color, texture, or outfit style.

Method: Proposes DiCo (Disentangled Concept Representation) framework with shared slot-based representation where each slot acts as a part-level anchor across modalities, further decomposed into multiple concept blocks to disentangle complementary attributes while maintaining consistent part-level correspondence.

Result: Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets demonstrate competitive performance with state-of-the-art methods, while enhancing interpretability through explicit slot- and block-level representations for more fine-grained retrieval.

Conclusion: DiCo effectively addresses the modality gap in text-to-image person re-identification through hierarchical disentangled representations, achieving strong performance while providing interpretable fine-grained retrieval capabilities.

Abstract: Text-to-image person re-identification (TIReID) aims to retrieve person images from a large gallery given free-form textual descriptions. TIReID is challenging due to the substantial modality gap between visual appearances and textual expressions, as well as the need to model fine-grained correspondences that distinguish individuals with similar attributes such as clothing color, texture, or outfit style. To address these issues, we propose DiCo (Disentangled Concept Representation), a novel framework that achieves hierarchical and disentangled cross-modal alignment. DiCo introduces a shared slot-based representation, where each slot acts as a part-level anchor across modalities and is further decomposed into multiple concept blocks. This design enables the disentanglement of complementary attributes (\textit{e.g.}, color, texture, shape) while maintaining consistent part-level correspondence between image and text. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that our framework achieves competitive performance with state-of-the-art methods, while also enhancing interpretability through explicit slot- and block-level representations for more fine-grained retrieval results.

[224] Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos

Meng Cao, Haoran Tang, Haoze Zhao, Mingfei Han, Ruyang Liu, Qiang Sun, Xiaojun Chang, Ian Reid, Xiaodan Liang

Main category: cs.CV

TL;DR: PhysGame uses gameplay video glitches (visual anomalies violating physics) to create instruction-tuning data for physical reasoning in multimodal LLMs, improving real-world and general reasoning performance.

Details

Motivation: Current MLLMs lack human-level physical world understanding. Existing datasets are either expensive (real videos) or unrealistic (synthetic). Gameplay glitches provide scalable, realistic supervision for physical reasoning.

Method: Created PhysGame dataset with 140K QA pairs from gameplay glitches across 5 physical domains. Used metadata-guided prompting for quality QA generation. Built GameBench benchmark with 880 expert-annotated glitch videos for evaluation.

Result: PhysGame improved Qwen2.5VL by 2.5% on real-world PhysBench, 1.9% on general MVBench, and 3.7% on GameBench. Shows effective transfer from gameplay anomalies to physical reasoning.

Conclusion: Learning from gameplay glitches offers scalable, effective pathway for advancing physical world understanding in multimodal intelligence, bridging synthetic and real-world physical reasoning.

Abstract: Understanding the physical world, including object dynamics, material properties, and causal interactions, remains a core challenge in artificial intelligence. Although recent multi-modal large language models (MLLMs) have demonstrated impressive general reasoning capabilities, they still fall short of achieving human-level understanding of physical principles. Existing datasets for physical reasoning either rely on real-world videos, which incur high annotation costs, or on synthetic simulations, which suffer from limited realism and diversity. In this paper, we propose a novel paradigm that leverages glitches in gameplay videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding. We introduce PhysGame, an meta information guided instruction-tuning dataset containing 140,057 glitch-centric question-answer pairs across five physical domains and sixteen fine-grained categories. To ensure data accuracy, we design a prompting strategy that utilizes gameplay metadata such as titles and descriptions to guide high-quality QA generation. Complementing PhysGame, we construct GameBench, an expert-annotated benchmark with 880 glitch-identified gameplay videos designed to evaluate physical reasoning capabilities. Extensive experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real world physical reasoning performance of Qwen2.5VL by 2.5% on PhysBench, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark. Moreover, PhysGame-tuned models achieve a 3.7% absolute improvement on GameBench, demonstrating enhanced robustness in detecting physical implausibilities. These results indicate that learning from gameplay anomalies offers a scalable and effective pathway toward advancing physical world understanding in multimodal intelligence.

[225] CER-HV: A CER-Based Human-in-the-Loop Framework for Cleaning Datasets Applied to Arabic-Script HTR

Sana Al-azzawi, Elisa Barney, Marcus Liwicki

Main category: cs.CV

TL;DR: CER-HV framework detects and cleans label errors in Arabic-script handwritten text recognition datasets using CER-based noise detection with human verification, improving recognition accuracy.

Details

Motivation: Arabic-script HTR lags behind Latin-script despite advances; data quality issues in existing datasets limit performance improvement.

Method: CER-HV combines CER-based noise detector (using CRNN with early stopping) with human-in-the-loop verification to identify label errors including transcription, segmentation, orientation, and non-text content issues.

Result: CRNN achieves SOTA on 5/6 datasets (8.45% CER on KHATT, 8.26% on PHTI, etc.); CER-HV detects errors with 80-90% precision; improves CER by 0.3-1.8% after cleaning.

Conclusion: Data quality is critical for HTR performance; CER-HV effectively identifies dataset errors; framework is generalizable beyond Arabic-script languages.

Abstract: Handwritten text recognition (HTR) for Arabic-script languages still lags behind Latin-script HTR, despite recent advances in model architectures, datasets, and benchmarks. We show that data quality is a significant limiting factor in many published datasets and propose CER-HV (CER-based Ranking with Human Verification) as a framework to detect and clean label errors. CER-HV combines a CER-based noise detector, built on a carefully configured Convolutional Recurrent Neural Network (CRNN) with early stopping to avoid overfitting noisy samples, and a human-in-the-loop (HITL) step that verifies high-ranking samples. The framework reveals that several existing datasets contain previously underreported problems, including transcription, segmentation, orientation, and non-text content errors. These have been identified with up to 90 percent precision in the Muharaf and 80-86 percent in the PHTI datasets. We also show that our CRNN achieves state-of-the-art performance across five of the six evaluated datasets, reaching 8.45 percent Character Error Rate (CER) on KHATT (Arabic), 8.26 percent on PHTI (Pashto), 10.66 percent on Ajami, and 10.11 percent on Muharaf (Arabic), all without any data cleaning. We establish a new baseline of 11.3 percent CER on the PHTD (Persian) dataset. Applying CER-HV improves the evaluation CER by 0.3-0.6 percent on the cleaner datasets and 1.0-1.8 percent on the noisier ones. Although our experiments focus on documents written in an Arabic-script language, including Arabic, Persian, Urdu, Ajami, and Pashto, the framework is general and can be applied to other text recognition datasets.

[226] Contextual Range-View Projection for 3D LiDAR Point Clouds

Seyedali Mousavi, Seyedhamidreza Mousavi, Masoud Daneshtalab

Main category: cs.CV

TL;DR: Improved range-view projection for LiDAR point clouds by incorporating contextual information from instance centers and class labels to address many-to-one conflicts, achieving better semantic segmentation performance.

Details

Motivation: Existing range-view projection methods for LiDAR point clouds use simple depth-based selection (closest point) which loses important semantic and structural information, especially for objects and instance boundaries.

Method: Proposes two mechanisms: 1) Centerness-Aware Projection (CAP) adjusts point depths based on distance from instance center to prioritize central points; 2) Class-Weighted-Aware Projection (CWAP) uses user-defined class weights to prioritize specific object classes.

Result: On SemanticKITTI dataset, CAP preserves more instance points and achieves up to 3.1% mIoU improvement over baseline. CWAP enhances targeted class performance with negligible impact on other classes.

Conclusion: Incorporating contextual information (instance centers and class labels) into range-view projection significantly improves semantic segmentation performance compared to simple depth-based selection.

Abstract: Range-view projection provides an efficient method for transforming 3D LiDAR point clouds into 2D range image representations, enabling effective processing with 2D deep learning models. However, a major challenge in this projection is the many-to-one conflict, where multiple 3D points are mapped onto the same pixel in the range image, requiring a selection strategy. Existing approaches typically retain the point with the smallest depth (closest to the LiDAR), disregarding semantic relevance and object structure, which leads to the loss of important contextual information. In this paper, we extend the depth-based selection rule by incorporating contextual information from both instance centers and class labels, introducing two mechanisms: \textit{Centerness-Aware Projection (CAP)} and \textit{Class-Weighted-Aware Projection (CWAP)}. In CAP, point depths are adjusted according to their distance from the instance center, thereby prioritizing central instance points over noisy boundary and background points. In CWAP, object classes are prioritized through user-defined weights, offering flexibility in the projection strategy. Our evaluations on the SemanticKITTI dataset show that CAP preserves more instance points during projection, achieving up to a 3.1% mIoU improvement compared to the baseline. Furthermore, CWAP enhances the performance of targeted classes while having a negligible impact on the performance of other classes

[227] CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

Hang Wu, Yujun Cai, Zehao Li, Haonan Ge, Bowen Sun, Junsong Yuan, Yiwei Wang

Main category: cs.CV

TL;DR: CamReasoner is a framework that reformulates camera movement understanding as a structured inference process using an Observation-Thinking-Answer paradigm with reinforcement learning for geometric reasoning.

Details

Motivation: Existing multimodal models treat camera dynamics as black-box classification, confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. There's a gap between perception and cinematic logic that needs bridging.

Method: Uses Observation-Thinking-Answer (O-T-A) paradigm to decode spatio-temporal cues like trajectories and view frustums. Constructs Large-scale Inference Trajectory Suite with 18k SFT reasoning chains and 38k RL feedback samples. First to employ RL for logical alignment in this domain to ensure motion inferences are grounded in physical geometry.

Result: CamReasoner effectively suppresses hallucinations and achieves state-of-the-art performance across multiple benchmarks for camera movement understanding.

Conclusion: The framework successfully bridges perception and cinematic logic through structured geometric reasoning, demonstrating that explicit reasoning blocks with RL alignment can significantly improve camera dynamics understanding in multimodal models.

Abstract: Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present CamReasoner, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to decode spatio-temporal cues such as trajectories and view frustums within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. Notably, we are the first to employ RL for logical alignment in this domain, ensuring motion inferences are grounded in physical geometry rather than contextual guesswork. By applying Reinforcement Learning to the Observation-Think-Answer (O-T-A) reasoning paradigm, CamReasoner effectively suppresses hallucinations and achieves state-of-the-art performance across multiple benchmarks.

[228] Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment

Lukas Kuhn, Giuseppe Serra, Florian Buettner

Main category: cs.CV

TL;DR: NOVA is a non-contrastive vision-language alignment framework that predicts text embeddings from augmented image views with distributional regularization, eliminating the need for negative sampling and complex training setups.

Details

Motivation: Contrastive vision-language models like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning, making training complex and unstable. The authors aim to develop a simpler, more stable alternative.

Method: NOVA aligns visual representations to a frozen text encoder by predicting text embeddings from augmented image views. It uses Sketched Isotropic Gaussian Regularization (SIGReg) to enforce an isotropic Gaussian structure on the predicted embeddings, eliminating negative sampling, momentum encoders, and stop-gradients.

Result: NOVA outperforms multiple standard baselines on zero-shot chest X-ray classification across three benchmark datasets using ClinicalBERT as text encoder and Vision Transformers trained from scratch on MIMIC-CXR. It also exhibits substantially more consistent training runs.

Conclusion: Non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods, with NOVA demonstrating strong performance in medical imaging applications.

Abstract: Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR. On zero-shot classification across three benchmark datasets, NOVA outperforms multiple standard baselines while exhibiting substantially more consistent training runs. Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.

[229] Localized Control in Diffusion Models via Latent Vector Prediction

Pablo Domingo-Gregorio, Javier Ruiz-Hidalgo

Main category: cs.CV

TL;DR: A novel diffusion model training framework for precise local control over user-defined image regions while maintaining global text-to-image generation capabilities.

Details

Motivation: Existing text-to-image diffusion models lack precise localized control, requiring laborious trial-and-error with text prompts. Current methods using image-level controls (edges, segmentation, depth maps) apply conditions uniformly across entire images, limiting localized manipulation.

Method: Proposes a training framework incorporating masking features and an additional loss term that leverages prediction of initial latent vectors at any diffusion step to enhance correspondence between current steps and final samples in latent space.

Result: Extensive experiments demonstrate effective synthesis of high-quality images with controlled local conditions while maintaining global text-to-image generation capabilities.

Conclusion: The method enables precise local control over user-defined image regions while allowing diffusion models to autonomously generate remaining areas according to original prompts, addressing limitations of uniform image-level conditioning.

Abstract: Diffusion models emerged as a leading approach in text-to-image generation, producing high-quality images from textual descriptions. However, attempting to achieve detailed control to get a desired image solely through text remains a laborious trial-and-error endeavor. Recent methods have introduced image-level controls alongside with text prompts, using prior images to extract conditional information such as edges, segmentation and depth maps. While effective, these methods apply conditions uniformly across the entire image, limiting localized control. In this paper, we propose a novel methodology to enable precise local control over user-defined regions of an image, while leaving to the diffusion model the task of autonomously generating the remaining areas according to the original prompt. Our approach introduces a new training framework that incorporates masking features and an additional loss term, which leverages the prediction of the initial latent vector at any diffusion step to enhance the correspondence between the current step and the final sample in the latent space. Extensive experiments demonstrate that our method effectively synthesizes high-quality images with controlled local conditions.

[230] SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback

Xiaoxuan He, Siming Fu, Wanli Li, Zhiyuan Li, Dacheng Yin, Kang Rong, Fengyun Rao, Bo Zhang

Main category: cs.CV

TL;DR: SAIL is a self-amplified iterative learning framework that enables diffusion models to act as their own teachers for alignment with minimal human feedback, using only 6% of preference data compared to existing methods.

Details

Motivation: Aligning diffusion models with human preferences is challenging when reward models are unavailable and collecting large-scale preference datasets is expensive. The paper explores whether effective alignment can be achieved using minimal human feedback by unlocking latent capabilities within diffusion models themselves.

Method: SAIL operates in a closed-loop manner: starting from a minimal seed set of human-annotated preference pairs, the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. A ranked preference mixup strategy balances exploration with adherence to initial human priors to prevent catastrophic forgetting.

Result: SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using only 6% of the preference data required by existing approaches, demonstrating that diffusion models possess remarkable self-improvement capabilities.

Conclusion: Diffusion models have inherent self-improvement capabilities that, when properly harnessed through frameworks like SAIL, can effectively replace both large-scale human annotation and external reward models for alignment tasks.

Abstract: Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. \textit{This raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves?} In this paper, we propose \textbf{SAIL} (\textbf{S}elf-\textbf{A}mplified \textbf{I}terative \textbf{L}earning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. To ensure robust learning and prevent catastrophic forgetting, we introduce a ranked preference mixup strategy that carefully balances exploration with adherence to initial human priors. Extensive experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using merely 6% of the preference data required by existing approaches, revealing that diffusion models possess remarkable self-improvement capabilities that, when properly harnessed, can effectively replace both large-scale human annotation and external reward models.

[231] Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation

David Shavin, Sagie Benaim

Main category: cs.CV

TL;DR: Splat and Distill framework enhances 2D Vision Foundation Models with 3D awareness by using feed-forward 3D Gaussian reconstruction to create novel viewpoint features for student model supervision.

Details

Motivation: Current Vision Foundation Models lack robust 3D awareness despite their success in 2D tasks, limiting their understanding of geometric relationships and 3D scene structure.

Method: Uses feed-forward 3D Gaussian reconstruction to lift 2D teacher features into explicit 3D representations, then splats these 3D features onto novel viewpoints to create supervision signals for distilling geometrically grounded knowledge into student models.

Result: Significantly outperforms prior works on monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation, improving both 3D awareness and semantic richness of 2D features.

Conclusion: The framework successfully instills 3D awareness into 2D VFMs through a dynamic learning process that avoids feature-averaging artifacts and creates a mutually reinforcing improvement between teacher and student models.

Abstract: Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher’s consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/

[232] TwistNet-2D: Learning Second-Order Channel Interactions via Spiral Twisting for Texture Recognition

Junbo Jacob Lian, Feng Xiong, Yujun Sun, Kaichen Ouyang, Zong Ke, Mingyang Yu, Shengwei Fu, Zhong Rui, Zhang Yujun, Huiling Chen

Main category: cs.CV

TL;DR: TwistNet-2D is a lightweight module that captures local pairwise channel interactions under directional spatial displacement for texture recognition, outperforming larger models with minimal computational overhead.

Details

Motivation: Current methods for texture recognition face a trade-off: bilinear pooling and Gram matrices capture global channel correlations but lose spatial structure, while self-attention models spatial context through weighted aggregation rather than explicit pairwise feature interactions.

Method: Introduces TwistNet-2D with Spiral-Twisted Channel Interaction (STCI) that shifts one feature map along prescribed directions before element-wise channel multiplication, capturing cross-position co-occurrence patterns. Uses four directional heads with learned channel reweighting and sigmoid-gated residual path.

Result: TwistNet-2D adds only 3.5% parameters and 2% FLOPs over ResNet-18, yet consistently outperforms parameter-matched and larger baselines (ConvNeXt, Swin Transformer, hybrid CNN-Transformer) across four texture and fine-grained recognition benchmarks.

Conclusion: TwistNet-2D provides an efficient solution for capturing local pairwise channel interactions with spatial awareness, addressing limitations of existing methods for texture recognition while maintaining computational efficiency.

Abstract: Second-order feature statistics are central to texture recognition, yet current methods face a fundamental tension: bilinear pooling and Gram matrices capture global channel correlations but collapse spatial structure, while self-attention models spatial context through weighted aggregation rather than explicit pairwise feature interactions. We introduce TwistNet-2D, a lightweight module that computes \emph{local} pairwise channel products under directional spatial displacement, jointly encoding where features co-occur and how they interact. The core component, Spiral-Twisted Channel Interaction (STCI), shifts one feature map along a prescribed direction before element-wise channel multiplication, thereby capturing the cross-position co-occurrence patterns characteristic of structured and periodic textures. Aggregating four directional heads with learned channel reweighting and injecting the result through a sigmoid-gated residual path, \TwistNet incurs only 3.5% additional parameters and 2% additional FLOPs over ResNet-18, yet consistently surpasses both parameter-matched and substantially larger baselines – including ConvNeXt, Swin Transformer, and hybrid CNN–Transformer architectures – across four texture and fine-grained recognition benchmarks.

[233] SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

Tan Yu, Qian Qiao, Le Shen, Ke Zhou, Jincheng Hu, Dian Sheng, Bo Hu, Haoming Qin, Jun Gao, Changhai Zhou, Shunshun Yin, Siyuan Liu

Main category: cs.CV

TL;DR: SoulX-FlashHead: A 1.3B-parameter framework for real-time, high-fidelity audio-driven portrait video generation with streaming capabilities

Details

Motivation: Address the challenge of balancing high-fidelity visual quality with low-latency streaming in audio-driven portrait generation, overcoming limitations of existing models that are either computationally expensive or sacrifice facial representation quality and temporal stability

Method: Proposes a unified 1.3B-parameter framework with Streaming-Aware Spatiotemporal Pre-training using Temporal Audio Context Cache for robust feature extraction from short audio fragments, and Oracle-Guided Bidirectional Distillation to mitigate error accumulation in long-sequence autoregressive generation

Result: Achieves state-of-the-art performance on HDTF and VFHQ benchmarks, with Lite variant reaching 96 FPS on a single NVIDIA RTX 4090, enabling ultra-fast interaction while maintaining visual coherence

Conclusion: SoulX-FlashHead successfully addresses the trade-off between visual quality and streaming latency in audio-driven portrait generation, providing a practical solution for real-time applications

Abstract: Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.

[234] SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

Yancheng Long, Yankai Yang, Hongyang Wei, Wei Chen, Tianke Zhang, Haonan fan, Changyi Liu, Kaiyu Jiang, Jiankang Chen, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Shuo Yang

Main category: cs.CV

TL;DR: SpatialReward is a reward model for online RL-based image editing that addresses “Attention Collapse” by using explicit spatial reasoning anchored to predicted edit regions, improving evaluative accuracy and boosting image editing performance.

Details

Motivation: Current RL approaches for image editing suffer from unreliable reward signals due to "Attention Collapse" - where models fail to make cross-image comparisons and miss fine-grained details, leading to inaccurate perception and miscalibrated scores.

Method: Proposes SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning by anchoring reasoning to predicted edit regions, grounding semantic judgments in pixel-level evidence. Trained on a curated 260k spatial-aware dataset.

Result: Achieves SOTA on MMRB2 and EditReward-Bench, outperforms proprietary evaluators on MultiEditReward-Bench. Boosts OmniGen2 by +0.90 on GEdit-Bench, surpassing leading discriminative model and doubling GPT-4.1’s gain (+0.45).

Conclusion: Spatial reasoning is essential for unlocking effective alignment in image editing, and SpatialReward provides a robust reward signal for online RL that significantly improves image editing performance.

Abstract: Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term “Attention Collapse,” where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench–surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.

[235] Thermal odometry and dense mapping using learned odometry and Gaussian splatting

Tianhao Zhou, Yujia Chen, Zhihao Zhan, Yuhang Ming, Jianzhu Huai

Main category: cs.CV

TL;DR: TOM-GS: A thermal odometry and mapping method combining learning-based odometry with Gaussian Splatting for dense reconstruction in adverse conditions.

Details

Motivation: Thermal infrared sensors are robust in adverse conditions (darkness, dust, smoke) but existing thermal odometry/mapping approaches are predominantly geometric, fail across diverse datasets, and lack dense mapping capabilities.

Method: Proposes TOM-GS integrating learning-based odometry with Gaussian Splatting-based dense mapping, featuring thermal image enhancement and monocular depth integration for thermal cameras.

Result: Extensive experiments show TOM-GS outperforms existing learning-based methods in motion estimation and novel-view rendering, confirming benefits of learning-based pipelines for robust thermal odometry and dense reconstruction.

Conclusion: TOM-GS is among the first Gaussian Splatting-based SLAM systems tailored for thermal cameras, demonstrating effective integration of learning-based odometry with dense mapping for adverse condition perception.

Abstract: Thermal infrared sensors, with wavelengths longer than smoke particles, can capture imagery independent of darkness, dust, and smoke. This robustness has made them increasingly valuable for motion estimation and environmental perception in robotics, particularly in adverse conditions. Existing thermal odometry and mapping approaches, however, are predominantly geometric and often fail across diverse datasets while lacking the ability to produce dense maps. Motivated by the efficiency and high-quality reconstruction ability of recent Gaussian Splatting (GS) techniques, we propose TOM-GS, a thermal odometry and mapping method that integrates learning-based odometry with GS-based dense mapping. TOM-GS is among the first GS-based SLAM systems tailored for thermal cameras, featuring dedicated thermal image enhancement and monocular depth integration. Extensive experiments on motion estimation and novel-view rendering demonstrate that TOM-GS outperforms existing learning-based methods, confirming the benefits of learning-based pipelines for robust thermal odometry and dense reconstruction.

[236] Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

Simiao Ren, Xingyu Shen, Ankit Raj, Albert Dai, Caroline, Zhang, Yuan Xu, Zexi Chen, Siqi Wu, Chen Gong, Yuxin Zhang

Main category: cs.CV

TL;DR: Vision-language models (VLMs) significantly outperform specialized age estimation models in facial age estimation, with zero-shot VLMs achieving much lower error rates than traditional architectures.

Details

Motivation: There's no systematic benchmark comparing modern vision-language models with specialized age estimation architectures, despite age estimation's importance for content moderation, age verification, and deepfake detection.

Method: Large-scale cross-paradigm benchmark evaluating 34 models (22 specialized architectures and 12 general-purpose VLMs) across eight standard datasets totaling 1,100 test images per model, using mean absolute error (MAE) as primary metric.

Result: Zero-shot VLMs significantly outperform most specialized models (MAE 5.65 vs 9.88 years). Gemini 3 Flash Preview (MAE 4.32) surpasses best non-LLM model MiVOLO (MAE 5.10) by 15%. VLMs also reduce false adult rates for minors from 39-100% to 16-29%.

Conclusion: Task-specific architectures may not be necessary for high-performance age estimation; future work should focus on distilling VLM capabilities into efficient specialized models.

Abstract: Facial age estimation plays a critical role in content moderation, age verification, and deepfake detection. However, no prior benchmark has systematically compared modern vision-language models (VLMs) with specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating 34 models - 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs - across eight standard datasets (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, and AgeDB), totaling 1,100 test images per model. Our key finding is striking: zero-shot VLMs significantly outperform most specialized models, achieving an average mean absolute error (MAE) of 5.65 years compared to 9.88 years for non-LLM models. The best-performing VLM (Gemini 3 Flash Preview, MAE 4.32) surpasses the strongest non-LLM model (MiVOLO, MAE 5.10) by 15%. MiVOLO - unique in combining face and body features using Vision Transformers - is the only specialized model that remains competitive with VLMs. We further analyze age verification at the 18-year threshold and find that most non-LLM models exhibit false adult rates between 39% and 100% for minors, whereas VLMs reduce this to 16%-29%. Additionally, coarse age binning (8-9 classes) consistently increases MAE beyond 13 years. Stratified analysis across 14 age groups reveals that all models struggle most at extreme ages (under 5 and over 65). Overall, these findings challenge the assumption that task-specific architectures are necessary for high-performance age estimation and suggest that future work should focus on distilling VLM capabilities into efficient specialized models.

[237] MIND: Benchmarking Memory Consistency and Action Control in World Models

Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, Alex Jinpeng Wang

Main category: cs.CV

TL;DR: MIND is a new benchmark for evaluating world models’ memory consistency and action control abilities using 250 high-quality videos across diverse scenes and action spaces.

Details

Motivation: There's a lack of unified benchmarks for evaluating fundamental abilities of world models in understanding, remembering, and predicting dynamic visual environments.

Method: Created MIND benchmark with 250 high-quality videos (1080p, 24 FPS) including first-person and third-person perspectives across 8 diverse scenes with shared and varied action spaces. Developed evaluation framework measuring memory consistency and action control, plus MIND-World baseline model.

Result: Benchmark completeness validated through extensive experiments, revealing key challenges in current world models including difficulty maintaining long-term memory consistency and generalizing across action spaces.

Conclusion: MIND provides the first open-domain closed-loop benchmark for world model evaluation, highlighting important research gaps and enabling future performance benchmarking.

Abstract: World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Code: https://github.com/CSU-JPG/MIND.

[238] WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, Chen Gao, Wei Wu, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong Tian, Tat-Seng Chua, Wenwu Zhu, Yong Li

Main category: cs.CV

TL;DR: WorldArena is a unified benchmark for evaluating embodied world models across both perceptual quality (video generation) and functional utility (downstream decision-making tasks), revealing a significant gap between visual quality and task performance.

Details

Motivation: Current evaluation of embodied world models focuses too narrowly on perceptual fidelity (video generation quality) while overlooking their functional utility in downstream decision-making tasks, creating a fragmented evaluation landscape.

Method: WorldArena assesses models through three dimensions: 1) video perception quality (16 metrics across six sub-dimensions), 2) embodied task functionality (evaluating world models as data engines, policy evaluators, and action planners with human evaluation), and 3) EWMScore - a holistic metric integrating multi-dimensional performance into a single interpretable index.

Result: Experiments on 14 representative models reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability. The benchmark includes a public leaderboard at https://world-arena.ai.

Conclusion: WorldArena provides a comprehensive framework for tracking progress toward truly functional world models in embodied AI, addressing the critical need for unified evaluation that considers both perceptual and functional dimensions.

Abstract: While world models have emerged as a cornerstone of embodied intelligence by enabling agents to reason about environmental dynamics through action-conditioned prediction, their evaluation remains fragmented. Current evaluation of embodied world models has largely focused on perceptual fidelity (e.g., video generation quality), overlooking the functional utility of these models in downstream decision-making tasks. In this work, we introduce WorldArena, a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index. Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability. WorldArena benchmark with the public leaderboard is released at https://world-arena.ai, providing a framework for tracking progress toward truly functional world models in embodied AI.

[239] FD-DB: Frequency-Decoupled Dual-Branch Network for Unpaired Synthetic-to-Real Domain Translation

Chuanhai Zang, Jiabao Hu, XW Song

Main category: cs.CV

TL;DR: FD-DB: A frequency-decoupled dual-branch model for synthetic-to-real domain adaptation that separates appearance transfer into interpretable low-frequency editing and high-frequency residual compensation to improve photorealism while preserving geometric structures.

Details

Motivation: Synthetic data provides low-cost, accurately annotated samples for geometry-sensitive vision tasks, but appearance and imaging differences between synthetic and real domains cause severe domain shift that degrades downstream performance. Existing unpaired synthetic-to-real translation methods face a trade-off between photorealism and structural stability.

Method: Proposes FD-DB, a frequency-decoupled dual-branch model that separates appearance transfer into: 1) low-frequency interpretable editing branch predicting physically meaningful parameters (white balance, exposure, contrast, saturation, blur, grain), and 2) high-frequency residual compensation branch for fine details. Uses gated fusion mechanism with explicit frequency constraints and two-stage training schedule.

Result: Experiments on YCB-V dataset show FD-DB improves real-domain appearance consistency and significantly boosts downstream semantic segmentation performance while preserving geometric and semantic structures.

Conclusion: FD-DB effectively addresses the photorealism-structural stability trade-off in synthetic-to-real domain adaptation through frequency-decoupled dual-branch architecture with interpretable editing and residual compensation.

Abstract: Synthetic data provide low-cost, accurately annotated samples for geometry-sensitive vision tasks, but appearance and imaging differences between synthetic and real domains cause severe domain shift and degrade downstream performance. Unpaired synthetic-to-real translation can reduce this gap without paired supervision, yet existing methods often face a trade-off between photorealism and structural stability: unconstrained generation may introduce deformation or spurious textures, while overly rigid constraints limit adaptation to real-domain statistics. We propose FD-DB, a frequency-decoupled dual-branch model that separates appearance transfer into low-frequency interpretable editing and high-frequency residual compensation. The interpretable branch predicts physically meaningful editing parameters (white balance, exposure, contrast, saturation, blur, and grain) to build a stable low-frequency appearance base with strong content preservation. The free branch complements fine details through residual generation, and a gated fusion mechanism combines the two branches under explicit frequency constraints to limit low-frequency drift. We further adopt a two-stage training schedule that first stabilizes the editing branch and then releases the residual branch to improve optimization stability. Experiments on the YCB-V dataset show that FD-DB improves real-domain appearance consistency and significantly boosts downstream semantic segmentation performance while preserving geometric and semantic structures.

[240] Robust Vision Systems for Connected and Autonomous Vehicles: Security Challenges and Attack Vectors

Sandeep Gupta, Roberto Passerone

Main category: cs.CV

TL;DR: Analysis of vision system security in autonomous vehicles, identifying attack surfaces and evaluating threats to confidentiality, integrity, and availability.

Details

Motivation: To ensure safe and reliable Level-5 autonomous driving by understanding and addressing security vulnerabilities in CAV vision systems, which are critical for object detection, lane marking recognition, and traffic sign identification.

Method: Analyzed key sensors and vision components to derive a reference architecture for CAV vision systems, then identified potential attack surfaces and attack vectors targeting each surface, evaluating their implications for CIA triad principles.

Result: Developed a comprehensive reference architecture for CAV vision systems, identified specific attack surfaces and vectors, and evaluated their impact on confidentiality, integrity, and availability of vision data and processing.

Conclusion: The study provides crucial insights into vision system vulnerabilities in autonomous vehicles, offering a foundation for developing robust security measures to protect the CIA triad and ensure safe CAV operation.

Abstract: This article investigates the robustness of vision systems in Connected and Autonomous Vehicles (CAVs), which is critical for developing Level-5 autonomous driving capabilities. Safe and reliable CAV navigation undeniably depends on robust vision systems that enable accurate detection of objects, lane markings, and traffic signage. We analyze the key sensors and vision components essential for CAV navigation to derive a reference architecture for CAV vision system (CAVVS). This reference architecture provides a basis for identifying potential attack surfaces of CAVVS. Subsequently, we elaborate on identified attack vectors targeting each attack surface, rigorously evaluating their implications for confidentiality, integrity, and availability (CIA). Our study provides a comprehensive understanding of attack vector dynamics in vision systems, which is crucial for formulating robust security measures that can uphold the principles of the CIA triad.

[241] Kelix Technique Report

Boyang Ding, Chenglong Chu, Dunju Zang, Han Li, Jiangxia Cao, Kun Gai, Muhao Wei, Ruiming Tang, Shiyao Wang, Siyang Mao, Xinchen Luo, Yahui Liu, Zhixin Ling, Zhuoran Yang, Ziming Li, Chengru Song, Guorui Zhou, Guowang Zhang, Hao Peng, Hao Wang, Jiaxin Deng, Jin Ouyang, Jinghao Zhang, Lejian Ren, Qianqian Wang, Qigen Hu, Tao Wang, Xingmei Wang, Yiping Yang, Zixing Zhang, Ziqi Wang

Main category: cs.CV

TL;DR: Kelix is a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations in multimodal LLMs

Details

Motivation: Current vision-language models use hybrid interfaces (discrete text tokens + continuous ViT features), which limits self-supervised learning on non-text data and creates bias toward understanding over generation. Discrete visual tokenization exists but suffers from information loss and weaker understanding compared to continuous-feature models.

Method: Presents Kelix, a fully discrete autoregressive unified model that uses shared discrete representation across modalities, enabling unified comprehension and generation under self-supervision through next-token prediction.

Result: Kelix closes the understanding gap between discrete and continuous visual representations, achieving performance comparable to continuous-feature vision-language models while maintaining the benefits of fully discrete autoregressive modeling.

Conclusion: Fully discrete autoregressive multimodal modeling is achievable without sacrificing understanding capabilities, enabling unified comprehension and generation across modalities through shared discrete representations.

Abstract: Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.

[242] Monocular Normal Estimation via Shading Sequence Estimation

Zongrui Li, Xinhua Ma, Minghui Hu, Yunqing Zhao, Yingchen Yu, Qian Zheng, Chang Liu, Xudong Jiang, Song Bai

Main category: cs.CV

TL;DR: RoSE reformulates monocular normal estimation as shading sequence estimation using image-to-video generative models to address 3D misalignment issues in existing methods.

Details

Motivation: Existing monocular normal estimation methods suffer from 3D misalignment where estimated normal maps appear correct but reconstructed surfaces fail to align with geometric details. This stems from models struggling to distinguish varying geometry represented only through subtle color variations in normal maps.

Method: Proposes a new paradigm reformulating normal estimation as shading sequence estimation, which is more sensitive to geometric information. Uses image-to-video generative models to predict shading sequences, then converts them to normal maps via ordinary least-squares. Trained on MultiShade synthetic dataset with diverse shapes, materials, and lighting.

Result: RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.

Conclusion: Reformulating normal estimation as shading sequence estimation addresses 3D misalignment issues and improves geometric accuracy through better sensitivity to geometric information.

Abstract: Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.

[243] Fake-HR1: Rethinking Reasoning of Vision Language Model for Synthetic Image Detection

Changjiang Jiang, Xinkuan Sha, Fengchang Yu, Jingjing Liu, Jian Liu, Mingqi Fang, Chenfeng Zhang, Wei Lu

Main category: cs.CV

TL;DR: Fake-HR1 is a hybrid-reasoning model for synthetic image detection that adaptively decides when to use Chain-of-Thought reasoning to balance detection accuracy with computational efficiency.

Details

Motivation: While Chain-of-Thought reasoning improves synthetic image detection, excessive reasoning causes resource overhead and latency, especially for obvious forgeries. There's a need for adaptive reasoning that only uses CoT when necessary.

Method: Two-stage training: 1) Hybrid Fine-Tuning for cold-start initialization, 2) Online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization to learn when to select appropriate reasoning modes (with or without CoT).

Result: Fake-HR1 adaptively performs reasoning across different query types, surpassing existing LLMs in both reasoning ability and generative detection performance while significantly improving response efficiency.

Conclusion: The proposed hybrid-reasoning approach effectively balances detection accuracy with computational efficiency by adaptively deciding when to use CoT reasoning for synthetic image detection.

Abstract: Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model’s ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.

cs.AI

[244] Discovering Differences in Strategic Behavior Between Humans and LLMs

Caroline Wang, Daniel Kasenberg, Kim Stachenfeld, Pablo Samuel Castro

Main category: cs.AI

TL;DR: AlphaEvolve discovers interpretable models of human and LLM behavior in strategic games, revealing LLMs exhibit deeper strategic reasoning than humans in iterated rock-paper-scissors.

Details

Motivation: As LLMs are increasingly deployed in social and strategic scenarios, there's a critical need to understand where and why their behavior diverges from humans. Existing behavioral game theory models don't fully capture idiosyncratic behavior of humans or black-box LLMs.

Method: Uses AlphaEvolve, a program discovery tool, to directly discover interpretable models of human and LLM behavior from data, enabling open-ended discovery of structural factors driving behavior. Applied to iterated rock-paper-scissors.

Result: Frontier LLMs demonstrate deeper strategic behavior than humans in iterated rock-paper-scissors. The method provides interpretable models that reveal structural differences in decision-making.

Conclusion: The approach provides a foundation for understanding structural differences driving human vs. LLM behavior in strategic interactions, with implications for deploying LLMs in social contexts.

Abstract: As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or black-box, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discover interpretable models of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.

[245] EventCast: Hybrid Demand Forecasting in E-Commerce with LLM-Based Event Knowledge

Congcong Hu, Yuang Shi, Fan Huang, Yang Xiang, Zhou Ye, Ming Jin, Shiyu Wang

Main category: cs.AI

TL;DR: EventCast is a modular forecasting framework that integrates future event knowledge into time-series prediction for e-commerce demand forecasting, using LLMs for event-driven reasoning rather than direct numerical forecasting.

Details

Motivation: Existing forecasting systems fail during high-impact periods like flash sales, holidays, and policy interventions where demand patterns shift abruptly. There's a need to incorporate future event knowledge into forecasting models to handle these dynamic e-commerce scenarios.

Method: EventCast uses LLMs to process unstructured business data (campaigns, holidays, seller incentives) into interpretable textual summaries, leveraging world knowledge for cultural nuances. These summaries are fused with historical demand features in a dual-tower architecture for scalable forecasting.

Result: Deployed across 4 countries, 160 regions over 10 months, EventCast achieved up to 86.9% improvement on MAE and 97.7% on MSE compared to variants without event knowledge, and reduced MAE by 57.0% and MSE by 83.3% versus best industrial baselines during event-driven periods.

Conclusion: EventCast provides a practical solution for improving operational decision-making in dynamic e-commerce environments by effectively integrating future event knowledge through LLM-based reasoning, deployed in real-world industrial pipelines since March 2025.

Abstract: Demand forecasting is a cornerstone of e-commerce operations, directly impacting inventory planning and fulfillment scheduling. However, existing forecasting systems often fail during high-impact periods such as flash sales, holiday campaigns, and sudden policy interventions, where demand patterns shift abruptly and unpredictably. In this paper, we introduce EventCast, a modular forecasting framework that integrates future event knowledge into time-series prediction. Unlike prior approaches that ignore future interventions or directly use large language models (LLMs) for numerical forecasting, EventCast leverages LLMs solely for event-driven reasoning. Unstructured business data, which covers campaigns, holiday schedules, and seller incentives, from existing operational databases, is processed by an LLM that converts it into interpretable textual summaries leveraging world knowledge for cultural nuances and novel event combinations. These summaries are fused with historical demand features within a dual-tower architecture, enabling accurate, explainable, and scalable forecasts. Deployed on real-world e-commerce scenarios spanning 4 countries of 160 regions over 10 months, EventCast achieves up to 86.9% and 97.7% improvement on MAE and MSE compared to the variant without event knowledge, and reduces MAE by up to 57.0% and MSE by 83.3% versus the best industrial baseline during event-driven periods. EventCast has deployed into real-world industrial pipelines since March 2025, offering a practical solution for improving operational decision-making in dynamic e-commerce environments.

[246] LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun

Main category: cs.AI

TL;DR: LiveMedBench: A continuously updated, contamination-free medical benchmark using real-world clinical cases with automated rubric-based evaluation to address data contamination and temporal misalignment in LLM evaluation.

Details

Motivation: Existing medical benchmarks have critical limitations: data contamination (test sets leaking into training data), temporal misalignment (failing to capture evolving medical knowledge), and inadequate evaluation metrics (shallow lexical overlap or subjective LLM-as-a-Judge scoring).

Method: 1) Weekly harvesting of real-world clinical cases from online medical communities with strict temporal separation from training data; 2) Multi-Agent Clinical Curation Framework to filter noise and validate clinical integrity; 3) Automated Rubric-based Evaluation Framework that decomposes physician responses into granular case-specific criteria.

Result: LiveMedBench contains 2,756 real-world cases across 38 specialties and multiple languages with 16,702 unique evaluation criteria. Evaluation of 38 LLMs shows best model achieves only 39.2% accuracy, with 84% of models showing performance degradation on post-cutoff cases, confirming data contamination risks. Error analysis reveals contextual application (35-48% of failures) as the dominant bottleneck over factual knowledge.

Conclusion: LiveMedBench addresses critical gaps in medical LLM evaluation by providing a contamination-free, continuously updated benchmark with reliable rubric-based assessment, revealing significant challenges in LLMs’ ability to apply medical knowledge to patient-specific contexts.

Abstract: The deployment of Large Language Models (LLMs) in high-stakes clinical settings demands rigorous and reliable evaluation. However, existing medical benchmarks remain static, suffering from two critical limitations: (1) data contamination, where test sets inadvertently leak into training corpora, leading to inflated performance estimates; and (2) temporal misalignment, failing to capture the rapid evolution of medical knowledge. Furthermore, current evaluation metrics for open-ended clinical reasoning often rely on either shallow lexical overlap (e.g., ROUGE) or subjective LLM-as-a-Judge scoring, both inadequate for verifying clinical correctness. To bridge these gaps, we introduce LiveMedBench, a continuously updated, contamination-free, and rubric-based benchmark that weekly harvests real-world clinical cases from online medical communities, ensuring strict temporal separation from model training data. We propose a Multi-Agent Clinical Curation Framework that filters raw data noise and validates clinical integrity against evidence-based medical principles. For evaluation, we develop an Automated Rubric-based Evaluation Framework that decomposes physician responses into granular, case-specific criteria, achieving substantially stronger alignment with expert physicians than LLM-as-a-Judge. To date, LiveMedBench comprises 2,756 real-world cases spanning 38 medical specialties and multiple languages, paired with 16,702 unique evaluation criteria. Extensive evaluation of 38 LLMs reveals that even the best-performing model achieves only 39.2%, and 84% of models exhibit performance degradation on post-cutoff cases, confirming pervasive data contamination risks. Error analysis further identifies contextual application-not factual knowledge-as the dominant bottleneck, with 35-48% of failures stemming from the inability to tailor medical knowledge to patient-specific constraints.

[247] Found-RL: foundation model-enhanced reinforcement learning for autonomous driving

Yansong Qu, Zihao Sheng, Zilin Huang, Jiancong Chen, Yuhao Luo, Tianyi Wang, Yiheng Feng, Samuel Labi, Sikai Chen

Main category: cs.AI

TL;DR: Found-RL: A platform that efficiently integrates Vision-Language Models (VLMs) with Reinforcement Learning for autonomous driving, using asynchronous batch inference to overcome VLM latency issues and novel supervision mechanisms for policy distillation.

Details

Motivation: RL suffers from sample inefficiency and lack of semantic interpretability in complex autonomous driving scenarios, while VLMs offer rich context-aware knowledge but have high inference latency that hinders deployment in high-frequency RL training loops.

Method: Asynchronous batch inference framework decouples heavy VLM reasoning from simulation loop; introduces Value-Margin Regularization (VMR) and Advantage-Weighted Action Guidance (AWAG) to distill VLM action suggestions; uses CLIP for dense reward shaping with Conditional Contrastive Action Alignment to address CLIP’s dynamic blindness.

Result: Lightweight RL model achieves near-VLM performance compared to billion-parameter VLMs while sustaining real-time inference (approx. 500 FPS), effectively resolving latency bottlenecks.

Conclusion: Found-RL provides an end-to-end pipeline for efficient VLM integration with RL for autonomous driving, enabling semantic interpretability and sample efficiency without sacrificing real-time performance.

Abstract: Reinforcement Learning (RL) has emerged as a dominant paradigm for end-to-end autonomous driving (AD). However, RL suffers from sample inefficiency and a lack of semantic interpretability in complex scenarios. Foundation Models, particularly Vision-Language Models (VLMs), can mitigate this by offering rich, context-aware knowledge, yet their high inference latency hinders deployment in high-frequency RL training loops. To bridge this gap, we present Found-RL, a platform tailored to efficiently enhance RL for AD using foundation models. A core innovation is the asynchronous batch inference framework, which decouples heavy VLM reasoning from the simulation loop, effectively resolving latency bottlenecks to support real-time learning. We introduce diverse supervision mechanisms: Value-Margin Regularization (VMR) and Advantage-Weighted Action Guidance (AWAG) to effectively distill expert-like VLM action suggestions into the RL policy. Additionally, we adopt high-throughput CLIP for dense reward shaping. We address CLIP’s dynamic blindness via Conditional Contrastive Action Alignment, which conditions prompts on discretized speed/command and yields a normalized, margin-based bonus from context-specific action-anchor scoring. Found-RL provides an end-to-end pipeline for fine-tuned VLM integration and shows that a lightweight RL model can achieve near-VLM performance compared with billion-parameter VLMs while sustaining real-time inference (approx. 500 FPS). Code, data, and models will be publicly available at https://github.com/ys-qu/found-rl.

[248] MERIT Feedback Elicits Better Bargaining in LLM Negotiators

Jihwan Oh, Murad Aghazada, Yooju Shin, Se-Young Yun, Taehyeon Kim

Main category: cs.AI

TL;DR: A framework for improving LLM bargaining abilities using utility feedback, new benchmark (AgoraBench), human-aligned metrics, and preference-based training

Details

Motivation: LLMs struggle with bargaining due to limited strategic depth and difficulty adapting to complex human factors, and current benchmarks don't capture these limitations

Method: Created AgoraBench benchmark with 9 challenging settings, developed human-aligned utility-based metrics (agent utility, negotiation power, acquisition ratio), and built human preference dataset with learning pipeline for prompting and finetuning

Result: Baseline LLM strategies diverge from human preferences, while their mechanism substantially improves negotiation performance with deeper strategic behavior and stronger opponent awareness

Conclusion: The utility feedback framework effectively enhances LLMs’ bargaining abilities by aligning them with human preferences and strategic considerations

Abstract: Bargaining is often regarded as a logical arena rather than an art or a matter of intuition, yet Large Language Models (LLMs) still struggle to navigate it due to limited strategic depth and difficulty adapting to complex human factors. Current benchmarks rarely capture this limitation. To bridge this gap, we present an utility feedback centric framework. Our contributions are: (i) AgoraBench, a new benchmark spanning nine challenging settings (e.g., deception, monopoly) that supports diverse strategy modeling; (ii) human-aligned, economically grounded metrics derived from utility theory. This is operationalized via agent utility, negotiation power, and acquisition ratio that implicitly measure how well the negotiation aligns with human preference and (iii) a human preference grounded dataset with learning pipeline that strengthens LLMs’ bargaining ability through both prompting and finetuning. Empirical results indicate that baseline LLM strategies often diverge from human preferences, while our mechanism substantially improves negotiation performance, yielding deeper strategic behavior and stronger opponent awareness.

[249] Abstraction Generation for Generalized Planning with Pretrained Large Language Models

Zhenhe Cui, Huaxiang Xia, Hangjun Shen, Kailun Luo, Yong He, Wei Liang

Main category: cs.AI

TL;DR: LLMs can generate Qualitative Numerical Planning abstractions for generalized planning problems with automated debugging to fix abstraction errors

Details

Motivation: To investigate whether large language models can serve as QNP abstraction generators for generalized planning problems and develop methods to fix abstractions through automated debugging

Method: Proposed prompt protocol where LLMs are given GP domains and training tasks to generate abstract features and create QNP abstractions, combined with automated debugging to detect and fix abstraction errors

Result: Experiments show that with proper guidance from automated debugging, some LLMs can generate useful QNP abstractions for generalized planning problems

Conclusion: LLMs can function as QNP abstraction generators when guided by automated debugging methods, demonstrating potential for AI planning applications

Abstract: Qualitative Numerical Planning (QNP) serves as an important abstraction model for generalized planning (GP), which aims to compute general plans that solve multiple instances at once. Recent works show that large language models (LLMs) can function as generalized planners. This work investigates whether LLMs can serve as QNP abstraction generators for GP problems and how to fix abstractions via automated debugging. We propose a prompt protocol: input a GP domain and training tasks to LLMs, prompting them to generate abstract features and further abstract the initial state, action set, and goal into QNP problems. An automated debugging method is designed to detect abstraction errors, guiding LLMs to fix abstractions. Experiments demonstrate that under properly guided by automated debugging, some LLMs can generate useful QNP abstractions.

[250] Flow of Spans: Generalizing Language Models to Dynamic Span-Vocabulary via GFlowNets

Bo Xue, Yunchong Song, Fanghao Shao, Xuekai Zhu, Lin Chen, Luoyi Fu, Xinbing Wang, Zhouhan Lin

Main category: cs.AI

TL;DR: FoSS introduces a GFlowNets framework for span-level text generation with dynamic vocabulary and DAG-structured state space, improving text diversity and quality over token-level approaches.

Details

Motivation: Standard autoregressive language models have limited flexibility due to fixed vocabulary and tree-structured state space. Recent span retrieval methods overlook that sentences can be composed of varying-length spans, lacking explicit DAG modeling, which restricts compositional path exploration and introduces bias.

Method: Proposes Flow of SpanS (FoSS), a GFlowNets framework for span generation. Constructs dynamic span vocabulary by flexibly segmenting retrieved text to ensure DAG-structured state space, allowing exploration of diverse compositional paths. Uses specialized reward models to guide generation.

Result: Improves MAUVE scores by up to 12.5% over Transformer on text generation, achieves 3.5% gains on knowledge-intensive tasks, consistently outperforms SOTA methods. Scaling experiments show benefits from larger models, more data, and richer retrieval corpora.

Conclusion: FoSS provides a principled GFlowNets framework for span generation that overcomes limitations of token-level approaches, enabling better exploration of compositional paths and improved text generation quality and diversity.

Abstract: Standard autoregressive language models generate text token-by-token from a fixed vocabulary, inducing a tree-structured state space when viewing token sampling as an action, which limits flexibility and expressiveness. Recent work introduces dynamic vocabulary by sampling retrieved text spans but overlooks that the same sentence can be composed of spans of varying lengths, lacking explicit modeling of the directed acyclic graph (DAG) state space. This leads to restricted exploration of compositional paths and is biased toward the chosen path. Generative Flow Networks (GFlowNets) are powerful for efficient exploring and generalizing over state spaces, particularly those with a DAG structure. However, prior GFlowNets-based language models operate at the token level and remain confined to tree-structured spaces, limiting their potential. In this work, we propose Flow of SpanS (FOSS), a principled GFlowNets framework for span generation. FoSS constructs a dynamic span vocabulary by segmenting the retrieved text flexibly, ensuring a DAG-structured state space, which allows GFlowNets to explore diverse compositional paths and improve generalization. With specialized reward models, FoSS generates diverse, high-quality text. Empirically, FoSS improves MAUVE scores by up to 12.5% over Transformer on text generation and achieves 3.5% gains on knowledge-intensive tasks, consistently outperforming state-of-the-art methods. Scaling experiments further demonstrate FoSS benefits from larger models, more data, and richer retrieval corpora, retaining its advantage over strong baselines.

[251] Neuro-symbolic Action Masking for Deep Reinforcement Learning

Shuai Han, Mehdi Dastani, Shihan Wang

Main category: cs.AI

TL;DR: NSAM is a neuro-symbolic framework that automatically learns symbolic models from high-dimensional states and generates action masks to prevent DRL agents from taking infeasible actions, improving sample efficiency and reducing constraint violations.

Details

Motivation: DRL agents often explore infeasible actions during training and execution, requiring manual specification of action masks and symbolic grounding functions. The authors aim to automate this process by learning symbolic models directly from high-dimensional states.

Method: NSAM learns symbolic models from high-dimensional states in a minimally supervised manner during DRL training. It uses these learned symbolic models to generate action masks that rule out infeasible actions, enabling end-to-end integration of symbolic reasoning and deep policy optimization.

Result: NSAM significantly improves sample efficiency of DRL agents while substantially reducing constraint violations across multiple domains with constraints.

Conclusion: The framework successfully automates symbolic model learning and action masking, enabling more efficient and constraint-compliant DRL training without requiring manual specification of symbolic grounding functions.

Abstract: Deep reinforcement learning (DRL) may explore infeasible actions during training and execution. Existing approaches assume a symbol grounding function that maps high-dimensional states to consistent symbolic representations and a manually specified action masking techniques to constrain actions. In this paper, we propose Neuro-symbolic Action Masking (NSAM), a novel framework that automatically learn symbolic models, which are consistent with given domain constraints of high-dimensional states, in a minimally supervised manner during the DRL process. Based on the learned symbolic model of states, NSAM learns action masks that rules out infeasible actions. NSAM enables end-to-end integration of symbolic reasoning and deep policy optimization, where improvements in symbolic grounding and policy learning mutually reinforce each other. We evaluate NSAM on multiple domains with constraints, and experimental results demonstrate that NSAM significantly improves sample efficiency of DRL agent while substantially reducing constraint violations.

[252] To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks

Nanxu Gong, Haotian Li, Sixun Dong, Jianxun Lian, Yanjie Fu, Xing Xie

Main category: cs.AI

TL;DR: Reasoning models don’t consistently outperform non-reasoning models on Theory of Mind tasks, revealing limitations in transferring formal reasoning capabilities to social reasoning.

Details

Motivation: To investigate whether the reasoning capabilities developed in Large Reasoning Models (LRMs) for formal tasks like mathematics and coding transfer effectively to socio-cognitive skills like Theory of Mind, which is essential for natural social interaction.

Method: Systematic study of nine advanced LLMs comparing reasoning vs. non-reasoning models on three representative ToM benchmarks, with fine-grained analysis, intervention approaches (Slow-to-Fast adaptive reasoning and Think-to-Match shortcut prevention), and option removal experiments.

Result: Reasoning models don’t consistently outperform non-reasoning models on ToM tasks; longer reasoning hurts performance; option matching shortcut exists; adaptive reasoning helps; formal reasoning capabilities don’t fully transfer to social reasoning.

Conclusion: Achieving robust Theory of Mind requires developing unique capabilities beyond existing reasoning methods, as advancements in formal reasoning don’t fully transfer to social reasoning tasks.

Abstract: Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions, which is essential for natural social interaction. Although recent progress in Large Reasoning Models (LRMs) has boosted step-by-step inference in mathematics and coding, it is still underexplored whether this benefit transfers to socio-cognitive skills. We present a systematic study of nine advanced Large Language Models (LLMs), comparing reasoning models with non-reasoning models on three representative ToM benchmarks. The results show that reasoning models do not consistently outperform non-reasoning models and sometimes perform worse. A fine-grained analysis reveals three insights. First, slow thinking collapses: accuracy significantly drops as responses grow longer, and larger reasoning budgets hurt performance. Second, moderate and adaptive reasoning benefits performance: constraining reasoning length mitigates failure, while distinct success patterns demonstrate the necessity of dynamic adaptation. Third, option matching shortcut: when multiple choice options are removed, reasoning models improve markedly, indicating reliance on option matching rather than genuine deduction. We also design two intervention approaches: Slow-to-Fast (S2F) adaptive reasoning and Think-to-Match (T2M) shortcut prevention to further verify and mitigate the problems. With all results, our study highlights the advancement of LRMs in formal reasoning (e.g., math, code) cannot be fully transferred to ToM, a typical task in social reasoning. We conclude that achieving robust ToM requires developing unique capabilities beyond existing reasoning methods.

Keane Ong, Sabri Boughorbel, Luwei Xiao, Chanakya Ekbote, Wei Dai, Ao Qu, Jingyao Wu, Rui Mao, Ehsan Hoque, Erik Cambria, Gianmarco Mengaldo, Paul Pu Liang

Main category: cs.AI

TL;DR: HARPO is a heterogeneity-aware RL method that balances learning across diverse behavioral tasks to train Omnisapiens-7B 2.0, a foundation model for social behavior processing that outperforms existing models.

Details

Motivation: Existing approaches model human behavioral dimensions in isolation, limiting generalization across behavioral settings. While recent reasoning RL methods enable training unified models across multiple tasks, they don't explicitly address learning across heterogeneous behavioral data.

Method: Heterogeneity-Aware Relative Policy Optimization (HARPO) modulates advantages during policy optimization to ensure no single task or sample carries disproportionate influence, balancing learning across heterogeneous tasks and samples.

Result: Omnisapiens-7B 2.0 achieves strongest performance across behavioral tasks with gains up to +16.85% on multitask and +9.37% on held-out settings, producing more explicit and robust reasoning traces. HARPO outperforms recent RL methods across behavioral tasks.

Conclusion: HARPO enables effective training of unified foundation models for social behavior processing by addressing heterogeneity in behavioral data, leading to superior performance and generalization across diverse behavioral tasks.

Abstract: To develop socially intelligent AI, existing approaches typically model human behavioral dimensions (e.g., affective, cognitive, or social attributes) in isolation. Although useful, task-specific modeling often increases training costs and limits generalization across behavioral settings. Recent reasoning RL methods facilitate training a single unified model across multiple behavioral tasks, but do not explicitly address learning across different heterogeneous behavioral data. To address this gap, we introduce Heterogeneity-Aware Relative Policy Optimization (HARPO), an RL method that balances leaning across heterogeneous tasks and samples. This is achieved by modulating advantages to ensure that no single task or sample carries disproportionate influence during policy optimization. Using HARPO, we develop and release Omnisapiens-7B 2.0, a foundation model for social behavior processing. Relative to existing behavioral foundation models, Omnisapiens-7B 2.0 achieves the strongest performance across behavioral tasks, with gains of up to +16.85% and +9.37% on multitask and held-out settings respectively, while producing more explicit and robust reasoning traces. We also validate HARPO against recent RL methods, where it achieves the most consistently strong performance across behavioral tasks.

[254] Spend Search Where It Pays: Value-Guided Structured Sampling and Optimization for Generative Recommendation

Jie Jiang, Yangru Huang, Zeyu Wang, Changping Wang, Yuling Xiong, Jun Zhang, Huan Yu

Main category: cs.AI

TL;DR: V-STAR addresses probability-reward mismatch in RL-finetuned generative recommendation models by combining value-guided decoding with tree-structured advantage reinforcement learning to improve exploration and learning signals.

Details

Motivation: Fine-tuning generative recommendation models with RL suffers from probability-reward mismatch where likelihood-dominated decoding causes insufficient exploration (pruning high-reward low-probability items) and advantage compression (correlated rewards for similar prefixes with weak comparative signals).

Method: V-STAR framework with two components: 1) Value-Guided Efficient Decoding (VED) identifies decisive nodes and selectively deepens high-potential prefixes to improve exploration without exhaustive search; 2) Sibling-GRPO exploits tree topology to compute sibling-relative advantages, concentrating learning signals on decisive branching decisions.

Result: Extensive experiments on offline and online datasets show V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.

Conclusion: V-STAR effectively addresses the probability-reward mismatch in RL-finetuned generative recommendation by combining value-guided exploration with tree-structured advantage learning, improving both performance and efficiency.

Abstract: Generative recommendation via autoregressive models has unified retrieval and ranking into a single conditional generation framework. However, fine-tuning these models with Reinforcement Learning (RL) often suffers from a fundamental probability-reward mismatch. Conventional likelihood-dominated decoding (e.g., beam search) exhibits a myopic bias toward locally probable prefixes, which causes two critical failures: (1) insufficient exploration, where high-reward items in low-probability branches are prematurely pruned and rarely sampled, and (2) advantage compression, where trajectories sharing high-probability prefixes receive highly correlated rewards with low within-group variance, yielding a weak comparative signal for RL. To address these challenges, we propose V-STAR, a Value-guided Sampling and Tree-structured Advantage Reinforcement framework. V-STAR forms a self-evolving loop via two synergistic components. First, a Value-Guided Efficient Decoding (VED) is developed to identify decisive nodes and selectively deepen high-potential prefixes. This improves exploration efficiency without exhaustive tree search. Second, we propose Sibling-GRPO, which exploits the induced tree topology to compute sibling-relative advantages and concentrates learning signals on decisive branching decisions. Extensive experiments on both offline and online datasets demonstrate that V-STAR outperforms state-of-the-art baselines, delivering superior accuracy and candidate-set diversity under strict latency constraints.

[255] Integrating Generative AI-enhanced Cognitive Systems in Higher Education: From Stakeholder Perceptions to a Conceptual Framework considering the EU AI Act

Da-Lun Chen, Prasasthy Balasubramanian, Lauri Lovén, Susanna Pirttikangas, Jaakko Sauvola, Panagiotis Kostakos

Main category: cs.AI

TL;DR: Study examines perceptions of GenAI in ITEE disciplines at University of Oulu, revealing programming support interest and concerns about quality/privacy, proposing framework for responsible integration.

Details

Motivation: GenAI adoption in higher education is growing but perceptions vary by discipline and context. EU AI Act requires regulatory compliance, creating need for stakeholder engagement and tailored integration approaches specific to different academic fields.

Method: Mixed-method approach surveying 61 staff and 37 students at Faculty of Information Technology and Electrical Engineering (ITEE), University of Oulu, analyzing perceptions and identifying discipline-specific themes.

Result: Revealed shared and discipline-specific themes: strong interest in GenAI for programming support, but concerns about response quality, privacy, and academic integrity. Identified high-level requirements and proposed conceptual framework for responsible GenAI integration.

Conclusion: Disciplinary-specific requirements highlight importance of stakeholder engagement. Proposed framework provides practical guidance for universities to harness GenAI while addressing concerns and ensuring regulatory compliance.

Abstract: Many staff and students in higher education have adopted generative artificial intelligence (GenAI) tools in their work and study. GenAI is expected to enhance cognitive systems by enabling personalized learning and streamlining educational services. However, stakeholders perceptions of GenAI in higher education remain divided, shaped by cultural, disciplinary, and institutional contexts. In addition, the EU AI Act requires universities to ensure regulatory compliance when deploying cognitive systems. These developments highlight the need for institutions to engage stakeholders and tailor GenAI integration to their needs while addressing concerns. This study investigates how GenAI is perceived within the disciplines of Information Technology and Electrical Engineering (ITEE). Using a mixed-method approach, we surveyed 61 staff and 37 students at the Faculty of ITEE, University of Oulu. The results reveal both shared and discipline-specific themes, including strong interest in programming support from GenAI and concerns over response quality, privacy, and academic integrity. Drawing from these insights, the study identifies a set of high-level requirements and proposes a conceptual framework for responsible GenAI integration. Disciplinary-specific requirements reinforce the importance of stakeholder engagement when integrating GenAI into higher education. The high-level requirements and the framework provide practical guidance for universities aiming to harness GenAI while addressing stakeholder concerns and ensuring regulatory compliance.

[256] See, Plan, Snap: Evaluating Multimodal GUI Agents in Scratch

Xingyi Zhang, Yulei Ye, Kaifeng Huang, Wenhao Li, Xiangfeng Wang

Main category: cs.AI

TL;DR: ScratchWorld benchmark evaluates multimodal GUI agents on Scratch programming tasks with two interaction modes to diagnose reasoning vs. execution failures.

Details

Motivation: There's a gap in evaluating AI agents' capabilities to construct programs through graphical user interfaces (GUIs) in low-code education environments like Scratch, despite their importance in programming education.

Method: Created ScratchWorld benchmark with 83 tasks across Create, Debug, Extend, and Compute categories based on Use-Modify-Create pedagogy. Uses two interaction modes: primitive mode (fine-grained drag-and-drop) and composite mode (high-level semantic APIs) to separate program reasoning from GUI execution. Employs execution-based evaluation via runtime tests in browser environment.

Result: Experiments with state-of-the-art multimodal models and GUI agents reveal a substantial reasoning-acting gap - agents show strong planning capabilities but struggle with fine-grained GUI manipulation.

Conclusion: ScratchWorld provides a comprehensive benchmark for evaluating multimodal GUI agents, highlighting persistent challenges in visuomotor control despite advances in program reasoning.

Abstract: Block-based programming environments such as Scratch play a central role in low-code education, yet evaluating the capabilities of AI agents to construct programs through Graphical User Interfaces (GUIs) remains underexplored. We introduce ScratchWorld, a benchmark for evaluating multimodal GUI agents on program-by-construction tasks in Scratch. Grounded in the Use-Modify-Create pedagogical framework, ScratchWorld comprises 83 curated tasks spanning four distinct problem categories: Create, Debug, Extend, and Compute. To rigorously diagnose the source of agent failures, the benchmark employs two complementary interaction modes: primitive mode requires fine-grained drag-and-drop manipulation to directly assess visuomotor control, while composite mode uses high-level semantic APIs to disentangle program reasoning from GUI execution. To ensure reliable assessment, we propose an execution-based evaluation protocol that validates the functional correctness of the constructed Scratch programs through runtime tests within the browser environment. Extensive experiments across state-of-the-art multimodal language models and GUI agents reveal a substantial reasoning–acting gap, highlighting persistent challenges in fine-grained GUI manipulation despite strong planning capabilities.

[257] SynergyKGC: Reconciling Topological Heterogeneity in Knowledge Graph Completion via Topology-Aware Synergy

Xuecheng Zou, Yu Tang, Bingbing Wang

Main category: cs.AI

TL;DR: SynergyKGC is an adaptive framework for Knowledge Graph Completion that addresses structural resolution mismatch by using cross-modal synergy experts and density-dependent identity anchoring to reconcile heterogeneous topological structures.

Details

Motivation: Existing KGC methods suffer from "structural resolution mismatch" - they fail to reconcile divergent representational demands across varying graph densities, leading to structural noise interference in dense clusters and catastrophic representation collapse in sparse regions.

Method: SynergyKGC advances traditional neighbor aggregation to an active Cross-Modal Synergy Expert via relation-aware cross-attention and semantic-intent-driven gating. It couples a density-dependent Identity Anchoring strategy with a Double-tower Coherent Consistency architecture to reconcile topological heterogeneity while ensuring representational stability.

Result: Systematic evaluations on two public benchmarks validate the superiority of SynergyKGC in significantly boosting KGC hit rates, providing empirical evidence for resilient information integration in non-homogeneous structured data.

Conclusion: SynergyKGC effectively addresses structural resolution mismatch in KGC through adaptive cross-modal synergy and density-dependent anchoring, offering a generalized principle for resilient information integration in heterogeneous structured data.

Abstract: Knowledge Graph Completion (KGC) fundamentally hinges on the coherent fusion of pre-trained entity semantics with heterogeneous topological structures to facilitate robust relational reasoning. However, existing paradigms encounter a critical “structural resolution mismatch,” failing to reconcile divergent representational demands across varying graph densities, which precipitates structural noise interference in dense clusters and catastrophic representation collapse in sparse regions. We present SynergyKGC, an adaptive framework that advances traditional neighbor aggregation to an active Cross-Modal Synergy Expert via relation-aware cross-attention and semantic-intent-driven gating. By coupling a density-dependent Identity Anchoring strategy with a Double-tower Coherent Consistency architecture, SynergyKGC effectively reconciles topological heterogeneity while ensuring representational stability across training and inference phases. Systematic evaluations on two public benchmarks validate the superiority of our method in significantly boosting KGC hit rates, providing empirical evidence for a generalized principle of resilient information integration in non-homogeneous structured data.

[258] Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics

Leheng Sheng, Wenchang Ma, Ruixin Hong, Xiang Wang, An Zhang, Tat-Seng Chua

Main category: cs.AI

TL;DR: RLCER is a reinforcement learning method that autonomously rewards chain-of-thought reasoning using self-proposed and self-evolving rubrics, eliminating the need for human annotation while outperforming outcome-centric approaches.

Details

Motivation: Chain-of-thought reasoning is crucial for LLMs but difficult to reward directly due to heavy human labeling requirements, evolving CoT distributions, and reward hacking issues. The paper seeks an autonomous approach that requires no human annotation and can evolve gradually.

Method: Proposes RLCER (Reinforcement Learning with CoT Supervision via Self-Evolving Rubrics), which enhances outcome-centric RLVR by rewarding CoTs with self-proposed and self-evolving rubrics. The method autonomously generates evaluation criteria that evolve over time without human intervention.

Result: Self-proposed and self-evolving rubrics provide reliable CoT supervision signals even without outcome rewards, enabling RLCER to outperform outcome-centric RLVR. When used as in-prompt hints, these self-proposed rubrics further improve inference-time performance.

Conclusion: RLCER demonstrates that autonomous, self-evolving rubrics can effectively supervise chain-of-thought reasoning without human annotation, offering a scalable solution for improving LLM reasoning capabilities through reinforcement learning.

Abstract: Despite chain-of-thought (CoT) playing crucial roles in LLM reasoning, directly rewarding it is difficult: training a reward model demands heavy human labeling efforts, and static RMs struggle with evolving CoT distributions and reward hacking. These challenges motivate us to seek an autonomous CoT rewarding approach that requires no human annotation efforts and can evolve gradually. Inspired by recent self-evolving training methods, we propose \textbf{RLCER} (\textbf{R}einforcement \textbf{L}earning with \textbf{C}oT Supervision via Self-\textbf{E}volving \textbf{R}ubrics), which enhances the outcome-centric RLVR by rewarding CoTs with self-proposed and self-evolving rubrics. We show that self-proposed and self-evolving rubrics provide reliable CoT supervision signals even without outcome rewards, enabling RLCER to outperform outcome-centric RLVR. Moreover, when used as in-prompt hints, these self-proposed rubrics further improve inference-time performance.

[259] Can LLMs Cook Jamaican Couscous? A Study of Cultural Novelty in Recipe Generation

F. Carichon, R. Rampa, G. Farnadi

Main category: cs.AI

TL;DR: LLMs fail to produce culturally representative adaptations in cooking recipes, showing no correlation with cultural distance unlike humans, due to weak cultural information preservation and misunderstanding of creativity/tradition concepts.

Details

Motivation: LLMs are increasingly used for cultural content generation but exhibit systematic cultural biases, raising concerns about stereotyping, homogenization, and erasure of culturally specific expressions. Understanding whether LLMs can meaningfully align with diverse cultures beyond dominant ones remains a critical challenge.

Method: Study cultural adaptation in LLMs through cooking recipes using the GlobalFusion dataset, which pairs human recipes from different countries according to cultural distance measures. Generate culturally adapted recipes with multiple LLMs for the same country pairs, enabling direct comparison between human and LLM behavior in cross-cultural content creation.

Result: LLMs fail to produce culturally representative adaptations - their generated recipe divergence does not correlate with cultural distance (unlike humans). Cultural information is weakly preserved in internal model representations, models inflate novelty by misunderstanding creativity/tradition concepts, and they fail to identify adaptation with associated countries and ground it in culturally salient elements like ingredients.

Conclusion: Current LLMs have fundamental limitations for culturally oriented generation, with important implications for their use in culturally sensitive applications. The findings highlight the gap between human cultural adaptation and LLM performance in this domain.

Abstract: Large Language Models (LLMs) are increasingly used to generate and shape cultural content, ranging from narrative writing to artistic production. While these models demonstrate impressive fluency and generative capacity, prior work has shown that they also exhibit systematic cultural biases, raising concerns about stereotyping, homogenization, and the erasure of culturally specific forms of expression. Understanding whether LLMs can meaningfully align with diverse cultures beyond the dominant ones remains a critical challenge. In this paper, we study cultural adaptation in LLMs through the lens of cooking recipes, a domain in which culture, tradition, and creativity are tightly intertwined. We build on the \textit{GlobalFusion} dataset, which pairs human recipes from different countries according to established measures of cultural distance. Using the same country pairs, we generate culturally adapted recipes with multiple LLMs, enabling a direct comparison between human and LLM behavior in cross-cultural content creation. Our analysis shows that LLMs fail to produce culturally representative adaptations. Unlike humans, the divergence of their generated recipes does not correlate with cultural distance. We further provide explanations for this gap. We show that cultural information is weakly preserved in internal model representations, that models inflate novelty in their production by misunderstanding notions such as creativity and tradition, and that they fail to identify adaptation with its associated countries and to ground it in culturally salient elements such as ingredients. These findings highlight fundamental limitations of current LLMs for culturally oriented generation and have important implications for their use in culturally sensitive applications.

[260] CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

Yusong Lin, Haiyang Wang, Shuzhe Wu, Lue Fan, Feiyang Pan, Sanyuan Zhao, Dandan Tu

Main category: cs.AI

TL;DR: CLI-Gym: A method to generate environment-intensive coding tasks at scale by simulating environment histories with Dockerfiles, creating 1,655 tasks for training agents like LiberCoder.

Details

Motivation: Agentic coding requires interaction with runtime environments (like CLI) for tasks like dependency resolution, but there's a lack of scalable environment-intensive task datasets to train agents effectively.

Method: Proposes CLI-Gym: uses agents to simulate environment histories guided by execution feedback, traces histories of healthy environments to invert states to earlier buggy versions, then derives tasks by packaging buggy states with error messages.

Result: Generated 1,655 environment-intensive tasks (largest collection of its kind). Fine-tuned model LiberCoder achieves +21.1% absolute improvement (to 46.1%) on Terminal-Bench, outperforming strong baselines.

Conclusion: First public pipeline for scalable derivation of environment-intensive tasks, enabling better training of coding agents for CLI/environment interaction tasks.

Abstract: Agentic coding requires agents to effectively interact with runtime environments, e.g., command line interfaces (CLI), so as to complete tasks like resolving dependency issues, fixing system problems, etc. But it remains underexplored how such environment-intensive tasks can be obtained at scale to enhance agents’ capabilities. To address this, based on an analogy between the Dockerfile and the agentic task, we propose to employ agents to simulate and explore environment histories, guided by execution feedback. By tracing histories of a healthy environment, its state can be inverted to an earlier one with runtime failures, from which a task can be derived by packing the buggy state and the corresponding error messages. With our method, named CLI-Gym, a total of 1,655 environment-intensive tasks are derived, being the largest collection of its kind. Moreover, with curated successful trajectories, our fine-tuned model, named LiberCoder, achieves substantial absolute improvements of +21.1% (to 46.1%) on Terminal-Bench, outperforming various strong baselines. To our knowledge, this is the first public pipeline for scalable derivation of environment-intensive tasks.

[261] GameDevBench: Evaluating Agentic Capabilities Through Game Development

Wayne Chi, Yixiong Fang, Arnav Yayavaram, Siddharth Yayavaram, Seth Karten, Qiuhong Anna Wei, Runkun Chen, Alexander Wang, Valerie Chen, Ameet Talwalkar, Chris Donahue

Main category: cs.AI

TL;DR: GameDevBench: First benchmark for evaluating multimodal coding agents on game development tasks requiring complex multimodal understanding of assets like shaders, sprites, and animations.

Details

Motivation: Progress on multimodal coding agents lags behind text-only coding agents due to lack of evaluation testbeds that combine software development complexity with deep multimodal understanding. Game development provides an ideal testbed as it requires navigating large codebases while manipulating multimodal assets within visual game scenes.

Method: Created GameDevBench with 132 tasks derived from web/video tutorials. Tasks require significant multimodal understanding and are complex (average solution needs 3x more code/file changes than prior benchmarks). Introduced two simple image and video-based feedback mechanisms for agents to improve multimodal capability.

Result: Best agent solves only 54.5% of tasks. Strong correlation between perceived difficulty and multimodal complexity - success rates drop from 46.9% on gameplay tasks to 31.6% on 2D graphics tasks. Image/video feedback methods consistently improve performance, with Claude Sonnet 4.5 improving from 33.3% to 47.7%.

Conclusion: GameDevBench addresses the scarcity of multimodal coding agent evaluation testbeds. Game development tasks reveal current agent limitations in multimodal understanding, and simple visual feedback mechanisms can significantly improve performance. Benchmark released publicly to support research in agentic game development.

Abstract: Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex – the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5’s performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.

[262] FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight

Jiayi Zhou, Yang Sheng, Hantao Lou, Yaodong Yang, Jie Fu

Main category: cs.AI

TL;DR: Neuro-symbolic framework that uses LLMs to translate natural language safety requirements into formal specifications for mathematical verification of AI agent behavior.

Details

Motivation: Current LLM-as-a-Judge oversight paradigm faces fundamental limitations in reliably supervising probabilistic systems with probabilistic supervision. Need for mathematical guarantees rather than probabilistic scores for behavioral safety in high-stakes domains.

Method: Bidirectional Formal-of-Thought architecture where LLMs serve as specification compilers: top-down decomposition of human intent into atomic verifiable constraints, then bottom-up compliance proving using Dafny specifications and Z3 SMT solving.

Result: Achieves 16.6% average improvement over LLM-as-a-Judge baselines across three benchmarks, enables weak-to-strong generalization (7B judge detects deception from 72B agents with >90% accuracy), and provides near-linear safety improvement through iterative refinement.

Conclusion: Formal verification through neuro-symbolic frameworks offers principled solution to AI safety oversight, providing mathematical guarantees rather than probabilistic scores, addressing fundamental limitations of current LLM-as-a-Judge paradigm.

Abstract: As LLM-based agents increasingly operate in high-stakes domains with real-world consequences, ensuring their behavioral safety becomes paramount. The dominant oversight paradigm, LLM-as-a-Judge, faces a fundamental dilemma: how can probabilistic systems reliably supervise other probabilistic systems without inheriting their failure modes? We argue that formal verification offers a principled escape from this dilemma, yet its adoption has been hindered by a critical bottleneck: the translation from natural language requirements to formal specifications. This paper bridges this gap by proposing , a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture: LLMs serve as specification compilers that top-down decompose high-level human intent into atomic, verifiable constraints, then bottom-up prove compliance using Dafny specifications and Z3 Satisfiability modulo theories solving, which produces mathematical guarantees rather than probabilistic scores. We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection. Experiments on 7 agent models demonstrate that achieves an average improvement of 16.6% over LLM-as-a-Judge baselines, enables weak-to-strong generalization where a 7B judge achieves over 90% accuracy detecting deception from 72B agents, and provides near-linear safety improvement through iterative refinement.

[263] AI Driven Discovery of Bio Ecological Mediation in Cascading Heatwave Risks

Yiquan Wang, Tin-Yeh Huang, Qingyun Gao, Yuhan Chang, Jialin Zhang

Main category: cs.AI

TL;DR: HeDA is an autonomous AI framework that constructs knowledge graphs from thousands of academic papers to map complex cascading failures in compound heatwaves, revealing biological systems as key amplifiers and identifying cross-sector couplings.

Details

Motivation: Compound heatwaves trigger complex cascading failures across interconnected systems, but disciplinary fragmentation prevents comprehensive mapping of these systemic risk topologies, necessitating an integrated approach.

Method: Developed Heatwave Discovery Agent (HeDA) as an autonomous scientific synthesis framework that constructs high-fidelity knowledge graphs from 8,111 academic publications, structuring 70,297 evidence nodes for enhanced inferential analysis.

Result: HeDA outperformed standard foundation models (GPT-5.2, Claude Sonnet 4.5) in complex reasoning tasks, identified biological systems as primary nonlinear amplifiers of thermal stress, and revealed latent functional couplings between distinct sectors like power grids and emergency medical systems.

Conclusion: The study elucidates compound climate risk dynamics and provides empirical basis for shifting adaptation strategies from static sectoral defense to dynamic cross-system resilience through topological analysis of systemic risk networks.

Abstract: Compound heatwaves increasingly trigger complex cascading failures that propagate through interconnected physical and human systems, yet the fragmentation of disciplinary knowledge hinders the comprehensive mapping of these systemic risk topologies. This study introduces the Heatwave Discovery Agent HeDA as an autonomous scientific synthesis framework designed to bridge cognitive gaps by constructing a high fidelity knowledge graph from 8,111 academic publications. By structuring 70,297 evidence nodes, the system exhibits enhanced inferential fidelity in capturing long tail risk mechanisms and achieves a significant accuracy margin compared to standard foundation models including GPT 5.2 and Claude Sonnet 4.5 in complex reasoning tasks. The resulting topological analysis reveals a critical bio ecological mediation effect where biological systems function as the primary non linear amplifiers of thermal stress that transform physical meteorological hazards into systemic socioeconomic losses. We further identify latent functional couplings between theoretically distinct sectors such as the heat induced synchronization of power grid failures and emergency medical capacity saturation. These findings elucidate the dynamics of compound climate risks and provide an empirical basis for shifting adaptation strategies from static sectoral defense to dynamic cross system resilience.

[264] The Specification Trap: Why Content-Based AI Value Alignment Cannot Produce Robust Alignment

Austin Spizzirri

Main category: cs.AI

TL;DR: Content-based AI value alignment approaches (RLHF, Constitutional AI, etc.) face fundamental philosophical limitations that prevent robust alignment under scaling, distributional shift, and autonomy, requiring a shift from value specification to value emergence.

Details

Motivation: The paper argues that current approaches to AI value alignment that focus on optimizing toward formal value-objects (reward functions, utility functions, constitutional principles) are fundamentally limited and cannot achieve robust alignment as AI capabilities scale, face distributional shifts, and gain increasing autonomy.

Method: The author uses philosophical analysis and conceptual arguments rather than empirical methods. The approach involves examining three philosophical results (Hume’s is-ought gap, Berlin’s value pluralism, and the extended frame problem) and applying them to existing alignment methods like RLHF, Constitutional AI, inverse reinforcement learning, and cooperative assistance games.

Result: The analysis shows that content-based alignment approaches suffer from a “specification trap” - they cannot produce genuine reasons-responsiveness, only simulated value-following. The failure modes are structural rather than engineering limitations, and proposed solutions (continual updating, meta-preferences, moral realism) only relocate rather than solve the problem.

Conclusion: Content-based alignment approaches have a fundamental ceiling that becomes safety-critical at the capability frontier. The alignment problem must be reframed from value specification to value emergence, moving beyond behavioral compliance to genuine reasons-responsiveness.

Abstract: I argue that content-based AI value alignment–any approach that treats alignment as optimizing toward a formal value-object (reward function, utility function, constitutional principles, or learned preference representation)–cannot, by itself, produce robust alignment under capability scaling, distributional shift, and increasing autonomy. This limitation arises from three philosophical results: Hume’s is-ought gap (behavioral data cannot entail normative conclusions), Berlin’s value pluralism (human values are irreducibly plural and incommensurable), and the extended frame problem (any value encoding will misfit future contexts that advanced AI creates). I show that RLHF, Constitutional AI, inverse reinforcement learning, and cooperative assistance games each instantiate this specification trap, and that their failure modes are structural, not engineering limitations. Proposed escape routes–continual updating, meta-preferences, moral realism–relocate the trap rather than exit it. Drawing on Fischer and Ravizza’s compatibilist theory, I argue that behavioral compliance does not constitute alignment: there is a principled distinction between simulated value-following and genuine reasons-responsiveness, and specification-based methods cannot produce the latter. The specification trap establishes a ceiling on content-based approaches, not their uselessness–but this ceiling becomes safety-critical at the capability frontier. The alignment problem must be reframed from value specification to value emergence.

[265] Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms

Marc Lanctot, Kate Larson, Ian Gemp, Michael Kaisers

Main category: cs.AI

TL;DR: Active evaluation framework for ranking AI agents across multiple tasks using online sampling to reduce evaluation costs while maintaining ranking accuracy.

Details

Motivation: As AI agents become more capable across diverse tasks, traditional evaluation becomes prohibitively expensive due to the need for many samples across correlated, stochastic tasks. There's a need for efficient evaluation methods that can accurately rank agents while minimizing evaluation costs.

Method: Proposes an active evaluation framework where ranking algorithms iteratively choose which tasks and agents to sample from. Algorithms report rankings at each iteration and are assessed against ground truth rankings over time. Compares baselines including Elo rating system and Soft Condorcet Optimization using synthetic data and real Atari game-playing agent data.

Result: Elo rating system is consistently reliable for efficient ranking error reduction despite theoretical limitations. Soft Condorcet Optimization performs comparably to Elo on synthetic data and significantly outperforms it on real Atari agent evaluation. When task variation is high, task selection based on proportional representation leads to higher ranking error reduction rates.

Conclusion: Active evaluation provides an efficient framework for ranking AI agents across multiple tasks. Traditional methods like Elo remain practical despite theoretical issues, while newer methods like Soft Condorcet Optimization show promise, especially on real-world data. Task selection strategies should adapt to task variation characteristics.

Abstract: As intelligent agents become more generally-capable, i.e. able to master a wide variety of tasks, the complexity and cost of properly evaluating them rises significantly. Tasks that assess specific capabilities of the agents can be correlated and stochastic, requiring many samples for accurate comparisons, leading to added costs. In this paper, we propose a formal definition and a conceptual framework for active evaluation of agents across multiple tasks, which assesses the performance of ranking algorithms as a function of number of evaluation data samples. Rather than curating, filtering, or compressing existing data sets as a preprocessing step, we propose an online framing: on every iteration, the ranking algorithm chooses the task and agents to sample scores from. Then, evaluation algorithms report a ranking of agents on each iteration and their performance is assessed with respect to the ground truth ranking over time. Several baselines are compared under different experimental contexts, with synthetic generated data and simulated online access to real evaluation data from Atari game-playing agents. We find that the classical Elo rating system – while it suffers from well-known failure modes, in theory – is a consistently reliable choice for efficient reduction of ranking error in practice. A recently-proposed method, Soft Condorcet Optimization, shows comparable performance to Elo on synthetic data and significantly outperforms Elo on real Atari agent evaluation. When task variation from the ground truth is high, selecting tasks based on proportional representation leads to higher rate of ranking error reduction.

[266] Implicit Probabilistic Reasoning Does Not Reflect Explicit Answers in Large Language Models

Manuel Mondal, Ljiljana Dolamic, Gérôme Bovet, Philippe Cudré-Mauroux, Julien Audiffren

Main category: cs.AI

TL;DR: The paper introduces “implicit probabilistic reasoning” as an alternative evaluation method for LLMs’ probabilistic reasoning capabilities, moving beyond traditional multiple-choice questions to assess how models integrate probability into text generation.

Details

Motivation: Traditional MCQ-based evaluation of LLMs' probabilistic reasoning has significant limitations (e.g., sensitivity to answer ordering). The authors aim to develop a more robust evaluation method that assesses how models integrate probabilistic reasoning into their actual text generation process.

Method: The authors introduce “implicit probabilistic reasoning” by rephrasing MCQs as text-completion scenarios with predetermined outcomes. They compare the model’s next-token probability assignments to the true likelihood of outcomes, evaluating how probabilistic information influences text generation.

Result: Models show solid performance in explicit probabilistic reasoning (MCQs) but perform poorly in implicit probabilistic reasoning (text completion). The evaluation reveals that implicit reasoning is improperly influenced by factors like independent prior events, partial observations, and statistical background information, leading to erroneous text generation not detected by conventional MCQ evaluation.

Conclusion: Implicit probabilistic reasoning evaluation reveals significant limitations in how LLMs integrate probability into text generation, highlighting the need for more comprehensive evaluation methods beyond traditional MCQ approaches.

Abstract: The handling of probabilities in the form of uncertainty or partial information is an essential task for LLMs in many settings and applications. A common approach to evaluate an LLM’s probabilistic reasoning capabilities is to assess its ability to answer questions pertaining to probability through the use of multiple-choice questions (MCQs). However, this paradigm, which we refer to as explicit probabilistic reasoning, has been shown in the literature to yield significant limitations (e.g., sensitivity to answer ordering). In this work, we introduce an alternative approach, named implicit probabilistic reasoning, which evaluates the models’ ability to integrate probabilistic reasoning into their text generation process. To achieve this, we rephrase MCQs as text-completion scenarios with a determined set of outcomes and compare the model’s next-token probability assignments to the true likelihood of the outcomes. In line with previous work, we find that models exhibit solid performance in their explicit probabilistic reasoning (i.e., answers to MCQs). However, during text completion (i.e., implicit probabilistic reasoning), where the same information must be taken into account to generate text, the models’ predictions often significantly diverge from the known ground truth. For instance, our evaluation method reveals that implicit probabilistic reasoning is improperly influenced by many factors, such as independent prior events, partial observations about a result, or statistical background information. All of these issues can cause erroneous results to be produced in text generation, which are not detected by conventional MCQ-based evaluation.

[267] Metareasoning in uncertain environments: a meta-BAMDP framework

Prakhar Godara, Tilman Diego Alemán

Main category: cs.AI

TL;DR: A meta-reasoning framework for decision-making under uncertainty that extends traditional models to handle unknown reward/transition distributions, applied to Bernoulli bandit tasks with approximate solutions.

Details

Motivation: Traditional metareasoning models assume agents know the transition and reward distributions of the underlying MDP, which is unrealistic for many real-world planning problems. The paper aims to generalize these models to handle environments with unknown distributions.

Method: Proposes a meta Bayes-Adaptive MDP (meta-BAMDP) framework that extends metareasoning to environments with unknown reward/transition distributions. Applies the framework to Bernoulli bandit tasks and introduces two novel theorems to enhance tractability, enabling stronger approximations grounded in realistic human decision-making scenarios.

Result: Develops a normative framework for understanding human exploration under cognitive constraints and provides experimentally testable predictions about human behavior in Bernoulli Bandit tasks. The theorems significantly enhance the tractability of the meta-reasoning problem.

Conclusion: The meta-BAMDP framework offers a resource-rational perspective for metareasoning in realistic environments with uncertainty, bridging the gap between theoretical models and practical human/AI decision-making under cognitive constraints.

Abstract: \textit{Reasoning} may be viewed as an algorithm $P$ that makes a choice of an action $a^* \in \mathcal{A}$, aiming to optimize some outcome. However, executing $P$ itself bears costs (time, energy, limited capacity, etc.) and needs to be considered alongside explicit utility obtained by making the choice in the underlying decision problem. Finding the right $P$ can itself be framed as an optimization problem over the space of reasoning processes $P$, generally referred to as \textit{metareasoning}. Conventionally, human metareasoning models assume that the agent knows the transition and reward distributions of the underlying MDP. This paper generalizes such models by proposing a meta Bayes-Adaptive MDP (meta-BAMDP) framework to handle metareasoning in environments with unknown reward/transition distributions, which encompasses a far larger and more realistic set of planning problems that humans and AI systems face. As a first step, we apply the framework to Bernoulli bandit tasks. Owing to the meta problem’s complexity, our solutions are necessarily approximate. However, we introduce two novel theorems that significantly enhance the tractability of the problem, enabling stronger approximations that are robust within a range of assumptions grounded in realistic human decision-making scenarios. These results offer a resource-rational perspective and a normative framework for understanding human exploration under cognitive constraints, as well as providing experimentally testable predictions about human behavior in Bernoulli Bandit tasks.

[268] Bridging Explainability and Embeddings: BEE Aware of Spuriousness

Cristian Daniel Păduraru, Antonio Bărbălau, Radu Filipescu, Andrei Liviu Nicolicioiu, Elena Burceanu

Main category: cs.AI

TL;DR: BEE framework analyzes weight space perturbations from fine-tuning to detect spurious correlations that conventional methods miss, applicable across vision, language, and multimodal domains.

Details

Motivation: Current spurious correlation detection methods rely on dataset statistics or error patterns, failing when counterexamples are absent. There's a need for a more fundamental approach that examines model internals rather than just outputs.

Method: BEE shifts focus from predictions to weight space and embedding geometry. It analyzes how fine-tuning perturbs pretrained representations, using linear probing as a diagnostic lens to reveal spurious features that persist after full fine-tuning and transfer across models.

Result: BEE consistently exposes spurious correlations across diverse domains: vision (Waterbirds, CelebA, ImageNet-1k), language (CivilComments, MIMIC-CXR), and multiple embedding families (CLIP, CLIP-DataComp.XL, mGTE, BLIP2, SigLIP2). Found concepts that slash ImageNet accuracy by up to 95% and clinical shortcuts causing dangerous false negatives.

Conclusion: BEE provides a general, principled tool for diagnosing spurious correlations in weight space, enabling better dataset auditing and more trustworthy foundation models across multimodal domains.

Abstract: Current methods for detecting spurious correlations rely on analyzing dataset statistics or error patterns, leaving many harmful shortcuts invisible when counterexamples are absent. We introduce BEE (Bridging Explainability and Embeddings), a framework that shifts the focus from model predictions to the weight space, and to the embedding geometry underlying decisions. By analyzing how fine-tuning perturbs pretrained representations, BEE uncovers spurious correlations that remain hidden from conventional evaluation pipelines. We use linear probing as a transparent diagnostic lens, revealing spurious features that not only persist after full fine-tuning but also transfer across diverse state-of-the-art models. Our experiments cover numerous datasets and domains: vision (Waterbirds, CelebA, ImageNet-1k), language (CivilComments, MIMIC-CXR medical notes), and multiple embedding families (CLIP, CLIP-DataComp.XL, mGTE, BLIP2, SigLIP2). BEE consistently exposes spurious correlations: from concepts that slash the ImageNet accuracy by up to 95%, to clinical shortcuts in MIMIC-CXR notes that induce dangerous false negatives. Together, these results position BEE as a general and principled tool for diagnosing spurious correlations in weight space, enabling principled dataset auditing and more trustworthy foundation models. The source code is publicly available at https://github.com/bit-ml/bee.

[269] Reinforcement Learning in Strategy-Based and Atari Games: A Review of Google DeepMinds Innovations

Abdelrhman Shaheen, Anas Badr, Ali Abohendy, Hatem Alsaadawy, Nadine Alsayad, Ehab H. El-Shazly

Main category: cs.AI

TL;DR: This paper reviews Google DeepMind’s reinforcement learning innovations in gaming, focusing on AlphaGo, AlphaGo Zero, and MuZero models, their training approaches, challenges, and future directions in AI gaming.

Details

Motivation: To analyze the significance of reinforcement learning applications in gaming, particularly Atari and strategy-based games, by examining Google DeepMind's pioneering models and their evolution.

Method: Review paper analyzing three key DeepMind models: AlphaGo (supervised + RL), AlphaGo Zero (self-play RL without human data), and MuZero (learns environment dynamics without explicit rules). Also discusses MiniZero and multi-agent models.

Result: AlphaGo surpassed professional Go players, AlphaGo Zero improved learning efficiency through self-play, and MuZero achieved adaptability across various games including complex Atari games without knowing game rules.

Conclusion: Reinforcement learning has revolutionized AI gaming capabilities, with DeepMind’s models demonstrating progressive improvements in learning efficiency, adaptability, and performance across diverse game environments.

Abstract: Reinforcement Learning (RL) has been widely used in many applications, particularly in gaming, which serves as an excellent training ground for AI models. Google DeepMind has pioneered innovations in this field, employing reinforcement learning algorithms, including model-based, model-free, and deep Q-network approaches, to create advanced AI models such as AlphaGo, AlphaGo Zero, and MuZero. AlphaGo, the initial model, integrates supervised learning and reinforcement learning to master the game of Go, surpassing professional human players. AlphaGo Zero refines this approach by eliminating reliance on human gameplay data, instead utilizing self-play for enhanced learning efficiency. MuZero further extends these advancements by learning the underlying dynamics of game environments without explicit knowledge of the rules, achieving adaptability across various games, including complex Atari games. This paper reviews the significance of reinforcement learning applications in Atari and strategy-based games, analyzing these three models, their key innovations, training processes, challenges encountered, and improvements made. Additionally, we discuss advancements in the field of gaming, including MiniZero and multi-agent models, highlighting future directions and emerging AI models from Google DeepMind.

[270] HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation

Haokun Liu, Sicong Huang, Jingyu Hu, Yangqiaoyu Zhou, Chenhao Tan

Main category: cs.AI

TL;DR: HypoBench is a benchmark for evaluating LLM-based hypothesis generation methods across 12 tasks (7 real-world, 5 synthetic) with 194 datasets, assessing practical utility, generalizability, and discovery rate.

Details

Motivation: There's growing interest in using LLMs for hypothesis generation, but fundamental questions remain about what makes a good hypothesis and how to systematically evaluate hypothesis generation methods.

Method: Introduced HypoBench benchmark with 7 real-world tasks and 5 synthetic tasks across 194 datasets. Evaluated 4 state-of-the-art LLMs combined with 6 existing hypothesis-generation methods across multiple aspects including practical utility, generalizability, and hypothesis discovery rate.

Result: Existing methods can discover valid and novel patterns in data, but performance significantly drops as task difficulty increases in synthetic settings. Best models/methods only recover 38.8% of ground-truth hypotheses in synthetic tasks, showing significant room for improvement.

Conclusion: HypoBench serves as a valuable resource for improving AI systems for scientific discovery, highlighting current challenges in hypothesis generation where methods don’t fully uncover all relevant patterns.

Abstract: There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate methods for hypothesis generation? To address this, we introduce HypoBench, a novel benchmark designed to evaluate LLMs and hypothesis generation methods across multiple aspects, including practical utility, generalizability, and hypothesis discovery rate. HypoBench includes 7 real-world tasks and 5 synthetic tasks with 194 distinct datasets. We evaluate four state-of-the-art LLMs combined with six existing hypothesis-generation methods. Overall, our results suggest that existing methods are capable of discovering valid and novel patterns in the data. However, the results from synthetic datasets indicate that there is still significant room for improvement, as current hypothesis generation methods do not fully uncover all relevant or meaningful patterns. Specifically, in synthetic settings, as task difficulty increases, performance significantly drops, with best models and methods only recovering 38.8% of the ground-truth hypotheses. These findings highlight challenges in hypothesis generation and demonstrate that HypoBench serves as a valuable resource for improving AI systems designed to assist scientific discovery.

[271] Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark

Shuhang Xu, Weijian Deng, Yixuan Zhou, Fangwei Zhong

Main category: cs.AI

TL;DR: CK-Arena introduces a dynamic benchmark using social deduction games to evaluate LLMs’ conceptual understanding beyond surface pattern memorization

Details

Motivation: Existing benchmarks for evaluating LLMs' conceptual knowledge are static and fact-oriented, making them vulnerable to data leakage and overfitting. There's a need to assess whether LLMs truly capture conceptual structures or just memorize surface patterns.

Method: Uses a multi-agent social deduction game (Undercover game) where LLM-based agents are assigned subtly different concept words and must describe, distinguish, and infer conceptual properties from others’ statements. Performance is evaluated through game outcomes and semantic quality of descriptions.

Result: Conceptual understanding varies substantially across models and categories, and is not strictly aligned with overall model capability. The benchmark can automatically construct high-quality QA data for diagnostic analysis.

Conclusion: CK-Arena provides a dynamic, interactive approach to evaluate LLMs’ conceptual understanding that goes beyond static benchmarks, revealing nuanced differences in how models handle conceptual knowledge.

Abstract: Concepts serve as fundamental abstractions that support human reasoning and categorization. However, it remains unclear whether large language models truly capture such conceptual structures or primarily rely on surface-level pattern memorization. Existing benchmarks are largely static and fact oriented, which limits their ability to probe fine-grained semantic understanding and makes them vulnerable to data leakage and overfitting. To address this limitation, we introduce CK-Arena, a dynamic benchmark for conceptual knowledge evaluation based on a multi agent social deduction game, namely the Undercover game. In this setting, LLM based agents are assigned subtly different concept words and must describe, distinguish, and infer conceptual properties from others’ statements. Model performance is evaluated through both game level outcomes and the semantic quality of generated descriptions. Furthermore, CK-Arena leverages the interaction process to automatically construct high quality question answering data for fine grained diagnostic analysis. Experimental results show that conceptual understanding varies substantially across models and categories, and is not strictly aligned with overall model capability. The data and code are available at the project homepage: https://ck-arena.site.

[272] Controllable Logical Hypothesis Generation for Abductive Reasoning in Knowledge Graphs

Yisen Gao, Jiaxin Bai, Tianshi Zheng, Qingyun Sun, Ziwei Zhang, Xingcheng Fu, Jianxin Li, Yangqiu Song

Main category: cs.AI

TL;DR: CtrlHGen is a controllable logical hypothesis generation framework for abductive reasoning over knowledge graphs that addresses hypothesis space collapse and oversensitivity through two-stage training with sub-logical decomposition and semantic rewards.

Details

Motivation: Current abductive reasoning in knowledge graphs lacks controllability, generating many plausible but redundant or irrelevant hypotheses from single observations, limiting practical utility in applications like clinical diagnosis and scientific discovery.

Method: Proposes CtrlHGen with two-stage training: supervised learning followed by reinforcement learning. Uses sub-logical decomposition for dataset augmentation to address hypothesis space collapse, and incorporates smoothed semantic rewards (Dice, Overlap scores) plus condition-adherence reward to address hypothesis oversensitivity.

Result: Extensive experiments on three benchmark datasets show CtrlHGen better adheres to control conditions and achieves superior semantic similarity performance compared to baselines.

Conclusion: CtrlHGen effectively addresses controllability challenges in abductive reasoning over knowledge graphs, improving practical utility for applications requiring controlled hypothesis generation.

Abstract: Abductive reasoning in knowledge graphs aims to generate plausible logical hypotheses from observed entities, with broad applications in areas such as clinical diagnosis and scientific discovery. However, due to a lack of controllability, a single observation may yield numerous plausible but redundant or irrelevant hypotheses on large-scale knowledge graphs. To address this limitation, we introduce the task of controllable hypothesis generation to improve the practical utility of abductive reasoning. This task faces two key challenges when controlling for generating long and complex logical hypotheses: hypothesis space collapse and hypothesis oversensitivity. To address these challenges, we propose CtrlHGen, a Controllable logcial Hypothesis Generation framework for abductive reasoning over knowledge graphs, trained in a two-stage paradigm including supervised learning and subsequent reinforcement learning. To mitigate hypothesis space collapse, we design a dataset augmentation strategy based on sub-logical decomposition, enabling the model to learn complex logical structures by leveraging semantic patterns in simpler components. To address hypothesis oversensitivity, we incorporate smoothed semantic rewards including Dice and Overlap scores, and introduce a condition-adherence reward to guide the generation toward user-specified control constraints. Extensive experiments on three benchmark datasets demonstrate that our model not only better adheres to control conditions but also achieves superior semantic similarity performance compared to baselines. Our code is available at https://github.com/HKUST-KnowComp/CtrlHGen.

Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Jiabei Xiao, Wenlong Zhang, Xinnan Dai, Xi Chen, Yuan Meng, Lei Bai, Wanli Ouyang, Shixiang Tang, Aoran Wang, Xinzhu Ma

Main category: cs.AI

TL;DR: PhysUniBench is a multimodal benchmark with 3,304 undergraduate physics problems across 8 sub-disciplines, each with visual diagrams, designed to evaluate MLLMs’ physics reasoning capabilities.

Details

Motivation: Existing evaluations fail to capture the full breadth and complexity of undergraduate physics, which provides a rigorous testbed for assessing multi-step physical reasoning in multimodal AI models.

Method: Created a large-scale multimodal benchmark through rigorous multi-stage process: multiple roll-outs, expert evaluation, automated filtering of easily solved problems, and a nuanced 5-level difficulty grading system.

Result: Current MLLMs struggle significantly with physics reasoning - GPT-5 achieves only 51.6% accuracy, especially on multi-step problems and those requiring precise diagram interpretation.

Conclusion: PhysUniBench provides a broad, rigorous assessment tool to drive progress in AI for Science, encouraging development of models with stronger physical reasoning, problem-solving skills, and multimodal understanding.

Abstract: Physics problem-solving is a challenging domain for AI models, requiring integration of conceptual understanding, mathematical reasoning, and interpretation of physical diagrams. Existing evaluations fail to capture the full breadth and complexity of undergraduate physics, whereas this level provides a rigorous yet standardized testbed for pedagogical assessment of multi-step physical reasoning. To this end, we present PhysUniBench, a large-scale multimodal benchmark designed to evaluate and improve the reasoning capabilities of multimodal large language models (MLLMs) specifically on undergraduate-level physics problems. PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram. The benchmark includes both open-ended and multiple-choice questions, systematically curated and difficulty-rated through an iterative process. The benchmark’s construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels. Through extensive experiments, we observe that current models encounter substantial challenges in physics reasoning, where GPT-5 achieves only 51.6% accuracy in the PhysUniBench. These results highlight that current MLLMs struggle with advanced physics reasoning, especially on multi-step problems and those requiring precise diagram interpretation. By providing a broad and rigorous assessment tool, PhysUniBench aims to drive progress in AI for Science, encouraging the development of models with stronger physical reasoning, problem-solving skills, and multimodal understanding.

[274] Scaling Towards the Information Boundary of Instruction Sets: The Infinity Instruct Subject Technical Report

Li Du, Hanyu Zhao, Yiming Ju, Tengfei Pan

Main category: cs.AI

TL;DR: A systematic framework for constructing high-quality instruction datasets with improved coverage and depth through hierarchical tagging, seed selection, evolutionary synthesis, and targeted generation.

Details

Motivation: Current instruction datasets have limited coverage of task types/knowledge areas and depth (instruction complexity), leading to models struggling with complex instructions and rare domains despite large dataset sizes.

Method: Proposes a systematic framework with four components: hierarchical tagging system, informative seed selection algorithm, evolutionary data synthesis process, and model deficiency diagnosis with targeted data generation, forming an iterative closed-loop.

Result: Constructed Infinity Instruct Subject dataset (~1.5M instructions) that improves instruction-following capabilities across multiple foundation models and benchmark tasks, showing enlarged coverage and depth compared to comparable datasets.

Conclusion: The work provides theoretical and practical foundation for efficient, continuous evolution of instruction datasets, shifting focus from quantity expansion to qualitative improvement.

Abstract: Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both coverage'' (coverage of task types and knowledge areas) and depth’’ (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical tagging system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct Infinity Instruct Subject, a high-quality dataset containing $\sim$1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that Infinity Instruct Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.

[275] Synthetic Homes: An Accessible Multimodal Pipeline for Producing Residential Building Data with Generative AI

Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra

Main category: cs.AI

TL;DR: A modular multimodal framework using generative AI to create energy modeling data from public images and residential information, addressing data accessibility and privacy issues.

Details

Motivation: Energy modeling research requires extensive data that can be inaccessible, expensive, or raise privacy concerns. Current computational models need large datasets but face data acquisition challenges.

Method: Developed a modular multimodal framework using generative AI to produce energy modeling data from publicly accessible images and residential information. Includes evaluation pipeline for generative AI components.

Result: The framework avoids common issues with generative models and produces realistic multimodal data. Successfully reduces dependence on costly or restricted data sources.

Conclusion: The approach paves a path toward more accessible research in machine learning and data-driven disciplines by addressing data scarcity and privacy concerns through generative AI.

Abstract: Computational models have emerged as powerful tools for energy modeling research, touting scalability and quantitative results. However, these models require a plethora of data, some of which can be inaccessible, expensive, or can raise privacy concerns. We introduce a modular multimodal framework to produce this data from publicly accessible images and residential information using generative Artificial Intelligence (AI). Additionally, we provide a pipeline demonstrating this framework and we evaluate its generative AI components. Our experiments show that our framework’s use of AI avoids common issues with generative models and produces realistic multimodal data. By reducing dependence on costly or restricted data sources, we pave a path towards more accessible research in Machine Learning (ML) and other data-driven disciplines.

[276] Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns

Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Shuo Wang, Hongfei Yan, Jingang Wang, Xunliang Cai

Main category: cs.AI

TL;DR: The paper proposes CoTP (Chain-of-Thought Patterns), a method to select high-value reasoning data by extracting atomic reasoning patterns from CoT sequences and using them to efficiently filter training data, significantly improving mathematical reasoning performance with minimal data.

Details

Motivation: Current approaches use CoT data indiscriminately without understanding which data types most effectively enhance model reasoning capabilities. The paper aims to identify and leverage high-value reasoning patterns to improve model reasoning potential efficiently.

Method: 1) Define reasoning potential as inverse of attempts needed to answer correctly; 2) Abstract atomic reasoning patterns from CoT sequences based on commonality and inductive capabilities; 3) Construct core reference set of valuable patterns; 4) Propose dual-granularity algorithm using chains of reasoning patterns and token entropy to select high-value CoT data (CoTP) aligned with core set.

Result: Only 10B-token CoTP data enables 85A6B MoE model to improve by 9.58% on challenging AIME 2024/2025 benchmarks and raises upper bound of downstream RL performance by 7.81%.

Conclusion: Selecting high-value reasoning patterns from CoT data is more effective than indiscriminate use, enabling significant performance improvements with minimal data through targeted training on valuable reasoning patterns.

Abstract: Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model’s reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.

[277] Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models

Yu Zeng, Wenxuan Huang, Shiting Huang, Xikun Bao, Yukun Qi, Yiming Zhao, Qiuchen Wang, Lin Chen, Zehui Chen, Huaian Chen, Wanli Ouyang, Feng Zhao

Main category: cs.AI

TL;DR: AGILE introduces an agentic jigsaw interaction learning framework that enhances visual perception and reasoning in VLMs through interactive code execution and environmental feedback.

Details

Motivation: Current VLMs have limited fundamental perceptual and reasoning abilities, performing poorly even on simple jigsaw tasks, while high-quality vision-language training data is scarce and not scalable.

Method: AGILE formulates jigsaw solving as an interactive process where the model generates executable code to perform actions based on current state, receives fine-grained visual feedback from the environment, and learns through iterative cycles of observation and interaction.

Result: AGILE boosts jigsaw task accuracy from 9.5% to 82.8% (2×2 setting) and shows strong generalization across 9 vision tasks with average 3.1% improvement, demonstrating enhanced perceptual and reasoning capabilities.

Conclusion: The work provides an efficient, scalable solution to multimodal data scarcity and opens new avenues for advancing reasoning and generalization in multimodal models through interactive learning.

Abstract: Although current large Vision-Language Models (VLMs) have advanced in multimodal understanding and reasoning, their fundamental perceptual and reasoning abilities remain limited. Specifically, even on simple jigsaw tasks, existing VLMs perform near randomly, revealing deficiencies in core perception and reasoning capabilities. While high-quality vision-language data can enhance these capabilities, its scarcity and limited scalability impose significant constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs. AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment. At each step, the model generates executable code to perform an action based on the current state, while the environment provides fine-grained visual feedback to guide task completion. Through this iterative cycle of observation and interaction, the model incrementally improves its perceptual and reasoning capabilities via exploration and feedback. Experimental results show that AGILE not only substantially boosts performance on jigsaw tasks of varying complexity (e.g., increasing accuracy from 9.5% to 82.8% under the 2 $\times$ 2 setting) but also demonstrates strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%. These results indicate notable enhancements in both perceptual and reasoning abilities. This work opens a new avenue for advancing reasoning and generalization in multimodal models and provides an efficient, scalable solution to the scarcity of multimodal reinforcement learning data. The code and datasets is available at https://github.com/yuzeng0-0/AGILE .

[278] Measuring What Matters: The AI Pluralism Index

Rashid Mushkani

Main category: cs.AI

TL;DR: The AI Pluralism Index (AIPI) is a transparent measurement framework that evaluates AI producers and systems across four pillars of pluralistic governance: participatory governance, inclusivity/diversity, transparency, and accountability.

Details

Motivation: Current AI development is concentrated in few firms/states, potentially encoding narrow interests and limiting public agency. While technical benchmarks are common, there's a lack of public, auditable measures for pluralistic governance that ensures affected stakeholders can shape AI objectives and practices.

Method: AIPI uses a transparent, evidence-based instrument that codes verifiable practices from public artifacts and independent evaluations. It handles “Unknown” evidence to report both lower-bound and known-only scores. The method includes a reproducible pipeline with structured web/repository analysis, external assessments, expert interviews, and reliability testing.

Result: The framework includes formal measurement model, reproducible pipeline, reliability assessments (inter-rater agreement, coverage reporting, cross-index correlations, sensitivity analysis), and open maintenance of protocol, codebook, scoring scripts, and evidence graph with versioned releases and public adjudication.

Conclusion: AIPI aims to steer incentives toward pluralistic AI practices and equip policymakers, procurers, and the public with comparable evidence for evaluating AI governance, complementing existing transparency, safety, and governance frameworks.

Abstract: Artificial intelligence systems increasingly mediate knowledge, communication, and decision making. Development and governance remain concentrated within a small set of firms and states, raising concerns that technologies may encode narrow interests and limit public agency. Capability benchmarks for language, vision, and coding are common, yet public, auditable measures of pluralistic governance are rare. We define AI pluralism as the degree to which affected stakeholders can shape objectives, data practices, safeguards, and deployment. We present the AI Pluralism Index (AIPI), a transparent, evidence-based instrument that evaluates producers and system families across four pillars: participatory governance, inclusivity and diversity, transparency, and accountability. AIPI codes verifiable practices from public artifacts and independent evaluations, explicitly handling “Unknown” evidence to report both lower-bound (“evidence”) and known-only scores with coverage. We formalize the measurement model; implement a reproducible pipeline that integrates structured web and repository analysis, external assessments, and expert interviews; and assess reliability with inter-rater agreement, coverage reporting, cross-index correlations, and sensitivity analysis. The protocol, codebook, scoring scripts, and evidence graph are maintained openly with versioned releases and a public adjudication process. We report pilot provider results and situate AIPI relative to adjacent transparency, safety, and governance frameworks. The index aims to steer incentives toward pluralistic practice and to equip policymakers, procurers, and the public with comparable evidence.

[279] Unifying Deductive and Abductive Reasoning in Knowledge Graphs with Masked Diffusion Model

Yisen Gao, Jiaxin Bai, Yi Huang, Xingcheng Fu, Qingyun Sun, Yangqiu Song

Main category: cs.AI

TL;DR: DARK: A unified diffusion-based framework for both deductive and abductive reasoning on knowledge graphs using self-reflective denoising and logic-exploration reinforcement learning.

Details

Motivation: Current methods treat deductive (retrieving entities from queries) and abductive (generating hypotheses from observations) reasoning in isolation, missing their synergistic potential where deduction can validate hypotheses and abduction can uncover deeper logical patterns.

Method: Proposes DARK, a masked diffusion model with two key innovations: 1) Self-reflective denoising process that iteratively generates and validates candidate hypotheses during abductive reasoning, 2) Logic-exploration reinforcement learning that simultaneously masks queries and conclusions to explore novel reasoning compositions.

Result: Extensive experiments on multiple benchmark knowledge graphs show DARK achieves state-of-the-art performance on both deductive and abductive reasoning tasks.

Conclusion: The unified approach demonstrates significant benefits over isolated methods, successfully bridging the gap between deductive and abductive reasoning in knowledge graphs.

Abstract: Deductive and abductive reasoning are two critical paradigms for analyzing knowledge graphs, enabling applications from financial query answering to scientific discovery. Deductive reasoning on knowledge graphs usually involves retrieving entities that satisfy a complex logical query, while abductive reasoning generates plausible logical hypotheses from observations. Despite their clear synergistic potential, where deduction can validate hypotheses and abduction can uncover deeper logical patterns, existing methods address them in isolation. To bridge this gap, we propose DARK, a unified framework for Deductive and Abductive Reasoning in Knowledge graphs. As a masked diffusion model capable of capturing the bidirectional relationship between queries and conclusions, DARK has two key innovations. First, to better leverage deduction for hypothesis refinement during abductive reasoning, we introduce a self-reflective denoising process that iteratively generates and validates candidate hypotheses against the observed conclusion. Second, to discover richer logical associations, we propose a logic-exploration reinforcement learning approach that simultaneously masks queries and conclusions, enabling the model to explore novel reasoning compositions. Extensive experiments on multiple benchmark knowledge graphs show that DARK achieves state-of-the-art performance on both deductive and abductive reasoning tasks, demonstrating the significant benefits of our unified approach.

[280] Retrieval- and Argumentation-Enhanced Multi-Agent LLMs for Judgmental Forecasting (Extended Version with Supplementary Material)

Deniz Gorur, Antonio Rago, Francesca Toni

Main category: cs.AI

TL;DR: Multi-agent LLM framework for claim verification using quantitative bipolar argumentation frameworks (QBAFs) applied to judgmental forecasting tasks.

Details

Motivation: Judgmental forecasting involves predicting future events based on human judgment, which can be framed as claim verification. The paper aims to improve forecasting accuracy by combining evidence from multiple LLM agents with different approaches to argument generation and evaluation.

Method: Proposes a multi-agent framework where different LLM-powered agents (ArgLLM, RbAM, RAG-ArgLLM) generate and evaluate QBAFs for claim verification. Agents may disagree on claim veracity and provide evidence for/against claims. Experiments conducted with 2-3 agent configurations using six different base LLMs on standard judgmental forecasting datasets.

Result: Combining evidence from multiple agents improves forecasting accuracy, particularly with three-agent configurations. The framework provides explainable evidence combinations for claim verification tasks.

Conclusion: Multi-agent LLM frameworks with quantitative argumentation can enhance judgmental forecasting by aggregating diverse evidence sources while maintaining explainability through structured argumentation frameworks.

Abstract: Judgmental forecasting is the task of making predictions about future events based on human judgment. This task can be seen as a form of claim verification, where the claim corresponds to a future event and the task is to assess the plausibility of that event. In this paper, we propose a novel multi-agent framework for claim verification, whereby different agents may disagree on claim veracity and bring specific evidence for and against the claims, represented as quantitative bipolar argumentation frameworks (QBAFs). We then instantiate the framework for supporting claim verification, with a variety of agents realised with Large Language Models (LLMs): (1) ArgLLM agents, an existing approach for claim verification that generates and evaluates QBAFs; (2) RbAM agents, whereby LLM-empowered Relation-based Argument Mining (RbAM) from external sources is used to generate QBAFs; (3) RAG-ArgLLM agents, extending ArgLLM agents with a form of Retrieval-Augmented Generation (RAG) of arguments from external sources. Finally, we conduct experiments with two standard judgmental forecasting datasets, with instances of our framework with two or three agents, empowered by six different base LLMs. We observe that combining evidence from agents can improve forecasting accuracy, especially in the case of three agents, while providing an explainable combination of evidence for claim verification.

[281] PreferThinker: Reasoning-based Personalized Image Preference Assessment

Shengqi Xu, Xinpeng Zhou, Yabo Zhang, Ming Liu, Tao Liang, Tianyu Zhang, Yalong Bai, Zuxuan Wu, Wangmeng Zuo

Main category: cs.AI

TL;DR: A reasoning-based framework for personalized image preference assessment that predicts user preference profiles from reference images and provides interpretable multi-dimensional assessments.

Details

Motivation: Existing image preference assessment methods focus on general preferences using large-scale data, but struggle with personalized preferences due to scarce user-specific data and diverse individual tastes.

Method: Proposes a predict-then-assess framework: first predicts user preference profile from reference images, then provides interpretable assessments. Uses a two-stage training with supervised fine-tuning followed by reinforcement learning with similarity-aware prediction reward.

Result: Extensive experiments demonstrate the superiority of the proposed method over existing approaches.

Conclusion: The framework effectively addresses personalized image preference assessment by leveraging common preference profiles and structured reasoning capabilities.

Abstract: Personalized image preference assessment aims to evaluate an individual user’s image preferences by relying only on a small set of reference images as prior information. Existing methods mainly focus on general preference assessment, training models with large-scale data to tackle well-defined tasks such as text-image alignment. However, these approaches struggle to handle personalized preference because user-specific data are scarce and not easily scalable, and individual tastes are often diverse and complex. To overcome these challenges, we introduce a common preference profile that serves as a bridge across users, allowing large-scale user data to be leveraged for training profile prediction and capturing complex personalized preferences. Building on this idea, we propose a reasoning-based personalized image preference assessment framework that follows a \textit{predict-then-assess} paradigm: it first predicts a user’s preference profile from reference images, and then provides interpretable, multi-dimensional scores and assessments of candidate images based on the predicted profile. To support this, we first construct a large-scale Chain-of-Thought (CoT)-style personalized assessment dataset annotated with diverse user preference profiles and high-quality CoT-style reasoning, enabling explicit supervision of structured reasoning. Next, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase to empower the model with structured reasoning capabilities, followed by reinforcement learning to incentivize the model to explore more reasonable assessment paths and enhance generalization. Furthermore, we propose a similarity-aware prediction reward to encourage better prediction of the user’s preference profile, which facilitates more reasonable assessments exploration. Extensive experiments demonstrate the superiority of the proposed method.

Haebin Seong, Sungmin Kim, Yongjun Cho, Myunchul Joe, Geunwoo Kim, Yubeen Park, Sunhoo Kim, Yoonshik Kim, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Jinmyung Kwak, Sunghee Ahn, Jaemin Lee, Younggil Do, Seungyeop Yi, Woojin Cheong, Minhyeok Oh, Minchan Kim, Seongjae Kang, Samwoo Seong, Youngjae Yu, Yunsung Lee

Main category: cs.AI

TL;DR: CostNav is an economic navigation benchmark that evaluates autonomous delivery systems using comprehensive cost-revenue analysis rather than just task success, revealing current navigation approaches are not economically viable.

Details

Motivation: Current navigation benchmarks focus on simplified task success metrics but neglect real-world economic constraints essential for commercial viability of autonomous delivery systems. There's a gap between research metrics and commercial deployment needs.

Method: Introduces CostNav benchmark integrating industry data (SEC filings, AIS injury reports) with Isaac Sim’s collision and cargo dynamics. Evaluates navigation policies through comprehensive economic cost-revenue analysis rather than task completion.

Result: Evaluation of rule-based Nav2 navigation shows negative contribution margins (-22.81/run for AMCL, -12.87/run for GPS) with no break-even point, demonstrating current approaches are not economically viable.

Conclusion: CostNav exposes the gap between navigation research metrics and commercial viability, challenging the community to develop economically viable navigation policies. The benchmark is method-agnostic, evaluating success solely on cost metrics.

Abstract: While current navigation benchmarks prioritize task success in simplified settings, they neglect the multidimensional economic constraints essential for the real-world commercialization of autonomous delivery systems. We introduce CostNav, an Economic Navigation Benchmark that evaluates physical AI agents through comprehensive economic cost-revenue analysis aligned with real-world business operations. By integrating industry-standard data - such as SEC filings and AIS injury reports - with Isaac Sim’s detailed collision and cargo dynamics, CostNav transcends simple task completion to accurately evaluate business value in complex, real-world scenarios. To our knowledge, CostNav is the first work to quantitatively expose the gap between navigation research metrics and commercial viability, revealing that optimizing for task success on a simplified task fundamentally differs from optimizing for real-world economic deployment. Our evaluation of rule-based Nav2 navigation shows that current approaches are not economically viable: the contribution margin is -22.81/run (AMCL) and -12.87/run (GPS), resulting in no break-even point. We challenge the community to develop navigation policies that achieve economic viability on CostNav. We remain method-agnostic, evaluating success solely on the metric of cost rather than the underlying architecture. All resources are available at https://github.com/worv-ai/CostNav.

[283] Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale

Shengji Tang, Weihao Lin, Peng Ye, Jingqi Ye, Hao Li, Yiqun Zhang, Xiaosong Wang, Bo Zhang, Shuyue Hu, Tao Chen, Lei Bai, Wanli Ouyang

Main category: cs.AI

TL;DR: JiSi framework enables open-source LLMs to collectively surpass Gemini-3-Pro performance through query-response mixed routing, support-set-based aggregator selection, and adaptive routing-aggregation switching.

Details

Motivation: Current LLM collaboration approaches face three bottlenecks: query-based routers focus only on textual similarity, static aggregation methods don't adapt to different tasks, and underutilized complementarity between routing and aggregation. The authors propose collective intelligence as an alternative to monolithic scaling.

Method: JiSi framework introduces three innovations: 1) Query-Response Mixed Routing that captures both semantic information and problem difficulty, 2) Support-Set-based Aggregator Selection that jointly evaluates aggregation and domain capacity, and 3) Adaptive Routing-Aggregation Switch that dynamically leverages advantages of both routing and aggregation.

Result: JiSi surpasses Gemini-3-Pro with only 47% costs by orchestrating ten open-source LLMs, outperforming mainstream baselines across nine benchmarks. The framework demonstrates that collective intelligence can achieve superior performance compared to monolithic scaling.

Conclusion: Collective intelligence represents a novel path toward AGI, showing that collaboration among open-source LLMs can surpass state-of-the-art monolithic models like Gemini-3-Pro through intelligent routing and aggregation mechanisms.

Abstract: Large Language Models (LLMs) have rapidly advanced, with Gemini-3-Pro setting a new performance milestone. In this work, we explore collective intelligence as an alternative to monolithic scaling, and demonstrate that open-source LLMs’ collaboration can surpass Gemini-3-Pro. We first revisit LLM routing and aggregation at scale and identify three key bottlenecks: (1) current train-free routers are limited by a query-based paradigm focusing solely on textual similarity; (2) recent aggregation methods remain largely static, failing to select appropriate aggregators for different tasks;(3) the complementarity of routing and aggregation remains underutilized. To address these problems, we introduce JiSi, a novel framework designed to release the full potential of LLMs’ collaboration through three innovations: (1) Query-Response Mixed Routing capturing both semantic information and problem difficulty; (2) Support-Set-based Aggregator Selection jointly evaluating the aggregation and domain capacity of aggregators; (3) Adaptive Routing-Aggregation Switch dynamically leveraging the advantages of routing and aggregation. Comprehensive experiments on nine benchmarks demonstrate that JiSi can surpass Gemini-3-Pro with only 47% costs by orchestrating ten open-source LLMs, while outperforming mainstream baselines. It suggests that collective intelligence represents a novel path towards Artificial General Intelligence (AGI).

[284] Meta Context Engineering via Agentic Skill Evolution

Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, Guojie Song

Main category: cs.AI

TL;DR: Meta Context Engineering (MCE) is a bi-level framework that co-evolves context engineering skills and context artifacts through agentic crossover, replacing static CE heuristics with adaptive optimization.

Details

Motivation: Current Context Engineering (CE) methods rely on manually crafted harnesses with rigid workflows and predefined schemas, imposing structural biases and limiting optimization to narrow, intuition-bound design spaces.

Method: MCE uses a bi-level framework: a meta-level agent refines engineering skills via agentic crossover (deliberative search over skills, executions, and evaluations), while a base-level agent executes these skills, learns from training rollouts, and optimizes context as flexible files and code.

Result: MCE demonstrates consistent performance gains across five disparate domains, achieving 5.6-53.8% relative improvement over state-of-the-art agentic CE methods (mean of 16.9%), with superior context adaptability, transferability, and efficiency.

Conclusion: MCE supersedes static CE heuristics by co-evolving CE skills and context artifacts, enabling more effective and efficient context optimization for large language models.

Abstract: The operational efficacy of large language models relies heavily on their inference-time context. This has established Context Engineering (CE) as a formal discipline for optimizing these inputs. Current CE methods rely on manually crafted harnesses, such as rigid generation-reflection workflows and predefined context schemas. They impose structural biases and restrict context optimization to a narrow, intuition-bound design space. To address this, we introduce Meta Context Engineering (MCE), a bi-level framework that supersedes static CE heuristics by co-evolving CE skills and context artifacts. In MCE iterations, a meta-level agent refines engineering skills via agentic crossover, a deliberative search over the history of skills, their executions, and evaluations. A base-level agent executes these skills, learns from training rollouts, and optimizes context as flexible files and code. We evaluate MCE across five disparate domains under offline and online settings. MCE demonstrates consistent performance gains, achieving 5.6–53.8% relative improvement over state-of-the-art agentic CE methods (mean of 16.9%), while maintaining superior context adaptability, transferability, and efficiency in both context usage and training.

[285] World of Workflows: A Benchmark for Bringing World Models to Enterprise Systems

Lakshya Gupta, Litao Li, Yizhe Liu, Sriram Ganapathi Subramanian, Kaheer Suleman, Zichen Zhang, Haoye Lu, Sumit Pasupalak

Main category: cs.AI

TL;DR: WoW introduces a realistic ServiceNow-based enterprise environment with 4,000+ business rules and 55 workflows, plus a benchmark revealing LLMs’ “dynamics blindness” to hidden cascading effects in complex systems.

Details

Motivation: Current enterprise benchmarks fail to capture real enterprise challenges like limited observability, large database states, and hidden workflows with cascading side effects. Frontier LLMs remain untested in complex enterprise systems where these dynamics create significant challenges.

Method: Created World of Workflows (WoW), a ServiceNow-based environment with 4,000+ business rules and 55 active workflows. Developed WoW-bench with 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities.

Result: Two major findings: (1) Frontier LLMs suffer from “dynamics blindness” - consistently failing to predict invisible cascading side effects leading to silent constraint violations; (2) Reliability in opaque systems requires grounded world modeling where agents must mentally simulate hidden state transitions.

Conclusion: For reliable enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. The authors release their GitHub for setting up and evaluating WoW.

Abstract: Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface-level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow-based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW-bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict the invisible, cascading side effects of their actions, which leads to silent constraint violations, and (2) reliability in opaque systems requires grounded world modeling, where agents must mentally simulate hidden state transitions to bridge the observability gap when high-fidelity feedback is unavailable. For reliable and useful enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. We release our GitHub for setting up and evaluating WoW.

[286] From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

Bowen Cao, Dongdong Zhang, Yixia Li, Junpeng Liu, Shijue Huang, Chufan Shi, Hongyuan Lu, Yaokang Wu, Guanhua Chen, Wai Lam, Furu Wei

Main category: cs.AI

TL;DR: ContextMATH benchmark shows LLMs struggle with contextual mathematical reasoning, with performance dropping significantly when problems are embedded in realistic scenarios or require problem formulation from implicit constraints.

Details

Motivation: Despite LLMs achieving near-expert performance on benchmark math problems, there's a significant gap in their ability to apply mathematical reasoning to real-world contextual problems where mathematical cores must be formulated from descriptive scenarios.

Method: Created ContextMATH benchmark by repurposing AIME and MATH-500 problems into two settings: Scenario Grounding (embedding abstract problems into realistic narratives) and Complexity Scaling (transforming explicit conditions into sub-problems). Evaluated 61 proprietary and open-source models.

Result: Significant performance drops: open-source models declined by 13 and 34 points on SG and CS, proprietary models by 13 and 20 points. Errors dominated by incorrect problem formulation, with formulation accuracy declining as problem difficulty increases. Fine-tuning with scenario data helps but gaps remain.

Conclusion: Contextual mathematical reasoning remains a central unsolved challenge for LLMs, with formulation and reasoning as complementary bottlenecks. Larger models show better understanding and reasoning, but performance gaps persist even with training interventions.

Abstract: Large language models now solve many benchmark math problems at near-expert levels, yet this progress has not fully translated into reliable performance in real-world applications. We study this gap through contextual mathematical reasoning, where the mathematical core must be formulated from descriptive scenarios. We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings: Scenario Grounding (SG), which embeds abstract problems into realistic narratives without increasing reasoning complexity, and Complexity Scaling (CS), which transforms explicit conditions into sub-problems to capture how constraints often appear in practice. Evaluating 61 proprietary and open-source models, we observe sharp drops: on average, open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20. Error analysis shows that errors are dominated by incorrect problem formulation, with formulation accuracy declining as original problem difficulty increases. Correct formulation emerges as a prerequisite for success, and its sufficiency improves with model scale, indicating that larger models advance in both understanding and reasoning. Nevertheless, formulation and reasoning remain two complementary bottlenecks that limit contextual mathematical problem solving. Finally, we find that fine-tuning with scenario data improves performance, whereas formulation-only training is ineffective. However, performance gaps are only partially alleviated, highlighting contextual mathematical reasoning as a central unsolved challenge for LLMs.

[287] Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Mengxuan Wang, Yuxin Chen, Gang Xu, Tao He, Hongjie Jiang, Ming Li

Main category: cs.AI

TL;DR: RAI is a training-free safety framework that enhances VLMs’ risk recognition by amplifying unsafe signals through targeted modulation of high-risk visual tokens using an unsafe prototype subspace.

Details

Motivation: VLMs are vulnerable to multimodal jailbreak attacks, and existing defenses have high training costs or degrade utility. Research shows LLMs inherently recognize unsafe content, but visual inputs dilute risk signals in VLMs.

Method: Constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens to explicitly activate safety-critical signals in cross-modal feature space.

Result: Extensive experiments show RAI substantially reduces attack success rate without compromising task performance across multiple jailbreak and utility benchmarks.

Conclusion: RAI provides a lightweight, training-free safety calibration framework that restores LLM-like risk recognition in VLMs while preserving cross-modal reasoning capabilities.

Abstract: Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model’s LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.

[288] Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink

Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, Xiumin Wang, Li Shen

Main category: cs.AI

TL;DR: Surgery: A fine-tuning defense method that uses attention sink divergence analysis to mitigate harmful fine-tuning in LLMs by steering attention heads away from harmful pattern learning.

Details

Motivation: Harmful fine-tuning can invalidate safety alignment in large language models, creating significant safety risks. Current defenses are insufficient, so the authors explore attention mechanisms to detect and prevent harmful fine-tuning.

Method: Proposes Surgery defense based on sink divergence analysis: 1) Measures sink divergence statistic for each attention head, 2) Observes heads separate into positive/negative divergence groups, 3) Uses regularizer to suppress positive sink divergence (associated with harmful learning), steering heads toward negative divergence group.

Result: Surgery improves defense performance by 5.90% on BeaverTails, 11.25% on HarmBench, and 9.55% on SorryBench benchmarks compared to baseline methods.

Conclusion: Attention sink divergence provides a measurable signal for detecting harmful fine-tuning, and the proposed Surgery method effectively defends against such attacks by steering attention mechanisms away from harmful pattern learning.

Abstract: Harmful fine-tuning can invalidate safety alignment of large language models, exposing significant safety risks. In this paper, we utilize the attention sink mechanism to mitigate harmful fine-tuning. Specifically, we first measure a statistic named \emph{sink divergence} for each attention head and observe that \emph{different attention heads exhibit two different signs of sink divergence}. To understand its safety implications, we conduct experiments and find that the number of attention heads of positive sink divergence increases along with the increase of the model’s harmfulness when undergoing harmful fine-tuning. Based on this finding, we propose a separable sink divergence hypothesis – \emph{attention heads associating with learning harmful patterns during fine-tuning are separable by their sign of sink divergence}. Based on the hypothesis, we propose a fine-tuning-stage defense, dubbed Surgery. Surgery utilizes a regularizer for sink divergence suppression, which steers attention heads toward the negative sink divergence group, thereby reducing the model’s tendency to learn and amplify harmful patterns. Extensive experiments demonstrate that Surgery improves defense performance by 5.90%, 11.25%, and 9.55% on the BeaverTails, HarmBench, and SorryBench benchmarks, respectively. Source code is available on https://github.com/Lslland/Surgery.

[289] PieArena: Frontier Language Agents Achieve MBA-Level Negotiation Performance and Reveal Novel Behavioral Differences

Chris Zhu, Sasha Cui, Will Sanok Dufallo, Runzhi Jin, Zhen Xu, Linjun Zhang, Daylian Cain

Main category: cs.AI

TL;DR: LLMs achieve human-expert-level negotiation performance in realistic business scenarios, with GPT-5 matching or outperforming trained business students, though robustness and trustworthiness challenges remain.

Details

Motivation: To evaluate LLMs' negotiation capabilities - a complex business task requiring strategic reasoning, theory of mind, and economic value creation - using realistic scenarios from MBA courses.

Method: Developed PieArena benchmark with realistic negotiation scenarios from MBA courses, used multi-agent interactions, and created a statistically grounded ranking model with confidence intervals to evaluate continuous negotiation payoffs.

Result: GPT-5 matched or outperformed trained business-school students despite their semester of instruction and coaching. Joint-intentionality scaffolding helped mid/lower-tier LMs more than frontier models. Behavioral analysis revealed cross-model heterogeneity in deception, computation accuracy, and reputation.

Conclusion: Frontier language agents are intellectually and psychologically capable for high-stakes economic settings, but deficiencies in robustness and trustworthiness remain open challenges.

Abstract: We present an in-depth evaluation of LLMs’ ability to negotiate, a central business task that requires strategic reasoning, theory of mind, and economic value creation. To do so, we introduce PieArena, a large-scale negotiation benchmark grounded in multi-agent interactions over realistic scenarios drawn from an MBA negotiation course at an elite business school. We develop a statistically grounded ranking model for continuous negotiation payoffs that produces leaderboards with principled confidence intervals and corrects for experimental asymmetries. We find systematic evidence of human-expert-level performance in which a representative frontier language agent (GPT-5) matches or outperforms trained business-school students, despite a semester of general negotiation instruction and targeted coaching immediately prior to the task. We further study the effects of joint-intentionality agentic scaffolding and observe asymmetric gains, with large improvements for mid- and lower-tier LMs and diminishing returns for frontier LMs. Beyond deal outcomes, PieArena provides a multi-dimensional negotiation behavioral profile, revealing novel cross-model heterogeneity, masked by deal-outcome-only benchmarks, in deception, computation accuracy, instruction compliance, and perceived reputation. Overall, our results suggest that frontier language agents are already intellectually and psychologically capable of deployment in high-stakes economic settings, but deficiencies in robustness and trustworthiness remain open challenges.

[290] Progress Constraints for Reinforcement Learning in Behavior Trees

Finn Rietz, Mart Kartašev, Petter Ögren, Johannes A. Stork

Main category: cs.AI

TL;DR: Combining Behavior Trees with Reinforcement Learning using progress constraints to prevent controllers from undoing previously achieved subgoals, improving performance and sample efficiency.

Details

Motivation: Behavior Trees provide structured decision-making while RL learns optimal controllers, but naive integration can cause controllers to counteract each other and undo achieved subgoals, degrading overall performance.

Method: Propose progress constraints using feasibility estimators that constrain allowed actions based on theoretical BT convergence results to prevent controllers from undoing previously achieved subgoals.

Result: Empirical evaluations in 2D proof-of-concept and high-fidelity warehouse environments show improved performance, sample efficiency, and constraint satisfaction compared to prior BT-RL integration methods.

Conclusion: Progress constraints enable effective BT-RL integration by preventing counterproductive controller interactions, leveraging BT structure to simplify RL training while maintaining learning capabilities.

Abstract: Behavior Trees (BTs) provide a structured and reactive framework for decision-making, commonly used to switch between sub-controllers based on environmental conditions. Reinforcement Learning (RL), on the other hand, can learn near-optimal controllers but sometimes struggles with sparse rewards, safe exploration, and long-horizon credit assignment. Combining BTs with RL has the potential for mutual benefit: a BT design encodes structured domain knowledge that can simplify RL training, while RL enables automatic learning of the controllers within BTs. However, naive integration of BTs and RL can lead to some controllers counteracting other controllers, possibly undoing previously achieved subgoals, thereby degrading the overall performance. To address this, we propose progress constraints, a novel mechanism where feasibility estimators constrain the allowed action set based on theoretical BT convergence results. Empirical evaluations in a 2D proof-of-concept and a high-fidelity warehouse environment demonstrate improved performance, sample efficiency, and constraint satisfaction, compared to prior methods of BT-RL integration.

[291] MePo: Meta Post-Refinement for Rehearsal-Free General Continual Learning

Guanglong Sun, Hongwei Yan, Liyuan Wang, Zhiqi Kang, Shuang Cui, Hang Su, Jun Zhu, Yi Zhong

Main category: cs.AI

TL;DR: MePo is a meta-learning approach for general continual learning that refines pretrained models using pseudo task sequences and meta covariance matrices for better adaptation to evolving data streams.

Details

Motivation: Current continual learning methods using pretrained models struggle with diverse and temporally mixed information in single-pass online datastreams, leading to suboptimal performance in general continual learning scenarios with blurry task boundaries.

Method: MePo constructs pseudo task sequences from pretraining data and uses bi-level meta-learning to refine pretrained backbones. It initializes a meta covariance matrix as reference geometry of pretrained representation space to exploit second-order statistics for robust output alignment.

Result: MePo achieves significant performance gains across various GCL benchmarks (15.10% on CIFAR-100, 13.36% on ImageNet-R, 12.56% on CUB-200 under Sup-21/1K) in a rehearsal-free manner.

Conclusion: MePo is an effective plug-in strategy that enhances pretrained models for general continual learning by leveraging meta-learning principles and second-order statistics, enabling better adaptation to evolving data streams without rehearsal.

Abstract: To cope with uncertain changes of the external world, intelligent systems must continually learn from complex, evolving environments and respond in real time. This ability, collectively known as general continual learning (GCL), encapsulates practical challenges such as online datastreams and blurry task boundaries. Although leveraging pretrained models (PTMs) has greatly advanced conventional continual learning (CL), these methods remain limited in reconciling the diverse and temporally mixed information along a single pass, resulting in sub-optimal GCL performance. Inspired by meta-plasticity and reconstructive memory in neuroscience, we introduce here an innovative approach named Meta Post-Refinement (MePo) for PTMs-based GCL. This approach constructs pseudo task sequences from pretraining data and develops a bi-level meta-learning paradigm to refine the pretrained backbone, which serves as a prolonged pretraining phase but greatly facilitates rapid adaptation of representation learning to downstream GCL tasks. MePo further initializes a meta covariance matrix as the reference geometry of pretrained representation space, enabling GCL to exploit second-order statistics for robust output alignment. MePo serves as a plug-in strategy that achieves significant performance gains across a variety of GCL benchmarks and pretrained checkpoints in a rehearsal-free manner (e.g., 15.10%, 13.36%, and 12.56% on CIFAR-100, ImageNet-R, and CUB-200 under Sup-21/1K). Our source code is available at \href{https://github.com/SunGL001/MePo}{MePo}

[292] From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent

Yuhang Wang, Feiming Xu, Zheng Lin, Guangyu He, Yuzhe Huang, Haichang Gao, Zhenxing Niu, Shiguo Lian, Zhaoxiang Liu

Main category: cs.AI

TL;DR: PASB is a security evaluation framework for personalized AI agents that assesses vulnerabilities in real-world deployments, using OpenClaw as a case study to reveal critical security risks across different execution stages.

Details

Motivation: Existing agent security research focuses on synthetic or task-centric settings, failing to capture the attack surface and risk propagation mechanisms of personalized agents in real-world deployments, creating a need for more realistic security evaluation frameworks.

Method: Proposes PASB (Personalized Agent Security Bench), an end-to-end security evaluation framework incorporating personalized usage scenarios, realistic toolchains, and long-horizon interactions for black-box evaluation on real systems, using OpenClaw as a representative case study.

Result: OpenClaw exhibits critical vulnerabilities at different execution stages including user prompt processing, tool usage, and memory retrieval, highlighting substantial security risks in personalized agent deployments.

Conclusion: Personalized AI agents like OpenClaw have significant security vulnerabilities in real-world deployments, and PASB provides a comprehensive framework for evaluating these risks across multiple personalized scenarios and attack types.

Abstract: Although large language model (LLM)-based agents, exemplified by OpenClaw, are increasingly evolving from task-oriented systems into personalized AI assistants for solving complex real-world tasks, their practical deployment also introduces severe security risks. However, existing agent security research and evaluation frameworks primarily focus on synthetic or task-centric settings, and thus fail to accurately capture the attack surface and risk propagation mechanisms of personalized agents in real-world deployments. To address this gap, we propose Personalized Agent Security Bench (PASB), an end-to-end security evaluation framework tailored for real-world personalized agents. Building upon existing agent attack paradigms, PASB incorporates personalized usage scenarios, realistic toolchains, and long-horizon interactions, enabling black-box, end-to-end security evaluation on real systems. Using OpenClaw as a representative case study, we systematically evaluate its security across multiple personalized scenarios, tool capabilities, and attack types. Our results indicate that OpenClaw exhibits critical vulnerabilities at different execution stages, including user prompt processing, tool usage, and memory retrieval, highlighting substantial security risks in personalized agent deployments. The code for the proposed PASB framework is available at https://github.com/AstorYH/PASB.

[293] Reinforcement Inference: Leveraging Uncertainty for Self-Correcting Language Model Reasoning

Xinhai Sun

Main category: cs.AI

TL;DR: Reinforcement Inference: An entropy-aware inference-time control strategy that uses model uncertainty to selectively invoke a second reasoning attempt, improving accuracy without retraining.

Details

Motivation: Current one-shot greedy inference protocols systematically underestimate LLM capabilities by causing premature commitment under internal ambiguity, where errors often arise from uncertainty rather than missing knowledge.

Method: Uses the model’s own uncertainty (entropy) as a control signal to selectively invoke a second, more deliberate reasoning attempt when the model is uncertain, rather than always re-asking.

Result: On MMLU-Pro (12,032 questions, 14 subjects), improves DeepSeek-v3.2 accuracy from 60.72% to 84.03% with only 61.06% additional inference calls. A 100% re-asking ablation reaches 84.35%, showing uncertainty-aware selection captures most gains efficiently.

Conclusion: Proposes an entropy-aware paradigm for measuring and expanding model capability, suggesting the gap between greedy inference and uncertainty-conditioned deliberation offers diagnostic insight into LLMs’ latent reasoning and motivates training for correctness-confidence alignment.

Abstract: Modern large language models (LLMs) are often evaluated and deployed under a one-shot, greedy inference protocol, especially in professional settings that require deterministic behavior. This regime can systematically under-estimate a fixed model’s true capability: many errors arise not from missing knowledge, but from premature commitment under internal ambiguity. We introduce Reinforcement Inference, an entropy-aware inference-time control strategy that uses the model’s own uncertainty to selectively invoke a second, more deliberate reasoning attempt, enabling stronger performance without any retraining. On 12,032 MMLU-Pro questions across 14 subjects, using DeepSeek-v3.2 with deterministic decoding in a zero-shot setting, Reinforcement Inference improves accuracy from 60.72% to 84.03%, while only incurring 61.06% additional inference calls. A 100% re-asking ablation reaches 84.35%, indicating that uncertainty-aware selection captures most of the attainable improvement with substantially less compute. Moreover, a prompt-only ablation underperforms the baseline, suggesting that the gains are not explained by generic prompting alone. Beyond providing a practical inference-time upgrade, our results suggest a broader entropy-aware paradigm for measuring and expanding model capability: because modern decoder-based models generate outputs autoregressively, entropy and related confidence measures arise naturally as first-class control signals during generation. The resulting gap between one-pass greedy inference and uncertainty-conditioned deliberation offers a diagnostic lens on an LLM’s latent reasoning horizon and motivates future training objectives that explicitly constrain correctness–confidence alignment.

[294] Why do we Trust Chatbots? From Normative Principles to Behavioral Drivers

Aditya Gulati, Nuria Oliver

Main category: cs.AI

TL;DR: Paper examines trust in chatbots, distinguishing between normative trustworthiness and psychological trust formation shaped by design choices, proposing to reframe chatbots as skilled salespeople rather than companions.

Details

Motivation: As chatbots become more human-like, there's a need to examine how trust is formed in these systems, particularly since regulatory frameworks focus on normative trust while user trust often emerges from behavioral mechanisms and design choices that leverage cognitive biases.

Method: Conceptual analysis and theoretical framework development examining the disconnect between normative trustworthiness (as defined by regulations) and psychological trust formation in users, proposing a reframing of chatbots’ role.

Result: Identifies that trust in chatbots is often not earned through demonstrated trustworthiness but shaped by interactional design choices, and proposes viewing chatbots as skilled salespeople whose objectives align with deploying organizations rather than as companions or assistants.

Conclusion: The coexistence of competing notions of “trust” obscures important distinctions between psychological trust formation and normative trustworthiness, requiring further research and stronger support mechanisms to help users appropriately calibrate trust in conversational AI systems.

Abstract: As chatbots increasingly blur the boundary between automated systems and human conversation, the foundations of trust in these systems warrant closer examination. While regulatory and policy frameworks tend to define trust in normative terms, the trust users place in chatbots often emerges from behavioral mechanisms. In many cases, this trust is not earned through demonstrated trustworthiness but is instead shaped by interactional design choices that leverage cognitive biases to influence user behavior. Based on this observation, we propose reframing chatbots not as companions or assistants, but as highly skilled salespeople whose objectives are determined by the deploying organization. We argue that the coexistence of competing notions of “trust” under a shared term obscures important distinctions between psychological trust formation and normative trustworthiness. Addressing this gap requires further research and stronger support mechanisms to help users appropriately calibrate trust in conversational AI systems.

[295] Learning the Value Systems of Societies with Preference-based Multi-objective Reinforcement Learning

Andrés Holgado-Sánchez, Peter Vamplew, Richard Dazeley, Sascha Ossowski, Holger Billhardt

Main category: cs.AI

TL;DR: Learning socially-derived value alignment models and value systems for agent societies using clustering and preference-based multi-objective reinforcement learning.

Details

Motivation: AI systems should recognize human values and adapt to different users' value systems, but current approaches have limitations: they require manual feature design, lack value-based interpretability, or can't adapt to diverse user preferences.

Method: Proposed algorithms for learning value alignment models and value systems in Markov Decision Processes using clustering and preference-based multi-objective reinforcement learning (PbMORL). Jointly learns socially-derived value alignment models and value systems representing different user groups.

Result: Evaluated against state-of-the-art PbMORL algorithm and baselines on two MDPs with human values.

Conclusion: The approach enables learning of value systems that concisely represent different user groups and their aligned behaviors through approximately Pareto-optimal policies.

Abstract: Value-aware AI should recognise human values and adapt to the value systems (value-based preferences) of different users. This requires operationalization of values, which can be prone to misspecification. The social nature of values demands their representation to adhere to multiple users while value systems are diverse, yet exhibit patterns among groups. In sequential decision making, efforts have been made towards personalization for different goals or values from demonstrations of diverse agents. However, these approaches demand manually designed features or lack value-based interpretability and/or adaptability to diverse user preferences. We propose algorithms for learning models of value alignment and value systems for a society of agents in Markov Decision Processes (MDPs), based on clustering and preference-based multi-objective reinforcement learning (PbMORL). We jointly learn socially-derived value alignment models (groundings) and a set of value systems that concisely represent different groups of users (clusters) in a society. Each cluster consists of a value system representing the value-based preferences of its members and an approximately Pareto-optimal policy that reflects behaviours aligned with this value system. We evaluate our method against a state-of-the-art PbMORL algorithm and baselines on two MDPs with human values.

[296] SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning

Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, Yu Liu

Main category: cs.AI

TL;DR: SpotAgent is an agentic reasoning framework for geo-localization that combines visual interpretation with tool-assisted verification using web search and maps to address sparse, ambiguous visual cues in real-world scenarios.

Details

Motivation: Current Large Vision-Language Models struggle with real-world geo-localization where visual cues are sparse, long-tailed, and ambiguous, often producing confident but ungrounded predictions due to reliance on internal knowledge without verification.

Method: Proposes SpotAgent framework formalizing geo-localization as agentic reasoning with external tools (web search, maps) via ReAct diagram. Uses 3-stage post-training: 1) Supervised Fine-Tuning for basic alignment, 2) Agentic Cold Start with high-quality trajectories from Multi-Agent framework for tool-calling expertise, 3) Reinforcement Learning refinement with Spatially-Aware Dynamic Filtering to prioritize learnable samples.

Result: Extensive experiments on standard benchmarks show SpotAgent achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.

Conclusion: SpotAgent successfully addresses limitations of existing LVLMs in geo-localization by integrating agentic reasoning with external tool verification, providing a framework for more reliable and accurate real-world visual understanding tasks.

Abstract: Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded predictions when faced with confounded evidence. To address these challenges, we propose SpotAgent, a framework that formalizes geo-localization into an agentic reasoning process that leverages expert-level reasoning to synergize visual interpretation with tool-assisted verification. SpotAgent actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram. We introduce a 3-stage post-training pipeline starting with a Supervised Fine-Tuning (SFT) stage for basic alignment, followed by an Agentic Cold Start phase utilizing high-quality trajectories synthesized via a Multi-Agent framework, aiming to instill tool-calling expertise. Subsequently, the model’s reasoning capabilities are refined through Reinforcement Learning. We propose a Spatially-Aware Dynamic Filtering strategy to enhance the efficiency of the RL stage by prioritizing learnable samples based on spatial difficulty. Extensive experiments on standard benchmarks demonstrate that SpotAgent achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.

[297] ClinAlign: Scaling Healthcare Alignment from Clinician Preference

Shiwei Lyu, Xidong Wang, Lei Liu, Hao Zhu, Chaohe Zhang, Jian Wang, Jinjie Gu, Benyou Wang, Yue Shen

Main category: cs.AI

TL;DR: A framework for aligning LLMs with clinician preferences using physician-verified rubrics distilled into reusable clinical principles, achieving state-of-the-art performance on medical benchmarks.

Details

Motivation: LLMs show expert medical knowledge but their open-ended outputs don't align well with fine-grained clinician preferences. Existing methods use coarse objectives or unreliable automated judges not grounded in professional guidelines.

Method: Two-stage framework: 1) HealthRubrics dataset of 7,034 physician-verified preference examples where clinicians refine LLM-drafted rubrics; 2) Distill into HealthPrinciples - 119 reusable, clinically grounded principles organized by clinical dimensions. Used for offline alignment (synthesizing rubrics for unlabeled queries) and inference-time guided self-revision.

Result: A 30B-A3B model trained with this framework achieves 33.4% on HealthBench-Hard, outperforming much larger models including Deepseek-R1 and o3, establishing a resource-efficient baseline for clinical alignment.

Conclusion: The proposed framework effectively aligns LLMs with clinician preferences through physician-verified rubrics and distilled principles, enabling scalable supervision and achieving state-of-the-art performance on medical benchmarks with smaller models.

Abstract: Although large language models (LLMs) demonstrate expert-level medical knowledge, aligning their open-ended outputs with fine-grained clinician preferences remains challenging. Existing methods often rely on coarse objectives or unreliable automated judges that are weakly grounded in professional guidelines. We propose a two-stage framework to address this gap. First, we introduce HealthRubrics, a dataset of 7,034 physician-verified preference examples in which clinicians refine LLM-drafted rubrics to meet rigorous medical standards. Second, we distill these rubrics into HealthPrinciples: 119 broadly reusable, clinically grounded principles organized by clinical dimensions, enabling scalable supervision beyond manual annotation. We use HealthPrinciples for (1) offline alignment by synthesizing rubrics for unlabeled queries and (2) an inference-time tool for guided self-revision. A 30B-A3B model trained with our framework achieves 33.4% on HealthBench-Hard, outperforming much larger models including Deepseek-R1 and o3, establishing a resource-efficient baseline for clinical alignment.

[298] CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

Richard Bornemann, Pierluigi Vito Amadori, Antoine Cully

Main category: cs.AI

TL;DR: CODE-SHARP is a framework that uses Foundation Models to open-endedly discover and refine hierarchical skills as executable reward programs in code, enabling goal-conditioned agents to solve complex long-horizon tasks.

Details

Motivation: Current reinforcement learning relies on hand-designed reward functions, which is infeasible for open-ended skill discovery where meaningful skills aren't known beforehand. Existing automated reward design methods are limited to refining rewards for pre-defined tasks.

Method: Uses Foundation Models to open-endedly expand and refine a hierarchical skill archive structured as a directed graph of executable reward functions in code. Combines this with a goal-conditioned agent trained on discovered rewards and a high-level FM-based planner for task composition.

Result: The agent learns to solve increasingly long-horizon goals in Craftax environment. When composed by the FM-based planner, it solves complex long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over 134% on average.

Conclusion: CODE-SHARP demonstrates a novel approach to open-ended skill discovery using Foundation Models and hierarchical reward programs, enabling agents to autonomously discover and compose skills for complex task solving.

Abstract: Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the set of meaningful skills is not known a priori. While recent methods have shown promising results towards automating reward function design, they remain limited to refining rewards for pre-defined tasks. To address this limitation, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a novel framework leveraging Foundation Models (FM) to open-endedly expand and refine a hierarchical skill archive, structured as a directed graph of executable reward functions in code. We show that a goal-conditioned agent trained exclusively on the rewards generated by the discovered SHARP skills learns to solve increasingly long-horizon goals in the Craftax environment. When composed by a high-level FM-based planner, the discovered skills enable a single goal-conditioned agent to solve complex, long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over $134$% on average. We will open-source our code and provide additional videos at https://sites.google.com/view/code-sharp/homepage.

[299] Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He

Main category: cs.AI

TL;DR: AWM is a synthetic environment generation pipeline that creates 1,000 code-driven environments for training multi-turn tool-use agents, enabling scalable RL training with reliable state transitions.

Details

Motivation: Current limitations in scaling autonomous agent training due to lack of diverse and reliable environments for multi-turn interactions with tools and environments.

Method: Proposes Agent World Model (AWM) pipeline that generates fully synthetic, code-driven environments backed by databases, providing rich toolsets (35 tools per environment on average) and high-quality observations.

Result: Created 1,000 environments covering everyday scenarios; enables large-scale RL for multi-turn tool-use agents with reliable reward functions; shows strong out-of-distribution generalization on three benchmarks.

Conclusion: Synthetic environments generated by AWM provide scalable, reliable training resources for autonomous agents, overcoming limitations of LLM-simulated environments and enabling better generalization.

Abstract: Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

cs.SD

[300] Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids’s Story Speech Synthesis

Raymond Chung

Main category: cs.SD

TL;DR: Expressive speech synthesis method using dataset augmentation with emotionally congruent text merging and contrastive training for better prosody and pause timing.

Details

Motivation: Expressive speech synthesis requires vibrant prosody and well-timed pauses, but training such models typically requires large datasets. The paper addresses how to effectively train expressive TTS models with limited data.

Method: 1) Augment small dataset by merging audios of emotionally congruent text using text emotion recognizer; 2) Train with two-sentence audio to learn natural breaks between lines; 3) Apply self-supervised contrastive training to improve speaking style embedding extraction from speech; 4) During inference, produce multi-sentence speech in one step guided by text-predicted speaking style.

Result: The approach outperforms baseline trained with consecutive two-sentence audio: synthesized speeches have closer inter-sentence pause distribution to ground truth, and subjective evaluations show higher scores in naturalness and style suitability.

Conclusion: The proposed strategy effectively trains expressive TTS models with limited data, achieving better prosody, pause timing, and naturalness through dataset augmentation and contrastive style embedding learning.

Abstract: Expressive speech synthesis requires vibrant prosody and well-timed pauses. We propose an effective strategy to augment a small dataset to train an expressive end-to-end Text-to-Speech model. We merge audios of emotionally congruent text using a text emotion recognizer, creating augmented expressive speech data. By training with two-sentence audio, our model learns natural breaks between lines. We further apply self-supervised contrastive training to improve the speaking style embedding extraction from speech. During inference, our model produces multi-sentence speech in one step, guided by the text-predicted speaking style. Evaluations showcase the effectiveness of our proposed approach when compared to a baseline model trained with consecutive two-sentence audio. Our synthesized speeches give a closer inter-sentence pause distribution to the ground truth speech. Subjective evaluations reveal our synthesized speech scored higher in naturalness and style suitability than the baseline.

[301] AudioRouter: Data Efficient Audio Understanding via RL based Dual Reasoning

Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang

Main category: cs.SD

Details

[302] Calliope: A TTS-based Narrated E-book Creator Ensuring Exact Synchronization, Privacy, and Layout Fidelity

Hugo L. Hammer, Vajira Thambawita, Pål Halvorsen

Main category: cs.SD

TL;DR: Calliope is an open-source framework that converts text e-books into narrated e-books with synchronized audio highlighting using offline TTS technology, preserving original formatting and avoiding cloud dependencies.

Details

Motivation: While commercial services exist for creating narrated e-books using neural TTS, there are no open-source solutions. The authors aim to provide an accessible, privacy-preserving, and cost-effective alternative that maintains exact synchronization between audio and text highlighting.

Method: Calliope uses state-of-the-art open-source TTS systems (XTTS-v2 and Chatterbox) to generate narration while capturing audio timestamps directly during synthesis. This ensures exact synchronization without forced alignment. The framework preserves the publisher’s original typography, styling, and embedded media while operating entirely offline in EPUB 3 Media Overlay format.

Result: The framework successfully creates narrated e-books with precise audio-text synchronization. Experiments show that alternative forced alignment methods introduce significant drift that degrades the reading experience, while Calliope’s direct timestamp capture method maintains perfect synchronization.

Conclusion: Calliope provides the first open-source solution for creating high-quality narrated e-books with exact audio-text synchronization, addressing privacy, cost, and copyright concerns associated with cloud-based services while supporting literacy and accessibility applications.

Abstract: A narrated e-book combines synchronized audio with digital text, highlighting the currently spoken word or sentence during playback. This format supports early literacy and assists individuals with reading challenges, while also allowing general readers to seamlessly switch between reading and listening. With the emergence of natural-sounding neural Text-to-Speech (TTS) technology, several commercial services have been developed to leverage these technology for converting standard text e-books into high-quality narrated e-books. However, no open-source solutions currently exist to perform this task. In this paper, we present Calliope, an open-source framework designed to fill this gap. Our method leverages state-of-the-art open-source TTS to convert a text e-book into a narrated e-book in the EPUB 3 Media Overlay format. The method offers several innovative steps: audio timestamps are captured directly during TTS, ensuring exact synchronization between narration and text highlighting; the publisher’s original typography, styling, and embedded media are strictly preserved; and the entire pipeline operates offline. This offline capability eliminates recurring API costs, mitigates privacy concerns, and avoids copyright compliance issues associated with cloud-based services. The framework currently supports the state-of-the-art open-source TTS systems XTTS-v2 and Chatterbox. A potential alternative approach involves first generating narration via TTS and subsequently synchronizing it with the text using forced alignment. However, while our method ensures exact synchronization, our experiments show that forced alignment introduces drift between the audio and text highlighting significant enough to degrade the reading experience. Source code and usage instructions are available at https://github.com/hugohammer/TTS-Narrated-Ebook-Creator.git.

[303] MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, Xipeng Qiu

Main category: cs.SD

Details

[304] SCRAPL: Scattering Transform with Random Paths for Machine Learning

Christopher Mitcheltree, Vincent Lostanlen, Emmanouil Benetos, Mathieu Lagrange

Main category: cs.SD

TL;DR: SCRAPL is a stochastic optimization method that uses random subsets of wavelet scattering transform paths to make perceptual quality assessment computationally efficient for neural network training in audio applications.

Details

Motivation: Wavelet scattering transforms provide excellent perceptual quality gradients for audio and vision tasks, but their computational expense (due to numerous paths) limits their use as differentiable loss functions in neural network training.

Method: Proposes SCRAPL (Scattering transform with Random Paths for machine Learning) - a stochastic optimization scheme that randomly samples subsets of scattering transform paths during training. Also introduces an importance sampling initialization heuristic that adapts to dataset perceptual content.

Result: Applied to differentiable digital signal processing (DDSP) for unsupervised sound matching of granular synthesizer and Roland TR-808 drum machine. Improves neural network convergence and evaluation performance while reducing computational cost.

Conclusion: SCRAPL enables efficient use of perceptual scattering transforms in neural network training for audio applications, with code and Python package provided for broader adoption.

Abstract: The Euclidean distance between wavelet scattering transform coefficients (known as paths) provides informative gradients for perceptual quality assessment of deep inverse problems in computer vision, speech, and audio processing. However, these transforms are computationally expensive when employed as differentiable loss functions for stochastic gradient descent due to their numerous paths, which significantly limits their use in neural network training. Against this problem, we propose “Scattering transform with Random Paths for machine Learning” (SCRAPL): a stochastic optimization scheme for efficient evaluation of multivariable scattering transforms. We implement SCRAPL for the joint time-frequency scattering transform (JTFS) which demodulates spectrotemporal patterns at multiple scales and rates, allowing a fine characterization of intermittent auditory textures. We apply SCRAPL to differentiable digital signal processing (DDSP), specifically, unsupervised sound matching of a granular synthesizer and the Roland TR-808 drum machine. We also propose an initialization heuristic based on importance sampling, which adapts SCRAPL to the perceptual content of the dataset, improving neural network convergence and evaluation performance. We make our code and audio samples available and provide SCRAPL as a Python package.

[305] AUDETER: A Large-scale Dataset for Deepfake Audio Detection in Open Worlds

Qizhou Wang, Hanxun Huang, Guansong Pang, Sarah Erfani, Christopher Leckie

Main category: cs.SD

TL;DR: AUDETER is a large-scale diverse deepfake audio dataset with 4,500+ hours of synthetic audio from 11 TTS models and 10 vocoders, plus a curriculum-learning approach to improve detection generalization across diverse deepfake patterns.

Details

Motivation: Current speech synthesis systems produce highly realistic vocalizations that challenge authenticity verification. Existing deepfake detection models suffer from distribution shifts between training and test data due to limited real speech diversity, insufficient coverage of recent synthesis systems, and heterogeneous deepfake sources in datasets.

Method: Created AUDETER dataset with 3 million clips from 11 recent TTS models and 10 vocoders. Proposed curriculum-learning-based approach to mitigate negative transfer across synthesis sources when training on highly diverse deepfake patterns.

Result: Existing detection models struggle to generalize to novel deepfakes and human speech in AUDETER. XLR-based detectors trained on AUDETER achieve strong cross-domain performance with 1.87% EER on In-the-Wild benchmark.

Conclusion: AUDETER addresses dataset limitations in deepfake audio detection and enables better evaluation and training. The curriculum-learning approach improves generalization across diverse synthesis sources.

Abstract: Speech synthesis systems can now produce highly realistic vocalisations that pose significant authenticity challenges. Despite substantial progress in deepfake detection models, their real-world effectiveness is often undermined by evolving distribution shifts between training and test data, driven by the complexity of human speech and the rapid evolution of synthesis systems. Existing datasets suffer from limited real speech diversity, insufficient coverage of recent synthesis systems, and heterogeneous mixtures of deepfake sources, which hinder systematic evaluation and open-world model training. To address these issues, we introduce AUDETER (AUdio DEepfake TEst Range), a large-scale and highly diverse deepfake audio dataset comprising over 4,500 hours of synthetic audio generated by 11 recent TTS models and 10 vocoders, totalling 3 million clips. We further observe that most existing detectors default to binary supervised training, which can induce negative transfer across synthesis sources when the training data contains highly diverse deepfake patterns, impacting overall generalisation. As a complementary contribution, we propose an effective curriculum-learning-based approach to mitigate this effect. Extensive experiments show that existing detection models struggle to generalise to novel deepfakes and human speech in AUDETER, whereas XLR-based detectors trained on AUDETER achieve strong cross-domain performance across multiple benchmarks, achieving an EER of 1.87% on In-the-Wild. AUDETER is available on GitHub.

[306] VoiceBridge: Designing Latent Bridge Models for General Speech Restoration at Scale

Chi Zhang, Zehua Chen, Kaiwen Zheng, Jun Zhu

Main category: cs.SD

TL;DR: VoiceBridge is a general speech restoration system using latent bridge models to reconstruct high-fidelity 48kHz speech from various distortions through a single latent-to-latent generative process.

Details

Motivation: Existing bridge models for speech enhancement are typically limited to single tasks or small datasets, lacking general speech restoration capability at scale. The authors aim to create a comprehensive system that can handle diverse low-quality to high-quality restoration tasks.

Method: Uses latent bridge models with a scalable transformer architecture, compressing speech waveforms into continuous latent representations. Introduces an energy-preserving variational autoencoder for better waveform-latent alignment, a joint neural prior to handle different low-quality priors, and perceptually aware fine-tuning for improved human perceptual quality.

Result: VoiceBridge demonstrates superior performance across in-domain and out-of-domain tasks and datasets, including refining zero-shot speech and podcast generation results, showing strong general speech restoration capability.

Conclusion: VoiceBridge provides an effective general speech restoration system that can reconstruct high-fidelity speech from various distortions through latent bridge modeling, with innovations in energy preservation, joint neural priors, and perceptual fine-tuning.

Abstract: Bridge models have recently been explored for speech enhancement tasks such as denoising, dereverberation, and super-resolution, while these efforts are typically confined to a single task or small-scale datasets, with constrained general speech restoration (GSR) capability at scale. In this work, we introduce VoiceBridge, a GSR system rooted in latent bridge models (LBMs), capable of reconstructing high-fidelity speech at full-band (\textit{i.e.,} 48~~kHz) from various distortions. By compressing speech waveform into continuous latent representations, VoiceBridge models the~~\textit{diverse LQ-to-HQ tasks} (namely, low-quality to high-quality) in GSR with~\textit{a single latent-to-latent generative process} backed by a scalable transformer architecture. To better inherit the advantages of bridge models from the data domain to the latent space, we present an energy-preserving variational autoencoder, enhancing the alignment between the waveform and latent space over varying energy levels. Furthermore, to address the difficulty of HQ reconstruction from distinctively different LQ priors, we propose a joint neural prior, uniformly alleviating the reconstruction burden of LBM. At last, considering the key requirement of GSR systems, human perceptual quality, a perceptually aware fine-tuning stage is designed to mitigate the cascading mismatch in generation while improving perceptual alignment. Extensive validation across in-domain and out-of-domain tasks and datasets (\textit{e.g.}, refining recent zero-shot speech and podcast generation results) demonstrates the superior performance of VoiceBridge. Demo samples can be visited at: https://VoiceBridge-demo.github.io/.

[307] Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

Yudong Yang, Xuezhen Zhang, Zhifeng Han, Siyin Wang, Jimin Zhuang, Zengrui Jin, Jing Shao, Guangzhi Sun, Chao Zhang

Main category: cs.SD

TL;DR: SACRED-Bench is a benchmark for evaluating LLM safety against complex audio-based attacks using speech-audio composition, showing high attack success rates even on state-of-the-art models, with proposed SALMONN-Guard defense.

Details

Motivation: Current LLM safety safeguards are inadequate for complex audio inputs that combine harmful and benign content, creating new safety risks that need to be addressed.

Method: SACRED-Bench uses three speech-audio composition mechanisms: overlapping harmful/benign speech, mixing benign speech with harmful non-speech audio, and multi-speaker dialogues. Questions implicitly refer to audio content without explicit harmful text. SALMONN-Guard is proposed as a defense model that jointly inspects speech, audio, and text.

Result: Even Gemini 2.5 Pro with full safety guardrails shows 66% attack success rate. SALMONN-Guard reduces this to 20%, demonstrating effectiveness of audio-aware defenses.

Conclusion: Audio-aware defenses are crucial for multimodal LLM safety, as current text-focused safeguards fail against complex audio attacks. The benchmark and guard model provide tools for improving audio safety.

Abstract: Recent progress in LLMs has enabled understanding of audio signals, but has also exposed new safety risks arising from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition to enable effective black-box attacks. SACRED-Bench adopts three composition mechanisms: (a) overlap of harmful and benign speech, (b) mixture of benign speech with harmful non-speech audio, and (c) multi-speaker dialogue. These mechanisms focus on evaluating safety in settings where benign and harmful intents co-occur within a single auditory scene. Moreover, questions in SACRED-Bench are designed to implicitly refer to content in the audio, such that no explicit harmful information appears in the text prompt alone. Experiments demonstrate that even Gemini 2.5 Pro, a state-of-the-art proprietary LLM with safety guardrails fully enabled, still exhibits a 66% attack success rate. To bridge this gap, we propose SALMONN-Guard, the first guard model that jointly inspects speech, audio, and text for safety judgments, reducing the attack success rate to 20%. Our results highlight the need for audio-aware defenses to ensure the safety of multimodal LLMs. The dataset and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench.

[308] UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization

Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng

Main category: cs.SD

TL;DR: UniAudio 2.0 introduces a novel audio tokenizer (ReasoningCodec) with reasoning and reconstruction tokens, and trains a unified autoregressive model on massive text/audio data for strong few-shot/zero-shot generalization across speech, sound, and music tasks.

Details

Motivation: The paper addresses two foundational problems in audio language models: designing an audio tokenizer that serves as intermediate representation for both understanding and generation, and building an audio foundation model that generalizes in few-shot/zero-shot settings like large language models.

Method: Proposes ReasoningCodec, a discrete audio codec with reasoning tokens (text-aligned, high-level analysis/planning) and reconstruction tokens (semantic-rich acoustic cues). Also introduces unified autoregressive architecture for text/audio with multi-stage training and multi-task data construction, trained on 100B text tokens and 60B audio tokens.

Result: Achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity. UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks across speech, sound, and music domains.

Conclusion: The proposed approach successfully addresses both audio tokenization and foundation model generalization challenges, creating a unified system that bridges audio understanding and generation with strong generalization capabilities.

Abstract: We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.

cs.LG

[309] Large Language Models Predict Functional Outcomes after Acute Ischemic Stroke

Anjali K. Kapoor, Anton Alyakin, Jin Vivian Lee, Eunice Yang, Annelene M. Schulze, Krithik Vishwanath, Jinseok Lee, Yindalon Aphinyanaphongs, Howard Riina, Jennifer A. Frontera, Eric Karl Oermann

Main category: cs.LG

TL;DR: LLMs can predict stroke functional outcomes from admission notes with performance comparable to structured-data models, supporting text-based prognostic tools for clinical workflows.

Details

Motivation: Current stroke outcome prediction relies on structured variables requiring manual extraction. This study explores whether LLMs can directly infer functional outcomes from routine admission notes without structured data abstraction.

Method: Evaluated encoder (BERT, NYUTron) and generative (Llama-3.1-8B, MedGemma-4B) LLMs in frozen and fine-tuned settings for discharge and 90-day mRS prediction using 9,485 discharge notes and 1,898 90-day notes from a stroke registry with temporal split testing.

Result: Fine-tuned Llama achieved highest performance: 90-day exact mRS accuracy 33.9% and binary accuracy 76.3%; discharge exact accuracy 42.0% and binary accuracy 75.0%. LLMs performed comparably to structured-data baselines for 90-day prediction.

Conclusion: Fine-tuned LLMs can predict post-stroke functional outcomes from admission notes alone, achieving performance comparable to models requiring structured variable abstraction, supporting development of text-based prognostic tools.

Abstract: Accurate prediction of functional outcomes after acute ischemic stroke can inform clinical decision-making and resource allocation. Prior work on modified Rankin Scale (mRS) prediction has relied primarily on structured variables (e.g., age, NIHSS) and conventional machine learning. The ability of large language models (LLMs) to infer future mRS scores directly from routine admission notes remains largely unexplored. We evaluated encoder (BERT, NYUTron) and generative (Llama-3.1-8B, MedGemma-4B) LLMs, in both frozen and fine-tuned settings, for discharge and 90-day mRS prediction using a large, real-world stroke registry. The discharge outcome dataset included 9,485 History and Physical notes and the 90-day outcome dataset included 1,898 notes from the NYU Langone Get With The Guidelines-Stroke registry (2016-2025). Data were temporally split with the most recent 12 months held out for testing. Performance was assessed using exact (7-class) mRS accuracy and binary functional outcome (mRS 0-2 vs. 3-6) accuracy and compared against established structured-data baselines incorporating NIHSS and age. Fine-tuned Llama achieved the highest performance, with 90-day exact mRS accuracy of 33.9% [95% CI, 27.9-39.9%] and binary accuracy of 76.3% [95% CI, 70.7-81.9%]. Discharge performance reached 42.0% [95% CI, 39.0-45.0%] exact accuracy and 75.0% [95% CI, 72.4-77.6%] binary accuracy. For 90-day prediction, Llama performed comparably to structured-data baselines. Fine-tuned LLMs can predict post-stroke functional outcomes from admission notes alone, achieving performance comparable to models requiring structured variable abstraction. Our findings support the development of text-based prognostic tools that integrate seamlessly into clinical workflows without manual data extraction.

[310] Towards Autonomous Mathematics Research

Tony Feng, Trieu H. Trinh, Garrett Bingham, Dawsen Hwang, Yuri Chervonyi, Junehyuk Jung, Joonkyung Lee, Carlo Pagano, Sang-hyun Kim, Federico Pasqualotto, Sergei Gukov, Jonathan N. Lee, Junsu Kim, Kaiying Hou, Golnaz Ghiasi, Yi Tay, YaGuang Li, Chenkai Kuang, Yuan Liu, Hanzhao, Lin, Evan Zheran Liu, Nigamaa Nayakanti, Xiaomeng Yang, Heng-tze Cheng, Demis Hassabis, Koray Kavukcuoglu, Quoc V. Le, Thang Luong

Main category: cs.LG

TL;DR: Aletheia is an AI math research agent that generates, verifies, and revises mathematical proofs end-to-end, demonstrating capabilities from Olympiad problems to PhD-level research and solving open mathematical questions.

Details

Motivation: While AI has achieved gold-medal performance in math competitions, there's a gap between competition-level problem-solving and professional mathematical research which requires navigating literature and constructing long-horizon proofs.

Method: Aletheia uses an advanced version of Gemini Deep Think for reasoning, novel inference-time scaling laws, and intensive tool use to navigate mathematical research complexities through iterative generation, verification, and revision of solutions.

Result: Aletheia achieved several milestones: 1) AI-generated research paper (Feng26) without human intervention, 2) human-AI collaboration paper (LeeSeo26), and 3) semi-autonomous evaluation of 700 open problems with autonomous solutions to four open questions.

Conclusion: The paper demonstrates AI’s growing capability in mathematical research and suggests codifying standards for quantifying AI autonomy and novelty in mathematical discoveries, with reflections on human-AI collaboration.

Abstract: Recent advances in foundational models have yielded reasoning systems capable of achieving a gold-medal standard at the International Mathematical Olympiad. The transition from competition-level problem-solving to professional research, however, requires navigating vast literature and constructing long-horizon proofs. In this work, we introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end-to-end in natural language. Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems, a novel inference-time scaling law that extends beyond Olympiad-level problems, and intensive tool use to navigate the complexities of mathematical research. We demonstrate the capability of Aletheia from Olympiad problems to PhD-level exercises and most notably, through several distinct milestones in AI-assisted mathematics research: (a) a research paper (Feng26) generated by AI without any human intervention in calculating certain structure constants in arithmetic geometry called eigenweights; (b) a research paper (LeeSeo26) demonstrating human-AI collaboration in proving bounds on systems of interacting particles called independent sets; and (c) an extensive semi-autonomous evaluation (Feng et al., 2026a) of 700 open problems on Bloom’s Erdos Conjectures database, including autonomous solutions to four open questions. In order to help the public better understand the developments pertaining to AI and mathematics, we suggest codifying standard levels quantifying autonomy and novelty of AI-assisted results. We conclude with reflections on human-AI collaboration in mathematics.

[311] Signature-Kernel Based Evaluation Metrics for Robust Probabilistic and Tail-Event Forecasting

Benjamin R. Redhead, Thomas L. Lee, Peng Gu, Víctor Elvira, Amos Storkey

Main category: cs.LG

TL;DR: Proposes two kernel-based metrics (Sig-MMD and CSig-MMD) for evaluating probabilistic forecasts that capture temporal dependencies and prioritize tail event prediction while maintaining proper scoring rule properties.

Details

Motivation: Current probabilistic forecasting evaluation frameworks lack consensus metrics, assume independence across time/variables, and lack sensitivity to tail events which are critical for real-world decision-making in domains like finance, epidemiology, and climate science.

Method: Introduces two kernel-based metrics: signature maximum mean discrepancy (Sig-MMD) and censored Sig-MMD (CSig-MMD). These leverage signature kernels to capture complex inter-variate and inter-temporal dependencies, remain robust to missing data, and CSig-MMD includes a censoring scheme to prioritize tail event prediction while maintaining properness.

Result: The proposed metrics enable more reliable evaluation of direct multi-step forecasting by capturing dependencies and tail events better than existing methods, facilitating development of more robust probabilistic algorithms.

Conclusion: Sig-MMD and CSig-MMD address critical flaws in current probabilistic forecasting evaluation frameworks, providing metrics that capture dependencies, handle missing data, and properly evaluate tail event prediction capabilities.

Abstract: Probabilistic forecasting is increasingly critical across high-stakes domains, from finance and epidemiology to climate science. However, current evaluation frameworks lack a consensus metric and suffer from two critical flaws: they often assume independence across time steps or variables, and they demonstrably lack sensitivity to tail events, the very occurrences that are most pivotal in real-world decision-making. To address these limitations, we propose two kernel-based metrics: the signature maximum mean discrepancy (Sig-MMD) and our novel censored Sig-MMD (CSig-MMD). By leveraging the signature kernel, these metrics capture complex inter-variate and inter-temporal dependencies and remain robust to missing data. Furthermore, CSig-MMD introduces a censoring scheme that prioritizes a forecaster’s capability to predict tail events while strictly maintaining properness, a vital property for a good scoring rule. These metrics enable a more reliable evaluation of direct multi-step forecasting, facilitating the development of more robust probabilistic algorithms.

[312] Versor: A Geometric Sequence Architecture

Truong Minh Huy, Edward Hirst

Main category: cs.LG

TL;DR: Versor introduces a novel sequence architecture using Conformal Geometric Algebra (CGA) to achieve structural generalization, SE(3)-equivariance, and improved efficiency over Transformers on multimodal tasks.

Details

Motivation: Traditional neural architectures lack native geometric awareness and structural generalization capabilities. The authors aim to develop a sequence architecture that can inherently represent geometric relationships and generalize across scales without explicit structural encoding.

Method: Versor uses Conformal Geometric Algebra (CGA) in the Cl₄,₁ manifold, replacing traditional non-linear operations with geometric transformations (rotors). It evolves states via geometric transformations to achieve SE(3)-equivariance, uses a Recursive Rotor Accumulator for O(L) linear complexity, and implements custom Clifford kernels for efficiency.

Result: Versor outperforms Transformers, Graph Networks, and geometric baselines on chaotic N-body dynamics, topological reasoning, and multimodal benchmarks (CIFAR-10, WikiText-103). Achieves 200× fewer parameters, 99.3% MCC on topology (vs. 50.4% for ViT), zero-shot scale generalization, and up to 78× speedup with custom kernels.

Conclusion: Versor demonstrates that geometric algebra provides a powerful foundation for sequence architectures, offering structural generalization, interpretability, and efficiency advantages over traditional approaches, with promising applications in geometrically-aware scientific modeling.

Abstract: A novel sequence architecture design is introduced, Versor, which uses Conformal Geometric Algebra (CGA) in place of the traditional fundamental non-linear operations to achieve structural generalization and significant performance improvements on a variety of tasks, while offering improved interpretability and efficiency. By embedding states in the $Cl_{4,1}$ manifold and evolving them via geometric transformations (rotors), Versor natively represents $SE(3)$-equivariant relationships without requiring explicit structural encoding. Versor is validated on chaotic N-body dynamics, topological reasoning, and standard multimodal benchmarks (CIFAR-10, WikiText-103), consistently outperforming Transformers, Graph Networks, and geometric baselines (GATr, EGNN). Key results include: orders of magnitude fewer parameters ($200\times$ vs. Transformers); interpretable attention decomposing into proximity and orientational components; zero-shot scale generalization (99.3% MCC on topology vs. 50.4% for ViT); and $O(L)$ linear complexity via the novel Recursive Rotor Accumulator. In out-of-distribution tests, Versor maintains stable predictions while Transformers fail catastrophically. Custom Clifford kernels achieve up to $78\times$ speedup, providing a scalable foundation for geometrically-aware scientific modeling.

[313] Adaptive Optimization via Momentum on Variance-Normalized Gradients

Francisco Patitucci, Aryan Mokhtari

Main category: cs.LG

TL;DR: MVN-Grad is a new Adam-style optimizer that applies momentum after variance normalization, improving stability and performance by decoupling stale momentum from stochastic normalization.

Details

Motivation: The paper addresses limitations in existing Adam-style optimizers where stale momentum and stochastic normalization are coupled, leading to suboptimal stability and performance. The authors aim to create an optimizer with better stability, robustness to outliers, and improved convergence properties.

Method: MVN-Grad combines variance-based normalization with momentum applied after normalization. It scales each coordinate by an exponential moving average of gradient uncertainty and then applies momentum to the normalized gradients, eliminating cross-time coupling between stale momentum and stochastic normalizer present in standard Adam updates.

Result: Theoretical analysis shows MVN-Grad has strictly smaller one-step conditional update variance than momentum-then-normalize methods under standard noise assumptions, and is robust to outliers with uniformly bounded response to single gradient spikes. Empirically, MVN-Grad matches or outperforms Adam, AdaBelief, and LaProp on CIFAR-100 image classification and GPT-style language modeling benchmarks, delivering smoother training and improved generalization with no added overhead.

Conclusion: MVN-Grad provides a theoretically grounded improvement to Adam-style optimizers by decoupling momentum and normalization, offering better stability, robustness, and performance across vision and language tasks without computational overhead.

Abstract: We introduce MVN-Grad (Momentum on Variance-Normalized Gradients), an Adam-style optimizer that improves stability and performance by combining two complementary ideas: variance-based normalization and momentum applied after normalization. MVN-Grad scales each coordinate by an exponential moving average of gradient uncertainty and applies momentum to the resulting normalized gradients, eliminating the cross-time coupling between stale momentum and a stochastic normalizer present in standard Adam-type updates. We prove that this decoupling yields strictly smaller one-step conditional update variance than momentum-then-normalize variance methods under standard noise assumptions, and that MVN-Grad is robust to outliers: it has a uniformly bounded response to single gradient spikes. In low-variance regimes, we further show variance normalization avoids sign-type collapse associated with second-moment scaling and can yield accelerated convergence. Across CIFAR-100 image classification and GPT-style language modeling benchmarks, MVN-Grad matches or outperforms Adam, AdaBelief, and LaProp, delivering smoother training and improved generalization with no added overhead.

[314] Frame-Level Internal Tool Use for Temporal Grounding in Audio LMs

Joesph An, Phillip Keung, Jiaqi Wang, Orevaoghene Ahia, Noah A. Smith

Main category: cs.LG

TL;DR: Audio language models struggle with temporal grounding tasks like word alignment and speaker diarization. The paper proposes frame-level internal tool use, training models to use their own audio representations for temporal grounding via binary frame classification and inhomogeneous Poisson process loss, achieving 50x speedup and robust length generalization.

Details

Motivation: Large audio language models are increasingly used for complex audio understanding tasks but struggle with temporal tasks requiring precise temporal grounding (word alignment, speaker diarization). Standard approaches generating timestamps as text tokens are computationally expensive and prone to hallucination, especially with audio lengths outside training distribution.

Method: Proposes frame-level internal tool use: trains audio LMs to use their own internal audio representations for temporal grounding directly. Introduces lightweight prediction mechanism trained via two objectives: binary frame classifier and novel inhomogeneous Poisson process (IHP) loss that models temporal event intensity.

Result: Outperforms token-based baselines across word localization, speaker diarization, and event localization tasks. Achieves >50x inference speedup and demonstrates robust length generalization, maintaining high accuracy on out-of-distribution audio durations where standard token-based models collapse completely.

Conclusion: Frame-level internal tool use provides an efficient and robust alternative to token-based temporal grounding in audio language models, addressing key limitations in computational cost and length generalization while improving performance on temporal tasks.

Abstract: Large audio language models are increasingly used for complex audio understanding tasks, but they struggle with temporal tasks that require precise temporal grounding, such as word alignment and speaker diarization. The standard approach, where we generate timestamps as sequences of text tokens, is computationally expensive and prone to hallucination, especially when processing audio lengths outside the model’s training distribution. In this work, we propose frame-level internal tool use, a method that trains audio LMs to use their own internal audio representations to perform temporal grounding directly. We introduce a lightweight prediction mechanism trained via two objectives: a binary frame classifier and a novel inhomogeneous Poisson process (IHP) loss that models temporal event intensity. Across word localization, speaker diarization, and event localization tasks, our approach outperforms token-based baselines. Most notably, it achieves a >50x inference speedup and demonstrates robust length generalization, maintaining high accuracy on out-of-distribution audio durations where standard token-based models collapse completely.

[315] Neural Network Quantum Field Theory from Transformer Architectures

Dmitry S. Ageev, Yulia A. Ageeva

Main category: cs.LG

TL;DR: Transformer attention heads can be used to construct Euclidean scalar quantum field theories, where n-point correlators emerge from averaging over random network parameters, with non-Gaussian statistics that persist at infinite width but become Gaussian in the large-head limit.

Details

Motivation: To establish a connection between neural network architectures (specifically transformer attention heads) and quantum field theory, showing how QFT structures can emerge from neural network parameter spaces and random features.

Method: Using the NN-QFT framework, they construct Euclidean scalar QFTs from transformer attention heads by averaging over random network parameters. They analyze single attention heads with shared random softmax weights, compute two-point functions in attention-weight representation, engineer Euclidean-invariant kernels via random-feature token embeddings, and analyze connected four-point functions.

Result: Non-Gaussian field statistics persist in the infinite-width limit for single attention heads, with finite “independence-breaking” contributions to connected four-point functions. However, summing many independent heads with standard normalization suppresses connected non-Gaussian correlators as 1/N_h, yielding Gaussian NN-QFT in the large-head limit.

Conclusion: Transformer attention heads provide a neural network construction of Euclidean scalar quantum field theories, with interesting statistical properties that interpolate between non-Gaussian behavior at finite width/heads and Gaussian behavior in the large-head limit, establishing connections between neural network architectures and quantum field theory.

Abstract: We propose a neural-network construction of Euclidean scalar quantum field theories from transformer attention heads, defining $n$-point correlators by averaging over random network parameters in the NN-QFT framework. For a single attention head, shared random softmax weights couple different width coordinates and induce non-Gaussian field statistics that persist in the infinite-width limit $d_k\to\infty$. We compute the two-point function in an attention-weight representation and show how Euclidean-invariant kernels can be engineered via random-feature token embeddings. We then analyze the connected four-point function and identify an “independence-breaking” contribution, expressible as a covariance over query-key weights, which remains finite at infinite width. Finally, we show that summing many independent heads with standard $1/N_h$ normalization suppresses connected non-Gaussian correlators as $1/N_h$, yielding a Gaussian NN-QFT in the large-head limit.

[316] How Much Reasoning Do Retrieval-Augmented Models Add beyond LLMs? A Benchmarking Framework for Multi-Hop Inference over Hybrid Knowledge

Junhong Lin, Bing Zhang, Song Wang, Ziyan Liu, Dan Gutfreund, Julian Shun, Yada Zhu

Main category: cs.LG

TL;DR: HybridRAG-Bench is a framework for evaluating retrieval-augmented LLMs on multi-hop reasoning over hybrid knowledge (text + knowledge graphs), using recent scientific literature to avoid data contamination.

Details

Motivation: Existing benchmarks overlap with LLM pretraining data, making it hard to distinguish genuine retrieval/reasoning from parametric recall. Need contamination-aware evaluation for hybrid knowledge-augmented systems.

Method: Automatically couples unstructured text and structured KG representations from recent arXiv papers, generates knowledge-intensive QA pairs grounded in explicit reasoning paths. Supports flexible domain/time selection.

Result: Demonstrated across AI, governance/policy, and bioinformatics domains. Benchmarks reward genuine retrieval/reasoning rather than parametric recall. Framework enables contamination-aware evaluation.

Conclusion: HybridRAG-Bench provides principled testbed for evaluating hybrid knowledge-augmented reasoning systems, addressing data contamination issues in existing benchmarks.

Abstract: Large language models (LLMs) continue to struggle with knowledge-intensive questions that require up-to-date information and multi-hop reasoning. Augmenting LLMs with hybrid external knowledge, such as unstructured text and structured knowledge graphs, offers a promising alternative to costly continual pretraining. As such, reliable evaluation of their retrieval and reasoning capabilities becomes critical. However, many existing benchmarks increasingly overlap with LLM pretraining data, which means answers or supporting knowledge may already be encoded in model parameters, making it difficult to distinguish genuine retrieval and reasoning from parametric recall. We introduce HybridRAG-Bench, a framework for constructing benchmarks to evaluate retrieval-intensive, multi-hop reasoning over hybrid knowledge. HybridRAG-Bench automatically couples unstructured text and structured knowledge graph representations derived from recent scientific literature on arXiv, and generates knowledge-intensive question-answer pairs grounded in explicit reasoning paths. The framework supports flexible domain and time-frame selection, enabling contamination-aware and customizable evaluation as models and knowledge evolve. Experiments across three domains (artificial intelligence, governance and policy, and bioinformatics) demonstrate that HybridRAG-Bench rewards genuine retrieval and reasoning rather than parametric recall, offering a principled testbed for evaluating hybrid knowledge-augmented reasoning systems. We release our code and data at github.com/junhongmit/HybridRAG-Bench.

[317] Rank-Accuracy Trade-off for LoRA: A Gradient-Flow Analysis

Michael Rushka, Diego Klabjan

Main category: cs.LG

TL;DR: Theoretical analysis of LoRA’s accuracy dependence on update rank from a dynamical systems perspective, establishing explicit relationships between rank and accuracy for fine-tuning tasks.

Details

Motivation: While empirical studies show LoRA achieves comparable accuracy to full-parameter methods even with low-rank updates, the theoretical understanding of how LoRA's accuracy depends on update rank remains limited. The paper aims to provide rigorous theoretical foundations for this relationship.

Method: Uses gradient flow analysis in both full-rank and low-rank regimes to establish explicit relationships between rank and accuracy. Derives gradient flow equations for LoRA, shows they are identical for simultaneous and sequential parameter updates, and obtains closed-form relationships for trace-squared and Frobenius-norm low-rank approximation loss functions.

Result: Rigorous derivation of gradient flow equations for LoRA and demonstration of their equivalence for different update strategies. Closed-form relationships established between LoRA rank and accuracy for specific loss functions, providing theoretical understanding of rank-accuracy tradeoffs.

Conclusion: The paper provides theoretical foundations for understanding LoRA’s performance, showing how accuracy depends on update rank through dynamical systems analysis. This contributes to better understanding of low-rank adaptation methods in deep learning.

Abstract: Previous empirical studies have shown that LoRA achieves accuracy comparable to full-parameter methods on downstream fine-tuning tasks, even for rank-1 updates. By contrast, the theoretical underpinnings of the dependence of LoRA’s accuracy on update rank remain relatively unexplored. In this work, we compare the accuracy of rank-r LoRA updates against full-parameter updates for fine-tuning tasks from a dynamical systems perspective. We perform gradient flow analysis in both full-rank and low-rank regimes to establish explicit relationships between rank and accuracy for two loss functions under LoRA. While gradient flow equations for LoRA are presented in prior work, we rigorously derive their form and show that they are identical for simultaneous and sequential LoRA parameter updates. We then use the resulting dynamical system equations to obtain closed-form relationships between LoRA rank and accuracy for trace-squared and Frobenius-norm low-rank approximation loss functions.

[318] ELROND: Exploring and decomposing intrinsic capabilities of diffusion models

Paweł Skierś, Tomasz Trzciński, Kamil Deja

Main category: cs.LG

TL;DR: A framework to disentangle semantic directions in diffusion model embeddings for precise control over visual variations from text prompts.

Details

Motivation: Current diffusion models produce random visual variations from text prompts without user control over specific semantic directions. Existing unsupervised methods analyze output features but ignore the underlying generative process.

Method: Collect gradients from backpropagating differences between stochastic realizations of a fixed prompt, then decompose them into meaningful steering directions using Principal Components Analysis or Sparse Autoencoder.

Result: The approach isolates interpretable, steerable directions for fine-grained control, mitigates mode collapse in distilled models by reintroducing diversity, and establishes a novel estimator for concept complexity based on discovered subspace dimensionality.

Conclusion: The framework enables precise semantic control in diffusion models, addresses mode collapse issues, and provides a way to measure concept complexity through embedding space analysis.

Abstract: A single text prompt passed to a diffusion model often yields a wide range of visual outputs determined solely by stochastic process, leaving users with no direct control over which specific semantic variations appear in the image. While existing unsupervised methods attempt to analyze these variations via output features, they omit the underlying generative process. In this work, we propose a framework to disentangle these semantic directions directly within the input embedding space. To that end, we collect a set of gradients obtained by backpropagating the differences between stochastic realizations of a fixed prompt that we later decompose into meaningful steering directions with either Principal Components Analysis or Sparse Autoencoder. Our approach yields three key contributions: (1) it isolates interpretable, steerable directions for precise, fine-grained control over a single concept; (2) it effectively mitigates mode collapse in distilled models by reintroducing lost diversity; and (3) it establishes a novel estimator for concept complexity under a specific model, based on the dimensionality of the discovered subspace.

[319] Binary Flow Matching: Prediction-Loss Space Alignment for Robust Learning

Jiadong Hong, Lei Liu, Xinyu Bian, Wenjie Wang, Zhaoyang Zhang

Main category: cs.LG

TL;DR: Flow matching for binary data works best with signal-space prediction (x-prediction) and x-loss alignment, avoiding singular weighting issues from velocity-based objectives.

Details

Motivation: To extend flow matching's success with signal-space prediction to binary manifolds (discrete data), addressing structural mismatches when combining x-prediction with velocity-based objectives that cause training instability.

Method: Theoretical analysis of prediction-loss alignment, proving that aligning objectives to signal space (x-loss) eliminates singular weighting, plus examination of design choices for binary data including probabilistic vs geometric losses.

Result: X-loss alignment yields uniformly bounded gradients and enables robust training without heuristic schedules; topology-dependent distinction found between cross-entropy and MSE losses for binary data.

Conclusion: Signal-space alignment is key principle for robust flow matching on binary/discrete domains, providing theoretical foundations and practical guidelines.

Abstract: Flow matching has emerged as a powerful framework for generative modeling, with recent empirical successes highlighting the effectiveness of signal-space prediction ($x$-prediction). In this work, we investigate the transfer of this paradigm to binary manifolds, a fundamental setting for generative modeling of discrete data. While $x$-prediction remains effective, we identify a latent structural mismatch that arises when it is coupled with velocity-based objectives ($v$-loss), leading to a time-dependent singular weighting that amplifies gradient sensitivity to approximation errors. Motivated by this observation, we formalize prediction-loss alignment as a necessary condition for flow matching training. We prove that re-aligning the objective to the signal space ($x$-loss) eliminates the singular weighting, yielding uniformly bounded gradients and enabling robust training under uniform timestep sampling without reliance on heuristic schedules. Finally, with alignment secured, we examine design choices specific to binary data, revealing a topology-dependent distinction between probabilistic objectives (e.g., cross-entropy) and geometric losses (e.g., mean squared error). Together, these results provide theoretical foundations and practical guidelines for robust flow matching on binary – and related discrete – domains, positioning signal-space alignment as a key principle for robust diffusion learning.

[320] Temper-Then-Tilt: Principled Unlearning for Generative Models through Tempering and Classifier Guidance

Jacob L. Block, Mehryar Mohri, Aryan Mokhtari, Sanjay Shakkottai

Main category: cs.LG

TL;DR: T3-Unlearning: A two-step inference method for machine unlearning in generative models that uses tempering to flatten concentrated distributions followed by classifier-guided tilting, achieving better forget quality with minimal parameter updates.

Details

Motivation: Standard classifier guidance for machine unlearning in generative models can fail when the forget set represents sharp, concentrated data distributions, requiring a more robust approach that works with finite samples.

Method: Temper-Then-Tilt Unlearning (T3-Unlearning) freezes the base model and applies a two-step inference procedure: (1) tempering the base distribution to flatten high-confidence spikes, (2) tilting the tempered distribution using a lightweight classifier trained to distinguish retain from forget samples.

Result: Empirical evaluations on the TOFU benchmark show T3-Unlearning improves forget quality and generative utility over existing baselines while training only a fraction of parameters with minimal runtime.

Conclusion: T3-Unlearning provides an effective approach for machine unlearning in generative models, especially for concentrated distributions, with theoretical guarantees linking classifier risk to unlearning error.

Abstract: We study machine unlearning in large generative models by framing the task as density ratio estimation to a target distribution rather than supervised fine-tuning. While classifier guidance is a standard approach for approximating this ratio and can succeed in general, we show it can fail to faithfully unlearn with finite samples when the forget set represents a sharp, concentrated data distribution. To address this, we introduce Temper-Then-Tilt Unlearning (T3-Unlearning), which freezes the base model and applies a two-step inference procedure: (i) tempering the base distribution to flatten high-confidence spikes, and (ii) tilting the tempered distribution using a lightweight classifier trained to distinguish retain from forget samples. Our theoretical analysis provides finite-sample guarantees linking the surrogate classifier’s risk to unlearning error, proving that tempering is necessary to successfully unlearn for concentrated distributions. Empirical evaluations on the TOFU benchmark show that T3-Unlearning improves forget quality and generative utility over existing baselines, while training only a fraction of the parameters with a minimal runtime.

[321] Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models

Shiting Huang, Zecheng Li, Yu Zeng, Qingnan Ren, Zhen Fang, Qisheng Su, Kou Shi, Lin Chen, Zehui Chen, Feng Zhao

Main category: cs.LG

TL;DR: MEL enhances RLVR by adding meta-experience learning through contrastive analysis of correct/incorrect reasoning trajectories to identify error bifurcation points and internalize reusable knowledge.

Details

Motivation: RLVR improves LLM reasoning but lacks mechanisms for error attribution and experience internalization beyond practice and verification, limiting fine-grained credit assignment and reusable knowledge formation.

Method: MEL builds on RLVR by using LLM’s self-verification to conduct contrastive analysis on paired correct/incorrect trajectories, identify reasoning error bifurcation points, summarize into meta-experience, and internalize via negative log-likelihood minimization.

Result: MEL achieves consistent improvements on benchmarks with 3.92%–4.73% Pass@1 gains across varying model sizes.

Conclusion: Meta-experience learning effectively addresses RLVR’s limitations by enabling error attribution and reusable knowledge formation, leading to improved reasoning performance.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model’s parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM’s self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM’s parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%–4.73% Pass@1 gains across varying model sizes.

[322] Self-Evolving Recommendation System: End-To-End Autonomous Model Optimization With LLM Agents

Haochen Wang, Yi Wu, Daryl Chang, Li Wei, Lukasz Heldt

Main category: cs.LG

TL;DR: LLM-driven self-evolving system automates ML optimization for YouTube recommendation models, using Gemini LLMs as specialized ML engineers to autonomously generate, train, and deploy model improvements.

Details

Motivation: Optimizing large-scale ML systems like video recommendation models requires navigating massive hyperparameter spaces and designing sophisticated optimizers, architectures, and reward functions. Traditional manual iteration is slow and limits innovation.

Method: Two-agent system: Offline Agent (Inner Loop) generates hypotheses using proxy metrics; Online Agent (Outer Loop) validates candidates against delayed business metrics in production. Uses Google’s Gemini LLMs as specialized ML engineers with deep reasoning capabilities.

Result: Successfully deployed at YouTube with several production launches, demonstrating autonomous LLM-driven evolution surpasses traditional engineering workflows in both development velocity and model performance.

Conclusion: LLM-driven self-evolving systems can autonomously discover novel ML improvements and accelerate development cycles for large-scale recommendation systems, representing a paradigm shift in ML engineering workflows.

Abstract: Optimizing large-scale machine learning systems, such as recommendation models for global video platforms, requires navigating a massive hyperparameter search space and, more critically, designing sophisticated optimizers, architectures, and reward functions to capture nuanced user behaviors. Achieving substantial improvements in these areas is a non-trivial task, traditionally relying on extensive manual iterations to test new hypotheses. We propose a self-evolving system that leverages Large Language Models (LLMs), specifically those from Google’s Gemini family, to autonomously generate, train, and deploy high-performing, complex model changes within an end-to-end automated workflow. The self-evolving system is comprised of an Offline Agent (Inner Loop) that performs high-throughput hypothesis generation using proxy metrics, and an Online Agent (Outer Loop) that validates candidates against delayed north star business metrics in live production. Our agents act as specialized Machine Learning Engineers (MLEs): they exhibit deep reasoning capabilities, discovering novel improvements in optimization algorithms and model architecture, and formulating innovative reward functions that target long-term user engagement. The effectiveness of this approach is demonstrated through several successful production launches at YouTube, confirming that autonomous, LLM-driven evolution can surpass traditional engineering workflows in both development velocity and model performance.

[323] PRISM: Differentially Private Synthetic Data with Structure-Aware Budget Allocation for Prediction

Amir Asiaee, Chao Yan, Zachary B. Abrams, Bradley A. Malin

Main category: cs.LG

TL;DR: PRISM: A differentially private synthetic data generation method that optimizes for specific prediction tasks by selectively protecting features relevant to the target variable Y, operating in three regimes based on available structural knowledge.

Details

Motivation: Existing DP synthetic data methods treat all features symmetrically, spreading noise uniformly even when the data will serve a specific prediction task. This leads to unnecessary noise accumulation and reduced utility for prediction tasks.

Method: PRISM operates in three regimes: (1) causal regime (target causal parents of Y), (2) graphical regime (target Markov blanket of Y), (3) predictive regime (select features via DP methods). It identifies predictive feature subsets, constructs targeted summary statistics, allocates budget to minimize prediction error, and synthesizes data via graphical-model inference.

Result: Task-aware allocation improves prediction accuracy compared to generic synthesizers. Under distribution shift, targeting causal parents achieves AUC ≈ 0.73 while correlation-based selection collapses to chance (≈ 0.49).

Conclusion: PRISM provides a prediction-centric approach to DP synthetic data generation that optimizes privacy budget allocation for specific prediction tasks, significantly improving utility while maintaining privacy guarantees.

Abstract: Differential privacy (DP) provides a mathematical guarantee limiting what an adversary can learn about any individual from released data. However, achieving this protection typically requires adding noise, and noise can accumulate when many statistics are measured. Existing DP synthetic data methods treat all features symmetrically, spreading noise uniformly even when the data will serve a specific prediction task. We develop a prediction-centric approach operating in three regimes depending on available structural knowledge. In the causal regime, when the causal parents of $Y$ are known and distribution shift is expected, we target the parents for robustness. In the graphical regime, when a Bayesian network structure is available and the distribution is stable, the Markov blanket of $Y$ provides a sufficient feature set for optimal prediction. In the predictive regime, when no structural knowledge exists, we select features via differentially private methods without claiming to recover causal or graphical structure. We formalize this as PRISM, a mechanism that (i) identifies a predictive feature subset according to the appropriate regime, (ii) constructs targeted summary statistics, (iii) allocates budget to minimize an upper bound on prediction error, and (iv) synthesizes data via graphical-model inference. We prove end-to-end privacy guarantees and risk bounds. Empirically, task-aware allocation improves prediction accuracy compared to generic synthesizers. Under distribution shift, targeting causal parents achieves AUC $\approx 0.73$ while correlation-based selection collapses to chance ($\approx 0.49$).

[324] Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

Kirill Pavlenko, Alexander Golubev, Simon Karasik, Boris Yangel

Main category: cs.LG

TL;DR: Blockwise Advantage Estimation improves GRPO by assigning separate advantages to different text blocks in structured generations, reducing reward interference without requiring nested rollouts.

Details

Motivation: GRPO assigns a single scalar advantage to all tokens in a completion, which causes reward interference in structured generations with multiple segments and objectives. This coupling of unrelated reward signals leads to misattributed credit and objective interference.

Method: Proposes Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to tokens in corresponding text blocks. Introduces Outcome-Conditioned Baseline that approximates intermediate state values using within-group statistics by stratifying samples according to prefix-derived intermediate outcomes, avoiding expensive nested rollouts.

Result: On math tasks with uncertainty estimation, the method mitigates reward interference, is competitive with state-of-the-art reward-designed approaches, and preserves test-time gains from confidence-weighted ensembling.

Conclusion: Provides a modular recipe for optimizing sequential objectives in structured generations without additional rollouts, reducing reliance on hand-designed scalar rewards and scaling naturally to additional objectives.

Abstract: Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion. For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit. We propose Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to the tokens in the corresponding text block, reducing reliance on hand-designed scalar rewards and scaling naturally to additional objectives. A key challenge is estimating advantages for later blocks whose rewards are conditioned on sampled prefixes; standard unbiased approaches require expensive nested rollouts from intermediate states. Concretely, we introduce an Outcome-Conditioned Baseline that approximates intermediate state values using only within-group statistics by stratifying samples according to a prefix-derived intermediate outcome. On math tasks with uncertainty estimation, our method mitigates reward interference, is competitive with a state-of-the-art reward-designed approach, and preserves test-time gains from confidence-weighted ensembling. More broadly, it provides a modular recipe for optimizing sequential objectives in structured generations without additional rollouts.

[325] Risk-Equalized Differentially Private Synthetic Data: Protecting Outliers by Controlling Record-Level Influence

Amir Asiaee, Chao Yan, Zachary B. Abrams, Bradley A. Malin

Main category: cs.LG

TL;DR: Risk-equalized DP synthesis framework that protects high-risk outliers in synthetic data by weighting records inversely to their outlierness scores, providing tighter per-instance privacy bounds for vulnerable records.

Details

Motivation: Current differential privacy provides worst-case guarantees but empirical attacks (especially membership inference) succeed more often against outliers/rare individuals. There's a need to prioritize protection for high-risk records that stand out from the crowd.

Method: Two-stage framework: 1) Small privacy budget estimates each record’s “outlierness” score, 2) DP learning procedure weights each record inversely to its risk score. Uses Gaussian mechanisms where privacy loss is proportional to influence on output.

Result: Risk-weighting substantially reduces membership inference success against high-outlierness records in simulated data with controlled outlier injection. On real-world benchmarks (Breast Cancer, Adult, German Credit), gains are dataset-dependent, showing interplay between scorer quality and synthesis pipeline.

Conclusion: Risk-equalized DP synthesis provides targeted protection for vulnerable outliers by reducing their influence on learned generators, offering tighter per-instance privacy bounds for records that need them most.

Abstract: When synthetic data is released, some individuals are harder to protect than others. A patient with a rare disease combination or a transaction with unusual characteristics stands out from the crowd. Differential privacy provides worst-case guarantees, but empirical attacks – particularly membership inference – succeed far more often against such outliers, especially under moderate privacy budgets and with auxiliary information. This paper introduces risk-equalized DP synthesis, a framework that prioritizes protection for high-risk records by reducing their influence on the learned generator. The mechanism operates in two stages: first, a small privacy budget estimates each record’s “outlierness”; second, a DP learning procedure weights each record inversely to its risk score. Under Gaussian mechanisms, a record’s privacy loss is proportional to its influence on the output – so deliberately shrinking outliers’ contributions yields tighter per-instance privacy bounds for precisely those records that need them most. We prove end-to-end DP guarantees via composition and derive closed-form per-record bounds for the synthesis stage (the scoring stage adds a uniform per-record term). Experiments on simulated data with controlled outlier injection show that risk-weighting substantially reduces membership inference success against high-outlierness records; ablations confirm that targeting – not random downweighting – drives the improvement. On real-world benchmarks (Breast Cancer, Adult, German Credit), gains are dataset-dependent, highlighting the interplay between scorer quality and synthesis pipeline.

[326] Modeling Programming Skills with Source Code Embeddings for Context-aware Exercise Recommendation

Carlos Eduardo P. Silva, João Pedro M. Sena, Julio C. S. Reis, André G. Santos, Lucas N. Ferreira

Main category: cs.LG

TL;DR: A context-aware recommender system that uses source code embeddings to model students’ programming skills and recommends personalized homework problems based on skill alignment.

Details

Motivation: To create personalized learning recommendations for programming education by accurately modeling students' skills from their submitted source code and matching them to appropriate exercises.

Method: Uses source code embeddings to predict students’ skills across programming topics, creates student profiles and problem skill vectors, then computes cosine similarity to rank exercises based on alignment with each student’s current abilities.

Result: Jina embeddings outperformed TF-IDF, CodeBERT-cpp, and GraphCodeBERT for skill prediction. The system produced more suitable recommendations than baselines based on correctness or solution time across seven course offerings.

Conclusion: Predicted programming skills from source code embeddings provide a stronger signal for exercise recommendation than traditional metrics, enabling more effective personalized learning in programming education.

Abstract: In this paper, we propose a context-aware recommender system that models students’ programming skills using embeddings of the source code they submit throughout a course. These embeddings predict students’ skills across multiple programming topics, producing profiles that are matched to the skills required by unseen homework problems. To generate recommendations, we compute the cosine similarity between student profiles and problem skill vectors, ranking exercises according to their alignment with each student’s current abilities. We evaluated our approach using real data from students and exercises in an introductory programming course at our university. First, we assessed the effectiveness of our source code embeddings for predicting skills, comparing them with token-based and graph-based alternatives. Results showed that Jina embeddings outperformed TF-IDF, CodeBERT-cpp, and GraphCodeBERT across most skills. Additionally, we evaluated the system’s ability to recommend exercises aligned with weekly course content by analyzing student submissions collected over seven course offerings. Our approach consistently produced more suitable recommendations than baselines based on correctness or solution time, indicating that predicted programming skills provide a stronger signal for problem recommendation.

[327] Kernel-Based Learning of Chest X-ray Images for Predicting ICU Escalation among COVID-19 Patients

Qiyuan Shi, Jian Kang, Yi Li

Main category: cs.LG

TL;DR: GLIMARK extends multiple kernel learning to generalized linear models for diverse data types beyond continuous outcomes, applied to COVID-19 chest X-ray analysis.

Details

Motivation: Traditional multiple kernel learning (MKL) methods are limited to continuous outcomes, but real-world data often involves diverse data types from the exponential family. There's a need to extend MKL to handle broader outcome variables for more comprehensive multimodal data analysis.

Method: Proposes GLIMARK (Generalized Linear Models with Integrated Multiple Additive Regression with Kernels), which extends MKL to accommodate outcome variables belonging to the exponential family. The method constructs composite kernels from simpler ones and integrates information from heterogeneous sources while handling diverse data types through generalized linear model framework.

Result: GLIMARK effectively recovers or approximates the true data-generating mechanism. Applied to COVID-19 chest X-ray dataset, it successfully predicts binary ICU escalation outcomes and extracts clinically meaningful features, demonstrating practical utility in real-world medical scenarios.

Conclusion: GLIMARK provides a flexible framework for multiple kernel learning that handles diverse data types beyond continuous outcomes, with demonstrated effectiveness in medical imaging applications for both prediction and feature extraction.

Abstract: Kernel methods have been extensively utilized in machine learning for classification and prediction tasks due to their ability to capture complex non-linear data patterns. However, single kernel approaches are inherently limited, as they rely on a single type of kernel function (e.g., Gaussian kernel), which may be insufficient to fully represent the heterogeneity or multifaceted nature of real-world data. Multiple kernel learning (MKL) addresses these limitations by constructing composite kernels from simpler ones and integrating information from heterogeneous sources. Despite these advances, traditional MKL methods are primarily designed for continuous outcomes. We extend MKL to accommodate the outcome variable belonging to the exponential family, representing a broader variety of data types, and refer to our proposed method as generalized linear models with integrated multiple additive regression with kernels (GLIMARK). Empirically, we demonstrate that GLIMARK can effectively recover or approximate the true data-generating mechanism. We have applied it to a COVID-19 chest X-ray dataset, predicting binary outcomes of ICU escalation and extracting clinically meaningful features, underscoring the practical utility of this approach in real-world scenarios.

[328] Iterative Importance Fine-tuning of Diffusion Models

Alexander Denker, Shreyas Padhy, Francisco Vargas, Johannes Hertrich

Main category: cs.LG

TL;DR: Self-supervised algorithm for fine-tuning diffusion models using Doob’s h-transform for efficient conditional sampling

Details

Motivation: Diffusion models serve as effective priors but face challenges in efficiently sampling from posterior distributions for downstream tasks; need for amortized conditional sampling methods

Method: Introduces self-supervised algorithm that learns optimal control via Doob’s h-transform, iteratively refines control using synthetic dataset resampled with path-based importance weights

Result: Demonstrates effectiveness on class-conditional sampling, inverse problems, and reward fine-tuning for text-to-image diffusion models

Conclusion: Proposed framework enables efficient amortized conditional sampling for diffusion models across various applications

Abstract: Diffusion models are an important tool for generative modelling, serving as effective priors in applications such as imaging and protein design. A key challenge in applying diffusion models for downstream tasks is efficiently sampling from resulting posterior distributions, which can be addressed using Doob’s $h$-transform. This work introduces a self-supervised algorithm for fine-tuning diffusion models by learning the optimal control, enabling amortised conditional sampling. Our method iteratively refines the control using a synthetic dataset resampled with path-based importance weights. We demonstrate the effectiveness of this framework on class-conditional sampling, inverse problems and reward fine-tuning for text-to-image diffusion models.

[329] From Classical to Topological Neural Networks Under Uncertainty

Sarah Harkins Dayton, Layal Bou Hamdan, Ioannis D. Schizas, David L. Boothe, Vasileios Maroulas

Main category: cs.LG

TL;DR: Survey of topological data analysis, topological deep learning, and Bayesian methods for military AI applications including image, video, audio, and time-series recognition

Details

Motivation: To explore how topology-aware and uncertainty-aware models can enhance AI robustness, interpretability, and generalization for military applications, particularly in multimodal data processing

Method: Combines neural networks with topological data analysis, topological deep learning techniques, and statistical Bayesian methods for processing images, time series, and graphs

Result: Demonstrates practical applications spanning image, video, audio, and time-series recognition, fraud detection, and link prediction for graphical data

Conclusion: Topology-aware and uncertainty-aware models can significantly enhance AI capabilities in military domains by improving robustness, interpretability, and generalization across multimodal data types

Abstract: This chapter explores neural networks, topological data analysis, and topological deep learning techniques, alongside statistical Bayesian methods, for processing images, time series, and graphs to maximize the potential of artificial intelligence in the military domain. Throughout the chapter, we highlight practical applications spanning image, video, audio, and time-series recognition, fraud detection, and link prediction for graphical data, illustrating how topology-aware and uncertainty-aware models can enhance robustness, interpretability, and generalization.

[330] Linear-LLM-SCM: Benchmarking LLMs for Coefficient Elicitation in Linear-Gaussian Causal Models

Kanta Yamaoka, Sumantrak Mukherjee, Thomas Gärtner, David Antony Selby, Stefan Konigorski, Eyke Hüllermeier, Viktor Bengs, Sebastian Josef Vollmer

Main category: cs.LG

TL;DR: LLMs struggle with quantitative causal reasoning in continuous domains; Linear-LLM-SCM framework benchmarks LLMs on linear Gaussian SCM parameter estimation given DAGs.

Details

Motivation: While LLMs show promise in qualitative causal reasoning, their ability to perform quantitative causal reasoning in continuous domains remains underexplored, particularly for estimating effect sizes and functional relationships.

Method: Introduces Linear-LLM-SCM, a plug-and-play benchmarking framework that decomposes DAGs into local parent-child sets, prompts LLMs to produce regression-style structural equations per node, and compares aggregated results against ground-truth parameters.

Result: Experiments reveal challenges: strong stochasticity in results, susceptibility to DAG misspecification via spurious edges, substantial variability in coefficient estimates, and sensitivity to structural and semantic perturbations.

Conclusion: LLMs have significant limitations as quantitative causal parameterizers in continuous domains, highlighting the need for improved methods. The open-source framework enables easy evaluation of LLMs on custom DAGs.

Abstract: Large language models (LLMs) have shown potential in identifying qualitative causal relations, but their ability to perform quantitative causal reasoning – estimating effect sizes that parametrize functional relationships – remains underexplored in continuous domains. We introduce Linear-LLM-SCM, a plug-and-play benchmarking framework for evaluating LLMs on linear Gaussian structural causal model (SCM) parametrization when the DAG is given. The framework decomposes a DAG into local parent-child sets and prompts an LLM to produce a regression-style structural equation per node, which is aggregated and compared against available ground-truth parameters. Our experiments show several challenges in such benchmarking tasks, namely, strong stochasticity in the results in some of the models and susceptibility to DAG misspecification via spurious edges in the continuous domains. Across models, we observe substantial variability in coefficient estimates for some settings and sensitivity to structural and semantic perturbations, highlighting current limitations of LLMs as quantitative causal parameterizers. We also open-sourced the benchmarking framework so that researchers can utilize their DAGs and any off-the-shelf LLMs plug-and-play for evaluation in their domains effortlessly.

[331] What Does Preference Learning Recover from Pairwise Comparison Data?

Rattana Pukdee, Maria-Florina Balcan, Pradeep Ravikumar

Main category: cs.LG

TL;DR: The paper analyzes when Bradley-Terry models are appropriate for pairwise preference learning and what they actually recover from real-world data that may violate model assumptions.

Details

Motivation: While Bradley-Terry models are widely used for preference learning (especially in aligning language models with human preferences), real data often violates model assumptions. The paper aims to understand what BT learning actually recovers from such imperfect data.

Method: The authors formalize preference information through conditional preference distributions (CPRD), establish precise conditions for when BT models are appropriate, and identify key factors governing sample efficiency: margin and connectivity.

Result: The paper provides theoretical foundations for understanding what preference learning recovers from real data, establishing conditions for BT model applicability and identifying critical factors affecting learning efficiency.

Conclusion: The work offers a data-centric foundation for preference learning, clarifying when Bradley-Terry models are suitable and what they actually recover from potentially mis-specified real-world preference data.

Abstract: Pairwise preference learning is central to machine learning, with recent applications in aligning language models with human preferences. A typical dataset consists of triplets $(x, y^+, y^-)$, where response $y^+$ is preferred over response $y^-$ for context $x$. The Bradley–Terry (BT) model is the predominant approach, modeling preference probabilities as a function of latent score differences. Standard practice assumes data follows this model and learns the latent scores accordingly. However, real data may violate this assumption, and it remains unclear what BT learning recovers in such cases. Starting from triplet comparison data, we formalize the preference information it encodes through the conditional preference distribution (CPRD). We give precise conditions for when BT is appropriate for modeling the CPRD, and identify factors governing sample efficiency – namely, margin and connectivity. Together, these results offer a data-centric foundation for understanding what preference learning actually recovers.

[332] Configuration-to-Performance Scaling Law with Neural Ansatz

Huaqing Zhang, Kaiyue Wen, Tengyu Ma

Main category: cs.LG

TL;DR: NCPL uses LLMs to predict training performance from full configurations, outperforming Chinchilla scaling laws and enabling better hyperparameter tuning.

Details

Motivation: Traditional scaling laws assume optimal hyperparameters, requiring significant tuning effort. There's a need for better predictability across diverse configurations and simpler tuning at scale.

Method: Proposes Neural Configuration-to-Performance Scaling Law (NCPL) using LLMs to map full training configurations to performance. Trained on diverse open-source pretraining logs across multiple sources.

Result: NCPL achieves 20-40% lower prediction error than Chinchilla law, generalizes to 10x more compute than training data, supports joint hyperparameter tuning, and extends to loss-curve prediction.

Conclusion: NCPL provides a more flexible and accurate approach to predicting training performance across diverse configurations, enabling better hyperparameter optimization at scale.

Abstract: Researchers build scaling laws to forecast the training performance of expensive large-scale runs with larger model size N and data size D. These laws assume that other training hyperparameters are optimally chosen, which can require significant effort and, in some cases, be impossible due to external hardware constraints. To improve predictability across a broader set of hyperparameters and enable simpler tuning at scale, we propose learning a \textit{Configuration-to-Performance Scaling Law} (CPL): a mapping from the \textit{full training configuration} to training performance. Because no simple functional form can express this mapping, we parameterize it with a large language model (LLM), and fit it with diverse open-source pretraining logs across multiple sources, yielding a \textit{Neural} Configuration-to-Performance Scaling Law (NCPL). NCPL accurately predicts how training configurations influence the final pretraining loss, achieving 20-40% lower prediction error than the configuration-agnostic Chinchilla law and generalizing to runs using up to 10 x more compute than any run in the training set. It further supports joint tuning of multiple hyperparameters with performance comparable to hyperparameter scaling law baselines. Finally, NCPL naturally and effectively extends to richer prediction targets such as loss-curve prediction.

[333] ICODEN: Ordinary Differential Equation Neural Networks for Interval-Censored Data

Haoling Wang, Lang Zeng, Tao Sun, Youngjoo Cho, Ying Ding

Main category: cs.LG

TL;DR: ICODeN: An ODE-based neural network for interval-censored survival analysis that models hazard functions through deep neural networks without requiring proportional hazards assumptions or parametric forms.

Details

Motivation: Existing survival analysis methods for interval-censored data either rely on strong model assumptions or cannot handle high-dimensional predictors, limiting their applicability in modern biomedical research with complex data.

Method: ICODeN uses ordinary differential equations to model hazard functions through deep neural networks, obtaining cumulative hazard by solving ODEs. It doesn’t require proportional hazards assumptions or prespecified parametric hazard forms, enabling flexible survival modeling.

Result: ICODeN achieves satisfactory predictive accuracy across simulations with proportional/non-proportional hazards and linear/nonlinear covariate effects, remaining stable with increasing predictors. Applications to Alzheimer’s disease and age-related macular degeneration data show robust performance with hundreds to thousands of SNPs.

Conclusion: ICODeN is a practical, assumption-lean tool for prediction with interval-censored survival data in high-dimensional biomedical settings, supporting data-driven subgroup identification with differential progression risk profiles.

Abstract: Predicting time-to-event outcomes when event times are interval censored is challenging because the exact event time is unobserved. Many existing survival analysis approaches for interval-censored data rely on strong model assumptions or cannot handle high-dimensional predictors. We develop ICODEN, an ordinary differential equation-based neural network for interval-censored data that models the hazard function through deep neural networks and obtains the cumulative hazard by solving an ordinary differential equation. ICODEN does not require the proportional hazards assumption or a prespecified parametric form for the hazard function, thereby permitting flexible survival modeling. Across simulation settings with proportional or non-proportional hazards and both linear and nonlinear covariate effects, ICODEN consistently achieves satisfactory predictive accuracy and remains stable as the number of predictors increases. Applications to data from multiple phases of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) and to two Age-Related Eye Disease Studies (AREDS and AREDS2) for age-related macular degeneration (AMD) demonstrate ICODEN’s robust prediction performance. In both applications, predicting time-to-AD or time-to-late AMD, ICODEN effectively uses hundreds to more than 1,000 SNPs and supports data-driven subgroup identification with differential progression risk profiles. These results establish ICODEN as a practical assumption-lean tool for prediction with interval-censored survival data in high-dimensional biomedical settings.

[334] A unified framework for geometry-independent operator learning in cardiac electrophysiology simulations

Bei Zhou, Cesare Corrado, Shuang Qian, Maximilian Balmus, Angela W. C. Lee, Cristobal Rodero, Caroline Roney, Marco J. W. Gotte, Luuk H. G. A. Hopman, Gernot Plank, Mengyun Qiao, Steven Niederer

Main category: cs.LG

TL;DR: A unified framework for geometry-independent neural operator learning using intrinsic coordinate representations on manifolds, demonstrated on cardiac electrophysiology and biomechanics.

Details

Motivation: Existing neural operator approaches struggle with heterogeneous and irregular geometries, typically relying on structured discretizations or explicit mappings to reference domains. There's a need for a geometry-independent framework that can handle extreme anatomical variability.

Method: Proposes reformulating operator learning in an intrinsic coordinate space defined on the underlying manifold. Both inputs and outputs are expressed in this shared coordinate domain, decoupling operator learning from mesh discretization and geometric variability while preserving spatial organization.

Result: The approach outperforms established neural operators on both atrial and ventricular geometries in cardiac electrophysiology. Also successfully applied to cardiac biomechanics (volumetric deformation), demonstrating framework generality.

Conclusion: Intrinsic coordinate representations provide a principled and extensible pathway for neural operator learning on complex physical systems with heterogeneous geometry, enabling geometry-independent operator learning across diverse applications.

Abstract: Learning neural operators on heterogeneous and irregular geometries remains a fundamental challenge, as existing approaches typically rely on structured discretisations or explicit mappings to a shared reference domain. We propose a unified framework for geometry-independent operator learning that reformulates the learning problem in an intrinsic coordinate space defined on the underlying manifold. By expressing both inputs and outputs in this shared coordinate domain, the framework decouples operator learning from mesh discretisation and geometric variability, while preserving meaningful spatial organisation and enabling faithful reconstruction on the original geometry. We demonstrate the framework on cardiac electrophysiology, a particularly challenging setting due to extreme anatomical variability across heart geometries. Leveraging a GPU-accelerated simulation pipeline, we generate large-scale datasets of high-fidelity electrophysiology simulations across diverse patient-specific anatomies and train customised neural operators to predict full-field local activation time maps. The proposed approach outperforms established neural operators on both atrial and ventricular geometries. Beyond cardiac electrophysiology, we further show that the same representation enables operator learning in cardiac biomechanics, a distinct problem involving volumetric deformation, highlighting the generality of the proposed framework. Together, these results establish intrinsic coordinate representations as a principled and extensible pathway for neural operator learning on complex physical systems characterised by heterogeneous geometry.

[335] Confounding Robust Continuous Control via Automatic Reward Shaping

Mateo Juliani, Mingxuan Li, Elias Bareinboim

Main category: cs.LG

TL;DR: Automatically learns reward shaping functions for continuous control RL from offline datasets, robust to unobserved confounding variables using causal Bellman equations.

Details

Motivation: Reward shaping accelerates RL training but lacks principled design methods, especially for complex continuous control problems with potential unobserved confounding variables in offline datasets.

Method: Uses causal Bellman equation to learn tight upper bounds on optimal state values, then applies these as potentials in Potential-Based Reward Shaping (PBRS) framework, tested with Soft-Actor-Critic on continuous control benchmarks.

Result: Exhibits strong performance guarantees under unobserved confounders and demonstrates effectiveness on multiple continuous control benchmarks.

Conclusion: Marks a solid first step towards confounding robust continuous control from a causal perspective, providing automated reward shaping for RL in complex environments.

Abstract: Reward shaping has been applied widely to accelerate Reinforcement Learning (RL) agents’ training. However, a principled way of designing effective reward shaping functions, especially for complex continuous control problems, remains largely under-explained. In this work, we propose to automatically learn a reward shaping function for continuous control problems from offline datasets, potentially contaminated by unobserved confounding variables. Specifically, our method builds upon the recently proposed causal Bellman equation to learn a tight upper bound on the optimal state values, which is then used as the potentials in the Potential-Based Reward Shaping (PBRS) framework. Our proposed reward shaping algorithm is tested with Soft-Actor-Critic (SAC) on multiple commonly used continuous control benchmarks and exhibits strong performance guarantees under unobserved confounders. More broadly, our work marks a solid first step towards confounding robust continuous control from a causal perspective. Code for training our reward shaping functions can be found at https://github.com/mateojuliani/confounding_robust_cont_control.

[336] Shortest-Path Flow Matching with Mixture-Conditioned Bases for OOD Generalization to Unseen Conditions

Andrea Rubbi, Amir Akbarnejad, Mohammad Vali Sanian, Aryan Yazdan Parast, Hesam Asadollahzadeh, Arian Amani, Naveed Akhtar, Sarah Cooper, Andrew Bassett, Pietro Liò, Lassi Paavolainen, Sattar Vakili, Mo Lotfollahi

Main category: cs.LG

TL;DR: SP-FM is a shortest-path flow-matching framework that improves out-of-distribution generalization for conditional generative models by conditioning both the base distribution and flow field on the condition.

Details

Motivation: Current conditional flow-based methods often fail to extrapolate to unseen conditions despite fitting training conditions well, highlighting the need for better out-of-distribution generalization in conditional generative modeling.

Method: SP-FM learns a condition-dependent base distribution parameterized as a flexible learnable mixture, together with a condition-dependent vector field trained via shortest-path flow matching, allowing adaptation of starting distributions across conditions.

Result: The method shows effectiveness across heterogeneous domains including single-cell transcriptomics and drug screening, enabling smooth interpolation and more reliable extrapolation beyond observed training ranges.

Conclusion: SP-FM provides a simple yet effective plug-in strategy for improving conditional generative modeling and OOD generalization across diverse domains.

Abstract: Robust generalization under distribution shift remains a key challenge for conditional generative modeling: conditional flow-based methods often fit the training conditions well but fail to extrapolate to unseen ones. We introduce SP-FM, a shortest-path flow-matching framework that improves out-of-distribution (OOD) generalization by conditioning both the base distribution and the flow field on the condition. Specifically, SP-FM learns a condition-dependent base distribution parameterized as a flexible, learnable mixture, together with a condition-dependent vector field trained via shortest-path flow matching. Conditioning the base allows the model to adapt its starting distribution across conditions, enabling smooth interpolation and more reliable extrapolation beyond the observed training range. We provide theoretical insights into the resulting conditional transport and show how mixture-conditioned bases enhance robustness under shift. Empirically, SP-FM is effective across heterogeneous domains, including predicting responses to unseen perturbations in single-cell transcriptomics and modeling treatment effects in high-content microscopy–based drug screening. Overall, SP-FM provides a simple yet effective plug-in strategy for improving conditional generative modeling and OOD generalization across diverse domains.

[337] R2RAG-Flood: A reasoning-reinforced training-free retrieval augmentation generation framework for flood damage nowcasting

Lipai Huang, Kai Yin, Chia-Fu Liu, Ali Mostafavi

Main category: cs.LG

TL;DR: R2RAG-Flood is a training-free retrieval-augmented generation framework for post-storm property damage prediction that uses reasoning trajectories from a knowledge base to guide LLM predictions without fine-tuning.

Details

Motivation: To create a practical damage assessment system that combines the reasoning capabilities of LLMs with the efficiency of retrieval-augmented generation, avoiding costly fine-tuning while providing interpretable predictions.

Method: Builds a knowledge base with labeled tabular records containing structured predictors, text summaries, and reasoning trajectories. During inference, retrieves relevant reasoning from geospatial neighbors and class prototypes to guide LLM predictions through context-augmented prompts in a two-stage damage assessment process.

Result: Achieves 0.613-0.668 overall accuracy and 0.757-0.896 damage class accuracy across 7 LLM backbones, approaching supervised baseline performance (0.714 overall accuracy) while providing structured rationales and showing substantially higher efficiency in severity-per-cost metrics.

Conclusion: R2RAG-Flood demonstrates that training-free retrieval-augmented generation can approach supervised model performance for damage assessment tasks while offering interpretability and computational efficiency advantages.

Abstract: R2RAG-Flood is a reasoning-reinforced, training-free retrieval-augmented generation framework for post-storm property damage nowcasting. Building on an existing supervised tabular predictor, the framework constructs a reasoning-centric knowledge base composed of labeled tabular records, where each sample includes structured predictors, a compact natural language text-mode summary, and a model-generated reasoning trajectory. During inference, R2RAG-Flood issues context-augmented prompts that retrieve and condition on relevant reasoning trajectories from nearby geospatial neighbors and canonical class prototypes, enabling the large language model backbone to emulate and adapt prior reasoning rather than learn new task-specific parameters. Predictions follow a two-stage procedure that first determines property damage occurrence and then refines severity within a three-level Property Damage Extent categorization, with a conditional downgrade step to correct over-predicted severity. In a case study of Harris County, Texas at the 12-digit Hydrologic Unit Code scale, the supervised tabular baseline trained directly on structured predictors achieves 0.714 overall accuracy and 0.859 damage class accuracy for medium and high damage classes. Across seven large language model backbones, R2RAG-Flood attains 0.613 to 0.668 overall accuracy and 0.757 to 0.896 damage class accuracy, approaching the supervised baseline while additionally producing a structured rationale for each prediction. Using a severity-per-cost efficiency metric derived from API pricing and GPU instance costs, lightweight R2RAG-Flood variants demonstrate substantially higher efficiency than both the supervised tabular baseline and larger language models, while requiring no task-specific training or fine-tuning.

[338] Stop Training for the Worst: Progressive Unmasking Accelerates Masked Diffusion Training

Jaeyeon Kim, Jonathan Geuter, David Alvarez-Melis, Sham Kakade, Sitan Chen

Main category: cs.LG

TL;DR: PUMA (Progressive UnMAsking) is a training method for Masked Diffusion Models that aligns training and inference masking patterns to reduce computational cost and improve efficiency.

Details

Motivation: Masked Diffusion Models (MDMs) have training complexity issues due to training on exponentially large sets of random masking patterns, creating a train-test mismatch with structured inference-time unmasking patterns.

Method: PUMA modifies the forward masking process to progressively unmask tokens during training, aligning training-time masking patterns with inference-time unmasking patterns to focus optimization on relevant masks.

Result: PUMA speeds up pretraining at 125M scale by approximately 2.5× and offers complementary advantages on top of common recipes like autoregressive initialization.

Conclusion: PUMA provides a simple yet effective modification to MDM training that reduces computational cost while maintaining performance by better aligning training with inference patterns.

Abstract: Masked Diffusion Models (MDMs) have emerged as a promising approach for generative modeling in discrete spaces. By generating sequences in any order and allowing for parallel decoding, they enable fast inference and strong performance on non-causal tasks. However, this flexibility comes with a training complexity trade-off: MDMs train on an exponentially large set of masking patterns, which is not only computationally expensive, but also creates a train–test mismatch between the random masks used in training and the highly structured masks induced by inference-time unmasking. In this work, we propose Progressive UnMAsking (PUMA), a simple modification of the forward masking process that aligns training-time and inference-time masking patterns, thereby focusing optimization on inference-aligned masks and speeding up training. Empirically, PUMA speeds up pretraining at the 125M scale by $\approx 2.5\times$ and offers complementary advantages on top of common recipes like autoregressive initialization. We open-source our codebase at https://github.com/JaeyeonKim01/PUMA.

[339] Identifying Evidence-Based Nudges in Biomedical Literature with Large Language Models

Jaydeep Chauhan, Mark Seidman, Pezhman Raeisian Parvari, Zhi Zheng, Zina Ben-Miled, Cristina Barboi, Andrew Gonzalez, Malaz Boustani

Main category: cs.LG

TL;DR: AI system extracts behavioral nudges from biomedical literature using hybrid filtering and LLM classification, achieving tunable precision-recall tradeoffs for evidence-based healthcare interventions.

Details

Motivation: Behavioral nudges show strong impact on health outcomes but identifying them from PubMed's 8M+ articles is challenging; need scalable system to extract evidence-based interventions from unstructured literature.

Method: Multi-stage pipeline: 1) Hybrid filtering (keywords, TF-IDF, cosine similarity, nudge-term bonus) reduces corpus to 81K candidates; 2) OpenScholar (quantized LLaMA 3.1 8B) classifies papers and extracts structured fields in single pass with JSON schema validation.

Result: Best setup achieved 67.0% F1 score and 72.0% recall; high-precision variant with self-consistency (7 randomized passes) achieved 100% precision with 12% recall; system demonstrates tunable trade-off for different use cases.

Conclusion: System enables interpretable, domain-specific retrieval for evidence synthesis and personalized healthcare, being integrated into Agile Nudge+ platform to ground LLM-generated interventions in peer-reviewed evidence.

Abstract: We present a scalable, AI-powered system that identifies and extracts evidence-based behavioral nudges from unstructured biomedical literature. Nudges are subtle, non-coercive interventions that influence behavior without limiting choice, showing strong impact on health outcomes like medication adherence. However, identifying these interventions from PubMed’s 8 million+ articles is a bottleneck. Our system uses a novel multi-stage pipeline: first, hybrid filtering (keywords, TF-IDF, cosine similarity, and a “nudge-term bonus”) reduces the corpus to about 81,000 candidates. Second, we use OpenScholar (quantized LLaMA 3.1 8B) to classify papers and extract structured fields like nudge type and target behavior in a single pass, validated against a JSON schema. We evaluated four configurations on a labeled test set (N=197). The best setup (Title/Abstract/Intro) achieved a 67.0% F1 score and 72.0% recall, ideal for discovery. A high-precision variant using self-consistency (7 randomized passes) achieved 100% precision with 12% recall, demonstrating a tunable trade-off for high-trust use cases. This system is being integrated into Agile Nudge+, a real-world platform, to ground LLM-generated interventions in peer-reviewed evidence. This work demonstrates interpretable, domain-specific retrieval pipelines for evidence synthesis and personalized healthcare.

[340] Theoretical Analysis of Contrastive Learning under Imbalanced Data: From Training Dynamics to a Pruning Solution

Haixu Liao, Yating Zhou, Songyang Zhang, Meng Wang, Shuai Zhang

Main category: cs.LG

TL;DR: Theoretical analysis of contrastive learning dynamics with Transformers under imbalanced data distributions, revealing three-stage neuron evolution and showing pruning can mitigate imbalance effects.

Details

Motivation: Contrastive learning lacks theoretical understanding under imbalanced data distributions common in real-world applications, which can degrade representation quality and cause biased model behavior.

Method: Developed theoretical framework to analyze training dynamics of contrastive learning with Transformer-based encoders under imbalanced data, examining neuron weight evolution through three distinct stages.

Result: Revealed minority features reduce representational capacity, increase architectural complexity needs, and hinder feature-noise separation; pruning restores performance and enhances feature separation.

Conclusion: Provides theoretical insights into contrastive learning dynamics under imbalance and practical guidance that pruning can mitigate negative effects, with findings validated through numerical experiments.

Abstract: Contrastive learning has emerged as a powerful framework for learning generalizable representations, yet its theoretical understanding remains limited, particularly under imbalanced data distributions that are prevalent in real-world applications. Such an imbalance can degrade representation quality and induce biased model behavior, yet a rigorous characterization of these effects is lacking. In this work, we develop a theoretical framework to analyze the training dynamics of contrastive learning with Transformer-based encoders under imbalanced data. Our results reveal that neuron weights evolve through three distinct stages of training, with different dynamics for majority features, minority features, and noise. We further show that minority features reduce representational capacity, increase the need for more complex architectures, and hinder the separation of ground-truth features from noise. Inspired by these neuron-level behaviors, we show that pruning restores performance degraded by imbalance and enhances feature separation, offering both conceptual insights and practical guidance. Major theoretical findings are validated through numerical experiments.

[341] Simple LLM Baselines are Competitive for Model Diffing

Elias Kempf, Simon Schrodi, Bartosz Cywiński, Thomas Brox, Neel Nanda, Arthur Conmy

Main category: cs.LG

TL;DR: Paper proposes evaluation metrics for model diffing methods (LLM-based vs SAE-based) to compare their ability to surface systematic behavioral differences between model revisions, finding improved LLM baseline performs comparably to SAE method with more abstract differences.

Details

Motivation: Standard LLM evaluations miss unexpected behavioral differences between model revisions or emergent misaligned tendencies. Model diffing addresses this by automatically surfacing systematic differences, but no systematic comparison exists between LLM-based and SAE-based approaches, nor established evaluation criteria.

Method: Proposes evaluation metrics for key desiderata: generalization (how well differences generalize), interestingness (how novel/surprising), and abstraction level (conceptual vs specific). Uses these metrics to compare existing LLM-based and SAE-based model diffing methods, including an improved LLM-based baseline.

Result: Improved LLM-based baseline performs comparably to SAE-based method in surfacing behavioral differences, while typically identifying more abstract behavioral differences between model revisions.

Conclusion: Model diffing evaluation framework enables systematic comparison of methods; LLM-based approaches can match SAE-based methods in effectiveness while providing more abstract insights into model behavioral changes.

Abstract: Standard LLM evaluations only test capabilities or dispositions that evaluators designed them for, missing unexpected differences such as behavioral shifts between model revisions or emergent misaligned tendencies. Model diffing addresses this limitation by automatically surfacing systematic behavioral differences. Recent approaches include LLM-based methods that generate natural language descriptions and sparse autoencoder (SAE)-based methods that identify interpretable features. However, no systematic comparison of these approaches exists nor are there established evaluation criteria. We address this gap by proposing evaluation metrics for key desiderata (generalization, interestingness, and abstraction level) and use these to compare existing methods. Our results show that an improved LLM-based baseline performs comparably to the SAE-based method while typically surfacing more abstract behavioral differences.

[342] Hardware Co-Design Scaling Laws via Roofline Modelling for On-Device LLMs

Luoyang Sun, Jiwen Jiang, Yifeng Ding, Fengfa Li, Yan Song, Haifeng Zhang, Jian Ying, Lei Ren, Kun Zhan, Wei Chen, Yan Xie, Cheng Deng

Main category: cs.LG

TL;DR: Proposes a hardware-software co-design framework for Vision-Language-Action models that jointly optimizes model accuracy and inference latency through scaling laws and roofline modeling for on-device deployment.

Details

Motivation: VLAs are increasingly deployed in resource-constrained on-device settings (autonomous vehicles, robots, smart spaces) where selecting appropriate LLM backbones requires balancing accuracy with strict inference latency and hardware efficiency constraints.

Method: Develops a hardware co-design law that models training loss as a function of architectural hyperparameters and characterizes inference latency via roofline modeling. Evaluates 1,942 candidate architectures on NVIDIA Jetson Orin, trains 170 selected models to fit scaling laws, and couples scaling laws with latency modeling to establish accuracy-latency correspondence.

Result: Reduces architecture selection from months to days. At same latency as Qwen2.5-0.5B on target hardware, co-designed architecture achieves 19.42% lower perplexity on WikiText-2. Identifies Pareto frontier for hardware co-designed LLMs.

Conclusion: Presents first principled and operational framework for hardware co-design scaling laws in on-device LLM deployment, enabling efficient hardware-software co-design for VLAs in resource-constrained environments.

Abstract: Vision-Language-Action Models (VLAs) have emerged as a key paradigm of Physical AI and are increasingly deployed in autonomous vehicles, robots, and smart spaces. In these resource-constrained on-device settings, selecting an appropriate large language model (LLM) backbone is a critical challenge: models must balance accuracy with strict inference latency and hardware efficiency constraints. This makes hardware-software co-design a game-changing requirement for on-device LLM deployment, where each hardware platform demands a tailored architectural solution. We propose a hardware co-design law that jointly captures model accuracy and inference performance. Specifically, we model training loss as an explicit function of architectural hyperparameters and characterise inference latency via roofline modelling. We empirically evaluate 1,942 candidate architectures on NVIDIA Jetson Orin, training 170 selected models for 10B tokens each to fit a scaling law relating architecture to training loss. By coupling this scaling law with latency modelling, we establish a direct accuracy-latency correspondence and identify the Pareto frontier for hardware co-designed LLMs. We further formulate architecture search as a joint optimisation over precision and performance, deriving feasible design regions under industrial hardware and application budgets. Our approach reduces architecture selection from months to days. At the same latency as Qwen2.5-0.5B on the target hardware, our co-designed architecture achieves 19.42% lower perplexity on WikiText-2. To our knowledge, this is the first principled and operational framework for hardware co-design scaling laws in on-device LLM deployment. We will make the code and related checkpoints publicly available.

[343] Deep learning outperforms traditional machine learning methods in predicting childhood malnutrition: evidence from survey data

Deepak Bastola, Yang Li

Main category: cs.LG

TL;DR: Machine learning analysis of childhood malnutrition in Nepal using survey data, with TabNet performing best among 16 algorithms for identifying at-risk children.

Details

Motivation: Childhood malnutrition is a major public health concern in Nepal and low-resource settings, but conventional case-finding approaches are labor-intensive and often unavailable in remote areas. There's a need for scalable, automated screening methods.

Method: Systematically compared 16 algorithms (deep learning, gradient boosting, traditional ML) using Nepal MICS 2019 survey data. Created composite malnutrition indicator from stunting, wasting, and underweight status. Evaluated performance using 10 metrics with emphasis on F1-score and recall due to class imbalance.

Result: TabNet demonstrated best performance among all models, outperforming SVM and AdaBoost. Feature importance analysis identified maternal education, household wealth index, child age as primary predictors, followed by geographic characteristics, vaccination status, and meal frequency.

Conclusion: Demonstrates a scalable, survey-based screening framework for identifying children at elevated malnutrition risk and guiding targeted interventions. Supports SDG progress and offers transferable methodological template for similar low-resource settings globally.

Abstract: Childhood malnutrition remains a major public health concern in Nepal and other low-resource settings, while conventional case-finding approaches are labor-intensive and frequently unavailable in remote areas. This study provides the first comprehensive assessment of machine learning and deep learning methodologies for identifying malnutrition among children under five years of age in Nepal. We systematically compared 16 algorithms spanning deep learning, gradient boosting, and traditional machine learning families, using data from the Nepal Multiple Indicator Cluster Survey (MICS) 2019. A composite malnutrition indicator was constructed by integrating stunting, wasting, and underweight status, and model performance was evaluated using ten metrics, with emphasis on F1-score and recall to account for substantial class imbalance and the high cost of failing to detect malnourished children. Among all models, TabNet demonstrated the best performance, likely attributable to its attention-based architecture, and outperformed both support vector machine and AdaBoost classifiers. A consensus feature importance analysis identified maternal education, household wealth index, and child age as the primary predictors of malnutrition, followed by geographic characteristics, vaccination status, and meal frequency. Collectively, these results demonstrate a scalable, survey-based screening framework for identifying children at elevated risk of malnutrition and for guiding targeted nutritional interventions. The proposed approach supports Nepal’s progress toward the Sustainable Development Goals and offers a transferable methodological template for similar low-resource settings globally.

[344] Time-to-Event Transformer to Capture Timing Attention of Events in EHR Time Series

Jia Li, Yu Hou, Rui Zhang

Main category: cs.LG

TL;DR: LITT is a Timing-Transformer architecture that enables personalized sequential event discovery from clinical time-series data by creating virtual relative timelines for event-timing-focused attention.

Details

Motivation: Current AI models, including transformers, are mostly agnostic to event timing and ordering in clinical time-series data, which limits their ability to discover personalized sequential events and perform causal reasoning needed for precision medicine.

Method: LITT introduces a Timing-Transformer architecture that creates a virtual “relative timeline” to align sequential events, enabling event-timing-focused attention. It treats timing as a computable dimension and assigns relative timestamps to candidate events beyond their observed physical times.

Result: LITT was validated on real-world longitudinal EHR data from 3,276 breast cancer patients to predict cardiotoxicity-induced heart disease onset timing. It outperformed both benchmark and state-of-the-art survival analysis methods on public datasets.

Conclusion: LITT represents a significant step forward for precision medicine in clinical AI by enabling personalized interpretation of clinical trajectories through timing-aware sequential event discovery.

Abstract: Automatically discovering personalized sequential events from large-scale time-series data is crucial for enabling precision medicine in clinical research, yet it remains a formidable challenge even for contemporary AI models. For example, while transformers capture rich associations, they are mostly agnostic to event timing and ordering, thereby bypassing potential causal reasoning. Intuitively, we need a method capable of evaluating the “degree of alignment” among patient-specific trajectories and identifying their shared patterns, i.e., the significant events in a consistent sequence. This necessitates treating timing as a true \emph{computable} dimension, allowing models to assign relative timestamps'' to candidate events beyond their observed physical times. In this work, we introduce LITT, a novel Timing-Transformer architecture that enables temporary alignment of sequential events on a virtual relative timeline’’, thereby enabling \emph{event-timing-focused attention} and personalized interpretations of clinical trajectories. Its interpretability and effectiveness are validated on real-world longitudinal EHR data from 3,276 breast cancer patients to predict the onset timing of cardiotoxicity-induced heart disease. Furthermore, LITT outperforms both the benchmark and state-of-the-art survival analysis methods on public datasets, positioning it as a significant step forward for precision medicine in clinical AI.

[345] Colorful Talks with Graphs: Human-Interpretable Graph Encodings for Large Language Models

Angelo Zangari, Peyman Baghershahi, Sourav Medya

Main category: cs.LG

TL;DR: LLMs struggle with graph problems due to structural reasoning requirements; this paper introduces a human-interpretable structural encoding method using Weisfeiler-Lehman similarity classes mapped to color tokens, improving LLM performance on graph tasks.

Details

Motivation: Graph problems present fundamental challenges for LLMs because they require reasoning over explicit structure, permutation invariance, and complex relationships - capabilities that don't align well with text-based LLM representations. There's a need to bridge this gap between LLMs' text processing strengths and graph reasoning requirements.

Method: The method introduces a human-interpretable structural encoding strategy for graph-to-text translation. It computes a variant of Weisfeiler-Lehman (WL) similarity classes and maps them to human-like color tokens instead of numeric labels. This approach injects graph structure directly into natural language prompts using semantically meaningful cues that LLMs can process more effectively than opaque symbolic encodings.

Result: Experimental results on multiple algorithmic and predictive graph tasks show considerable improvements on both synthetic and real-world datasets. The method captures both local and global-range dependencies, enhancing LLM performance particularly on graph tasks requiring reasoning over global graph structure.

Conclusion: Human-interpretable structural encoding using semantically meaningful cues (like color tokens) can effectively bridge the gap between LLMs’ text processing capabilities and graph reasoning requirements, enabling better performance on graph problems despite the fundamental challenges.

Abstract: Graph problems are fundamentally challenging for large language models (LLMs). While LLMs excel at processing unstructured text, graph tasks require reasoning over explicit structure, permutation invariance, and computationally complex relationships, creating a mismatch with the representations of text-based models. Our work investigates how LLMs can be effectively applied to graph problems despite these barriers. We introduce a human-interpretable structural encoding strategy for graph-to-text translation that injects graph structure directly into natural language prompts. Our method involves computing a variant of Weisfeiler-Lehman (WL) similarity classes and maps them to human-like color tokens rather than numeric labels. The key insight is that semantically meaningful and human-interpretable cues may be more effectively processed by LLMs than opaque symbolic encoding. Experimental results on multiple algorithmic and predictive graph tasks show the considerable improvements by our method on both synthetic and real-world datasets. By capturing both local and global-range dependencies, our method enhances LLM performance especially on graph tasks that require reasoning over global graph structure.

[346] Affordances Enable Partial World Modeling with LLMs

Khimya Khetarpal, Gheorghe Comanici, Jonathan Richens, Jeremy Shar, Fei Xia, Laurent Orseau, Aleksandra Faust, Doina Precup

Main category: cs.LG

TL;DR: Large language models can serve as partial world models for task-agnostic, language-conditioned intents, with affordance-aware extraction improving search efficiency in robotics tasks.

Details

Motivation: While large pre-trained models contain extensive world knowledge, using them directly for search is inefficient. Partial models focusing on affordance-linked states and actions could be more effective, but it's unclear if large models can serve as such partial world models.

Method: Formal analysis proving agents achieving task-agnostic, language-conditioned intents necessarily possess predictive partial-world models. Introduces distribution-robust affordances for multi-task settings and shows how to extract partial models to improve search efficiency.

Result: Empirical evaluations in tabletop robotics tasks demonstrate affordance-aware partial models reduce search branching factor and achieve higher rewards compared to full world models.

Conclusion: Large models can indeed serve as partial world models when properly extracted with affordance awareness, providing more efficient and effective search procedures for task completion.

Abstract: Full models of the world require complex knowledge of immense detail. While pre-trained large models have been hypothesized to contain similar knowledge due to extensive pre-training on vast amounts of internet scale data, using them directly in a search procedure is inefficient and inaccurate. Conversely, partial models focus on making high quality predictions for a subset of state and actions: those linked through affordances that achieve user intents~\citep{khetarpal2020can}. Can we posit large models as partial world models? We provide a formal answer to this question, proving that agents achieving task-agnostic, language-conditioned intents necessarily possess predictive partial-world models informed by affordances. In the multi-task setting, we introduce distribution-robust affordances and show that partial models can be extracted to significantly improve search efficiency. Empirical evaluations in tabletop robotics tasks demonstrate that our affordance-aware partial models reduce the search branching factor and achieve higher rewards compared to full world models.

[347] Tensor Methods: A Unified and Interpretable Approach for Material Design

Shaan Pakala, Aldair E. Gongora, Brian Giera, Evangelos E. Papalexakis

Main category: cs.LG

TL;DR: Tensor completion methods for material design optimization offer interpretable predictions that compete with traditional ML while rediscovering physical phenomena through tensor factors.

Details

Motivation: Material design optimization faces exponential search space growth, making exhaustive synthesis/evaluation impossible and traditional computational methods too heavy. ML surrogate models are difficult to interpret and underperform with non-uniform training data sampling.

Method: Proposes tensor completion methods as an all-in-one approach for interpretability and predictions. Uses classical tensor methods that provide interpretable tensor factors as a byproduct of prediction, and studies specialized tensor methods for non-uniform sampling scenarios.

Result: Tensor methods compete with traditional ML in predictions while offering free interpretable tensor factors. They rediscover physical phenomena, indicating alignment with true underlying physics. Specialized tensor methods improve generalization on non-uniform data, outperforming baseline ML by up to 5% on aggregate R² and halving error in some out-of-distribution regions.

Conclusion: Tensor completion methods provide a promising alternative to traditional ML for material design optimization, offering both competitive predictive performance and interpretability through tensor factors that reveal physical patterns and improve generalization on non-uniform data.

Abstract: When designing new materials, it is often necessary to tailor the material design (with respect to its design parameters) to have some desired properties (e.g. Young’s modulus). As the set of design parameters grow, the search space grows exponentially, making the actual synthesis and evaluation of all material combinations virtually impossible. Even using traditional computational methods such as Finite Element Analysis becomes too computationally heavy to search the design space. Recent methods use machine learning (ML) surrogate models to more efficiently determine optimal material designs; unfortunately, these methods often (i) are notoriously difficult to interpret and (ii) under perform when the training data comes from a non-uniform sampling of the design space. We suggest the use of tensor completion methods as an all-in-one approach for interpretability and predictions. We observe classical tensor methods are able to compete with traditional ML in predictions, with the added benefit of their interpretable tensor factors (which are given completely for free, as a result of the prediction). In our experiments, we are able to rediscover physical phenomena via the tensor factors, indicating that our predictions are aligned with the true underlying physics of the problem. This also means these tensor factors could be used by experimentalists to identify potentially novel patterns, given we are able to rediscover existing ones. We also study the effects of both types of surrogate models when we encounter training data from a non-uniform sampling of the design space. We observe more specialized tensor methods that can give better generalization in these non-uniforms sampling scenarios. We find the best generalization comes from a tensor model, which is able to improve upon the baseline ML methods by up to 5% on aggregate $R^2$, and halve the error in some out of distribution regions.

[348] Experimental Demonstration of Online Learning-Based Concept Drift Adaptation for Failure Detection in Optical Networks

Yousuf Moiz Ali, Jaroslaw E. Prilepsky, João Pedro, Antonio Napoli, Sasipim Srivallapanondh, Sergei K. Turitsyn, Pedro Freire

Main category: cs.LG

TL;DR: Online learning approach for concept drift adaptation in optical network failure detection, achieving 70% performance improvement over static models with low latency.

Details

Motivation: Optical network failure detection faces challenges due to concept drift - changing network conditions and failure patterns over time that degrade static model performance.

Method: Novel online learning-based approach that continuously adapts to concept drift in optical network data streams, maintaining low latency for real-time failure detection.

Result: Achieves up to 70% improvement in performance over conventional static models while maintaining low latency requirements for optical network operations.

Conclusion: Online learning with concept drift adaptation significantly improves optical network failure detection performance compared to static approaches.

Abstract: We present a novel online learning-based approach for concept drift adaptation in optical network failure detection, achieving up to a 70% improvement in performance over conventional static models while maintaining low latency.

[349] Modular Multi-Task Learning for Chemical Reaction Prediction

Jiayun Pang, Ahmed M. Zaitoun, Xacobe Couso Cambeiro, Ivan Vulić

Main category: cs.LG

TL;DR: LoRA enables parameter-efficient fine-tuning of LLMs for organic chemistry tasks, matching full fine-tuning performance while better preserving general chemical knowledge and mitigating catastrophic forgetting.

Details

Motivation: The paper addresses the challenge of adapting large language models trained on broad organic chemistry to smaller, domain-specific reaction datasets in chemical and pharmaceutical R&D, requiring effective specialization while preserving general chemical understanding.

Method: The study evaluates Low-Rank Adaptation (LoRA) as a parameter-efficient alternative to full fine-tuning for organic reaction prediction, benchmarking on USPTO reaction classes and challenging C-H functionalisation reactions for forward reaction prediction, retrosynthesis, and reagent prediction tasks.

Result: LoRA achieves accuracy comparable to full fine-tuning while effectively mitigating catastrophic forgetting and better preserving multi-task performance. Both approaches generalize beyond training distributions, producing plausible alternative solvent predictions. C-H functionalisation fine-tuning reveals that LoRA and full fine-tuning encode subtly different reactivity patterns.

Conclusion: As LLMs continue to scale, modular, parameter-efficient fine-tuning strategies like LoRA offer practical solutions for flexible deployment in chemistry applications, with LoRA showing more effective reaction-specific adaptation.

Abstract: Adapting large language models (LLMs) trained on broad organic chemistry to smaller, domain-specific reaction datasets is a key challenge in chemical and pharmaceutical R&D. Effective specialisation requires learning new reaction knowledge while preserving general chemical understanding across related tasks. Here, we evaluate Low-Rank Adaptation (LoRA) as a parameter-efficient alternative to full fine-tuning for organic reaction prediction on limited, complex datasets. Using USPTO reaction classes and challenging C-H functionalisation reactions, we benchmark forward reaction prediction, retrosynthesis and reagent prediction. LoRA achieves accuracy comparable to full fine-tuning while effectively mitigating catastrophic forgetting and better preserving multi-task performance. Both fine-tuning approaches generalise beyond training distributions, producing plausible alternative solvent predictions. Notably, C-H functionalisation fine-tuning reveals that LoRA and full fine-tuning encode subtly different reactivity patterns, suggesting more effective reaction-specific adaptation with LoRA. As LLMs continue to scale, our results highlight the practicality of modular, parameter-efficient fine-tuning strategies for their flexible deployment for chemistry applications.

[350] Gated Removal of Normalization in Transformers Enables Stable Training and Efficient Inference

Andrei Kanavalau, Carmen Amo Alonso, Sanjay Lall

Main category: cs.LG

TL;DR: TaperNorm gradually transitions from standard normalization to learned fixed scaling, enabling norm-free Transformers by eliminating per-token statistics while maintaining training stability.

Details

Motivation: To challenge the necessity of sample-dependent normalization in pre-norm Transformers and develop a method that maintains training stability while eventually eliminating per-token statistics for inference efficiency.

Method: TaperNorm replaces RMSNorm/LayerNorm with a mechanism that starts as standard normalization, then smoothly transitions to learned sample-independent linear/affine maps via cosine-decayed global gate, enabling folding into adjacent linear projections.

Result: TaperNorm matches normalized baselines while eliminating per-token statistics, enabling up to 1.22× higher throughput in last-token logits mode through folding internal scalings.

Conclusion: The work identifies scale anchoring as normalization’s key role and demonstrates a path toward norm-free Transformers while maintaining training stability.

Abstract: Normalization is widely viewed as essential for stabilizing Transformer training. We revisit this assumption for pre-norm Transformers and ask to what extent sample-dependent normalization is needed inside Transformer blocks. We introduce TaperNorm, a drop-in replacement for RMSNorm/LayerNorm that behaves exactly like the standard normalizer early in training and then smoothly tapers to a learned sample-independent linear/affine map. A single global gate is held at $g{=}1$ during gate warmup, used to calibrate the scaling branch via EMAs, and then cosine-decayed to $g{=}0$, at which point per-token statistics vanish and the resulting fixed scalings can be folded into adjacent linear projections. Our theoretical and empirical results isolate scale anchoring as the key role played by output normalization: as a (near) $0$-homogeneous map it removes radial gradients at the output, whereas without such an anchor cross-entropy encourages unbounded logit growth (``logit chasing’’). We further show that a simple fixed-target auxiliary loss on the pre-logit residual-stream scale provides an explicit alternative anchor and can aid removal of the final normalization layer. Empirically, TaperNorm matches normalized baselines under identical setups while eliminating per-token statistics and enabling these layers to be folded into adjacent linear projections at inference. On an efficiency microbenchmark, folding internal scalings yields up to $1.22\times$ higher throughput in last-token logits mode. These results take a step towards norm-free Transformers while identifying the special role output normalization plays.

[351] LUCID: Attention with Preconditioned Representations

Sai Surya Duvvuri, Nirmal Patel, Nilesh Gupta, Inderjit S. Dhillon

Main category: cs.LG

TL;DR: LUCID Attention introduces a preconditioner based on exponentiated key-key similarities to improve attention focus in long sequences without computational overhead, showing significant gains on long-context retrieval tasks.

Details

Motivation: Standard softmax attention diffuses probability mass to irrelevant tokens in long sequences, and attempts to sharpen focus via temperature reduction cause vanishing gradient problems, limiting performance in long-context scenarios.

Method: LUCID Attention applies a preconditioner derived from exponentiated key-key similarities to minimize overlap between keys in RKHS, allowing queries to focus on important keys with same computational complexity as standard attention.

Result: Training ~1B parameter models on up to 128K tokens shows significant improvements: up to 18% on BABILong and 14% on RULER multi-needle performance compared to standard attention, with gains on SCROLLS and LongBench.

Conclusion: LUCID Attention effectively addresses softmax limitations in long sequences through preconditioning, improving retrieval performance without computational overhead or learnability issues from temperature adjustments.

Abstract: Softmax-based dot-product attention is a cornerstone of Transformer architectures, enabling remarkable capabilities such as in-context learning. However, as context lengths increase, a fundamental limitation of the softmax function emerges: it tends to diffuse probability mass to irrelevant tokens degrading performance in long-sequence scenarios. Furthermore, attempts to sharpen focus by lowering softmax temperature hinder learnability due to vanishing gradients. We introduce LUCID Attention, an architectural modification that applies a preconditioner to the attention probabilities. This preconditioner, derived from exponentiated key-key similarities, minimizes overlap between the keys in a Reproducing Kernel Hilbert Space, thus allowing the query to focus on important keys among large number of keys accurately with same computational complexity as standard attention. Additionally, LUCID’s preconditioning-based approach to retrieval bypasses the need for low temperature and the learnability problems associated with it. We validate our approach by training ~1 billion parameter language models evaluated on up to 128K tokens. Our results demonstrate significant gains on long-context retrieval tasks, specifically retrieval tasks from BABILong, RULER, SCROLLS and LongBench. For instance, LUCID achieves up to 18% improvement in BABILong and 14% improvement in RULER multi-needle performance compared to standard attention.

[352] LightGTS-Cov: Covariate-Enhanced Time Series Forecasting

Yong Shang, Zhipeng Yao, Ning Jin, Xiangfei Qiu, Hui Zhang, Bin Yang

Main category: cs.LG

TL;DR: LightGTS-Cov extends LightGTS foundation model to better incorporate exogenous covariates for time series forecasting, achieving superior performance in energy applications.

Details

Motivation: Existing time series foundation models often ignore or poorly handle exogenous covariates, limiting their effectiveness in covariate-rich applications like electricity price and renewable energy forecasting.

Method: Extends LightGTS with a small MLP plug-in (~0.1M parameters) that residually refines outputs by integrating time-aligned covariates (both past and future-known) into the forecasting process.

Result: Outperforms LightGTS and other covariate-aware baselines on electricity price and energy generation benchmarks, and demonstrates strong real-world performance in photovoltaic power and electricity price forecasting applications.

Conclusion: LightGTS-Cov effectively incorporates covariates while maintaining lightweight architecture, proving valuable for real-world energy forecasting applications with covariate dependencies.

Abstract: Time series foundation models are typically pre-trained on large, multi-source datasets; however, they often ignore exogenous covariates or incorporate them via simple concatenation with the target series, which limits their effectiveness in covariate-rich applications such as electricity price forecasting and renewable energy forecasting. We introduce LightGTS-Cov, a covariate-enhanced extension of LightGTS that preserves its lightweight, period-aware backbone while explicitly incorporating both past and future-known covariates. Built on a $\sim$1M-parameter LightGTS backbone, LightGTS-Cov adds only a $\sim$0.1M-parameter MLP plug-in that integrates time-aligned covariates into the target forecasts by residually refining the outputs of the decoding process. Across covariate-aware benchmarks on electricity price and energy generation datasets, LightGTS-Cov consistently outperforms LightGTS and achieves superior performance over other covariate-aware baselines under both settings, regardless of whether future-known covariates are provided. We further demonstrate its practical value in two real-world energy case applications: long-term photovoltaic power forecasting with future weather forecasts and day-ahead electricity price forecasting with weather and dispatch-plan covariates. Across both applications, LightGTS-Cov achieves strong forecasting accuracy and stable operational performance after deployment, validating its effectiveness in real-world industrial settings.

[353] AI-rithmetic

Alex Bie, Travis Dick, Alex Kulesza, Prabhakar Raghavan, Vinod Raman, Sergei Vassilvitskii

Main category: cs.LG

TL;DR: Frontier AI models excel at advanced mathematics but fail at basic integer addition due to digit-length scaling issues, with errors mainly from operand misalignment and carrying failures.

Details

Motivation: Despite AI systems achieving success in complex mathematical tasks like international competitions and research, they consistently fail at basic arithmetic operations like adding two numbers, revealing a fundamental weakness that needs systematic investigation.

Method: The study conducts empirical evaluation of frontier models on integer addition tasks, analyzing error patterns as digit count increases, and categorizing errors into interpretable classes like operand misalignment and carrying failures.

Result: All frontier models show significantly degraded accuracy for integer addition with increasing digits. Most errors are interpretable: 87.9% of Claude Opus 4.1 errors, 62.9% of GPT-5 errors, and 92.4% of Gemini 2.5 Pro errors are attributed to operand misalignment or carrying failures. Misalignment errors relate to tokenization, while carrying errors appear as independent random failures.

Conclusion: Frontier AI models have fundamental limitations in basic arithmetic despite advanced mathematical capabilities, with systematic error patterns revealing underlying architectural weaknesses related to tokenization and carrying mechanisms.

Abstract: Modern AI systems have been successfully deployed to win medals at international math competitions, assist with research workflows, and prove novel technical lemmas. However, despite their progress at advanced levels of mathematics, they remain stubbornly bad at basic arithmetic, consistently failing on the simple task of adding two numbers. We present a systematic investigation of this phenomenon. We demonstrate empirically that all frontier models suffer significantly degraded accuracy for integer addition as the number of digits increases. Furthermore, we show that most errors made by these models are highly interpretable and can be attributed to either operand misalignment or a failure to correctly carry; these two error classes explain 87.9%, 62.9%, and 92.4% of Claude Opus 4.1, GPT-5, and Gemini 2.5 Pro errors, respectively. Finally, we show that misalignment errors are frequently related to tokenization, and that carrying errors appear largely as independent random failures.

[354] Equivariant Evidential Deep Learning for Interatomic Potentials

Zhongyao Wang, Taoyong Cui, Jiawen Zou, Shufei Zhang, Bo Yan, Wanli Ouyang, Weimin Tan, Mao Su

Main category: cs.LG

TL;DR: Proposes e²IP, an equivariant evidential deep learning framework for uncertainty quantification in machine learning interatomic potentials that models atomic forces and their uncertainty with rotation-equivariant covariance tensors.

Details

Motivation: Existing uncertainty quantification methods for machine learning interatomic potentials are computationally expensive or perform poorly. Evidential deep learning offers a single-model alternative but faces challenges extending to vector-valued quantities like atomic forces while maintaining rotational equivariance.

Method: Develops e²IP, a backbone-agnostic framework that models atomic forces and uncertainty jointly using full 3×3 symmetric positive definite covariance tensors that transform equivariantly under rotations. This maintains statistical self-consistency while enabling single-forward-pass uncertainty estimation.

Result: Experiments on diverse molecular benchmarks show e²IP achieves better accuracy-efficiency-reliability balance than non-equivariant evidential baselines and ensemble methods, with improved data efficiency while retaining single-model inference efficiency.

Conclusion: e²IP provides an effective framework for uncertainty quantification in machine learning interatomic potentials that addresses rotational equivariance challenges while maintaining computational efficiency and reliability.

Abstract: Uncertainty quantification (UQ) is critical for assessing the reliability of machine learning interatomic potentials (MLIPs) in molecular dynamics (MD) simulations, identifying extrapolation regimes and enabling uncertainty-aware workflows such as active learning for training dataset construction. Existing UQ approaches for MLIPs are often limited by high computational cost or suboptimal performance. Evidential deep learning (EDL) provides a theoretically grounded single-model alternative that determines both aleatoric and epistemic uncertainty in a single forward pass. However, extending evidential formulations from scalar targets to vector-valued quantities such as atomic forces introduces substantial challenges, particularly in maintaining statistical self-consistency under rotational transformations. To address this, we propose \textit{Equivariant Evidential Deep Learning for Interatomic Potentials} ($\text{e}^2$IP), a backbone-agnostic framework that models atomic forces and their uncertainty jointly by representing uncertainty as a full $3\times3$ symmetric positive definite covariance tensor that transforms equivariantly under rotations. Experiments on diverse molecular benchmarks show that $\text{e}^2$IP provides a stronger accuracy-efficiency-reliability balance than the non-equivariant evidential baseline and the widely used ensemble method. It also achieves better data efficiency through the fully equivariant architecture while retaining single-model inference efficiency.

[355] Breaking the Curse of Repulsion: Optimistic Distributionally Robust Policy Optimization for Off-Policy Generative Recommendation

Jie Jiang, Yusen Huo, Xiangxin Zhan, Changping Wang, Jun Zhang

Main category: cs.LG

TL;DR: DRPO introduces distributionally robust policy optimization for offline RL in recommendation systems to prevent model collapse from low-quality data by identifying and filtering high-quality behaviors.

Details

Motivation: Policy-based RL methods for generative recommendation suffer from model collapse when trained on offline historical logs due to dominance of low-quality data, requiring a solution to reconcile variance reduction and noise imitation.

Method: Proposes Distributionally Robust Policy Optimization (DRPO) that reformulates the objective as an Optimistic Distributionally Robust Optimization problem, using hard filtering as the exact solution to recover high-quality behaviors while discarding divergence-inducing noise.

Result: DRPO achieves state-of-the-art performance on mixed-quality recommendation benchmarks by preventing model collapse and effectively recovering high-quality behaviors from noisy data.

Conclusion: The paper presents a theoretical framework and practical solution for offline RL in recommendation systems that addresses the critical issue of model collapse from low-quality data through distributionally robust optimization.

Abstract: Policy-based Reinforcement Learning (RL) has established itself as the dominant paradigm in generative recommendation for optimizing sequential user interactions. However, when applied to offline historical logs, these methods suffer a critical failure: the dominance of low-quality data induces severe model collapse. We first establish the Divergence Theory of Repulsive Optimization, revealing that negative gradient updates inherently trigger exponential intensity explosion during off-policy training. This theory elucidates the inherent dilemma of existing methods, exposing their inability to reconcile variance reduction and noise imitation. To break this curse, we argue that the solution lies in rigorously identifying the latent high-quality distribution entangled within the noisy behavior policy. Accordingly, we reformulate the objective as an Optimistic Distributionally Robust Optimization (DRO) problem. Guided by this formulation, we propose Distributionally Robust Policy Optimization (DRPO). We prove that hard filtering is the exact solution to this DRO objective, enabling DRPO to optimally recover high-quality behaviors while strictly discarding divergence-inducing noise. Extensive experiments demonstrate that DRPO achieves state-of-the-art performance on mixed-quality recommendation benchmarks.

[356] QTALE: Quantization-Robust Token-Adaptive Layer Execution for LLMs

Kanghyun Noh, Jinheon Choi, Yulwha Kim

Main category: cs.LG

TL;DR: QTALE is a framework that enables seamless integration of token-adaptive layer execution (reduces FLOPs) with quantization (reduces memory) for efficient LLM deployment while preserving accuracy.

Details

Motivation: LLMs require substantial computational and memory resources, making efficient deployment challenging. While token-adaptive execution reduces FLOPs and quantization reduces memory footprint, naively combining them causes accuracy degradation due to reduced redundancy in token-adaptive models.

Method: QTALE introduces two key components: (1) a training strategy that ensures diverse execution paths are actively explored during fine-tuning, and (2) a post-training mechanism that allows flexible adjustment of execution ratio at inference to reintroduce redundancy when needed.

Result: QTALE enables seamless integration of token-adaptive layer execution with quantization, showing no noticeable accuracy difference, with the gap to quantization-only models kept below 0.5% on CommonsenseQA benchmarks.

Conclusion: By combining token-adaptive execution for FLOPs reduction and quantization for memory savings, QTALE provides an effective solution for efficient LLM deployment while preserving model accuracy.

Abstract: Large language models (LLMs) demand substantial computational and memory resources, posing challenges for efficient deployment. Two complementary approaches have emerged to address these issues: token-adaptive layer execution, which reduces floating-point operations (FLOPs) by selectively bypassing layers, and quantization, which lowers memory footprint by reducing weight precision. However, naively integrating these techniques leads to additional accuracy degradation due to reduced redundancy in token-adaptive models. We propose QTALE (Quantization-Robust Token-Adaptive Layer Execution for LLMs), a novel framework that enables seamless integration of token-adaptive execution with quantization while preserving accuracy. Conventional token-adaptive methods reduce redundancy in two ways: (1) by limiting the diversity of training paths explored during fine-tuning, and (2) by lowering the number of parameters actively involved in inference. To overcome these limitations, QTALE introduces two key components: (1) a training strategy that ensures diverse execution paths are actively explored during fine-tuning, and (2) a post-training mechanism that allows flexible adjustment of the execution ratio at inference to reintroduce redundancy when needed. Experimental results show that QTALE enables seamless integration of token-adaptive layer execution with quantization, showing no noticeable accuracy difference, with the gap to quantization-only models kept below 0.5% on CommonsenseQA benchmarks. By combining token-adaptive execution for FLOPs reduction and quantization for memory savings, QTALE provides an effective solution for efficient LLM deployment.

[357] A Dual-Stream Physics-Augmented Unsupervised Architecture for Runtime Embedded Vehicle Health Monitoring

Enzo Nicolas Spotorno, Antonio Augusto Medeiros Frohlich

Main category: cs.LG

TL;DR: A dual-stream architecture for vehicle health monitoring that combines unsupervised anomaly detection with physics-based load estimation to distinguish between transient surface shocks and sustained mechanical fatigue, enabling edge-based monitoring on resource-constrained ECUs.

Details

Motivation: Traditional vehicle monitoring metrics like mileage fail to capture mechanical burden, while unsupervised deep learning models often conflate statistical stability with mechanical rest, missing critical high-load steady states (like hill climbing with heavy payloads) that cause significant drivetrain fatigue.

Method: Proposes a Dual-Stream Architecture that fuses unsupervised learning for surface anomaly detection with macroscopic physics proxies for cumulative load estimation, using low-frequency sensor data to generate multi-dimensional health vectors distinguishing dynamic hazards from sustained mechanical effort.

Result: Validated on a RISC-V embedded platform, the architecture demonstrates low computational overhead, enabling comprehensive edge-based health monitoring on resource-constrained ECUs without cloud latency or bandwidth costs.

Conclusion: The approach successfully addresses the blind spot in traditional monitoring by distinguishing between transient anomalies and sustained mechanical fatigue, providing practical edge-based vehicle health monitoring for commercial fleets.

Abstract: Runtime quantification of vehicle operational intensity is essential for predictive maintenance and condition monitoring in commercial and heavy-duty fleets. Traditional metrics like mileage fail to capture mechanical burden, while unsupervised deep learning models detect statistical anomalies, typically transient surface shocks, but often conflate statistical stability with mechanical rest. We identify this as a critical blind spot: high-load steady states, such as hill climbing with heavy payloads, appear statistically normal yet impose significant drivetrain fatigue. To resolve this, we propose a Dual-Stream Architecture that fuses unsupervised learning for surface anomaly detection with macroscopic physics proxies for cumulative load estimation. This approach leverages low-frequency sensor data to generate a multi-dimensional health vector, distinguishing between dynamic hazards and sustained mechanical effort. Validated on a RISC-V embedded platform, the architecture demonstrates low computational overhead, enabling comprehensive, edge-based health monitoring on resource-constrained ECUs without the latency or bandwidth costs of cloud-based monitoring.

[358] Control Reinforcement Learning: Token-Level Mechanistic Analysis via Learned SAE Feature Steering

Seonglae Cho, Zekun Wu, Adriano Koshiyama

Main category: cs.LG

TL;DR: CRL trains a policy to select sparse autoencoder features for steering language model outputs, providing interpretable intervention logs and new analysis capabilities for mechanistic interpretability.

Details

Motivation: Existing sparse autoencoder methods only show which features activate in language models, but not which features actually change model outputs when amplified. There's a need for dynamic intervention methods that can identify causal features and provide interpretable steering capabilities.

Method: Introduces Control Reinforcement Learning (CRL) which trains a policy to select SAE features for steering at each token. Uses Adaptive Feature Masking to encourage diverse feature discovery while preserving single-feature interpretability. The framework includes branch point tracking, critic trajectory analysis, and layer-wise comparison techniques.

Result: CRL achieves improvements on Gemma-2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest benchmarks while providing per-token intervention logs. The method reveals syntactic features in early layers and semantic features in later layers, and enables tracking of branch points where feature choice determines output correctness.

Conclusion: Learned feature steering through CRL establishes a new mechanistic interpretability tool that complements static feature analysis with dynamic intervention probes, providing interpretable logs of which features causally affect model outputs.

Abstract: Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We introduce Control Reinforcement Learning (CRL), which trains a policy to select SAE features for steering at each token, producing interpretable intervention logs: the learned policy identifies features that change model outputs when amplified. Adaptive Feature Masking encourages diverse feature discovery while preserving singlefeature interpretability. The framework yields new analysis capabilities: branch point tracking locates tokens where feature choice determines output correctness; critic trajectory analysis separates policy limitations from value estimation errors; layer-wise comparison reveals syntactic features in early layers and semantic features in later layers. On Gemma-2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves improvements while providing per-token intervention logs. These results establish learned feature steering as a mechanistic interpretability tool that complements static feature analysis with dynamic intervention probes

[359] LakeMLB: Data Lake Machine Learning Benchmark

Feiyu Pan, Tianbin Zhang, Aoqian Zhang, Yu Sun, Zheng Wang, Lixing Chen, Li Pan, Jianhua Li

Main category: cs.LG

TL;DR: LakeMLB is a benchmark for evaluating machine learning performance in data lake environments, focusing on multi-table scenarios (Union and Join) with real-world datasets and supporting three integration strategies.

Details

Motivation: Standardized benchmarks for evaluating machine learning performance in data lake environments are scarce, despite data lakes being foundational platforms for large-scale ML with heterogeneous data storage.

Method: Developed LakeMLB benchmark focusing on two representative multi-table scenarios (Union and Join) with three real-world datasets each, covering domains like government open data, finance, Wikipedia, and online marketplaces. Supports three integration strategies: pre-training-based, data augmentation-based, and feature augmentation-based approaches.

Result: Conducted extensive experiments with state-of-the-art tabular learning methods, providing insights into their performance under complex data lake scenarios. Released both datasets and code publicly.

Conclusion: LakeMLB addresses the gap in standardized benchmarks for ML in data lakes, facilitating rigorous research on machine learning in data lake ecosystems with multi-table scenarios.

Abstract: Modern data lakes have emerged as foundational platforms for large-scale machine learning, enabling flexible storage of heterogeneous data and structured analytics through table-oriented abstractions. Despite their growing importance, standardized benchmarks for evaluating machine learning performance in data lake environments remain scarce. To address this gap, we present LakeMLB (Data Lake Machine Learning Benchmark), designed for the most common multi-source, multi-table scenarios in data lakes. LakeMLB focuses on two representative multi-table scenarios, Union and Join, and provides three real-world datasets for each scenario, covering government open data, finance, Wikipedia, and online marketplaces. The benchmark supports three representative integration strategies: pre-training-based, data augmentation-based, and feature augmentation-based approaches. We conduct extensive experiments with state-of-the-art tabular learning methods, offering insights into their performance under complex data lake scenarios. We release both datasets and code to facilitate rigorous research on machine learning in data lake ecosystems; the benchmark is available at https://github.com/zhengwang100/LakeMLB.

[360] Chamfer-Linkage for Hierarchical Agglomerative Clustering

Kishen N Gowda, Willem Fletcher, MohammadHossein Bateni, Laxman Dhulipala, D Ellis Hershkowitz, Rajesh Jayaram, Jakub Łącki

Main category: cs.LG

TL;DR: Proposes Chamfer-linkage, a novel linkage function for hierarchical agglomerative clustering that uses Chamfer distance between clusters, showing improved clustering quality across diverse datasets compared to classical linkages.

Details

Motivation: Classical linkage functions (single-linkage, average-linkage, Ward's method) show high variability across real-world datasets and don't consistently produce high-quality clusterings. There's a need for more robust linkage functions that better capture cluster structure.

Method: Introduces Chamfer-linkage, which measures distance between clusters using Chamfer distance - a popular distance metric between point-clouds in ML/CV. The method maintains O(n²) time complexity like classical linkages and can be implemented as a drop-in replacement.

Result: Chamfer-linkage consistently yields higher-quality clusterings than classical linkages across diverse datasets. It satisfies desirable concept representation properties that other measures struggle with, while maintaining computational efficiency.

Conclusion: Chamfer-linkage is a practical drop-in replacement for classical linkage functions that broadens the hierarchical clustering toolkit, offering more consistent and higher-quality results across different datasets.

Abstract: Hierarchical Agglomerative Clustering (HAC) is a widely-used clustering method based on repeatedly merging the closest pair of clusters, where inter-cluster distances are determined by a linkage function. Unlike many clustering methods, HAC does not optimize a single explicit global objective; clustering quality is therefore primarily evaluated empirically, and the choice of linkage function plays a crucial role in practice. However, popular classical linkages, such as single-linkage, average-linkage and Ward’s method show high variability across real-world datasets and do not consistently produce high-quality clusterings in practice. In this paper, we propose \emph{Chamfer-linkage}, a novel linkage function that measures the distance between clusters using the Chamfer distance, a popular notion of distance between point-clouds in machine learning and computer vision. We argue that Chamfer-linkage satisfies desirable concept representation properties that other popular measures struggle to satisfy. Theoretically, we show that Chamfer-linkage HAC can be implemented in $O(n^2)$ time, matching the efficiency of classical linkage functions. Experimentally, we find that Chamfer-linkage consistently yields higher-quality clusterings than classical linkages such as average-linkage and Ward’s method across a diverse collection of datasets. Our results establish Chamfer-linkage as a practical drop-in replacement for classical linkage functions, broadening the toolkit for hierarchical clustering in both theory and practice.

[361] A Unified Theory of Random Projection for Influence Functions

Pingbang Hu, Yuzheng Hu, Jiaqi W. Ma, Han Zhao

Main category: cs.LG

TL;DR: Theoretical analysis of when random projection preserves influence functions in overparameterized models, addressing limitations of Johnson-Lindenstrauss lemma for curvature inversion and interaction with regularization.

Details

Motivation: Influence functions require computing gᵀF⁻¹g' where F is a curvature matrix, but forming/inverting F is prohibitive in overparameterized models. Random projection via sketching is used but lacks theoretical justification for inversion and interaction with regularization techniques.

Method: Develop unified theory analyzing when projection preserves influence functions. Analyze three settings: 1) Unregularized projection requiring injectivity on range(F), 2) Regularized projection with ridge regularization altering sketching requirements, 3) Factorized influence for Kronecker-factored curvatures with decoupled sketches. Also analyze out-of-range test gradients with leakage terms.

Result: Shows exact preservation requires m ≥ rank(F) for unregularized case, ridge regularization changes sketching barrier with guarantees governed by effective dimension, and Kronecker-factored curvatures work with decoupled sketches despite violating i.i.d. assumptions. Quantifies leakage for general test points.

Conclusion: Provides novel theory characterizing when projection provably preserves influence functions and principled guidance for choosing sketch size in practice, addressing gaps in existing justification for sketching methods.

Abstract: Influence functions and related data attribution scores take the form of $g^{\top}F^{-1}g^{\prime}$, where $F\succeq 0$ is a curvature operator. In modern overparameterized models, forming or inverting $F\in\mathbb{R}^{d\times d}$ is prohibitive, motivating scalable influence computation via random projection with a sketch $P \in \mathbb{R}^{m\times d}$. This practice is commonly justified via the Johnson–Lindenstrauss (JL) lemma, which ensures approximate preservation of Euclidean geometry for a fixed dataset. However, JL does not address how sketching behaves under inversion. Furthermore, there is no existing theory that explains how sketching interacts with other widely-used techniques, such as ridge regularization and structured curvature approximations. We develop a unified theory characterizing when projection provably preserves influence functions. When $g,g^{\prime}\in\text{range}(F)$, we show that: 1) Unregularized projection: exact preservation holds iff $P$ is injective on $\text{range}(F)$, which necessitates $m\geq \text{rank}(F)$; 2) Regularized projection: ridge regularization fundamentally alters the sketching barrier, with approximation guarantees governed by the effective dimension of $F$ at the regularization scale; 3) Factorized influence: for Kronecker-factored curvatures $F=A\otimes E$, the guarantees continue to hold for decoupled sketches $P=P_A\otimes P_E$, even though such sketches exhibit row correlations that violate i.i.d. assumptions. Beyond this range-restricted setting, we analyze out-of-range test gradients and quantify a \emph{leakage} term that arises when test gradients have components in $\ker(F)$. This yields guarantees for influence queries on general test points. Overall, this work develops a novel theory that characterizes when projection provably preserves influence and provides principled guidance for choosing the sketch size in practice.

[362] Constructing Industrial-Scale Optimization Modeling Benchmark

Zhong Li, Hongliang Lu, Tao Wei, Wenyu Liu, Yuxuan Chen, Yuan Lan, Fan Zhang, Zaiwen Wen

Main category: cs.LG

TL;DR: MIPLIB-NL: A benchmark for evaluating LLMs on translating natural language to optimization formulations, built from real industrial mixed-integer linear programs with 10³-10⁶ variables/constraints.

Details

Motivation: Current LLM evaluation for optimization modeling uses toy/synthetic benchmarks, masking difficulties of real industrial problems. Need benchmarks aligning natural-language specs with real optimization models to properly assess LLM capabilities.

Method: Structure-aware reverse construction from MIPLIB 2017: (1) recover compact model structure from flat solver formulations, (2) reverse-generate natural-language specs tied to recovered structure using model-data separation, (3) iterative semantic validation through expert review and human-LLM interaction with reconstruction checks.

Result: Created MIPLIB-NL with 223 one-to-one reconstructions preserving original mathematical content. Experiments show substantial performance degradation on MIPLIB-NL for systems strong on existing benchmarks, exposing failure modes invisible at toy scale.

Conclusion: MIPLIB-NL fills critical gap for realistic natural-language-to-optimization evaluation, revealing LLM limitations on industrial-scale problems and providing benchmark for advancing optimization modeling with LLMs.

Abstract: Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with $10^{3}$–$10^{6}$ (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model–data separation format, and (iii) performs iterative semantic validation through expert review and human–LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.

[363] A Multimodal Conditional Mixture Model with Distribution-Level Physics Priors

Jinkyo Han, Bahador Bahmani

Main category: cs.LG

TL;DR: Physics-informed multimodal conditional modeling framework using mixture density networks (MDNs) with physics regularization for scientific systems with intrinsic multimodality from regime switching and non-unique physical mechanisms.

Details

Motivation: Scientific systems often exhibit multimodal behavior from latent regime switching and non-unique physical mechanisms, making it challenging to learn full conditional distributions in a physically consistent and interpretable way. Current ML generative models lack integration with physics constraints.

Method: Uses mixture density networks (MDNs) for explicit multimodal conditional distributions, with physics knowledge embedded through component-specific regularization terms that penalize violations of governing equations or physical laws. Accommodates non-uniqueness and stochasticity while remaining computationally efficient.

Result: Evaluated on scientific problems with intrinsic multimodality (bifurcation phenomena, stochastic PDEs, atomistic-scale shock dynamics). Compared with conditional flow matching (CFM) and showed MDNs achieve competitive performance with simpler, more interpretable formulation.

Conclusion: Physics-informed MDNs provide an effective framework for multimodal conditional modeling in scientific applications, offering interpretability and physical consistency while maintaining competitive performance with state-of-the-art generative models.

Abstract: Many scientific and engineering systems exhibit intrinsically multimodal behavior arising from latent regime switching and non-unique physical mechanisms. In such settings, learning the full conditional distribution of admissible outcomes in a physically consistent and interpretable manner remains a challenge. While recent advances in machine learning have enabled powerful multimodal generative modeling, their integration with physics-constrained scientific modeling remains nontrivial, particularly when physical structure must be preserved or data are limited. This work develops a physics-informed multimodal conditional modeling framework based on mixture density representations. Mixture density networks (MDNs) provide an explicit and interpretable parameterization of multimodal conditional distributions. Physical knowledge is embedded through component-specific regularization terms that penalize violations of governing equations or physical laws. This formulation naturally accommodates non-uniqueness and stochasticity while remaining computationally efficient and amenable to conditioning on contextual inputs. The proposed framework is evaluated across a range of scientific problems in which multimodality arises from intrinsic physical mechanisms rather than observational noise, including bifurcation phenomena in nonlinear dynamical systems, stochastic partial differential equations, and atomistic-scale shock dynamics. In addition, the proposed method is compared with a conditional flow matching (CFM) model, a representative state-of-the-art generative modeling approach, demonstrating that MDNs can achieve competitive performance while offering a simpler and more interpretable formulation.

[364] Driving Reaction Trajectories via Latent Flow Matching

Yili Shen, Xiangliang Zhang

Main category: cs.LG

TL;DR: LatentRxnFlow models chemical reactions as continuous latent trajectories using Conditional Flow Matching, achieving state-of-the-art accuracy on USPTO benchmarks while providing trajectory-level diagnostics and uncertainty estimation.

Details

Motivation: Current reaction prediction models offer limited insight into reaction processes, either as one-shot mappings or requiring mechanism-specific supervision. There's a need for models that combine predictive accuracy with transparency and diagnostic capabilities.

Method: Proposes LatentRxnFlow, which models reactions as continuous latent trajectories anchored at thermodynamic product states using Conditional Flow Matching. Learns time-dependent latent dynamics directly from reactant-product pairs without mechanistic annotations.

Result: Achieves state-of-the-art performance on USPTO benchmarks. Enables trajectory-level diagnostics, failure mode localization, error mitigation via gated inference, and provides intrinsic epistemic uncertainty signals from geometric properties of learned trajectories.

Conclusion: LatentRxnFlow combines strong predictive accuracy with improved transparency, diagnosability, and uncertainty awareness, advancing reaction prediction toward more trustworthy deployment in high-throughput discovery workflows.

Abstract: Recent advances in reaction prediction have achieved near-saturated accuracy on standard benchmarks (e.g., USPTO), yet most state-of-the-art models formulate the task as a one-shot mapping from reactants to products, offering limited insight into the underlying reaction process. Procedural alternatives introduce stepwise generation but often rely on mechanism-specific supervision, discrete symbolic edits, and computationally expensive inference. In this work, we propose LatentRxnFlow, a new reaction prediction paradigm that models reactions as continuous latent trajectories anchored at the thermodynamic product state. Built on Conditional Flow Matching, our approach learns time-dependent latent dynamics directly from standard reactant-product pairs, without requiring mechanistic annotations or curated intermediate labels. While LatentRxnFlow achieves state-of-the-art performance on USPTO benchmarks, more importantly, the continuous formulation exposes the full generative trajectory, enabling trajectory-level diagnostics that are difficult to realize with discrete or one-shot models. We show that latent trajectory analysis allows us to localize and characterize failure modes and to mitigate certain errors via gated inference. Furthermore, geometric properties of the learned trajectories provide an intrinsic signal of epistemic uncertainty, helping prioritize reliably predictable reaction outcomes and flag ambiguous cases for additional validation. Overall, LatentRxnFlow combines strong predictive accuracy with improved transparency, diagnosability, and uncertainty awareness, moving reaction prediction toward more trustworthy deployment in high-throughput discovery workflows.

[365] Analyzing Fairness of Neural Network Prediction via Counterfactual Dataset Generation

Brian Hyeongseok Kim, Jacqueline L. Mitchell, Chao Wang

Main category: cs.LG

TL;DR: Counterfactual dataset analysis method that identifies minimal label changes needed to alter model predictions, revealing training data bias influences.

Details

Motivation: To develop interpretability methods that go beyond input perturbations and instead examine how training data bias affects model predictions through counterfactual dataset analysis.

Method: Proposes counterfactual dataset approach that heuristically ranks and modifies a bounded number of training labels to construct alternative datasets, then retrains models to see if predictions change on test cases.

Result: Method modifies only small subsets of training labels across 1100+ test cases from 7 fairness datasets, effectively pinpointing critical training examples that drive prediction changes.

Conclusion: Counterfactual dataset analysis provides interpretable way to probe dataset bias and understand connections between training examples and test predictions, offering new fairness assessment approach.

Abstract: Interpreting the inference-time behavior of deep neural networks remains a challenging problem. Existing approaches to counterfactual explanation typically ask: What is the closest alternative input that would alter the model’s prediction in a desired way? In contrast, we explore counterfactual datasets. Rather than perturbing the input, our method efficiently finds the closest alternative training dataset, one that differs from the original dataset by changing a few labels. Training a new model on this altered dataset can then lead to a different prediction of a given test instance. This perspective provides a new way to assess fairness by directly analyzing the influence of label bias on training and inference. Our approach can be characterized as probing whether a given prediction depends on biased labels. Since exhaustively enumerating all possible alternate datasets is infeasible, we develop analysis techniques that trace how bias in the training data may propagate through the learning algorithm to the trained network. Our method heuristically ranks and modifies the labels of a bounded number of training examples to construct a counterfactual dataset, retrains the model, and checks whether its prediction on a chosen test case changes. We evaluate our approach on feedforward neural networks across over 1100 test cases from 7 widely-used fairness datasets. Results show that it modifies only a small subset of training labels, highlighting its ability to pinpoint the critical training examples that drive prediction changes. Finally, we demonstrate how our counterfactual datasets reveal connections between training examples and test cases, offering an interpretable way to probe dataset bias.

[366] Learning Adaptive Distribution Alignment with Neural Characteristic Function for Graph Domain Adaptation

Wei Chen, Xingyu Guo, Shuang Li, Zhao Zhang, Yan Zhong, Fuzhen Zhuang, Deqing wang

Main category: cs.LG

TL;DR: ADAlign is an adaptive distribution alignment framework for Graph Domain Adaptation that automatically identifies and aligns relevant discrepancies between source and target graphs without manual specification, using a Neural Spectral Discrepancy metric to capture feature-structure dependencies.

Details

Motivation: Existing Graph Domain Adaptation methods rely on manually designed graph filters and heuristic alignment of specific graph elements, making them inflexible and scenario-specific. They struggle when dominant discrepancies vary across transfer scenarios.

Method: Proposes ADAlign with Neural Spectral Discrepancy (NSD) - a parametric distance using neural characteristic functions in the spectral domain to encode feature-structure dependencies. Includes learnable frequency sampler that adaptively emphasizes informative spectral components via minimax optimization.

Result: Extensive experiments on 10 datasets and 16 transfer tasks show ADAlign outperforms state-of-the-art baselines while achieving efficiency gains with lower memory usage and faster training.

Conclusion: ADAlign provides a flexible, scenario-aware, and robust framework for Graph Domain Adaptation that automatically adapts to diverse distributional shifts without manual intervention.

Abstract: Graph Domain Adaptation (GDA) transfers knowledge from labeled source graphs to unlabeled target graphs but is challenged by complex, multi-faceted distributional shifts. Existing methods attempt to reduce distributional shifts by aligning manually selected graph elements (e.g., node attributes or structural statistics), which typically require manually designed graph filters to extract relevant features before alignment. However, such approaches are inflexible: they rely on scenario-specific heuristics, and struggle when dominant discrepancies vary across transfer scenarios. To address these limitations, we propose \textbf{ADAlign}, an Adaptive Distribution Alignment framework for GDA. Unlike heuristic methods, ADAlign requires no manual specification of alignment criteria. It automatically identifies the most relevant discrepancies in each transfer and aligns them jointly, capturing the interplay between attributes, structures, and their dependencies. This makes ADAlign flexible, scenario-aware, and robust to diverse and dynamically evolving shifts. To enable this adaptivity, we introduce the Neural Spectral Discrepancy (NSD), a theoretically principled parametric distance that provides a unified view of cross-graph shifts. NSD leverages neural characteristic function in the spectral domain to encode feature-structure dependencies of all orders, while a learnable frequency sampler adaptively emphasizes the most informative spectral components for each task via minimax paradigm. Extensive experiments on 10 datasets and 16 transfer tasks show that ADAlign not only outperforms state-of-the-art baselines but also achieves efficiency gains with lower memory usage and faster training.

[367] Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks

Yongzhong Xu

Main category: cs.LG

TL;DR: Transformers trained on modular arithmetic collapse to 3-4D execution manifolds despite high-dimensional parameter spaces, revealing geometric structure underlying attention concentration and SGD dynamics.

Details

Motivation: To understand the geometric structure of learning dynamics in overparameterized transformer models, particularly how high-dimensional parameter spaces relate to actual computational execution during training.

Method: Carefully controlled modular arithmetic tasks with transformer models (d=128), analyzing training trajectories, dimensional collapse, and geometric properties through projection onto execution manifolds.

Result: Training trajectories rapidly collapse onto low-dimensional execution manifolds (3-4D), robust across random seeds; sharp attention concentration emerges as saturation along routing coordinates, SGD shows integrable dynamics on execution subspace.

Conclusion: Transformers’ core computation occurs in dramatically reduced subspaces while most parameters absorb optimization interference, providing geometric framework for understanding transformer learning with implications for interpretability and curriculum design.

Abstract: We investigate the geometric structure of learning dynamics in overparameterized transformer models through carefully controlled modular arithmetic tasks. Our primary finding is that despite operating in high-dimensional parameter spaces ($d=128$), transformer training trajectories rapidly collapse onto low-dimensional execution manifolds of dimension $3$–$4$. This dimensional collapse is robust across random seeds and moderate task difficulties, though the orientation of the manifold in parameter space varies between runs. We demonstrate that this geometric structure underlies several empirically observed phenomena: (1) sharp attention concentration emerges as saturation along routing coordinates within the execution manifold, (2) stochastic gradient descent (SGD) exhibits approximately integrable dynamics when projected onto the execution subspace, with non-integrability confined to orthogonal staging directions, and (3) sparse autoencoders capture auxiliary routing structure but fail to isolate execution itself, which remains distributed across the low-dimensional manifold. Our results suggest a unifying geometric framework for understanding transformer learning, where the vast majority of parameters serve to absorb optimization interference while core computation occurs in a dramatically reduced subspace. These findings have implications for interpretability, training curriculum design, and understanding the role of overparameterization in neural network learning.

[368] Learning Structure-Semantic Evolution Trajectories for Graph Domain Adaptation

Wei Chen, Xingyu Guo, Shuang Li, Yan Zhong, Zhao Zhang, Fuzhen Zhuang, Hongrui Liu, Libang Zhang, Guo Ye, Huimei He

Main category: cs.LG

TL;DR: DiffGDA: A diffusion-based graph domain adaptation method that models adaptation as continuous-time generative process using SDEs, outperforming state-of-the-art on 14 graph transfer tasks.

Details

Motivation: Existing graph domain adaptation methods use discrete adaptation strategies (intermediate graphs or stepwise alignment) that fail in real-world scenarios where graph structures evolve continuously and nonlinearly. Fixed-step alignment cannot approximate actual transformation processes.

Method: Proposes DiffGDA, a diffusion-based GDA method that models domain adaptation as continuous-time generative process using stochastic differential equations (SDEs). Uses domain-aware network to steer generative process toward target domain, encouraging diffusion trajectory to follow optimal adaptation path.

Result: Extensive experiments on 14 graph transfer tasks across 8 real-world datasets demonstrate DiffGDA consistently outperforms state-of-the-art baselines. Theoretically shows diffusion process converges to optimal solution bridging source and target domains in latent space.

Conclusion: DiffGDA provides a continuous-time framework for graph domain adaptation that better handles real-world graph evolution, achieving superior performance through diffusion-based modeling of structural and semantic transitions.

Abstract: Graph Domain Adaptation (GDA) aims to bridge distribution shifts between domains by transferring knowledge from well-labeled source graphs to given unlabeled target graphs. One promising recent approach addresses graph transfer by discretizing the adaptation process, typically through the construction of intermediate graphs or stepwise alignment procedures. However, such discrete strategies often fail in real-world scenarios, where graph structures evolve continuously and nonlinearly, making it difficult for fixed-step alignment to approximate the actual transformation process. To address these limitations, we propose \textbf{DiffGDA}, a \textbf{Diff}usion-based \textbf{GDA} method that models the domain adaptation process as a continuous-time generative process. We formulate the evolution from source to target graphs using stochastic differential equations (SDEs), enabling the joint modeling of structural and semantic transitions. To guide this evolution, a domain-aware network is introduced to steer the generative process toward the target domain, encouraging the diffusion trajectory to follow an optimal adaptation path. We theoretically show that the diffusion process converges to the optimal solution bridging the source and target domains in the latent space. Extensive experiments on 14 graph transfer tasks across 8 real-world datasets demonstrate DiffGDA consistently outperforms state-of-the-art baselines.

[369] Enhancing Ride-Hailing Forecasting at DiDi with Multi-View Geospatial Representation Learning from the Web

Xixuan Hao, Guicheng Li, Daiqiang Wu, Xusen Guo, Yumeng Zhu, Zhichao Zou, Peng Zhen, Yao Yao, Yuxuan Liang

Main category: cs.LG

TL;DR: MVGR-Net is a two-stage framework for ride-hailing forecasting that learns geospatial representations from POI and temporal mobility patterns, then uses prompt-empowered LLMs with external event integration.

Details

Motivation: Ride-hailing forecasting is crucial for urban mobility optimization but faces challenges from geospatial heterogeneity and external event sensitivity. Existing methods struggle to capture comprehensive regional characteristics.

Method: Two-stage approach: 1) Pretraining learns geospatial representations from Points-of-Interest and temporal mobility patterns (semantic attribute and temporal mobility views). 2) Forecasting uses prompt-empowered LLMs fine-tuned with external events.

Result: Extensive experiments on DiDi’s real-world datasets demonstrate state-of-the-art performance in ride-hailing forecasting.

Conclusion: MVGR-Net effectively addresses geospatial heterogeneity and external event challenges in ride-hailing forecasting through multi-view representation learning and LLM integration.

Abstract: The proliferation of ride-hailing services has fundamentally transformed urban mobility patterns, making accurate ride-hailing forecasting crucial for optimizing passenger experience and urban transportation efficiency. However, ride-hailing forecasting faces significant challenges due to geospatial heterogeneity and high susceptibility to external events. This paper proposes MVGR-Net(Multi-View Geospatial Representation Learning), a novel framework that addresses these challenges through a two-stage approach. In the pretraining stage, we learn comprehensive geospatial representations by integrating Points-of-Interest and temporal mobility patterns to capture regional characteristics from both semantic attribute and temporal mobility pattern views. The forecasting stage leverages these representations through a prompt-empowered framework that fine-tunes Large Language Models while incorporating external events. Extensive experiments on DiDi’s real-world datasets demonstrate the state-of-the-art performance.

[370] Don’t Eliminate Cut: Exponential Separations in LLM-Based Theorem Proving

Sho Sonoda, Shunta Akiyama, Yuya Uezato

Main category: cs.LG

TL;DR: Theoretical analysis of LLM-guided theorem proving in interactive proof assistants, modeling tactic proposal as stochastic policy in MDP, with separation result showing hierarchical learners outperform flat learners for proof structures with reusable components.

Details

Motivation: To provide theoretical understanding of why LLM-guided theorem proving works empirically despite worst-case hardness, and to justify subgoal decomposition approaches in agentic theorem provers through formal analysis.

Method: Model tactic proposal as stochastic policy in finite-horizon deterministic MDP with state/action spaces as compact metric spaces. Introduce problem distributions generated by reference policy with latent variable model for proof DAGs. Analyze under top-k search protocol with Tsybakov-type margin conditions.

Result: Derived lower bounds on success probability decomposing into search and learning terms. Main separation result: when cut elimination expands DAG depth D into cut-free tree of size Ω(Λ^D) while cut-aware hierarchical process has size O(λ^D) with λ≪Λ, flat learners require exponentially more data than hierarchical learners.

Conclusion: Provides principled justification for subgoal decomposition in agentic theorem provers, showing hierarchical learning approaches are theoretically superior for proof structures with reusable components like cuts/lemmas/sketches.

Abstract: We develop a theoretical analysis of LLM-guided formal theorem proving in interactive proof assistants (e.g., Lean) by modeling tactic proposal as a stochastic policy in a finite-horizon deterministic MDP. To capture modern representation learning, we treat the state and action spaces as general compact metric spaces and assume Lipschitz policies. To explain the gap between worst-case hardness and empirical success, we introduce problem distributions generated by a reference policy $q$, including a latent-variable model in which proofs exhibit reusable cut/lemma/sketch structure represented by a proof DAG. Under a top-$k$ search protocol and Tsybakov-type margin conditions, we derive lower bounds on finite-horizon success probability that decompose into search and learning terms, with learning controlled by sequential Rademacher/covering complexity. Our main separation result shows that when cut elimination expands a DAG of depth $D$ into a cut-free tree of size $Ω(Λ^D)$ while the cut-aware hierarchical process has size $O(λ^D)$ with $λ\llΛ$, a flat (cut-free) learner provably requires exponentially more data than a cut-aware hierarchical learner. This provides a principled justification for subgoal decomposition in recent agentic theorem provers.

[371] Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

Williams Jonathan, Tureci Esin

Main category: cs.LG

TL;DR: RLTT is a reinforcement learning framework that distributes reward across full latent reasoning trajectories in LoopLMs, solving credit assignment problems and improving mathematical reasoning performance.

Details

Motivation: Standard RL objectives like GRPO only assign credit to final latent states in LoopLMs, creating a mismatch with the model's internal multi-step reasoning process. This fundamental limitation prevents effective RL-based improvement of reasoning capabilities.

Method: RLTT (Reward Latent Thought Trajectories) distributes reward across the full latent reasoning trajectory rather than just the final state. It provides dense, trajectory-level credit assignment without external verifiers and can directly replace GRPO with minimal overhead.

Result: RLTT yields substantial improvements over GRPO on mathematical reasoning benchmarks: +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME. Despite math-only training, it also transfers effectively to non-mathematical reasoning benchmarks.

Conclusion: Trajectory-level credit assignment via RLTT effectively improves reasoning in LoopLMs, demonstrating that distributing reward across full latent reasoning trajectories solves the credit assignment problem and enables better RL-based optimization of reasoning capabilities.

Abstract: Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model’s internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-2.6B-Thinking under identical training and inference conditions, RLTT yields substantial improvements over GRPO on challenging mathematical reasoning benchmarks, improving accuracy by +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs.

[372] A Swap-Adversarial Framework for Improving Domain Generalization in Electroencephalography-Based Parkinson’s Disease Prediction

Seongwon Jin, Hanseul Choi, Sunggu Yang, Sungho Park, Jibum Kim

Main category: cs.LG

TL;DR: A new ECoG dataset for Parkinson’s disease prediction from rat models, plus a Swap-Adversarial Framework (SAF) that uses channel swapping and domain-adversarial training to handle inter-subject variability and improve generalization across ECoG and EEG datasets.

Details

Motivation: ECoG offers better spatial resolution than EEG for early PD prediction, but lacks open benchmark datasets due to ethical constraints in human studies. There's also a need to address high inter-subject variability and HDLSS problems in ECoG data.

Method: Proposes SAF with three components: (1) robust preprocessing, (2) Inter-Subject Balanced Channel Swap (ISBCS) for cross-subject augmentation by randomly swapping channels between subjects, and (3) domain-adversarial training to suppress subject-specific bias and learn task-relevant shared features.

Result: Method consistently outperformed all baselines in cross-subject, cross-session, and cross-dataset settings, showing most significant improvements in highly variable environments. Achieved superior cross-dataset performance between public EEG benchmarks, demonstrating strong generalization from ECoG to EEG data.

Conclusion: The new ECoG dataset provides first reproducible benchmark for PD prediction, and SAF effectively addresses inter-subject variability and HDLSS problems while enabling robust domain generalization across different brain signal modalities.

Abstract: Electroencephalography (ECoG) offers a promising alternative to conventional electrocorticography (EEG) for the early prediction of Parkinson’s disease (PD), providing higher spatial resolution and a broader frequency range. However, reproducible comparisons has been limited by ethical constraints in human studies and the lack of open benchmark datasets. To address this gap, we introduce a new dataset, the first reproducible benchmark for PD prediction. It is constructed from long-term ECoG recordings of 6-hydroxydopamine (6-OHDA)-induced rat models and annotated with neural responses measured before and after electrical stimulation. In addition, we propose a Swap-Adversarial Framework (SAF) that mitigates high inter-subject variability and the high-dimensional low-sample-size (HDLSS) problem in ECoG data, while achieving robust domain generalization across ECoG and EEG-based Brain-Computer Interface (BCI) datasets. The framework integrates (1) robust preprocessing, (2) Inter-Subject Balanced Channel Swap (ISBCS) for cross-subject augmentation, and (3) domain-adversarial training to suppress subject-specific bias. ISBCS randomly swaps channels between subjects to reduce inter-subject variability, and domain-adversarial training jointly encourages the model to learn task-relevant shared features. We validated the effectiveness of the proposed method through extensive experiments under cross-subject, cross-session, and cross-dataset settings. Our method consistently outperformed all baselines across all settings, showing the most significant improvements in highly variable environments. Furthermore, the proposed method achieved superior cross-dataset performance between public EEG benchmarks, demonstrating strong generalization capability not only within ECoG but to EEG data. The new dataset and source code will be made publicly available upon publication.

[373] What Makes Value Learning Efficient in Residual Reinforcement Learning?

Guozheng Ma, Lu Li, Haoyu Wang, Zixuan Liu, Pierre-Luc Bacon, Dacheng Tao

Main category: cs.LG

TL;DR: DAWN addresses value learning bottlenecks in residual RL through data-anchored warmup and critic normalization for efficient online policy refinement.

Details

Motivation: Residual RL enables stable online refinement of pretrained policies but suffers from value learning challenges: cold start pathology (critic lacks knowledge around base policy) and structural scale mismatch (residual contributions dwarfed by base actions).

Method: Proposes DAWN with two key components: 1) Base-policy transitions for implicit warmup to anchor value learning, and 2) Critic normalization to restore representation sensitivity for discerning value differences.

Result: DAWN demonstrates substantial efficiency gains across diverse benchmarks, policy architectures, and observation modalities by addressing the identified bottlenecks.

Conclusion: Simple yet principled solutions (data-anchored warmup and critic normalization) effectively solve value learning bottlenecks in residual RL, enabling more efficient online policy refinement.

Abstract: Residual reinforcement learning (RL) enables stable online refinement of expressive pretrained policies by freezing the base and learning only bounded corrections. However, value learning in residual RL poses unique challenges that remain poorly understood. In this work, we identify two key bottlenecks: cold start pathology, where the critic lacks knowledge of the value landscape around the base policy, and structural scale mismatch, where the residual contribution is dwarfed by the base action. Through systematic investigation, we uncover the mechanisms underlying these bottlenecks, revealing that simple yet principled solutions suffice: base-policy transitions serve as an essential value anchor for implicit warmup, and critic normalization effectively restores representation sensitivity for discerning value differences. Based on these insights, we propose DAWN (Data-Anchored Warmup and Normalization), a minimal approach targeting efficient value learning in residual RL. By addressing these bottlenecks, DAWN demonstrates substantial efficiency gains across diverse benchmarks, policy architectures, and observation modalities.

[374] $μ$pscaling small models: Principled warm starts and hyperparameter transfer

Yuxin Ma, Nan Chen, Mateo Díaz, Soufiane Hayou, Dmitriy Kunisky, Soledad Villar

Main category: cs.LG

TL;DR: A principled approach to model upscaling with theoretical guarantees for equivalence between original and widened models, plus hyperparameter transfer techniques for efficient tuning.

Details

Motivation: Large neural networks are released in multiple sizes for different inference budgets, but upscaling smaller trained models to larger ones can be sensitive to hyperparameters that are costly to tune directly at target sizes. Current workarounds using scaling laws may not be reliable with upscaling.

Method: 1) Introduces a general upscaling method based on μP and any-dimensional architectures that guarantees model equivalence between original and widened versions, allowing rigorous analysis of infinite-width limits. 2) Extends μTransfer theory to create hyperparameter transfer techniques specifically for upscaled models.

Result: Empirical demonstration shows the method is effective on realistic datasets and architectures, providing efficient hyperparameter tuning for upscaled models.

Conclusion: Provides principled approaches to model upscaling with theoretical guarantees and practical hyperparameter transfer techniques that work effectively on real-world applications.

Abstract: Modern large-scale neural networks are often trained and released in multiple sizes to accommodate diverse inference budgets. To improve efficiency, recent work has explored model upscaling: initializing larger models from trained smaller ones in order to transfer knowledge and accelerate convergence. However, this method can be sensitive to hyperparameters that need to be tuned at the target upscaled model size, which is prohibitively costly to do directly. It remains unclear whether the most common workaround – tuning on smaller models and extrapolating via hyperparameter scaling laws – is still sound when using upscaling. We address this with principled approaches to upscaling with respect to model widths and efficiently tuning hyperparameters in this setting. First, motivated by $μ$P and any-dimensional architectures, we introduce a general upscaling method applicable to a broad range of architectures and optimizers, backed by theory guaranteeing that models are equivalent to their widened versions and allowing for rigorous analysis of infinite-width limits. Second, we extend the theory of $μ$Transfer to a hyperparameter transfer technique for models upscaled using our method and empirically demonstrate that this method is effective on realistic datasets and architectures.

[375] Bridging the Compression-Precision Paradox: A Hybrid Architecture for Clinical EEG Report Generation with Guaranteed Measurement Accuracy

Wuyang Zhang, Zhen Luo, Chuqiao Gu, Jianming Ma, Yebo Cao, Wangming Yuan, Yinzhi Jin

Main category: cs.LG

TL;DR: A hybrid architecture for automated EEG reporting that separates measurement extraction from text generation to guarantee clinical accuracy, using signal processing for exact values before compression and cross-modal translation with constrained decoding.

Details

Motivation: Current LLM-based EEG monitoring systems face two critical limitations: 1) EEG recordings exceed LLM context windows requiring extreme compression (400:1+ ratios) that destroys fine-grained temporal precision needed for clinical diagnosis, and 2) LLMs lack inherent time-series comprehension and hallucinate clinically incorrect measurement values, where even a 0.5 Hz error can distinguish between different epilepsy syndromes.

Method: Hybrid architecture that separates measurement extraction from text generation. First computes exact clinical values via signal processing before compression, employs a cross-modal bridge for EEG-to-language translation, uses parameter-efficient fine-tuning with constrained decoding around frozen slots, and implements multirate sampling to maintain long-range context while preserving event-level precision.

Result: Evaluation on TUH and CHB-MIT datasets shows 60% fewer false alarms, 50% faster detection, and sub-clinical measurement precision. This is the first system guaranteeing clinical measurement accuracy in automated EEG reports.

Conclusion: The proposed hybrid architecture successfully addresses the limitations of LLM-based EEG monitoring by separating measurement extraction from text generation, ensuring clinical accuracy while maintaining the benefits of language generation for automated reporting.

Abstract: Automated EEG monitoring requires clinician-level precision for seizure detection and reporting. Clinical EEG recordings exceed LLM context windows, requiring extreme compression (400:1+ ratios) that destroys fine-grained temporal precision. A 0.5 Hz error distinguishes absence epilepsy from Lennox-Gastaut syndrome. LLMs lack inherent time-series comprehension and rely on statistical associations from compressed representations. This dual limitation causes systems to hallucinate clinically incorrect measurement values. We separate measurement extraction from text generation. Our hybrid architecture computes exact clinical values via signal processing before compression, employs a cross-modal bridge for EEG-to-language translation, and uses parameter-efficient fine-tuning with constrained decoding around frozen slots. Multirate sampling maintains long-range context while preserving event-level precision. Evaluation on TUH and CHB-MIT datasets achieves 60% fewer false alarms, 50% faster detection, and sub-clinical measurement precision. This is the first system guaranteeing clinical measurement accuracy in automated EEG reports.

[376] Contrastive Learning for Multi Label ECG Classification with Jaccard Score Based Sigmoid Loss

Junichiro Takahashi, Masataka Sato, Satoshi Kodeta, Norihiko Takeda

Main category: cs.LG

TL;DR: Developed a robust ECG encoder for multimodal medical AI using SigLIP with modified loss function, improved multi-label ECG classification through medical knowledge integration and data augmentation techniques.

Details

Motivation: Current multimodal medical AI models like MedGemini have limited ECG performance, and some (MedGemma) don't support ECG data at all. ECG interpretation is challenging with variable diagnostic accuracy, while echocardiography has accessibility limitations. Need robust ECG encoder for multimodal pretraining using real-world hospital data.

Method: Used SigLIP (CLIP-based model with sigmoid loss for multi-label prediction) with modified loss function tailored to ECG data’s multi-label nature. Incorporated medical knowledge in language model, increased embedding dimensionality, applied random cropping to mitigate data drift, and conducted per-label analysis.

Result: Modified loss function and medical knowledge integration significantly improved multi-label ECG classification. Increased embedding dimensionality and random cropping enhanced performance. Per-label analysis identified which ECG findings were easier/harder to predict.

Conclusion: Provides foundational framework for developing medical models utilizing ECG data, addressing limitations of current multimodal medical AI in ECG interpretation.

Abstract: Recent advances in large language models (LLMs) have enabled the development of multimodal medical AI. While models such as MedGemini achieve high accuracy on VQA tasks like USMLE MM, their performance on ECG based tasks remains limited, and some models, such as MedGemma, do not support ECG data at all. Interpreting ECGs is inherently challenging, and diagnostic accuracy can vary depending on the interpreter’s experience. Although echocardiography provides rich diagnostic information, it requires specialized equipment and personnel, limiting its availability. In this study, we focus on constructing a robust ECG encoder for multimodal pretraining using real world hospital data. We employ SigLIP, a CLIP based model with a sigmoid based loss function enabling multi label prediction, and introduce a modified loss function tailored to the multi label nature of ECG data. Experiments demonstrate that incorporating medical knowledge in the language model and applying the modified loss significantly improve multi label ECG classification. To further enhance performance, we increase the embedding dimensionality and apply random cropping to mitigate data drift. Finally, per label analysis reveals which ECG findings are easier or harder to predict. Our study provides a foundational framework for developing medical models that utilize ECG data.

[377] SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

Yifan Zhang, Zunhai Su, Shuhao Hu, Rui Yang, Wei Wu, Yulei Qian, Yuchen Xie, Xunliang Cai

Main category: cs.LG

TL;DR: SnapMLA is an FP8 quantization framework optimized for DeepSeek MLA architecture’s decoding phase, addressing numerical heterogeneity, quantization scale misalignment, and system-level optimization challenges.

Details

Motivation: FP8 attention shows promise but faces challenges when integrated into MLA decoding: numerical heterogeneity from decoupled positional embeddings, quantization scale misalignment in FP8 PV GEMM, and lack of optimized system-level support.

Method: Three hardware-aware algorithm-kernel co-optimization techniques: 1) RoPE-Aware Per-Token KV Quantization (keeping RoPE in high precision), 2) Quantized PV Computation Pipeline Reconstruction (fixing scale misalignment), 3) End-to-End Dataflow Optimization (specialized kernels for efficient data flow).

Result: Achieves up to 1.91x throughput improvement with negligible performance degradation on challenging long-context tasks including mathematical reasoning and code generation benchmarks.

Conclusion: SnapMLA successfully addresses FP8 integration challenges in MLA decoding through hardware-aware optimizations, delivering significant efficiency gains for long-context tasks while maintaining model quality.

Abstract: While FP8 attention has shown substantial promise in innovations like FlashAttention-3, its integration into the decoding phase of the DeepSeek Multi-head Latent Attention (MLA) architecture presents notable challenges. These challenges include numerical heterogeneity arising from the decoupling of positional embeddings, misalignment of quantization scales in FP8 PV GEMM, and the need for optimized system-level support. In this paper, we introduce SnapMLA, an FP8 MLA decoding framework optimized to improve long-context efficiency through the following hardware-aware algorithm-kernel co-optimization techniques: (i) RoPE-Aware Per-Token KV Quantization, where the RoPE part is maintained in high precision, motivated by our comprehensive analysis of the heterogeneous quantization sensitivity inherent to the MLA KV cache. Furthermore, per-token granularity is employed to align with the autoregressive decoding process and maintain quantization accuracy. (ii) Quantized PV Computation Pipeline Reconstruction, which resolves the misalignment of quantization scale in FP8 PV computation stemming from the shared KV structure of the MLA KV cache. (iii) End-to-End Dataflow Optimization, where we establish an efficient data read-and-write workflow using specialized kernels, ensuring efficient data flow and performance gains. Extensive experiments on state-of-the-art MLA LLMs show that SnapMLA achieves up to a 1.91x improvement in throughput, with negligible risk of performance degradation in challenging long-context tasks, including mathematical reasoning and code generation benchmarks. Code is available at https://github.com/meituan-longcat/SGLang-FluentLLM.

[378] Online Min-Max Optimization: From Individual Regrets to Cumulative Saddle Points

Abhijeet Vyas, Brian Bullins

Main category: cs.LG

TL;DR: Online min-max optimization framework with new performance measures and algorithms for various function classes beyond convex-concave settings.

Details

Motivation: Existing min-max optimization frameworks are limited to convex-concave settings and static equilibrium concepts. There's a need for online min-max optimization with performance measures compatible with individual regrets and dynamic environments.

Method: Proposes static duality gap (SDual-Gap_T) and dynamic saddle point regret (DSP-Reg_T) measures. Uses reduction to classic online convex optimization problems. Develops algorithms for strong convexity-strong concavity, min-max exponential concavity, and two-sided Polyak-Łojasiewicz conditions.

Result: Established bounds for SDual-Gap_T and DSP-Reg_T under various function classes. Identified a class of functions satisfying min-max exponential concavity that captures two-player portfolio selection. Derived bounds for dynamic regret compatible with individual regrets under PL conditions.

Conclusion: The paper introduces a comprehensive online min-max optimization framework with novel performance measures and provides algorithmic solutions with theoretical guarantees for various function classes beyond traditional convex-concave settings.

Abstract: We propose and study an online version of min-max optimization based on cumulative saddle points under a variety of performance measures beyond convex-concave settings. After first observing the incompatibility of (static) Nash equilibrium (SNE-Reg$_T$) with individual regrets even for strongly convex-strongly concave functions, we propose an alternate \emph{static} duality gap (SDual-Gap$_T$) inspired by the online convex optimization (OCO) framework. We provide algorithms that, using a reduction to classic OCO problems, achieve bounds for SDual-Gap$_T$~and a novel \emph{dynamic} saddle point regret (DSP-Reg$_T$), which we suggest naturally represents a min-max version of the dynamic regret in OCO. We derive our bounds for SDual-Gap$_T$~and DSP-Reg$_T$~~under strong convexity-strong concavity and a min-max notion of exponential concavity (min-max EC), and in addition we establish a class of functions satisfying min-max EC~~that captures a two-player variant of the classic portfolio selection problem. Finally, for a dynamic notion of regret compatible with individual regrets, we derive bounds under a two-sided Polyak-Łojasiewicz (PL) condition.

[379] Gauss-Newton Unlearning for the LLM Era

Lev McKinney, Anvith Thudi, Juhan Bae, Tara Rezaei, Nicolas Papernot, Sheila A. McIlraith, Roger Grosse

Main category: cs.LG

TL;DR: K-FADE: A novel LLM unlearning method using Gauss-Newton steps with K-FAC approximations to efficiently erase unwanted distributions while preserving performance on retained data

Details

Motivation: Standard LLM training can produce undesirable outputs, and existing unlearning methods degrade model performance on desired distributions. Need better trade-off between forgetting unwanted data and retaining good performance.

Method: Uses forget set to compute few uphill Gauss-Newton steps with parametric Hessian approximations (K-FAC). Transforms output constraints on retain set into weight constraints, minimizing behavior changes on retain data.

Result: Suppresses outputs from forget set, approximates retraining results without forget set, alters retain set outputs less than previous methods. Updates can be reapplied after further training.

Conclusion: K-FADE provides state-of-the-art unlearning for LLMs with better retain set preservation and maintainable unlearning updates.

Abstract: Standard large language model training can create models that produce outputs their trainer deems unacceptable in deployment. The probability of these outputs can be reduced using methods such as LLM unlearning. However, unlearning a set of data (called the forget set) can degrade model performance on other distributions where the trainer wants to retain the model’s behavior. To improve this trade-off, we demonstrate that using the forget set to compute only a few uphill Gauss-Newton steps provides a conceptually simple, state-of-the-art unlearning approach for LLMs. While Gauss-Newton steps adapt Newton’s method to non-linear models, it is non-trivial to efficiently and accurately compute such steps for LLMs. Hence, our approach crucially relies on parametric Hessian approximations such as Kronecker-Factored Approximate Curvature (K-FAC). We call this combined approach K-FADE (K-FAC for Distribution Erasure). Our evaluation on the WMDP and ToFU benchmarks demonstrates that K-FADE suppresses outputs from the forget set and approximates, in output space, the results of retraining without the forget set. Critically, our method does this while altering the outputs on the retain set less than previous methods. This is because K-FADE transforms a constraint on the model’s outputs across the entire retain set into a constraint on the model’s weights, allowing the algorithm to minimally change the model’s behavior on the retain set at each step. Moreover, the unlearning updates computed by K-FADE can be reapplied later if the model undergoes further training, allowing unlearning to be cheaply maintained.

[380] LLM-Based Scientific Equation Discovery via Physics-Informed Token-Regularized Policy Optimization

Boxiao Wang, Kai Li, Tianyi Liu, Chen Li, Junzhe Wang, Yifan Zhang, Jian Cheng

Main category: cs.LG

TL;DR: PiT-PO: A reinforcement learning framework that evolves LLMs into adaptive generators for symbolic regression, enforcing physical validity and structural parsimony through dual constraints.

Details

Motivation: Existing LLM-based symbolic regression frameworks treat LLMs as static generators without updating internal representations based on search feedback, often producing physically inconsistent or mathematically redundant expressions.

Method: PiT-PO uses reinforcement learning with a dual-constraint mechanism: hierarchical physical validity enforcement and token-level penalties to suppress redundant structures, evolving LLMs into adaptive generators.

Result: Achieves state-of-the-art performance on standard benchmarks, discovers novel turbulence models for fluid dynamics, and enables small-scale models to outperform closed-source giants.

Conclusion: PiT-PO democratizes access to high-performance scientific discovery by evolving LLMs into adaptive generators that produce scientifically consistent and structurally parsimonious equations.

Abstract: Symbolic regression aims to distill mathematical equations from observational data. Recent approaches have successfully leveraged Large Language Models (LLMs) to generate equation hypotheses, capitalizing on their vast pre-trained scientific priors. However, existing frameworks predominantly treat the LLM as a static generator, relying on prompt-level guidance to steer exploration. This paradigm fails to update the model’s internal representations based on search feedback, often yielding physically inconsistent or mathematically redundant expressions. In this work, we propose PiT-PO (Physics-informed Token-regularized Policy Optimization), a unified framework that evolves the LLM into an adaptive generator via reinforcement learning. Central to PiT-PO is a dual-constraint mechanism that rigorously enforces hierarchical physical validity while simultaneously applying fine-grained, token-level penalties to suppress redundant structures. Consequently, PiT-PO aligns LLM to produce equations that are both scientifically consistent and structurally parsimonious. Empirically, PiT-PO achieves state-of-the-art performance on standard benchmarks and successfully discovers novel turbulence models for challenging fluid dynamics problems. We also demonstrate that PiT-PO empowers small-scale models to outperform closed-source giants, democratizing access to high-performance scientific discovery.

[381] Rotary Positional Embeddings as Phase Modulation: Theoretical Bounds on the RoPE Base for Long-Context Transformers

Feilong Liu

Main category: cs.LG

TL;DR: RoPE positional embeddings analyzed as phase modulation, revealing lower bounds for positional coherence and upper bounds from floating-point precision, defining a “Goldilocks zone” for long-context transformers.

Details

Motivation: Rotary positional embeddings (RoPE) are widely used in LLMs but their behavior at long context lengths remains poorly characterized, leading to issues with positional coherence in long-context models.

Method: Reinterpret RoPE as phase modulation applied to complex oscillators, enabling analysis through classical signal processing theory to derive principled lower bounds (aliasing and DC-component stability) and upper bounds (from floating-point precision).

Result: Derived precision- and depth-dependent feasibility region for RoPE base parameter; validated framework on state-of-the-art models (LLaMA, Mistral, DeepSeek), showing observed successes/failures align with predicted bounds.

Conclusion: The analysis provides theoretical foundations for RoPE behavior at long contexts, explaining attention collapse and long-range degradation in models violating stability bounds, and revealing a hard precision wall for scaling beyond one million tokens.

Abstract: Rotary positional embeddings (RoPE) are widely used in large language models to encode token positions through multiplicative rotations, yet their behavior at long context lengths remains poorly characterized. In this work, we reinterpret RoPE as phase modulation applied to a bank of complex oscillators, enabling analysis through classical signal processing theory. Under this formulation, we derive principled lower bounds on the RoPE base parameter that are necessary to preserve positional coherence over a target context length. These include a fundamental aliasing bound, analogous to a Nyquist limit, and a DC-component stability bound that constrains phase drift in low-frequency positional modes. We further extend this analysis to deep transformers, showing that repeated rotary modulation across layers compounds angular misalignment, tightening the base requirement as depth increases. Complementing these results, we derive a precision-dependent upper bound on the RoPE base arising from finite floating-point resolution. Beyond this limit, incremental phase updates become numerically indistinguishable, leading to positional erasure even in the absence of aliasing. Together, the lower and upper bounds define a precision- and depth-dependent feasibility region a Goldilocks zone for long-context transformers. We validate the framework through a comprehensive case study of state-of-the-art models, including LLaMA, Mistral, and DeepSeek variants, showing that observed successes, failures, and community retrofits align closely with the predicted bounds. Notably, models that violate the stability bound exhibit attention collapse and long-range degradation, while attempts to scale beyond one million tokens encounter a hard precision wall independent of architecture or training.

[382] When Gradient Clipping Becomes a Control Mechanism for Differential Privacy in Deep Learning

Mohammad Partohaghighi, Roummel Marcia, Bruce J. West, YangQuan Chen

Main category: cs.LG

TL;DR: Adaptive gradient clipping for differential privacy using weight-only spectral analysis and feedback control

Details

Motivation: Existing adaptive clipping methods for differentially private training rely on per-example gradient statistics, which adds computational overhead and is sensitive to datasets/architectures. Need a lightweight, architecture-agnostic approach.

Method: Proposes control-driven clipping using weight-only spectral diagnostic: periodically analyzes weight matrix via spectral decomposition, estimates heavy-tailed spectral indicator related to training stability, smooths it over time, and uses bounded feedback controller to update clipping threshold multiplicatively in log domain.

Result: Method provides adaptive clipping without per-example gradient statistics, reduces computational overhead, and maintains privacy guarantees since updates are post-processing of parameters from privacy-preserving training.

Conclusion: Weight-only spectral diagnostic with feedback control offers efficient, architecture-agnostic adaptive clipping for differentially private training, addressing optimization bias vs. noise trade-off without additional privacy cost.

Abstract: Privacy-preserving training on sensitive data commonly relies on differentially private stochastic optimization with gradient clipping and Gaussian noise. The clipping threshold is a critical control knob: if set too small, systematic over-clipping induces optimization bias; if too large, injected noise dominates updates and degrades accuracy. Existing adaptive clipping methods often depend on per-example gradient norm statistics, adding computational overhead and introducing sensitivity to datasets and architectures. We propose a control-driven clipping strategy that adapts the threshold using a lightweight, weight-only spectral diagnostic computed from model parameters. At periodic probe steps, the method analyzes a designated weight matrix via spectral decomposition and estimates a heavy-tailed spectral indicator associated with training stability. This indicator is smoothed over time and fed into a bounded feedback controller that updates the clipping threshold multiplicatively in the log domain. Because the controller uses only parameters produced during privacy-preserving training, the resulting threshold updates are post-processing and do not increase privacy loss beyond that of the underlying DP optimizer under standard composition accounting.

[383] Neural Additive Experts: Context-Gated Experts for Controllable Model Additivity

Guangzhi Xiong, Sanchit Sinha, Aidong Zhang

Main category: cs.LG

TL;DR: Neural Additive Experts (NAEs) is a novel framework that balances interpretability and accuracy by using a mixture of experts approach to relax the rigid additive constraints of Generalized Additive Models while maintaining feature-level interpretability.

Details

Motivation: The paper addresses the core trade-off in machine learning between interpretability and accuracy. Standard Generalized Additive Models (GAMs) offer clear feature attributions but are constrained by their strictly additive nature, limiting predictive performance. Introducing feature interactions can boost accuracy but obscures individual feature contributions.

Method: NAEs employ a mixture of experts framework, learning multiple specialized networks per feature. A dynamic gating mechanism integrates information across features, relaxing rigid additive constraints. The authors also propose targeted regularization techniques to mitigate variance among expert predictions, enabling smooth transition from purely additive models to those capturing feature interactions.

Result: Theoretical analysis and experiments on synthetic data illustrate the model’s flexibility. Extensive evaluations on real-world datasets confirm that NAEs achieve an optimal balance between predictive accuracy and transparent, feature-level explanations.

Conclusion: NAEs provide a novel framework that successfully balances interpretability and accuracy by relaxing the additive constraints of traditional GAMs while maintaining clear feature attributions through a mixture of experts approach with targeted regularization.

Abstract: The trade-off between interpretability and accuracy remains a core challenge in machine learning. Standard Generalized Additive Models (GAMs) offer clear feature attributions but are often constrained by their strictly additive nature, which can limit predictive performance. Introducing feature interactions can boost accuracy yet may obscure individual feature contributions. To address these issues, we propose Neural Additive Experts (NAEs), a novel framework that seamlessly balances interpretability and accuracy. NAEs employ a mixture of experts framework, learning multiple specialized networks per feature, while a dynamic gating mechanism integrates information across features, thereby relaxing rigid additive constraints. Furthermore, we propose targeted regularization techniques to mitigate variance among expert predictions, facilitating a smooth transition from an exclusively additive model to one that captures intricate feature interactions while maintaining clarity in feature attributions. Our theoretical analysis and experiments on synthetic data illustrate the model’s flexibility, and extensive evaluations on real-world datasets confirm that NAEs achieve an optimal balance between predictive accuracy and transparent, feature-level explanations. The code is available at https://github.com/Teddy-XiongGZ/NAE.

[384] ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression

Ammar Ali, Baher Mohammad, Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Stamatios Lefkimmiatis

Main category: cs.LG

TL;DR: ROCKET is a training-free model compression method that formulates layer-wise compression as a multi-choice knapsack problem and uses single-step sparse matrix factorization to achieve state-of-the-art compression performance without fine-tuning.

Details

Motivation: The motivation is to develop an efficient model compression method that can achieve high compression rates while maintaining model performance, without requiring extensive fine-tuning or iterative optimization processes.

Method: ROCKET uses two key innovations: 1) Formulates layer-wise compression allocation as a multi-choice knapsack problem to select optimal compression levels per layer, and 2) Introduces single-step sparse matrix factorization inspired by dictionary learning that sparsifies weights based on activation-weight sensitivity and updates dictionaries via closed-form least squares.

Result: ROCKET outperforms existing compression methods at 20-50% compression rates, retains over 90% of original model performance at 30% compression without fine-tuning, and with light fine-tuning can compress Qwen3-14B to 8B parameters while maintaining performance comparable to original Qwen3-8B.

Conclusion: ROCKET provides an effective training-free compression approach that achieves state-of-the-art results through intelligent layer-wise compression allocation and efficient sparse factorization techniques.

Abstract: We present ROCKET, a training-free model compression method that achieves state-of-the-art performance in comparison with factorization, structured-sparsification and dynamic compression baselines. Operating under a global compression budget, ROCKET comprises two key innovations: First, it formulates layer-wise compression allocation as a multi-choice knapsack problem, selecting the optimal compression level for each layer to minimize total reconstruction error while adhering to a target model size. Second, it introduces a single-step sparse matrix factorization inspired by dictionary learning: using only a small calibration set, it sparsifies weight coefficients based on activation-weights sensitivity and then updates the dictionary in closed form via least squares bypassing iterative optimization, sparse coding, or backpropagation entirely. ROCKET consistently outperforms existing compression approaches across different model architectures at 20-50% compression rates. Notably, it retains over 90% of the original model’s performance at 30% compression without any fine-tuning. Moreover, when applying a light fine-tuning phase, recovery is substantially enhanced: for instance, compressing Qwen3-14B to an 8B-parameter model and healing it with just 30 million tokens yields performance nearly on par with the original Qwen3-8B. The code for ROCKET is at github.com/mts-ai/ROCKET/tree/main.

[385] TRACE: Theoretical Risk Attribution under Covariate-shift Effects

Hosein Anjidani, S. Yahya S. R. Tehrani, Mohammad Mahdi Mojahedian, Mohammad Hossein Yassaee

Main category: cs.LG

TL;DR: TRACE framework decomposes risk change between source and shifted-data models into interpretable factors for diagnostic analysis under covariate shift.

Details

Motivation: When models are retrained on shifted data, their performance on the original domain can change unpredictably. There's a need for interpretable diagnostic tools to understand why performance changes occur during model replacement under covariate shift.

Method: Introduces TRACE framework that decomposes absolute risk change into four factors: two generalization gaps, a model change penalty, and a covariate shift penalty. Uses model sensitivity factor (from high-quantile input gradients) and data-shift measures (Optimal Transport or MMD) to estimate covariate shift penalty. Model change penalty is controlled by average output distance between models on target samples.

Result: TRACE bound correctly captures scaling of true risk difference with shift magnitude in linear regression. Across synthetic and vision benchmarks, TRACE diagnostics maintain strong monotonic relationship with true performance degradation. Deployment gate score correlates strongly with risk change and achieves high AUROC/AUPRC for gating decisions.

Conclusion: TRACE provides an interpretable, computable diagnostic framework for understanding performance changes during model replacement under covariate shift, enabling safe and label-efficient model deployment decisions.

Abstract: When a source-trained model $Q$ is replaced by a model $\tilde{Q}$ trained on shifted data, its performance on the source domain can change unpredictably. To address this, we study the two-model risk change, $ΔR := R_P(\tilde{Q}) - R_P(Q)$, under covariate shift. We introduce TRACE (Theoretical Risk Attribution under Covariate-shift Effects), a framework that decomposes $|ΔR|$ into an interpretable upper bound. This decomposition disentangles the risk change into four actionable factors: two generalization gaps, a model change penalty, and a covariate shift penalty, transforming the bound into a powerful diagnostic tool for understanding why performance has changed. To make TRACE a fully computable diagnostic, we instantiate each term. The covariate shift penalty is estimated via a model sensitivity factor (from high-quantile input gradients) and a data-shift measure; we use feature-space Optimal Transport (OT) by default and provide a robust alternative using Maximum Mean Discrepancy (MMD). The model change penalty is controlled by the average output distance between the two models on the target sample. Generalization gaps are estimated on held-out data. We validate our framework in an idealized linear regression setting, showing the TRACE bound correctly captures the scaling of the true risk difference with the magnitude of the shift. Across synthetic and vision benchmarks, TRACE diagnostics are valid and maintain a strong monotonic relationship with the true performance degradation. Crucially, we derive a deployment gate score that correlates strongly with $|ΔR|$ and achieves high AUROC/AUPRC for gating decisions, enabling safe, label-efficient model replacement.

[386] Hierarchical Zero-Order Optimization for Deep Neural Networks

Sansheng Cao, Zhengyu Ma, Yonghong Tian

Main category: cs.LG

TL;DR: HZO optimization reduces query complexity from O(ML²) to O(ML log L) for deep networks by decomposing depth dimension, achieving competitive accuracy with backpropagation on image datasets.

Details

Motivation: Zeroth-order optimization has biological plausibility and handles non-differentiable objectives, but its computational complexity limits application in deep neural networks. The paper challenges the layer-by-layer gradient propagation paradigm.

Method: Hierarchical Zeroth-Order (HZO) optimization uses a divide-and-conquer strategy that decomposes the depth dimension of neural networks. It operates near the unitary limit (L_lip ≈ 1) for numerical stability.

Result: HZO reduces query complexity from O(ML²) to O(ML log L) for networks of width M and depth L. Extensive evaluations on CIFAR-10 and ImageNet show competitive accuracy compared to backpropagation.

Conclusion: HZO represents a significant advancement in zeroth-order optimization for deep neural networks, offering biological plausibility while achieving practical performance comparable to backpropagation.

Abstract: Zeroth-order (ZO) optimization has long been favored for its biological plausibility and its capacity to handle non-differentiable objectives, yet its computational complexity has historically limited its application in deep neural networks. Challenging the conventional paradigm that gradients propagate layer-by-layer, we propose Hierarchical Zeroth-Order (HZO) optimization, a novel divide-and-conquer strategy that decomposes the depth dimension of the network. We prove that HZO reduces the query complexity from $O(ML^2)$ to $O(ML \log L)$ for a network of width $M$ and depth $L$, representing a significant leap over existing ZO methodologies. Furthermore, we provide a detailed error analysis showing that HZO maintains numerical stability by operating near the unitary limit ($L_{lip} \approx 1$). Extensive evaluations on CIFAR-10 and ImageNet demonstrate that HZO achieves competitive accuracy compared to backpropagation.

[387] Learning Page Order in Shuffled WOO Releases

Efe Kahraman, Giulio Tosato

Main category: cs.LG

TL;DR: Document page ordering using page embeddings on heterogeneous Dutch freedom of information documents, comparing pointer networks, seq2seq transformers, and pairwise ranking models, with best performance on documents up to 15 pages.

Details

Motivation: The paper addresses the challenge of ordering shuffled pages in heterogeneous document collections (emails, legal texts, spreadsheets) compiled into single PDFs, where traditional semantic ordering signals are unreliable. This is a practical problem in document processing and information retrieval.

Method: Five methods are compared: pointer networks, seq2seq transformers, specialized pairwise ranking models, and curriculum learning approaches. The study uses page embeddings on 5,461 shuffled WOO documents (Dutch freedom of information releases) and analyzes attention patterns and positional encodings.

Result: Best performance successfully reorders documents up to 15 pages with Kendall’s tau ranging from 0.95 for short documents (2-5 pages) to 0.72 for 15-page documents. Key findings: seq2seq transformers fail to generalize on long documents (tau drops from 0.918 to 0.014), curriculum learning underperforms direct training by 39% on long documents, and model specialization improves longer documents by +0.21 tau.

Conclusion: Document page ordering requires different strategies for short vs. long documents, explaining why curriculum learning fails. Learned positional encodings contribute to seq2seq failure but multiple interacting causes exist. Model specialization is effective for longer documents.

Abstract: We investigate document page ordering on 5,461 shuffled WOO documents (Dutch freedom of information releases) using page embeddings. These documents are heterogeneous collections such as emails, legal texts, and spreadsheets compiled into single PDFs, where semantic ordering signals are unreliable. We compare five methods, including pointer networks, seq2seq transformers, and specialized pairwise ranking models. The best performing approach successfully reorders documents up to 15 pages, with Kendall’s tau ranging from 0.95 for short documents (2-5 pages) to 0.72 for 15 page documents. We observe two unexpected failures: seq2seq transformers fail to generalize on long documents (Kendall’s tau drops from 0.918 on 2-5 pages to 0.014 on 21-25 pages), and curriculum learning underperforms direct training by 39% on long documents. Ablation studies suggest learned positional encodings are one contributing factor to seq2seq failure, though the degradation persists across all encoding variants, indicating multiple interacting causes. Attention pattern analysis reveals that short and long documents require fundamentally different ordering strategies, explaining why curriculum learning fails. Model specialization achieves substantial improvements on longer documents (+0.21 tau).

[388] Roughness-Informed Federated Learning

Mohammad Partohaghighi, Roummel Marcia, Bruce J. West, YangQuan Chen

Main category: cs.LG

TL;DR: RI-FedAvg improves federated learning in non-IID settings by using a Roughness Index regularization to mitigate client drift and enhance convergence.

Details

Motivation: Federated Learning faces challenges in non-IID settings due to client drift, which impairs convergence and model performance. Existing methods struggle with heterogeneous data distributions across clients.

Method: Proposes RI-FedAvg algorithm that incorporates a Roughness Index-based regularization term into local objectives. The RI quantifies roughness of high-dimensional loss functions and adaptively penalizes updates based on local loss landscape fluctuations.

Result: RI-FedAvg outperforms state-of-the-art baselines (FedAvg, FedProx, FedDyn, SCAFFOLD, DP-FedAvg) on MNIST, CIFAR-10, and CIFAR-100 datasets, achieving higher accuracy and faster convergence in non-IID scenarios.

Conclusion: RI-FedAvg enhances robustness and efficiency of federated learning in practical heterogeneous environments by effectively mitigating client drift through roughness-aware regularization.

Abstract: Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy, yet faces challenges in non-independent and identically distributed (non-IID) settings due to client drift, which impairs convergence. We propose RI-FedAvg, a novel FL algorithm that mitigates client drift by incorporating a Roughness Index (RI)-based regularization term into the local objective, adaptively penalizing updates based on the fluctuations of local loss landscapes. This paper introduces RI-FedAvg, leveraging the RI to quantify the roughness of high-dimensional loss functions, ensuring robust optimization in heterogeneous settings. We provide a rigorous convergence analysis for non-convex objectives, establishing that RI-FedAvg converges to a stationary point under standard assumptions. Extensive experiments on MNIST, CIFAR-10, and CIFAR-100 demonstrate that RI-FedAvg outperforms state-of-the-art baselines, including FedAvg, FedProx, FedDyn, SCAFFOLD, and DP-FedAvg, achieving higher accuracy and faster convergence in non-IID scenarios. Our results highlight RI-FedAvg’s potential to enhance the robustness and efficiency of federated learning in practical, heterogeneous environments.

[389] Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

Zhibin Duan, Guowei Rong, Zhuo Li, Bo Chen, Mingyuan Zhou, Dandan Guo

Main category: cs.LG

TL;DR: BNRM is a Bayesian non-negative reward modeling framework that uses sparse latent factors to mitigate reward hacking in LLM alignment, improving robustness and interpretability.

Details

Motivation: Human preference-based reward models for LLM alignment are vulnerable to reward hacking due to noisy annotations and systematic biases like response length or style, which can lead to over-optimization and poor generalization.

Method: Proposes Bayesian Non-Negative Reward Model (BNRM) integrating non-negative factor analysis into Bradley-Terry preference model. Uses sparse, non-negative latent factor generative process with instance-specific latent variables for disentangled reward representations and global latent factor sparsity for implicit debiasing. Develops amortized variational inference network conditioned on deep model representations for efficient end-to-end training.

Result: BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions compared to strong baselines.

Conclusion: BNRM provides a principled framework for robust uncertainty-aware reward learning that addresses reward hacking through disentanglement and debiasing mechanisms, enabling more reliable LLM alignment via human feedback.

Abstract: Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.

[390] Learning Mixture Density via Natural Gradient Expectation Maximization

Yutao Chen, Jasmine Bayrooti, Steven Morad

Main category: cs.LG

TL;DR: Proposes natural gradient expectation maximization (nGEM) to improve training of mixture density networks by leveraging information geometry, achieving faster convergence and better scaling to high-dimensional data.

Details

Motivation: Standard maximum likelihood training of mixture density networks using negative log-likelihood suffers from slow convergence and mode collapse. The authors aim to improve optimization by integrating information geometry principles.

Method: Interpret mixture density networks as deep latent-variable models and analyze through expectation maximization framework. Derive natural gradient expectation maximization (nGEM) objective by exploiting connections between EM and natural gradient descent.

Result: Empirically shows nGEM achieves up to 10× faster convergence with minimal computational overhead. Scales well to high-dimensional data where standard NLL fails.

Conclusion: nGEM provides an effective optimization method for mixture density networks by leveraging information geometry, offering significant improvements in convergence speed and scalability.

Abstract: Mixture density networks are neural networks that produce Gaussian mixtures to represent continuous multimodal conditional densities. Standard training procedures involve maximum likelihood estimation using the negative log-likelihood (NLL) objective, which suffers from slow convergence and mode collapse. In this work, we improve the optimization of mixture density networks by integrating their information geometry. Specifically, we interpret mixture density networks as deep latent-variable models and analyze them through an expectation maximization framework, which reveals surprising theoretical connections to natural gradient descent. We then exploit such connections to derive the natural gradient expectation maximization (nGEM) objective. We show that empirically nGEM achieves up to 10$\times$ faster convergence while adding almost zerocomputational overhead, and scales well to high-dimensional data where NLL otherwise fails.

[391] dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

Arnav Shah, Junzhe Li, Parsa Idehpour, Adibvafa Fallahpour, Brandon Wang, Sukjun Hwang, Bo Wang, Patrick D. Hsu, Hani Goodarzi, Albert Gu

Main category: cs.LG

TL;DR: dnaHNet is a tokenizer-free autoregressive model for genomic sequences that uses differentiable dynamic chunking to adaptively compress raw nucleotides into latent tokens, achieving better scaling and efficiency than existing architectures while preserving biological coherence.

Details

Motivation: Genomic foundation models face a fundamental tradeoff: fixed-vocabulary tokenizers fragment biologically meaningful motifs, while nucleotide-level models preserve biological coherence but have prohibitive computational costs for long contexts.

Method: Introduces dnaHNet, a tokenizer-free autoregressive model with differentiable dynamic chunking mechanism that segments and models genomic sequences end-to-end, adaptively compressing raw nucleotides into latent tokens to balance compression with predictive accuracy.

Result: Outperforms leading architectures including StripedHyena2 in scaling and efficiency, achieves >3× inference speedup over Transformers, and shows superior performance on zero-shot tasks like predicting protein variant fitness and gene essentiality while automatically discovering hierarchical biological structures.

Conclusion: dnaHNet establishes a scalable, interpretable framework for next-generation genomic modeling that overcomes the tokenization tradeoff in genomic foundation models.

Abstract: Genomic foundation models have the potential to decode DNA syntax, yet face a fundamental tradeoff in their input representation. Standard fixed-vocabulary tokenizers fragment biologically meaningful motifs such as codons and regulatory elements, while nucleotide-level models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end-to-end. Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively, balancing compression with predictive accuracy. Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency. This recursive chunking yields quadratic FLOP reductions, enabling $>3 \times$ inference speedup over Transformers. On zero-shot tasks, dnaHNet achieves superior performance in predicting protein variant fitness and gene essentiality, while automatically discovering hierarchical biological structures without supervision. These results establish dnaHNet as a scalable, interpretable framework for next-generation genomic modeling.

[392] VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu

Main category: cs.LG

TL;DR: VESPO: Variational sEquence-level Soft Policy Optimization - a stable RL training method for LLMs that addresses policy staleness and distribution shift through variational optimization with closed-form reshaping kernel.

Details

Motivation: Training stability is a major challenge in RL for LLMs due to policy staleness, asynchronous training, and mismatches between training/inference engines causing behavior policy divergence, risking training collapse. Importance sampling helps but suffers from high variance, with existing remedies lacking unified theoretical foundation.

Method: Proposes VESPO (Variational sEquence-level Soft Policy Optimization) that incorporates variance reduction into a variational formulation over proposal distributions, deriving a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization.

Result: Experiments on mathematical reasoning benchmarks show VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, delivering consistent gains across both dense and Mixture-of-Experts models.

Conclusion: VESPO provides a theoretically grounded solution to RL training stability for LLMs, enabling stable training under challenging conditions like high staleness and asynchronous execution.

Abstract: Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at https://github.com/FloyedShen/VESPO

[393] Just on Time: Token-Level Early Stopping for Diffusion Language Models

Zahar Kohut, Severyn Shykula, Dmytro Khamula, Mykola Vysotskyi, Taras Rumezhak, Volodymyr Karpiv

Main category: cs.LG

TL;DR: Training-free token-level early stopping method for diffusion language models that identifies when individual tokens converge during generation, reducing computational cost without quality loss.

Details

Motivation: Diffusion language models are computationally inefficient because many tokens stabilize early in the denoising process, but current approaches continue processing all tokens until the final step.

Method: Introduces a training-free, token-level early stopping approach that uses lightweight signals from model predictions and local context to dynamically determine when individual tokens can be finalized, enabling adaptive per-token freezing without task-specific fine-tuning.

Result: Achieves state-of-the-art efficiency gains across diverse benchmarks (mathematical reasoning, general QA, scientific understanding) while preserving generation quality, substantially reducing total diffusion steps required.

Conclusion: The method provides an effective way to improve computational efficiency of diffusion language models through adaptive token-level early stopping without compromising output quality.

Abstract: Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient because many tokens reach stability long before the final denoising step. We introduce a training-free, token-level early stopping approach that identifies convergence independently at each position. Our method leverages lightweight signals derived from the model’s predictions and local context to dynamically determine when individual tokens can be finalized. This yields adaptive per-token freezing without task-specific fine-tuning, substantially reducing the total number of diffusion steps required. Across diverse benchmarks, spanning mathematical reasoning, general question answering, and scientific understanding, our approach achieves state-of-the-art efficiency gains while preserving generation quality.

[394] On the Role of Consistency Between Physics and Data in Physics-Informed Neural Networks

Nicolás Becerra-Zuniga, Lucas Lacasa, Eusebio Valero, Gonzalo Rubio

Main category: cs.LG

TL;DR: PINNs face accuracy limitations due to data-to-PDE inconsistencies, with an intrinsic “consistency barrier” that sets a lower bound on error based on data fidelity mismatches.

Details

Motivation: PINNs are widely used for PDE modeling with limited data, but real-world data often contains inconsistencies with governing equations due to noise, discretization errors, or modeling assumptions. The impact of these data-to-PDE inconsistencies on PINN accuracy and convergence is not well understood.

Method: Systematic analysis of data inconsistency effects on PINN accuracy using the 1D viscous Burgers equation with manufactured analytical solution. PINNs are trained with datasets of progressively increasing numerical accuracy and perfectly consistent analytical data to isolate and quantify the consistency barrier.

Result: PINNs can partially mitigate low-fidelity data using PDE residuals and recover dominant physical structure, but training ultimately saturates at an error level dictated by data inconsistency. With high-fidelity numerical data, PINN solutions become indistinguishable from those trained on analytical data, effectively removing the consistency barrier.

Conclusion: Data quality fundamentally limits PINN accuracy through a consistency barrier, but high-fidelity data can overcome this limitation. The findings clarify the interplay between data quality and physics enforcement in PINNs, providing practical guidance for physics-informed surrogate model construction.

Abstract: Physics-informed neural networks (PINNs) have gained significant attention as a surrogate modeling strategy for partial differential equations (PDEs), particularly in regimes where labeled data are scarce and physical constraints can be leveraged to regularize the learning process. In practice, however, PINNs are frequently trained using experimental or numerical data that are not fully consistent with the governing equations due to measurement noise, discretization errors, or modeling assumptions. The implications of such data-to-PDE inconsistencies on the accuracy and convergence of PINNs remain insufficiently understood. In this work, we systematically analyze how data inconsistency fundamentally limits the attainable accuracy of PINNs. We introduce the concept of a consistency barrier, defined as an intrinsic lower bound on the error that arises from mismatches between the fidelity of the data and the exact enforcement of the PDE residual. To isolate and quantify this effect, we consider the 1D viscous Burgers equation with a manufactured analytical solution, which enables full control over data fidelity and residual errors. PINNs are trained using datasets of progressively increasing numerical accuracy, as well as perfectly consistent analytical data. Results show that while the inclusion of the PDE residual allows PINNs to partially mitigate low-fidelity data and recover the dominant physical structure, the training process ultimately saturates at an error level dictated by the data inconsistency. When high-fidelity numerical data are employed, PINN solutions become indistinguishable from those trained on analytical data, indicating that the consistency barrier is effectively removed. These findings clarify the interplay between data quality and physics enforcement in PINNs providing practical guidance for the construction and interpretation of physics-informed surrogate models.

[395] Interpretable Graph-Level Anomaly Detection via Contrast with Normal Prototypes

Qiuran Zhao, Kai Ming Ting, Xinpeng Li

Main category: cs.LG

TL;DR: ProtoGLAD is an interpretable unsupervised framework for graph-level anomaly detection that provides explanations by contrasting anomalies with nearest normal prototype graphs from the dataset.

Details

Motivation: Current deep graph-level anomaly detection methods are black-box and lack interpretability. Existing explanation methods either don't reference normal graphs or use abstract latent vectors rather than concrete dataset graphs, limiting reliability and deployment in real-world applications.

Method: Uses a point-set kernel to iteratively discover multiple normal prototype graphs and their associated clusters from the dataset. Identifies anomalies as graphs distant from all discovered normal clusters. Provides explanations by explicitly contrasting each detected anomaly with its nearest normal prototype graph.

Result: Extensive experiments on multiple real-world datasets show ProtoGLAD achieves competitive anomaly detection performance compared to state-of-the-art GLAD methods while providing better human-interpretable prototype-based explanations.

Conclusion: ProtoGLAD addresses interpretability limitations in graph-level anomaly detection by providing concrete prototype-based explanations, making it more reliable and deployable in real-world applications.

Abstract: The task of graph-level anomaly detection (GLAD) is to identify anomalous graphs that deviate significantly from the majority of graphs in a dataset. While deep GLAD methods have shown promising performance, their black-box nature limits their reliability and deployment in real-world applications. Although some recent methods have made attempts to provide explanations for anomaly detection results, they either provide explanations without referencing normal graphs, or rely on abstract latent vectors as prototypes rather than concrete graphs from the dataset. To address these limitations, we propose Prototype-based Graph-Level Anomaly Detection (ProtoGLAD), an interpretable unsupervised framework that provides explanation for each detected anomaly by explicitly contrasting with its nearest normal prototype graph. It employs a point-set kernel to iteratively discover multiple normal prototype graphs and their associated clusters from the dataset, then identifying graphs distant from all discovered normal clusters as anomalies. Extensive experiments on multiple real-world datasets demonstrate that ProtoGLAD achieves competitive anomaly detection performance compared to state-of-the-art GLAD methods while providing better human-interpretable prototype-based explanations.

[396] Weight Decay Improves Language Model Plasticity

Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade

Main category: cs.LG

TL;DR: Weight decay during LLM pretraining improves model plasticity for downstream fine-tuning, leading to better adaptation despite potentially worse base model performance.

Details

Motivation: Current LLM development focuses on base model validation loss but ignores downstream adaptability. The paper aims to study pretraining from the perspective of model plasticity - the ability to successfully adapt to downstream tasks through fine-tuning.

Method: Systematic experiments focusing on weight decay as a key regularization parameter during pretraining. Investigates how different weight decay values affect model plasticity and downstream fine-tuning performance.

Result: Models trained with larger weight decay values are more plastic, showing larger performance gains when fine-tuned. This creates counterintuitive trade-offs where worse-performing base models can outperform better ones after fine-tuning. Weight decay encourages linearly separable representations, regularizes attention matrices, and reduces overfitting.

Conclusion: Evaluation metrics beyond cross-entropy loss are important for hyperparameter optimization. A single optimization hyperparameter like weight decay plays a multifaceted role in shaping model behavior and downstream adaptability.

Abstract: The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model’s validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay’s mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.

[397] Pupillometry and Brain Dynamics for Cognitive Load in Working Memory

Nusaibah Farrukh, Malavika Pradeep, Akshay Sasi, Rahul Venugopal, Elizabeth Sherly

Main category: cs.LG

TL;DR: Pupillometry can compete with EEG for cognitive load classification using interpretable feature-based approaches, challenging EEG’s necessity and supporting wearable monitoring systems.

Details

Motivation: Cognitive load assessment is crucial for adaptive learning, clinical monitoring, and brain-computer interfaces. While EEG and pupillometry are established biomarkers, their comparative utility and practical integration as lightweight, wearable solutions remain underexplored. Current deep learning approaches lack interpretability and are computationally expensive.

Method: Integrated feature-based and model-driven approaches using the OpenNeuro ‘Digit Span Task’ dataset. Applied Catch-22 features with classical machine learning models for cognitive load classification from EEG and pupillometry data, comparing performance in binary and multiclass tasks.

Result: Feature-based approaches with classical ML outperformed deep learning. Pupillometry alone could compete with EEG for cognitive load classification, challenging the assumption that EEG is necessary. SHAP-based feature analysis provided physiologically meaningful insights into pupil dynamics.

Conclusion: Pupillometry serves as a portable, practical proxy for cognitive load detection, supporting development of wearable, affordable monitoring systems for neuropsychiatry, education, and healthcare without requiring resource-intensive EEG.

Abstract: Cognitive load, the mental effort required during working memory, is central to neuroscience, psychology, and human-computer interaction. Accurate assessment is vital for adaptive learning, clinical monitoring, and brain-computer interfaces. Physiological signals such as pupillometry and electroencephalography are established biomarkers of cognitive load, but their comparative utility and practical integration as lightweight, wearable monitoring solutions remain underexplored. EEG provides high temporal resolution of neural activity. Although non-invasive, it is technologically demanding and limited in wearability and cost due to its resource-intensive nature, whereas pupillometry is non-invasive, portable, and scalable. Existing studies often rely on deep learning models with limited interpretability and substantial computational expense. This study integrates feature-based and model-driven approaches to advance time-series analysis. Using the OpenNeuro ‘Digit Span Task’ dataset, this study investigates cognitive load classification from EEG and pupillometry. Feature-based approaches using Catch-22 features and classical machine learning models outperform deep learning in both binary and multiclass tasks. The findings demonstrate that pupillometry alone can compete with EEG, serving as a portable and practical proxy for real-world applications. These results challenge the assumption that EEG is necessary for load detection, showing that pupil dynamics combined with interpretable models and SHAP based feature analysis provide physiologically meaningful insights. This work supports the development of wearable, affordable cognitive monitoring systems for neuropsychiatry, education, and healthcare.

[398] Diffusion-Pretrained Dense and Contextual Embeddings

Sedigheh Eslami, Maksim Gaiduk, Markus Krimmel, Louis Milliken, Bo Wang, Denis Bykov

Main category: cs.LG

TL;DR: pplx-embed is a family of multilingual embedding models using diffusion-pretrained language models with multi-stage contrastive learning for web-scale retrieval, offering both standard and contextualized embedding variants.

Details

Motivation: To develop effective multilingual embedding models for large-scale retrieval that can capture comprehensive bidirectional context and preserve global document information, addressing limitations in existing retrieval systems for real-world web-scale applications.

Method: Uses diffusion-pretrained language model backbone with multi-stage contrastive learning; leverages bidirectional attention from diffusion pretraining; employs mean pooling and late chunking strategy to preserve global context; offers two model types: standard retrieval (pplx-embed-v1) and contextualized embeddings (pplx-embed-context-v1).

Result: pplx-embed-v1 achieves competitive performance on MTEB (Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet benchmarks; pplx-embed-context-v1 sets new records on ConTEB benchmark; both show strong performance on internal large-scale search evaluation with tens of millions of documents.

Conclusion: The diffusion-pretrained approach enables effective bidirectional context capture for multilingual embeddings, validating the models’ effectiveness in production-scale retrieval environments where both quality and efficiency are critical.

Abstract: In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, which focuses on real-world, large-scale search scenarios over tens of millions of documents. These results validate the models’ effectiveness in production environments where retrieval quality and efficiency are critical at scale.

[399] Generative clinical time series models trained on moderate amounts of patient data are privacy preserving

Rustam Zhumagambetov, Niklas Giesa, Sebastian D. Boie, Stefan Haufe

Main category: cs.LG

TL;DR: Privacy audit of generative AI models for hospital time series data shows established privacy attacks are ineffective when trained on large datasets, and differential privacy mechanisms offer little benefit while reducing utility.

Details

Motivation: Medical data sharing for ML training is limited by privacy concerns. Synthetic data from generative AI models is seen as a solution, but privacy protection isn't guaranteed. Current privacy mechanisms (k-anonymization, differential privacy) have limitations for time series models, making privacy audits essential.

Method: Used battery of established privacy attacks to audit state-of-the-art hospital time series models trained on MIMIC-IV dataset. Also used eICU dataset to mount privacy attacks against synthetic data generators trained on MIMIC-IV. Evaluated effectiveness of differential privacy mechanisms for these models.

Result: Established privacy attacks are ineffective against generated multivariate clinical time series when synthetic data generators are trained on large enough training datasets. Existing differential privacy mechanisms would not improve privacy but would decrease utility for ML prediction tasks.

Conclusion: For large-scale hospital time series datasets, current privacy attacks are ineffective against synthetic data generators, and applying differential privacy offers minimal privacy benefit while significantly reducing data utility for downstream ML tasks.

Abstract: Sharing medical data for machine learning model training purposes is often impossible due to the risk of disclosing identifying information about individual patients. Synthetic data produced by generative artificial intelligence (genAI) models trained on real data is often seen as one possible solution to comply with privacy regulations. While powerful genAI models for heterogeneous hospital time series have recently been introduced, such modeling does not guarantee privacy protection, as the generated data may still reveal identifying information about individuals in the models’ training cohort. Applying established privacy mechanisms to generative time series models, however, proves challenging as post-hoc data anonymization through k-anonymization or similar techniques is limited, while model-centered privacy mechanisms that implement differential privacy (DP) may lead to unstable training, compromising the utility of generated data. Given these known limitations, privacy audits for generative time series models are currently indispensable regardless of the concrete privacy mechanisms applied to models and/or data. In this work, we use a battery of established privacy attacks to audit state-of-the-art hospital time series models, trained on the public MIMIC-IV dataset, with respect to privacy preservation. Furthermore, the eICU dataset was used to mount a privacy attack against the synthetic data generator trained on the MIMIC-IV dataset. Results show that established privacy attacks are ineffective against generated multivariate clinical time series when synthetic data generators are trained on large enough training datasets. Furthermore, we discuss how the use of existing DP mechanisms for these synthetic data generators would not bring desired improvement in privacy, but only a decrease in utility for machine learning prediction tasks.

[400] Coarse-Grained Boltzmann Generators

Weilong Chen, Bojun Zhao, Jan Eckwert, Julija Zavadlav

Main category: cs.LG

TL;DR: CG-BGs combine coarse-grained modeling with Boltzmann Generators for scalable, unbiased molecular sampling using learned potentials of mean force.

Details

Motivation: Traditional Boltzmann Generators have limited scalability for large molecular systems, while coarse-grained models often lack proper reweighting for correct statistics. There's a need for a framework that combines the scalability of reduced-order modeling with exact importance sampling.

Method: Propose Coarse-Grained Boltzmann Generators (CG-BGs) that operate in coarse-grained coordinate space, using a learned potential of mean force (PMF) to reweight samples from a flow-based model. The PMF is efficiently learned from force matching with rapidly converged data.

Result: CG-BGs faithfully capture complex interactions mediated by explicit solvent within highly reduced representations, establishing a scalable pathway for unbiased sampling of larger molecular systems.

Conclusion: CG-BGs provide a principled framework that unifies scalable reduced-order modeling with exact importance sampling, enabling unbiased sampling of larger molecular systems through efficient PMF learning.

Abstract: Sampling equilibrium molecular configurations from the Boltzmann distribution is a longstanding challenge. Boltzmann Generators (BGs) address this by combining exact-likelihood generative models with importance sampling, but their practical scalability is limited. Meanwhile, coarse-grained surrogates enable the modeling of larger systems by reducing effective dimensionality, yet often lack the reweighting process required to ensure asymptotically correct statistics. In this work, we propose Coarse-Grained Boltzmann Generators (CG-BGs), a principled framework that unifies scalable reduced-order modeling with the exactness of importance sampling. CG-BGs act in a coarse-grained coordinate space, using a learned potential of mean force (PMF) to reweight samples generated by a flow-based model. Crucially, we show that this PMF can be efficiently learned from rapidly converged data via force matching. Our results demonstrate that CG-BGs faithfully capture complex interactions mediated by explicit solvent within highly reduced representations, establishing a scalable pathway for the unbiased sampling of larger molecular systems.

[401] Exploring the impact of adaptive rewiring in Graph Neural Networks

Charlotte Cambier van Nooten, Christos Aronis, Yuliya Shapovalova, Lucia Cavallaro

Main category: cs.LG

TL;DR: Sparsification methods in GNNs using Erdős-Rényi techniques improve efficiency for large-scale graph applications like power grid reliability analysis, with adaptive rewiring showing promise for balancing sparsity and learning complex patterns.

Details

Motivation: Address high memory usage and computational costs in large-scale graph applications by applying sparsification methods as regularization in Graph Neural Networks, particularly for critical real-world applications like N-1 contingency assessment in electrical grids.

Method: Apply sparsification techniques from Network Science and Machine Learning, including Erdős-Rényi model for model sparsification, to Graph Convolutional Networks (GCN) and Graph Isomorphism Networks (GIN). Explore different degrees of sparsification and rewiring across three datasets of varying sizes, with adaptive rewiring approach that allows connectivity structure adaptation during training.

Result: Experiments show importance of tuning sparsity parameters - while sparsity can improve generalization, excessive sparsity may hinder learning of complex patterns. Adaptive rewiring combined with early stopping proves promising by allowing model to adapt connectivity structure during training.

Conclusion: Sparsification methods effectively enhance GNN efficiency and scalability for critical applications like power grid reliability analysis, with adaptive rewiring showing particular promise for balancing computational efficiency and model performance.

Abstract: This paper explores sparsification methods as a form of regularization in Graph Neural Networks (GNNs) to address high memory usage and computational costs in large-scale graph applications. Using techniques from Network Science and Machine Learning, including Erdős-Rényi for model sparsification, we enhance the efficiency of GNNs for real-world applications. We demonstrate our approach on N-1 contingency assessment in electrical grids, a critical task for ensuring grid reliability. We apply our methods to three datasets of varying sizes, exploring Graph Convolutional Networks (GCN) and Graph Isomorphism Networks (GIN) with different degrees of sparsification and rewiring. Comparison across sparsification levels shows the potential of combining insights from both research fields to improve GNN performance and scalability. Our experiments highlight the importance of tuning sparsity parameters: while sparsity can improve generalization, excessive sparsity may hinder learning of complex patterns. Our adaptive rewiring approach, particularly when combined with early stopping, proves promising by allowing the model to adapt its connectivity structure during training. This research contributes to understanding how sparsity can be effectively leveraged in GNNs for critical applications like power grid reliability analysis.

[402] Evaluation metrics for temporal preservation in synthetic longitudinal patient data

Katariina Perkonoja, Parisa Movahedi, Antti Airola, Kari Auranen, Joni Virta

Main category: cs.LG

TL;DR: Proposes metrics for evaluating temporal preservation in synthetic longitudinal patient data, assessing how well synthetic data reproduces temporal characteristics across marginal, covariance, individual-level, and measurement structures.

Details

Motivation: Current synthetic data generation methods for longitudinal patient data may preserve marginal statistics while distorting important temporal dependencies and individual trajectories, requiring better evaluation metrics.

Method: Develops a set of metrics categorized into four temporal characteristics: marginal structure (overall distributions), covariance structure (relationships between variables), individual-level structure (patient trajectories), and measurement structure (timing patterns).

Result: Shows that strong marginal-level resemblance can hide distortions in covariance and individual trajectories; temporal preservation depends on data quality, measurement frequency, and preprocessing; sparse/irregular measurements reduce temporal resemblance.

Conclusion: No single metric captures temporal preservation adequately; multidimensional evaluation across all characteristics provides comprehensive assessment; metrics enable better evaluation and improvement of generative models for temporally realistic synthetic data.

Abstract: This study introduces a set of metrics for evaluating temporal preservation in synthetic longitudinal patient data, defined as artificially generated data that mimic real patients’ repeated measurements over time. The proposed metrics assess how synthetic data reproduces key temporal characteristics, categorized into marginal, covariance, individual-level and measurement structures. We show that strong marginal-level resemblance may conceal distortions in covariance and disruptions in individual-level trajectories. Temporal preservation is influenced by factors such as original data quality, measurement frequency, and preprocessing strategies, including binning, variable encoding and precision. Variables with sparse or highly irregular measurement times provide limited information for learning temporal dependencies, resulting in reduced resemblance between the synthetic and original data. No single metric adequately captures temporal preservation; instead, a multidimensional evaluation across all characteristics provides a more comprehensive assessment of synthetic data quality. Overall, the proposed metrics clarify how and why temporal structures are preserved or degraded, enabling more reliable evaluation and improvement of generative models and supporting the creation of temporally realistic synthetic longitudinal patient data.

[403] LOREN: Low Rank-Based Code-Rate Adaptation in Neural Receivers

Bram Van Bolderik, Vlado Menkovski, Sonia Heemstra de Groot, Manil Dev Gomony

Main category: cs.LG

TL;DR: LOREN is a low-rank adaptation neural receiver that enables code-rate adaptability with minimal hardware overhead by using small adapters instead of full retraining.

Details

Motivation: Neural network receivers outperform traditional ones but have high memory/power requirements due to needing separate weight sets for each code rate, limiting practical deployment.

Method: Proposes LOREN with lightweight low-rank adaptation adapters integrated into convolutional layers; freezes shared base network and trains only small adapters per code rate; uses end-to-end training over 3GPP CDL channels.

Result: Achieves comparable or superior performance to fully retrained base neural receivers; hardware implementation in 22nm technology shows >65% silicon area savings and up to 15% power reduction for three code rates.

Conclusion: LOREN enables practical deployment of neural receivers by providing code-rate adaptability with minimal hardware overhead while maintaining performance.

Abstract: Neural network based receivers have recently demonstrated superior system-level performance compared to traditional receivers. However, their practicality is limited by high memory and power requirements, as separate weight sets must be stored for each code rate. To address this challenge, we propose LOREN, a Low Rank-Based Code-Rate Adaptation Neural Receiver that achieves adaptability with minimal overhead. LOREN integrates lightweight low rank adaptation adapters (LOREN adapters) into convolutional layers, freezing a shared base network while training only small adapters per code rate. An end-to-end training framework over 3GPP CDL channels ensures robustness across realistic wireless environments. LOREN achieves comparable or superior performance relative to fully retrained base neural receivers. The hardware implementation of LOREN in 22nm technology shows more than 65% savings in silicon area and up to 15% power reduction when supporting three code rates.

[404] Domain Knowledge Guided Bayesian Optimization For Autonomous Alignment Of Complex Scientific Instruments

Aashwin Mishra, Matt Seaberg, Ryan Roussel, Daniel Ratner, Apurva Mehta

Main category: cs.LG

TL;DR: Domain knowledge guided Bayesian Optimization uses physical insight to transform high-dimensional search spaces, aligning coordinate axes with active subspaces to solve needle-in-a-haystack optimization problems.

Details

Motivation: Bayesian Optimization struggles with high-dimensional problems featuring tightly coupled parameters and sparse rewards (needle-in-a-haystack scenarios). As scientific instruments become more complex, robust high-dimensional optimization methods are needed.

Method: Proposes domain knowledge guided Bayesian Optimization that leverages physical insight to transform coordinates, decoupling input features and aligning active subspaces with primary search axes. Combined with reverse annealing exploration strategy.

Result: Demonstrated on 12-dimensional, 6-crystal Split-and-Delay optical system where conventional BO methods failed. The approach reliably converges to global optimum, with coordinate transformation being key to success.

Conclusion: Physical insight can transform high-dimensional, coupled optimization problems into simpler representations, enabling rapid automated tuning while retaining current optimization algorithms.

Abstract: Bayesian Optimization (BO) is a powerful tool for optimizing complex non-linear systems. However, its performance degrades in high-dimensional problems with tightly coupled parameters and highly asymmetric objective landscapes, where rewards are sparse. In such needle-in-a-haystack scenarios, even advanced methods like trust-region BO (TurBO) often lead to unsatisfactory results. We propose a domain knowledge guided Bayesian Optimization approach, which leverages physical insight to fundamentally simplify the search problem by transforming coordinates to decouple input features and align the active subspaces with the primary search axes. We demonstrate this approach’s efficacy on a challenging 12-dimensional, 6-crystal Split-and-Delay optical system, where conventional approaches, including standard BO, TuRBO and multi-objective BO, consistently led to unsatisfactory results. When combined with an reverse annealing exploration strategy, this approach reliably converges to the global optimum. The coordinate transformation itself is the key to this success, significantly accelerating the search by aligning input co-ordinate axes with the problem’s active subspaces. As increasingly complex scientific instruments, from large telescopes to new spectrometers at X-ray Free Electron Lasers are deployed, the demand for robust high-dimensional optimization grows. Our results demonstrate a generalizable paradigm: leveraging physical insight to transform high-dimensional, coupled optimization problems into simpler representations can enable rapid and robust automated tuning for consistent high performance while still retaining current optimization algorithms.

[405] Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

Enrico Ahlers, Daniel Passon, Yannic Noller, Lars Grunske

Main category: cs.LG

TL;DR: FIRE is an inference-time backdoor mitigation approach that manipulates latent representations to neutralize triggers in poisoned models by reversing backdoor directions in feature space.

Details

Motivation: Existing backdoor mitigation strategies are ineffective or inefficient for already deployed vulnerable models, requiring a runtime solution that can neutralize triggers during inference without expensive input modifications.

Method: FIRE hypothesizes that triggers induce structured changes in model representations and treats them as directions in latent spaces between layers. It manipulates latent representations by moving poisoned sample features along backdoor directions in reverse to neutralize triggers.

Result: FIRE shows low computational overhead and outperforms current runtime mitigations on image benchmarks across various attacks, datasets, and network architectures.

Conclusion: FIRE provides an effective inference-time defense against backdoor attacks by leveraging the model’s own feature representations to neutralize triggers, offering a practical solution for deployed vulnerable models.

Abstract: Machine learning models are increasingly present in our everyday lives; as a result, they become targets of adversarial attackers seeking to manipulate the systems we interact with. A well-known vulnerability is a backdoor introduced into a neural network by poisoned training data or a malicious training process. Backdoors can be used to induce unwanted behavior by including a certain trigger in the input. Existing mitigations filter training data, modify the model, or perform expensive input modifications on samples. If a vulnerable model has already been deployed, however, those strategies are either ineffective or inefficient. To address this gap, we propose our inference-time backdoor mitigation approach called FIRE (Feature-space Inference-time REpair). We hypothesize that a trigger induces structured and repeatable changes in the model’s internal representation. We view the trigger as directions in the latent spaces between layers that can be applied in reverse to correct the inference mechanism. Therefore, we turn the backdoored model against itself by manipulating its latent representations and moving a poisoned sample’s features along the backdoor directions to neutralize the trigger. Our evaluation shows that FIRE has low computational overhead and outperforms current runtime mitigations on image benchmarks across various attacks, datasets, and network architectures.

[406] Reducing Estimation Uncertainty Using Normalizing Flows and Stratification

Paweł Lorek, Rafał Topolnicki, Tomasz Trzciński, Maciej Zięba, Aleksandra Krystecka

Main category: cs.LG

TL;DR: A flow-based model with stratified sampling for flexible estimation of expectations from unknown data distributions, reducing uncertainty compared to parametric methods.

Details

Motivation: Current expectation estimation methods rely on parametric distribution assumptions (Gaussian/mixed Gaussian) which cause significant uncertainty when assumptions don't hold. Need more flexible approaches for unknown data distributions.

Method: Proposes a flow-based model integrated with stratified sampling, using a parametrized neural network to flexibly model unknown data distributions for expectation estimation.

Result: Shows marked reduction in estimation uncertainty across multiple datasets including high-dimensional ones (30D and 128D), outperforming crude Monte Carlo estimators and Gaussian mixture models.

Conclusion: Flow-based model with stratified sampling provides more flexible and accurate expectation estimation for unknown distributions compared to traditional parametric methods.

Abstract: Estimating the expectation of a real-valued function of a random variable from sample data is a critical aspect of statistical analysis, with far-reaching implications in various applications. Current methodologies typically assume (semi-)parametric distributions such as Gaussian or mixed Gaussian, leading to significant estimation uncertainty if these assumptions do not hold. We propose a flow-based model, integrated with stratified sampling, that leverages a parametrized neural network to offer greater flexibility in modeling unknown data distributions, thereby mitigating this limitation. Our model shows a marked reduction in estimation uncertainty across multiple datasets, including high-dimensional (30 and 128) ones, outperforming crude Monte Carlo estimators and Gaussian mixture models. Reproducible code is available at https://github.com/rnoxy/flowstrat.

[407] Rising Multi-Armed Bandits with Known Horizons

Seockbean Song, Chenyu Gan, Youngsik Yoon, Siwei Wang, Wei Chen, Jungseul Ok

Main category: cs.LG

TL;DR: CURE-UCB algorithm for Rising Multi-Armed Bandits that explicitly incorporates horizon information to outperform horizon-agnostic strategies, with applications in hyperparameter tuning and robotics.

Details

Motivation: The paper addresses the underexplored horizon-aware setting in Rising Multi-Armed Bandits (RMAB), where optimal strategies shift dramatically with available budget T. Knowledge of T provides significant utility in RMAB for aligning decision-making with shifting optimality, which is crucial in practical scenarios like hyperparameter tuning and robotics.

Method: Proposes CUmulative Reward Estimation UCB (CURE-UCB), a novel algorithm that explicitly integrates the horizon T into the decision-making process. The method provides rigorous analysis establishing new regret upper bounds and proves strict superiority over horizon-agnostic strategies in structured environments like “linear-then-flat” instances.

Result: Extensive experiments demonstrate significant superiority of CURE-UCB over baseline methods. The algorithm strictly outperforms horizon-agnostic strategies in structured environments, with rigorous theoretical guarantees on regret bounds.

Conclusion: CURE-UCB successfully addresses the horizon-aware setting in RMAB, providing a principled approach that leverages horizon information to achieve better performance than horizon-agnostic methods, with applications in practical scenarios where performance improves with repeated usage.

Abstract: The Rising Multi-Armed Bandit (RMAB) framework models environments where expected rewards of arms increase with plays, which models practical scenarios where performance of each option improves with the repeated usage, such as in robotics and hyperparameter tuning. For instance, in hyperparameter tuning, the validation accuracy of a model configuration (arm) typically increases with each training epoch. A defining characteristic of RMAB is em horizon-dependent optimality: unlike standard settings, the optimal strategy here shifts dramatically depending on the available budget $T$. This implies that knowledge of $T$ yields significantly greater utility in RMAB, empowering the learner to align its decision-making with this shifting optimality. However, the horizon-aware setting remains underexplored. To address this, we propose a novel CUmulative Reward Estimation UCB (CURE-UCB) that explicitly integrates the horizon. We provide a rigorous analysis establishing a new regret upper bound and prove that our method strictly outperforms horizon-agnostic strategies in structured environments like ``linear-then-flat’’ instances. Extensive experiments demonstrate its significant superiority over baselines.

[408] Transport, Don’t Generate: Deterministic Geometric Flows for Combinatorial Optimization

Benjy Friedmann, Nadav Dym

Main category: cs.LG

TL;DR: CycFlow replaces diffusion-based heatmap generation with deterministic point transport for neural combinatorial optimization, achieving 1000x speedup while maintaining competitive solution quality for TSP.

Details

Motivation: Current neural combinatorial optimization methods use diffusion models that treat TSP as stochastic heatmap generation, which has quadratic computational complexity and slow iterative denoising processes. The authors aim to develop a more efficient approach.

Method: CycFlow learns an instance-conditioned vector field that continuously transports 2D input coordinates to a canonical circular arrangement. The optimal tour is recovered from this 2N-dimensional representation via angular sorting, using data-dependent flow matching instead of iterative edge denoising.

Result: CycFlow achieves up to three orders of magnitude (1000x) speedup compared to state-of-the-art diffusion baselines while maintaining competitive optimality gaps for the Euclidean Traveling Salesman Problem.

Conclusion: The framework successfully shifts from quadratic edge scoring to linear coordinate dynamics, demonstrating that deterministic point transport can be more efficient than stochastic heatmap generation for neural combinatorial optimization tasks.

Abstract: Recent advances in Neural Combinatorial Optimization (NCO) have been dominated by diffusion models that treat the Euclidean Traveling Salesman Problem (TSP) as a stochastic $N \times N$ heatmap generation task. In this paper, we propose CycFlow, a framework that replaces iterative edge denoising with deterministic point transport. CycFlow learns an instance-conditioned vector field that continuously transports input 2D coordinates to a canonical circular arrangement, where the optimal tour is recovered from this $2N$ dimensional representation via angular sorting. By leveraging data-dependent flow matching, we bypass the quadratic bottleneck of edge scoring in favor of linear coordinate dynamics. This paradigm shift accelerates solving speed by up to three orders of magnitude compared to state-of-the-art diffusion baselines, while maintaining competitive optimality gaps.

[409] Kalman Linear Attention: Parallel Bayesian Filtering For Efficient Language Modelling and State Tracking

Vaisakh Shaj, Cameron Barker, Aidan Scannell, Andras Szecsenyi, Elliot J. Crowley, Amos Storkey

Main category: cs.LG

TL;DR: KLA introduces Kalman Linear Attention, a neural sequence modeling primitive that reframes sequence modeling through probabilistic Bayesian filters, enabling parallel training while maintaining explicit uncertainty tracking.

Details

Motivation: State-space language models like Mamba and GLA offer efficient linear complexity but lack expressivity and robust state-tracking for complex reasoning. The paper aims to address these limitations by incorporating probabilistic Bayesian filters for principled state estimation and uncertainty tracking.

Method: Reparameterizes the Kalman filter in information form to enable computation via associative scan for parallel training. Introduces Kalman Linear Attention (KLA) layer that performs time-parallel probabilistic inference while maintaining explicit belief-state uncertainty.

Result: KLA offers more expressive nonlinear updates and gating than GLA variants while retaining computational advantages. On language modeling tasks, KLA matches or outperforms modern SSMs and GLAs across discrete token-manipulation and state-tracking benchmarks.

Conclusion: The Kalman Linear Attention layer successfully bridges the gap between efficient state-space models and probabilistic reasoning, providing a more expressive alternative to existing SSMs while maintaining computational efficiency.

Abstract: State-space language models such as Mamba and gated linear attention (GLA) offer efficient alternatives to transformers due to their linear complexity and parallel training, but often lack the expressivity and robust state-tracking needed for complex reasoning. We address these limitations by reframing sequence modelling through a probabilistic lens, using Bayesian filters as a core primitive. While classical filters such as Kalman filters provide principled state estimation and uncertainty tracking, they are typically viewed as inherently sequential. We show that reparameterising the Kalman filter in information form enables its updates to be computed via an associative scan, allowing efficient parallel training. Building on this insight, we introduce the Kalman Linear Attention (KLA) layer, a neural sequence-modelling primitive that performs time-parallel probabilistic inference while maintaining explicit belief-state uncertainty. KLA offers strictly more expressive nonlinear updates and gating than GLA variants while retaining their computational advantages. On language modelling tasks, KLA matches or outperforms modern SSMs and GLAs across representative discrete token-manipulation and state-tracking benchmarks.

[410] Predicting integers from continuous parameters

Bas Maat, Peter Bloem

Main category: cs.LG

TL;DR: The paper proposes modeling integer-valued labels directly using discrete probability distributions in neural networks, comparing several distributions including novel Bitwise and discrete Laplace approaches.

Details

Motivation: Traditional regression treats integer labels as continuous, changing the underlying distribution from discrete to continuous. The authors want to model integer labels directly with discrete distributions while maintaining continuous parameters for backpropagation in neural networks.

Method: Investigated several discrete probability distributions whose parameters can be predicted from features and are continuous for gradient-based learning. Compared existing and novel distributions including Bitwise (Bernoulli distribution on each bit of binary representation) and discrete Laplace (exponentially decaying tails around continuous mean).

Result: Tested on tabular learning, sequential prediction, and image generation tasks. Found that Bitwise and discrete Laplace distributions performed best overall among the evaluated approaches.

Conclusion: Direct modeling of integer labels with appropriate discrete distributions outperforms traditional continuous regression approaches, with Bitwise and discrete Laplace being the most effective methods for neural network applications.

Abstract: We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-votes on social media posts, or the number of bicycles available at a public rental station. While it is possible to model these as continuous values, and to apply traditional regression, this approach changes the underlying distribution on the labels from discrete to continuous. Discrete distributions have certain benefits, which leads us to the question whether such integer labels can be modeled directly by a discrete distribution, whose parameters are predicted from the features of a given instance. Moreover, we focus on the use case of output distributions of neural networks, which adds the requirement that the parameters of the distribution be continuous so that backpropagation and gradient descent may be used to learn the weights of the network. We investigate several options for such distributions, some existing and some novel, and test them on a range of tasks, including tabular learning, sequential prediction and image generation. We find that overall the best performance comes from two distributions: Bitwise, which represents the target integer in bits and places a Bernoulli distribution on each, and a discrete analogue of the Laplace distribution, which uses a distribution with exponentially decaying tails around a continuous mean.

[411] Enhancing Multivariate Time Series Forecasting with Global Temporal Retrieval

Fanpu Cao, Lu Dai, Jindong Han, Hui Xiong

Main category: cs.LG

TL;DR: GTR is a lightweight plug-and-play module that enhances multivariate time series forecasting by retrieving and aligning global periodic patterns beyond limited historical context.

Details

Motivation: Existing MTSF models are limited by short historical context, preventing them from capturing global periodic patterns that span cycles longer than input horizons. Naive solutions like extending windows cause overfitting, computational costs, and redundant processing.

Method: GTR maintains adaptive global temporal embeddings of entire cycles, dynamically retrieves and aligns relevant global segments with input sequences, and uses 2D convolution with residual fusion to jointly model local and global dependencies without altering host model architecture.

Result: Extensive experiments on six real-world datasets show GTR consistently achieves state-of-the-art performance in both short-term and long-term forecasting with minimal parameter and computational overhead.

Conclusion: GTR provides an efficient, general solution for enhancing global periodicity modeling in MTSF tasks through its lightweight, plug-and-play design that bridges short-term observations with long-term patterns.

Abstract: Multivariate time series forecasting (MTSF) plays a vital role in numerous real-world applications, yet existing models remain constrained by their reliance on a limited historical context. This limitation prevents them from effectively capturing global periodic patterns that often span cycles significantly longer than the input horizon - despite such patterns carrying strong predictive signals. Naive solutions, such as extending the historical window, lead to severe drawbacks, including overfitting, prohibitive computational costs, and redundant information processing. To address these challenges, we introduce the Global Temporal Retriever (GTR), a lightweight and plug-and-play module designed to extend any forecasting model’s temporal awareness beyond the immediate historical context. GTR maintains an adaptive global temporal embedding of the entire cycle and dynamically retrieves and aligns relevant global segments with the input sequence. By jointly modeling local and global dependencies through a 2D convolution and residual fusion, GTR effectively bridges short-term observations with long-term periodicity without altering the host model architecture. Extensive experiments on six real-world datasets demonstrate that GTR consistently delivers state-of-the-art performance across both short-term and long-term forecasting scenarios, while incurring minimal parameter and computational overhead. These results highlight GTR as an efficient and general solution for enhancing global periodicity modeling in MTSF tasks. Code is available at this repository: https://github.com/macovaseas/GTR.

[412] Collaborative Threshold Watermarking

Tameem Bakr, Anish Ambreth, Nils Lukas

Main category: cs.LG

TL;DR: A threshold watermarking scheme for federated learning that enables collaborative watermark embedding by K clients, where only coalitions of at least t clients can reconstruct and verify the watermark, preventing individual clients from removing it.

Details

Motivation: In federated learning, clients need mechanisms to prove model provenance after joint training. Existing watermarking approaches either don't scale well with many clients (watermarks dilute) or give individual clients too much power to verify/remove watermarks.

Method: Introduces (t,K)-threshold watermarking: clients collaboratively embed a shared watermark during training using secret sharing of the watermark key τ. Only coalitions of at least t clients can reconstruct τ and verify models, and verification can be done without revealing τ in the clear.

Result: The watermark remains detectable at scale (K=128) with minimal accuracy loss and stays above detection threshold (z≥4) under attacks including adaptive fine-tuning using up to 20% of training data.

Conclusion: The proposed threshold watermarking scheme provides scalable and secure provenance verification for federated learning models while preventing individual clients from compromising the watermark.

Abstract: In federated learning (FL), $K$ clients jointly train a model without sharing raw data. Because each participant invests data and compute, clients need mechanisms to later prove the provenance of a jointly trained model. Model watermarking embeds a hidden signal in the weights, but naive approaches either do not scale with many clients as per-client watermarks dilute as $K$ grows, or give any individual client the ability to verify and potentially remove the watermark. We introduce $(t,K)$-threshold watermarking: clients collaboratively embed a shared watermark during training, while only coalitions of at least $t$ clients can reconstruct the watermark key and verify a suspect model. We secret-share the watermark key $τ$ so that coalitions of fewer than $t$ clients cannot reconstruct it, and verification can be performed without revealing $τ$ in the clear. We instantiate our protocol in the white-box setting and evaluate on image classification. Our watermark remains detectable at scale ($K=128$) with minimal accuracy loss and stays above the detection threshold ($z\ge 4$) under attacks including adaptive fine-tuning using up to 20% of the training data.

[413] Time Series Foundation Models for Energy Load Forecasting on Consumer Hardware: A Multi-Dimensional Zero-Shot Benchmark

Luigi Simeone

Main category: cs.LG

TL;DR: TSFMs show strong zero-shot electricity demand forecasting capabilities with 47% error reduction over baselines, maintaining stability with minimal context while traditional models fail, though calibration varies across models.

Details

Motivation: To evaluate whether Time Series Foundation Models' zero-shot capabilities translate to mission-critical applications like electricity demand forecasting where accuracy, calibration, and robustness directly affect grid operations.

Method: Multi-dimensional benchmark evaluating four TSFMs (Chronos-Bolt, Chronos-2, Moirai-2, TinyTimeMixer) alongside Prophet, SARIMA, and Seasonal Naive using ERCOT hourly load data (2020-2024) on consumer-grade hardware. Evaluation spans four axes: context length sensitivity, probabilistic forecast calibration, robustness under distribution shifts, and prescriptive analytics.

Result: Top-performing foundation models achieve MASE values near 0.31 at long context lengths (2048h), 47% reduction over Seasonal Naive baseline. TSFMs maintain stable accuracy with minimal context while Prophet fails when fitting window is shorter than seasonality period. Calibration varies: Chronos-2 produces well-calibrated intervals while Moirai-2 and Prophet exhibit overconfidence.

Conclusion: TSFMs demonstrate practical zero-shot forecasting capabilities for electricity demand with structural advantages over traditional methods, though model selection requires consideration of calibration characteristics. The benchmark provides practical guidelines and reproducible framework.

Abstract: Time Series Foundation Models (TSFMs) have introduced zero-shot prediction capabilities that bypass the need for task-specific training. Whether these capabilities translate to mission-critical applications such as electricity demand forecasting–where accuracy, calibration, and robustness directly affect grid operations–remains an open question. We present a multi-dimensional benchmark evaluating four TSFMs (Chronos-Bolt, Chronos-2, Moirai-2, and TinyTimeMixer) alongside Prophet as an industry-standard baseline and two statistical references (SARIMA and Seasonal Naive), using ERCOT hourly load data from 2020 to 2024. All experiments run on consumer-grade hardware (AMD Ryzen 7, 16GB RAM, no GPU). The evaluation spans four axes: (1) context length sensitivity from 24 to 2048 hours, (2) probabilistic forecast calibration, (3) robustness under distribution shifts including COVID-19 lockdowns and Winter Storm Uri, and (4) prescriptive analytics for operational decision support. The top-performing foundation models achieve MASE values near 0.31 at long context lengths (C = 2048h, day-ahead horizon), a 47% reduction over the Seasonal Naive baseline. The inclusion of Prophet exposes a structural advantage of pre-trained models: Prophet fails when the fitting window is shorter than its seasonality period (MASE > 74 at 24-hour context), while TSFMs maintain stable accuracy even with minimal context because they recognise temporal patterns learned during pre-training rather than estimating them from scratch. Calibration varies substantially across models–Chronos-2 produces well-calibrated prediction intervals (95% empirical coverage at 90% nominal level) while both Moirai-2 and Prophet exhibit overconfidence (~70% coverage). We provide practical model selection guidelines and release the complete benchmark framework for reproducibility.

[414] Semi-Supervised Cross-Domain Imitation Learning

Li-Min Chu, Kai-Siang Ma, Ming-Hong Chen, Ping-Chun Hsieh

Main category: cs.LG

TL;DR: Semi-supervised cross-domain imitation learning (SS-CDIL) method that uses minimal target expert demonstrations and unlabeled imperfect trajectories to achieve stable, data-efficient policy transfer across domains.

Details

Motivation: Cross-domain imitation learning is valuable when expert data collection is costly, but existing methods are either supervised (requiring proxy tasks and explicit alignment) or unsupervised (often unstable). There's a need for a semi-supervised approach that balances stability and data efficiency.

Method: Proposes SS-CDIL algorithm using offline data: small number of target expert demonstrations + unlabeled imperfect trajectories. Introduces novel cross-domain loss function for learning inter-domain state-action mappings and adaptive weight function to balance source and target knowledge.

Result: Experiments on MuJoCo and Robosuite show consistent gains over baselines, demonstrating stable and data-efficient policy learning with minimal supervision.

Conclusion: SS-CDIL provides a theoretically justified approach for cross-domain imitation learning that achieves stable performance with minimal target expert data, addressing limitations of both supervised and unsupervised methods.

Abstract: Cross-domain imitation learning (CDIL) accelerates policy learning by transferring expert knowledge across domains, which is valuable in applications where the collection of expert data is costly. Existing methods are either supervised, relying on proxy tasks and explicit alignment, or unsupervised, aligning distributions without paired data, but often unstable. We introduce the Semi-Supervised CDIL (SS-CDIL) setting and propose the first algorithm for SS-CDIL with theoretical justification. Our method uses only offline data, including a small number of target expert demonstrations and some unlabeled imperfect trajectories. To handle domain discrepancy, we propose a novel cross-domain loss function for learning inter-domain state-action mappings and design an adaptive weight function to balance the source and target knowledge. Experiments on MuJoCo and Robosuite show consistent gains over the baselines, demonstrating that our approach achieves stable and data-efficient policy learning with minimal supervision. Our code is available at~ https://github.com/NYCU-RL-Bandits-Lab/CDIL.

[415] ICA: Information-Aware Credit Assignment for Visually Grounded Long-Horizon Information-Seeking Agents

Cong Pang, Xuyu Feng, Yujie Yi, Zixuan Chen, Jiawei Hong, Tiankuo Yao, Nang Yuan, Jiapeng Luo, Lewei Lu, Xin Lou

Main category: cs.LG

TL;DR: Visual-native search framework using webpage snapshots with layout cues and Information-Aware Credit Assignment (ICA) for better web information retrieval

Details

Motivation: Current reinforcement learning agents for web information retrieval suffer from low signal-to-noise feedback, text parsers discard layout semantics, and sparse rewards obscure which retrieval actions matter

Method: Visual-native framework representing webpages as visual snapshots to leverage layout cues, combined with Information-Aware Credit Assignment (ICA) that estimates each retrieved snapshot’s contribution via posterior analysis and propagates dense learning signals

Result: Consistently outperforms text-based baselines on diverse information-seeking benchmarks, showing visual snapshot grounding with information-level credit assignment alleviates credit-assignment bottleneck

Conclusion: Visual-native approach with ICA effectively addresses credit assignment challenges in open-ended web environments for information-seeking agents

Abstract: Despite the strong performance achieved by reinforcement learning-trained information-seeking agents, learning in open-ended web environments remains severely constrained by low signal-to-noise feedback. Text-based parsers often discard layout semantics and introduce unstructured noise, while long-horizon training typically relies on sparse outcome rewards that obscure which retrieval actions actually matter. We propose a visual-native search framework that represents webpages as visual snapshots, allowing agents to leverage layout cues to quickly localize salient evidence and suppress distractors. To learn effectively from these high-dimensional observations, we introduce Information-Aware Credit Assignment (ICA), a post-hoc method that estimates each retrieved snapshot’s contribution to the final outcome via posterior analysis and propagates dense learning signals back to key search turns. Integrated with a GRPO-based training pipeline, our approach consistently outperforms text-based baselines on diverse information-seeking benchmarks, providing evidence that visual snapshot grounding with information-level credit assignment alleviates the credit-assignment bottleneck in open-ended web environments. The code and datasets will be released in https://github.com/pc-inno/ICA_MM_deepsearch.git.

[416] PRISM: Parallel Residual Iterative Sequence Model

Jie Jiang, Ke Cheng, Xin Xu, Mengyang Pang, Tianhao Lu, Jiaheng Li, Yue Liu, Yuan Wang, Jun Zhang, Huan Yu, Zhouchen Lin

Main category: cs.LG

TL;DR: PRISM introduces a parallelizable iterative sequence model that combines Transformer expressivity with linear model efficiency through solver-inspired architecture and write-forget decoupling.

Details

Motivation: To resolve the fundamental tension between Transformer expressivity and linear model efficiency, overcoming the limitations of existing efficient architectures bounded by shallow linear updates and the hardware parallelism issues of iterative methods like Test-Time Training.

Method: PRISM uses a solver-inspired inductive bias with Write-Forget Decoupling to isolate non-linearity, a two-stage proxy architecture with short-convolution for initial residual anchoring and learned predictor for refinement updates, enabling parallelizable feedforward operations.

Result: Theoretically achieves Rank-L accumulation beyond single-step Rank-1 bottleneck, empirically achieves comparable performance to explicit optimization methods with 174x higher throughput.

Conclusion: PRISM successfully bridges the gap between expressivity and efficiency in sequence modeling through parallelizable iterative refinement, offering a promising direction for efficient generative modeling.

Abstract: Generative sequence modeling faces a fundamental tension between the expressivity of Transformers and the efficiency of linear sequence models. Existing efficient architectures are theoretically bounded by shallow, single-step linear updates, while powerful iterative methods like Test-Time Training (TTT) break hardware parallelism due to state-dependent gradients. We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension. PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form. We employ a Write-Forget Decoupling strategy that isolates non-linearity within the injection operator. To bypass the serial dependency of explicit solvers, PRISM utilizes a two-stage proxy architecture: a short-convolution anchors the initial residual using local history energy, while a learned predictor estimates the refinement updates directly from the input. This design distills structural patterns associated with iterative correction into a parallelizable feedforward operator. Theoretically, we prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck. Empirically, it achieves comparable performance to explicit optimization methods while achieving 174x higher throughput.

[417] FedPS: Federated data Preprocessing via aggregated Statistics

Xuefeng Xu, Graham Cormode

Main category: cs.LG

TL;DR: FedPS: A federated data preprocessing framework using aggregated statistics and data sketching for privacy-preserving FL without raw data sharing

Details

Motivation: Federated Learning requires data preprocessing but faces challenges due to privacy constraints (no raw data centralization) and communication efficiency needs, which are largely overlooked in FL research despite being critical for model performance.

Method: FedPS uses data-sketching techniques to summarize local datasets efficiently while preserving essential statistical information. It designs federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extends preprocessing-related models (k-Means, k-NN, Bayesian Linear Regression) to both horizontal and vertical FL settings.

Result: FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments, enabling privacy-preserving data preprocessing without raw data sharing.

Conclusion: FedPS addresses the critical but overlooked preprocessing stage in FL, offering a unified framework that enables effective data preprocessing while maintaining privacy and communication efficiency constraints.

Abstract: Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.

[418] RePO: Bridging On-Policy Learning and Off-Policy Knowledge through Rephrasing Policy Optimization

Linxuan Xia, Xiaolong Yang, Yongyuan Chen, Enyue Zhao, Deng Cai, Yasheng Wang, Boxi Wu

Main category: cs.LG

TL;DR: RePO is a novel reinforcement learning method that improves domain-specific alignment of LLMs by having the model rephrase off-policy knowledge into on-policy trajectories, enabling better hard sample utilization while maintaining training stability.

Details

Motivation: Current methods for aligning LLMs to domain-specific data face trade-offs: SFT degrades model generality, on-policy RL preserves generality but fails with hard samples, and off-policy RL suffers from training instability due to distribution shift.

Method: RePO prompts the policy model to first comprehend off-policy knowledge, then rephrase it into trajectories that match its own stylistic and parametric distribution. These rephrased high-quality trajectories dynamically replace low-reward rollouts while preserving on-policy training dynamics.

Result: Experiments on several benchmarks show RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.

Conclusion: RePO effectively reconciles off-policy knowledge absorption with on-policy RL stability, offering a promising approach for domain-specific LLM alignment without sacrificing generality.

Abstract: Aligning large language models (LLMs) on domain-specific data remains a fundamental challenge. Supervised fine-tuning (SFT) offers a straightforward way to inject domain knowledge but often degrades the model’s generality. In contrast, on-policy reinforcement learning (RL) preserves generality but fails to effectively assimilate hard samples that exceed the model’s current reasoning level. Recent off-policy RL attempts improve hard sample utilization, yet they suffer from severe training instability due to the forced distribution shift toward off-policy knowledge. To reconcile effective off-policy knowledge absorption with the stability of on-policy RL, we propose Rephrasing Policy Optimization (RePO). In RePO, the policy model is prompted to first comprehend off-policy knowledge and then rephrase it into trajectories that conform to its own stylistic and parametric distribution. RePO dynamically replaces low-reward rollouts with these rephrased, high-quality trajectories. This strategy guides the model toward correct reasoning paths while strictly preserving on-policy training dynamics. Experiments on several benchmarks demonstrate that RePO improves hard-sample utilization and outperforms existing baselines, achieving state-of-the-art performance.

[419] Adaptive Sampling for Private Worst-Case Group Optimization

Max Cairney-Leeming, Amartya Sanyal, Christoph H. Lampert

Main category: cs.LG

TL;DR: ASC algorithm for differentially private worst-case group optimization that adaptively controls sampling rates and clipping thresholds to ensure consistent privacy guarantees while improving worst-case group accuracy.

Details

Motivation: Standard models often fail on minority/hard-to-learn groups. While worst-case group optimization methods exist, they create privacy issues in differential privacy settings - unequal weighting leads to inhomogeneous privacy guarantees, particularly weaker privacy for minority groups.

Method: ASC (Adaptively Sampled and Clipped Worst-case Group Optimization) algorithm that adaptively controls both the sampling rate and clipping threshold for each group. Allows harder-to-learn groups to be sampled more often while ensuring consistent privacy guarantees across all groups.

Result: ASC achieves lower-variance gradients, tighter privacy guarantees, and substantially higher worst-case group accuracy without sacrificing overall average accuracy compared to prior work.

Conclusion: ASC provides an effective solution for differentially private worst-case group optimization that balances privacy fairness with model performance across different groups.

Abstract: Models trained by minimizing the average loss often fail to be accurate on small or hard-to-learn groups of the data. Various methods address this issue by optimizing a weighted objective that focuses on the worst-performing groups. However, this approach becomes problematic when learning with differential privacy, as unequal data weighting can result in inhomogeneous privacy guarantees, in particular weaker privacy for minority groups. In this work, we introduce a new algorithm for differentially private worst-case group optimization called ASC (Adaptively Sampled and Clipped Worst-case Group Optimization). It adaptively controls both the sampling rate and the clipping threshold of each group. Thereby, it allows for harder-to-learn groups to be sampled more often while ensuring consistent privacy guarantees across all groups. Comparing ASC to prior work, we show that it results in lower-variance gradients, tighter privacy guarantees, and substantially higher worst-case group accuracy without sacrificing overall average accuracy.

[420] Resource-Efficient Model-Free Reinforcement Learning for Board Games

Kazuki Ota, Takayuki Osa, Motoki Omura, Tatsuya Harada

Main category: cs.LG

TL;DR: Proposes a model-free RL algorithm for board games that achieves more efficient learning than search-based methods like AlphaZero, validated across five board game environments.

Details

Motivation: Search-based RL methods like AlphaZero have achieved remarkable success in board games but suffer from significant computational demands that hinder reproducibility. The authors aim to develop a more efficient model-free alternative.

Method: Proposes a model-free reinforcement learning algorithm specifically designed for board games. The method is validated through comprehensive experiments on five board games: Animal Shogi, Gardner Chess, Go, Hex, and Othello.

Result: The proposed method achieves more efficient learning than existing methods across all five board game environments. An extensive ablation study demonstrates the importance of the core techniques used in the proposed method.

Conclusion: The efficient algorithm shows the potential of model-free reinforcement learning in domains traditionally dominated by search-based methods, offering a more computationally efficient alternative.

Abstract: Board games have long served as complex decision-making benchmarks in artificial intelligence. In this field, search-based reinforcement learning methods such as AlphaZero have achieved remarkable success. However, their significant computational demands have been pointed out as barriers to their reproducibility. In this study, we propose a model-free reinforcement learning algorithm designed for board games to achieve more efficient learning. To validate the efficiency of the proposed method, we conducted comprehensive experiments on five board games: Animal Shogi, Gardner Chess, Go, Hex, and Othello. The results demonstrate that the proposed method achieves more efficient learning than existing methods across these environments. In addition, our extensive ablation study shows the importance of core techniques used in the proposed method. We believe that our efficient algorithm shows the potential of model-free reinforcement learning in domains traditionally dominated by search-based methods.

[421] SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios

Yanan Wang, Renxi Wang, Yongxin Wang, Xuezhi Liang, Fajri Koto, Timothy Baldwin, Xiaodan Liang, Haonan Li

Main category: cs.LG

TL;DR: SimuScene is a systematic study evaluating LLMs’ ability to simulate physical scenarios via code across 5 physics domains and 52 concepts, showing current models struggle (21.5% pass rate) and proposing RL training with vision-language model rewards.

Details

Motivation: While LLMs excel at math, coding, and reasoning tasks, their ability to accurately represent and simulate physical scenarios through code generation remains underexplored. The authors aim to systematically study this capability gap.

Method: 1) Created SimuScene dataset with 7,659 physical scenarios across 5 physics domains and 52 concepts, using automatic pipeline with human verification; 2) Evaluated 10 contemporary LLMs; 3) Introduced RL pipeline with visual rewards using vision-language model as judge to train textual models.

Result: Even the strongest LLM achieved only 21.5% pass rate on physical simulation tasks, demonstrating significant difficulty. Training with their data improved physical simulation via code while substantially enhancing general code generation performance.

Conclusion: Physical scenario simulation via code is a challenging task for current LLMs, but can be improved through specialized training with visual rewards from vision-language models, which also benefits general code generation capabilities.

Abstract: Large language models (LLMs) have been extensively studied for tasks like math competitions, complex coding, and scientific reasoning, yet their ability to accurately represent and simulate physical scenarios via code remains underexplored. We propose SimuScene, the first systematic study that trains and evaluates LLMs on simulating physical scenarios across five physics domains and 52 physical concepts. We build an automatic pipeline to collect data, with human verification to ensure quality. The final dataset contains 7,659 physical scenarios with 334 human-verified examples as the test set. We evaluated 10 contemporary LLMs and found that even the strongest model achieves only a 21.5% pass rate, demonstrating the difficulty of the task. Finally, we introduce a reinforcement learning pipeline with visual rewards that uses a vision-language model as a judge to train textual models. Experiments show that training with our data improves physical simulation via code while substantially enhancing general code generation performance.

[422] Automated Model Design using Gated Neuron Selection in Telecom

Adam Orucu, Marcus Medhage, Farnaz Moradi, Andreas Johnsson, Sarunas Girdzijauskas

Main category: cs.LG

TL;DR: TabGNS is a gradient-based Neural Architecture Search method specifically designed for tabular data in telecommunications networks, achieving smaller models and faster search times.

Details

Motivation: The telecommunications industry needs automated neural architecture design for resource-constrained environments, as manual design is challenging and time-consuming for tasks like traffic prediction and QoS optimization.

Method: TabGNS (Tabular Gated Neuron Selection) is a novel gradient-based NAS method tailored for tabular data, using gated neuron selection to automatically design compact neural network architectures.

Result: TabGNS reduces architecture size by 51-82% and search time by up to 36x compared to state-of-the-art tabular NAS methods while improving prediction performance across telecommunications and generic tabular datasets.

Conclusion: TabGNS enables automated neural network design throughout the model lifecycle, accelerating ML deployment in telecommunications networks through efficient, compact architecture search.

Abstract: The telecommunications industry is experiencing rapid growth in adopting deep learning for critical tasks such as traffic prediction, signal strength prediction, and quality of service optimisation. However, designing neural network architectures for these applications remains challenging and time-consuming, particularly when targeting compact models suitable for resource-constrained network environments. Therefore, there is a need for automating the model design process to create high-performing models efficiently. This paper introduces TabGNS (Tabular Gated Neuron Selection), a novel gradient-based Neural Architecture Search (NAS) method specifically tailored for tabular data in telecommunications networks. We evaluate TabGNS across multiple telecommunications and generic tabular datasets, demonstrating improvements in prediction performance while reducing the architecture size by 51-82% and reducing the search time by up to 36x compared to state-of-the-art tabular NAS methods. Integrating TabGNS into the model lifecycle management enables automated design of neural networks throughout the lifecycle, accelerating deployment of ML solutions in telecommunications networks.

[423] The Sample Complexity of Uniform Approximation for Multi-Dimensional CDFs and Fixed-Price Mechanisms

Matteo Castiglioni, Anna Lunghi, Alberto Marchesi

Main category: cs.LG

TL;DR: The paper studies sample complexity of learning n-dimensional CDFs with one-bit feedback, showing near-dimensional-invariance with sample complexity ε⁻³log(1/ε)^O(n), and applies this to learning fixed-price mechanisms in small markets.

Details

Motivation: The motivation is to understand the sample complexity of learning multivariate cumulative distribution functions (CDFs) when only minimal one-bit feedback is available, as opposed to full feedback scenarios. This extends the multivariate DKW inequality to bandit feedback settings and has applications in mechanism design for small markets.

Method: The paper develops theoretical analysis for learning n-dimensional CDFs with one-bit feedback, establishing sample complexity bounds. The approach involves analyzing uniform approximation of CDFs over arbitrary fine grids under bandit feedback constraints.

Result: Main result shows near-dimensional-invariance: achieving uniform ε-approximation with sample complexity ε⁻³log(1/ε)^O(n), where dimensionality n only affects logarithmic terms. Provides tight sample complexity bounds and novel regret guarantees for learning fixed-price mechanisms in small markets like bilateral trade.

Conclusion: The paper establishes fundamental sample complexity bounds for learning multivariate CDFs with minimal feedback, demonstrating dimensional invariance in logarithmic terms. These results have practical implications for mechanism design in small markets where only limited feedback is available.

Abstract: We study the sample complexity of learning a uniform approximation of an $n$-dimensional cumulative distribution function (CDF) within an error $ε> 0$, when observations are restricted to a minimal one-bit feedback. This serves as a counterpart to the multivariate DKW inequality under ‘‘full feedback’’, extending it to the setting of ‘‘bandit feedback’’. Our main result shows a near-dimensional-invariance in the sample complexity: we get a uniform $ε$-approximation with a sample complexity $\frac{1}{ε^3}{\log\left(\frac 1 ε\right)^{\mathcal{O}(n)}}$ over a arbitrary fine grid, where the dimensionality $n$ only affects logarithmic terms. As direct corollaries, we provide tight sample complexity bounds and novel regret guarantees for learning fixed-price mechanisms in small markets, such as bilateral trade settings.

[424] Natural Hypergradient Descent: Algorithm Design, Convergence Analysis, and Parallel Implementation

Deyi Kong, Zaiwei Chen, Shuzhong Zhang, Shancong Mou

Main category: cs.LG

TL;DR: NHGD is a new bilevel optimization method that uses empirical Fisher information as a Hessian inverse surrogate, enabling parallel optimization with reduced computational overhead while maintaining theoretical guarantees.

Details

Motivation: Bilevel optimization problems suffer from computational bottlenecks in hypergradient estimation, particularly the need to compute or approximate Hessian inverses, which is expensive and limits scalability in large-scale machine learning settings.

Method: NHGD exploits the statistical structure of inner optimization problems by using the empirical Fisher information matrix as an asymptotically consistent surrogate for the Hessian. This enables a parallel optimize-and-approximate framework where Hessian-inverse approximation is updated synchronously with stochastic inner optimization, reusing gradient information at minimal additional cost.

Result: Theoretical analysis establishes high-probability error bounds and sample complexity guarantees matching state-of-the-art optimize-then-approximate methods, while significantly reducing computational time overhead. Empirical evaluations on representative bilevel learning tasks demonstrate practical advantages in scalability and effectiveness.

Conclusion: NHGD provides an efficient solution to bilevel optimization problems by addressing the computational bottleneck of hypergradient estimation through statistical structure exploitation and parallel optimization, making it suitable for large-scale machine learning applications.

Abstract: In this work, we propose Natural Hypergradient Descent (NHGD), a new method for solving bilevel optimization problems. To address the computational bottleneck in hypergradient estimation–namely, the need to compute or approximate Hessian inverse–we exploit the statistical structure of the inner optimization problem and use the empirical Fisher information matrix as an asymptotically consistent surrogate for the Hessian. This design enables a parallel optimize-and-approximate framework in which the Hessian-inverse approximation is updated synchronously with the stochastic inner optimization, reusing gradient information at negligible additional cost. Our main theoretical contribution establishes high-probability error bounds and sample complexity guarantees for NHGD that match those of state-of-the-art optimize-then-approximate methods, while significantly reducing computational time overhead. Empirical evaluations on representative bilevel learning tasks further demonstrate the practical advantages of NHGD, highlighting its scalability and effectiveness in large-scale machine learning settings.

[425] Tuning the burn-in phase in training recurrent neural networks improves their performance

Julian D. Schiller, Malte Heinrich, Victor G. Lopez, Matthias A. Müller

Main category: cs.LG

TL;DR: Truncated BPTT for RNNs reduces computational overhead but introduces performance trade-offs; proper burn-in phase tuning can significantly improve prediction accuracy.

Details

Motivation: Standard BPTT for RNNs is computationally expensive with long sequences. Truncated BPTT over shorter segments reduces overhead but may affect performance. The paper aims to understand the theoretical and practical implications of this trade-off.

Method: Establishes theoretical bounds on accuracy and performance loss when optimizing over subsequences instead of full sequences. Identifies burn-in phase as critical tuning parameter. Validates through experiments on system identification and time series forecasting benchmarks.

Result: Burn-in phase significantly influences training process. Proper tuning can reduce prediction error by over 60% on training and test data. Theoretical bounds align with experimental observations.

Conclusion: Truncated BPTT is practical but requires careful burn-in phase tuning. The burn-in phase serves as important performance knob, with proper tuning yielding substantial accuracy improvements.

Abstract: Training recurrent neural networks (RNNs) with standard backpropagation through time (BPTT) can be challenging, especially in the presence of long input sequences. A practical alternative to reduce computational and memory overhead is to perform BPTT repeatedly over shorter segments of the training data set, corresponding to truncated BPTT. In this paper, we examine the training of RNNs when using such a truncated learning approach for time series tasks. Specifically, we establish theoretical bounds on the accuracy and performance loss when optimizing over subsequences instead of the full data sequence. This reveals that the burn-in phase of the RNN is an important tuning knob in its training, with significant impact on the performance guarantees. We validate our theoretical results through experiments on standard benchmarks from the fields of system identification and time series forecasting. In all experiments, we observe a strong influence of the burn-in phase on the training process, and proper tuning can lead to a reduction of the prediction error on the training and test data of more than 60% in some cases.

[426] RiemannGL: Riemannian Geometry Changes Graph Deep Learning

Li Sun, Qiqi Wan, Suyang Zhou, Zhenhao Huang, Philip S. Yu

Main category: cs.LG

TL;DR: Riemannian geometry provides a principled foundation for graph representation learning, offering a unifying paradigm beyond isolated techniques, with focus on intrinsic manifold structures rather than just hyperbolic spaces.

Details

Motivation: Graphs have non-Euclidean structures with complex interactions, requiring geometric foundations beyond Euclidean spaces. Current approaches are limited to narrow manifolds (mainly hyperbolic) and extrinsic formulations, lacking intrinsic manifold integration with graph neural networks.

Method: Proposes Riemannian geometry as foundational framework for graph learning, identifies conceptual gaps, and outlines research agenda along three dimensions: manifold type (beyond hyperbolic), neural architecture (intrinsic structures), and learning paradigm.

Result: Presents a structured research agenda and identifies key challenges for advancing Riemannian graph learning, including theoretical foundations and promising directions for future exploration.

Conclusion: Riemannian geometry should be viewed as a unifying paradigm for graph learning, with intrinsic manifold structures as central mission. The paper provides a coherent viewpoint to stimulate broader exploration of Riemannian foundations for graph representation learning.

Abstract: Graphs are ubiquitous, and learning on graphs has become a cornerstone in artificial intelligence and data mining communities. Unlike pixel grids in images or sequential structures in language, graphs exhibit a typical non-Euclidean structure with complex interactions among the objects. This paper argues that Riemannian geometry provides a principled and necessary foundation for graph representation learning, and that Riemannian graph learning should be viewed as a unifying paradigm rather than a collection of isolated techniques. While recent studies have explored the integration of graph learning and Riemannian geometry, most existing approaches are limited to a narrow class of manifolds, particularly hyperbolic spaces, and often adopt extrinsic manifold formulations. We contend that the central mission of Riemannian graph learning is to endow graph neural networks with intrinsic manifold structures, which remains underexplored. To advance this perspective, we identify key conceptual and methodological gaps in existing approaches and outline a structured research agenda along three dimensions: manifold type, neural architecture, and learning paradigm. We further discuss open challenges, theoretical foundations, and promising directions that are critical for unlocking the full potential of Riemannian graph learning. This paper aims to provide a coherent viewpoint and to stimulate broader exploration of Riemannian geometry as a foundational framework for future graph learning research.

[427] Near-Constant Strong Violation and Last-Iterate Convergence for Online CMDPs via Decaying Safety Margins

Qian Zuo, Zhiyong Wang, Fengxiang He

Main category: cs.LG

TL;DR: FlexDOME algorithm achieves near-constant strong constraint violation with sublinear regret in constrained MDPs, using time-varying safety margins and regularization for safe online RL.

Details

Motivation: Existing primal-dual methods for safe online RL in CMDPs either incur growing constraint violation or are limited to average-iterate convergence due to oscillations. There's a need for algorithms that achieve both sublinear strong regret and near-constant strong constraint violation with last-iterate convergence.

Method: FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. It uses a novel term-wise asymptotic dominance strategy where safety margins are scheduled to asymptotically majorize optimization and statistical error decay rates, clamping cumulative violations to near-constant levels. The analysis employs policy-dual Lyapunov arguments for convergence guarantees.

Result: FlexDOME is the first algorithm to provably achieve near-constant Õ(1) strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence in CMDPs. Experiments validate the theoretical findings.

Conclusion: FlexDOME successfully addresses limitations of existing safe online RL methods by achieving both strong regret and constraint violation guarantees with last-iterate convergence through carefully designed safety margins and regularization.

Abstract: We study safe online reinforcement learning in Constrained Markov Decision Processes (CMDPs) under strong regret and violation metrics, which forbid error cancellation over time. Existing primal-dual methods that achieve sublinear strong reward regret inevitably incur growing strong constraint violation or are restricted to average-iterate convergence due to inherent oscillations. To address these limitations, we propose the Flexible safety Domain Optimization via Margin-regularized Exploration (FlexDOME) algorithm, the first to provably achieve near-constant $\tilde{O}(1)$ strong constraint violation alongside sublinear strong regret and non-asymptotic last-iterate convergence. FlexDOME incorporates time-varying safety margins and regularization terms into the primal-dual framework. Our theoretical analysis relies on a novel term-wise asymptotic dominance strategy, where the safety margin is rigorously scheduled to asymptotically majorize the functional decay rates of the optimization and statistical errors, thereby clamping cumulative violations to a near-constant level. Furthermore, we establish non-asymptotic last-iterate convergence guarantees via a policy-dual Lyapunov argument. Experiments corroborate our theoretical findings.

[428] Spatial-Morphological Modeling for Multi-Attribute Imputation of Urban Blocks

Vasilii Starikov, Ruslan Kozliak, Georgii Kontsevik, Sergey Mityagin

Main category: cs.LG

TL;DR: A spatial-morphological imputer tool that combines data-driven morphological clustering with neighborhood methods to reconstruct missing urban morphological indicators (FSI/GSI) at city block level.

Details

Motivation: Accurate reconstruction of missing morphological indicators is crucial for urban planning and data-driven analysis, but existing methods may not adequately capture both morphological structure and spatial context.

Method: Combines data-driven morphological clustering (global priors) with neighborhood-based methods (local spatial information) using inverse distance weighting (IDW) or spatial k-nearest neighbor (sKNN) for context-dependent interpolation of floor space index (FSI) and ground space index (GSI).

Result: While SM alone captures meaningful morphological structure, its combination with IDW or sKNN provides superior performance compared to existing state-of-the-art models, demonstrating complementary advantages of morphological and spatial approaches.

Conclusion: Composite methods that combine morphological and spatial approaches offer the best performance for reconstructing missing urban morphological indicators, with the spatial-morphological imputer tool being effective for urban planning applications.

Abstract: Accurate reconstruction of missing morphological indicators of a city is crucial for urban planning and data-driven analysis. This study presents the spatial-morphological (SM) imputer tool, which combines data-driven morphological clustering with neighborhood-based methods to reconstruct missing values of the floor space index (FSI) and ground space index (GSI) at the city block level, inspired by the SpaceMatrix framework. This approach combines city-scale morphological patterns as global priors with local spatial information for context-dependent interpolation. The evaluation shows that while SM alone captures meaningful morphological structure, its combination with inverse distance weighting (IDW) or spatial k-nearest neighbor (sKNN) methods provides superior performance compared to existing SOTA models. Composite methods demonstrate the complementary advantages of combining morphological and spatial approaches.

[429] From Circuits to Dynamics: Understanding and Stabilizing Failure in 3D Diffusion Transformers

Maximilian Plattner, Fabian Paischer, Johannes Brandstetter, Arturs Berzins

Main category: cs.LG

TL;DR: 3D diffusion transformers for surface completion suffer from “Meltdown” - small input perturbations cause output fragmentation, traced to early cross-attention activation; PowerRemap test-time control stabilizes this failure.

Details

Motivation: The paper addresses a critical failure mode in state-of-the-art 3D diffusion transformers for surface completion from sparse point clouds, where arbitrarily small perturbations cause catastrophic fragmentation of outputs.

Method: Used mechanistic interpretability (activation-patching) to localize the failure to a single early denoising cross-attention activation. Analyzed its singular-value spectrum as a proxy for fragmentation. Developed PowerRemap, a test-time control method to stabilize conditioning.

Result: Identified Meltdown across multiple architectures (WaLa, Make-a-Shape), datasets (GSO, SimJEB), and denoising strategies (DDPM, DDIM). PowerRemap achieved stabilization rates up to 98.3% by preventing symmetry-breaking bifurcations.

Conclusion: This work demonstrates how mechanistic analysis of diffusion models can link circuit-level mechanisms to diffusion dynamics, enabling effective test-time interventions to stabilize model behavior.

Abstract: Reliable surface completion from sparse point clouds underpins many applications spanning content creation and robotics. While 3D diffusion transformers attain state-of-the-art results on this task, we uncover that they exhibit a catastrophic mode of failure: arbitrarily small on-surface perturbations to the input point cloud can fracture the output into multiple disconnected pieces – a phenomenon we call Meltdown. Using activation-patching from mechanistic interpretability, we localize Meltdown to a single early denoising cross-attention activation. We find that the singular-value spectrum of this activation provides a scalar proxy: its spectral entropy rises when fragmentation occurs and returns to baseline when patched. Interpreted through diffusion dynamics, we show that this proxy tracks a symmetry-breaking bifurcation of the reverse process. Guided by this insight, we introduce PowerRemap, a test-time control that stabilizes sparse point-cloud conditioning. We demonstrate that Meltdown persists across state-of-the-art architectures (WaLa, Make-a-Shape), datasets (GSO, SimJEB) and denoising strategies (DDPM, DDIM), and that PowerRemap effectively counters this failure with stabilization rates of up to 98.3%. Overall, this work is a case study on how diffusion model behavior can be understood and guided based on mechanistic analysis, linking a circuit-level cross-attention mechanism to diffusion-dynamics accounts of trajectory bifurcations.

[430] CMAD: Cooperative Multi-Agent Diffusion via Stochastic Optimal Control

Riccardo Barbano, Alexander Denker, Zeljko Kereta, Runchang Li, Francisco Vargas

Main category: cs.LG

TL;DR: Proposes compositional generation as cooperative Stochastic Optimal Control, treating pre-trained diffusion models as interacting agents whose trajectories are jointly steered toward shared objectives.

Details

Motivation: Current approaches treat composition as algebraic composition of probability densities, assuming target distribution is known explicitly, which is rarely the case. Need better methods for controlling composition of multiple pre-trained models.

Method: Formulates compositional generation as cooperative Stochastic Optimal Control problem. Treats pre-trained diffusion models as interacting agents whose diffusion trajectories are jointly steered via optimal control toward shared objective defined on aggregated output.

Result: Validated on conditional MNIST generation and compared against naive inference-time DPS-style baseline replacing learned cooperative control with per-step gradient guidance.

Conclusion: Proposes new paradigm for compositional generation that moves beyond algebraic probability density composition to cooperative control of interacting diffusion agents.

Abstract: Continuous-time generative models have achieved remarkable success in image restoration and synthesis. However, controlling the composition of multiple pre-trained models remains an open challenge. Current approaches largely treat composition as an algebraic composition of probability densities, such as via products or mixtures of experts. This perspective assumes the target distribution is known explicitly, which is almost never the case. In this work, we propose a different paradigm that formulates compositional generation as a cooperative Stochastic Optimal Control problem. Rather than combining probability densities, we treat pre-trained diffusion models as interacting agents whose diffusion trajectories are jointly steered, via optimal control, toward a shared objective defined on their aggregated output. We validate our framework on conditional MNIST generation and compare it against a naive inference-time DPS-style baseline replacing learned cooperative control with per-step gradient guidance.

[431] GENIUS: Generative Fluid Intelligence Evaluation Suite

Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, Wentao Zhang

Main category: cs.LG

TL;DR: GENIUS is a benchmark for evaluating Generative Fluid Intelligence (GFI) in multimodal models, focusing on pattern induction, constraint execution, and contextual adaptation rather than just knowledge recall.

Details

Motivation: Existing benchmarks for Unified Multimodal Models (UMMs) primarily assess crystallized intelligence (knowledge recall), overlooking generative fluid intelligence - the capacity for dynamic reasoning, pattern induction, and adaptation to novel scenarios.

Method: Introduces GENIUS benchmark that formalizes GFI as three primitives: inducing implicit patterns, executing ad-hoc constraints, and adapting to contextual knowledge. Evaluates 12 representative models and proposes a training-free attention intervention strategy to address identified deficits.

Result: Evaluation reveals significant performance deficits in GFI tasks across models. Diagnostic analysis shows failures stem from limited context comprehension rather than insufficient generative capability.

Conclusion: GENIUS establishes a rigorous standard for evaluating generative fluid intelligence, guiding multimodal models beyond knowledge utilization toward dynamic, general-purpose reasoning.

Abstract: Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks $\textit{Generative Fluid Intelligence (GFI)}$: the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce $\textbf{GENIUS}$ ($\textbf{GEN}$ Fluid $\textbf{I}$ntelligence Eval$\textbf{U}$ation $\textbf{S}$uite). We formalize $\textit{GFI}$ as a synthesis of three primitives. These include $\textit{Inducing Implicit Patterns}$ (e.g., inferring personalized visual preferences), $\textit{Executing Ad-hoc Constraints}$ (e.g., visualizing abstract metaphors), and $\textit{Adapting to Contextual Knowledge}$ (e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately, $\textbf{GENIUS}$ establishes a rigorous standard for $\textit{GFI}$, guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at: $\href{https://github.com/arctanxarc/GENIUS}{https://github.com/arctanxarc/GENIUS}$.

[432] Stochastic Parroting in Temporal Attention – Regulating the Diagonal Sink

Victoria Hankemeier, Malte Hankemeier

Main category: cs.LG

TL;DR: Temporal attention mechanisms suffer from diagonal attention sink bias where attention scores concentrate on diagonal positions, causing information degeneration in spatio-temporal models.

Details

Motivation: Spatio-temporal models analyzing spatial structures and temporal dynamics are prone to information degeneration between space and time. Prior work shows over-squashing in causal attention or temporal convolutions creates bias on first tokens, but it's unclear if similar bias exists in temporal attention mechanisms.

Method: Derived sensitivity bounds on the expected value of the Jacobian of a temporal attention layer. Theoretically analyzed how off-diagonal attention scores depend on sequence length, showing temporal attention matrices suffer from diagonal attention sink. Proposed regularization methods to address this issue.

Result: Theoretical analysis reveals temporal attention mechanisms exhibit diagonal attention sink bias where attention scores concentrate on diagonal positions. Experimental results demonstrate the effectiveness of the proposed regularization methods in mitigating this bias.

Conclusion: Temporal attention mechanisms suffer from diagonal attention sink bias, which can be mitigated through appropriate regularization methods, addressing information degeneration issues in spatio-temporal models.

Abstract: Spatio-temporal models analyze spatial structures and temporal dynamics, which makes them prone to information degeneration among space and time. Prior literature has demonstrated that over-squashing in causal attention or temporal convolutions creates a bias on the first tokens. To analyze whether such a bias is present in temporal attention mechanisms, we derive sensitivity bounds on the expected value of the Jacobian of a temporal attention layer. We theoretically show how off-diagonal attention scores depend on the sequence length, and that temporal attention matrices suffer a diagonal attention sink. We suggest regularization methods, and experimentally demonstrate their effectiveness.

[433] OSIL: Learning Offline Safe Imitation Policies with Safety Inferred from Non-preferred Trajectories

Returaj Burnwal, Nirav Pravinbhai Bhatt, Balaraman Ravindran

Main category: cs.LG

TL;DR: OSIL: Offline safe imitation learning algorithm that learns safe policies from demonstrations without explicit safety cost or reward labels, using non-preferred trajectories to infer safety constraints.

Details

Motivation: Online learning in real-world domains can be risky, and specifying accurate safety costs is difficult. While demonstrations without safety/reward labels are available, non-preferred trajectories (unsafe behavior) can implicitly convey what to avoid, enabling offline safe policy learning.

Method: Formulates safe policy learning as Constrained Markov Decision Process (CMDP), derives lower bound on reward maximizing objective, learns cost model estimating likelihood of non-preferred behavior from demonstrations, and infers safety from non-preferred trajectories without explicit safety cost annotations.

Result: Empirically demonstrates that OSIL learns safer policies satisfying cost constraints without degrading reward performance, outperforming several baselines in offline safe imitation learning.

Conclusion: OSIL enables agents to learn safe and reward-maximizing behavior entirely from offline demonstrations without explicit safety cost or reward information, addressing practical challenges in real-world safe imitation learning.

Abstract: This work addresses the problem of offline safe imitation learning (IL), where the goal is to learn safe and reward-maximizing policies from demonstrations that do not have per-timestep safety cost or reward information. In many real-world domains, online learning in the environment can be risky, and specifying accurate safety costs can be difficult. However, it is often feasible to collect trajectories that reflect undesirable or unsafe behavior, implicitly conveying what the agent should avoid. We refer to these as non-preferred trajectories. We propose a novel offline safe IL algorithm, OSIL, that infers safety from non-preferred demonstrations. We formulate safe policy learning as a Constrained Markov Decision Process (CMDP). Instead of relying on explicit safety cost and reward annotations, OSIL reformulates the CMDP problem by deriving a lower bound on reward maximizing objective and learning a cost model that estimates the likelihood of non-preferred behavior. Our approach allows agents to learn safe and reward-maximizing behavior entirely from offline demonstrations. We empirically demonstrate that our approach can learn safer policies that satisfy cost constraints without degrading the reward performance, thus outperforming several baselines.

[434] MoEEdit: Efficient and Routing-Stable Knowledge Editing for Mixture-of-Experts LLMs

Yupu Gu, Rongzhe Wei, Andy Zhu, Pan Li

Main category: cs.LG

TL;DR: MoEEdit: A routing-stable knowledge editing framework for sparse Mixture-of-Experts LLMs that prevents routing distribution shifts through per-expert null-space projections.

Details

Motivation: Existing knowledge editing methods are designed for dense LLM architectures and don't work well with sparse Mixture-of-Experts models, causing routing distribution shifts that undermine stability and efficiency when adapted naively.

Method: Reparameterizes expert updates via per-expert null-space projections to keep router inputs invariant, preventing routing shifts. Uses block coordinate descent solver for efficient block-structured optimization.

Result: MoEEdit achieves state-of-the-art efficacy and generalization while preserving high specificity and routing stability, with superior compute and memory efficiency compared to existing methods.

Conclusion: Establishes a robust foundation for scalable, precise knowledge editing in sparse LLMs and highlights the importance of routing-stable interventions for MoE architectures.

Abstract: Knowledge editing (KE) enables precise modifications to factual content in large language models (LLMs). Existing KE methods are largely designed for dense architectures, limiting their applicability to the increasingly prevalent sparse Mixture-of-Experts (MoE) models that underpin modern scalable LLMs. Although MoEs offer strong efficiency and capacity scaling, naively adapting dense-model editors is both computationally costly and prone to routing distribution shifts that undermine stability and consistency. To address these challenges, we introduce MoEEdit, the first routing-stable framework for parameter-modifying knowledge editing in MoE LLMs. Our method reparameterizes expert updates via per-expert null-space projections that keep router inputs invariant and thereby suppress routing shifts. The resulting block-structured optimization is solved efficiently with a block coordinate descent (BCD) solver. Experiments show that MoEEdit attains state-of-the-art efficacy and generalization while preserving high specificity and routing stability, with superior compute and memory efficiency. These results establish a robust foundation for scalable, precise knowledge editing in sparse LLMs and underscore the importance of routing-stable interventions.

[435] A Jointly Efficient and Optimal Algorithm for Heteroskedastic Generalized Linear Bandits with Adversarial Corruptions

Sanghwa Kim, Junghyun Lee, Se-Young Yun

Main category: cs.LG

TL;DR: A corruption-robust algorithm for heteroskedastic generalized linear bandits with adversarial corruptions, achieving near-optimal regret bounds across various contextual bandit settings.

Details

Motivation: Address the challenge of heteroskedastic generalized linear bandits with adversarial corruptions, which encompasses many practical bandit problems including heteroskedastic linear bandits and logistic/Poisson bandits where standard algorithms may fail under corruption.

Method: Propose HCW-GLB-OMD algorithm with two components: 1) an online mirror descent (OMD)-based estimator for parameter estimation, and 2) Hessian-based confidence weights to achieve corruption robustness. The method is computationally efficient with O(1) space and time complexity per iteration.

Result: Achieves regret bound of Õ(d√∑g(τ_t)μ̇_{t,⋆} + d²g_maxκ + dκC), where C is total corruption budget. Provide matching lower bound of Ω̃(d√∑g(τ_t)μ̇_{t,⋆} + dC), showing algorithm is instance-wise minimax optimal up to κ-factor in corruption term.

Conclusion: The proposed algorithm achieves near-optimal performance across various heteroskedastic GLB instances with adversarial corruptions, unifying previous problem-specific results and providing a general corruption-robust solution.

Abstract: We consider the problem of heteroskedastic generalized linear bandits (GLBs) with adversarial corruptions, which subsumes various stochastic contextual bandit settings, including heteroskedastic linear bandits and logistic/Poisson bandits. We propose HCW-GLB-OMD, which consists of two components: an online mirror descent (OMD)-based estimator and Hessian-based confidence weights to achieve corruption robustness. This is computationally efficient in that it only requires ${O}(1)$ space and time complexity per iteration. Under the self-concordance assumption on the link function, we show a regret bound of $\tilde{O}\left( d \sqrt{\sum_t g(τ_t) \dotμ_{t,\star}} + d^2 g_{\max} κ+ d κC \right)$, where $\dotμ_{t,\star}$ is the slope of $μ$ around the optimal arm at time $t$, $g(τ_t)$’s are potentially exogenously time-varying dispersions (e.g., $g(τ_t) = σ_t^2$ for heteroskedastic linear bandits, $g(τ_t) = 1$ for Bernoulli and Poisson), $g_{\max} = \max_{t \in [T]} g(τ_t)$ is the maximum dispersion, and $C \geq 0$ is the total corruption budget of the adversary. We complement this with a lower bound of $\tildeΩ(d \sqrt{\sum_t g(τ_t) \dotμ_{t,\star}} + d C)$, unifying previous problem-specific lower bounds. Thus, our algorithm achieves, up to a $κ$-factor in the corruption term, instance-wise minimax optimality simultaneously across various instances of heteroskedastic GLBs with adversarial corruptions.

[436] In-the-Wild Model Organisms: Mitigating Undesirable Emergent Behaviors in Production LLM Post-Training via Data Attribution

Frank Xiao, Santiago Aranguri

Main category: cs.LG

TL;DR: Activation-based data attribution method traces behavioral changes in language models to specific training datapoints, enabling identification and mitigation of harmful behaviors like distractor-triggered compliance.

Details

Motivation: To develop a method that can causally attribute specific behaviors in post-trained language models to responsible training datapoints, particularly for identifying and mitigating harmful behaviors that emerge from contaminated preference data in real-world training scenarios.

Method: Computes activation-difference vectors for test prompts and preference pairs, ranks training datapoints by cosine similarity to identify responsible datapoints, validates attributions causally through retraining with modified data, and clusters behavior-datapoint similarity matrices for unsupervised discovery of emergent behaviors.

Result: Applied to OLMo 2’s DPO training, discovered “distractor-triggered compliance” - a harmful behavior where models comply with dangerous requests when benign formatting instructions are appended. Filtering top-ranked datapoints reduced this behavior by 63%, while switching their labels achieved 78% reduction. Method outperforms gradient-based attribution and LLM-judge baselines while being 10x cheaper.

Conclusion: Activation-based data attribution provides an effective, efficient method for tracing behavioral changes to training data, enabling identification and mitigation of harmful emergent behaviors in real-world language models, with applications for safety techniques and model debugging.

Abstract: We propose activation-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for both test prompts and preference pairs and ranking by cosine similarity, we identify datapoints that cause specific behaviors and validate these attributions causally by retraining with modified data. Clustering behavior-datapoint similarity matrices also enables unsupervised discovery of emergent behaviors. Applying this to OLMo 2’s production DPO training, we surfaced distractor-triggered compliance: a harmful behavior where the model complies with dangerous requests when benign formatting instructions are appended. Filtering top-ranked datapoints reduces this behavior by 63% while switching their labels achieves 78%. Our method outperforms gradient-based attribution and LLM-judge baselines while being over 10 times cheaper than both. This in-the-wild model organism - emerging from contaminated preference data rather than deliberate injection - provides a realistic benchmark for safety techniques.

[437] Sample Efficient Generative Molecular Optimization with Joint Self-Improvement

Serra Korkmaz, Adam Izdebski, Jonathan Pirnay, Rasmus Møller-Larsen, Michal Kmicikiewicz, Pankhil Gawade, Dominik G. Grimm, Ewa Szczurek

Main category: cs.LG

TL;DR: Joint Self-Improvement: A method combining joint generative-predictive modeling with self-improving sampling for efficient molecular optimization under limited evaluation budgets.

Details

Motivation: Molecular optimization requires designing molecules with superior properties, but candidate evaluation is expensive and rare. Surrogate models suffer from distribution shift as optimization pushes candidates out-of-distribution, creating sample efficiency challenges.

Method: Introduces Joint Self-Improvement with two components: (1) a joint generative-predictive model that aligns generator with surrogate to reduce distribution shift, and (2) a self-improving sampling scheme that biases the generative component using the predictive model to efficiently generate optimized molecules at inference time.

Result: Experiments across offline and online molecular optimization benchmarks show that Joint Self-Improvement outperforms state-of-the-art methods under limited evaluation budgets.

Conclusion: The proposed approach effectively addresses distribution shift and sample efficiency challenges in molecular optimization through joint modeling and self-improving sampling.

Abstract: Generative molecular optimization aims to design molecules with properties surpassing those of existing compounds. However, such candidates are rare and expensive to evaluate, yielding sample efficiency essential. Additionally, surrogate models introduced to predict molecule evaluations, suffer from distribution shift as optimization drives candidates increasingly out-of-distribution. To address these challenges, we introduce Joint Self-Improvement, which benefits from (i) a joint generative-predictive model and (ii) a self-improving sampling scheme. The former aligns the generator with the surrogate, alleviating distribution shift, while the latter biases the generative part of the joint model using the predictive one to efficiently generate optimized molecules at inference-time. Experiments across offline and online molecular optimization benchmarks demonstrate that Joint Self-Improvement outperforms state-of-the-art methods under limited evaluation budgets.

[438] GRASP: group-Shapley feature selection for patients

Yuheng Luo, Shuyan Li, Zhong Cao

Main category: cs.LG

TL;DR: GRASP is a feature selection framework for medical prediction that combines Shapley value attribution with group L21 regularization to extract compact, non-redundant feature sets.

Details

Motivation: Existing feature selection methods like LASSO lack robustness and interpretability in medical prediction tasks. There's a need for methods that can identify stable, non-redundant feature sets while maintaining predictive accuracy.

Method: GRASP first uses SHAP (Shapley Additive exPlanations) to distill group-level importance scores from a pretrained tree model. It then applies group L21 regularization in logistic regression to enforce structured sparsity, resulting in compact feature selections.

Result: GRASP consistently delivers comparable or superior predictive accuracy compared to LASSO, SHAP, and deep learning methods, while identifying fewer, less redundant, and more stable features.

Conclusion: GRASP provides a robust and interpretable feature selection framework for medical prediction that addresses limitations of existing methods by combining Shapley value attribution with structured regularization.

Abstract: Feature selection remains a major challenge in medical prediction, where existing approaches such as LASSO often lack robustness and interpretability. We introduce GRASP, a novel framework that couples Shapley value driven attribution with group $L_{21}$ regularization to extract compact and non-redundant feature sets. GRASP first distills group level importance scores from a pretrained tree model via SHAP, then enforces structured sparsity through group $L_{21}$ regularized logistic regression, yielding stable and interpretable selections. Extensive comparisons with LASSO, SHAP, and deep learning based methods show that GRASP consistently delivers comparable or superior predictive accuracy, while identifying fewer, less redundant, and more stable features.

[439] TVCACHE: A Stateful Tool-Value Cache for Post-Training LLM Agents

Abhishek Vijaya Kumar, Bhaskar Kataria, Byungsoo Oh, Emaad Manzoor, Rachee Singh

Main category: cs.LG

TL;DR: TVCACHE is a stateful tool-value cache for LLM agent post-training that reduces idle GPU time by caching tool outputs with environment state awareness, achieving up to 70% cache hit rates and 6.9X speedup.

Details

Motivation: In RL post-training of LLM agents, external tool calls take seconds to minutes, leaving GPUs idle and inflating training time/cost. While many tool invocations repeat across parallel rollouts, naive caching is incorrect because tool outputs depend on environment state from prior agent interactions.

Method: TVCACHE maintains a tree of observed tool-call sequences and performs longest-prefix matching for cache lookups. A cache hit occurs only when the agent’s full tool history matches a previously executed sequence, guaranteeing identical environment state.

Result: On three diverse workloads (terminal-based tasks, SQL generation, and video understanding), TVCACHE achieves cache hit rates of up to 70% and reduces median tool call execution time by up to 6.9X, with no degradation in post-training reward accumulation.

Conclusion: TVCACHE effectively reduces LLM agent post-training time and cost by intelligently caching tool outputs while maintaining correctness through state-aware caching, making RL post-training more efficient.

Abstract: In RL post-training of LLM agents, calls to external tools take several seconds or even minutes, leaving allocated GPUs idle and inflating post-training time and cost. While many tool invocations repeat across parallel rollouts and could in principle be cached, naively caching their outputs for reuse is incorrect since tool outputs depend on the environment state induced by prior agent interactions. We present TVCACHE, a stateful tool-value cache for LLM agent post-training. TVCACHE maintains a tree of observed tool-call sequences and performs longest-prefix matching for cache lookups: a hit occurs only when the agent’s full tool history matches a previously executed sequence, guaranteeing identical environment state. On three diverse workloads-terminal-based tasks, SQL generation, and video understanding. TVCACHE achieves cache hit rates of up to 70% and reduces median tool call execution time by up to 6.9X, with no degradation in post-training reward accumulation.

[440] General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies

Jianxun Wang, Grant C. Forbes, Leonardo Villalobos-Arias, David L. Roberts

Main category: cs.LG

TL;DR: Offline RL algorithm using flexible f-divergence constraints to balance RL objectives with behavior policy constraints when learning from limited/diverse datasets

Details

Motivation: Practical offline RL datasets often have limited exploration and diverse behavior policies, making standard constraints too conservative. Need to balance RL objectives with appropriate constraints based on dataset characteristics.

Method: Identifies connection between f-divergence and Bellman residual constraints via Linear Programming form for RL and convex conjugate. Introduces flexible f-divergence formulation with adaptive constraints based on dataset characteristics.

Result: Experiments on MuJoCo, Fetch, and AdroitHand environments show correctness of LP form and improved performance with flexible f-divergence when learning from challenging datasets.

Conclusion: Flexible f-divergence constraints can effectively balance RL objectives with behavior policy constraints, improving offline RL performance on datasets with limited exploration and diverse behavior policies.

Abstract: Offline RL algorithms aim to improve upon the behavior policy that produces the collected data while constraining the learned policy to be within the support of the dataset. However, practical offline datasets often contain examples with little diversity or limited exploration of the environment, and from multiple behavior policies with diverse expertise levels. Limited exploration can impair the offline RL algorithm’s ability to estimate \textit{Q} or \textit{V} values, while constraining towards diverse behavior policies can be overly conservative. Such datasets call for a balance between the RL objective and behavior policy constraints. We first identify the connection between $f$-divergence and optimization constraint on the Bellman residual through a more general Linear Programming form for RL and the convex conjugate. Following this, we introduce the general flexible function formulation for the $f$-divergence to incorporate an adaptive constraint on algorithms’ learning objectives based on the offline training dataset. Results from experiments on the MuJoCo, Fetch, and AdroitHand environments show the correctness of the proposed LP form and the potential of the flexible $f$-divergence in improving performance for learning from a challenging dataset when applied to a compatible constrained optimization algorithm.

[441] When Fusion Helps and When It Breaks: View-Aligned Robustness in Same-Source Financial Imaging

Rui Ma

Main category: cs.LG

TL;DR: Multi-view learning for financial prediction using price charts and technical indicators, with analysis of fusion strategies and adversarial robustness under different attack scenarios.

Details

Motivation: To study same-source multi-view learning and adversarial robustness for next-day direction prediction in financial markets using image representations of financial data.

Method: Constructs two window-aligned views from rolling windows: OHLCV-rendered price/volume charts and technical-indicator matrices. Uses leakage-resistant time-block splits with embargo. Evaluates early fusion (channel stacking) vs late fusion (dual encoders with fusion head) with cross-view consistency regularization. Tests adversarial robustness with FGSM and PGD attacks under view-constrained and joint attack scenarios.

Result: Results depend strongly on label-noise regime. Late fusion provides dominant clean-performance gains, while early fusion can exhibit negative transfer. Cross-view consistency regularization has secondary, backbone-dependent effects. Models show severe vulnerability to tiny adversarial budgets with strong view asymmetry. Late fusion improves robustness under view-constrained attacks, but joint attacks remain challenging.

Conclusion: Multi-view learning with financial image representations shows promise but requires careful handling of label noise and fusion strategies. Adversarial robustness is a significant concern, with late fusion offering some protection against view-constrained attacks but vulnerability to coordinated joint attacks.

Abstract: We study same-source multi-view learning and adversarial robustness for next-day direction prediction with financial image representations. On Shanghai Gold Exchange (SGE) spot gold data (2005-2025), we construct two window-aligned views from each rolling window: an OHLCV-rendered price/volume chart and a technical-indicator matrix. To ensure reliable evaluation, we adopt leakage-resistant time-block splits with embargo and use Matthews correlation coefficient (MCC). We find that results depend strongly on the label-noise regime: we apply an ex-post minimum-movement filter that discards samples with realized next-day absolute return below tau to define evaluation subsets with reduced near-zero label ambiguity. This induces a non-monotonic data-noise trade-off that can reveal predictive signal but eventually increases variance as sample size shrinks; the filter is used for offline benchmark construction rather than an inference-time decision rule. In the stabilized subsets, fusion is regime dependent: early fusion by channel stacking can exhibit negative transfer, whereas late fusion with dual encoders and a fusion head provides the dominant clean-performance gains; cross-view consistency regularization has secondary, backbone-dependent effects. We further evaluate test-time L-infinity perturbations using FGSM and PGD under two threat scenarios: view-constrained attacks that perturb one view and joint attacks that perturb both. We observe severe vulnerability at tiny budgets with strong view asymmetry. Late fusion consistently improves robustness under view-constrained attacks, but joint attacks remain challenging and can still cause substantial worst-case degradation.

[442] Direct Learning of Calibration-Aware Uncertainty for Neural PDE Surrogates

Carlos Stein Brito

Main category: cs.LG

TL;DR: Cross-regularized uncertainty framework learns uncertainty parameters during training using a regularization split to reduce train-test mismatch, applied to neural PDE surrogates like Fourier Neural Operators.

Details

Motivation: Neural PDE surrogates need calibrated uncertainty in data-limited or partially observed regimes for downstream decision-making. Existing uncertainty methods (ensembles, dropout, post-hoc calibration) have limitations, requiring per-regime tuning or lacking adaptivity.

Method: Cross-regularized uncertainty learns uncertainty parameters during training using gradients routed through a held-out regularization split. The predictor is optimized on training split for fit, while low-dimensional uncertainty controls are optimized on regularization split to reduce train-test mismatch. Can learn continuous noise levels at output head, within hidden features, or within operator-specific components like spectral modes. Implemented in Fourier Neural Operators.

Result: Evaluated on APEBench sweeps over observed fraction and training-set size. Learned predictive distributions are better calibrated on held-out splits, and uncertainty fields concentrate in high-error regions in one-step spatial diagnostics.

Conclusion: Cross-regularized uncertainty provides regime-adaptive uncertainty without per-regime noise tuning, yielding better calibrated uncertainty for neural PDE surrogates in data-limited scenarios.

Abstract: Neural PDE surrogates are often deployed in data-limited or partially observed regimes where downstream decisions depend on calibrated uncertainty in addition to low prediction error. Existing approaches obtain uncertainty through ensemble replication, fixed stochastic noise such as dropout, or post hoc calibration. Cross-regularized uncertainty learns uncertainty parameters during training using gradients routed through a held-out regularization split. The predictor is optimized on the training split for fit, while low-dimensional uncertainty controls are optimized on the regularization split to reduce train-test mismatch, yielding regime-adaptive uncertainty without per-regime noise tuning. The framework can learn continuous noise levels at the output head, within hidden features, or within operator-specific components such as spectral modes. We instantiate the approach in Fourier Neural Operators and evaluate on APEBench sweeps over observed fraction and training-set size. Across these sweeps, the learned predictive distributions are better calibrated on held-out splits and the resulting uncertainty fields concentrate in high-error regions in one-step spatial diagnostics.

[443] Divide, Harmonize, Then Conquer It: Shooting Multi-Commodity Flow Problems with Multimodal Language Models

Xinyu Yuan, Yan Qiao, Zonghui Wang, Wenzhi Chen

Main category: cs.LG

TL;DR: Pram is an ML-based method using multimodal language models to solve multi-commodity flow problems, achieving near-optimal solutions with significantly faster runtime than traditional solvers.

Details

Motivation: Existing optimization engines struggle to balance optimality and tractability in large-scale allocation systems. There's a need for practical solutions that can handle the trade-off dilemma faced by service providers in network flow problems.

Method: Pram divides the original MCF problem into local subproblems solved by MLM-powered agents, then ensures global consistency via multi-agent reinforcement learning. The method learns to perform gradient descent in context.

Result: Pram achieves performance comparable to linear programming solvers (very close to optimal), with 1-2 orders of magnitude faster runtime. It shows strong robustness (<10% performance degradation under failures) and generalization to unforeseen events.

Conclusion: Pram provides a practical, scalable solution for MCF problems that integrates with mainstream allocation systems, demonstrating MLMs’ potential for combinatorial optimization tasks.

Abstract: The multi-commodity flow (MCF) problem is a fundamental topic in network flow and combinatorial optimization, with broad applications in transportation, communication, and logistics, etc. Nowadays, the rapid expansion of allocation systems has posed challenges for existing optimization engines in balancing optimality and tractability. In this paper, we present Pram, the first ML-based method that leverages the reasoning power of multimodal language models (MLMs) for addressing the trade-off dilemma – a great need of service providers. As part of our proposal, Pram (i) quickly computes high-quality allocations by dividing the original problem into local subproblems, which are then resolved by an MLM-powered “agent”, and (ii) ensures global consistency by harmonizing these subproblems via a multi-agent reinforcement learning algorithm. Theoretically, we show that Pram, which learns to perform gradient descent in context, provably converges to the optimum within the family of MCF problems. Empirically, on real-world datasets and public topologies, Pram achieves performance comparable to, and in some cases even surpassing, linear programming solvers (very close to the optimal solution), and substantially lower runtimes (1 to 2 orders of magnitude faster). Moreover, Pram exhibits strong robustness (<10% performance degradation under link failures or flow bursts), demonstrating MLM’s generalization ability to unforeseen events. Pram is objective-agnostic and seamlessly integrates with mainstream allocation systems, providing a practical and scalable solution for future networks.

[444] MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation

Jialin Liu, Zhaorui Zhang, Ray C. C. Cheung

Main category: cs.LG

TL;DR: MoToRec transforms multimodal recommendation into discrete semantic tokenization using a sparsely-regularized RQ-VAE to address data sparsity and item cold-start problems in recommender systems.

Details

Motivation: GNN-based recommender systems struggle with data sparsity and item cold-start problems, especially for new items with limited interaction history. While multimodal content offers potential solutions, existing methods produce suboptimal representations due to noise and entanglement in sparse data.

Method: Proposes MoToRec framework with three components: (1) sparsely-regularized Residual Quantized VAE for generating compositional semantic codes of discrete, interpretable tokens, (2) adaptive rarity amplification for prioritized learning of cold-start items, and (3) hierarchical multi-source graph encoder for robust signal fusion with collaborative signals.

Result: Extensive experiments on three large-scale datasets demonstrate MoToRec’s superiority over state-of-the-art methods in both overall and cold-start scenarios.

Conclusion: Discrete tokenization provides an effective and scalable alternative for mitigating the long-standing cold-start challenge in multimodal recommendation systems.

Abstract: Graph neural networks (GNNs) have revolutionized recommender systems by effectively modeling complex user-item interactions, yet data sparsity and the item cold-start problem significantly impair performance, particularly for new items with limited or no interaction history. While multimodal content offers a promising solution, existing methods result in suboptimal representations for new items due to noise and entanglement in sparse data. To address this, we transform multimodal recommendation into discrete semantic tokenization. We present Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation (MoToRec), a framework centered on a sparsely-regularized Residual Quantized Variational Autoencoder (RQ-VAE) that generates a compositional semantic code of discrete, interpretable tokens, promoting disentangled representations. MoToRec’s architecture is enhanced by three synergistic components: (1) a sparsely-regularized RQ-VAE that promotes disentangled representations, (2) a novel adaptive rarity amplification that promotes prioritized learning for cold-start items, and (3) a hierarchical multi-source graph encoder for robust signal fusion with collaborative signals. Extensive experiments on three large-scale datasets demonstrate MoToRec’s superiority over state-of-the-art methods in both overall and cold-start scenarios. Our work validates that discrete tokenization provides an effective and scalable alternative for mitigating the long-standing cold-start challenge.

[445] Motion Capture is Not the Target Domain: Scaling Synthetic Data for Learning Motion Representations

Firas Darwish, George Nicholson, Aiden Doherty, Hang Yuan

Main category: cs.LG

TL;DR: Synthetic motion data pretraining improves HAR model generalization when mixed with real data or scaled sufficiently, but large-scale motion-capture pretraining has limited gains due to domain mismatch with wearable signals.

Details

Motivation: Synthetic data offers scalable pretraining when real-world data is scarce, particularly in full-body human motion where large-scale data collection is infeasible but essential for wearable-based Human Activity Recognition (HAR).

Method: Pretrain motion time-series models using synthetic motion data generated from motion-capture-derived representations, then evaluate transfer across diverse downstream HAR tasks.

Result: Synthetic pretraining improves generalization when mixed with real data or scaled sufficiently. Large-scale motion-capture pretraining yields only marginal gains due to domain mismatch with wearable signals.

Conclusion: Clarifies key sim-to-real challenges and the limits/opportunities of synthetic motion data for transferable HAR representations, highlighting domain mismatch as a critical issue.

Abstract: Synthetic data offers a compelling path to scalable pretraining when real-world data is scarce, but models pretrained on synthetic data often fail to transfer reliably to deployment settings. We study this problem in full-body human motion, where large-scale data collection is infeasible but essential for wearable-based Human Activity Recognition (HAR), and where synthetic motion can be generated from motion-capture-derived representations. We pretrain motion time-series models using such synthetic data and evaluate their transfer across diverse downstream HAR tasks. Our results show that synthetic pretraining improves generalisation when mixed with real data or scaled sufficiently. We also demonstrate that large-scale motion-capture pretraining yields only marginal gains due to domain mismatch with wearable signals, clarifying key sim-to-real challenges and the limits and opportunities of synthetic motion data for transferable HAR representations.

[446] Token-Efficient Change Detection in LLM APIs

Timothée Chauvin, Clément Lalanne, Erwan Le Merrer, Jean-Michel Loubes, François Taïani, Gilles Tredan

Main category: cs.LG

TL;DR: B3IT is a black-box method for detecting changes in LLMs using border inputs where multiple output tokens are equally likely, achieving 30x cost reduction compared to existing methods.

Details

Motivation: Existing methods for remote change detection in LLMs are either too expensive for large-scale deployment or require white-box/grey-box access to model weights or log probabilities. There's a need for low-cost, strict black-box methods that only observe output tokens.

Method: Proposes Black-Box Border Input Tracking (B3IT) scheme that uses specific “border inputs” where there exists more than one output top token. The approach leverages statistical analysis showing optimal change detection depends on the model’s Jacobian and Fisher information, with border inputs enabling powerful detection tests in low-temperature regimes.

Result: Extensive experiments show border inputs are easily found for non-reasoning tested endpoints, achieving performance comparable to best available grey-box approaches. B3IT reduces costs by 30x compared to existing methods while operating in strict black-box setting.

Conclusion: B3IT provides an effective, low-cost solution for black-box change detection in LLMs using border inputs, offering significant cost advantages over existing methods while maintaining strong detection performance.

Abstract: Remote change detection in LLMs is a difficult problem. Existing methods are either too expensive for deployment at scale, or require initial white-box access to model weights or grey-box access to log probabilities. We aim to achieve both low cost and strict black-box operation, observing only output tokens. Our approach hinges on specific inputs we call Border Inputs, for which there exists more than one output top token. From a statistical perspective, optimal change detection depends on the model’s Jacobian and the Fisher information of the output distribution. Analyzing these quantities in low-temperature regimes shows that border inputs enable powerful change detection tests. Building on this insight, we propose the Black-Box Border Input Tracking (B3IT) scheme. Extensive in-vivo and in-vitro experiments show that border inputs are easily found for non-reasoning tested endpoints, and achieve performance on par with the best available grey-box approaches. B3IT reduces costs by $30\times$ compared to existing methods, while operating in a strict black-box setting.

[447] MerLin: A Discovery Engine for Photonic and Hybrid Quantum Machine Learning

Cassandre Notton, Benjamin Stott, Philippe Schoeb, Anthony Walsh, Grégoire Leboucher, Vincent Espitalier, Vassilis Apostolou, Louis-Félix Vigneux, Alexia Salavrakos, Jean Senellart

Main category: cs.LG

TL;DR: MerLin is an open-source framework for systematic benchmarking and reproducible research in photonic and hybrid quantum machine learning, integrating quantum circuit simulation with standard ML workflows.

Details

Motivation: The paper addresses the need for systematic empirical exploration of quantum machine learning models across different datasets and hardware constraints, moving beyond isolated algorithmic proposals to establish reproducible benchmarks and shared experimental baselines.

Method: Developed MerLin framework that integrates optimized strong simulation of linear optical circuits into PyTorch and scikit-learn workflows, enabling end-to-end differentiable training of quantum layers. The framework includes systematic benchmarking capabilities and reproduces 18 state-of-the-art photonic and hybrid QML works.

Result: Successfully reproduced 18 state-of-the-art photonic and hybrid QML works spanning kernel methods, reservoir computing, convolutional/recurrent architectures, generative models, and modern training paradigms. Released these as reusable, modular experiments establishing a shared experimental baseline.

Conclusion: MerLin provides a discovery engine for photonic and hybrid QML that enables practitioners to leverage existing ML tooling for ablation studies, cross-modality comparisons, and hybrid classical-quantum workflows, positioning it as a future-proof co-design tool linking algorithms, benchmarks, and hardware.

Abstract: Identifying where quantum models may offer practical benefits in near term quantum machine learning (QML) requires moving beyond isolated algorithmic proposals toward systematic and empirical exploration across models, datasets, and hardware constraints. We introduce MerLin, an open source framework designed as a discovery engine for photonic and hybrid quantum machine learning. MerLin integrates optimized strong simulation of linear optical circuits into standard PyTorch and scikit learn workflows, enabling end to end differentiable training of quantum layers. MerLin is designed around systematic benchmarking and reproducibility. As an initial contribution, we reproduce eighteen state of the art photonic and hybrid QML works spanning kernel methods, reservoir computing, convolutional and recurrent architectures, generative models, and modern training paradigms. These reproductions are released as reusable, modular experiments that can be directly extended and adapted, establishing a shared experimental baseline consistent with empirical benchmarking methodologies widely adopted in modern artificial intelligence. By embedding photonic quantum models within established machine learning ecosystems, MerLin allows practitioners to leverage existing tooling for ablation studies, cross modality comparisons, and hybrid classical quantum workflows. The framework already implements hardware aware features, allowing tests on available quantum hardware while enabling exploration beyond its current capabilities, positioning MerLin as a future proof co design tool linking algorithms, benchmarks, and hardware.

[448] Statistical Learning Analysis of Physics-Informed Neural Networks

David A. Barajas-Solano

Main category: cs.LG

TL;DR: Statistical learning perspective on Physics-Informed Neural Networks (PINNs) for initial/boundary value problems, analyzing training as fitting residual distributions via KL divergence and using singular learning theory.

Details

Motivation: To provide a statistical learning framework for understanding PINNs training, moving beyond viewing physics constraints as regularization to treating them as infinite indirect data sources.

Method: Reformulates PINN parameter estimation as statistical learning problem with hard constraints, analyzes physics penalty as infinite indirect data, uses KL divergence between true and PINN residual distributions, and applies singular learning theory with Local Learning Coefficient.

Result: Shows physics-informed learning with PINNs is a singular learning problem, provides analysis framework using Local Learning Coefficient for heat equation IBVP, and discusses implications for predictive uncertainty quantification and extrapolation capacity.

Conclusion: Provides statistical learning perspective on PINNs that offers deeper understanding of training dynamics, uncertainty quantification, and generalization capabilities beyond traditional regularization views.

Abstract: We study the training and performance of physics-informed learning for initial and boundary value problems (IBVP) with physics-informed neural networks (PINNs) from a statistical learning perspective. Specifically, we restrict ourselves to parameterizations with hard initial and boundary condition constraints and reformulate the problem of estimating PINN parameters as a statistical learning problem. From this perspective, the physics penalty on the IBVP residuals can be better understood not as a regularizing term bus as an infinite source of indirect data, and the learning process as fitting the PINN distribution of residuals $p(y \mid x, t, w) q(x, t) $ to the true data-generating distribution $δ(0) q(x, t)$ by minimizing the Kullback-Leibler divergence between the true and PINN distributions. Furthermore, this analysis show that physics-informed learning with PINNs is a singular learning problem, and we employ singular learning theory tools, namely the so-called Local Learning Coefficient (Lau et al., 2025) to analyze the estimates of PINN parameters obtained via stochastic optimization for a heat equation IBVP. Finally, we discuss implications of this analysis on the quantification of predictive uncertainty of PINNs and the extrapolation capacity of PINNs.

Genmao Zhuang, Amir Barati Farimani

Main category: cs.LG

TL;DR: MKNA is a language-driven AI agent that translates natural language scientific queries into executable actions for materials discovery, automating database retrieval, property prediction, structure generation, and stability evaluation.

Details

Motivation: Traditional materials discovery workflows rely heavily on expert intuition and computationally expensive simulations, creating bottlenecks in finding high-performance materials for energy, electronics, and aerospace applications.

Method: The Materials Knowledge Navigation Agent (MKNA) uses language understanding to translate scientific intent into executable actions, autonomously extracts quantitative thresholds and design motifs from literature/databases, and enables data-grounded hypothesis formation for materials exploration.

Result: MKNA successfully identified high-Debye-temperature ceramics (Theta_D > 800 K), rediscovered canonical ultra-stiff materials (diamond, SiC, SiN, BeO), and proposed novel thermodynamically stable Be-C-rich compounds in the 1500-1700 K regime.

Conclusion: MKNA establishes a generalizable platform for autonomous, language-guided materials exploration that not only finds stable candidates but also reconstructs interpretable design heuristics from scientific literature and databases.

Abstract: Accelerating the discovery of high-performance materials remains a central challenge across energy, electronics, and aerospace technologies, where traditional workflows depend heavily on expert intuition and computationally expensive simulations. Here we introduce the Materials Knowledge Navigation Agent (MKNA), a language-driven system that translates natural-language scientific intent into executable actions for database retrieval, property prediction, structure generation, and stability evaluation. Beyond automating tool invocation, MKNA autonomously extracts quantitative thresholds and chemically meaningful design motifs from literature and database evidence, enabling data-grounded hypothesis formation. Applied to the search for high-Debye-temperature ceramics, the agent identifies a literature-supported screening criterion (Theta_D > 800 K), rediscovers canonical ultra-stiff materials such as diamond, SiC, SiN, and BeO, and proposes thermodynamically stable, previously unreported Be-C-rich compounds that populate the sparsely explored 1500-1700 K regime. These results demonstrate that MKNA not only finds stable candidates but also reconstructs interpretable design heuristics, establishing a generalizable platform for autonomous, language-guided materials exploration.

[450] The Offline-Frontier Shift: Diagnosing Distributional Limits in Generative Multi-Objective Optimization

Stephanie Holly, Alexandru-Ciprian Zăvoianu, Siegfried Silber, Sepp Hochreiter, Werner Zellinger

Main category: cs.LG

TL;DR: Generative methods for offline multi-objective optimization systematically underperform evolutionary alternatives on metrics like generational distance due to offline-frontier shift, which limits their ability to sample out-of-distribution in objective space.

Details

Motivation: The paper investigates why recent generative approaches (including diffusion models) for offline multi-objective optimization show strong performance under hypervolume metric but systematically underperform evolutionary alternatives on other established MOO metrics like generational distance.

Method: The authors identify and analyze the “offline-frontier shift” - the displacement of the offline dataset from the Pareto front - as a fundamental limitation. They argue that overcoming this requires out-of-distribution sampling in objective space, measured via an integral probability metric, and empirically observe that generative methods remain conservatively close to the offline objective distribution.

Result: Generative methods systematically underperform evolutionary alternatives with respect to metrics like generational distance. They remain conservatively close to the offline objective distribution and fail to adequately sample out-of-distribution in objective space, which is necessary to overcome the offline-frontier shift limitation.

Conclusion: Offline MOO is fundamentally limited by distribution shift problems, with generative methods failing to overcome the offline-frontier shift. The paper provides a diagnostic framework for understanding when and why generative optimization methods fail in offline MOO settings.

Abstract: Offline multi-objective optimization (MOO) aims to recover Pareto-optimal designs given a finite, static dataset. Recent generative approaches, including diffusion models, show strong performance under hypervolume, yet their behavior under other established MOO metrics is less understood. We show that generative methods systematically underperform evolutionary alternatives with respect to other metrics, such as generational distance. We relate this failure mode to the offline-frontier shift, i.e., the displacement of the offline dataset from the Pareto front, which acts as a fundamental limitation in offline MOO. We argue that overcoming this limitation requires out-of-distribution sampling in objective space (via an integral probability metric) and empirically observe that generative methods remain conservatively close to the offline objective distribution. Our results position offline MOO as a distribution-shift–limited problem and provide a diagnostic lens for understanding when and why generative optimization methods fail.

[451] Asymmetric Prompt Weighting for Reinforcement Learning with Verifiable Rewards

Reinhard Heckel, Mahdi Soltanolkotabi, Christos Thramboulidis

Main category: cs.LG

TL;DR: Asymmetric prompt weighting in RL for LLMs upweights low-success prompts to accelerate training in low-accuracy regimes, particularly beneficial for from-scratch RL.

Details

Motivation: Current RL algorithms for LLM post-training (GRPO, DAPO, RLOO) focus on ambiguous prompts with intermediate success probability, but neglect prompts with very low or zero success probability. The paper investigates whether asymmetric weighting that emphasizes low-success prompts could accelerate training, especially in from-scratch RL scenarios.

Method: Proposes asymmetric prompt weighting schemes that assign higher weights to prompts with low empirical success probability. Provides theoretical analysis characterizing optimal prompt weights that minimize time needed to raise success probability from initial to target accuracy under fixed update budget. Focuses on low-success regimes where informative responses are rare and response cost dominates.

Result: Asymmetric weighting particularly benefits from-scratch RL (like R1-Zero) where training traverses wide accuracy range, but less beneficial in post-SFT RL where models start at high accuracy. Theoretical analysis shows optimal weights become asymmetric in low-success regimes, upweighting low success probabilities to accelerate effective-time convergence.

Conclusion: Asymmetric prompt weighting is a valuable technique for RL training of LLMs, especially in low-accuracy regimes and from-scratch training scenarios. The approach addresses the challenge of rare informative responses in low-success settings by strategically reweighting prompts to accelerate convergence.

Abstract: Reinforcement learning with verifiable rewards has driven recent advances in LLM post-training, in particular for reasoning. Policy optimization algorithms generate a number of responses for a given prompt and then effectively weight the corresponding gradients depending on the rewards. The most popular algorithms including GRPO, DAPO, and RLOO focus on ambiguous prompts, i.e., prompts with intermediate success probability, while downgrading gradients with very easy and very hard prompts. In this paper, we consider asymmetric prompt weightings that assign higher weights to prompts with low, or even zero, empirical success probability. We find that asymmetric weighting particularly benefits from-scratch RL (as in R1-Zero), where training traverses a wide accuracy range, and less so in post-SFT RL where the model already starts at high accuracy. We also provide theory that characterizes prompt weights which minimize the time needed to raise success probability from an initial level to a target accuracy under a fixed update budget. In low-success regimes, where informative responses are rare and response cost dominates, these optimal weights become asymmetric, upweighting low success probabilities and thereby accelerating effective-time convergence.

[452] TabICLv2: A better, faster, scalable, and open tabular foundation model

Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan

Main category: cs.LG

TL;DR: TabICLv2 is a new state-of-the-art foundation model for tabular data that outperforms previous methods through improved synthetic data generation, architectural innovations, and optimized pretraining protocols.

Details

Motivation: Tabular foundation models have recently surpassed gradient-boosted trees in predictive benchmarks, demonstrating the value of in-context learning for tabular data. The authors aim to create an even more advanced model that can handle larger datasets more efficiently.

Method: Three key innovations: (1) novel synthetic data generation engine for high pretraining diversity, (2) architectural improvements including scalable softmax in attention for better generalization to larger datasets, and (3) optimized pretraining protocols using the Muon optimizer instead of AdamW.

Result: TabICLv2 surpasses the current state-of-the-art RealTabPFN-2.5 on TabArena and TALENT benchmarks without any tuning. It generalizes effectively to million-scale datasets under 50GB GPU memory and is faster than RealTabPFN-2.5.

Conclusion: TabICLv2 represents a significant advancement in tabular foundation models, achieving state-of-the-art performance through improved synthetic data generation, architectural innovations, and optimized training protocols, with open research commitment.

Abstract: Tabular foundation models, such as TabPFNv2 and TabICL, have recently dethroned gradient-boosted trees at the top of predictive benchmarks, demonstrating the value of in-context learning for tabular data. We introduce TabICLv2, a new state-of-the-art foundation model for regression and classification built on three pillars: (1) a novel synthetic data generation engine designed for high pretraining diversity; (2) various architectural innovations, including a new scalable softmax in attention improving generalization to larger datasets without prohibitive long-sequence pretraining; and (3) optimized pretraining protocols, notably replacing AdamW with the Muon optimizer. On the TabArena and TALENT benchmarks, TabICLv2 without any tuning surpasses the performance of the current state of the art, RealTabPFN-2.5 (hyperparameter-tuned, ensembled, and fine-tuned on real data). With only moderate pretraining compute, TabICLv2 generalizes effectively to million-scale datasets under 50GB GPU memory while being markedly faster than RealTabPFN-2.5. We provide extensive ablation studies to quantify these contributions and commit to open research by first releasing inference code and model weights at https://github.com/soda-inria/tabicl, with synthetic data engine and pretraining code to follow.

[453] From Belief Entrenchment to Robust Reasoning in LLM Agents

Jihwan Oh, Minchan Jeong, Jongwoo Ko, Se-Young Yun

Main category: cs.LG

TL;DR: DReaMAD improves multi-agent debate for LLM reasoning by addressing belief entrenchment through strategic prior knowledge elicitation and enforced perspective diversity.

Details

Motivation: Multi-Agent Debate (MAD) suffers from belief entrenchment where agents reinforce shared errors rather than correcting them, limiting its effectiveness for LLM reasoning.

Method: Proposes DReaMAD framework that first rectifies static initial belief via strategic prior knowledge elicitation, then reshapes debate dynamics by enforcing perspective diversity.

Result: Achieves +9.5% accuracy gain over ReAct prompting and +19.0% higher win rate than standard MAD on the new MetaNIM Arena benchmark.

Conclusion: DReaMAD significantly mitigates belief entrenchment in multi-agent debate systems, improving reasoning performance through diversity-enhancing mechanisms.

Abstract: Multi-Agent Debate (MAD) has emerged as a promising inference scaling method for Large Language Model (LLM) reasoning. However, it frequently suffers from belief entrenchment, where agents reinforce shared errors rather than correcting them. Going beyond merely identifying this failure, we decompose it into two distinct root causes: (1) the model’s biased $\textit{static initial belief}$ and (2) $\textit{homogenized debate dynamics}$ that amplify the majority view regardless of correctness. To address these sequentially, we propose $\textbf{DReaMAD}$ $($$\textbf{D}$iverse $\textbf{Rea}$soning via $\textbf{M}$ulti-$\textbf{A}$gent $\textbf{D}$ebate with Refined Prompt$)$. Our framework first rectifies the static belief via strategic prior knowledge elicitation, then reshapes the debate dynamics by enforcing perspective diversity. Validated on our new $\textit{MetaNIM Arena}$ benchmark, $\textbf{DReaMAD}$ significantly mitigates entrenchment, achieving a +9.5% accuracy gain over ReAct prompting and a +19.0% higher win rate than standard MAD.

[454] Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang

Main category: cs.LG

TL;DR: Uni-DPO improves Direct Preference Optimization by dynamically weighting preference pairs based on data quality and model performance during training, achieving state-of-the-art results across text, math, and multimodal tasks.

Details

Motivation: Current DPO methods treat all preference pairs equally, ignoring variations in data quality and learning difficulty, leading to inefficient data utilization and suboptimal performance.

Method: Proposes Uni-DPO, a unified dynamic preference optimization framework that jointly considers (a) inherent quality of preference pairs and (b) model’s evolving performance during training, adaptively reweighting samples based on both factors.

Result: On textual tasks, Gemma-2-9B-IT fine-tuned with Uni-DPO surpasses Claude 3 Opus by 6.7 points on Arena-Hard. On mathematical and multimodal tasks, Uni-DPO consistently outperforms baseline methods across all benchmarks.

Conclusion: Uni-DPO enables more effective use of preference data and achieves superior performance, demonstrating effectiveness and robustness across diverse tasks and models.

Abstract: Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based methods typically treat all preference pairs equally, overlooking substantial variations in data quality and learning difficulty, which leads to inefficient data utilization and suboptimal performance. To address this limitation, we propose Uni-DPO, a unified dynamic preference optimization framework that jointly considers (a) the inherent quality of preference pairs and (b) the model’s evolving performance during training. By adaptively reweighting samples based on both factors, Uni-DPO enables more effective use of preference data and achieves superior performance. Extensive experiments across models and benchmarks demonstrate the effectiveness and generalization of Uni-DPO. On textual tasks, Gemma-2-9B-IT fine-tuned with Uni-DPO surpasses the leading LLM, Claude 3 Opus, by 6.7 points on Arena-Hard. On mathematical and multimodal tasks, Uni-DPO consistently outperforms baseline methods across all benchmarks, providing strong empirical evidence of its effectiveness and robustness.

[455] Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning

Junxuan Wang, Xuyang Ge, Wentao Shu, Zhengfu He, Xipeng Qiu

Main category: cs.LG

TL;DR: Attention outputs in transformers are surprisingly low-dimensional (~60% effective rank) compared to MLP outputs (~90% rank), causing dead feature problems in sparse autoencoders that can be fixed via subspace-constrained initialization.

Details

Motivation: To understand the geometric structure of transformer activations and address the prevalent dead feature problem in sparse dictionary learning for large language models.

Method: Analyzed effective dimensionality of attention outputs vs MLP outputs across diverse models, identified attention output projection matrix as key factor, and proposed subspace-constrained training for sparse autoencoders that initializes features into the active subspace.

Result: Attention outputs consistently show ~60% effective rank vs ~90% for MLP outputs; subspace-constrained initialization reduces dead features from 87% to below 1% in Attention Output SAEs with 1M features.

Conclusion: Attention mechanisms operate in surprisingly low-dimensional subspaces, which explains dead feature problems in sparse dictionary learning; subspace-constrained training provides practical solution and new insights into transformer geometry.

Abstract: Transformer architectures, and their attention mechanisms in particular, form the foundation of modern large language models. While transformer models are widely believed to operate in high-dimensional hidden spaces, we show that attention outputs are in fact confined to a surprisingly low-dimensional subspace, with an effective dimensionality of only about $60%$ of the full space. In contrast, MLP outputs and residual streams remain much closer to full-rank, exhibiting effective ranks around $90%$. This striking dimensional discrepancy is consistently observed across diverse model families and datasets, and is strongly shaped by the attention output projection matrix. Critically, we find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic geometry of the activation space. Building on this insight, we propose a subspace-constrained training method for sparse autoencoders (SAEs), initializing feature directions into the active subspace of activations. Our approach reduces dead features from 87% to below 1% in Attention Output SAEs with 1M features, and can further extend to other sparse dictionary learning methods. Our findings provide both new insights into the geometry of attention and practical tools for improving sparse dictionary learning in large language models.

[456] ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang

Main category: cs.LG

TL;DR: ButterflyQuant: Learnable butterfly transforms for adaptive 2-bit quantization of LLMs, replacing fixed Hadamard rotations with continuous orthogonal transforms optimized per layer to suppress outliers.

Details

Motivation: Extreme 2-bit quantization suffers from catastrophic performance loss due to activation outliers. Existing rotation-based methods (QuIP, QuaRot) use fixed orthogonal transforms (Hadamard matrices) that cannot adapt to specific weight distributions and outlier patterns in different transformer layers.

Method: Proposes ButterflyQuant which replaces fixed Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. These transforms are orthogonal by construction, differentiable, and enable gradient-based optimization. Uses uniformity regularization on post-transformation activations to promote smoother distributions. Requires only 128 calibration samples and converges quickly on a single GPU.

Result: Achieves O(n log n) computational complexity with only (n log n)/2 learnable parameters. The method adapts to layer-specific outlier patterns while maintaining theoretical guarantees for outlier suppression.

Conclusion: ButterflyQuant provides an adaptive, learnable alternative to fixed rotation methods for extreme quantization, enabling better outlier suppression and performance preservation in 2-bit LLM quantization through layer-specific optimization.

Abstract: Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms–Hadamard matrices achieving optimal worst-case coherence $μ= 1/\sqrt{n}$–that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard’s discrete ${+1, -1}$ entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms’ continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU.

[457] Building Production-Ready Probes For Gemini

János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, Arthur Conmy

Main category: cs.LG

TL;DR: Novel activation probe architectures for misuse mitigation in frontier language models that address generalization challenges under production distribution shifts, particularly long-context inputs.

Details

Motivation: As frontier language models become more powerful, stronger misuse mitigation techniques are needed. Activation probes show promise but fail to generalize under important production distribution shifts, especially the shift from short-context to long-context inputs.

Method: Proposed several new probe architectures designed to handle long-context distribution shifts. Evaluated in cyber-offensive domain with various production-relevant distribution shifts including multi-turn conversations, long context prompts, and adaptive red teaming. Combined architecture choice with training on diverse distributions for broad generalization. Also paired probes with prompted classifiers for optimal accuracy at low computational cost.

Result: Novel architectures successfully address context length challenges. Combination of architecture choice and diverse training data enables broad generalization. Probes paired with prompted classifiers achieve optimal accuracy with computational efficiency. Findings informed successful deployment in user-facing Gemini instances. Early positive results using AlphaEvolve for automated probe architecture search and adaptive red teaming improvements.

Conclusion: New probe architectures effectively handle long-context distribution shifts for misuse mitigation in frontier language models. Successful deployment demonstrates practical applicability, and automation of AI safety research shows promising early results.

Abstract: Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architectures that handle this long-context distribution shift. We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant distribution shifts, including multi-turn conversations, long context prompts, and adaptive red teaming. Our results demonstrate that while our novel architectures address context length, a combination of architecture choice and training on diverse distributions is required for broad generalization. Additionally, we show that pairing probes with prompted classifiers achieves optimal accuracy at a low cost due to the computational efficiency of probes. These findings have informed the successful deployment of misuse mitigation probes in user-facing instances of Gemini, Google’s frontier language model. Finally, we find early positive results using AlphaEvolve to automate improvements in both probe architecture search and adaptive red teaming, showing that automating some AI safety research is already possible.

[458] TABES: Trajectory-Aware Backward-on-Entropy Steering for Masked Diffusion Models

Shreshth Saini, Avinab Saha, Balu Adsumilli, Neil Birkbeck, Yilin Wang, Alan C. Bovik

Main category: cs.LG

TL;DR: BoE Steering uses gradient-guided inference with Token Influence Scores to improve masked diffusion model sampling by approximating infinite-horizon lookahead via single backward passes, addressing trajectory lock-in issues in non-autoregressive generation.

Details

Motivation: Current masked diffusion model sampling methods use simple confidence-based heuristics that ignore long-term impacts of local decisions, causing trajectory lock-in where early hallucinations lead to global incoherence. Search-based methods help but are computationally expensive (O(K) forward passes per step).

Method: Proposes Backward-on-Entropy (BoE) Steering, a gradient-guided inference framework that approximates infinite-horizon lookahead via single backward pass. Derives Token Influence Score (TIS) from first-order expansion of trajectory cost functional, using gradient of future entropy w.r.t input embeddings as optimal control signal. Introduces ActiveQueryAttention sparse adjoint primitive to reduce backward pass complexity.

Result: BoE achieves superior Pareto frontier for inference-time scaling compared to existing unmasking methods, demonstrating gradient-guided steering offers mathematically principled and efficient path to robust non-autoregressive generation.

Conclusion: Gradient-guided steering provides efficient solution to trajectory lock-in in masked diffusion models, balancing computational efficiency with generation quality through principled mathematical framework.

Abstract: Masked Diffusion Models (MDMs) have emerged as a promising non-autoregressive paradigm for generative tasks, offering parallel decoding and bidirectional context utilization. However, current sampling methods rely on simple confidence-based heuristics that ignore the long-term impact of local decisions, leading to trajectory lock-in where early hallucinations cascade into global incoherence. While search-based methods mitigate this, they incur prohibitive computational costs ($O(K)$ forward passes per step). In this work, we propose Backward-on-Entropy (BoE) Steering, a gradient-guided inference framework that approximates infinite-horizon lookahead via a single backward pass. We formally derive the Token Influence Score (TIS) from a first-order expansion of the trajectory cost functional, proving that the gradient of future entropy with respect to input embeddings serves as an optimal control signal for minimizing uncertainty. To ensure scalability, we introduce \texttt{ActiveQueryAttention}, a sparse adjoint primitive that exploits the structure of the masking objective to reduce backward pass complexity. BoE achieves a superior Pareto frontier for inference-time scaling compared to existing unmasking methods, demonstrating that gradient-guided steering offers a mathematically principled and efficient path to robust non-autoregressive generation. We will release the code.

[459] Proficient Graph Neural Network Design by Accumulating Knowledge on Large Language Models

Jialiang Wang, Hanmo Liu, Shimin Di, Zhili Wang, Jiachuan Wang, Lei Chen, Xiaofang Zhou

Main category: cs.LG

TL;DR: DesiGNN is a knowledge-centered framework that converts past GNN design experience into structured knowledge priors for LLMs, enabling efficient data-aware model design for unseen graph datasets.

Details

Motivation: LLMs struggle with specialized, data-sensitive tasks like GNN design due to knowledge gaps in graph property-architecture relationships and external noise from misleading inputs, leading to generic or poor model suggestions.

Method: DesiGNN systematically converts past model design experience into structured, fine-grained knowledge priors for meta-learning with LLMs, aligning empirical property filtering from benchmarks with adaptive elicitation of literature insights via LLMs.

Result: DesiGNN delivers top-5.77% initial model proposals for unseen datasets within seconds and achieves consistently superior performance with minimal search cost compared to baselines.

Conclusion: The framework demonstrates that constructing solid meta-knowledge between unseen graph understanding and known effective architecture patterns enables efficient, high-quality GNN design automation.

Abstract: High-level automation is increasingly critical in AI, driven by rapid advances in large language models (LLMs) and AI agents. However, LLMs, despite their general reasoning power, struggle significantly in specialized, data-sensitive tasks such as designing Graph Neural Networks (GNNs). This difficulty arises from (1) the inherent knowledge gaps in modeling the intricate, varying relationships between graph properties and suitable architectures and (2) the external noise from misleading descriptive inputs, often resulting in generic or even misleading model suggestions. Achieving proficiency in designing data-aware models – defined as the meta-level capability to systematically accumulate, interpret, and apply data-specific design knowledge – remains challenging for existing automated approaches, due to their inefficient construction and application of meta-knowledge. To achieve meta-level proficiency, we propose DesiGNN, a knowledge-centered framework that systematically converts past model design experience into structured, fine-grained knowledge priors well-suited for meta-learning with LLMs. To account for the inherent variability and external noise, DesiGNN aligns empirical property filtering from extensive benchmarks with adaptive elicitation of literature insights via LLMs. By constructing a solid meta-knowledge between unseen graph understanding and known effective architecture patterns, DesiGNN can deliver top-5.77% initial model proposals for unseen datasets within seconds and achieve consistently superior performance with minimal search cost compared to baselines.

[460] Enhancing Inverse Reinforcement Learning through Encoding Dynamic Information in Reward Shaping

Simon Sinong Zhan, Philip Wang, Qingyuan Wu, Yixuan Wang, Ruochen Jiao, Chao Huang, Qi Zhu

Main category: cs.LG

TL;DR: Proposes Model-Enhanced AIRL framework that incorporates dynamics information into reward shaping for improved performance in stochastic environments with theoretical guarantees.

Details

Motivation: Addresses limitations of Adversarial Inverse Reinforcement Learning (AIRL) in stochastic environments where theoretical results don't hold and performance degrades.

Method: Infuses dynamics information into reward shaping with theoretical guarantees, integrates transition model estimation directly into reward shaping, creating Model-Enhanced AIRL framework.

Result: Achieves superior performance in stochastic environments and competitive performance in deterministic environments with significant sample efficiency improvements in MuJoCo benchmarks.

Conclusion: The proposed method effectively addresses AIRL’s limitations in stochastic environments through model-enhanced reward shaping with theoretical guarantees.

Abstract: In this paper, we aim to tackle the limitation of the Adversarial Inverse Reinforcement Learning (AIRL) method in stochastic environments where theoretical results cannot hold and performance is degraded. To address this issue, we propose a novel method which infuses the dynamics information into the reward shaping with the theoretical guarantee for the induced optimal policy in the stochastic environments. Incorporating our novel model-enhanced rewards, we present a novel Model-Enhanced AIRL framework, which integrates transition model estimation directly into reward shaping. Furthermore, we provide a comprehensive theoretical analysis of the reward error bound and performance difference bound for our method. The experimental results in MuJoCo benchmarks show that our method can achieve superior performance in stochastic environments and competitive performance in deterministic environments, with significant improvement in sample efficiency, compared to existing baselines.

[461] Spatiotemporal Field Generation Based on Hybrid Mamba-Transformer with Physics-informed Fine-tuning

Peimian Du, Jiabin Liu, Xiaowei Jin, Wangmeng Zuo, Hui Li

Main category: cs.LG

TL;DR: HMT-PF: A hybrid Mamba-Transformer model for spatiotemporal physical field generation with physics-informed fine-tuning to reduce equation discrepancies

Details

Motivation: Address substantial physical equation discrepancies in data-driven spatiotemporal physical field generation models, aiming to improve physical consistency while maintaining field characteristics

Method: Developed HMT-PF based on hybrid Mamba-Transformer architecture with unstructured grid inputs; introduced physics-enhanced fine-tuning block; used point query mechanism for efficient gradient evaluation of equation residuals; encoded residuals into latent space; employed self-supervised learning for fine-tuning

Result: Hybrid Mamba-Transformer model achieves good performance in generating spatiotemporal fields; physics-informed fine-tuning effectively reduces significant physical errors; developed MSE-R evaluation method for accuracy and realism assessment

Conclusion: The proposed HMT-PF framework successfully addresses physical equation discrepancies in spatiotemporal field generation through hybrid architecture and physics-informed fine-tuning, improving both accuracy and physical consistency

Abstract: This research confronts the challenge of substantial physical equation discrepancies encountered in the generation of spatiotemporal physical fields through data-driven trained models. A spatiotemporal physical field generation model, named HMT-PF, is developed based on the hybrid Mamba-Transformer architecture, incorporating unstructured grid information as input. A fine-tuning block, enhanced with physical information, is introduced to effectively reduce the physical equation discrepancies. The physical equation residuals are computed through a point query mechanism for efficient gradient evaluation, then encoded into latent space for refinement. The fine-tuning process employs a self-supervised learning approach to achieve physical consistency while maintaining essential field characteristics. Results show that the hybrid Mamba-Transformer model achieves good performance in generating spatiotemporal fields, while the physics-informed fine-tuning mechanism further reduces significant physical errors effectively. A MSE-R evaluation method is developed to assess the accuracy and realism of physical field generation.

[462] Can LLMs Reason Structurally? Benchmarking via the Lens of Data Structures

Yu He, Yingxi Li, Colin White, Ellen Vitercik

Main category: cs.LG

TL;DR: DSR-Bench: A diagnostic benchmark for evaluating LLMs’ algorithmic reasoning through data structure operations, revealing significant limitations in structural reasoning abilities.

Details

Motivation: As LLMs are deployed on increasingly complex multi-step decision-making tasks, there's a need to understand their algorithmic reasoning capabilities. Current benchmarks lack diagnostic tools for evaluating this specific ability, particularly structural reasoning about relationships like order, hierarchy, and connectivity.

Method: Proposes data structures as a principled lens for evaluating algorithmic reasoning. Introduces DSR-Bench with 20 data structures, 35 operations, and 4,140 problem instances. Features hierarchical task organization, fully automated generation and evaluation, and fine-grained diagnostics. Includes three auxiliary probes for more realistic usage scenarios.

Result: Evaluation of 13 state-of-the-art LLMs reveals critical limitations: top-performing model achieves only 0.46/1 on challenging instances. Models perform poorly on spatial data and context-rich scenarios, and struggle to reason over their own code.

Conclusion: Current LLMs have significant limitations in algorithmic reasoning through data structures, particularly in structural reasoning about relationships. The proposed benchmark provides diagnostic tools for evaluating and improving these capabilities.

Abstract: Large language models (LLMs) are deployed on increasingly complex tasks that require multi-step decision-making. Understanding their algorithmic reasoning abilities is therefore crucial. However, we lack a diagnostic benchmark for evaluating this capability. We propose data structures as a principled lens: as fundamental building blocks of algorithms, they naturally probe structural reasoning-the ability to understand and manipulate relationships such as order, hierarchy, and connectivity that underpin algorithmic reasoning. We introduce DSR-Bench, spanning 20 data structures, 35 operations, and 4,140 problem instances. DSR-Bench features hierarchical task organization, fully automated generation and evaluation, and fine-grained diagnostics. Evaluating 13 state-of-the-art LLMs reveals critical limitations: the top-performing model achieves only 0.46/1 on challenging instances. Three auxiliary probes targeting more realistic usages expose further weaknesses: models perform poorly on spatial data and context-rich scenarios, and they struggle to reason over their own code.

[463] Belief-Based Offline Reinforcement Learning for Delay-Robust Policy Optimization

Simon Sinong Zhan, Qingyuan Wu, Philip Wang, Frank Yang, Xiangyu Shi, Chao Huang, Qi Zhu

Main category: cs.LG

TL;DR: DT-CORL is an offline RL framework that addresses deployment challenges by handling delayed observations during online execution without seeing delays during training, using transformer-based belief prediction.

Details

Motivation: Addresses two key gaps in RL deployment: (1) sim-to-real gap where real systems introduce latency/imperfections, and (2) interaction gap where offline-trained policies face out-of-distribution states during online execution. Standard offline RL learns from delay-free logs but must act under delays that break Markov assumption and hurt performance.

Method: DT-CORL (Delay-Transformer belief policy Constrained Offline RL) uses a transformer-based belief predictor to produce delay-robust actions even though it never sees delayed observations during training. It’s more sample-efficient than naïve history-augmentation baselines.

Result: Experiments on D4RL benchmarks with several delay settings show DT-CORL consistently outperforms both history-augmentation and vanilla belief-based methods, narrowing the sim-to-real latency gap while preserving data efficiency.

Conclusion: DT-CORL effectively bridges the offline-to-online deployment gap by handling delayed dynamics at deployment through transformer-based belief prediction, making RL agents more robust to real-world latency issues.

Abstract: Offline-to-online deployment of reinforcement-learning (RL) agents must bridge two gaps: (1) the sim-to-real gap, where real systems add latency and other imperfections not present in simulation, and (2) the interaction gap, where policies trained purely offline face out-of-distribution states during online execution because gathering new interaction data is costly or risky. Agents therefore have to generalize from static, delay-free datasets to dynamic, delay-prone environments. Standard offline RL learns from delay-free logs yet must act under delays that break the Markov assumption and hurt performance. We introduce DT-CORL (Delay-Transformer belief policy Constrained Offline RL), an offline-RL framework built to cope with delayed dynamics at deployment. DT-CORL (i) produces delay-robust actions with a transformer-based belief predictor even though it never sees delayed observations during training, and (ii) is markedly more sample-efficient than naïve history-augmentation baselines. Experiments on D4RL benchmarks with several delay settings show that DT-CORL consistently outperforms both history-augmentation and vanilla belief-based methods, narrowing the sim-to-real latency gap while preserving data efficiency.

[464] Towards Privacy-Guaranteed Label Unlearning in Vertical Federated Learning: Few-Shot Forgetting without Disclosure

Hanlin Gu, Hong Xi Tae, Chee Seng Chan, Lixin Fan

Main category: cs.LG

TL;DR: First method for label unlearning in Vertical Federated Learning using representation-level manifold mixup to generate synthetic embeddings, followed by gradient-based forgetting and recovery optimization.

Details

Motivation: Addresses the critical challenge of unlearning in Vertical Federated Learning (VFL), which has received far less attention than horizontal federated learning. Labels in VFL play a dual role as both essential inputs and sensitive information, creating unique privacy challenges.

Method: Uses representation-level manifold mixup to generate synthetic embeddings for both unlearned and retained samples. These augmented embeddings undergo gradient-based label forgetting to remove associated label information, followed by a recovery-phase optimization step to refine remaining embeddings and maintain performance on retained data.

Result: Extensive experiments on diverse datasets (MNIST, CIFAR-10, CIFAR-100, ModelNet, Brain Tumor MRI, COVID-19 Radiography, Yahoo Answers) demonstrate strong efficacy and scalability. The method achieves effective label unlearning while maintaining computational efficiency.

Conclusion: Establishes a new direction for unlearning in VFL, showing that re-imagining mixup as an efficient mechanism can unlock practical and utility-preserving unlearning. Provides first solution for label unlearning in vertical federated settings.

Abstract: This paper addresses the critical challenge of unlearning in Vertical Federated Learning (VFL), a setting that has received far less attention than its horizontal counterpart. Specifically, we propose the first method tailored to \textit{label unlearning} in VFL, where labels play a dual role as both essential inputs and sensitive information. To this end, we employ a representation-level manifold mixup mechanism to generate synthetic embeddings for both unlearned and retained samples. This is to provide richer signals for the subsequent gradient-based label forgetting and recovery steps. These augmented embeddings are then subjected to gradient-based label forgetting, effectively removing the associated label information from the model. To recover performance on the retained data, we introduce a recovery-phase optimization step that refines the remaining embeddings. This design achieves effective label unlearning while maintaining computational efficiency. We validate our method through extensive experiments on diverse datasets, including MNIST, CIFAR-10, CIFAR-100, ModelNet, Brain Tumor MRI, COVID-19 Radiography, and Yahoo Answers demonstrate strong efficacy and scalability. Overall, this work establishes a new direction for unlearning in VFL, showing that re-imagining mixup as an efficient mechanism can unlock practical and utility-preserving unlearning. The code is publicly available at \href{https://github.com/bryanhx/Towards-Privacy-Guaranteed-Label-Unlearning-in-Vertical-Federated-Learning}{https://github.com/bryanhx/Towards-Privacy-Guaranteed-Label-Unlearning-in-Vertical-Federated-Learning}

[465] Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai

Main category: cs.LG

TL;DR: Shuffle-R1 improves RL fine-tuning efficiency for multimodal LLMs by addressing advantage collapsing and rollout silencing through pairwise trajectory sampling and advantage-based batch shuffling.

Details

Motivation: Current RL pipelines for MLLMs suffer from training inefficiencies due to advantage collapsing (most advantages near zero) and rollout silencing (few rollouts contribute gradients), leading to suboptimal updates and poor long-term learning.

Method: Proposes Shuffle-R1 framework with two key components: (1) Pairwise Trajectory Sampling that selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle that increases exposure of valuable rollouts through informed batch reshuffling.

Result: Experiments across multiple reasoning benchmarks show consistent outperformance over strong RL baselines with minimal overhead, demonstrating improved training efficiency.

Conclusion: The work highlights the importance of data-centric adaptations for more efficient RL training in MLLMs, addressing fundamental issues in current RL fine-tuning pipelines.

Abstract: Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle, which increases exposure of valuable rollouts through informed batch reshuffling. Experiments across multiple reasoning benchmarks show that our framework consistently outperforms strong RL baselines with minimal overhead. These results highlight the importance of data-centric adaptations for more efficient RL training in MLLM.

[466] Adapt before Continual Learning

Aojun Lu, Tao Feng, Hangjie Yuan, Chunhui Ding, Yanan Sun

Main category: cs.LG

TL;DR: ACL introduces a plug-and-play adaptation phase before continual learning tasks to balance stability and plasticity in pre-trained models by aligning embeddings with original class prototypes while distancing from irrelevant classes.

Details

Motivation: Existing continual learning approaches struggle to balance stability (retaining existing knowledge) and plasticity (acquiring new knowledge) when using pre-trained models. Freezing the backbone limits plasticity, while fine-tuning causes catastrophic forgetting.

Method: Proposes ACL framework with adaptation phase before each new task. Refines PTM backbone by aligning embeddings with original class prototypes while distancing from irrelevant classes, achieving better stability-plasticity trade-off.

Result: The method demonstrates improved continual learning performance across benchmarks and integrated methods, showing desirable balance between stability and plasticity both theoretically and empirically.

Conclusion: ACL effectively addresses the stability-plasticity trade-off in PTM-based continual learning through a novel adaptation phase, enhancing performance while maintaining knowledge retention.

Abstract: Continual Learning (CL) seeks to enable neural networks to incrementally acquire new knowledge (plasticity) while retaining existing knowledge (stability). Although pre-trained models (PTMs) have provided a strong foundation for CL, existing approaches face a fundamental challenge in balancing these two competing objectives. Current methods typically address stability by freezing the PTM backbone, which severely limits the model’s plasticity, particularly when incoming data distribution diverges largely from the pre-training data. Alternatively, sequentially fine-tuning the entire PTM can adapt to new knowledge but often leads to catastrophic forgetting, highlighting the critical stability-plasticity trade-off in PTM-based CL. To address this limitation, we propose Adapting PTMs before the core CL} process (ACL), a novel framework that introduces a plug-and-play adaptation phase prior to learning each new task. During this phase, ACL refines the PTM backbone by aligning embeddings with their original class prototypes while distancing them from irrelevant classes. This mechanism theoretically and empirically demonstrates desirable balance between stability and plasticity, significantly improving CL performance across benchmarks and integrated methods. Code is available at https://github.com/byyx666/ACL_code.

[467] The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

Long Li, Zhijian Zhou, Jiaran Hao, Jason Klein Liu, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, Chao Qu, Yuan Qi

Main category: cs.LG

TL;DR: DPH-RL framework uses mass-covering f-divergences (forward-KL, JS-divergence) as rehearsal mechanism to preserve diversity in RL fine-tuning, solving Pass@k degradation while improving both Pass@1 and Pass@k performance.

Details

Motivation: Standard RLVR fine-tuning often degrades multi-attempt performance (Pass@k) despite improving single-attempt accuracy (Pass@1), causing catastrophic forgetting of diverse skills. Current approaches overlook the divergence term's potential as a proactive solution for knowledge retention.

Method: Proposes Diversity-Preserving Hybrid RL (DPH-RL) that uses mass-covering f-divergences (forward-KL, JS-divergence) as rehearsal mechanisms. The framework continuously references the initial policy to maintain broad solution coverage, computing f-divergence using generator functions that only require sampling from the initial policy without needing online reference models.

Result: Extensive experiments on math and SQL generation show DPH-RL resolves Pass@k degradation and improves both Pass@1 and Pass@k in- and out-of-domain. The approach is more training-efficient as it requires only sampling from the initial policy and no online reference model.

Conclusion: The proper selection of divergence measure is a powerful tool for building more general and diverse reasoning models. DPH-RL highlights a crucial, overlooked axis for improving RLVR by using divergence terms as proactive solutions for knowledge retention.

Abstract: A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. While various methods have been proposed, the choice and function of the divergence term have been surprisingly unexamined as a proactive solution. We argue that standard RLVR objectives – both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely – lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We propose a fundamental shift in perspective: using the divergence term itself as the solution. Our framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences (like forward-KL and JS-divergence) to function as a rehearsal mechanism. By continuously referencing the initial policy, this approach forces the model to maintain broad solution coverage. Extensive experiments on math and SQL generation demonstrate that DPH-RL not only resolves the Pass@k degradation but improves both Pass@1 and Pass@k in- and out-of-domain. Additionally, DPH-RL is more training-efficient because it computes f-divergence using generator functions, requiring only sampling from the initial policy and no online reference model. Our work highlights a crucial, overlooked axis for improving RLVR, demonstrating that the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models.

[468] Generalization of Diffusion Models Arises with a Balanced Representation Space

Zekai Zhang, Xiao Li, Xiang Li, Lianghe Shi, Meng Wu, Molei Tao, Qing Qu

Main category: cs.LG

TL;DR: Theoretical analysis of memorization vs generalization in diffusion models through representation learning, showing memorization creates spiky localized representations while generalization produces balanced ones, with practical detection and editing methods.

Details

Motivation: Diffusion models risk memorizing training data when overfit, but the distinction between memorization and generalization remains unclear. The paper aims to analyze this through representation learning to understand what constitutes meaningful generative modeling.

Method: Analyze a two-layer ReLU denoising autoencoder (DAE) theoretically to prove memorization corresponds to storing raw samples in weights (spiky representations) while generalization captures local data statistics (balanced representations). Validate findings on real-world unconditional and text-to-image diffusion models, then propose representation-based memorization detection and training-free editing via representation steering.

Result: Theoretical proofs show distinct representation structures for memorization vs generalization. Empirical validation confirms these patterns emerge in practical diffusion models. Proposed methods successfully detect memorization and enable precise control through representation steering without retraining.

Conclusion: Learning good representations is central to novel and meaningful generative modeling. The distinction between memorization (spiky representations) and generalization (balanced representations) provides fundamental insights for improving diffusion models, with practical implications for detection and control.

Abstract: Diffusion models excel at generating high-quality, diverse samples, yet they risk memorizing training data when overfit to the training objective. We analyze the distinctions between memorization and generalization in diffusion models through the lens of representation learning. By investigating a two-layer ReLU denoising autoencoder (DAE), we prove that (i) memorization corresponds to the model storing raw training samples in the learned weights for encoding and decoding, yielding localized spiky representations, whereas (ii) generalization arises when the model captures local data statistics, producing balanced representations. Furthermore, we validate these theoretical findings on real-world unconditional and text-to-image diffusion models, demonstrating that the same representation structures emerge in deep generative models with significant practical implications. Building on these insights, we propose a representation-based method for detecting memorization and a training-free editing technique that allows precise control via representation steering. Together, our results highlight that learning good representations is central to novel and meaningful generative modeling.

[469] Beyond Aggregation: Guiding Clients in Heterogeneous Federated Learning

Zijian Wang, Xiaofei Zhang, Xin Zhang, Yukun Liu, Qiong Zhang

Main category: cs.LG

TL;DR: Federated learning framework where server actively guides new queries to most appropriate clients using density ratio modeling and empirical likelihood, leveraging heterogeneity as a feature rather than just a bug.

Details

Motivation: Current FL algorithms focus on aggregating model updates from heterogeneous clients but under-exploit the central server's potential. Inspired by healthcare scenarios where a server could guide patients to the best-equipped hospital, the paper proposes a more intelligent FL paradigm that actively allocates new tasks/queries to the most appropriate clients.

Method: Introduces a density ratio model and empirical likelihood-based framework that simultaneously addresses two goals: (1) learning effective local models on each client, and (2) finding the best matching client for a new query. The framework enables the server to guide query allocation based on client expertise.

Result: Empirical results on benchmark datasets demonstrate improvements in both model accuracy and the precision of client guidance compared to standard FL approaches. The framework shows effectiveness in leveraging heterogeneity as a feature.

Conclusion: This work opens a new direction for building more intelligent and resource-efficient FL systems that actively leverage statistical heterogeneity rather than just mitigating it. The proposed paradigm transforms the server from a passive coordinator to an active guide for query allocation.

Abstract: Federated learning (FL) is increasingly adopted in domains like healthcare, where data privacy is paramount. A fundamental challenge in these systems is statistical heterogeneity-the fact that data distributions vary significantly across clients (e.g., different hospitals may treat distinct patient demographics). While current FL algorithms focus on aggregating model updates from these heterogeneous clients, the potential of the central server remains under-explored. This paper is motivated by a healthcare scenario: could a central server not only coordinate model training but also guide a new patient to the hospital best equipped for their specific condition? We generalize this idea to propose a novel paradigm for FL systems where the server actively guides the allocation of new tasks or queries to the most appropriate client. To enable this, we introduce a density ratio model and empirical likelihood-based framework that simultaneously addresses two goals: (1) learning effective local models on each client, and (2) finding the best matching client for a new query. Empirical results demonstrate the framework’s effectiveness on benchmark datasets, showing improvements in both model accuracy and the precision of client guidance compared to standard FL approaches. This work opens a new direction for building more intelligent and resource-efficient FL systems that leverage heterogeneity as a feature, not just a bug. Code is available at https://github.com/zijianwang0510/FedDRM.git.

[470] Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

Lin Long, Changdae Oh, Seongheon Park, Sharon Li

Main category: cs.LG

TL;DR: Analysis of language prior in LVLMs through chain-of-embedding reveals Visual Integration Points and introduces TVI estimator to quantify visual influence.

Details

Motivation: Large vision-language models often default to language priors (memorized textual patterns) while under-utilizing visual evidence, but existing input-output probing fails to reveal internal mechanisms of visual influence.

Method: Systematic analysis through chain-of-embedding examines layer-wise representation dynamics, identifies Visual Integration Points (VIPs) where visual information reshapes hidden representations, and introduces Total Visual Integration (TVI) estimator to quantify visual influence.

Result: Across 60 model-dataset combinations spanning 10 contemporary LVLMs and 6 benchmarks, VIP consistently emerges, and TVI reliably predicts the strength of language prior, offering a principled diagnostic toolkit.

Conclusion: The chain-of-embedding analysis provides fundamental insights into when and how vision influences LVLM behavior, enabling better understanding and diagnosis of language prior issues in multimodal models.

Abstract: Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) – memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding for multimodal reasoning. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representational discrepancy beyond the VIP to quantify how strongly visual query influences response generation. Across 60 model-dataset combinations spanning 10 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.

Rabia Gondur, Usama Bin Sikandar, Evan Schaffer, Mikio Christian Aoi, Stephen L Keeley

Main category: cs.LG

TL;DR: Proposes MM-GPVAE, an unsupervised latent variable model that extracts shared and independent temporal latents from multi-modal experimental data, combining GPFA and GP-VAE approaches with Fourier domain parameterization.

Details

Motivation: Existing latent variable models are typically designed for single data types, making it difficult to identify structure shared across different experimental modalities. Neuroscience needs methods to characterize relationships between neural population activity and behavioral data across modalities.

Method: Combines Gaussian Process Factor Analysis (GPFA) for neural spiking data with Gaussian Process Variational Autoencoders (GP-VAEs), partitioning latent variability into shared and independent components across modalities. Uses Fourier domain parameterization for improved latent identification.

Result: Validated on simulated multi-modal data (Poisson spike counts and MNIST images) and real-world experiments (Drosophila calcium imaging with limb tracking, Manduca sexta spike trains during visual tracking). Accurately identifies shared/independent latent structure and provides good reconstructions.

Conclusion: MM-GPVAE successfully extracts interpretable shared and independent temporal latents from multi-modal neuroscience data, enabling better characterization of relationships between neural activity and behavior across different experimental modalities.

Abstract: Characterizing the relationship between neural population activity and behavioral data is a central goal of neuroscience. While latent variable models (LVMs) are successful in describing high-dimensional time-series data, they are typically only designed for a single type of data, making it difficult to identify structure shared across different experimental data modalities. Here, we address this shortcoming by proposing an unsupervised LVM which extracts temporally evolving shared and independent latents for distinct, simultaneously recorded experimental modalities. We do this by combining Gaussian Process Factor Analysis (GPFA), an interpretable LVM for neural spiking data with temporally smooth latent space, with Gaussian Process Variational Autoencoders (GP-VAEs), which similarly use a GP prior to characterize correlations in a latent space, but admit rich expressivity due to a deep neural network mapping to observations. We achieve interpretability in our model by partitioning latent variability into components that are either shared between or independent to each modality. We parameterize the latents of our model in the Fourier domain, and show improved latent identification using this approach over standard GP-VAE methods. We validate our model on simulated multi-modal data consisting of Poisson spike counts and MNIST images that scale and rotate smoothly over time. We show that the multi-modal GP-VAE (MM-GPVAE) is able to not only identify the shared and independent latent structure across modalities accurately, but provides good reconstructions of both images and neural rates on held-out trials. Finally, we demonstrate our framework on two real world multi-modal experimental settings: Drosophila whole-brain calcium imaging alongside tracked limb positions, and Manduca sexta spike train measurements from ten wing muscles as the animal tracks a visual stimulus.

[472] Discrete Variational Autoencoding via Policy Search

Michael Drolet, Firas Al-Hafez, Aditya Bhatt, Jan Peters, Oleg Arenz

Main category: cs.LG

TL;DR: A novel training framework for discrete VAEs using natural gradient updates from non-parametric encoders, enabling scalable high-dimensional image reconstruction without reparameterization tricks.

Details

Motivation: Discrete latent bottlenecks in VAEs offer bit efficiency but face gradient estimation challenges due to non-differentiable discrete variables, limiting their effectiveness on high-dimensional tasks like image reconstruction.

Method: Proposes a training framework leveraging natural gradient from non-parametric encoder to update parametric encoder, combined with automatic step size adaptation and transformer-based encoder architecture.

Result: Method scales to challenging datasets like ImageNet and outperforms both approximate reparameterization methods and quantization-based discrete autoencoders in reconstructing high-dimensional data from compact latent spaces.

Conclusion: The proposed natural gradient approach enables effective training of discrete VAEs for high-dimensional data reconstruction without requiring reparameterization tricks, advancing discrete latent representation learning.

Abstract: Discrete latent bottlenecks in variational autoencoders (VAEs) offer high bit efficiency and can be modeled with autoregressive discrete distributions, enabling parameter-efficient multimodal search with transformers. However, discrete random variables do not allow for exact differentiable parameterization; therefore, discrete VAEs typically rely on approximations, such as Gumbel-Softmax reparameterization or straight-through gradient estimates, or employ high-variance gradient-free methods such as REINFORCE that have had limited success on high-dimensional tasks such as image reconstruction. Inspired by popular techniques in policy search, we propose a training framework for discrete VAEs that leverages the natural gradient of a non-parametric encoder to update the parametric encoder without requiring reparameterization. Our method, combined with automatic step size adaptation and a transformer-based encoder, scales to challenging datasets such as ImageNet and outperforms both approximate reparameterization methods and quantization-based discrete autoencoders in reconstructing high-dimensional data from compact latent spaces.

[473] Learning-based agricultural management in partially observable environments subject to climate variability

Zhaoan Wang, Shaoping Xiao, Junchao Li, Jun Wang

Main category: cs.LG

TL;DR: Deep Reinforcement Learning framework for optimal nitrogen fertilization in corn crops using Gym-DSSAT simulator, comparing POMDP vs MDP models for climate-adaptive management.

Details

Motivation: Conventional agricultural guidelines fail under extreme weather conditions like heatwaves and droughts, necessitating adaptive fertilization strategies that can optimize crop yield, economic profitability, and environmental sustainability.

Method: Integration of Deep Reinforcement Learning with Recurrent Neural Networks, trained using Gym-DSSAT simulator on corn crops in Iowa. Comparison of Partially Observable Markov Decision Process (POMDP) models with Markov Decision Process (MDP) models to handle sequential observations.

Result: POMDP models with sequential observations outperform MDP models for nitrogen fertilization policies. Fixed policies work well for minor climate fluctuations but require agent retraining for extreme weather events to maintain optimal yields, cost-effectiveness, and environmental conservation.

Conclusion: DRL with RNNs provides adaptable fertilization strategies for dynamic climate scenarios, though retraining is needed for extreme weather. The framework offers promising direction for climate-resilient crop management optimization.

Abstract: Agricultural management, with a particular focus on fertilization strategies, holds a central role in shaping crop yield, economic profitability, and environmental sustainability. While conventional guidelines offer valuable insights, their efficacy diminishes when confronted with extreme weather conditions, such as heatwaves and droughts. In this study, we introduce an innovative framework that integrates Deep Reinforcement Learning (DRL) with Recurrent Neural Networks (RNNs). Leveraging the Gym-DSSAT simulator, we train an intelligent agent to master optimal nitrogen fertilization management. Through a series of simulation experiments conducted on corn crops in Iowa, we compare Partially Observable Markov Decision Process (POMDP) models with Markov Decision Process (MDP) models. Our research underscores the advantages of utilizing sequential observations in developing more efficient nitrogen input policies. Additionally, we explore the impact of climate variability, particularly during extreme weather events, on agricultural outcomes and management. Our findings demonstrate the adaptability of fertilization policies to varying climate conditions. Notably, a fixed policy exhibits resilience in the face of minor climate fluctuations, leading to commendable corn yields, cost-effectiveness, and environmental conservation. However, our study illuminates the need for agent retraining to acquire new optimal policies under extreme weather events. This research charts a promising course toward adaptable fertilization strategies that can seamlessly align with dynamic climate scenarios, ultimately contributing to the optimization of crop management practices.

[474] GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models

Peter Holderrieth, Uriel Singer, Tommi Jaakkola, Ricky T. Q. Chen, Yaron Lipman, Brian Karrer

Main category: cs.LG

TL;DR: GLASS Flows enables efficient reward-aligned sampling for flow/diffusion models by using a “flow matching model within a flow matching model” approach, eliminating the SDE sampling bottleneck while maintaining stochastic evolution.

Details

Motivation: Current reward alignment algorithms for flow matching and diffusion models suffer from efficiency limitations due to reliance on SDE sampling, which is slower and less performant than ODE sampling.

Method: Introduces GLASS Flows, a sampling paradigm that simulates an “inner” flow matching model within a pre-trained model to sample Markov transitions efficiently without retraining, combining ODE efficiency with SDE stochastic evolution.

Result: GLASS Flows eliminate the trade-off between stochastic evolution and efficiency in large-scale text-to-image models, and when combined with Feynman-Kac Steering, achieve state-of-the-art performance in text-to-image generation.

Conclusion: GLASS Flows provide a simple, drop-in solution for inference-time scaling of flow and diffusion models, enabling efficient reward alignment without sacrificing performance.

Abstract: The performance of flow matching and diffusion models can be greatly improved at inference time using reward alignment algorithms, yet efficiency remains a major limitation. While several algorithms were proposed, we demonstrate that a common bottleneck is the sampling method these algorithms rely on: many algorithms require to sample Markov transitions via SDE sampling, which is significantly less efficient and often less performant than ODE sampling. To remove this bottleneck, we introduce GLASS Flows, a new sampling paradigm that simulates a “flow matching model within a flow matching model” to sample Markov transitions. As we show in this work, this “inner” flow matching model can be retrieved from a pre-trained model without any re-training, combining the efficiency of ODEs with the stochastic evolution of SDEs. On large-scale text-to-image models, we show that GLASS Flows eliminate the trade-off between stochastic evolution and efficiency. Combined with Feynman-Kac Steering, GLASS Flows improve state-of-the-art performance in text-to-image generation, making it a simple, drop-in solution for inference-time scaling of flow and diffusion models.

[475] Provable Emergence of Deep Neural Collapse and Low-Rank Bias in $L^2$-Regularized Nonlinear Networks

Emanuele Zangrando, Piero Deidda, Simone Brugiapaglia, Nicola Guglielmi, Francesco Tudisco

Main category: cs.LG

TL;DR: Theoretical framework connecting Deep Neural Collapse (DNC1) to implicit low-rank bias in networks with L2 weight decay, showing quantitative relations between feature variation and weight matrix rank.

Details

Motivation: To establish a unified theoretical connection between the first property of Deep Neural Collapse (DNC1) and the emergence of implicit low-rank bias in nonlinear networks trained with L2 weight decay regularization, explaining why these phenomena co-occur.

Method: Develops a theoretical framework with three main contributions: 1) Derives quantitative relation between Total Cluster Variation (TCV) of embeddings and numerical rank of weight matrices, 2) Proves global optimality of DNC1 in constrained representation-cost setting for feedforward and residual architectures, 3) Establishes benign landscape property showing existence of continuous loss-decreasing paths to DNC1 configurations.

Result: Shows that distance from weight matrices to rank-K matrices is bounded by TCV scaled inversely with weight decay, proves zero TCV minimizes representation cost, and demonstrates benign optimization landscape. Empirical validation confirms predicted relations among TCV, singular-value structure, and weight decay.

Conclusion: Neural collapse and low-rank bias are intimately linked phenomena arising from the optimization geometry induced by weight decay, providing theoretical understanding of why these properties emerge during training.

Abstract: We present a unified theoretical framework connecting the first property of Deep Neural Collapse (DNC1) to the emergence of implicit low-rank bias in nonlinear networks trained with $L^2$ weight decay regularization. Our main contributions are threefold. First, we derive a quantitative relation between the Total Cluster Variation (TCV) of intermediate embeddings and the numerical rank of stationary weight matrices. In particular, we establish that, at any critical point, the distance from a weight matrix to the set of rank-$K$ matrices is bounded by a constant times the TCV of earlier-layer features, scaled inversely with the weight-decay parameter. Second, we prove global optimality of DNC1 in a constrained representation-cost setting for both feedforward and residual architectures, showing that zero TCV across intermediate layers minimizes the representation cost under natural architectural constraints. Third, we establish a benign landscape property: for almost every interpolating initialization there exists a continuous, loss-decreasing path from the initialization to a globally optimal, DNC1-satisfying configuration. Our theoretical claims are validated empirically; numerical experiments confirm the predicted relations among TCV, singular-value structure, and weight decay. These results indicate that neural collapse and low-rank bias are intimately linked phenomena arising from the optimization geometry induced by weight decay.

[476] ACT: Agentic Classification Tree

Vincent Grari, Tim Arni, Thibault Laugel, Sylvain Lamprier, James Zou, Marcin Detyniecki

Main category: cs.LG

TL;DR: ACT extends decision trees to unstructured text data by using natural language questions as splits, refined through LLM feedback, achieving competitive performance while maintaining interpretability.

Details

Motivation: AI systems in high-stakes settings need transparent, interpretable decisions. Traditional decision trees work only on tabular data, while LLMs handle unstructured data but lack verifiable reasoning. There's a need for interpretable models that work directly on unstructured inputs like text.

Method: ACT extends decision trees to unstructured data by formulating each split as a natural-language question. The questions are refined through impurity-based evaluation and LLM feedback using TextGrad. This creates interpretable decision paths while operating on text inputs.

Result: Experiments on text benchmarks show ACT matches or surpasses prompting-based baselines while producing transparent and interpretable decision paths.

Conclusion: ACT provides a way to apply interpretable decision-tree methodology to unstructured text data, offering both competitive performance and the transparency needed for high-stakes applications.

Abstract: When used in high-stakes settings, AI systems are expected to produce decisions that are transparent, interpretable and auditable, a requirement increasingly expected by regulations. Decision trees such as CART provide clear and verifiable rules, but they are restricted to structured tabular data and cannot operate directly on unstructured inputs such as text. In practice, large language models (LLMs) are widely used for such data, yet prompting strategies such as chain-of-thought or prompt optimization still rely on free-form reasoning, limiting their ability to ensure trustworthy behaviors. We present the Agentic Classification Tree (ACT), which extends decision-tree methodology to unstructured inputs by formulating each split as a natural-language question, refined through impurity-based evaluation and LLM feedback via TextGrad. Experiments on text benchmarks show that ACT matches or surpasses prompting-based baselines while producing transparent and interpretable decision paths.

[477] Goal-Conditioned Reinforcement Learning from Sub-Optimal Data on Metric Spaces

Alfredo Reichlin, Miguel Vasco, Hang Yin, Danica Kragic

Main category: cs.LG

TL;DR: MetricRL combines metric learning for value functions with weighted imitation learning for goal-conditioned offline RL with sparse rewards, invertible actions, and deterministic transitions.

Details

Motivation: Addresses the challenge of learning optimal behavior from sub-optimal datasets in goal-conditioned offline RL, particularly dealing with distribution shift issues that arise when learning from imperfect demonstrations.

Method: Proposes MetricRL which combines metric learning for value function approximation with weighted imitation learning for policy estimation. Introduces distance monotonicity property linking metric representations to optimality and designs an objective that explicitly promotes this property.

Result: MetricRL consistently outperforms prior state-of-the-art goal-conditioned RL methods in recovering near-optimal behavior from sub-optimal offline data.

Conclusion: MetricRL effectively mitigates distribution shift effects without requiring conservative or behavior-cloning constraints, enabling effective learning even in severely sub-optimal regimes.

Abstract: We study the problem of learning optimal behavior from sub-optimal datasets for goal-conditioned offline reinforcement learning under sparse rewards, invertible actions and deterministic transitions. To mitigate the effects of \emph{distribution shift}, we propose MetricRL, a method that combines metric learning for value function approximation with weighted imitation learning for policy estimation. MetricRL avoids conservative or behavior-cloning constraints, enabling effective learning even in severely sub-optimal regimes. We introduce distance monotonicity as a key property linking metric representations to optimality and design an objective that explicitly promotes it. Empirically, MetricRL consistently outperforms prior state-of-the-art goal-conditioned RL methods in recovering near-optimal behavior from sub-optimal offline data.

[478] Kernel-based Optimally Weighted Conformal Time-Series Prediction

Jonghyeok Lee, Chen Xu, Yao Xie

Main category: cs.LG

TL;DR: KOWCPI is a novel conformal prediction method for time-series that uses kernel-based optimal weighting to create narrower confidence intervals while maintaining coverage guarantees for non-exchangeable data.

Details

Motivation: Traditional conformal prediction methods assume exchangeable data, which doesn't hold for time-series data with temporal dependencies. There's a need for methods that can provide valid confidence intervals for non-exchangeable time-series data while achieving narrower intervals than existing approaches.

Method: KOWCPI adapts the Reweighted Nadaraya-Watson (RNW) estimator for quantile regression on dependent data and learns optimal data-adaptive weights. It establishes conditional coverage guarantees for non-exchangeable data under strong mixing conditions on the non-conformity scores.

Result: KOWCPI demonstrates superior performance on both real and synthetic time-series data compared to state-of-the-art methods, achieving narrower confidence intervals without losing coverage.

Conclusion: KOWCPI provides an effective conformal prediction framework for time-series that handles non-exchangeable data while producing more precise confidence intervals than existing methods.

Abstract: In this work, we present a novel conformal prediction method for time-series, which we call Kernel-based Optimally Weighted Conformal Prediction Intervals (KOWCPI). Specifically, KOWCPI adapts the classic Reweighted Nadaraya-Watson (RNW) estimator for quantile regression on dependent data and learns optimal data-adaptive weights. Theoretically, we tackle the challenge of establishing a conditional coverage guarantee for non-exchangeable data under strong mixing conditions on the non-conformity scores. We demonstrate the superior performance of KOWCPI on real and synthetic time-series data against state-of-the-art methods, where KOWCPI achieves narrower confidence intervals without losing coverage.

[479] IGC-Net for conditional average potential outcome estimation over time

Konstantin Hess, Dennis Frauen, Valentyn Melnychuk, Stefan Feuerriegel

Main category: cs.LG

TL;DR: IGC-Net: A neural end-to-end model for estimating conditional average potential outcomes over time using iterative G-computation to adjust for time-varying confounding in observational medical data.

Details

Motivation: Existing methods for estimating potential outcomes from observational medical data often fail to properly adjust for time-varying confounding, leading to biased estimates. Neural methods with proper adjustments have limitations like division by propensity scores close to zero, resulting in poor performance.

Method: IGC-Net is a novel neural end-to-end model that performs fully regression-based iterative G-computation for conditional average potential outcomes in time-varying settings. It adjusts for time-varying confounding through iterative G-computation rather than propensity score division.

Result: The paper evaluates IGC-Net across various experiments, demonstrating its effectiveness in estimating conditional average potential outcomes over time from observational data.

Conclusion: IGC-Net represents a significant step towards personalized decision-making from electronic health records by providing a neural approach that properly adjusts for time-varying confounding in outcome estimation.

Abstract: Estimating potential outcomes for treatments over time based on observational data is important for personalized decision-making in medicine. However, many existing methods for this task fail to properly adjust for time-varying confounding and thus yield biased estimates. There are only a few neural methods with proper adjustments, but these have inherent limitations (e.g., division by propensity scores that are often close to zero), which result in poor performance. As a remedy, we introduce the iterative G-computation network (IGC-Net). Our IGC-Net is a novel, neural end-to-end model which adjusts for time-varying confounding in order to estimate conditional average potential outcomes (CAPOs) over time. Specifically, our IGC-Net is the first neural model to perform fully regression-based iterative G-computation for CAPOs in the time-varying setting. We evaluate the effectiveness of our IGC-Net across various experiments. In sum, this work represents a significant step towards personalized decision-making from electronic health records.

Ivo Pascal de Jong, Andreea Ioana Sburlea, Matthia Sabatelli, Matias Valdenegro-Toro

Main category: cs.LG

TL;DR: The paper proposes orthogonal disentanglement of aleatoric and epistemic uncertainties, introduces Uncertainty Disentanglement Error (UDE) metric, and demonstrates that deep ensembles trained from scratch achieve orthogonal epistemic uncertainty estimates but aleatoric uncertainty still fails orthogonality.

Details

Motivation: Current methods for jointly estimating aleatoric (data) and epistemic (knowledge) uncertainty are problematic and evaluation methods are insufficient. The paper argues that these uncertainties should be orthogonally disentangled - each unaffected by the other - which is often not met in practice.

Method: Proves orthogonality and consistency as necessary and sufficient criteria for disentanglement. Constructs Uncertainty Disentanglement Error (UDE) metric to measure these criteria. Uses Deep Ensemble trained from scratch on ImageNet-1k with Information Theoretic disentangling approach.

Result: Empirical evaluation shows: 1) finetuned models give different orthogonality results than models trained from scratch, 2) UDE can be optimized through dropout rate, 3) Deep Ensemble achieves consistent and orthogonal estimates of epistemic uncertainty, but aleatoric uncertainty estimates still fail orthogonality.

Conclusion: Orthogonal disentanglement of uncertainties is crucial but challenging. While epistemic uncertainty can be made orthogonal, aleatoric uncertainty remains problematic. UDE provides a principled metric for evaluating uncertainty disentanglement.

Abstract: Aleatoric (data) and epistemic (knowledge) uncertainty are textbook components of Uncertainty Quantification. Jointly estimating these components has been shown to be problematic and non-trivial. As a result, there are multiple ways to disentangle these uncertainties, but current methods to evaluate them are insufficient. We propose that aleatoric and epistemic uncertainty estimates should be orthogonally disentangled - meaning that each uncertainty is not affected by the other - a necessary condition that is often not met. We prove that orthogonality and consistency and necessary and sufficient criteria for disentanglement, and construct Uncertainty Disentanglement Error as a metric to measure these criteria, with further empirical evaluation showing that finetuned models give different orthogonality results than models trained from scratch and that UDE can be optimized for through dropout rate. We demonstrate a Deep Ensemble trained from scratch on ImageNet-1k with Information Theoretic disentangling achieves consistent and orthogonal estimates of epistemic uncertainty, but estimates of aleatoric uncertainty still fail on orthogonality.

[481] Hypercube Policy Regularization Framework for Offline Reinforcement Learning

Yi Shen, Hanyan Huang

Main category: cs.LG

TL;DR: A hypercube policy regularization framework for offline RL that reduces over-conservatism by allowing exploration of actions from similar states in the dataset, improving performance on low-quality datasets.

Details

Motivation: Standard offline RL methods suffer from over-conservatism due to policy constraints that force agents to strictly follow actions in the static dataset, leading to suboptimal policies especially with low-quality datasets.

Method: Proposes a hypercube policy regularization framework that relaxes constraints by allowing agents to explore actions corresponding to similar states in the dataset, implemented as TD3-BC-C and Diffusion-QL-C variants.

Result: TD3-BC-C and Diffusion-QL-C outperform state-of-the-art algorithms like IQL, CQL, TD3-BC and Diffusion-QL on most D4RL environments with similar computational time.

Conclusion: The hypercube policy regularization framework effectively reduces over-conservatism in offline RL, improving algorithm performance especially on low-quality datasets while maintaining theoretical guarantees.

Abstract: Offline reinforcement learning has received extensive attention from scholars because it avoids the interaction between the agent and the environment by learning a policy through a static dataset. However, general reinforcement learning methods cannot get satisfactory results in offline reinforcement learning due to the out-of-distribution state actions that the dataset cannot cover during training. To solve this problem, the policy regularization method that tries to directly clone policies used in static datasets has received numerous studies due to its simplicity and effectiveness. However, policy constraint methods make the agent choose the corresponding actions in the static dataset. This type of constraint is usually over-conservative, which results in suboptimal policies, especially in low-quality static datasets. In this paper, a hypercube policy regularization framework is proposed, this method alleviates the constraints of policy constraint methods by allowing the agent to explore the actions corresponding to similar states in the static dataset, which increases the effectiveness of algorithms in low-quality datasets. It was also theoretically demonstrated that the hypercube policy regularization framework can effectively improve the performance of original algorithms. In addition, the hypercube policy regularization framework is combined with TD3-BC and Diffusion-QL for experiments on D4RL datasets which are called TD3-BC-C and Diffusion-QL-C. The experimental results of the score demonstrate that TD3-BC-C and Diffusion-QL-C perform better than state-of-the-art algorithms like IQL, CQL, TD3-BC and Diffusion-QL in most D4RL environments in approximate time.

[482] HypeRL: Hypernetwork-Based Reinforcement Learning for Control of Parametrized Dynamical Systems

Nicolò Botteghi, Stefania Fresca, Mengwu Guo, Andrea Manzoni

Main category: cs.LG

TL;DR: HypeRL: A deep reinforcement learning framework using hypernetworks to learn optimal control policies for parametric dynamical systems, enabling generalization across parameter variations.

Details

Motivation: Traditional numerical methods for optimal control of parametric dynamical systems (common in applied sciences/engineering) become computationally infeasible for high-dimensional or parameter-dependent problems. There's a need for more efficient approaches that can handle parameter variations.

Method: Uses actor-critic deep reinforcement learning with hypernetworks - additional neural networks that learn the weights and biases of the policy and value function networks, effectively embedding parametric information into the control framework.

Result: Validated on two parametric control problems: 1D Kuramoto-Sivashinsky equation with in-domain control, and navigation of particle dynamics in 2D gyre flow. Showed that physical/task-dependent information encoding via hypernetworks is essential for learning parameter-dependent control policies.

Conclusion: HypeRL successfully overcomes limitations of traditional methods by learning optimal control policies that generalize across parameter variations through hypernetwork-based parameter encoding.

Abstract: In this work, we devise a new, general-purpose reinforcement learning strategy for the optimal control of parametric dynamical systems. Such problems frequently arise in applied sciences and engineering and entail a significant complexity when control and/or state variables are distributed in high-dimensional space or depend on varying parameters. Traditional numerical methods, relying on either iterative minimization algorithms – exploiting, e.g., the solution of the adjoint problem – or dynamic programming – also involving the solution of the Hamilton-Jacobi-Bellman (HJB) equation – while reliable, often become computationally infeasible. In this paper, we propose HypeRL a deep reinforcement learning (DRL) framework to overcome the limitations shown by traditional methods. HypeRL aims at approximating the optimal control policy directly. Specifically, we employ an actor-critic DRL approach to learn an optimal feedback control strategy that can generalize across the range of variation of the parameters. To effectively learn such optimal control laws for different instances of the parameters, encoding the parameter information into the DRL policy and value function neural networks (NNs) is essential. HypeRL uses two additional NNs, called hypernetworks, to learn the weights and biases of the value function and the policy NNs. In this way, HypeRL effectively embeds the parametric information into the value function and policy. We validate the proposed approach on two parametric control problems, namely (I) a 1D parametric Kuramoto-Sivashinsky equation with in-domain control, and (ii) a navigation problem of particle dynamics in a parametric 2D gyre flow. We show that the knowledge of physical and task-dependent information and the encoding of this information via a hypernetwork, are essential ingredients for learning parameter-dependent control policies.

[483] Finding Kissing Numbers with Game-theoretic Reinforcement Learning

Chengdong Ma, Théo Tao Zhaowei, Pengyu Li, Minghao Liu, Haojun Chen, Zihao Mao, Yuan Cheng, Yuan Qi, Yaodong Yang

Main category: cs.LG

TL;DR: AI-driven reinforcement learning system PackingStar solves high-dimensional kissing number problems by modeling them as matrix completion games, breaking decades-old records and discovering thousands of new geometric structures.

Details

Motivation: The kissing number problem (maximal non-overlapping spheres around a central sphere) is a fundamental geometric challenge dating back to Newton, with limited progress due to high-dimensional complexity, combinatorial explosion, and reliance on rational structures.

Method: Model the problem as a two-player matrix completion game where entries represent pairwise cosines of sphere center vectors. One player fills entries while another corrects suboptimal ones, cooperatively maximizing matrix size (kissing number). Matrices are decomposed into representative substructures to guide subsequent games.

Result: PackingStar surpasses records from dimensions 25 to 31, sets new lower bounds for generalized kissing numbers, achieves first breakthrough beyond rational structures since 1971 in 13 dimensions, discovers over 6000 new structures, and reveals configurations challenging antipodal paradigms with algebraic correspondences to finite simple groups.

Conclusion: AI demonstrates power to explore high-dimensional spaces beyond human intuition via extreme-scale reinforcement learning, opening new pathways for the kissing number problem and broader geometry research, with discovered patterns inspiring further human constructions.

Abstract: Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a central sphere has remained a fundamental challenge. This problem is the local analogue of Hilbert’s 18th problem, bridging geometry, number theory, and information theory. Although significant progress has been made through lattices and codes, the irregularities of high-dimensional geometry, dimensional structure variability, and combinatorial explosion beyond Go limit the scalability and generality of existing methods. Here we model the problem as a two-player matrix completion game and train the reinforcement learning system, PackingStar, to play the games. The matrix entries represent pairwise cosines of sphere center vectors. One player fills entries while another corrects suboptimal ones to improve exploration quality, cooperatively maximizing the matrix size, corresponding to the kissing number. These matrices are decomposed into representative substructures, providing diverse bases and structural constraints that steer subsequent games and make extremely large spaces tractable. PackingStar surpasses records from dimensions 25 to 31 and sets new lower bounds for generalized kissing numbers under various angular constraints. It achieves the first breakthrough beyond rational structures from 1971 in 13 dimensions and discovers over 6000 new structures in other dimensions. Notably, some configurations challenge long-held antipodal paradigms, revealing algebraic correspondences with finite simple groups as well as geometric relationships across dimensions. Inspired by these patterns, humans devised further improved constructions. These results demonstrate AI’s power to explore high-dimensional spaces beyond human intuition via extreme-scale reinforcement learning and open new pathways for the Kissing Number Problem and broader geometry research.

[484] Rethinking Approximate Gaussian Inference in Classification

Bálint Mucsányi, Nathaël Da Costa, Philipp Hennig

Main category: cs.LG

TL;DR: Proposes replacing softmax with normCDF/sigmoid for efficient uncertainty quantification in classification, enabling sampling-free Gaussian inference and Dirichlet approximations.

Details

Motivation: Softmax only captures aleatoric uncertainty, not epistemic uncertainty. Existing Gaussian inference methods for epistemic uncertainty require costly Monte Carlo sampling due to intractable softmax integrals.

Method: Replace softmax with element-wise normCDF or sigmoid activations, enabling analytical approximations of Gaussian pushforwards via Dirichlet distributions with moment matching. Eliminates MC sampling overhead.

Result: Improved uncertainty quantification on ImageNet, CIFAR-100, CIFAR-10 compared to softmax MC sampling, with no runtime/memory overhead from sampling.

Conclusion: Probit/sigmoid activations enable efficient, sampling-free uncertainty quantification for classification tasks, outperforming softmax-based MC approximations.

Abstract: In classification tasks, softmax functions are ubiquitously used as output activations to produce predictive probabilities. Such outputs only capture aleatoric uncertainty. To capture epistemic uncertainty, approximate Gaussian inference methods have been proposed. We develop a common formalism to describe such methods, which we view as outputting Gaussian distributions over the logit space. Predictives are then obtained as the expectations of the Gaussian distributions pushed forward through the softmax. However, such softmax Gaussian integrals cannot be solved analytically, and Monte Carlo (MC) approximations can be costly and noisy. We propose to replace the softmax activation by element-wise normCDF or sigmoid, which allows for the accurate sampling-free approximation of predictives. This also enables the approximation of the Gaussian pushforwards by Dirichlet distributions with moment matching. This approach entirely eliminates the runtime and memory overhead associated with MC sampling. We evaluate it combined with several approximate Gaussian inference methods (Laplace, HET, SNGP) on large- and small-scale datasets (ImageNet, CIFAR-100, CIFAR-10), demonstrating improved uncertainty quantification capabilities compared to softmax MC sampling. Our code is available at https://github.com/bmucsanyi/probit.

[485] KernelBand: Steering LLM-based Kernel Optimization via Hardware-Aware Multi-Armed Bandits

Dezhi Ran, Shuxiao Xie, Mingfang Ji, Anmin Liu, Mengzhou Wu, Yuan Cao, Yuzhe Guo, Hao Yu, Linyi Li, Yitao Hu, Wei Yang, Tao Xie

Main category: cs.LG

TL;DR: KernelBand formulates GPU kernel optimization for LLM serving as a Multi-Armed Bandit problem, using hardware-aware pruning and trace-driven clustering to efficiently explore optimization space with code LLMs.

Details

Motivation: Optimizing GPU kernels for efficient LLM serving requires deep system expertise and is a search problem over vast optimization spaces. Existing code LLMs can generate functionally correct code but struggle with efficient exploration of optimization strategies for diverse hardware.

Method: Formulates kernel optimization as Multi-Armed Bandit problem. Uses hardware-aware pruning via profiling bounds and trace-driven clustering leveraging Lipschitz continuity to navigate infinite optimization strategy space. Theoretically reduces regret bound to depend on compact covering number of runtime clusters.

Result: Extensive experiments on TritonBench-G with three GPU architectures and four code LLMs show KernelBand consistently outperforms state-of-the-art methods with over 33% average improvement.

Conclusion: KernelBand bridges the gap between code LLMs and kernel optimization by framing it as an MAB problem, enabling sample-efficient discovery of high-performance kernels for LLM serving across diverse hardware.

Abstract: High-performance GPU kernels are critical for efficient LLM serving, yet their optimization remains a bottleneck requiring deep system expertise. While code LLMs show promise in generating functionally correct code, kernel optimization is intrinsically a search problem over a vast optimization space. The fundamental mismatch prevents existing LLM agents from efficiently exploring the optimization space for diverse hardware and compute patterns. To bridge the gap, we present KernelBand, a framework that formulates kernel optimization as a Multi-Armed Bandit (MAB) problem, explicitly balancing exploration and exploitation to unlock the potential of code LLMs. To navigate the infinite arm space of optimization strategies applied to candidate kernels, we design two key mechanisms: a hardware-aware pruning strategy via profiling bounds and a trace-driven clustering algorithm that leverages Lipschitz continuity. Theoretically, we prove that KernelBand reduces the regret bound to depend on the compact covering number of runtime clusters, ensuring sample-efficient discovery of high-performance kernels. Extensive experiments on TritonBench-G with three GPU architectures and four code LLMs show that KernelBand consistently and substantially outperforms state-of-the-art methods with over 33% average improvement.

[486] ItDPDM: Information-Theoretic Discrete Poisson Diffusion Model

Sagnik Bhattacharya, Abhiram Gorle, Ahsan Bilal, Connor Ding, Amit Kumar Singh Yadav, Tsachy Weissman

Main category: cs.LG

TL;DR: ItDPDM is a discrete diffusion model for non-negative discrete data that combines exact likelihood estimation with discrete-state modeling using an information-theoretic Poisson reconstruction loss.

Details

Motivation: Existing methods for generative modeling of discrete data like symbolic music have two key limitations: (1) they often model continuous embeddings which is suboptimal for discrete data, and (2) they optimize variational bounds rather than exact likelihood, leading to inaccurate likelihood estimates and degraded sampling quality.

Method: Introduces Information-Theoretic Discrete Poisson Diffusion Model (ItDPDM) inspired by photon arrival processes. Uses an information-theoretic Poisson Reconstruction Loss (PRL) that has a provable exact relationship with true data likelihood, enabling exact likelihood estimation with fully discrete-state modeling.

Result: ItDPDM achieves improved likelihood and sampling performance over prior discrete and continuous diffusion models on synthetic discrete datasets. On real-world datasets including symbolic music and images, it attains superior likelihood estimates and competitive generation quality.

Conclusion: The work demonstrates a proof of concept for distribution-robust discrete generative modeling that addresses both discrete-state modeling and exact likelihood estimation simultaneously.

Abstract: Generative modeling of non-negative, discrete data, such as symbolic music, remains challenging due to two persistent limitations in existing methods. Firstly, many approaches rely on modeling continuous embeddings, which is suboptimal for inherently discrete data distributions. Secondly, most models optimize variational bounds rather than exact data likelihood, resulting in inaccurate likelihood estimates and degraded sampling quality. While recent diffusion-based models have addressed these issues separately, we tackle them jointly. In this work, we introduce the Information-Theoretic Discrete Poisson Diffusion Model (ItDPDM), inspired by photon arrival process, which combines exact likelihood estimation with fully discrete-state modeling. Central to our approach is an information-theoretic Poisson Reconstruction Loss (PRL) that has a provable exact relationship with the true data likelihood. ItDPDM achieves improved likelihood and sampling performance over prior discrete and continuous diffusion models on a variety of synthetic discrete datasets. Furthermore, on real-world datasets such as symbolic music and images, ItDPDM attains superior likelihood estimates and competitive generation quality-demonstrating a proof of concept for distribution-robust discrete generative modeling.

[487] Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks

Ichiro Hashimoto

Main category: cs.LG

TL;DR: Theoretical analysis of benign overfitting in leaky ReLU two-layer neural networks trained on mixture data via gradient descent, establishing directional convergence and classification error bounds.

Details

Motivation: Previous work on benign overfitting in neural networks was limited to nearly orthogonal data settings and gradient flow analysis. This paper aims to extend understanding to more realistic mixture data settings and gradient descent optimization.

Method: Established directional convergence of network parameters for leaky ReLU two-layer networks trained via gradient descent on mixture data. Derived classification error bounds for the convergent direction and analyzed phase transitions in overfitting behavior.

Result: Proved benign overfitting occurs with high probability in wider scenarios than previously known, characterized cases where benign overfitting fails even with directional convergence, and discovered a new phase transition phenomenon.

Conclusion: Provides a more complete theoretical picture of benign overfitting in leaky ReLU two-layer networks, extending results to mixture data settings and gradient descent optimization beyond previous limitations.

Abstract: In this paper, we provide sufficient conditions of benign overfitting of fixed width leaky ReLU two-layer neural network classifiers trained on mixture data via gradient descent. Our results are derived by establishing directional convergence of the network parameters and classification error bound of the convergent direction. Our classification error bound also lead to the discovery of a newly identified phase transition. Previously, directional convergence in (leaky) ReLU neural networks was established only for gradient flow. Due to the lack of directional convergence, previous results on benign overfitting were limited to those trained on nearly orthogonal data. All of our results hold on mixture data, which is a broader data setting than the nearly orthogonal data setting in prior work. We demonstrate our findings by showing that benign overfitting occurs with high probability in a much wider range of scenarios than previously known. Our results also allow us to characterize cases when benign overfitting provably fails even if directional convergence occurs. Our work thus provides a more complete picture of benign overfitting in leaky ReLU two-layer neural networks.

[488] Orion-Bix: Bi-Axial Attention for Tabular In-Context Learning

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, Vinay Kumar Sankarapu

Main category: cs.LG

TL;DR: Orion-Bix is a tabular foundation model using biaxial attention and meta-learned in-context reasoning for few-shot learning on mixed numeric/categorical data.

Details

Motivation: Tabular data is crucial for real-world ML but challenging for general-purpose modeling due to mixed data types, weak feature structure, and limited labeled data. Current approaches struggle with scaling and generalization.

Method: Combines biaxial attention (alternating standard, grouped, hierarchical, and relational attention) with multi-CLS summarization to capture local/global dependencies. Uses meta-learned in-context reasoning with hierarchical decision routing for few-shot adaptation to large label spaces.

Result: Outperforms gradient-boosting baselines and remains competitive with state-of-the-art tabular foundation models on public benchmarks, demonstrating robust few-shot learning capabilities.

Conclusion: Biaxial attention with episodic meta-training enables effective few-shot tabular learning, providing a scikit-learn compatible foundation model for heterogeneous tabular data.

Abstract: Tabular data drive most real-world machine learning applications, yet building general-purpose models for them remains difficult. Mixed numeric and categorical fields, weak feature structure, and limited labeled data make scaling and generalization challenging. To this end, we introduce Orion-Bix, a tabular foundation model that combines biaxial attention with meta-learned in-context reasoning for few-shot tabular learning. Its encoder alternates standard, grouped, hierarchical, and relational attention, fusing their outputs through multi-CLS summarization to capture both local and global dependencies efficiently. A label-aware ICL head adapts on the fly and scales to large label spaces via hierarchical decision routing. Meta-trained on synthetically generated, structurally diverse tables with causal priors, Orion-Bix learns transferable inductive biases across heterogeneous data. Delivered as a scikit-learn compatible foundation model, it outperforms gradient-boosting baselines and remains competitive with state-of-the-art tabular foundation models on public benchmarks, showing that biaxial attention with episodic meta-training enables robust, few-shot-ready tabular learning. The model is publicly available at https://github.com/Lexsi-Labs/Orion-BiX .

[489] On Transferring Transferability: Towards a Theory for Size Generalization

Eitan Levin, Yuxin Ma, Mateo Díaz, Soledad Villar

Main category: cs.LG

TL;DR: A framework for analyzing transferability of models across different input dimensions, showing it corresponds to continuity in a limit space formed by identifying small and large problem instances.

Details

Motivation: Many learning tasks require models that handle inputs of varying sizes, and there's interest in whether models trained on low-dimensional data can transfer to higher-dimensional inputs. The paper aims to provide a general framework for understanding transferability across dimensions.

Method: Introduces a general framework for transferability across dimensions, showing it corresponds to continuity in a limit space formed by identifying small problem instances with equivalent large ones. The identification is driven by data and learning task. Instantiates framework on existing architectures and implements necessary changes for transferability.

Result: Numerical experiments support the findings. The framework provides design principles for creating new transferable models and shows how to modify existing architectures to ensure transferability.

Conclusion: Transferability across dimensions can be precisely characterized as continuity in a limit space, providing both theoretical understanding and practical design principles for creating dimension-independent models.

Abstract: Many modern learning tasks require models that can take inputs of varying sizes. Consequently, dimension-independent architectures have been proposed for domains where the inputs are graphs, sets, and point clouds. Recent work on graph neural networks has explored whether a model trained on low-dimensional data can transfer its performance to higher-dimensional inputs. We extend this body of work by introducing a general framework for transferability across dimensions. We show that transferability corresponds precisely to continuity in a limit space formed by identifying small problem instances with equivalent large ones. This identification is driven by the data and the learning task. We instantiate our framework on existing architectures, and implement the necessary changes to ensure their transferability. Finally, we provide design principles for designing new transferable models. Numerical experiments support our findings.

[490] EvoXplain: When Machine Learning Models Agree on Predictions but Disagree on Why – Measuring Mechanistic Multiplicity Across Training Runs

Chama Bensmail

Main category: cs.LG

TL;DR: EvoXplain is a diagnostic framework that measures explanation stability across repeated training runs to reveal whether models with high accuracy use consistent or competing internal mechanisms.

Details

Motivation: Current ML practice assumes that high-accuracy models have correct explanations, but overlooks whether different models achieving similar accuracy use the same internal logic or competing mechanisms.

Method: Treats explanations as samples from training pipelines without aggregating predictions, examines whether they form coherent explanatory basins or separate into multiple structured basins across repeated training runs.

Result: Found varying explanation stability: DNNs on Breast Cancer converge to single basin, same architecture on Adult Income separates into distinct basins, Logistic Regression on Breast Cancer shows conditional multiplicity controlled by regularization.

Conclusion: EvoXplain reframes interpretability as a property of training pipelines under repeated instantiation, making explanatory structure visible and quantifiable rather than selecting a “correct” explanation.

Abstract: Machine learning models are primarily judged by predictive performance, especially in applied settings. Once a model reaches high accuracy, its explanation is often assumed to be correct and trustworthy. This assumption raises an overlooked question: when two models achieve high accuracy, do they rely on the same internal logic, or do they reach the same outcome via different and potentially competing mechanisms? We introduce EvoXplain, a diagnostic framework that measures the stability of model explanations across repeated training. Rather than analysing the explanation of a single trained model, EvoXplain treats explanations as samples drawn from the training and model selection pipeline itself, without aggregating predictions or constructing ensembles. It examines whether these samples form a single coherent explanatory basin or separate into multiple structured explanatory basins. We evaluate EvoXplain on the Adult Income and Breast Cancer datasets using deep neural networks and Logistic Regression. Although all models achieve high predictive accuracy, explanation stability differs across pipelines. Deep neural networks on Breast Cancer converge to a single explanatory basin, while the same architecture on Adult Income separates into distinct explanatory basins despite identical training conditions. Logistic Regression on Breast Cancer exhibits conditional multiplicity, where basin accessibility is controlled by regularisation configuration. EvoXplain does not attempt to select a correct explanation. Instead, it makes explanatory structure visible and quantifiable, revealing when single instance explanations obscure the existence of multiple admissible predictive mechanisms. More broadly, EvoXplain reframes interpretability as a property of the training pipeline under repeated instantiation, rather than of any single trained model.

[491] Tailored Behavior-Change Messaging for Physical Activity: Integrating Contextual Bandits and Large Language Models

Haochen Song, Dominik Hofer, Rania Islambouli, Laura Hawkins, Ananya Bhattacharjee, Zahra Hassanzadeh, Jan Smeddinck, Meredith Franklin, Joseph Jay Williams

Main category: cs.LG

TL;DR: Hybrid cMABxLLM approach combines contextual multi-armed bandits for intervention type selection with LLMs for personalized message content generation in a 30-day physical activity intervention.

Details

Motivation: Traditional cMABs require large samples and use fixed message templates, limiting personalization. The authors aim to create a hybrid system that combines the adaptive decision-making of cMABs with the generative personalization capabilities of LLMs for more effective behavioral interventions.

Method: Deployed a 30-day physical activity intervention comparing five delivery models: equal randomization (RCT), cMAB only, LLM only, LLM with interaction history, and cMABxLLM. The hybrid approach uses cMAB to select intervention type (behavioral self-monitoring, gain-framing, loss-framing, social comparison) and LLM to personalize message content using dynamic contextual factors like self-efficacy, social influence, and regulatory focus.

Result: The cMABxLLM approach retained perceived acceptance of LLM-generated messages while reducing token usage and providing explicit, reproducible decision rules. It also avoided skew in intervention delivery by improving support for under-delivered intervention types.

Conclusion: The hybrid cMABxLLM approach provides a deployable template for combining Bayesian adaptive experimentation with generative models, supporting both personalization and interpretability in behavioral interventions.

Abstract: Contextual multi-armed bandit (cMAB) algorithms offer a promising framework for adapting behavioral interventions to individuals over time. However, cMABs often require large samples to learn effectively and typically rely on a finite pre-set of fixed message templates. In this paper, we present a hybrid cMABxLLM approach in which the cMAB selects an intervention type, and a large language model (LLM) which personalizes the message content within the selected type. We deployed this approach in a 30-day physical-activity intervention, comparing four behavioral change intervention types: behavioral self-monitoring, gain-framing, loss-framing, and social comparison, delivered as daily motivational messages to support motivation and achieve a daily step count. Message content is personalized using dynamic contextual factors, including daily fluctuations in self-efficacy, social influence, and regulatory focus. Over the trial, participants received daily messages assigned by one of five models: equal randomization (RCT), cMAB only, LLM only, LLM with interaction history, or cMABxLLM. Outcomes include motivation towards physical activity and message usefulness, assessed via ecological momentary assessments (EMAs). We evaluate and compare the five delivery models using pre-specified statistical analyses that account for repeated measures and time trends. We find that the cMABxLLM approach retains the perceived acceptance of LLM-generated messages, while reducing token usage and providing an explicit, reproducible decision rule for intervention selection. This hybrid approach also avoids the skew in intervention delivery by improving support for under-delivered intervention types. More broadly, our approach provides a deployable template for combining Bayesian adaptive experimentation with generative models in a way that supports both personalization and interpretability.

[492] Little By Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts

Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong

Main category: cs.LG

TL;DR: MoRAM: A continual learning method using fine-grained rank-1 adapters as associative memory units, eliminating explicit routers and improving plasticity-stability trade-off for large pre-trained models.

Details

Motivation: Existing LoRA-based Mixture-of-Experts methods for continual learning suffer from task interference, catastrophic forgetting, redundancy, and ambiguous routing due to coarse-grained experts. New experts often duplicate or conflict with existing ones, causing routing degradation as experts accumulate.

Method: Proposes MoRAM (Mixture of Rank-1 Associative Memory) that treats weight matrices as linear associative memories. Uses fine-grained rank-1 adapters as atomic memory experts, viewing them as key-value pairs. Eliminates explicit routers using self-activation mechanism where each memory atom evaluates its own relevance via intrinsic key, transforming adaptation into content-addressable retrieval.

Result: Extensive experiments on CLIP and LLMs show MoRAM significantly outperforms state-of-the-art baselines, achieving superior plasticity-stability trade-offs, improving generalization while mitigating forgetting.

Conclusion: MoRAM provides an effective continual learning approach for large pre-trained models by using fine-grained rank-1 associative memory units, eliminating routing complexity, and achieving better balance between learning new tasks and retaining old knowledge.

Abstract: Continual learning (CL) with large pre-trained models is challenged by task interference and catastrophic forgetting. Existing LoRA-based Mixture-of-Experts (MoE) methods mitigate forgetting by adding new task-specific adapters and freezing old ones, but often suffer from redundancy, interference, and ambiguous routing due to coarse-grained experts and routing. Coarse-grained experts (i.e., full LoRA adapters with large rank) encode low-specialty information. Newly added experts often duplicate or conflict with existing ones, causing redundancy and interference. Their low specialization further confuses the router, accelerating routing degradation and forgetting as experts accumulate. In this work, we propose MoRAM (Mixture of Rank-1 Associative Memory). Grounded in the view that weight matrices function as linear associative memories, MoRAM achieves CL as gradual incrementing of atomic rank-1 memory experts. Each rank-1 adapter acts as a fine-grained MoE expert or an associative memory unit. By viewing rank-1 adapters as key-value pairs, we eliminate explicit routers in MoE-LoRA, using a self-activation mechanism where each memory atom evaluates its own relevance via its intrinsic key. This transforms the adaptation process into robust, content-addressable retrieval. Extensive experiments on CLIP and LLMs demonstrate that MoRAM significantly outperforms state-of-the-art baselines, achieving superior plasticity-stability trade-offs, improving generalization while mitigating forgetting.

[493] Provably Robust Bayesian Counterfactual Explanations under Model Changes

Jamie Duell, Xiuyi Fan

Main category: cs.LG

TL;DR: PSCE generates counterfactual explanations with probabilistic safety guarantees for model updates, ensuring high confidence and low variance predictions.

Details

Motivation: Existing counterfactual explanations become invalid when models are updated frequently in real-world settings, creating a need for explanations that remain reliable under model changes.

Method: Probabilistically Safe CEs (PSCE) uses Bayesian principles to generate δ-safe (high predictive confidence) and ε-robust (low predictive variance) counterfactual explanations with formal probabilistic guarantees under model changes.

Result: PSCE produces more plausible and discriminative counterfactual explanations compared to state-of-the-art Bayesian CE methods, with provable robustness under model changes.

Conclusion: PSCE provides a robust framework for generating counterfactual explanations that maintain reliability even when machine learning models are updated, addressing a critical limitation in real-world deployment.

Abstract: Counterfactual explanations (CEs) offer interpretable insights into machine learning predictions by answering ``what if?" questions. However, in real-world settings where models are frequently updated, existing counterfactual explanations can quickly become invalid or unreliable. In this paper, we introduce Probabilistically Safe CEs (PSCE), a method for generating counterfactual explanations that are $δ$-safe, to ensure high predictive confidence, and $ε$-robust to ensure low predictive variance. Based on Bayesian principles, PSCE provides formal probabilistic guarantees for CEs under model changes which are adhered to in what we refer to as the $\langle δ, ε\rangle$-set. Uncertainty-aware constraints are integrated into our optimization framework and we validate our method empirically across diverse datasets. We compare our approach against state-of-the-art Bayesian CE methods, where PSCE produces counterfactual explanations that are not only more plausible and discriminative, but also provably robust under model change.

[494] Uncertainty-driven Embedding Convolution

Sungjun Lim, Kangjun Noh, Youngjun Choi, Heeyoung Lee, Kyungwoo Song

Main category: cs.LG

TL;DR: UEC (Uncertainty-driven Embedding Convolution) is a novel ensemble method that transforms deterministic embeddings into probabilistic ones and uses uncertainty modeling to improve performance and robustness across NLP tasks.

Details

Motivation: No single embedding model dominates across all domains and tasks, motivating ensemble techniques. However, existing ensemble methods operate on deterministic embeddings and fail to account for model-specific uncertainty, limiting robustness and reliability.

Method: UEC transforms deterministic embeddings into probabilistic embeddings post-hoc, computes adaptive ensemble coefficients based on embedding uncertainty from a surrogate-loss formulation, and uses an uncertainty-aware similarity function that incorporates uncertainty into similarity scoring.

Result: Extensive experiments on diverse benchmarks demonstrate that UEC consistently improves both performance and robustness by leveraging principled uncertainty modeling.

Conclusion: UEC provides a theoretically grounded and efficient approach to embedding ensemble that accounts for model uncertainty, offering improved performance and robustness over existing methods.

Abstract: Text embeddings are essential components in modern NLP pipelines. Although numerous embedding models have been proposed, no single model consistently dominates across domains and tasks. This variability motivates the use of ensemble techniques to combine complementary strengths. However, most existing ensemble methods operate on deterministic embeddings and fail to account for model-specific uncertainty, limiting their robustness and reliability in downstream applications. To address these limitations, we propose Uncertainty-driven Embedding Convolution (UEC). UEC first transforms deterministic embeddings into probabilistic ones in a post-hoc manner. It then computes adaptive ensemble coefficients based on embedding uncertainty, derived from a principled surrogate-loss formulation. Additionally, UEC employs an uncertainty-aware similarity function that directly incorporates uncertainty into the similarity scoring, providing a theoretically grounded and efficient surrogate to distributional distances. Extensive experiments on diverse benchmarks demonstrate that UEC consistently improves both performance and robustness by leveraging principled uncertainty modeling.

[495] Rank-1 Approximation of Inverse Fisher for Natural Policy Gradients in Deep Reinforcement Learning

Yingxiao Huo, Satya Prakash Dash, Radu Stoican, Samuel Kaski, Mingfei Sun

Main category: cs.LG

TL;DR: Efficient natural policy optimization using rank-1 approximation to inverse Fisher Information Matrix for faster convergence in deep reinforcement learning.

Details

Motivation: Natural gradients offer fast convergence in deep RL but require computationally expensive inversion of Fisher Information Matrix at each iteration, making them impractical for large-scale applications.

Method: Proposes a scalable natural policy optimization technique using rank-1 approximation to the full inverse-FIM, reducing computational complexity while maintaining convergence benefits.

Result: The method achieves superior performance to standard actor-critic and trust-region baselines across diverse environments, with theoretical guarantees of faster convergence than policy gradients.

Conclusion: Rank-1 approximation to inverse-FIM provides an efficient and scalable approach to natural policy optimization that maintains convergence benefits while being computationally feasible.

Abstract: Natural gradients have long been studied in deep reinforcement learning due to their fast convergence properties and covariant weight updates. However, computing natural gradients requires inversion of the Fisher Information Matrix (FIM) at each iteration, which is computationally prohibitive in nature. In this paper, we present an efficient and scalable natural policy optimization technique that leverages a rank-1 approximation to full inverse-FIM. We theoretically show that under certain conditions, a rank-1 approximation to inverse-FIM converges faster than policy gradients and, under some conditions, enjoys the same sample complexity as stochastic policy gradient methods. We benchmark our method on a diverse set of environments and show that it achieves superior performance to standard actor-critic and trust-region baselines.

[496] Deep Network Trainability via Persistent Subspace Orthogonality

Alex Massucco, Davide Murari, Carola-Bibiane Schönlieb

Main category: cs.LG

TL;DR: Paper proposes architectures with orthogonal Jacobians to mitigate vanishing/exploding gradients, introducing persistent subspace orthogonality for deeper network training.

Details

Motivation: Training deep neural networks via backpropagation is hindered by vanishing or exploding gradients, which limits network depth and training stability.

Method: Design architectures by analyzing and controlling network Jacobian; characterize networks with orthogonal Jacobian; introduce persistent subspace orthogonality (isometries on non-trivial subspace); propose practical mechanisms to enforce this condition.

Result: Empirical results show the proposed condition is necessary to preserve gradient norms during backpropagation, enabling training of very deep networks.

Conclusion: Controlling network Jacobian through orthogonal or subspace-orthogonal designs effectively mitigates gradient issues and enables deeper network training.

Abstract: Training neural networks via backpropagation is often hindered by vanishing or exploding gradients. In this work, we design architectures that mitigate these issues by analyzing and controlling the network Jacobian. We first provide a unified characterization for a class of networks with orthogonal Jacobian including known architectures and yielding new trainable designs. We then introduce the relaxed notion of persistent subspace orthogonality. This applies to a broader class of networks whose Jacobians are isometries only on a non-trivial subspace. We propose practical mechanisms to enforce this condition and empirically show that it is necessary to sufficiently preserve the gradient norms during backpropagation, enabling the training of very deep networks. We support our theory with extensive experiments.

[497] Conformal Unlearning: A New Paradigm for Unlearning in Conformal Predictors

Yahya Alkhatib, Muhammad Ahmar Jamal, Wee Peng Tay

Main category: cs.LG

TL;DR: A new framework for conformal unlearning that ensures trained conformal predictors miscovers targeted data while maintaining valid coverage on remaining data, with rigorous statistical guarantees.

Details

Motivation: Existing machine unlearning methods lack rigorous uncertainty-aware statistical measures for evaluating unlearning effectiveness in conformal prediction settings, leading to "fake conformal unlearning" where models appear unlearned but still correctly cover forgotten data.

Method: Proposes a new paradigm for conformal machine unlearning that provides finite-sample, uncertainty-aware guarantees without relying on retrained models as reference. Formalizes requirements for high coverage on retained data and high miscoverage on forgotten data, introduces practical empirical metrics, and presents an algorithm optimizing these conformal objectives.

Result: Extensive experiments on vision and text benchmarks demonstrate the approach effectively removes targeted information while preserving utility, addressing the limitations of existing methods.

Conclusion: The proposed framework provides a principled approach to conformal unlearning with rigorous statistical guarantees, solving the problem of “fake conformal unlearning” and enabling effective removal of targeted information while maintaining coverage on retained data.

Abstract: Conformal unlearning aims to ensure that a trained conformal predictor miscovers data points with specific shared characteristics, such as those from a particular label class, associated with a specific user, or belonging to a defined cluster, while maintaining valid coverage on the remaining data. Existing machine unlearning methods, which typically approximate a model retrained from scratch after removing the data to be forgotten, face significant challenges when applied to conformal unlearning. These methods often lack rigorous, uncertainty-aware statistical measures to evaluate unlearning effectiveness and exhibit a mismatch between their degraded performance on forgotten data and the frequency with which that data are still correctly covered by conformal predictors-a phenomenon we term ‘‘fake conformal unlearning’’. To address these limitations, we propose a new paradigm for conformal machine unlearning that provides finite-sample, uncertainty-aware guarantees on unlearning performance without relying on a retrained model as a reference. We formalize conformal unlearning to require high coverage on retained data and high miscoverage on forgotten data, introduce practical empirical metrics for evaluation, and present an algorithm that optimizes these conformal objectives. Extensive experiments on vision and text benchmarks demonstrate that the proposed approach effectively removes targeted information while preserving utility.

[498] MOTGNN: Interpretable Graph Neural Networks for Multi-Omics Disease Classification

Tiantian Yang, Zhiqian Chen

Main category: cs.LG

TL;DR: MOTGNN is an interpretable framework for binary disease classification that integrates multi-omics data using XGBoost for graph construction, modality-specific GNNs for representation learning, and deep feedforward networks for cross-omics integration.

Details

Motivation: Multi-omics data integration is challenging due to high dimensionality, modality heterogeneity, lack of reliable biological networks, reliance on handcrafted similarity graphs, vulnerability to class imbalance, and lack of interpretability in existing models.

Method: Uses XGBoost for omics-specific supervised graph construction, modality-specific Graph Neural Networks for hierarchical representation learning, and deep feedforward networks for cross-omics integration.

Result: Outperforms state-of-the-art baselines by 5-10% in accuracy, ROC-AUC, and F1-score across three real-world disease datasets, remains robust to severe class imbalance, and provides interpretability through top-ranked biomarkers and modality contributions.

Conclusion: MOTGNN improves both predictive accuracy and interpretability in multi-omics disease modeling, offering computational efficiency through sparse graphs and built-in interpretability features.

Abstract: Integrating multi-omics data, such as DNA methylation, mRNA expression, and microRNA (miRNA) expression, offers a comprehensive view of the biological mechanisms underlying disease. However, the high dimensionality of multi-omics data, the heterogeneity across modalities, and the lack of reliable biological interaction networks make meaningful integration challenging. In addition, many existing models rely on handcrafted similarity graphs, are vulnerable to class imbalance, and often lack built-in interpretability, limiting their usefulness in biomedical applications. We propose Multi-Omics integration with Tree-generated Graph Neural Network (MOTGNN), a novel and interpretable framework for binary disease classification. MOTGNN employs eXtreme Gradient Boosting (XGBoost) for omics-specific supervised graph construction, followed by modality-specific Graph Neural Networks (GNNs) for hierarchical representation learning, and a deep feedforward network for cross-omics integration. Across three real-world disease datasets, MOTGNN outperforms state-of-the-art baselines by 5-10% in accuracy, ROC-AUC, and F1-score, and remains robust to severe class imbalance. The model maintains computational efficiency through the use of sparse graphs and provides built-in interpretability, revealing both top-ranked biomarkers and the relative contributions of each omics modality. These results highlight the potential of MOTGNN to improve both predictive accuracy and interpretability in multi-omics disease modeling.

[499] Semantic-Enhanced Time-Series Forecasting via Large Language Models

Hao Liu, Chun Yang, Zhang xiaoxing, Xiaobin Zhu

Main category: cs.LG

TL;DR: SE-LLM: A semantic-enhanced large language model for time series forecasting that bridges modality gaps between linguistic knowledge and time series patterns by embedding periodicity and anomaly characteristics into semantic space.

Details

Motivation: Existing LLM approaches for time series forecasting focus on token-level modal alignment but fail to bridge the intrinsic modality gap between linguistic knowledge structures and time series data patterns, limiting semantic representation. Additionally, Transformer-based LLMs are good at long-range dependencies but weak at modeling short-term anomalies in time series.

Method: Proposes SE-LLM that explores inherent periodicity and anomalous characteristics of time series to embed into semantic space, enhancing token embeddings for LLMs. Also introduces a plugin module embedded within self-attention that models both long-term and short-term dependencies. The approach freezes the LLM and reduces token sequence dimensionality to reduce computational consumption.

Result: Experiments demonstrate superior performance of SE-LLM against state-of-the-art methods in time series forecasting.

Conclusion: SE-LLM effectively bridges the modality gap between linguistic knowledge and time series patterns, enhances interpretability of tokens for LLMs, and activates LLMs’ potential for temporal sequence analysis while being computationally efficient.

Abstract: Time series forecasting plays a significant role in finance, energy, meteorology, and IoT applications. Recent studies have leveraged the generalization capabilities of large language models (LLMs) to adapt to time series forecasting, achieving promising performance. However, existing studies focus on token-level modal alignment, instead of bridging the intrinsic modality gap between linguistic knowledge structures and time series data patterns, greatly limiting the semantic representation. To address this issue, we propose a novel Semantic-Enhanced LLM (SE-LLM) that explores the inherent periodicity and anomalous characteristics of time series to embed into the semantic space to enhance the token embedding. This process enhances the interpretability of tokens for LLMs, thereby activating the potential of LLMs for temporal sequence analysis. Moreover, existing Transformer-based LLMs excel at capturing long-range dependencies but are weak at modeling short-term anomalies in time-series data. Hence, we propose a plugin module embedded within self-attention that models long-term and short-term dependencies to effectively adapt LLMs to time-series analysis. Our approach freezes the LLM and reduces the sequence dimensionality of tokens, greatly reducing computational consumption. Experiments demonstrate the superiority performance of our SE-LLM against the state-of-the-art (SOTA) methods.

[500] Energy Injection Identification enabled Disaggregation with Deep Multi-Task Learning

Xudong Wang, Guoming Tang, Junyu Xue, Srinivasan Keshav, Tongxin Li, Chris Ding

Main category: cs.LG

TL;DR: DualNILM: A Transformer-based multi-task learning framework for appliance state recognition and injected energy identification in smart homes with behind-the-meter renewable energy sources.

Details

Motivation: The increasing adoption of behind-the-meter renewable energy sources like solar panels and batteries creates challenges for conventional Non-Intrusive Load Monitoring methods, as injected energy from these sources obscures appliance power signatures and reduces NILM performance.

Method: DualNILM uses a Transformer-based architecture with multi-task learning that integrates sequence-to-point and sequence-to-sequence strategies to capture multiscale temporal dependencies in aggregate power consumption patterns for dual tasks of appliance state recognition and injected energy identification.

Result: Extensive evaluation on self-collected and synthesized datasets shows DualNILM maintains excellent performance for both tasks, significantly outperforming conventional methods, and demonstrates robustness for energy disaggregation in modern energy systems with renewable penetration.

Conclusion: The framework shows strong potential for robust energy disaggregation in modern energy systems with renewable energy sources, and the authors open-sourced synthetic photovoltaic augmented datasets with realistic injection simulation methodology.

Abstract: Non-Intrusive Load Monitoring (NILM) offers a cost-effective method to obtain fine-grained appliance-level energy consumption in smart homes and building applications. However, the increasing adoption of behind-the-meter (BTM) energy sources such as solar panels and battery storage poses new challenges for conventional NILM methods that rely solely on at-the-meter data. The energy injected from the BTM sources can obscure the power signatures of individual appliances, leading to a significant decrease in NILM performance. To address this challenge, we present DualNILM, a deep multi-task learning framework designed for the dual tasks of appliance state recognition and injected energy identification. Using a Transformer-based architecture that integrates sequence-to-point and sequence-to-sequence strategies, DualNILM effectively captures multiscale temporal dependencies in the aggregate power consumption patterns, allowing for accurate appliance state recognition and energy injection identification. Extensive evaluation on self-collected and synthesized datasets demonstrates that DualNILM maintains an excellent performance for dual tasks in NILM, much outperforming conventional methods. Our work underscores the framework’s potential for robust energy disaggregation in modern energy systems with renewable penetration. Synthetic photovoltaic augmented datasets with realistic injection simulation methodology are open-sourced at https://github.com/MathAdventurer/PV-Augmented-NILM-Datasets.

[501] Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics

Nima Shoghi, Yuxuan Liu, Yuning Shen, Rob Brekelmans, Pan Li, Quanquan Gu

Main category: cs.LG

TL;DR: STAR-MD is a scalable SE(3)-equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales using joint spatio-temporal attention in a causal diffusion transformer architecture.

Details

Motivation: Molecular dynamics simulations are computationally expensive for studying protein dynamics at biologically relevant timescales. Existing generative models struggle with long-horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio-temporal dynamics.

Method: STAR-MD uses a causal diffusion transformer with joint spatio-temporal attention that efficiently captures complex space-time dependencies while avoiding memory bottlenecks. The model is SE(3)-equivariant and designed for scalable trajectory generation.

Result: On the ATLAS benchmark, STAR-MD achieves state-of-the-art performance across all metrics, substantially improving conformational coverage, structural validity, and dynamic fidelity compared to previous methods. It successfully extrapolates to generate stable microsecond-scale trajectories where baseline methods fail catastrophically.

Conclusion: STAR-MD’s joint spatio-temporal modeling enables robust dynamics simulation at biologically relevant timescales, addressing severe limitations in current models for long-horizon generation and paving the way for accelerated exploration of protein function.

Abstract: Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics, but their computational cost limits access to biologically relevant timescales. Recent generative models have shown promise in accelerating simulations, yet they struggle with long-horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio-temporal dynamics. We present STAR-MD (Spatio-Temporal Autoregressive Rollout for Molecular Dynamics), a scalable SE(3)-equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales. Our key innovation is a causal diffusion transformer with joint spatio-temporal attention that efficiently captures complex space-time dependencies while avoiding the memory bottlenecks of existing methods. On the standard ATLAS benchmark, STAR-MD achieves state-of-the-art performance across all metrics–substantially improving conformational coverage, structural validity, and dynamic fidelity compared to previous methods. STAR-MD successfully extrapolates to generate stable microsecond-scale trajectories where baseline methods fail catastrophically, maintaining high structural quality throughout the extended rollout. Our comprehensive evaluation reveals severe limitations in current models for long-horizon generation, while demonstrating that STAR-MD’s joint spatio-temporal modeling enables robust dynamics simulation at biologically relevant timescales, paving the way for accelerated exploration of protein function.

[502] A Law of Data Reconstruction for Random Features (and Beyond)

Leonardo Iurada, Simone Bombari, Tatiana Tommasi, Marco Mondelli

Main category: cs.LG

TL;DR: Deep learning models can reconstruct entire training datasets when number of parameters exceeds dimensionality times number of samples (p > dn), revealing a law of data reconstruction.

Details

Motivation: To understand memorization in deep learning from a data reconstruction perspective, moving beyond traditional label fitting/interpolation views to examine when models can actually reconstruct training data.

Method: Theoretical analysis using random features model showing when p > dn, the subspace spanned by training samples in feature space contains enough information to identify individual samples in input space. Proposes optimization method to reconstruct dataset from model parameters.

Result: Demonstrates successful data reconstruction on various architectures (random features, two-layer fully-connected, deep residual networks) when p exceeds threshold dn. Reveals a “law of data reconstruction” where entire training dataset can be recovered.

Conclusion: Deep learning models can memorize training data through reconstruction when parameters exceed dimensionality-times-samples threshold, providing new perspective on memorization beyond interpolation/label fitting.

Abstract: Large-scale deep learning models are known to memorize parts of the training set. In machine learning theory, memorization is often framed as interpolation or label fitting, and classical results show that this can be achieved when the number of parameters $p$ in the model is larger than the number of training samples $n$. In this work, we consider memorization from the perspective of data reconstruction, demonstrating that this can be achieved when $p$ is larger than $dn$, where $d$ is the dimensionality of the data. More specifically, we show that, in the random features model, when $p \gg dn$, the subspace spanned by the training samples in feature space gives sufficient information to identify the individual samples in input space. Our analysis suggests an optimization method to reconstruct the dataset from the model parameters, and we demonstrate that this method performs well on various architectures (random features, two-layer fully-connected and deep residual networks). Our results reveal a law of data reconstruction, according to which the entire training dataset can be recovered as $p$ exceeds the threshold $dn$.

[503] Lipschitz Bandits with Stochastic Delayed Feedback

Zhongxuan Liu, Yue Kang, Thomas C. M. Lee

Main category: cs.LG

TL;DR: Lipschitz bandit problem with stochastic delayed feedback, where rewards are observed after random delays. Algorithms proposed for both bounded and unbounded delay settings with sublinear regret guarantees.

Details

Motivation: Extend Lipschitz bandit problems to more realistic scenarios where feedback is not immediate but delayed, which occurs in many real-world applications like online advertising, recommendation systems, and clinical trials where outcomes are observed after some time.

Method: For bounded delays: propose a delay-aware zooming algorithm that adapts to maximum delay τ_max. For unbounded delays: develop a novel phased learning strategy that accumulates reliable feedback over carefully scheduled intervals.

Result: Achieve sublinear regret guarantees for both settings. For bounded delays: optimal performance of delay-free setting plus additional term scaling with maximal delay. For unbounded delays: nearly optimal up to logarithmic factors with established regret lower bound.

Conclusion: Successfully extend Lipschitz bandit framework to handle stochastic delayed feedback with provable guarantees, providing practical algorithms for real-world applications where immediate feedback is unavailable.

Abstract: The Lipschitz bandit problem extends stochastic bandits to a continuous action set defined over a metric space, where the expected reward function satisfies a Lipschitz condition. In this work, we introduce a new problem of Lipschitz bandit in the presence of stochastic delayed feedback, where the rewards are not observed immediately but after a random delay. We consider both bounded and unbounded stochastic delays, and design algorithms that attain sublinear regret guarantees in each setting. For bounded delays, we propose a delay-aware zooming algorithm that retains the optimal performance of the delay-free setting up to an additional term that scales with the maximal delay $τ_{\max}$. For unbounded delays, we propose a novel phased learning strategy that accumulates reliable feedback over carefully scheduled intervals, and establish a regret lower bound showing that our method is nearly optimal up to logarithmic factors. Finally, we present experimental results to demonstrate the efficiency of our algorithms under various delay scenarios.

[504] Learning to Explore with Parameter-Space Noise: A Deep Dive into Parameter-Space Noise for Reinforcement Learning with Verifiable Rewards

Bizhe Bai, Xinyue Wang, Peng Ye, Tao Chen

Main category: cs.LG

TL;DR: PSN-RLVR introduces parameter-space noise for better exploration in RL-based LLM reasoning, addressing exploration ceilings in existing methods through temporally consistent trajectory-level exploration with efficient adaptive noise control.

Details

Motivation: Existing RL with Verifiable Rewards (RLVR) methods for LLM reasoning have an exploration ceiling - they mainly reweight existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets like pass-at-256.

Method: PSN-RLVR perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that preserves chain-of-thought coherence. It uses truncated importance sampling to mitigate sampling-update mismatch and proposes a computationally efficient real-time adaptive noise scheduler driven by semantic diversity and normalized self-certainty.

Result: PSN-GRPO (instantiated on GRPO) consistently expands effective reasoning capability boundaries across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods.

Conclusion: Parameter-space noise exploration addresses fundamental exploration limitations in RLVR for LLM reasoning, providing orthogonal improvements that can be composed with other methods for additional gains.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves LLM reasoning, yet growing evidence indicates an exploration ceiling: it often reweights existing solution traces rather than discovering new strategies, limiting gains under large sampling budgets (e.g., pass-at-256). We address this limitation with PSN-RLVR, which perturbs policy parameters before rollout generation to induce temporally consistent, trajectory-level exploration that better preserves long-horizon chain-of-thought coherence than action-space noise. To mitigate the resulting sampling-update mismatch, we incorporate truncated importance sampling (TIS). To avoid expensive KL-based adaptive noise control, we propose a computationally efficient real-time adaptive noise scheduler driven by a lightweight surrogate that combines semantic diversity with normalized self-certainty. Instantiated on GRPO, a widely used RLVR method, PSN-GRPO consistently expands the effective reasoning capability boundary across multiple mathematical reasoning benchmarks and model families, yielding higher pass-at-k under large sampling budgets and outperforming prior exploration-oriented RLVR methods (e.g., Pass-at-k-style training) while remaining orthogonal and thus composable for additional gains.

[505] Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Molei Tao, Wei Deng

Main category: cs.LG

TL;DR: GDPO introduces a new RL algorithm for diffusion language models using semi-deterministic Monte Carlo to reduce variance in ELBO estimation, outperforming previous methods on reasoning tasks.

Details

Motivation: Diffusion language models offer parallel generation advantages over autoregressive LLMs, but RL fine-tuning for DLMs is challenging due to intractable likelihoods. Existing methods like diffu-GRPO use biased token-level approximations, while principled ELBO-based approaches are computationally prohibitive.

Method: GDPO (Group Diffusion Policy Optimization) uses semi-deterministic Monte Carlo schemes to reduce variance in ELBO estimation by employing fast, deterministic integral approximations along key dimensions, avoiding the variance explosion of vanilla double Monte Carlo sampling.

Result: GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO on the majority of math, reasoning, and coding benchmarks, demonstrating superior performance for RL fine-tuning of diffusion language models.

Conclusion: GDPO provides an effective RL algorithm for diffusion language models by addressing the variance issues in ELBO estimation, enabling better fine-tuning for reasoning and coding tasks compared to existing methods.

Abstract: Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce Group Diffusion Policy Optimization (GDPO), a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.

[506] Entropy-Guided Dynamic Tokens for Graph-LLM Alignment in Molecular Understanding

Zihao Jing, Qiuhao Zeng, Ruiyi Fang, Yan Sun, Boyu Wang, Pingzhao Hu

Main category: cs.LG

TL;DR: EDT-Former is a novel transformer that generates dynamic tokens aligned with informative molecular patches for better molecular graph understanding, enabling efficient alignment between frozen graph encoders and LLMs without LLM backbone tuning.

Details

Motivation: Current LLMs struggle with molecular graph understanding, and existing graph-LLM bridges use fixed-length static tokens designed for vision tasks, which overlook stereochemistry and substructural context while requiring costly LLM fine-tuning.

Method: EDT-Former (Entropy-guided Dynamic Token Transformer) generates tokens aligned with informative molecular patches, preserving both local and global structural features. It enables alignment between frozen graph encoders and LLMs without tuning the LLM backbone (excluding embedding layer).

Result: Achieves state-of-the-art results on MoleculeQA, Molecule-oriented Mol-Instructions, and property prediction benchmarks (TDC, MoleculeNet), demonstrating effectiveness for scalable and generalizable multimodal molecular understanding.

Conclusion: EDT-Former provides an efficient and effective approach for molecular graph understanding that preserves structural features while enabling computationally efficient fine-tuning with frozen LLM backbones.

Abstract: Molecular understanding is central to advancing areas such as scientific discovery, yet Large Language Models (LLMs) struggle to understand molecular graphs effectively. Existing graph-LLM bridges often adapt the Q-Former-style connector with fixed-length static tokens, which is originally designed for vision tasks. These designs overlook stereochemistry and substructural context and typically require costly LLM-backbone fine-tuning, limiting efficiency and generalization. We introduce EDT-Former, an Entropy-guided Dynamic Token Transformer that generates tokens aligned with informative molecular patches, thereby preserving both local and global structural features for molecular graph understanding. Beyond prior approaches, EDT-Former enables alignment between frozen graph encoders and LLMs without tuning the LLM backbone (excluding the embedding layer), resulting in computationally efficient finetuning, and achieves stateof-the-art results on MoleculeQA, Molecule-oriented Mol-Instructions, and property prediction benchmarks (TDC, MoleculeNet), underscoring its effectiveness for scalable and generalizable multimodal molecular understanding

[507] LLM Priors for ERM over Programs

Shivam Singhal, Priyadarsi Mishra, Eran Malach, Tomer Galanti

Main category: cs.LG

TL;DR: LLM-PV uses pretrained LLMs to propose candidate programs for ERM-style selection without exhaustive enumeration, achieving better generalization than gradient-based methods on algorithmic tasks.

Details

Motivation: Classical ERM over program classes requires exponential enumeration, while gradient-based methods can need exponentially many samples. There's a need for methods that are efficient in both samples and computation for learning short programs.

Method: LLM-PV uses a propose-and-verify approach: pretrained LLMs generate candidate programs, which are executed and scored on a validation set, with the best program selected. No gradient updates or validation feedback adaptation is used.

Result: LLM-PV recovers exact underlying rules from small labeled sets and generalizes beyond training sequence lengths on algorithmic tasks (parity variants, pattern matching, primality testing). SGD-trained transformers and other baselines fit training data but fail to generalize reliably.

Conclusion: Pretrained LLM priors can serve as effective search biases for ERM, narrowing the gap between statistical and computational efficiency for program learning.

Abstract: We study program-learning methods that are efficient in both samples and computation. Classical learning theory suggests that when the target admits a short program description (for example, a short piece of ``Python code’’), it can be learned from relatively few examples by performing ERM over the program class. However, this approach relies on enumerating candidate programs, which is typically exponential in the description length. In contrast, gradient-based training avoids explicit search, but for some families of short programs it can require exponentially many samples to succeed. We propose \textsc{LLM-PV}, a propose-and-verify recipe that enables ERM-style selection over a discrete program class without exhaustive enumeration. A pretrained LLM induces a proposal distribution over candidate programs; each proposal is executed, scored on a held-out validation set, and the best program is selected. The method uses no gradient updates and does not use validation feedback to adapt the sampling distribution. Across algorithmic tasks including parity variants, pattern matching, and primality testing, \textsc{LLM-PV} often recovers the exact underlying rule from a small labeled set and generalizes far beyond the training sequence lengths. In the same regimes, SGD-trained transformers and standard adaptation baselines (fine-tuning and in-context learning), as well as classical ML baselines, can fit the training data yet fail to generalize reliably. Together, these results suggest that pretrained LLM priors can serve as effective search biases for ERM, narrowing the gap between statistical and computational efficiency. The code is available at [\href{https://github.com/DLFundamentals/LLM_PV}{code}].

[508] Live or Lie: Action-Aware Capsule Multiple Instance Learning for Risk Assessment in Live Streaming Platforms

Yiran Qiao, Jing Chen, Xiang Ao, Qiwei Zhong, Yang Liu, Qing He

Main category: cs.LG

TL;DR: AC-MIL is a novel framework for risk assessment in live streaming rooms using Multiple Instance Learning with action-aware capsules to detect coordinated malicious behaviors.

Details

Motivation: Live streaming faces severe risks from sparse, coordinated malicious behaviors among multiple participants that are hard to detect timely and accurately, especially with only room-level labels available (weak supervision).

Method: Formulates the task as a Multiple Instance Learning problem where each room is a bag and structured user-timeslot capsules are instances. Proposes AC-MIL framework that models individual behaviors and group-level coordination patterns through serial and parallel architecture encoding temporal dynamics and cross-user dependencies.

Result: Extensive experiments on large-scale industrial datasets from Douyin show AC-MIL significantly outperforms MIL and sequential baselines, establishing new state-of-the-art performance in room-level risk assessment.

Conclusion: AC-MIL provides effective risk assessment for live streaming with room-level interpretability, enabling identification of risky behavior segments as actionable evidence for intervention.

Abstract: Live streaming has become a cornerstone of today’s internet, enabling massive real-time social interactions. However, it faces severe risks arising from sparse, coordinated malicious behaviors among multiple participants, which are often concealed within normal activities and challenging to detect timely and accurately. In this work, we provide a pioneering study on risk assessment in live streaming rooms, characterized by weak supervision where only room-level labels are available. We formulate the task as a Multiple Instance Learning (MIL) problem, treating each room as a bag and defining structured user-timeslot capsules as instances. These capsules represent subsequences of user actions within specific time windows, encapsulating localized behavioral patterns. Based on this formulation, we propose AC-MIL, an Action-aware Capsule MIL framework that models both individual behaviors and group-level coordination patterns. AC-MIL captures multi-granular semantics and behavioral cues through a serial and parallel architecture that jointly encodes temporal dynamics and cross-user dependencies. These signals are integrated for robust room-level risk prediction, while also offering interpretable evidence at the behavior segment level. Extensive experiments on large-scale industrial datasets from Douyin demonstrate that AC-MIL significantly outperforms MIL and sequential baselines, establishing new state-of-the-art performance in room-level risk assessment for live streaming. Moreover, AC-MIL provides capsule-level interpretability, enabling identification of risky behavior segments as actionable evidence for intervention. The project page is available at: https://qiaoyran.github.io/AC-MIL/.

[509] Provably Optimal Reinforcement Learning under Safety Filtering

Donggeon David Oh, Duy P. Nguyen, Haimin Hu, Jaime F. Fisac

Main category: cs.LG

TL;DR: Safety filters in RL don’t inherently degrade performance - with sufficiently permissive filters, asymptotic performance is preserved while ensuring categorical safety.

Details

Motivation: RL lacks formal safety guarantees for safety-critical applications, and safety filters are often seen as sacrificing performance. The paper aims to show this tradeoff is not inherent.

Method: Formalizes RL safety with Safety-Critical MDPs (SC-MDPs) requiring categorical avoidance of failure states. Defines filtered MDPs where safety filters are part of the environment. Proves theoretical guarantees about safety, convergence, and performance.

Result: Theoretical proof that learning in filtered MDPs is safe categorically, standard RL convergence carries over, and optimal filtered policies achieve same asymptotic return as best safe policies in SC-MDPs. Validated on Safety Gymnasium with zero violations and matching/exceeding unfiltered baselines.

Conclusion: Safety-performance tradeoff is not inherent; train and deploy RL policies with the most permissive safety filter available for principled safe RL.

Abstract: Recent advances in reinforcement learning (RL) enable its use on increasingly complex tasks, but the lack of formal safety guarantees still limits its application in safety-critical settings. A common practical approach is to augment the RL policy with a safety filter that overrides unsafe actions to prevent failures during both training and deployment. However, safety filtering is often perceived as sacrificing performance and hindering the learning process. We show that this perceived safety-performance tradeoff is not inherent and prove, for the first time, that enforcing safety with a sufficiently permissive safety filter does not degrade asymptotic performance. We formalize RL safety with a safety-critical Markov decision process (SC-MDP), which requires categorical, rather than high-probability, avoidance of catastrophic failure states. Additionally, we define an associated filtered MDP in which all actions result in safe effects, thanks to a safety filter that is considered to be a part of the environment. Our main theorem establishes that (i) learning in the filtered MDP is safe categorically, (ii) standard RL convergence carries over to the filtered MDP, and (iii) any policy that is optimal in the filtered MDP-when executed through the same filter-achieves the same asymptotic return as the best safe policy in the SC-MDP, yielding a complete separation between safety enforcement and performance optimization. We validate the theory on Safety Gymnasium with representative tasks and constraints, observing zero violations during training and final performance matching or exceeding unfiltered baselines. Together, these results shed light on a long-standing question in safety-filtered learning and provide a simple, principled recipe for safe RL: train and deploy RL policies with the most permissive safety filter that is available.

[510] Position: Many generalization measures for deep learning are fragile

Shuofeng Zhang, Ard Louis

Main category: cs.LG

TL;DR: Many post-mortem generalization measures are fragile - small training modifications can substantially change their values, trends, or scaling behavior, even when network performance remains stable.

Details

Motivation: To demonstrate that many generalization measures computed on trained networks (post-mortem measures) are fragile and unreliable, as small training modifications can dramatically alter their values without affecting actual network performance.

Method: Position paper analyzing fragility of post-mortem generalization measures through empirical observations and theoretical arguments, examining how minor hyperparameter changes affect measures like path norm and PAC-Bayes bounds.

Result: Found that many post-mortem measures are fragile: minor changes (learning rate adjustments, SGD variants) can reverse learning curve slopes; PAC-Bayes origin measure fails to capture data complexity differences; function-based marginal-likelihood PAC-Bayes bound captures data complexity but isn’t post-mortem.

Conclusion: Post-mortem generalization bounds are often fragile, and developers of new measures should explicitly audit them for fragility to ensure reliability and meaningful interpretation.

Abstract: In this position paper, we argue that many post-mortem generalization measures – those computed on trained networks – are \textbf{fragile}: small training modifications that barely affect the performance of the underlying deep neural network can substantially change a measure’s value, trend, or scaling behavior. For example, minor hyperparameter changes, such as learning rate adjustments or switching between SGD variants, can reverse the slope of a learning curve in widely used generalization measures such as the path norm. We also identify subtler forms of fragility. For instance, the PAC-Bayes origin measure is regarded as one of the most reliable, and is indeed less sensitive to hyperparameter tweaks than many other measures. However, it completely fails to capture differences in data complexity across learning curves. This data fragility contrasts with the function-based marginal-likelihood PAC-Bayes bound, which does capture differences in data-complexity, including scaling behavior, in learning curves, but which is not a post-mortem measure. Beyond demonstrating that many post-mortem bounds are fragile, this position paper also argues that developers of new measures should explicitly audit them for fragility.

[511] Overlap-weighted orthogonal meta-learner for treatment effect estimation over time

Konstantin Hess, Dennis Frauen, Mihaela van der Schaar, Stefan Feuerriegel

Main category: cs.LG

TL;DR: A novel overlap-weighted orthogonal meta-learner for estimating heterogeneous treatment effects in time-varying settings that addresses severe overlap problems when treatment sequences have low observational probability.

Details

Motivation: Estimating heterogeneous treatment effects in time-varying settings is challenging due to exponentially decreasing probability of observing certain treatment sequences, creating severe overlap problems where existing meta-learners suffer from exploding estimation variance when overlap is low.

Method: Introduces an overlap-weighted orthogonal (WO) meta-learner that targets regions in observed data with high probability of receiving interventional treatment sequences. Develops a Neyman-orthogonal population risk function minimizing overlap-weighted oracle risk, making it robust against nuisance function misspecification and model-agnostic.

Result: The WO-learner demonstrates benefits through extensive experiments with transformer and LSTM backbones, showing it can counteract instabilities in existing meta-learners and obtain more reliable HTE estimates.

Conclusion: The proposed overlap-weighted orthogonal meta-learner provides a fully data-driven approach for reliable heterogeneous treatment effect estimation in time-varying settings with low treatment overlap, offering robustness against model misspecification.

Abstract: Estimating heterogeneous treatment effects (HTEs) in time-varying settings is particularly challenging, as the probability of observing certain treatment sequences decreases exponentially with longer prediction horizons. Thus, the observed data contain little support for many plausible treatment sequences, which creates severe overlap problems. Existing meta-learners for the time-varying setting typically assume adequate treatment overlap, and thus suffer from exploding estimation variance when the overlap is low. To address this problem, we introduce a novel overlap-weighted orthogonal (WO) meta-learner for estimating HTEs that targets regions in the observed data with high probability of receiving the interventional treatment sequences. This offers a fully data-driven approach through which our WO-learner can counteract instabilities as in existing meta-learners and thus obtain more reliable HTE estimates. Methodologically, we develop a novel Neyman-orthogonal population risk function that minimizes the overlap-weighted oracle risk. We show that our WO-learner has the favorable property of Neyman-orthogonality, meaning that it is robust against misspecification in the nuisance functions. Further, our WO-learner is fully model-agnostic and can be applied to any machine learning model. Through extensive experiments with both transformer and LSTM backbones, we demonstrate the benefits of our novel WO-learner.

[512] SeeDNorm: Self-Rescaled Dynamic Normalization

Wenrui Cai, Defa Zhu, Qingjie Liu, Qiyang Min

Main category: cs.LG

TL;DR: SeeDNorm is a dynamic normalization layer that enhances transformers by preserving input norm information and using data-dependent scaling coefficients, improving performance over RMSNorm and LayerNorm with minimal parameter overhead.

Details

Motivation: RMSNorm discards input norm information in forward passes and uses static scaling factors that may not accommodate input variability and distributional shifts, limiting performance improvements especially in zero-shot scenarios common to large language models.

Method: SeeDNorm dynamically adjusts scaling coefficients based on current input, preserving input norm information and enabling data-dependent, self-rescaled dynamic normalization while maintaining RMSNorm’s ability to dynamically adjust gradients according to input norm during backpropagation.

Result: SeeDNorm achieves consistently superior performance compared to RMSNorm, LayerNorm, and DyT across models of varying sizes in large language model pre-training as well as supervised and unsupervised computer vision tasks, with minimal parameters and negligible efficiency impact.

Conclusion: SeeDNorm effectively addresses limitations of existing normalization layers by preserving input norm information and enabling dynamic scaling, offering improved representational capability for transformer models across diverse tasks.

Abstract: Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling coefficient $γ$ to maintain the representational capacity of the model. However, RMSNorm discards the input norm information in forward pass and a static scaling factor $γ$ may be insufficient to accommodate the wide variability of input data and distributional shifts, thereby limiting further performance improvements, particularly in zero-shot scenarios that large language models routinely encounter. To address this limitation, we propose SeeDNorm, which enhances the representational capability of the model by dynamically adjusting the scaling coefficient based on the current input, thereby preserving the input norm information and enabling data-dependent, self-rescaled dynamic normalization. During backpropagation, SeeDNorm retains the ability of RMSNorm to dynamically adjust gradient according to the input norm. We provide a detailed analysis of the training optimization for SeedNorm and proposed corresponding solutions to address potential instability issues that may arise when applying SeeDNorm. We validate the effectiveness of SeeDNorm across models of varying sizes in large language model pre-training as well as supervised and unsupervised computer vision tasks. By introducing a minimal number of parameters and with neglligible impact on model efficiency, SeeDNorm achieves consistently superior performance compared to previously commonly used normalization layers such as RMSNorm and LayerNorm, as well as element-wise activation alternatives to normalization layers like DyT.

[513] Improving Long-Range Interactions in Graph Neural Simulators via Hamiltonian Dynamics

Tai Hoang, Alessandro Trenta, Alessio Gravina, Niklas Freymuth, Philipp Becker, Davide Bacciu, Gerhard Neumann

Main category: cs.LG

TL;DR: IGNS is a graph-based neural simulator that uses Hamiltonian dynamics principles to preserve information and handle complex physical systems with better long-range interaction modeling and reduced error accumulation.

Details

Motivation: Traditional numerical solvers are computationally expensive for high-fidelity solutions, and existing Graph Neural Simulators (GNSs) struggle with long-range interactions and error accumulation during autoregressive rollouts.

Method: Proposes Information-preserving Graph Neural Simulators (IGNS) based on Hamiltonian dynamics principles, extending to port-Hamiltonian systems to capture non-conservative effects. Includes warmup phase for global context initialization, geometric encoding for irregular meshes, and multi-step training objective for PDE matching.

Result: IGNS consistently outperforms state-of-the-art GNSs across all tasks, achieving higher accuracy and stability under challenging and complex dynamical systems, particularly in handling long-range dependencies and external forcing scenarios.

Conclusion: IGNS provides an effective framework for learning to simulate complex physical systems by preserving information across graphs and handling a broader class of dynamics through port-Hamiltonian extensions, with demonstrated superior performance over existing methods.

Abstract: Learning to simulate complex physical systems from data has emerged as a promising way to overcome the limitations of traditional numerical solvers, which often require prohibitive computational costs for high-fidelity solutions. Recent Graph Neural Simulators (GNSs) accelerate simulations by learning dynamics on graph-structured data, yet often struggle to capture long-range interactions and suffer from error accumulation under autoregressive rollouts. To address these challenges, we propose Information-preserving Graph Neural Simulators (IGNS), a graph-based neural simulator built on the principles of Hamiltonian dynamics. This structure guarantees preservation of information across the graph, while extending to port-Hamiltonian systems allows the model to capture a broader class of dynamics, including non-conservative effects. IGNS further incorporates a warmup phase to initialize global context, geometric encoding to handle irregular meshes, and a multi-step training objective that facilitates PDE matching, where the trajectory produced by integrating the port-Hamiltonian core aligns with the ground-truth trajectory, thereby reducing rollout error. To evaluate these properties systematically, we introduce new benchmarks that target long-range dependencies and challenging external forcing scenarios. Across all tasks, IGNS consistently outperforms state-of-the-art GNSs, achieving higher accuracy and stability under challenging and complex dynamical systems. Our project page: https://thobotics.github.io/neural_pde_matching.

[514] Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou

Main category: cs.LG

TL;DR: VSD improves speculative decoding for MLLMs by training draft models using variational inference over multiple draft paths, optimizing for target-model acceptance rather than single greedy trajectories.

Details

Motivation: Existing speculative decoding methods suffer from a training-decoding discrepancy: they optimize for single greedy trajectories during training, but actual decoding involves verifying and ranking multiple sampled draft paths. This mismatch limits decoding efficiency.

Method: Proposes Variational Speculative Decoding (VSD) that formulates draft training as variational inference over latent proposals (draft paths). Uses an ELBO to maximize marginal probability of target-model acceptance, incorporates path-level utility, and optimizes via Expectation-Maximization with MCMC sampling from oracle-filtered posterior and weighted likelihood maximization using Adaptive Rejection Weighting and Confidence-Aware Regularization.

Result: VSD achieves up to 9.6% speedup over EAGLE-3 and 7.9% over ViSpec across LLMs and MLLMs, significantly improving decoding efficiency with theoretical guarantees of increased expected acceptance length and speedup.

Conclusion: VSD effectively addresses the training-decoding discrepancy in speculative decoding by optimizing draft models for the actual multi-path verification process, leading to substantial efficiency gains for both LLMs and MLLMs.

Abstract: Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws MCMC samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across LLMs and MLLMs show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving decoding efficiency.

[515] Discount Model Search for Quality Diversity Optimization in High-Dimensional Measure Spaces

Bryon Tjanaka, Henry Chen, Matthew C. Fontaine, Stefanos Nikolaidis

Main category: cs.LG

TL;DR: DMS (Discount Model Search) is a new QD algorithm that uses a continuous model of discount values to handle high-dimensional measure spaces, outperforming existing methods like CMA-MAE.

Details

Motivation: Current QD algorithms struggle with high-dimensional measure spaces due to distortion issues where many solutions map to similar measures, causing stagnation in exploration. Existing methods like CMA-MAE use histograms that fail to distinguish between similar solutions in high dimensions.

Method: Proposes Discount Model Search (DMS) which uses a smooth, continuous model to represent discount values instead of discrete histograms. This allows DMS to distinguish between solutions with similar measures in high-dimensional spaces and continue effective exploration.

Result: DMS outperforms CMA-MAE and other black-box QD algorithms on high-dimensional benchmarks. It enables new capabilities like using image datasets as measure spaces, allowing users to specify desired measures through image examples rather than hand-designed functions.

Conclusion: DMS addresses fundamental limitations of existing QD algorithms in high-dimensional measure spaces by using continuous discount models, enabling effective exploration and new applications like image-based measure specification.

Abstract: Quality diversity (QD) optimization searches for a collection of solutions that optimize an objective while attaining diverse outputs of a user-specified, vector-valued measure function. Contemporary QD algorithms are typically limited to low-dimensional measures because high-dimensional measures are prone to distortion, where many solutions found by the QD algorithm map to similar measures. For example, the state-of-the-art CMA-MAE algorithm guides measure space exploration with a histogram in measure space that records so-called discount values. However, CMA-MAE stagnates in domains with high-dimensional measure spaces because solutions with similar measures fall into the same histogram cell and hence receive the same discount value. To address these limitations, we propose Discount Model Search (DMS), which guides exploration with a model that provides a smooth, continuous representation of discount values. In high-dimensional measure spaces, this model enables DMS to distinguish between solutions with similar measures and thus continue exploration. We show that DMS facilitates new capabilities for QD algorithms by introducing two new domains where the measure space is the high-dimensional space of images, which enables users to specify their desired measures by providing a dataset of images rather than hand-designing the measure function. Results in these domains and on high-dimensional benchmarks show that DMS outperforms CMA-MAE and other existing black-box QD algorithms.

[516] Breaking the Simplification Bottleneck in Amortized Neural Symbolic Regression

Paul Saegert, Ullrich Köthe

Main category: cs.LG

TL;DR: SimpliPy: A fast rule-based simplification engine for symbolic regression that achieves 100x speed-up over SymPy, enabling improved amortized SR with Flash-ANSR framework.

Details

Motivation: Amortized symbolic regression struggles with scaling to realistic scientific complexity due to slow reduction of equivalent expressions to normalized forms using general-purpose Computer Algebra Systems like SymPy.

Method: Proposes SimpliPy, a rule-based simplification engine that achieves 100x speed-up over SymPy. Uses this in Flash-ANSR framework for amortized symbolic regression with improved training efficiency and systematic decontamination.

Result: Flash-ANSR achieves better accuracy than amortized baselines (NeSymReS, E2E) on FastSRB benchmark and performs on par with state-of-the-art direct optimization (PySR) while recovering more concise expressions.

Conclusion: Fast simplification engines like SimpliPy enable substantial improvements in amortized symbolic regression, making it more scalable and efficient for scientific applications.

Abstract: Symbolic regression (SR) aims to discover interpretable analytical expressions that accurately describe observed data. Amortized SR promises to be much more efficient than the predominant genetic programming SR methods, but currently struggles to scale to realistic scientific complexity. We find that a key obstacle is the lack of a fast reduction of equivalent expressions to a concise normalized form. Amortized SR has addressed this by general-purpose Computer Algebra Systems (CAS) like SymPy, but the high computational cost severely limits training and inference speed. We propose SimpliPy, a rule-based simplification engine achieving a 100-fold speed-up over SymPy at comparable quality. This enables substantial improvements in amortized SR, including scalability to much larger training sets, more efficient use of the per-expression token budget, and systematic training set decontamination with respect to equivalent test expressions. We demonstrate these advantages in our Flash-ANSR framework, which achieves much better accuracy than amortized baselines (NeSymReS, E2E) on the FastSRB benchmark. Moreover, it performs on par with state-of-the-art direct optimization (PySR) while recovering more concise instead of more complex expressions with increasing inference budget.

[517] Tractable Gaussian Phase Retrieval with Heavy Tails and Adversarial Corruption with Near-Linear Sample Complexity

Santanu Das, Jatin Batra

Main category: cs.LG

TL;DR: First polynomial-time algorithm for robust phase retrieval with heavy-tailed noise and adversarial corruptions using near-linear sample complexity

Details

Motivation: Phase retrieval has applications in optics, crystallography, astrophysics, etc., but existing algorithms lack robustness against measurement errors and adversarial corruptions. Recent breakthroughs in robust statistics haven't been applied to phase retrieval efficiently.

Method: Connects robust spectral initialization with recent advances in robust PCA to develop polynomial-time algorithms. Uses robust covariance estimation techniques to handle heavy-tailed noise and adversarial corruptions in both measurements and sensing vectors.

Result: Achieves first polynomial-time algorithm for robust phase retrieval with heavy-tailed noise and adversarial corruptions, with near-linear sample complexity O(n log n). Improves upon previous exponential-time algorithm.

Conclusion: Establishes efficient algorithmic framework for robust phase retrieval by bridging robust spectral initialization with robust PCA techniques, enabling practical applications in noisy environments.

Abstract: Phase retrieval is the classical problem of recovering a signal $x^* \in \mathbb{R}^n$ from its noisy phaseless measurements $y_i = \langle a_i, x^* \rangle^2 + ζ_i$ (where $ζ_i$ denotes noise, and $a_i$ is the sensing vector) for $i \in [m]$. The problem of phase retrieval has a rich history, with a variety of applications such as optics, crystallography, heteroscedastic regression, astrophysics, etc. A major consideration in algorithms for phase retrieval is robustness against measurement errors. In recent breakthroughs in algorithmic robust statistics, efficient algorithms have been developed for several parameter estimation tasks such as mean estimation, covariance estimation, robust principal component analysis (PCA), etc. in the presence of heavy-tailed noise and adversarial corruptions. In this paper, we study efficient algorithms for robust phase retrieval with heavy-tailed noise when a constant fraction of both the measurements $y_i$ and the sensing vectors $a_i$ may be arbitrarily adversarially corrupted. For this problem, Buna and Rebeschini (AISTATS 2025) very recently gave an exponential time algorithm with sample complexity $O(n \log n)$. Their algorithm needs a robust spectral initialization, specifically, a robust estimate of the top eigenvector of a covariance matrix, which they deemed to be beyond known efficient algorithmic techniques (similar spectral initializations are a key ingredient of a large family of phase retrieval algorithms). In this work, we make a connection between robust spectral initialization and recent algorithmic advances in robust PCA, yielding the first polynomial-time algorithms for robust phase retrieval with both heavy-tailed noise and adversarial corruptions, in fact with near-linear (in $n$) sample complexity.

[518] Analysis of Control Bellman Residual Minimization for Markov Decision Problem

Donghwan Lee, Hyukjun Yang

Main category: cs.LG

TL;DR: The paper establishes foundational results for Bellman residual minimization methods for policy optimization in control tasks, addressing a gap in existing research.

Details

Motivation: Bellman residual minimization has advantages over dynamic programming (more stable convergence with function approximation) but has received less attention for policy optimization compared to policy evaluation, creating a research gap.

Method: The paper establishes theoretical foundations for control Bellman residual minimization methods for policy optimization, likely developing mathematical frameworks and algorithms for this approach.

Result: The paper provides foundational results for Bellman residual minimization in control tasks, though specific empirical results aren’t mentioned in the abstract.

Conclusion: Bellman residual minimization deserves investigation for policy optimization due to its advantages, and this paper lays the groundwork for such methods in control tasks.

Abstract: Markov decision problems are most commonly solved via dynamic programming. Another approach is Bellman residual minimization, which directly minimizes the squared Bellman residual objective function. However, compared to dynamic programming, this approach has received relatively less attention, mainly because it is often less efficient in practice and can be more difficult to extend to model-free settings such as reinforcement learning. Nonetheless, Bellman residual minimization has several advantages that make it worth investigating, such as more stable convergence with function approximation for value functions. While Bellman residual methods for policy evaluation have been widely studied, methods for policy optimization (control tasks) have been scarcely explored. In this paper, we establish foundational results for the control Bellman residual minimization for policy optimization.

[519] Learning to Remember, Learn, and Forget in Attention-Based Models

Djohan Bonnet, Jamie Lohoff, Jan Finkbeiner, Elidona Skhikerujah, Emre Neftci

Main category: cs.LG

TL;DR: Palimpsa is a self-attention model that treats In-Context Learning as a continual learning problem, using Bayesian metaplasticity to manage the stability-plasticity dilemma and expand memory capacity in attention models.

Details

Motivation: Current gated linear attention models have fixed memory capacity and suffer from interference in long sequences, limiting their In-Context Learning capabilities. The paper aims to address the stability-plasticity dilemma in attention mechanisms.

Method: Proposes Palimpsa, a self-attention model using Bayesian metaplasticity where each attention state’s plasticity is tied to an importance state grounded by a prior distribution. This framework shows that existing gated linear attention models (including Mamba2) are special cases of Palimpsa.

Result: Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall benchmark and Commonsense Reasoning tasks, demonstrating expanded memory capacity and better handling of long sequences.

Conclusion: Palimpsa provides a theoretical framework that unifies various attention models and enables transformation of non-metaplastic models into metaplastic ones, significantly improving memory capacity for In-Context Learning.

Abstract: In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory has a fixed capacity and is prone to interference, especially for long sequences. We propose Palimpsa, a self-attention model that views ICL as a continual learning problem that must address a stability-plasticity dilemma. Palimpsa uses Bayesian metaplasticity, where the plasticity of each attention state is tied to an importance state grounded by a prior distribution that captures accumulated knowledge. We demonstrate that various gated linear attention models emerge as specific architecture choices and posterior approximations, and that Mamba2 is a special case of Palimpsa where forgetting dominates. This theoretical link enables the transformation of any non-metaplastic model into a metaplastic one, significantly expanding its memory capacity. Our experiments show that Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall (MQAR) benchmark and on Commonsense Reasoning tasks.

[520] Implicit Hypothesis Testing and Divergence Preservation in Neural Network Representations

Kadircan Aksoy, Protim Bhattacharjee, Peter Jung

Main category: cs.LG

TL;DR: Paper studies neural classifier training dynamics through binary hypothesis testing lens, showing well-generalizing networks align with Neyman-Pearson optimal decision rules via KL divergence improvements.

Details

Motivation: To understand supervised training dynamics of neural classifiers by framing classification as binary hypothesis testing between class-conditional distributions, aiming to explain generalization behavior through statistical decision theory.

Method: Model classification as set of binary tests between class-conditional distributions of representations, empirically analyze training trajectories to show alignment with Neyman-Pearson optimal decision rules via monotonic KL divergence improvements.

Result: Well-generalizing networks increasingly align with Neyman-Pearson optimal decision rules during training, showing monotonic improvements in KL divergence that relate to error rate exponents, providing insights into generalization.

Conclusion: The binary hypothesis testing framework yields explanations for neural network generalization and suggests possible training/regularization strategies based on statistical decision theory principles.

Abstract: We study the supervised training dynamics of neural classifiers through the lens of binary hypothesis testing. We model classification as a set of binary tests between class-conditional distributions of representations and empirically show that, along training trajectories, well-generalizing networks increasingly align with Neyman-Pearson optimal decision rules via monotonic improvements in KL divergence that relate to error rate exponents. We finally discuss how this yields an explanation and possible training or regularization strategies for different classes of neural networks.

[521] A Controlled Study of Double DQN and Dueling DQN Under Cross-Environment Transfer

Azkaa Nasir, Fatima Dossa, Muhammad Ahmed Atif, Mohammad Shahid Shaikh

Main category: cs.LG

TL;DR: DDQN architecture shows robust transfer learning across environments while Dueling DQN exhibits negative transfer under identical conditions, suggesting architectural inductive bias affects cross-environment transfer robustness in deep RL.

Details

Motivation: To understand how architectural differences between Double Deep Q-Networks (DDQN) and Dueling DQN influence transfer learning behavior across environments, particularly examining robustness under domain shift.

Method: Controlled empirical study using CartPole as source task and LunarLander as target task with fixed layer-wise representation transfer protocol under identical hyperparameters and training conditions, comparing to baseline agents trained from scratch.

Result: DDQN consistently avoids negative transfer and maintains learning dynamics comparable to baseline performance, while Dueling DQN consistently exhibits negative transfer with degraded rewards and unstable optimization behavior, confirmed by statistical analysis across multiple random seeds.

Conclusion: Architectural inductive bias is strongly associated with robustness to cross-environment transfer in value-based deep reinforcement learning, with DDQN showing more robust transfer behavior than Dueling DQN under the examined protocol.

Abstract: Transfer learning in deep reinforcement learning is often motivated by improved stability and reduced training cost, but it can also fail under substantial domain shift. This paper presents a controlled empirical study examining how architectural differences between Double Deep Q-Networks (DDQN) and Dueling DQN influence transfer behavior across environments. Using CartPole as a source task and LunarLander as a structurally distinct target task, we evaluate a fixed layer-wise representation transfer protocol under identical hyperparameters and training conditions, with baseline agents trained from scratch used to contextualize transfer effects. Empirical results show that DDQN consistently avoids negative transfer under the examined setup and maintains learning dynamics comparable to baseline performance in the target environment. In contrast, Dueling DQN consistently exhibits negative transfer under identical conditions, characterized by degraded rewards and unstable optimization behavior. Statistical analysis across multiple random seeds confirms a significant performance gap under transfer. These findings suggest that architectural inductive bias is strongly associated with robustness to cross-environment transfer in value-based deep reinforcement learning under the examined transfer protocol.

[522] Transform-Augmented GRPO Improves Pass@k

Khiem Le, Youssef Mroueh, Phuc Nguyen, Chi-Heng Lin, Shangqian Gao, Ting Hua, Nitesh V. Chawla

Main category: cs.LG

TL;DR: TA-GRPO improves reasoning in LLMs by generating semantically equivalent variants of questions to address diversity collapse and gradient diminishing issues in GRPO, leading to better performance on mathematical reasoning benchmarks.

Details

Motivation: Standard next-token prediction LLMs are overly sensitive to superficial phrasing variations. GRPO was designed to improve reasoning but suffers from two failure modes: diversity collapse (amplifying single solution strategies) and gradient diminishing (zero gradients when all rollouts get identical rewards).

Method: TA-GRPO generates semantically equivalent transformed variants of each question through paraphrasing, variable renaming, and format changes, then computes advantages by pooling rewards across the entire group. This ensures mixed rewards even for easy/hard questions and promotes multiple solution strategies.

Result: Experiments show consistent Pass@k improvements: gains up to 9.84 points on competition math (AMC12, AIME24) and 5.05 points on out-of-distribution scientific reasoning (GPQA-Diamond).

Conclusion: TA-GRPO effectively addresses GRPO’s failure modes by using semantically equivalent transformations, reducing zero-gradient probability and improving generalization via reduced train-test distribution shift.

Abstract: Large language models trained via next-token prediction are fundamentally pattern-matchers: sensitive to superficial phrasing variations even when the underlying problem is identical. Group Relative Policy Optimization (GRPO) was designed to improve reasoning, but in fact it worsens this situation through two failure modes: diversity collapse, where training amplifies a single solution strategy while ignoring alternatives of gradient signal, and gradient diminishing, where a large portion of questions yield zero gradients because all rollouts receive identical rewards. We propose TA-GRPO (Transform-Augmented GRPO), which generates semantically equivalent transformed variants of each question (via paraphrasing, variable renaming, and format changes) and computes advantages by pooling rewards across the entire group. This pooled computation ensures mixed rewards even when the original question is too easy or too hard, while training on diverse phrasings promotes multiple solution strategies. We provide theoretical justification showing that TA-GRPO reduces zero-gradient probability and improves generalization via reduced train-test distribution shift. Experiments on mathematical reasoning benchmarks show consistent Pass@k improvements, with gains up to 9.84 points on competition math (AMC12, AIME24) and 5.05 points on out-of-distribution scientific reasoning (GPQA-Diamond).

[523] Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

J Rosser, Robert Kirk, Edward Grefenstette, Jakob Foerster, Laura Ruis

Main category: cs.LG

TL;DR: Infusion framework uses influence functions to craft subtle training data perturbations that induce targeted model behavior changes through parameter shifts, evaluated on vision and language data poisoning tasks.

Details

Motivation: To explore the reverse of traditional influence functions - instead of attributing model behavior to training data, craft training data that induces specific model behavior, addressing data poisoning vulnerabilities.

Method: Uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts, evaluated on CIFAR-10 vision tasks and preliminary language experiments.

Result: On CIFAR-10, subtle edits to just 0.2% of training data can be competitive with inserting explicit behavior examples. The approach transfers across architectures (ResNet ↔ CNN), and in language experiments, it’s most effective at amplifying behaviors models have already learned.

Conclusion: Small, subtle edits to training data can systematically shape model behavior, highlighting the importance of training data interpretability for both adversaries and defenders in security contexts.

Abstract: Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet $\leftrightarrow$ CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.

[524] Expanding the Capabilities of Reinforcement Learning via Text Feedback

Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, Andrea Zanette

Main category: cs.LG

TL;DR: RL from Text Feedback (RLTF) uses textual feedback as intermediate supervision between sparse binary rewards and expensive demonstrations for LLM post-training, improving single-turn performance through multi-turn learning.

Details

Motivation: Current RL for LLMs uses uninformative binary rewards, while distillation requires costly demonstrations. Text feedback offers richer supervision than rewards but is cheaper than demonstrations, representing a natural human interaction mode already abundant in real-world settings.

Method: Proposes RL from Text Feedback (RLTF) with two methods: 1) Self Distillation (RLTF-SD) trains single-turn policy to match its own feedback-conditioned second-turn generations; 2) Feedback Modeling (RLTF-FM) predicts feedback as auxiliary objective. Both leverage text feedback during training but not inference.

Result: Both methods consistently outperform strong baselines across reasoning puzzles, competition math, and creative writing tasks, demonstrating effectiveness of text feedback as rich supervision.

Conclusion: Text feedback provides valuable intermediate supervision between binary rewards and demonstrations, enabling more effective RL for LLMs through methods that internalize feedback to improve single-turn performance.

Abstract: The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.

[525] The Label Horizon Paradox: Rethinking Supervision Targets in Financial Forecasting

Chen-Hui Song, Shuoling Liu, Liyuan Chen

Main category: cs.LG

TL;DR: The paper challenges the assumption that training labels must match inference targets in financial forecasting, introducing the Label Horizon Paradox where optimal supervision differs from prediction goals due to market dynamics.

Details

Motivation: The paper questions the fundamental assumption in deep learning for financial forecasting that training labels should strictly mirror inference targets, noting that this design choice is rarely scrutinized despite its importance.

Method: The authors propose a bi-level optimization framework that autonomously identifies optimal proxy labels within a single training run, grounded in theoretical analysis of dynamic signal-noise trade-offs and the Label Horizon Paradox phenomenon.

Result: Extensive experiments on large-scale financial datasets demonstrate consistent improvements over conventional baselines, validating the effectiveness of the proposed approach.

Conclusion: The work opens new avenues for label-centric research in financial forecasting by showing that optimal supervision signals often deviate from prediction targets, challenging conventional wisdom.

Abstract: While deep learning has revolutionized financial forecasting through sophisticated architectures, the design of the supervision signal itself is rarely scrutinized. We challenge the canonical assumption that training labels must strictly mirror inference targets, uncovering the Label Horizon Paradox: the optimal supervision signal often deviates from the prediction goal, shifting across intermediate horizons governed by market dynamics. We theoretically ground this phenomenon in a dynamic signal-noise trade-off, demonstrating that generalization hinges on the competition between marginal signal realization and noise accumulation. To operationalize this insight, we propose a bi-level optimization framework that autonomously identifies the optimal proxy label within a single training run. Extensive experiments on large-scale financial datasets demonstrate consistent improvements over conventional baselines, thereby opening new avenues for label-centric research in financial forecasting.

[526] A Function-Space Stability Boundary for Generalization in Interpolating Learning Systems

Ronald Katende

Main category: cs.LG

TL;DR: The paper analyzes when algorithmic stability explains generalization in interpolating learning systems, proposing a contractive propagation condition and stability certificate to measure sensitivity to data perturbations.

Details

Motivation: To understand when algorithmic stability can explain generalization in modern learning systems that interpolate training data but still generalize well, as it remains unclear whether stability is a universal explanation for this phenomenon.

Method: Models training as a function-space trajectory and measures sensitivity to single-sample perturbations along this trajectory. Proposes a contractive propagation condition and derives a stability certificate by unrolling the resulting recursion.

Result: Shows that small certificate implies stability-based generalization, but also proves existence of interpolating regimes with small risk where contractive sensitivity cannot hold. Experiments confirm certificate growth predicts generalization differences across optimizers, step sizes, and dataset perturbations.

Conclusion: The framework identifies regimes where stability explains generalization and where alternative mechanisms must account for success, showing stability is not a universal explanation for generalization in interpolating systems.

Abstract: Modern learning systems often interpolate training data while still generalizing well, yet it remains unclear when algorithmic stability explains this behavior. We model training as a function-space trajectory and measure sensitivity to single-sample perturbations along this trajectory. We propose a contractive propagation condition and a stability certificate obtained by unrolling the resulting recursion. A small certificate implies stability-based generalization, while we also prove that there exist interpolating regimes with small risk where such contractive sensitivity cannot hold, showing that stability is not a universal explanation. Experiments confirm that certificate growth predicts generalization differences across optimizers, step sizes, and dataset perturbations. The framework therefore identifies regimes where stability explains generalization and where alternative mechanisms must account for success.

[527] Learning, Solving and Optimizing PDEs with TensorGalerkin: an efficient high-performance Galerkin assembly algorithm

Shizheng Wen, Mingyuan Chi, Tianwei Yu, Ben Moseley, Mike Yan Michelis, Pu Ren, Hao Sun, Siddhartha Mishra

Main category: cs.LG

TL;DR: A unified algorithmic framework for solving, optimizing, and learning PDEs with variational structure using GPU-optimized TensorGalerkin for efficient linear system assembly.

Details

Motivation: To create a unified framework that efficiently handles numerical PDE solving, PDE-constrained optimization, and physics-informed operator learning for variational PDEs, addressing computational efficiency challenges in these applications.

Method: Based on Galerkin discretization of variational forms, using TensorGalerkin framework that tensorizes element-wise operations in Python Map stage and performs global reduction via sparse matrix multiplication on mesh sparsity graphs.

Result: Demonstrated significant computational efficiency and accuracy gains over baselines for 2D/3D elliptic, parabolic, and hyperbolic PDEs on unstructured meshes across all targeted applications.

Conclusion: The framework provides an efficient, unified approach for PDE solving, optimization, and physics-informed learning with GPU acceleration and broad applicability to various PDE types.

Abstract: We present a unified algorithmic framework for the numerical solution, constrained optimization, and physics-informed learning of PDEs with a variational structure. Our framework is based on a Galerkin discretization of the underlying variational forms, and its high efficiency stems from a novel highly-optimized and GPU-compliant TensorGalerkin framework for linear system assembly (stiffness matrices and load vectors). TensorGalerkin operates by tensorizing element-wise operations within a Python-level Map stage and then performs global reduction with a sparse matrix multiplication that performs message passing on the mesh-induced sparsity graph. It can be seamlessly employed downstream as i) a highly-efficient numerical PDEs solver, ii) an end-to-end differentiable framework for PDE-constrained optimization, and iii) a physics-informed operator learning algorithm for PDEs. With multiple benchmarks, including 2D and 3D elliptic, parabolic, and hyperbolic PDEs on unstructured meshes, we demonstrate that the proposed framework provides significant computational efficiency and accuracy gains over a variety of baselines in all the targeted downstream applications.

[528] Escaping Local Minima Provably in Non-convex Matrix Sensing: A Deterministic Framework via Simulated Lifting

Tianqi Shen, Jinji Yang, Junze He, Kunhan Gao, Ziye Ma

Main category: cs.LG

TL;DR: A deterministic framework called Simulated Oracle Direction (SOD) that escapes spurious local minima in low-rank matrix sensing by simulating over-parameterized escape directions without actual tensor lifting.

Details

Motivation: Low-rank matrix sensing has challenging nonconvex landscapes with many spurious local minima. While over-parameterization via tensor lifting can convert local minima to saddle points, actual lifting is computationally intractable. The goal is to achieve the benefits of over-parameterization without the computational cost.

Method: Proposes Simulated Oracle Direction (SOD) escape mechanism that simulates the landscape and escape directions of over-parametrized space without actually lifting the problem. Designs a mathematical framework to project over-parametrized escape directions onto original parameter space to guarantee strict decrease from local minima.

Result: Numerical experiments show the framework reliably escapes local minima and facilitates convergence to global optima with minimal computational cost compared to explicit tensor over-parameterization.

Conclusion: The deterministic framework can escape spurious local minima with guarantee without random perturbations or heuristic estimates. Has implications for nonconvex optimization beyond matrix sensing by showing how simulated over-parameterization can tame challenging optimization landscapes.

Abstract: Low-rank matrix sensing is a fundamental yet challenging nonconvex problem whose optimization landscape typically contains numerous spurious local minima, making it difficult for gradient-based optimizers to converge to the global optimum. Recent work has shown that over-parameterization via tensor lifting can convert such local minima into strict saddle points, an insight that also partially explains why massive scaling can improve generalization and performance in modern machine learning. Motivated by this observation, we propose a Simulated Oracle Direction (SOD) escape mechanism that simulates the landscape and escape direction of the over-parametrized space, without resorting to actually lifting the problem, since that would be computationally intractable. In essence, we designed a mathematical framework to project over-parametrized escape directions onto the original parameter space to guarantee a strict decrease of objective value from existing local minima. To the best of our knowledge, this represents the first deterministic framework that could escape spurious local minima with guarantee, especially without using random perturbations or heuristic estimates. Numerical experiments demonstrate that our framework reliably escapes local minima and facilitates convergence to global optima, while incurring minimal computational cost when compared to explicit tensor over-parameterization. We believe this framework has non-trivial implications for nonconvex optimization beyond matrix sensing, by showcasing how simulated over-parameterization can be leveraged to tame challenging optimization landscapes.

[529] ContextBench: A Benchmark for Context Retrieval in Coding Agents

Han Li, Letian Zhu, Bohan Zhang, Rili Feng, Jiaming Wang, Yue Pan, Earl T. Barr, Federica Sarro, Zhaoyang Chu, He Ye

Main category: cs.LG

TL;DR: ContextBench is a process-oriented evaluation framework for coding agents that measures context retrieval performance during issue resolution, revealing that sophisticated agent scaffolding provides only marginal gains and LLMs favor recall over precision.

Details

Motivation: Existing evaluations of LLM-based coding agents focus primarily on final task success, providing limited insight into how agents retrieve and use code context during problem solving. There's a need for process-oriented evaluation that examines context retrieval throughout the issue-resolution process.

Method: Developed ContextBench with 1,136 issue-resolution tasks from 66 repositories across 8 programming languages, each augmented with human-annotated gold contexts. Implemented automated evaluation framework tracking agent trajectories and measuring context recall, precision, and efficiency throughout issue resolution. Evaluated 4 frontier LLMs and 5 coding agents.

Result: Sophisticated agent scaffolding yields only marginal gains in context retrieval (“The Bitter Lesson” of coding agents). LLMs consistently favor recall over precision, and substantial gaps exist between explored and utilized context. ContextBench provides intermediate gold-context metrics that augment existing end-to-end benchmarks.

Conclusion: ContextBench offers valuable intermediate signals for guiding LLM reasoning in software tasks by unboxing the issue-resolution process. The framework reveals important insights about coding agent behavior that aren’t captured by traditional success metrics alone.

Abstract: LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during problem solving. We introduce ContextBench, a process-oriented evaluation of context retrieval in coding agents. ContextBench consists of 1,136 issue-resolution tasks from 66 repositories across eight programming languages, each augmented with human-annotated gold contexts. We further implement an automated evaluation framework that tracks agent trajectories and measures context recall, precision, and efficiency throughout issue resolution. Using ContextBench, we evaluate four frontier LLMs and five coding agents. Our results show that sophisticated agent scaffolding yields only marginal gains in context retrieval (“The Bitter Lesson” of coding agents), LLMs consistently favor recall over precision, and substantial gaps exist between explored and utilized context. ContextBench augments existing end-to-end benchmarks with intermediate gold-context metrics that unbox the issue-resolution process. These contexts offer valuable intermediate signals for guiding LLM reasoning in software tasks.

[530] Dense Neural Networks are not Universal Approximators

Levi Rauchwerger, Stefanie Jegelka, Ron Levie

Main category: cs.LG

TL;DR: Dense neural networks lack universal approximation capabilities under natural weight constraints, unlike sparse networks which can achieve true universality.

Details

Motivation: To investigate the approximation capabilities of dense neural networks and understand their limitations compared to sparse architectures, challenging the common assumption that dense networks are universally approximators.

Method: Uses model compression approach combining weak regularity lemma with interpretation of feedforward networks as message passing graph neural networks, analyzing ReLU networks under natural weight and dimensional constraints.

Result: Demonstrates existence of Lipschitz continuous functions that cannot be approximated by dense neural networks, showing intrinsic limitations of dense connectivity.

Conclusion: Dense neural networks do not possess universality under natural constraints, motivating sparse connectivity as necessary for achieving true universal approximation capabilities.

Abstract: We investigate the approximation capabilities of dense neural networks. While universal approximation theorems establish that sufficiently large architectures can approximate arbitrary continuous functions if there are no restrictions on the weight values, we show that dense neural networks do not possess this universality. Our argument is based on a model compression approach, combining the weak regularity lemma with an interpretation of feedforward networks as message passing graph neural networks. We consider ReLU neural networks subject to natural constraints on weights and input and output dimensions, which model a notion of dense connectivity. Within this setting, we demonstrate the existence of Lipschitz continuous functions that cannot be approximated by such networks. This highlights intrinsic limitations of neural networks with dense layers and motivates the use of sparse connectivity as a necessary ingredient for achieving true universality.

[531] A Thermodynamic Theory of Learning Part II: Critical Period Closure and Continual Learning Failure

Daisuke Okanohara

Main category: cs.LG

TL;DR: The paper establishes a geometric framework showing that irreversible learning dynamics progressively reduce a model’s reconfiguration capacity, leading to catastrophic forgetting in continual learning when new tasks exceed the residual adaptable degrees of freedom.

Details

Motivation: To understand why catastrophic forgetting occurs in continual learning despite the existence of multi-task solutions, by examining the irreversible nature of learning dynamics and their geometric constraints on future adaptability.

Method: Models learning as transport processes in parameter space, analyzes compositional structure of learning dynamics as transport maps, and defines compatible effective rank to quantify remaining reconfiguration capacity. Uses geometric analysis of Jacobian semigroups and singular value submultiplicativity.

Result: Proves a capacity-threshold criterion: if the stable rank of a new task’s Hessian exceeds the residual compatible effective rank, the task is trajectory-level incompatible and will cause forgetting. Shows catastrophic forgetting arises from irreversible loss of reconfiguration capacity, not absence of multi-task solutions.

Conclusion: Establishes trajectory-level capacity limits for continual learning, showing that finite-time learning irreversibly reduces reconfiguration capacity, leading to catastrophic forgetting when new tasks exceed remaining adaptable degrees of freedom.

Abstract: Learning performed over finite time is inherently irreversible. In PartI of this series, we modeled learning as a transport process in the space of parameter distributions and derived the Epistemic Speed Limit (ESL), which lower-bounds entropy production under finite-time dynamics. In this work (PartII), we show that irreversibility imposes a geometric restriction on future adaptability through the compositional structure of learning dynamics. Successive learning phases compose multiplicatively as transport maps, and their Jacobians form a semigroup whose rank and singular values are submultiplicative. As a result, dynamically usable degrees of reconfiguration can only decrease under composition. We formalize the remaining adaptability of a model in terms of compatible effective rank, defined as the log-volume of task-preserving directions that remain dynamically accessible. Although task performance may remain unchanged, finite-time learning can progressively reduce this reconfiguration capacity. We prove a capacity-threshold criterion for continual learning: let m_B denote the stable rank of the Hessian of a new task B restricted to the task-preserving manifold of a previously learned task A. If m_B exceeds the residual compatible effective rank, then task B is trajectory-level incompatible with task A; any sufficient adaptation necessarily induces forgetting. Thus catastrophic forgetting arises not from the absence of multi-task solutions, but from irreversible loss of reconfiguration capacity under compositional learning dynamics. This establishes a trajectory-level capacity limit for continual learning.

[532] Distribution-Free Robust Predict-Then-Optimize in Function Spaces

Yash Patel, Ambuj Tewari

Main category: cs.LG

TL;DR: Functional conformal prediction for robust PDE-based design under neural operator uncertainty

Details

Motivation: Neural operator models for PDEs lack accuracy guarantees, risking poor design performance when used in engineering optimization. Need robust decision-making under uncertainty in infinite-dimensional function spaces.

Method: Extends conformal prediction from finite-dimensional to infinite-dimensional Sobolev spaces, providing distribution-free uncertainty quantification for neural operators. Uses this uncertainty to formulate robust engineering design problems.

Result: Developed functional conformal coverage method with guarantees for Sobolev spaces. Demonstrated across diverse PDEs (Poisson, heat equations) and showed significant improvement in robust design for quantum state discrimination.

Conclusion: Functional conformal prediction enables robust engineering design under neural operator uncertainty, bridging gap between neural surrogates and classical PDE solvers with uncertainty guarantees.

Abstract: The need to rapidly solve PDEs in engineering design workflows has spurred the rise of neural surrogate models. In particular, neural operator models provide a discretization-invariant surrogate by retaining the infinite-dimensional, functional form of their arguments. Despite improved throughput, such methods lack guarantees on accuracy, unlike classical numerical PDE solvers. Optimizing engineering designs under these potentially miscalibrated surrogates thus runs the risk of producing designs that perform poorly upon deployment. In a similar vein, there is growing interest in automated decision-making under black-box predictors in the finite-dimensional setting, where a similar risk of suboptimality exists under poorly calibrated models. For this reason, methods have emerged that produce adversarially robust decisions under uncertainty estimates of the upstream model. One such framework leverages conformal prediction, a distribution-free post-hoc uncertainty quantification method, to provide these estimates due to its natural pairing with black-box predictors. We herein extend this line of conformally robust decision-making to infinite-dimensional function spaces. We first extend the typical conformal prediction guarantees over finite-dimensional spaces to infinite-dimensional Sobolev spaces. We then demonstrate how such uncertainty can be leveraged to robustly formulate engineering design tasks and characterize the suboptimality of the resulting robust optimal designs. We then empirically demonstrate the generality of our functional conformal coverage method across a diverse collection of PDEs, including the Poisson and heat equations, and showcase the significant improvement of such robust design in a quantum state discrimination task.

[533] Importance inversion transfer identifies shared principles for cross-domain learning

Daniele Caligiore

Main category: cs.LG

TL;DR: X-CDTL framework uses network science and explainable AI to find structural invariants that transfer knowledge across heterogeneous domains like biological, linguistic, molecular, and social networks, improving anomaly detection stability by 56% under noise.

Details

Motivation: Existing transfer learning methods fail to bridge radically heterogeneous systems, especially under data scarcity or noise. The paper aims to develop a principled approach for cross-domain knowledge transfer by identifying shared organizational principles across different scientific domains.

Method: Proposes Explainable Cross-Domain Transfer Learning (X-CDTL) framework that unifies network science and explainable AI. Introduces Importance Inversion Transfer (IIT) mechanism that prioritizes domain-invariant structural anchors over highly discriminative but idiosyncratic features.

Result: In anomaly detection tasks, models guided by X-CDTL principles achieve significant performance gains, including 56% relative improvement in decision stability under extreme noise compared to traditional baselines.

Conclusion: The work provides evidence for shared organizational signatures across heterogeneous domains and establishes a principled paradigm for cross-disciplinary knowledge propagation, advancing machine learning as a robust engine for scientific discovery through explicit structural laws rather than opaque latent representations.

Abstract: The capacity to transfer knowledge across scientific domains relies on shared organizational principles. However, existing transfer-learning methodologies often fail to bridge radically heterogeneous systems, particularly under severe data scarcity or stochastic noise. This study formalizes Explainable Cross-Domain Transfer Learning (X-CDTL), a framework unifying network science and explainable artificial intelligence to identify structural invariants that generalize across biological, linguistic, molecular, and social networks. By introducing the Importance Inversion Transfer (IIT) mechanism, the framework prioritizes domain-invariant structural anchors over idiosyncratic, highly discriminative features. In anomaly detection tasks, models guided by these principles achieve significant performance gains - exhibiting a 56% relative improvement in decision stability under extreme noise - over traditional baselines. These results provide evidence for a shared organizational signature across heterogeneous domains, establishing a principled paradigm for cross-disciplinary knowledge propagation. By shifting from opaque latent representations to explicit structural laws, this work advances machine learning as a robust engine for scientific discovery.

[534] UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation

Jonathan von Rad, Yong Cao, Andreas Geiger

Main category: cs.LG

TL;DR: UniComp is a unified evaluation framework for comparing LLM compression techniques (pruning, quantization, distillation) across performance, reliability, and efficiency dimensions using diverse benchmarks.

Details

Motivation: Existing evaluations of model compression techniques are limited in method coverage and focus primarily on knowledge-centric benchmarks, lacking comprehensive assessment across different compression approaches and broader capabilities.

Method: UniComp evaluates six compression techniques on modern LLMs across 40+ datasets along three dimensions: performance, reliability, and efficiency, using capability- and safety-oriented benchmarks with hardware-aware efficiency analysis.

Result: Compression shows consistent knowledge bias (knowledge-intensive tasks preserved while reasoning, multilingual, and instruction-following degrade), quantization provides best overall trade-off, distillation yields runtime acceleration at high computational cost, and task-specific calibration improves pruned models’ reasoning by up to 50%.

Conclusion: UniComp provides comprehensive evaluation framework revealing compression biases and trade-offs, with quantization offering best balance and task-specific calibration significantly improving pruned models’ reasoning capabilities.

Abstract: Model compression is increasingly essential for deploying large language models (LLMs), yet existing evaluations are limited in method coverage and focus primarily on knowledge-centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability- and safety-oriented benchmarks together with a hardware-aware efficiency analysis. Through extensive evaluation of six compression techniques on modern LLMs across more than 40 datasets, we find that (i) compression exhibits a consistent knowledge bias, where knowledge-intensive tasks are relatively preserved while reasoning, multilingual, and instruction-following capabilities degrade substantially; (ii) quantization provides the best overall trade-off between retained performance and efficiency, whereas distillation yields strong runtime acceleration gains at high computational cost; and (iii) task-specific calibration can significantly improve the reasoning ability of pruned models by up to 50%.

[535] Beyond Student: An Asymmetric Network for Neural Network Inheritance

Yiyun Zhou, Jingwei Shi, Mingjing Xu, Zhonghua Jiang, Jingyuan Chen

Main category: cs.LG

TL;DR: InherNet: A neural network inheritance method using asymmetric low-rank decomposition of teacher weights to create lightweight networks that better inherit teacher knowledge compared to traditional knowledge distillation.

Details

Motivation: Traditional knowledge distillation has limitations due to capacity gaps between teacher and student networks. The paper explores whether networks can better inherit teacher structure and knowledge, and how such inheriting networks compare to distilled student networks.

Method: Proposes InherNet which performs asymmetric low-rank decomposition on teacher weights using Singular Value Decomposition (SVD) for initialization, reconstructing a lightweight network while preserving principal knowledge and balancing depth, width, and compression efficiency.

Result: Experimental results across unimodal and multimodal tasks show InherNet achieves higher performance compared to student networks of similar parameter sizes.

Conclusion: InherNet reveals a promising direction for efficient model compression beyond traditional distillation, enabling better knowledge inheritance from teacher networks.

Abstract: Knowledge Distillation (KD) has emerged as a powerful technique for model compression, enabling lightweight student networks to benefit from the performance of redundant teacher networks. However, the inherent capacity gap often limits the performance of student networks. Inspired by the expressiveness of pretrained teacher networks, a compelling research question arises: is there a type of network that can not only inherit the teacher’s structure but also maximize the inheritance of its knowledge? Furthermore, how does the performance of such an inheriting network compare to that of student networks, all benefiting from the same teacher network? To further explore this question, we propose InherNet, a neural network inheritance method that performs asymmetric low-rank decomposition on the teacher’s weights and reconstructs a lightweight yet expressive network without significant architectural disruption. By leveraging Singular Value Decomposition (SVD) for initialization to ensure the inheritance of principal knowledge, InherNet effectively balances depth, width, and compression efficiency. Experimental results across unimodal and multimodal tasks demonstrate that InherNet achieves higher performance compared to student networks of similar parameter sizes. Our findings reveal a promising direction for future research in efficient model compression beyond traditional distillation.

[536] Fully-automated sleep staging: multicenter validation of a generalizable deep neural network for Parkinson’s disease and isolated REM sleep behavior disorder

Jesper Strøm, Casper Skjærbæk, Natasha Becker Bertelsen, Steffen Torpe Simonsen, Niels Okkels, David Bertram, Sinah Röttgen, Konstantin Kufer, Kaare B. Mikkelsen, Marit Otto, Poul Jørgen Jennum, Per Borghammer, Michael Sommerauer, Preben Kidmose

Main category: cs.LG

TL;DR: Adapted U-Sleep deep neural network for sleep staging in Parkinson’s disease and isolated REM sleep behavior disorder, achieving improved performance through fine-tuning and confidence-based thresholds.

Details

Motivation: Manual sleep staging is challenging in neurodegenerative diseases due to EEG abnormalities and fragmented sleep, creating a bottleneck for deploying RBD screening technologies at scale. Video-polysomnography remains the diagnostic gold standard for iRBD, a key prodromal marker of Parkinson's disease.

Method: Fine-tuned a pretrained U-Sleep model (originally trained on large multisite non-neurodegenerative dataset) on research datasets from two centers (PACE and CBC) with PD, iRBD, and control subjects. Evaluated on independent dataset from DCSM. Used confidence-based thresholds to optimize REM sleep staging and conducted interrater study for low-agreement cases.

Result: Fine-tuned model achieved κ = 0.74 on PACE/CBC (vs κ = 0.66 for pretrained model). In independent DCSM dataset, mean κ increased from 0.60 to 0.64 and median from 0.64 to 0.69. Confidence threshold increased correct REM sleep epoch identification from 85% to 95.5% while preserving sufficient REM sleep for 95% of subjects.

Conclusion: The adapted U-Sleep model provides generalizable sleep staging for neurodegenerative diseases, with performance improvements through fine-tuning and confidence-based optimization, potentially enabling scalable RBD screening technologies.

Abstract: Isolated REM sleep behavior disorder (iRBD) is a key prodromal marker of Parkinson’s disease (PD), and video-polysomnography (vPSG) remains the diagnostic gold standard. However, manual sleep staging is particularly challenging in neurodegenerative diseases due to EEG abnormalities and fragmented sleep, making PSG assessments a bottleneck for deploying new RBD screening technologies at scale. We adapted U-Sleep, a deep neural network, for generalizable sleep staging in PD and iRBD. A pretrained U-Sleep model, based on a large, multisite non-neurodegenerative dataset (PUB; 19,236 PSGs across 12 sites), was fine-tuned on research datasets from two centers (Lundbeck Foundation Parkinson’s Disease Research Center (PACE) and the Cologne-Bonn Cohort (CBC); 112 PD, 138 iRBD, 89 age-matched controls. The resulting model was evaluated on an independent dataset from the Danish Center for Sleep Medicine (DCSM; 81 PD, 36 iRBD, 87 sleep-clinic controls). A subset of PSGs with low agreement between the human rater and the model (Cohen’s $κ$ < 0.6) was re-scored by a second blinded human rater to identify sources of disagreement. Finally, we applied confidence-based thresholds to optimize REM sleep staging. The pretrained model achieved mean $κ$ = 0.81 in PUB, but $κ$ = 0.66 when applied directly to PACE/CBC. By fine-tuning the model, we developed a generalized model with $κ$ = 0.74 on PACE/CBC (p < 0.001 vs. the pretrained model). In DCSM, mean and median $κ$ increased from 0.60 to 0.64 (p < 0.001) and 0.64 to 0.69 (p < 0.001), respectively. In the interrater study, PSGs with low agreement between the model and the initial scorer showed similarly low agreement between human scorers. Applying a confidence threshold increased the proportion of correctly identified REM sleep epochs from 85% to 95.5%, while preserving sufficient (> 5 min) REM sleep for 95% of subjects.

cs.MA

Wenkai Fan, Shurui Zhang, Xiaolong Wang, Haowei Yang, Tsz Wai Chan, Xingyan Chen, Junquan Bi, Zirui Zhou, Jia Liu, Kani Chen

Main category: cs.MA

TL;DR: AIvilization v0 is a large-scale artificial society with LLM agents in a resource-constrained economy, featuring hierarchical planning, adaptive memory, and human steering for long-term autonomy.

Details

Motivation: To create a sustainable artificial society that maintains long-horizon autonomy while adapting to rapidly changing environments, addressing the tension between goal stability and reactive correctness.

Method: Combines: (1) hierarchical branch-thinking planner for goal decomposition and simulation-guided validation, (2) adaptive agent profile with dual-process memory separating short-term execution from long-term semantic consolidation, and (3) human-in-the-loop steering interface with memory-based effect propagation.

Result: The system produces stable markets reproducing key economic patterns (heavy-tailed returns, volatility clustering) and structured wealth stratification driven by education and access constraints. Full architecture shows robustness in multi-objective, long-horizon settings.

Conclusion: The unified LLM-agent architecture successfully creates a sustainable artificial society that balances goal stability with environmental adaptation, demonstrating economic realism and long-term robustness.

Abstract: AIvilization v0 is a publicly deployed large-scale artificial society that couples a resource-constrained sandbox economy with a unified LLM-agent architecture, aiming to sustain long-horizon autonomy while remaining executable under rapidly changing environment. To mitigate the tension between goal stability and reactive correctness, we introduce (i) a hierarchical branch-thinking planner that decomposes life goals into parallel objective branches and uses simulation-guided validation plus tiered re-planning to ensure feasibility; (ii) an adaptive agent profile with dual-process memory that separates short-term execution traces from long-term semantic consolidation, enabling persistent yet evolving identity; and (iii) a human-in-the-loop steering interface that injects long-horizon objectives and short commands at appropriate abstraction levels, with effects propagated through memory rather than brittle prompt overrides. The environment integrates physiological survival costs, non-substitutable multi-tier production, an AMM-based price mechanism, and a gated education-occupation system. Using high-frequency transactions from the platforms mature phase, we find stable markets that reproduce key stylized facts (heavy-tailed returns and volatility clustering) and produce structured wealth stratification driven by education and access constraints. Ablations show simplified planners can match performance on narrow tasks, while the full architecture is more robust under multi-objective, long-horizon settings, supporting delayed investment and sustained exploration.

[538] An Ontology-driven Dynamic Knowledge Base for Uninhabited Ground Vehicles

Hsan Sandar Win, Andrew Walters, Cheng-Chew Lim, Daniel Webber, Seth Leslie, Tan Doan

Main category: cs.MA

TL;DR: DCMD (Dynamic Contextual Mission Data) is an ontology-driven dynamic knowledge base for UGVs that provides real-time contextual updates to enhance situation awareness and autonomous decision-making in dynamic environments.

Details

Motivation: UGVs rely heavily on pre-mission a priori information, which becomes problematic when unexpected events occur during missions, causing identification ambiguities and requiring increased user intervention. There's a need for systems that can dynamically update contextual information to help UGVs realize their full potential in complex, changing environments.

Method: Developed an ontology-driven dynamic knowledge base using DCMD concept, supported by near real-time information acquisition and analysis. Implemented on a team of four UGVs executing a laboratory-based surveillance mission to provide in-mission on-platform DCMD updates.

Result: The ontology-driven dynamic representation of the UGV operational environment was machine actionable, producing contextual information that supported successful and timely mission execution and directly contributed to enhanced situation awareness.

Conclusion: DCMD enables UGVs to dynamically adapt to changing environments through real-time contextual updates, improving autonomous decision-making and situation awareness without heavy reliance on pre-mission information.

Abstract: In this paper, the concept of Dynamic Contextual Mission Data (DCMD) is introduced to develop an ontology-driven dynamic knowledge base for Uninhabited Ground Vehicles (UGVs) at the tactical edge. The dynamic knowledge base with DCMD is added to the UGVs to: support enhanced situation awareness; improve autonomous decision making; and facilitate agility within complex and dynamic environments. As UGVs are heavily reliant on the a priori information added pre-mission, unexpected occurrences during a mission can cause identification ambiguities and require increased levels of user input. Updating this a priori information with contextual information can help UGVs realise their full potential. To address this, the dynamic knowledge base was designed using an ontology-driven representation, supported by near real-time information acquisition and analysis, to provide in-mission on-platform DCMD updates. This was implemented on a team of four UGVs that executed a laboratory based surveillance mission. The results showed that the ontology-driven dynamic representation of the UGV operational environment was machine actionable, producing contextual information to support a successful and timely mission, and contributed directly to the situation awareness.

[539] Beyond Task Performance: A Metric-Based Analysis of Sequential Cooperation in Heterogeneous Multi-Agent Destructive Foraging

Alejandro Mendoza Barrionuevo, Samuel Yanes Luis, Daniel Gutiérrez Reina, Sergio L. Toral Marín

Main category: cs.MA

TL;DR: The paper proposes a systematic set of general-purpose cooperation metrics for heterogeneous multi-agent systems operating under partial observability and temporal role dependency, validated in a destructive foraging scenario with autonomous vehicles.

Details

Motivation: Most previous studies focus on algorithmic performance for task completion, but lack comprehensive metrics to characterize cooperation aspects like coordination, dependency, fairness, and sensitivity in heterogeneous multi-agent systems with partial observability and temporal role dependencies.

Method: Proposes a suite of cooperation metrics structured into three categories: primary metrics, inter-team metrics, and intra-team metrics. These metrics are validated in a realistic destructive foraging scenario using heterogeneous autonomous vehicles with two specialized teams (search and destruction) having sequential dependencies.

Result: The metrics provide multilevel characterization of cooperation and have been validated with several representative approaches covering both learning-based algorithms and classical heuristic paradigms in the aquatic surface cleaning scenario.

Conclusion: The proposed metrics offer a comprehensive framework for analyzing cooperation in heterogeneous multi-agent systems beyond just task efficiency, addressing coordination, dependency, fairness, and sensitivity aspects that are transferable to similar sequential domains.

Abstract: This work addresses the problem of analyzing cooperation in heterogeneous multi-agent systems which operate under partial observability and temporal role dependency, framed within a destructive multi-agent foraging setting. Unlike most previous studies, which focus primarily on algorithmic performance with respect to task completion, this article proposes a systematic set of general-purpose cooperation metrics aimed at characterizing not only efficiency, but also coordination and dependency between teams and agents, fairness, and sensitivity. These metrics are designed to be transferable to different multi-agent sequential domains similar to foraging. The proposed suite of metrics is structured into three main categories that jointly provide a multilevel characterization of cooperation: primary metrics, inter-team metrics, and intra-team metrics. They have been validated in a realistic destructive foraging scenario inspired by dynamic aquatic surface cleaning using heterogeneous autonomous vehicles. It involves two specialized teams with sequential dependencies: one focused on the search of resources, and another on their destruction. Several representative approaches have been evaluated, covering both learning-based algorithms and classical heuristic paradigms.

[540] The emergence of numerical representations in communicating artificial agents

Daniela Mihai, Lucas Weber, Francesca Franzon

Main category: cs.MA

TL;DR: Neural agents develop non-compositional numerical communication systems through referential games, achieving high in-distribution accuracy but failing to generalize to unseen numerosities.

Details

Motivation: To investigate whether communication pressure alone can lead to the emergence of numerical representations in artificial agents, and whether these emergent codes resemble human numeral systems.

Method: Two neural network agents play referential games to communicate numerosities using either discrete tokens (symbolic) or continuous sketches (iconic). No pre-defined numeric concepts are provided.

Result: Agents achieve high in-distribution communication accuracy with both discrete and continuous representations. However, the emergent codes are non-compositional - agents fail to generalize to unseen numerosities, reusing symbols for highest trained numerosity (discrete) or collapsing extrapolated values (continuous).

Conclusion: Communication pressure alone suffices for precise transmission of learned numerosities, but additional pressures are needed to develop compositional codes and generalization abilities that resemble human numeral systems.

Abstract: Human languages provide efficient systems for expressing numerosities, but whether the sheer pressure to communicate is enough for numerical representations to arise in artificial agents, and whether the emergent codes resemble human numerals at all, remains an open question. We study two neural network-based agents that must communicate numerosities in a referential game using either discrete tokens or continuous sketches, thus exploring both symbolic and iconic representations. Without any pre-defined numeric concepts, the agents achieve high in-distribution communication accuracy in both communication channels and converge on high-precision symbol-meaning mappings. However, the emergent code is non-compositional: the agents fail to derive systematic messages for unseen numerosities, typically reusing the symbol of the highest trained numerosity (discrete), or collapsing extrapolated values onto a single sketch (continuous). We conclude that the communication pressure alone suffices for precise transmission of learned numerosities, but additional pressures are needed to yield compositional codes and generalisation abilities.

[541] Learning to Compose for Cross-domain Agentic Workflow Generation

Jialiang Wang, Shengxiang Xu, Hanmo Liu, Jiachuan Wang, Yuyu Luo, Shimin Di, Min-Ling Zhang, Lei Chen

Main category: cs.MA

TL;DR: A method for single-pass cross-domain workflow generation using learned reusable capabilities, outperforming iterative refinement approaches.

Details

Motivation: Current workflow generation systems rely on iterative refinement which is costly and unstable under domain shift. There's a need for more efficient, generalizable workflow generation that can adapt to different task distributions and operator sets.

Method: Proposes a decompose-recompose-decide mechanism: 1) Learn compact set of reusable workflow capabilities across diverse domains, 2) Map input tasks to sparse compositions over these bases for single-pass workflow generation, 3) Attribute success/failure to counterfactual contributions from learned capabilities to understand which capabilities drive success.

Result: The 1-pass generator surpasses state-of-the-art refinement baselines that use 20 iterations, while substantially reducing generation latency and cost across multi-domain, cross-domain, and unseen-domain evaluations.

Conclusion: Internalizing workflow generation into a single-pass LLM with learned reusable capabilities enables efficient and effective cross-domain workflow generation, overcoming limitations of iterative refinement approaches.

Abstract: Automatically generating agentic workflows – executable operator graphs or codes that orchestrate reasoning, verification, and repair – has become a practical way to solve complex tasks beyond what single-pass LLM generation can reliably handle. Yet what constitutes a good workflow depends heavily on the task distribution and the available operators. Under domain shift, current systems typically rely on iterative workflow refinement to discover a feasible workflow from a large workflow space, incurring high iteration costs and yielding unstable, domain-specific behavior. In response, we internalize a decompose-recompose-decide mechanism into an open-source LLM for cross-domain workflow generation. To decompose, we learn a compact set of reusable workflow capabilities across diverse domains. To recompose, we map each input task to a sparse composition over these bases to generate a task-specific workflow in a single pass. To decide, we attribute the success or failure of workflow generation to counterfactual contributions from learned capabilities, thereby capturing which capabilities actually drive success by their marginal effects. Across stringent multi-domain, cross-domain, and unseen-domain evaluations, our 1-pass generator surpasses SOTA refinement baselines that consume 20 iterations, while substantially reducing generation latency and cost.

[542] Convergence and Connectivity: Dynamics of Multi-Agent Q-Learning in Random Networks

Dan Leonte, Aamal Hussain, Raphael Huser, Francesco Belardinelli, Dario Paccagnan

Main category: cs.MA

TL;DR: Analysis of Q-learning convergence in network polymatrix games on random graphs, showing conditions for unique equilibrium convergence in many-agent systems.

Details

Motivation: Multi-agent learning algorithms often fail to converge to equilibrium solutions in many-agent settings, displaying complex non-stationary behaviors. The paper aims to understand when convergence can be achieved in network-based multi-agent systems.

Method: Studies Q-learning dynamics in network polymatrix normal-form games on random graph models (Erdős-Rényi and Stochastic Block models). Establishes sufficient conditions for convergence to unique equilibrium based on exploration rates, payoff matrices, and interaction probabilities.

Result: Theoretical conditions show convergence can be reliably achieved in many-agent systems when network interactions are properly controlled. Numerical simulations validate these findings.

Conclusion: Convergence to equilibrium in multi-agent Q-learning is possible in many-agent systems if network interaction probabilities are appropriately managed, providing insights for designing stable distributed learning systems.

Abstract: Beyond specific settings, many multi-agent learning algorithms fail to converge to an equilibrium solution, instead displaying complex, non-stationary behaviours such as recurrent or chaotic orbits. In fact, recent literature suggests that such complex behaviours are likely to occur when the number of agents increases. In this paper, we study Q-learning dynamics in network polymatrix normal-form games where the network structure is drawn from classical random graph models. In particular, we focus on the Erdős-Rényi model, which is used to analyze connectivity in distributed systems, and the Stochastic Block model, which generalizes the above by accounting for community structures that naturally arise in multi-agent systems. In each setting, we establish sufficient conditions under which the agents’ joint strategies converge to a unique equilibrium. We investigate how this condition depends on the exploration rates, payoff matrices and, crucially, the probabilities of interaction between network agents. We validate our theoretical findings through numerical simulations and demonstrate that convergence can be reliably achieved in many-agent systems, provided interactions in the network are controlled.

[543] LLM-Mediated Guidance of MARL Systems

Philipp D. Siedler, Ian Gemp

Main category: cs.MA

TL;DR: LLM-mediated interventions improve Multi-Agent Reinforcement Learning by guiding agents toward desirable behaviors through natural language and rule-based controllers.

Details

Motivation: Multi-Agent Reinforcement Learning faces challenges in achieving efficient learning and desirable behaviors in complex environments. The paper explores combining MARL with LLM-mediated interventions to guide agents toward better behaviors.

Method: The study investigates two types of LLM-mediated interventions: Natural Language Controller (using 7B/8B LLM to simulate human-like interventions) and Rule-Based Controller. These interventions shape learning trajectories of multiple agents in MARL systems.

Result: Rule-Based Controller showed stronger impact than Natural Language Controller. Both intervention types outperformed baseline without interventions. Early interventions were particularly beneficial, leading to more efficient training and higher performance.

Conclusion: LLM-mediated guidance can accelerate training and enhance MARL performance in challenging environments, with rule-based interventions being particularly effective for shaping agent behaviors.

Abstract: In complex multi-agent environments, achieving efficient learning and desirable behaviours is a significant challenge for Multi-Agent Reinforcement Learning (MARL) systems. This work explores the potential of combining MARL with Large Language Model (LLM)-mediated interventions to guide agents toward more desirable behaviours. Specifically, we investigate how LLMs can be used to interpret and facilitate interventions that shape the learning trajectories of multiple agents. We experimented with two types of interventions, referred to as controllers: a Natural Language (NL) Controller and a Rule-Based (RB) Controller. The RB Controller showed a stronger impact than the NL Controller, which uses a small (7B/8B) LLM to simulate human-like interventions. Our findings indicate that agents particularly benefit from early interventions, leading to more efficient training and higher performance. Both intervention types outperform the baseline without interventions, highlighting the potential of LLM-mediated guidance to accelerate training and enhance MARL performance in challenging environments.

[544] Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning

Nicolò Dal Fabbro, Milad Mesbahi, Renato Mendes, João Borges de Sousa, George J. Pappas

Main category: cs.MA

TL;DR: Multi-agent reinforcement learning approach for long-term river plume monitoring using AUVs, combining spatiotemporal Gaussian process regression with multi-head Q-network controllers for energy-efficient coordination.

Details

Motivation: Need for efficient long-term (multiple days) monitoring of dynamic river plumes using autonomous underwater vehicles, with focus on energy and communication efficiency in multi-agent systems.

Method: Multi-agent reinforcement learning with central coordinator, integrating spatiotemporal Gaussian process regression (GPR) with multi-head Q-network controllers that regulate AUV direction and speed. Uses intermittent communication to balance data collection with energy conservation.

Result: Outperforms single- and multi-agent benchmarks in simulations using Delft3D ocean model. Scaling number of agents improves both mean squared error and operational endurance. Learned policies generalize across unseen seasonal regimes.

Conclusion: Demonstrates promise for data-driven long-term monitoring of dynamic plume environments, showing that multi-agent coordination can significantly improve endurance while maintaining or improving accuracy.

Abstract: We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of multi-agent coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.

[545] AOI: Context-Aware Multi-Agent Operations via Dynamic Scheduling and Hierarchical Memory Compression

Zishan Bai, Jing Luo, Ziyi Ni, Enze Ge, Jiacheng Shi, Yichao Zhang, Jiayi Gu, Zhimo Han, Riyang Bao, Junfeng Hao

Main category: cs.MA

TL;DR: AOI is a multi-agent framework with LLM-based context compression for autonomous IT operations management, improving task success and reducing diagnosis time.

Details

Motivation: Modern cloud-native architectures create overwhelming operational data complexity, leading to inefficient processing, poor task coordination, and loss of contextual continuity during fault diagnosis.

Method: Proposes AOI (AI-Oriented Operations) with three specialized agents and an LLM-based Context Compressor, featuring dynamic task scheduling and three-layer memory architecture (Working, Episodic, Semantic).

Result: Achieves 72.4% context compression while preserving 92.8% critical information, improves task success to 94.2%, and reduces MTTR by 34.4% over best baseline.

Conclusion: Presents a paradigm shift toward scalable, adaptive, context-aware autonomous operations for next-generation IT infrastructures with minimal human intervention.

Abstract: The proliferation of cloud-native architectures, characterized by microservices and dynamic orchestration, has rendered modern IT infrastructures exceedingly complex and volatile. This complexity generates overwhelming volumes of operational data, leading to critical bottlenecks in conventional systems: inefficient information processing, poor task coordination, and loss of contextual continuity during fault diagnosis and remediation. To address these challenges, we propose AOI (AI-Oriented Operations), a novel multi-agent collaborative framework that integrates three specialized agents with an LLM-based Context Compressor. Its core innovations include: (1) a dynamic task scheduling strategy that adaptively prioritizes operations based on real-time system states, (2) a three-layer memory architecture comprising Working, Episodic, and Semantic layers that optimizes context retention and retrieval. Extensive experiments on synthetic and real-world benchmarks show that AOI achieves 72.4% context compression while preserving 92.8% critical information, improves task success to 94.2%, and reduces MTTR by 34.4% over the best baseline. This work presents a paradigm shift towards scalable, adaptive, and context-aware autonomous operations, enabling robust management of next-generation IT infrastructures with minimal human intervention.

[546] Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

John Gardiner, Orlando Romero, Brendan Tivnan, Nicolò Dal Fabbro, George J. Pappas

Main category: cs.MA

TL;DR: A framework for training multi-agent reinforcement learning agents to use shared quantum entanglement as a coordination resource, enabling communication-free correlated policies that outperform classical shared randomness approaches.

Details

Motivation: Prior MARL work uses shared randomness for coordination, but quantum entanglement offers a larger class of communication-free correlated policies. Quantum physics shows that for certain cooperative games without communication, shared entanglement enables strategies that outperform classical shared randomness approaches (quantum advantage).

Method: Introduces a novel differentiable policy parameterization that enables optimization over quantum measurements, combined with a policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors.

Result: The method successfully learns strategies that attain quantum advantage in single-round games treated as black box oracles, and also demonstrates quantum advantage in multi-agent sequential decision-making problems formulated as Dec-POMDPs.

Conclusion: The framework enables MARL agents to exploit quantum entanglement as a coordination resource, achieving quantum advantage in both single-round games and sequential decision-making problems, expanding the capabilities of communication-free coordination in multi-agent systems.

Abstract: The inability to communicate poses a major challenge to coordination in multi-agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision-making. In contrast, this work introduces the first framework for training MARL agents to exploit shared quantum entanglement as a coordination resource, which permits a larger class of communication-free correlated policies than shared randomness alone. This is motivated by well-known results in quantum physics which posit that, for certain single-round cooperative games with no communication, shared quantum entanglement enables strategies that outperform those that only use shared randomness. In such cases, we say that there is quantum advantage. Our framework is based on a novel differentiable policy parameterization that enables optimization over quantum measurements, together with a novel policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. To illustrate the effectiveness of our proposed method, we first show that we can learn, purely from experience, strategies that attain quantum advantage in single-round games that are treated as black box oracles. We then demonstrate how our machinery can learn policies with quantum advantage in an illustrative multi-agent sequential decision-making problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP).

[547] LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, Yafeng Deng

Main category: cs.MA

TL;DR: LingxiDiagBench is a multi-agent benchmark for evaluating LLMs on psychiatric diagnosis, featuring 16K synthetic consultation dialogues in Chinese across 12 ICD-10 categories.

Details

Motivation: Address the shortage of psychiatrists and subjectivity in mental health diagnosis by creating AI benchmarks for psychiatric assessment, overcoming limitations of existing benchmarks lacking realistic patient simulation, clinician-verified labels, and dynamic multi-turn consultation support.

Method: Developed LingxiDiag-16K dataset with 16,000 EMR-aligned synthetic consultation dialogues reproducing real clinical demographic and diagnostic distributions. Created multi-agent benchmark evaluating LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation.

Result: LLMs achieve high accuracy on binary depression-anxiety classification (up to 92.3%) but performance deteriorates for comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%). Dynamic consultation often underperforms static evaluation, and consultation quality shows only moderate correlation with diagnostic accuracy.

Conclusion: The benchmark reveals significant challenges in AI-assisted psychiatric diagnosis, particularly for complex cases and dynamic consultation. The released dataset and framework support reproducible research in this critical area.

Abstract: Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression–anxiety classification (up to 92.3%), performance deteriorates substantially for depression–anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

cs.MM

[548] Rethinking Security of Diffusion-based Generative Steganography

Jihao Zhu, Zixuan Chen, Jiali Liu, Lingxiao Yang, Yi Zhou, Weiqi Luo, Xiaohua Xie

Main category: cs.MM

TL;DR: A security analysis of diffusion model-based generative image steganography (DM-GIS) methods, showing that disrupting diffusion model noise distribution compromises security, and proposing a noise space-based steganalyzer (NS-DSer) to detect hidden messages.

Details

Motivation: To analyze the security of diffusion model-based generative image steganography methods and identify key factors affecting their security, particularly focusing on how steganographic operations impact the noise distribution of diffusion models.

Method: The paper first analyzes general pipelines of DM-GIS methods, identifies noise space as the primary embedding domain, then theoretically demonstrates that disrupting noise distribution compromises security. Based on this insight, they propose NS-DSer - a steganalysis framework that detects DM-GIS generated images in the diffusion model noise space.

Result: Experimental results validate the theoretical analysis and show NS-DSer’s effectiveness across diverse detection scenarios. The paper reevaluates security of existing DM-GIS methods using NS-DSer across increasingly challenging detection scenarios.

Conclusion: The noise distribution of diffusion models is crucial for DM-GIS security, and any steganographic operation that disrupts this distribution compromises security. NS-DSer provides an effective framework for detecting such hidden messages.

Abstract: Generative image steganography is a technique that conceals secret messages within generated images, without relying on pre-existing cover images. Recently, a number of diffusion model-based generative image steganography (DM-GIS) methods have been introduced, which effectively combat traditional steganalysis techniques. In this paper, we identify the key factors that influence DM-GIS security and revisit the security of existing methods. Specifically, we first provide an overview of the general pipelines of current DM-GIS methods, finding that the noise space of diffusion models serves as the primary embedding domain. Further, we analyze the relationship between DM-GIS security and noise distribution of diffusion models, theoretically demonstrating that any steganographic operation that disrupts the noise distribution compromise DM-GIS security. Building on this insight, we propose a Noise Space-based Diffusion Steganalyzer (NS-DSer)-a simple yet effective steganalysis framework allowing for detecting DM-GIS generated images in the diffusion model noise space. We reevaluate the security of existing DM-GIS methods using NS-DSer across increasingly challenging detection scenarios. Experimental results validate our theoretical analysis of DM-GIS security and show the effectiveness of NS-DSer across diverse detection scenarios.

[549] Orthogonal Disentanglement with Projected Feature Alignment for Multimodal Emotion Recognition in Conversation

Xinyi Che, Wenbo Wang, Jian Guan, Qijun Zhao

Main category: cs.MM

TL;DR: OD-PFA framework for multimodal emotion recognition that disentangles shared and modality-specific emotional cues using orthogonal constraints and projected feature alignment.

Details

Motivation: Existing MERC methods focus on aligning cross-modal semantics but overlook modality-specific emotional nuances like micro-expressions, tone variations, and sarcastic language, limiting their ability to capture comprehensive emotional understanding.

Method: Proposes Orthogonal Disentanglement with Projected Feature Alignment (OD-PFA): 1) Decouples unimodal features into shared and modality-specific components, 2) Uses orthogonal disentanglement strategy with reconstruction loss to separate components while preserving emotional information, 3) Applies projected feature alignment to map shared features into common latent space with cross-modal consistency alignment loss.

Result: Extensive evaluations on IEMOCAP and MELD benchmark datasets demonstrate effectiveness of OD-PFA for multimodal emotion recognition tasks compared to state-of-the-art approaches.

Conclusion: OD-PFA successfully captures both shared semantics and modality-specific emotional cues, addressing limitations of existing methods that overlook fine-grained emotional nuances across different modalities.

Abstract: Multimodal Emotion Recognition in Conversation (MERC) significantly enhances emotion recognition performance by integrating complementary emotional cues from text, audio, and visual modalities. While existing methods commonly utilize techniques such as contrastive learning and cross-attention mechanisms to align cross-modal emotional semantics, they typically overlook modality-specific emotional nuances like micro-expressions, tone variations, and sarcastic language. To overcome these limitations, we propose Orthogonal Disentanglement with Projected Feature Alignment (OD-PFA), a novel framework designed explicitly to capture both shared semantics and modality-specific emotional cues. Our approach first decouples unimodal features into shared and modality-specific components. An orthogonal disentanglement strategy (OD) enforces effective separation between these components, aided by a reconstruction loss to maintain critical emotional information from each modality. Additionally, a projected feature alignment strategy (PFA) maps shared features across modalities into a common latent space and applies a cross-modal consistency alignment loss to enhance semantic coherence. Extensive evaluations on widely-used benchmark datasets, IEMOCAP and MELD, demonstrate effectiveness of our proposed OD-PFA multimodal emotion recognition tasks, as compared with the state-of-the-art approaches.

eess.AS

[550] AudioRAG: A Challenging Benchmark for Audio Reasoning and Information Retrieval

Jingru Lin, Chen Zhang, Tianrui Wang, Haizhou Li

Main category: eess.AS

Details

Result: State-of-the-art Large Audio-Language Models (LALMs) struggle to answer AudioRAG questions, demonstrating the challenge of audio-based reasoning requiring external information grounding.

[551] From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks

Riccardo Miccini, Clément Laroche, Tobias Piechowiak, Xenofon Fafoutis, Luca Pezzarossa

Main category: eess.AS

TL;DR: Speech enhancement models with dynamic channel pruning can simultaneously estimate voice activity, noise type, and fundamental frequency from their internal pruning masks, eliminating need for separate auxiliary models.

Details

Motivation: Current audio devices need separate models for speech enhancement and auxiliary tasks like VAD, SNR estimation, and acoustic scene classification, which is computationally expensive for on-device deployment and introduces latency/privacy issues for cloud-based solutions.

Method: Leverage dynamic channel pruning masks from speech enhancement models to extract useful signal properties. Use simple interpretable predictors on these masks to estimate VAD, noise classification, and F0 estimation without additional computational overhead.

Result: Achieved 93% accuracy on VAD, 84% on noise classification, and R2 of 0.86 on F0 estimation using binary masks that reduce predictions to weighted sums with negligible overhead.

Conclusion: Dynamic channel pruning models can serve dual purpose: efficient speech enhancement and simultaneous estimation of signal properties, providing holistic solution for on-device audio processing while revealing what these models learn through their emergent behavior.

Abstract: Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.

[552] RE-LLM: Refining Empathetic Speech-LLM Responses by Integrating Emotion Nuance

Jing-Han Chen, Bo-Hao Su, Ya-Tse Wu, Chi-Chun Lee

Main category: eess.AS

TL;DR: RE-LLM: A speech-LLM that integrates dimensional emotion embeddings and auxiliary learning to enhance emotional exploration and empathetic response generation in human-AI interactions.

Details

Motivation: Current LLMs for empathetic AI focus on emotional reflection but overlook emotional exploration, which is key for deeper engagement. Text-only approaches capture limited emotion nuances, so there's a need to incorporate speech signals for richer emotional understanding.

Method: Proposes RE-LLM, a speech-LLM that integrates dimensional emotion embeddings and auxiliary learning to enhance emotional understanding and empathetic response generation.

Result: Statistically significant gains in empathy metrics across three datasets (IEMOCAP, ESD, MSP-PODCAST). RE-LLM improves Emotional Reaction score by 14.79% and 6.76% compared to text-only and speech-LLM baselines on ESD. Exploration scores increased substantially across all datasets (35.42% to 139.28% improvements). Also boosts speech emotion recognition accuracy by 2.3-6.9%.

Conclusion: RE-LLM demonstrates enriched emotional understanding and improved empathetic response generation by integrating speech signals with dimensional emotion embeddings, addressing the gap in emotional exploration for more engaging human-AI interactions.

Abstract: With generative AI advancing, empathy in human-AI interaction is essential. While prior work focuses on emotional reflection, emotional exploration, key to deeper engagement, remains overlooked. Existing LLMs rely on text which captures limited emotion nuances. To address this, we propose RE-LLM, a speech-LLM integrating dimensional emotion embeddings and auxiliary learning. Experiments show statistically significant gains in empathy metrics across three datasets. RE-LLM relatively improves the Emotional Reaction score by 14.79% and 6.76% compared to text-only and speech-LLM baselines on ESD. Notably, it raises the Exploration score by 35.42% and 3.91% on IEMOCAP, 139.28% and 9.83% on ESD, and 60.95% and 22.64% on MSP-PODCAST. It also boosts unweighted accuracy by 5.4% on IEMOCAP, 2.3% on ESD, and 6.9% on MSP-PODCAST in speech emotion recognition. These results highlight the enriched emotional understanding and improved empathetic response generation of RE-LLM.

[553] Self-Supervised Learning for Speaker Recognition: A study and review

Theo Lepage, Reda Dehak

Main category: eess.AS

TL;DR: Survey paper reviewing self-supervised learning frameworks (SimCLR, MoCo, DINO) adapted from computer vision to speaker recognition, analyzing their hyperparameters, components, and performance on in-domain/out-of-domain data.

Details

Motivation: Supervised learning for audio/speech tasks depends heavily on human-annotated data, making it costly and prone to poor generalization. Self-supervised learning offers a promising alternative by leveraging unlabeled data, but while SSL for ASR is well-studied, SSL for speaker recognition remains under-explored.

Method: Comprehensive review and analysis of SSL frameworks originally developed for computer vision (SimCLR, MoCo, DINO) adapted to speaker recognition. Investigates hyperparameter effects, SSL components (data augmentation, projector, positive sampling), and evaluates frameworks on in-domain and out-of-domain data with consistent experimental setup.

Result: DINO achieves best downstream performance and effectively models intra-speaker variability but is highly sensitive to hyperparameters. SimCLR and MoCo provide robust alternatives that effectively capture inter-speaker variability and are less prone to collapse. Comprehensive comparison of SSL methods from literature provided.

Conclusion: SSL frameworks show promise for speaker recognition, with different frameworks offering complementary strengths. DINO excels in performance but requires careful tuning, while SimCLR/MoCo offer robustness. The work highlights current trends, advancements, and challenges in applying SSL to speaker recognition.

Abstract: Deep learning models trained in a supervised setting have revolutionized audio and speech processing. However, their performance inherently depends on the quantity of human-annotated data, making them costly to scale and prone to poor generalization under unseen conditions. To address these challenges, Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations. The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR), remains in its early stages. This work describes major SSL instance-invariance frameworks (e.g., SimCLR, MoCo, and DINO), initially developed for computer vision, along with their adaptation to SR. Various SSL methods for SR, proposed in the literature and built upon these frameworks, are also presented. An extensive review of these approaches is then conducted: (1) the effect of the main hyperparameters of SSL frameworks is investigated; (2) the role of SSL components is studied (e.g., data-augmentation, projector, positive sampling); and (3) SSL frameworks are evaluated on SR with in-domain and out-of-domain data, using a consistent experimental setup, and a comprehensive comparison of SSL methods from the literature is provided. Specifically, DINO achieves the best downstream performance and effectively models intra-speaker variability, although it is highly sensitive to hyperparameters and training conditions, while SimCLR and MoCo provide robust alternatives that effectively capture inter-speaker variability and are less prone to collapse. This work aims to highlight recent trends and advancements, identifying current challenges in the field.

[554] SLM-S2ST: A multimodal language model for direct speech-to-speech translation

Yuxuan Hu, Haibin Wu, Ruchao Fan, Xiaofei Wang, Heng Lu, Yao Qian, Jinyu Li

Main category: eess.AS

TL;DR: SLM-S2ST is a multimodal language model for direct speech-to-speech translation that extends Phi4-MM with an audio transformer head and streaming vocoder, achieving superior performance on CVSS-C dataset.

Details

Motivation: While speech-aware language models can understand spoken language and generate text responses, enabling them to produce speech output efficiently and effectively remains challenging. The paper aims to address this gap by creating a multimodal LM for direct speech-to-speech translation.

Method: Built on the open-source Phi4-MM model, SLM-S2ST extends its predecessor by generating translated speech using an audio transformer head that predicts audio tokens with a delay relative to text tokens, followed by a streaming vocoder for waveform synthesis.

Result: Experimental results on the CVSS-C dataset show SLM-S2ST’s superior performance, significantly surpassing existing baseline models trained on the same dataset. When scaled up with more training data and larger model size, it reaches on-par performance with current state-of-the-art models.

Conclusion: SLM-S2ST demonstrates effective direct speech-to-speech translation capabilities through multimodal language modeling with delayed audio token prediction and streaming vocoder synthesis.

Abstract: Speech-aware language models (LMs) have demonstrated capabilities in understanding spoken language while generating text-based responses. However, enabling them to produce speech output efficiently and effectively remains a challenge. In this paper, we present SLM-S2ST, a multimodal LM for direct speech-to-speech translation (S2ST), built on the open-source Phi4-MM model. SLM-S2ST extends its predecessor by generating translated speech using an audio transformer head that predicts audio tokens with a delay relative to text tokens, followed by a streaming vocoder for waveform synthesis. Our experimental results on the CVSS-C dataset demonstrate SLM-S2ST’s superior performance, significantly surpassing existing baseline models trained on the same dataset. Furthermore, when we scale up the training data and the model size, SLM-S2ST reaches on-par performance with the current SOTA model.

[555] Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

Haibin Wu, Yuxuan Hu, Ruchao Fan, Xiaofei Wang, Kenichi Kumatani, Bo Ren, Jianwei Yu, Heng Lu, Lijuan Wang, Yao Qian, Jinyu Li

Main category: eess.AS

TL;DR: Systematic comparison of speech-text joint decoding paradigms for speech language models, proposing an early-stop interleaved approach that improves both speed and performance.

Details

Motivation: Speech language models enable end-to-end speech-text modeling for spoken dialogue systems, but the choice of joint decoding paradigm critically affects performance, efficiency, and alignment quality. There's a need to systematically compare different approaches and improve upon existing limitations.

Method: Systematically compare representative joint speech-text decoding strategies (interleaved and parallel generation paradigms) under controlled experimental setup using same base LM, speech tokenizer, and training data. Propose novel early-stop interleaved (ESI) pattern to address slow inference issues of interleaved approach. Curate high-quality QA datasets to improve speech QA performance.

Result: Interleaved approach achieves best alignment but suffers from slow inference due to long token sequences. Early-stop interleaved (ESI) pattern significantly accelerates decoding while yielding slightly better performance. High-quality QA datasets further improve speech QA performance.

Conclusion: The early-stop interleaved paradigm offers an effective solution balancing alignment quality and inference efficiency for speech language models, advancing practical deployment of spoken dialogue systems.

Abstract: Speech language models (Speech LMs) enable end-to-end speech-text modeling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies, including the interleaved, and parallel generation paradigms, under a controlled experimental setup using the same base language model, speech tokenizer and training data. Our results show that the interleaved approach achieves the best alignment. However it suffers from slow inference due to long token sequence length. To address this, we propose a novel early-stop interleaved (ESI) pattern that not only significantly accelerates decoding but also yields slightly better performance. Additionally, we curate high-quality question answering (QA) datasets to further improve speech QA performance.

[556] MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

Junhyeok Lee, Helin Wang, Yaohan Guan, Thomas Thebaud, Laureano Moro-Velazquez, Jesús Villalba, Najim Dehak

Main category: eess.AS

TL;DR: MaskVCT is a zero-shot voice conversion model with multi-factor controllability using classifier-free guidance, allowing flexible control over speaker identity, linguistic content, and prosody.

Details

Motivation: Previous voice conversion models rely on fixed conditioning schemes, limiting their flexibility and controllability. There's a need for a model that can integrate diverse conditions and allow users to balance different factors like speaker identity, linguistic content, and prosody in a zero-shot setting.

Method: MaskVCT uses multiple classifier-free guidances (CFGs) to achieve multi-factor controllability. It can leverage continuous or quantized linguistic features to enhance intelligibility and speaker similarity, and can optionally use or omit pitch contour to control prosody. The model integrates diverse conditions in a single framework.

Result: Extensive experiments show MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines.

Conclusion: MaskVCT provides a flexible, controllable zero-shot voice conversion system that allows users to balance speaker identity, linguistic content, and prosodic factors through multiple classifier-free guidances.

Abstract: We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intelligibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.

[557] Physics-Guided Variational Model for Unsupervised Sound Source Tracking

Luan Vinícius Fiorio, Ivana Nikoloska, Bruno Defraene, Alex Young, Johan David, Ronald M. Aarts

Main category: eess.AS

TL;DR: Unsupervised physics-guided variational model for sound source tracking using microphone arrays without ground-truth labels

Details

Motivation: Traditional sound source tracking uses classical array-processing algorithms, while machine learning approaches require expensive precise source position labels. There's a need for unsupervised methods that don't rely on ground-truth labels.

Method: Combines variational encoder with physics-based decoder that injects geometric constraints into latent space through analytically derived pairwise time-delay likelihoods. Learns to estimate source directions directly from microphone array signals without ground-truth labels.

Result: Outperforms traditional baselines, achieves accuracy and computational complexity comparable to state-of-the-art supervised models. Generalizes well to mismatched array geometries and exhibits strong robustness to corrupted microphone position metadata.

Conclusion: The physics-guided variational approach enables fully unsupervised sound source tracking with performance comparable to supervised methods, and can be extended to multi-source tracking.

Abstract: Sound source tracking is commonly performed using classical array-processing algorithms, while machine-learning approaches typically rely on precise source position labels that are expensive or impractical to obtain. This paper introduces a physics-guided variational model capable of fully unsupervised single-source sound source tracking. The method combines a variational encoder with a physics-based decoder that injects geometric constraints into the latent space through analytically derived pairwise time-delay likelihoods. Without requiring ground-truth labels, the model learns to estimate source directions directly from microphone array signals. Experiments on real-world data demonstrate that the proposed approach outperforms traditional baselines and achieves accuracy and computational complexity comparable to state-of-the-art supervised models. We further show that the method generalizes well to mismatched array geometries and exhibits strong robustness to corrupted microphone position metadata. Finally, we outline a natural extension of the approach to multi-source tracking and present the theoretical modifications required to support it.

eess.IV

[558] A Systematic Review on Data-Driven Brain Deformation Modeling for Image-Guided Neurosurgery

Tiago Assis, Colin P. Galvin, Joshua P. Castillo, Nazim Haouchine, Marta Kersten-Oertel, Zeyu Gao, Mireia Crispin-Ortuzar, Stephen J. Price, Thomas Santarius, Yangming Ou, Sarah Frisken, Nuno C. Garcia, Alexandra J. Golby, Reuben Dorent, Ines P. Machado

Main category: eess.IV

TL;DR: Systematic review of AI-driven brain deformation compensation methods for neurosurgical image guidance, covering deep learning registration, deformation field regression, multimodal alignment, and hybrid models from 2020-2025.

Details

Motivation: Brain deformation during neurosurgery misaligns preoperative planning images with intraoperative anatomy, requiring accurate compensation for reliable image-guided surgery. AI methods offer promising solutions but need systematic evaluation.

Method: Comprehensive literature review of 41 studies from 2020-2025 using PubMed, IEEE Xplore, Scopus, and Web of Science. Analyzed deep learning-based image registration, direct deformation field regression, multimodal alignment, resection-aware architectures, and hybrid biomechanical models.

Result: AI-based deformation models show promising performance and computational efficiency but have limitations in out-of-distribution robustness, standardized benchmarking, interpretability, and clinical deployment readiness.

Conclusion: Review identifies gaps in current approaches and outlines opportunities for more robust, generalizable, and clinically translatable deformation compensation solutions for neurosurgical guidance.

Abstract: Accurate compensation of brain deformation is a critical challenge for reliable image-guided neurosurgery, as surgical manipulation and tumor resection induce tissue motion that misaligns preoperative planning images with intraoperative anatomy and longitudinal studies. In this systematic review, we synthesize recent AI-driven approaches developed between January 2020 and April 2025 for modeling and correcting brain deformation. A comprehensive literature search was conducted in PubMed, IEEE Xplore, Scopus, and Web of Science, with predefined inclusion and exclusion criteria focused on computational methods applied to brain deformation compensation for neurosurgical imaging, resulting in 41 studies meeting these criteria. We provide a unified analysis of methodological strategies, including deep learning-based image registration, direct deformation field regression, synthesis-driven multimodal alignment, resection-aware architectures addressing missing correspondences, and hybrid models that integrate biomechanical priors. We also examine dataset utilization, reported evaluation metrics, validation protocols, and how uncertainty and generalization have been assessed across studies. While AI-based deformation models demonstrate promising performance and computational efficiency, current approaches exhibit limitations in out-of-distribution robustness, standardized benchmarking, interpretability, and readiness for clinical deployment. Our review highlights these gaps and outlines opportunities for future research aimed at achieving more robust, generalizable, and clinically translatable deformation compensation solutions for neurosurgical guidance. By organizing recent advances and critically evaluating evaluation practices, this work provides a comprehensive foundation for researchers and clinicians engaged in developing and applying AI-based brain deformation methods.

[559] Anatomy-Preserving Latent Diffusion for Generation of Brain Segmentation Masks with Ischemic Infarct

Lucia Borrego, Vajira Thambawita, Marco Ciuffreda, Ines del Val, Alejandro Dominguez, Josep Munuera

Main category: eess.IV

TL;DR: A generative framework using VAE and diffusion models to synthesize multi-class brain segmentation masks for medical imaging, addressing data scarcity in NCCT neuroimaging.

Details

Motivation: High-quality segmentation masks are scarce in medical image analysis, especially for non-contrast CT neuroimaging, where manual annotation is expensive and inconsistent. This limits training data availability for segmentation models.

Method: Combines a variational autoencoder trained on segmentation masks to learn anatomical latent representations, with a diffusion model operating in this latent space to generate new samples from noise. At inference, synthetic masks are decoded through the frozen VAE decoder, with optional binary prompt control over lesion presence.

Result: Generated masks preserve global brain anatomy, discrete tissue semantics, and realistic variability while avoiding structural artifacts common in pixel-space generative models.

Conclusion: The framework provides a simple and scalable solution for anatomy-aware mask generation in data-scarce medical imaging scenarios, particularly useful for augmenting training data for segmentation models.

Abstract: The scarcity of high-quality segmentation masks remains a major bottleneck for medical image analysis, particularly in non-contrast CT (NCCT) neuroimaging, where manual annotation is costly and variable. To address this limitation, we propose an anatomy-preserving generative framework for the unconditional synthesis of multi-class brain segmentation masks, including ischemic infarcts. The proposed approach combines a variational autoencoder trained exclusively on segmentation masks to learn an anatomical latent representation, with a diffusion model operating in this latent space to generate new samples from pure noise. At inference, synthetic masks are obtained by decoding denoised latent vectors through the frozen VAE decoder, with optional coarse control over lesion presence via a binary prompt. Qualitative results show that the generated masks preserve global brain anatomy, discrete tissue semantics, and realistic variability, while avoiding the structural artifacts commonly observed in pixel-space generative models. Overall, the proposed framework offers a simple and scalable solution for anatomy-aware mask generation in data-scarce medical imaging scenarios.

[560] Uncertainty-Aware Ordinal Deep Learning for cross-Dataset Diabetic Retinopathy Grading

Ali El Bellaj, Aya Benradi, Salman El Youssoufi, Taha El Marzouki, Mohammed-Amine Cheddadi

Main category: eess.IV

TL;DR: Uncertainty-aware deep learning framework for diabetic retinopathy severity grading using ordinal evidential learning with lesion-query attention and Dirichlet-based regression for robust cross-dataset generalization.

Details

Motivation: Diabetic retinopathy is a severe complication of diabetes that can lead to vision loss. Early and reliable detection is critical, but automated grading systems need to handle uncertainty and domain shifts between different clinical datasets while respecting the ordinal nature of disease progression.

Method: Combines convolutional backbone with lesion-query attention pooling and evidential Dirichlet-based ordinal regression head. Uses ordinal evidential loss with annealed regularization for calibrated confidence under domain shift. Trained on multi-domain datasets (APTOS, Messidor-2, EyePACS subset).

Result: Achieves strong cross-dataset generalization with competitive classification accuracy and high quadratic weighted kappa on held-out test sets. Provides meaningful uncertainty estimates for low-confidence cases.

Conclusion: Ordinal evidential learning is promising for robust and clinically reliable diabetic retinopathy grading, offering both accurate severity prediction and principled uncertainty estimation.

Abstract: Diabetes mellitus is a chronic metabolic disorder characterized by persistent hyperglycemia due to insufficient insulin production or impaired insulin utilization. One of its most severe complications is diabetic retinopathy (DR), a progressive retinal disease caused by microvascular damage, leading to hemorrhages, exudates, and potential vision loss. Early and reliable detection of DR is therefore critical for preventing irreversible blindness. In this work, we propose an uncertainty-aware deep learning framework for automated DR severity grading that explicitly models the ordinal nature of disease progression. Our approach combines a convolutional backbone with lesion-query attention pooling and an evidential Dirichlet-based ordinal regression head, enabling both accurate severity prediction and principled estimation of predictive uncertainty. The model is trained using an ordinal evidential loss with annealed regularization to encourage calibrated confidence under domain shift. We evaluate the proposed method on a multi-domain training setup combining APTOS, Messidor-2, and a subset of EyePACS fundus datasets. Experimental results demonstrate strong cross-dataset generalization, achieving competitive classification accuracy and high quadratic weighted kappa on held-out test sets, while providing meaningful uncertainty estimates for low-confidence cases. These results suggest that ordinal evidential learning is a promising direction for robust and clinically reliable diabetic retinopathy grading.

[561] Analyzing Model Misspecification in Quantitative MRI: Application to Perfusion ASL

Jiachen Wang, Jon Tamir, Adam Bush

Main category: eess.IV

TL;DR: Proposes a framework to assess model misspecification in quantitative MRI using misspecified Cramer-Rao bound tests, demonstrated with arterial spin labeling showing brain model is well-specified but kidney model is moderately misspecified.

Details

Motivation: Quantitative MRI models are often confounded and difficult to validate in vivo. Model misspecification occurs when assumed signal models differ from true data-generating processes, leading to biased estimates and incorrect uncertainty quantification.

Method: Uses misspecified Cramer-Rao bound (MCRB) theory to assess model validity. Two tests: (1) examine if empirical MCRB asymptotically approaches CRB as repeated measurements increase; (2) compare MLE estimates from two equal-sized subsets and evaluate if empirical variance aligns with theoretical CRB predictions.

Result: Demonstrated with arterial spin labeling (ASL) showing the commonly used ASL signal model appears to be well-specified in the brain but moderately misspecified in the kidney.

Conclusion: Provides a general, theoretically grounded framework for assessing model validity in quantitative MRI, helping identify when models may produce biased or inconsistent estimates.

Abstract: Quantitative MRI (qMRI) involves parameter estimation governed by an explicit signal model. However, these models are often confounded and difficult to validate in vivo. A model is misspecified when the assumed signal model differs from the true data-generating process. Under misspecification, the variance of any unbiased estimator is lower-bounded by the misspecified Cramer-Rao bound (MCRB), and maximum-likelihood estimates (MLE) may exhibit bias and inconsistency. Based on these principles, we assess misspecification in qMRI using two tests: (i) examining whether empirical MCRB asymptotically approaches the CRB as repeated measurements increase; (ii) comparing MLE estimates from two equal-sized subsets and evaluating whether their empirical variance aligns with theoretical CRB predictions. We demonstrate the framework using arterial spin labeling (ASL) as an illustrative example. Our result shows the commonly used ASL signal model appears to be specified in the brain and moderately misspecified in the kidney. The proposed framework offers a general, theoretically grounded approach for assessing model validity in quantitative MRI.

[562] Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT

Jineel H Raythatha, Shuchang Ye, Jeremy Hsu, Jinman Kim

Main category: eess.IV

TL;DR: Foundation models show equivalent discrimination to task-specific models for traumatic bowel injury detection but suffer specificity deficits due to negative-class heterogeneity, not just class imbalance.

Details

Motivation: To evaluate foundation models in clinical practice under compound distribution shift (class imbalance + heterogeneous imaging appearances), specifically for rare traumatic bowel injury diagnosis, and investigate whether specificity deficits are associated with negative-class heterogeneity.

Method: Retrospective study using multi-institutional RSNA Abdominal Traumatic Injury CT dataset (23 centers, 2019-2023). Compared two foundation models (MedCLIP zero-shot, RadDINO linear probe) against three task-specific approaches (CNN, Transformer, Ensemble). Trained on 3,147 patients (2.3% bowel injury prevalence), evaluated on enriched 100-patient test set. Isolated negative-class effects by assessing specificity in patients without bowel injury who had concurrent solid organ injury vs no abdominal pathology.

Result: Foundation models achieved equivalent discrimination (AUC 0.64-0.68 vs 0.58-0.64) with higher sensitivity (79-91% vs 41-74%) but lower specificity (33-50% vs 50-88%). All models showed high specificity without abdominal pathology (84-100%), but specificity declined substantially for foundation models when solid organ injuries were present (50-51 percentage point drop vs 12-41 point drop for task-specific models).

Conclusion: Foundation models matched task-specific discrimination without task-specific training, but their specificity deficits were driven primarily by confounding negative-class heterogeneity rather than prevalence alone. Susceptibility to negative-class heterogeneity decreased progressively with labeled training, suggesting adaptation is required before clinical implementation.

Abstract: Purpose: Translating foundation models into clinical practice requires evaluating their performance under compound distribution shift, where severe class imbalance coexists with heterogeneous imaging appearances. This challenge is relevant for traumatic bowel injury, a rare but high-mortality diagnosis. We investigated whether specificity deficits in foundation models are associated with heterogeneity in the negative class. Methods: This retrospective study used the multi-institutional, RSNA Abdominal Traumatic Injury CT dataset (2019-2023), comprising scans from 23 centres. Two foundation models (MedCLIP, zero-shot; RadDINO, linear probe) were compared against three task-specific approaches (CNN, Transformer, Ensemble). Models were trained on 3,147 patients (2.3% bowel injury prevalence) and evaluated on an enriched 100-patient test set. To isolate negative-class effects, specificity was assessed in patients without bowel injury who had concurrent solid organ injury (n=58) versus no abdominal pathology (n=50). Results: Foundation models achieved equivalent discrimination to task-specific models (AUC, 0.64-0.68 versus 0.58-0.64) with higher sensitivity (79-91% vs 41-74%) but lower specificity (33-50% vs 50-88%). All models demonstrated high specificity in patients without abdominal pathology (84-100%). When solid organ injuries were present, specificity declined substantially for foundation models (50-51 percentage points) compared with smaller reductions of 12-41 percentage points for task-specific models. Conclusion: Foundation models matched task-specific discrimination without task-specific training, but their specificity deficits were driven primarily by confounding negative-class heterogeneity rather than prevalence alone. Susceptibility to negative-class heterogeneity decreased progressively with labelled training, suggesting adaptation is required before clinical implementation.

[563] Benchmarking Deep Learning and Statistical Target Detection Methods for PFM-1 Landmine Detection in UAV Hyperspectral Imagery

Sagar Lekhak, Prasanna Reddy Pulakurthi, Ramesh Bhatta, Emmett J. Ientilucci

Main category: eess.IV

TL;DR: Benchmark study comparing classical spectral detection algorithms with a lightweight neural network for UAV-based hyperspectral landmine detection, emphasizing precision-focused evaluation over ROC-AUC.

Details

Motivation: Standardized benchmarks for UAV-based hyperspectral landmine detection are scarce, and there's a need to evaluate both classical algorithms and learning-based approaches for reliable detection in scenarios with extremely sparse target pixels.

Method: Systematic benchmark of four classical statistical detection algorithms (SAM, MF, ACE, CEM) and a proposed lightweight Spectral Neural Network with Parametric Mish activations for PFM-1 landmine detection using VNIR hyperspectral data and pixel-level binary ground truth masks.

Result: While ACE achieved highest ROC-AUC (0.989), the Spectral-NN outperformed classical detectors in precision-focused evaluation (PR and AP), highlighting that ROC-AUC can be misleading when target pixels are extremely sparse relative to background.

Conclusion: Precision-focused evaluation, scene-aware benchmarking, and learning-based spectral models are essential for reliable UAV-based hyperspectral landmine detection, with the proposed Spectral-NN showing superior performance in precision metrics.

Abstract: In recent years, unmanned aerial vehicles (UAVs) equipped with imaging sensors and automated processing algorithms have emerged as a promising tool to accelerate large-area surveys while reducing risk to human operators. Although hyperspectral imaging (HSI) enables material discrimination using spectral signatures, standardized benchmarks for UAV-based landmine detection remain scarce. In this work, we present a systematic benchmark of four classical statistical detection algorithms, including Spectral Angle Mapper (SAM), Matched Filter (MF), Adaptive Cosine Estimator (ACE), and Constrained Energy Minimization (CEM), alongside a proposed lightweight Spectral Neural Network utilizing Parametric Mish activations for PFM-1 landmine detection. We also release pixel-level binary ground truth masks (target/background) to enable standardized, reproducible evaluation. Evaluations were conducted on inert PFM-1 targets across multiple scene crops using a recently released VNIR hyperspectral dataset. Metrics such as receiver operating characteristic (ROC) curve, area under the curve (AUC), precision-recall (PR) curve, and average precision (AP) were used. While all methods achieve high ROC-AUC on an independent test set, the ACE method observes the highest AUC of 0.989. However, because target pixels are extremely sparse relative to background, ROC-AUC alone can be misleading; under precision-focused evaluation (PR and AP), the Spectral-NN outperforms classical detectors, achieving the highest AP. These results emphasize the need for precision-focused evaluation, scene-aware benchmarking, and learning-based spectral models for reliable UAV-based hyperspectral landmine detection. The code and pixel-level annotations will be released.

[564] FPGA Implementation of Sketched LiDAR for a 192 x 128 SPAD Image Sensor

Zhenya Zang, Mike Davies, Istvan Gyongy

Main category: eess.IV

TL;DR: FPGA implementation of polynomial spline compression algorithm for SPAD arrays achieves 512x compression ratio, enabling histogram-free online depth reconstruction in LiDAR systems.

Details

Motivation: Address the massive data transfer bandwidth challenge in high-resolution SPAD arrays where data rates reach tens of GB/s, overcoming the time-stamp transfer bottleneck for real-time depth reconstruction.

Method: Optimize polynomial spline function-based statistical compression algorithm using fixed-point arithmetic and LUTs, then implement online sketch processing elements (SPEs) on FPGA to directly process SPAD time-stamp streams.

Result: Achieves 512x compression ratio compared to conventional histogram-based outputs, validated with 192x128-pixel SPAD array in customized LiDAR setup, enabling high-fidelity histogram-free online depth reconstruction.

Conclusion: The FPGA implementation effectively alleviates SPAD array time-stamp transfer bottleneck, offers scalability for future higher-pixel-count SPADs, and demonstrates practical hardware solution for real-time depth sensing applications.

Abstract: This study presents an efficient field-programmable gate array (FPGA) implementation of a polynomial spline function-based statistical compression algorithm designed to address the critical challenge of massive data transfer bandwidth in emerging high-spatial-resolution single-photon avalanche diode (SPAD) arrays, where data rates can reach tens of gigabytes per second. In our experiments, the proposed hardware implementation achieves a compression ratio of 512x compared with conventional histogram-based outputs, with the potential for further improvement. The algorithm is first optimized in software using fixed-point (FXP) arithmetic and look-up tables (LUTs) to eliminate explicit additions, multiplications, and non-linear operations. This enables a careful balance between accuracy and hardware resource utilization. Guided by this trade-off analysis, online sketch processing elements (SPEs) are implemented on an FPGA to directly process time-stamp streams from the SPAD sensor. The implementation is validated using a customized LiDAR setup with a 192 x 128-pixel SPAD array. This work demonstrates histogram-free online depth reconstruction with high fidelity, effectively alleviating the time-stamp transfer bottleneck of SPAD arrays and offering scalability as pixel counts continue to increase for future SPADs.

[565] Training-Free Stimulus Encoding for Retinal Implants via Sparse Projected Gradient Descent

Henning Konermann, Yuli Wu, Emil Mededovic, Volkmar Schulz, Peter Walter, Johannes Stegmaier

Main category: eess.IV

TL;DR: A retinal implant encoding method using constrained sparse least-squares optimization with efficient solver that improves visual reconstruction fidelity compared to traditional downsampling approaches.

Details

Motivation: Retinal implants have fundamental limitations: low-resolution electrode arrays and patient-specific perceptual distortions. Current encoders use suboptimal task-agnostic downsampling and linear brightness mappings that don't account for realistic perceptual models.

Method: Formulates stimulus encoding as constrained sparse least-squares problem under linearized perceptual forward model. Uses efficient projected residual norm steepest descent solver that exploits sparsity in perception matrix and supports stimulus bounds via projection.

Result: In silico experiments across four simulated patients and implant resolutions (15×15 to 100×100 electrodes) show improved reconstruction fidelity: up to +0.265 SSIM increase, +12.4 dB PSNR, and 81.4% MAE reduction on Fashion-MNIST compared to Lanczos downsampling.

Conclusion: The proposed constrained sparse least-squares approach with efficient solver significantly improves retinal implant encoding quality by exploiting sparsity in perceptual models, offering better visual reconstruction than traditional methods.

Abstract: Retinal implants aim to restore functional vision despite photoreceptor degeneration, yet are fundamentally constrained by low resolution electrode arrays and patient-specific perceptual distortions. Most deployed encoders rely on task-agnostic downsampling and linear brightness-to-amplitude mappings, which are suboptimal under realistic perceptual models. While global inverse problems have been formulated as neural networks, such approaches can be fast at inference, and can achieve high reconstruction fidelity, but require training and have limited generalizability to arbitrary inputs. We cast stimulus encoding as a constrained sparse least-squares problem under a linearized perceptual forward model. Our key observation is that the resulting perception matrix can be highly sparse, depending on patient and implant configuration. Building on this, we apply an efficient projected residual norm steepest descent solver that exploits sparsity and supports stimulus bounds via projection. In silico experiments across four simulated patients and implant resolutions from $15\times15$ to $100\times100$ electrodes demonstrate improved reconstruction fidelity, with up to $+0.265$ SSIM increase, $+12.4,\mathrm{dB}$ PSNR, and $81.4%$ MAE reduction on Fashion-MNIST compared to Lanczos downsampling.

[566] MITI: SLAM Benchmark for Laparoscopic Surgery

Regine Hartwig, Daniel Ostler, Jean-Claude Rosenthal, Hubertus Feußner, Dirk Wilhelm, Dirk Wollherr

Main category: eess.IV

TL;DR: A benchmark dataset (MITI) for evaluating stereoscopic visual-inertial computer vision algorithms in minimally invasive abdominal surgery, providing multimodal sensor data with ground truth IR tracking.

Details

Motivation: To provide a comprehensive clinical training dataset for evaluating and advancing visual-inertial algorithms (SLAM/SfM/3D reconstruction/VIO) specifically designed for minimally invasive surgical applications in the abdomen.

Method: Created the MITI Dataset by recording a complete handheld surgical intervention with multimodal sensors: IMU, stereoscopic video, and infrared tracking as ground truth. Includes calibration data for all sensors, rigid transformations, and time-offsets.

Result: A publicly available dataset containing a full abdominal scan with minimal cutting/tissue deformation, making it ideal for testing SLAM algorithms in surgical contexts.

Conclusion: The MITI Dataset enables researchers to enhance visual-inertial algorithms for minimally invasive surgery by providing necessary multimodal sensor data with ground truth for evaluation.

Abstract: We propose a new benchmark for evaluating stereoscopic visual-inertial computer vision algorithms (SLAM/ SfM/ 3D Reconstruction/ Visual-Inertial Odometry) for minimally invasive surgical (MIS) interventions in the abdomen. Our MITI Dataset available at [https://mediatum.ub.tum.de/1621941] provides all the necessary data by a complete recording of a handheld surgical intervention at Research Hospital Rechts der Isar of TUM. It contains multimodal sensor information from IMU, stereoscopic video, and infrared (IR) tracking as ground truth for evaluation. Furthermore, calibration for the stereoscope, accelerometer, magnetometer, the rigid transformations in the sensor setup, and time-offsets are available. We wisely chose a suitable intervention that contains very few cutting and tissue deformation and shows a full scan of the abdomen with a handheld camera such that it is ideal for testing SLAM algorithms. Intending to promote the progress of visual-inertial algorithms designed for MIS application, we hope that our clinical training dataset helps and enables researchers to enhance algorithms.

[567] Airway Tree Modeling Using Dual-channel 3D UNet 3+ with Vesselness Prior

Hsiang-Chin Chien, Ching-Ping Wang, Jung-Chih Chen, Chia-Yen Lee

Main category: eess.IV

TL;DR: A dual-channel 3D UNet 3+ combined with Frangi filter for lung airway tree modeling from CT images, using vessel-like features to guide segmentation.

Details

Motivation: Lung airway tree modeling is crucial for pulmonary disease diagnosis from CT scans, providing 3D measurements like wall thickness. Existing approaches have limitations: model-based methods require manual parameter tuning, while deep learning methods like UNet variants need improvement for airway segmentation accuracy.

Method: Combines Frangi filter (for vessel-like feature extraction) with UNet 3+ architecture to create a dual-channel 3D UNet 3+. The Frangi filter extracts vessel-like features which are used as input to guide the training and testing procedures of the dual-channel network.

Result: The paper claims improved accuracy for lung airway tree modeling compared to other UNet variations, though specific quantitative results are not provided in the abstract.

Conclusion: The proposed dual-channel 3D UNet 3+ with Frangi filter guidance shows promise for enhancing lung airway tree segmentation accuracy in CT images, potentially improving pulmonary disease diagnosis.

Abstract: The lung airway tree modeling is essential to work for the diagnosis of pulmonary diseases, especially for X-Ray computed tomography (CT). The airway tree modeling on CT images can provide the experts with 3-dimension measurements like wall thickness, etc. This information can tremendously aid the diagnosis of pulmonary diseases like chronic obstructive pulmonary disease [1-4]. Many scholars have attempted various ways to model the lung airway tree, which can be split into two major categories based on its nature. Namely, the model-based approach and the deep learning approach. The performance of a typical model-based approach usually depends on the manual tuning of the model parameter, which can be its advantages and disadvantages. The advantage is its don’t require a large amount of training data which can be beneficial for a small dataset like medical imaging. On the other hand, the performance of model-based may be a misconcep-tion [5,6]. In recent years, deep learning has achieved good results in the field of medical image processing, and many scholars have used UNet-based methods in medical image segmentation [7-11]. Among all the variation of UNet, the UNet 3+ [11] have relatively good result compare to the rest of the variation of UNet. Therefor to further improve the accuracy of lung airway tree modeling, this study combines the Frangi filter [5] with UNet 3+ [11] to develop a dual-channel 3D UNet 3+. The Frangi filter is used to extracting vessel-like feature. The vessel-like feature then used as input to guide the dual-channel UNet 3+ training and testing procedures.

[568] Deformation-Recovery Diffusion Model (DRDM): Instance Deformation for Image Manipulation and Synthesis

Jian-Qing Zheng, Yuanhan Mo, Yang Sun, Jiahua Li, Fuping Wu, Ziyang Wang, Tonia Vincent, Bartłomiej W. Papież

Main category: eess.IV

TL;DR: DRDM is a diffusion-based generative model for medical imaging that generates anatomically plausible deformation fields rather than direct images, enabling realistic data augmentation for tasks like few-shot learning and registration.

Details

Motivation: Existing diffusion models for medical imaging often lack interpretable connections between generated and real images, and can create anatomically implausible structures. There's a need for generative models that preserve anatomical integrity while enabling realistic data augmentation.

Method: Proposes Deformation-Recovery Diffusion Model (DRDM) that learns to generate topology-preserving deformation fields. Uses multi-scale Deformation Velocity Fields (DVFs) and trains the model to recover unrealistic deformation components, restoring randomly deformed images to realistic distributions.

Result: DRDM creates diverse, large-scale deformations while maintaining anatomical plausibility. Experiments on cardiac MRI and pulmonary CT show improved performance in 2D segmentation and 3D registration tasks compared to baseline methods.

Conclusion: DRDM offers a novel approach to generative modeling in medical imaging by focusing on deformation field generation rather than direct image synthesis, enabling anatomically plausible data augmentation that benefits downstream tasks like few-shot learning and registration.

Abstract: In medical imaging, the diffusion models have shown great potential for synthetic image generation tasks. However, these approaches often lack the interpretable connections between the generated and real images and can create anatomically implausible structures or illusions. To address these limitations, we propose the Deformation-Recovery Diffusion Model (DRDM), a novel diffusion-based generative model that emphasises morphological transformation through deformation fields rather than direct image synthesis. DRDM introduces a topology-preserving deformation field generation strategy, which randomly samples and integrates multi-scale Deformation Velocity Fields (DVFs). DRDM is trained to learn to recover unrealistic deformation components, thus restoring randomly deformed images to a realistic distribution. This formulation enables the generation of diverse yet anatomically plausible deformations that preserve structural integrity, thereby improving data augmentation and synthesis for downstream tasks such as few-shot learning and image registration. Experiments on cardiac Magnetic Resonance Imaging and pulmonary Computed Tomography show that DRDM is capable of creating diverse, large-scale deformations, while maintaining anatomical plausibility of deformation fields. Additional evaluations on 2D image segmentation and 3D image registration tasks indicate notable performance gains, underscoring DRDM’s potential to enhance both image manipulation and generative modelling in medical imaging applications. Project page: https://jianqingzheng.github.io/def_diff_rec/

[569] Accurate, provable and fast polychromatic tomographic reconstruction: A variational inequality approach

Mengqi Lou, Kabir Aladin Verchand, Sara Fridovich-Keil, Ashwin Pananjady

Main category: eess.IV

TL;DR: EXACT is an iterative algorithm for single-material CT reconstruction that handles nonlinear forward models with exponential attenuation, polychromatic sources, and various noise types, offering improved sample and computational efficiency.

Details

Motivation: CT reconstruction faces challenges with nonlinear forward models accounting for exponential signal attenuation, polychromatic X-ray sources, and various noise types. Existing methods may require more X-ray views, higher source intensity, or more computation time than necessary.

Method: Developed EXACT (EXtragradient Algorithm for Computed Tomography), an iterative algorithm based on formulating the estimate as the fixed point of a monotone variational inequality. The method handles single-material reconstruction under realistic measurement assumptions.

Result: Proved statistical and computational performance guarantees under realistic measurement assumptions. For Gaussian measurement variants, EXACT achieves improved sample and iteration complexity bounds compared to existing algorithms. Applied to CT phantom recovery, EXACT often requires fewer X-ray views, lower source intensity, and less computation time while achieving similar reconstruction quality.

Conclusion: EXACT provides an efficient algorithm for CT reconstruction with nonlinear forward models, offering theoretical guarantees and practical improvements in sample efficiency, computational requirements, and radiation exposure compared to existing methods.

Abstract: We consider the problem of signal reconstruction for computed tomography (CT) under a nonlinear forward model that accounts for exponential signal attenuation, a polychromatic X-ray source, general measurement noise (e.g., Poisson shot noise), and observations acquired over multiple wavelength windows. We develop a simple iterative algorithm for single-material reconstruction, which we call EXACT (EXtragradient Algorithm for Computed Tomography), based on formulating our estimate as the fixed point of a monotone variational inequality. We prove guarantees on the statistical and computational performance of EXACT under realistic assumptions on the measurement process. We also consider a recently introduced variant of this model with Gaussian measurements and present sample and iteration complexity bounds for EXACT that improve upon those of existing algorithms. We apply our EXACT algorithm to a CT phantom image recovery task and show that it often requires fewer X-ray views, lower source intensity, and less computation time to achieve reconstruction quality similar to existing methods. Code is available at https://github.com/voilalab/exact.

[570] A UAV-Based VNIR Hyperspectral Benchmark Dataset for Landmine and UXO Detection

Sagar Lekhak, Emmett J. Ientilucci, Jasper Baur, Susmita Ghosh

Main category: eess.IV

TL;DR: A benchmark dataset of UAV-based VNIR hyperspectral imagery for landmine/UXO detection, collected over a controlled test field with 143 surrogate targets in various configurations.

Details

Motivation: To address the lack of open-access UAV-based hyperspectral data for landmine and unexploded ordnance detection research, providing a standardized benchmark for algorithm development and evaluation.

Method: Used a Headwall Nano-Hyperspec sensor on a UAV platform at 20.6m altitude to capture 270 spectral bands (398-1002 nm) over a test field with 143 surrogate targets. Applied radiometric calibration, orthorectification, mosaicking, and reflectance retrieval using Empirical Line Method with SVC spectroradiometer reference spectra.

Result: Created a high-fidelity hyperspectral dataset with RMSE values below 1.0 and SAM values between 1-6 degrees in the 400-900 nm range. The dataset includes raw radiance cubes, GCP/AeroPoint data, and reference spectra for reproducible research.

Conclusion: This dataset fills a critical gap in open-access UAV-based hyperspectral data for landmine detection and serves as a multi-sensor benchmark when combined with previously published drone-based electromagnetic induction data from the same test field.

Abstract: This paper introduces a novel benchmark dataset of Visible and Near-Infrared (VNIR) hyperspectral imagery acquired via an unmanned aerial vehicle (UAV) platform for landmine and unexploded ordnance (UXO) detection research. The dataset was collected over a controlled test field seeded with 143 realistic surrogate landmine and UXO targets, including surface, partially buried, and fully buried configurations. Data acquisition was performed using a Headwall Nano-Hyperspec sensor mounted on a multi-sensor drone platform, flown at an altitude of approximately 20.6 m, capturing 270 contiguous spectral bands spanning 398-1002 nm. Radiometric calibration, orthorectification, and mosaicking were performed followed by reflectance retrieval using a two-point Empirical Line Method (ELM), with reference spectra acquired using an SVC spectroradiometer. Cross-validation against six reference objects yielded RMSE values below 1.0 and SAM values between 1 and 6 degrees in the 400-900 nm range, demonstrating high spectral fidelity. The dataset is released alongside raw radiance cubes, GCP/AeroPoint data, and reference spectra to support reproducible research. This contribution fills a critical gap in open-access UAV-based hyperspectral data for landmine detection and offers a multi-sensor benchmark when combined with previously published drone-based electromagnetic induction (EMI) data from the same test field.

Editor’s Picks

[1] AudioRAG: A Challenging Benchmark for Audio Reasoning and Information Retrieval

[2] AudioRouter: Data Efficient Audio Understanding via RL based Dual Reasoning

[3] MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Today’s Research Highlights

Table of Contents

cs.CL

[1] Reviewing the Reviewer: Elevating Peer Review Quality through LLM-Guided Feedback

[2] Latent Thoughts Tuning: Bridging Context and Reasoning with Fused Information in Latent Tokens

[3] Learning to Evict from Key-Value Cache

[4] On Emergent Social World Models – Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models

[5] Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality

[6] Simultaneous Speech-to-Speech Translation Without Aligned Data

[7] The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

[8] Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

[9] When Less Is More? Diagnosing ASR Predictions in Sardinian via Layer-Wise Decoding

[10] Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

[11] Physically Interpretable AlphaEarth Foundation Model Embeddings Enable LLM-Based Land Surface Intelligence

[12] Multilingual Dysarthric Speech Assessment Using Universal Phone Recognition and Language-Specific Phonemic Contrast Modeling

[13] Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

[14] The Alignment Bottleneck in Decomposition-Based Claim Verification

[15] Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

[16] When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

[17] Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

[18] When are We Worried? Temporal Trends of Anxiety and What They Reveal about Us

[19] EVOKE: Emotion Vocabulary Of Korean and English

[20] LATA: A Tool for LLM-Assisted Translation Annotation

[21] Neuro-Symbolic Synergy for Interactive World Modeling

[22] Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

[23] On the Robustness of Knowledge Editing for Detoxification

[24] LHAW: Controllable Underspecification for Long-Horizon Tasks

[25] When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

[26] Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

[27] Online Causal Kalman Filtering for Stable and Effective Policy Optimization

[28] How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

[29] UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory

[30] Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance

[31] Targeted Syntactic Evaluation of Language Models on Georgian Case Alignment

[32] Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

[33] Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

[34] Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

[35] Deep Learning-based Method for Expressing Knowledge Boundary of Black-Box LLM

[36] Beyond Confidence: The Rhythms of Reasoning in Generative Models

[37] I can tell whether you are a Native Hawlêri Speaker! How ANN, CNN, and RNN perform in NLI-Native Language Identification

[38] C-MOP: Integrating Momentum and Boundary-Aware Clustering for Enhanced Prompt Evolution

[39] Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis

[40] The CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems

[41] SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

[42] Computational Phenomenology of Temporal Experience in Autism: Quantifying the Emotional and Narrative Characteristics of Lived Unpredictability

[43] Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models

[44] LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules

[45] Linguistic Indicators of Early Cognitive Decline in the DementiaBank Pitt Corpus: A Statistical and Machine Learning Study

[46] Language Model Inversion through End-to-End Differentiation

[47] Embedding Inversion via Conditional Masked Diffusion Language Models

[48] Conversational Behavior Modeling Foundation Model With Multi-Level Perception

[49] SteuerLLM: Local specialized large language model for German tax law analysis

[50] DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

[51] Can Large Language Models Make Everyone Happy?

[52] Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

[53] TEGRA: Text Encoding With Graph and Retrieval Augmentation for Misinformation Detection

[54] Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

[55] Structured Sentiment Analysis as Transition-based Dependency Graph Parsing

[56] When Speculation Spills Secrets: Side Channels via Speculative Decoding In LLMs

[57] EmbBERT: Attention Under 2 MB Memory

[58] Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations

[59] from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors

[60] Fixing the Broken Compass: Diagnosing and Improving Inference-Time Reward Modeling

[61] MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering

[62] ZeroTuning: Unlocking the Initial Token’s Power to Enhance Large Language Models Without Training

[63] Intrinsic Self-Correction in LLMs: Towards Explainable Prompting via Mechanistic Interpretability

[64] ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

[65] WAVE++: Capturing Within-Task Variance for Continual Relation Extraction with Adaptive Prompting

[66] Aligning Dialogue Agents with Global Feedback via Large Language Model Multimodal Reward Decomposition

[67] Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

[68] Unveiling the “Fairness Seesaw”: Discovering and Mitigating Gender and Race Bias in Vision-Language Models

[69] Cross-Attention Speculative Decoding

[70] Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics

[71] Unveiling Super Experts in Mixture-of-Experts Large Language Models

[72] AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding

[73] Is In-Context Learning Learning?