Daily arXiv Papers - 2026-02-09

AI-enhanced summaries of 0 research papers from arXiv

Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

Kaiyuan Zhang, Mohan Shi, Eray Eren, Natarajan Balaji Shankar, Zilai Wang, Abeer Alwan

Main category: eess.AS

TL;DR: STACodec: A unified audio codec that integrates semantic information into residual vector quantization via semantic token assignment, achieving better balance between acoustic fidelity and semantic capability than existing hybrid codecs.

DetailsMotivation: Traditional neural audio codecs preserve acoustic details well but lack semantic information, while recent hybrid codecs that incorporate semantic information through distillation often degrade reconstruction performance. There's a need for a unified approach that achieves both good acoustic fidelity and semantic capability.

Method: STACodec integrates semantic information from self-supervised learning models into the first layer of residual vector quantization via semantic token assignment. It also includes a semantic pre-distillation module that predicts semantic tokens directly for assignment during inference, eliminating reliance on SSL-based semantic tokenizers and improving efficiency.

Result: Experimental results show STACodec outperforms existing hybrid codecs in both audio reconstruction and downstream semantic tasks, demonstrating a better balance between acoustic fidelity and semantic capability.

Conclusion: STACodec successfully addresses the limitation of existing hybrid codecs by integrating semantic information into the quantization process while maintaining reconstruction quality, offering a unified solution for audio compression with both acoustic and semantic fidelity.

Abstract: Neural audio codecs are widely used for audio compression and can be integrated into token-based language models. Traditional codecs preserve acoustic details well but lack semantic information. Recent hybrid codecs attempt to incorporate semantic information through distillation, but this often degrades reconstruction performance, making it difficult to achieve both. To address this limitation, we introduce STACodec, a unified codec that integrates semantic information from self-supervised learning (SSL) models into the first layer of residual vector quantization (RVQ-1) via semantic token assignment (STA). To further eliminate reliance on SSL-based semantic tokenizers and improve efficiency during inference, we propose a semantic pre-distillation (SPD) module, which predicts semantic tokens directly for assignment to the first RVQ layer during inference. Experimental results show that STACodec outperforms existing hybrid codecs in both audio reconstruction and downstream semantic tasks, demonstrating a better balance between acoustic fidelity and semantic capability.

Relevance: 9/10

[2] EgoAVU: Egocentric Audio-Visual Understanding

Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai

Main category: cs.CV

TL;DR: EgoAVU introduces a scalable data engine for generating egocentric audio-visual narrations, questions, and answers to train multimodal LLMs for joint audio-visual understanding in egocentric videos.

DetailsMotivation: Current multimodal LLMs struggle with joint audio-visual understanding in egocentric videos due to lack of coherent joint-modality text labels, showing bias toward visual signals while neglecting audio cues.

Method: EgoAVU data engine automatically generates egocentric audio-visual narrations through cross-modal correlation modeling, with token-based video filtering and modular graph-based curation for quality and diversity. Creates EgoAVU-Instruct (3M training samples) and EgoAVU-Bench (evaluation dataset).

Result: EgoAVU-Bench reveals existing MLLMs’ limitations: heavy visual bias, audio neglect, and failure to correlate audio with visual sources. Finetuning on EgoAVU-Instruct achieves up to 113% performance improvement on EgoAVU-Bench and transfers to other benchmarks (up to 28% gain).

Conclusion: EgoAVU addresses the critical gap in joint audio-visual understanding for egocentric videos, enabling MLLMs to effectively process both modalities and showing strong transfer learning capabilities.

Abstract: Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.

Relevance: 9/10

[3] Scaling Speech Tokenizers with Diffusion Autoencoders

Yuancheng Wang, Zhenyu Tang, Yun Wang, Arthur Hinsvark, Yingru Liu, Yinghao Li, Kainan Peng, Junyi Ao, Mingbo Ma, Mike Seltzer, Qing He, Xubo Liu

Main category: cs.SD

TL;DR: Speech Diffusion Tokenizer (SiTok) is a 1.6B parameter diffusion autoencoder that learns semantic-rich representations for speech understanding while enabling high-fidelity audio reconstruction at extremely low token rates (12.5 Hz) and bit rates (200 bps).

DetailsMotivation: Existing speech tokenizers face challenges in balancing semantic encoding for understanding vs. acoustic encoding for reconstruction, and achieving both low bit rates and low token rates simultaneously.

Method: Proposes SiTok, a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. Scaled to 1.6B parameters and trained on 2 million hours of speech.

Result: Outperforms strong baselines on understanding, reconstruction, and generation tasks at extremely low token rate of 12.5 Hz and bit-rate of 200 bits-per-second.

Conclusion: SiTok successfully addresses the trade-off between semantic understanding and acoustic reconstruction in speech tokenization, achieving state-of-the-art performance at unprecedented low rates.

Abstract: Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.

Relevance: 9/10


Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

Table of Contents

cs.CL

[1] Recontextualizing Famous Quotes for Brand Slogan Generation

Ziao Yang, Zizhang Chen, Lei Zhang, Hongfu Liu

Main category: cs.CL

TL;DR: A framework for generating novel advertising slogans by recontextualizing famous quotes, using modular subtasks to create brand-aligned, creative slogans that balance novelty with familiarity.

DetailsMotivation: Advertising fatigue reduces effectiveness of repeated slogans, creating demand for novel, creative slogan generation. Existing LLM approaches produce stylistically redundant outputs lacking brand persona and appearing overtly machine-generated.

Method: Proposes recontextualizing persona-related famous quotes for slogan generation. Introduces modular framework with interpretable subtasks: quote matching, structural decomposition, vocabulary replacement, and remix generation.

Result: Extensive automatic and human evaluations show marginal improvements in diversity, novelty, emotional impact, and human preference over three state-of-the-art LLM baselines.

Conclusion: Famous quotes provide powerful resource for creative slogan generation, offering natural slogan-length text with rich rhetorical devices and depth. Modular framework enables effective balance of novelty with familiarity.

Abstract: Slogans are concise and memorable catchphrases that play a crucial role in advertising by conveying brand identity and shaping public perception. However, advertising fatigue reduces the effectiveness of repeated slogans, creating a growing demand for novel, creative, and insightful slogan generation. While recent work leverages large language models (LLMs) for this task, existing approaches often produce stylistically redundant outputs that lack a clear brand persona and appear overtly machine-generated. We argue that effective slogans should balance novelty with familiarity and propose a new paradigm that recontextualizes persona-related famous quotes for slogan generation. Well-known quotes naturally align with slogan-length text, employ rich rhetorical devices, and offer depth and insight, making them a powerful resource for creative generation. Technically, we introduce a modular framework that decomposes slogan generation into interpretable subtasks, including quote matching, structural decomposition, vocabulary replacement, and remix generation. Extensive automatic and human evaluations demonstrate marginal improvements in diversity, novelty, emotional impact, and human preference over three state-of-the-art LLM baselines.

[2] Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering

Jongha Kim, Byungoh Ko, Jeehye Na, Jinsung Yoon, Hyunwoo J. Kim

Main category: cs.CL

TL;DR: RMCD is a novel decoding method for retrieval-augmented LVLMs that aggregates information from multiple contexts by weighting each context’s contribution based on its relevance to the question, improving performance on knowledge-intensive visual QA tasks.

DetailsMotivation: Large Vision Language Models lack detailed entity knowledge, and while retrieval-augmented generation helps, previous decoding methods fail to effectively leverage multiple relevant contexts and suppress negative effects of irrelevant contexts.

Method: Proposes Relevance-aware Multi-context Contrastive Decoding (RMCD) that combines outputs predicted with each retrieved context, weighting each output based on its relevance to the question, enabling aggregation of useful information while counteracting irrelevant contexts.

Result: RMCD consistently outperforms other decoding methods across multiple LVLMs, achieving best performance on three knowledge-intensive visual question-answering benchmarks, and is robust to varying retrieval quality without requiring additional training.

Conclusion: RMCD provides an effective plug-and-play decoding method for retrieval-augmented LVLMs that improves knowledge-intensive visual QA by better leveraging multiple contexts and mitigating irrelevant information.

Abstract: Despite the remarkable capabilities of Large Vision Language Models (LVLMs), they still lack detailed knowledge about specific entities. Retrieval-augmented Generation (RAG) is a widely adopted solution that enhances LVLMs by providing additional contexts from an external Knowledge Base. However, we observe that previous decoding methods for RAG are sub-optimal as they fail to sufficiently leverage multiple relevant contexts and suppress the negative effects of irrelevant contexts. To this end, we propose Relevance-aware Multi-context Contrastive Decoding (RMCD), a novel decoding method for RAG. RMCD outputs a final prediction by combining outputs predicted with each context, where each output is weighted based on its relevance to the question. By doing so, RMCD effectively aggregates useful information from multiple relevant contexts while also counteracting the negative effects of irrelevant ones. Experiments show that RMCD consistently outperforms other decoding methods across multiple LVLMs, achieving the best performance on three knowledge-intensive visual question-answering benchmarks. Also, RMCD can be simply applied by replacing the decoding method of LVLMs without additional training. Analyses also show that RMCD is robust to the retrieval results, consistently performing the best across the weakest to the strongest retrieval results. Code is available at https://github.com/mlvlab/RMCD.

[3] CAST: Character-and-Scene Episodic Memory for Agents

Kexin Ma, Bojun Li, Yuhua Tang, Ruochun Jin, Liting Sun

Main category: cs.CL

TL;DR: CAST proposes a character-and-scene based episodic memory architecture for AI agents, inspired by dramatic theory, to better represent and retrieve coherent events compared to traditional semantic memory systems.

DetailsMotivation: Current agent memory systems focus on semantic recall using structures like key-value pairs, vectors, or graphs, but struggle to represent coherent episodic memories (who, when, where). There's a need for memory architectures that can better capture human-like episodic memory for events.

Method: CAST constructs 3D scenes organized by time, place, and topic, and organizes them into character profiles that summarize events. It complements this episodic memory with a graph-based semantic memory, creating a dual memory design.

Result: CAST improves performance by 8.11% F1 and 10.21% J(LLM-as-a-Judge) compared to baselines across various datasets, with particularly strong results on open and time-sensitive conversational questions.

Conclusion: The character-and-scene based episodic memory architecture effectively addresses limitations of traditional semantic memory systems, enabling better representation and retrieval of coherent events in AI agents.

Abstract: Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where. However, most agent memory systems only emphasize semantic recall and treat experience as structures such as key-value, vector, or graph, which makes them struggle to represent and retrieve coherent events. To address this challenge, we propose a Character-and-Scene based memory architecture(CAST) inspired by dramatic theory. Specifically, CAST constructs 3D scenes (time/place/topic) and organizes them into character profiles that summarize the events of a character to represent episodic memory. Moreover, CAST complements this episodic memory with a graph-based semantic memory, which yields a robust dual memory design. Experiments demonstrate that CAST has averagely improved 8.11% F1 and 10.21% J(LLM-as-a-Judge) than baselines on various datasets, especially on open and time-sensitive conversational questions.

[4] Rethinking Memory Mechanisms of Foundation Agents in the Second Half

Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen Wang, Xiongxiao Xu, Baixiang Huang, Juntao Tan, Shelby Heinecke, Huan Wang, Caiming Xiong, Ahmed A. Metwally, Jun Yan, Chen-Yu Lee, Hanqing Zeng, Yinglong Xia, Xiaokai Wei, Ali Payani, Yu Wang, Haitong Ma, Wenya Wang, Chengguang Wang, Yu Zhang, Xin Wang, Yongfeng Zhang, Jiaxuan You, Hanghang Tong, Xiao Luo, Yizhou Sun, Wei Wang, Julian McAuley, James Zou, Jiawei Han, Philip S. Yu, Kai Shu

Main category: cs.CL

TL;DR: A comprehensive survey on foundation agent memory systems, analyzing memory substrates, cognitive mechanisms, and memory subjects to address real-world utility challenges in long-horizon AI interactions.

DetailsMotivation: The AI field is shifting from benchmark-focused research to real-world utility, where agents face context explosion in long-horizon, dynamic environments. Memory emerges as the critical solution to bridge the utility gap by enabling continuous accumulation, management, and selective reuse of information across extended interactions.

Method: The survey provides a unified framework for foundation agent memory across three dimensions: 1) Memory substrate (internal vs. external), 2) Cognitive mechanisms (episodic, semantic, sensory, working, and procedural memory), and 3) Memory subject (agent-centric vs. user-centric). It analyzes memory instantiation under different agent topologies and examines learning policies for memory operations.

Result: The paper organizes the rapidly growing field of agent memory (hundreds of papers released this year) into a coherent framework, highlighting how memory systems can address context explosion and enable real-world utility in dynamic, user-dependent environments.

Conclusion: Memory is identified as the critical component for achieving real utility in the “second half” of AI research. The survey provides a comprehensive taxonomy, evaluation benchmarks, and outlines open challenges for future research in foundation agent memory systems.

Abstract: The research of artificial intelligence is undergoing a paradigm shift from prioritizing model innovations over benchmark scores towards emphasizing problem definition and rigorous real-world evaluation. As the field enters the “second half,” the central challenge becomes real utility in long-horizon, dynamic, and user-dependent environments, where agents face context explosion and must continuously accumulate, manage, and selectively reuse large volumes of information across extended interactions. Memory, with hundreds of papers released this year, therefore emerges as the critical solution to fill the utility gap. In this survey, we provide a unified view of foundation agent memory along three dimensions: memory substrate (internal and external), cognitive mechanism (episodic, semantic, sensory, working, and procedural), and memory subject (agent- and user-centric). We then analyze how memory is instantiated and operated under different agent topologies and highlight learning policies over memory operations. Finally, we review evaluation benchmarks and metrics for assessing memory utility, and outline various open challenges and future directions.

[5] PersonaPlex: Voice and Role Control for Full Duplex Conversational Speech Models

Rajarshi Roy, Jonathan Raiman, Sang-gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, Bryan Catanzaro

Main category: cs.CL

TL;DR: PersonaPlex is a duplex conversational speech model that enables role-driven, voice-personalized speech-to-speech interactions through hybrid system prompts combining role conditioning and voice cloning.

DetailsMotivation: Existing duplex speech models are limited to fixed roles and voices, restricting their ability to support structured, role-driven real-world applications and personalized interactions. The authors aim to create a more flexible system that can adapt to different personas and voices.

Method: PersonaPlex incorporates hybrid system prompts combining role conditioning (via text prompts) with voice cloning (via speech samples). It’s trained on a large-scale synthetic dataset of paired prompts and user-agent conversations generated using open-source LLMs and TTS models. The authors also extend the Full-Duplex-Bench benchmark to multi-role customer service scenarios for evaluation.

Result: PersonaPlex achieves strong role-conditioned behavior, voice-conditioned speech, and natural conversational responsiveness. It surpasses state-of-the-art duplex speech models and hybrid LLM-based speech systems in role adherence, speaker similarity, latency, and naturalness.

Conclusion: PersonaPlex successfully addresses the limitations of existing duplex speech models by enabling flexible role-driven and voice-personalized conversational interactions, demonstrating superior performance in multi-role scenarios.

Abstract: Recent advances in duplex speech models have enabled natural, low-latency speech-to-speech interactions. However, existing models are restricted to a fixed role and voice, limiting their ability to support structured, role-driven real-world applications and personalized interactions. In this work, we introduce PersonaPlex, a duplex conversational speech model that incorporates hybrid system prompts, combining role conditioning with text prompts and voice cloning with speech samples. PersonaPlex is trained on a large-scale synthetic dataset of paired prompts and user-agent conversations, generated with open-source large language models (LLM) and text-to-speech (TTS) models. To evaluate role conditioning in real-world settings, we extend the Full-Duplex-Bench benchmark beyond a single assistant role to multi-role customer service scenarios. Experiments show that PersonaPlex achieves strong role-conditioned behavior, voice-conditioned speech, and natural conversational responsiveness, surpassing state-of-the-art duplex speech models and hybrid large language model-based speech systems in role adherence, speaker similarity, latency, and naturalness.

[6] Reading Between the Waves: Robust Topic Segmentation Using Inter-Sentence Audio Features

Steffen Freisinger, Philipp Seeberger, Tobias Bocklet, Korbinian Riedhammer

Main category: cs.CL

TL;DR: Multi-modal topic segmentation model combining text and audio encoders to leverage acoustic cues for improved segmentation of spoken content across multiple languages.

DetailsMotivation: Current topic segmentation methods for spoken content (videos, podcasts) don't fully utilize acoustic features, missing important cues that could improve segmentation accuracy and robustness.

Method: Proposes a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder to capture acoustic cues around sentence boundaries for better topic segmentation.

Result: Substantial gains over text-only and multi-modal baselines on YouTube videos dataset; more resilient to ASR noise; outperforms larger text-only baselines on Portuguese, German, and English datasets.

Conclusion: Acoustic features provide valuable cues for robust topic segmentation across languages, demonstrating the importance of multi-modal approaches for spoken content analysis.

Abstract: Spoken content, such as online videos and podcasts, often spans multiple topics, which makes automatic topic segmentation essential for user navigation and downstream applications. However, current methods do not fully leverage acoustic features, leaving room for improvement. We propose a multi-modal approach that fine-tunes both a text encoder and a Siamese audio encoder, capturing acoustic cues around sentence boundaries. Experiments on a large-scale dataset of YouTube videos show substantial gains over text-only and multi-modal baselines. Our model also proves more resilient to ASR noise and outperforms a larger text-only baseline on three additional datasets in Portuguese, German, and English, underscoring the value of learned acoustic features for robust topic segmentation.

[7] What Is Novel? A Knowledge-Driven Framework for Bias-Aware Literature Originality Evaluation

Abeer Mostafa, Thi Huyen Nguyen, Zahra Ahmadi

Main category: cs.CL

TL;DR: A literature-aware novelty assessment framework that learns from peer-review reports to evaluate research novelty through structured comparison to prior work, using LLMs and similarity graphs.

DetailsMotivation: Research novelty assessment in peer review is subjective and based on incomplete comparisons. The authors aim to create a systematic, evidence-based approach that learns from human reviewers to improve consistency and reduce bias in novelty evaluation.

Method: Fine-tune a large language model on 80K novelty-annotated reviews from AI conferences. For each manuscript, extract structured representations of ideas/methods/claims, retrieve related papers, construct similarity graphs for concept-level comparison, and produce calibrated novelty scores with explanations.

Result: The system produces calibrated novelty scores and human-like explanatory assessments, reducing overestimation and improving consistency compared to existing approaches.

Conclusion: The framework provides a systematic, evidence-based approach to novelty assessment that aligns with human reviewer judgments while offering more consistent and grounded evaluations.

Abstract: Assessing research novelty is a core yet highly subjective aspect of peer review, typically based on implicit judgment and incomplete comparison to prior work. We introduce a literature-aware novelty assessment framework that explicitly learns how humans judge novelty from peer-review reports and grounds these judgments in structured comparison to existing research. Using nearly 80K novelty-annotated reviews from top-tier AI conferences, we fine-tune a large language model to capture reviewer-aligned novelty evaluation behavior. For a given manuscript, the system extracts structured representations of its ideas, methods, and claims, retrieves semantically related papers, and constructs a similarity graph that enables fine-grained, concept-level comparison to prior work. Conditioning on this structured evidence, the model produces calibrated novelty scores and human-like explanatory assessments, reducing overestimation and improving consistency relative to existing approaches.

[8] Quantifying and Attributing Polarization to Annotator Groups

Dimitris Tsirmpas, John Pavlopoulos

Main category: cs.CL

TL;DR: A new metric for measuring polarization between annotator groups in subjective annotation tasks, addressing limitations of current agreement metrics.

DetailsMotivation: Current annotation agreement metrics are inadequate for inter-group analysis, sensitive to group size imbalances, and restricted to single-annotation settings, making them insufficient for subjective tasks like toxicity and hate-speech detection.

Method: Introduces a quantifiable metric paired with a statistical significance test that attributes polarization to various annotator groups, enabling comparisons between imbalanced sociodemographic subgroups across datasets and tasks, including multi-label settings.

Result: Applied to three hate speech datasets and one toxicity detection dataset, finding: (1) strong persistent polarization attributed to annotator race, (2) religious annotators disagree with others but not each other, (3) less educated annotators are more subjective while educated ones agree more. Also estimates minimum annotator numbers for robust results.

Conclusion: Provides a new metric for analyzing annotation polarization across demographic groups, with findings reflecting current annotation patterns, and offers an open-source Python library implementation.

Abstract: Current annotation agreement metrics are not well-suited for inter-group analysis, are sensitive to group size imbalances and restricted to single-annotation settings. These restrictions render them insufficient for many subjective tasks such as toxicity and hate-speech detection. For this reason, we introduce a quantifiable metric, paired with a statistical significance test, that attributes polarization to various annotator groups. Our metric enables direct comparisons between heavily imbalanced sociodemographic and ideological subgroups across different datasets and tasks, while also enabling analysis on multi-label settings. We apply this metric to three datasets on hate speech, and one on toxicity detection, discovering that: (1) Polarization is strongly and persistently attributed to annotator race, especially on the hate speech task. (2) Religious annotators do not fundamentally disagree with each other, but do with other annotators, a trend that is gradually diminished and then reversed with irreligious annotators. (3) Less educated annotators are more subjective, while educated ones tend to broadly agree more between themselves. Overall, our results reflect current findings around annotation patterns for various subgroups. Finally, we estimate the minimum number of annotators needed to obtain robust results, and provide an open-source Python library that implements our metric.

[9] Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding

Yanzheng Xiang, Lan Wei, Yizhen Yao, Qinglin Zhu, Hanqi Yan, Chen Jin, Philip Alexander Teare, Dandan Zhang, Lin Gui, Amrutha Saseendran, Yulan He

Main category: cs.CL

TL;DR: COVER is a novel verification method for parallel diffusion language model decoding that reduces flip-flop oscillations and unnecessary revisions through KV cache override and stability-aware seed selection.

DetailsMotivation: Existing parallel diffusion decoding methods suffer from flip-flop oscillations where tokens are repeatedly remasked and restored, slowing inference by weakening conditioning context and wasting revision budget without progress.

Method: COVER uses KV cache override to create two attention views in a single forward pass: selected seeds are masked for verification while their cached states are injected for other queries. It includes a diagonal correction to prevent self-leakage and uses stability-aware scoring to prioritize seeds based on uncertainty, downstream influence, and cache drift.

Result: COVER significantly reduces unnecessary revisions and achieves faster decoding while maintaining output quality across benchmarks compared to existing verification schemes.

Conclusion: COVER effectively addresses flip-flop oscillations in parallel diffusion decoding through efficient single-pass verification with KV cache override, leading to faster inference without quality degradation.

Abstract: Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger flip-flop oscillations, where tokens are remasked and later restored unchanged. This behaviour slows inference in two ways: remasking verified positions weakens the conditioning context for parallel drafting, and repeated remask cycles consume the revision budget with little net progress. We propose COVER (Cache Override Verification for Efficient Revision), which performs leave-one-out verification and stable drafting within a single forward pass. COVER constructs two attention views via KV cache override: selected seeds are masked for verification, while their cached key value states are injected for all other queries to preserve contextual information, with a closed form diagonal correction preventing self leakage at the seed positions. COVER further prioritises seeds using a stability aware score that balances uncertainty, downstream influence, and cache drift, and it adapts the number of verified seeds per step. Across benchmarks, COVER markedly reduces unnecessary revisions and yields faster decoding while preserving output quality.

[10] Uncertainty Drives Social Bias Changes in Quantized Large Language Models

Stanley Z. Hua, Sanae Lotfi, Irene Y. Chen

Main category: cs.CL

TL;DR: Quantization of large language models alters social biases in unpredictable ways, with up to 21% of responses flipping between biased/unbiased states despite unchanged aggregate scores, disproportionately affecting uncertain responses and creating asymmetric impacts across demographic groups.

DetailsMotivation: Current post-training quantization methods focus on computational efficiency but overlook how compression fundamentally changes model behavior, particularly social biases. Aggregate metrics fail to capture nuanced bias changes that occur during quantization, potentially leading to misleading assessments of model fairness.

Method: Conducted large-scale study of 50 quantized models using PostTrainingBiasBench, a unified benchmark of 13 closed- and open-ended bias datasets. Analyzed quantization-induced masked bias flipping phenomenon, measured uncertainty-driven response changes, and examined asymmetric impacts across demographic groups with varying quantization strengths (4-bit vs 8-bit).

Result: Quantization causes up to 21% of responses to flip between biased/unbiased states while aggregate scores remain unchanged. High-uncertainty responses are 3-11x more likely to change. 4-bit models show 4-6x more behavioral changes than 8-bit models. Bias worsens by up to 18.6% for some demographic groups while improving by 14.1% for others. Larger models show no consistent robustness advantage.

Conclusion: Compression fundamentally alters bias patterns in unpredictable ways, requiring crucial post-quantization evaluation and interventions. Current aggregate metrics are insufficient for assessing bias changes, and group-specific impacts vary unpredictably across model families, necessitating more nuanced evaluation approaches.

Abstract: Post-training quantization reduces the computational cost of large language models but fundamentally alters their social biases in ways that aggregate metrics fail to capture. We present the first large-scale study of 50 quantized models evaluated on PostTrainingBiasBench, a unified benchmark of 13 closed- and open-ended bias datasets. We identify a phenomenon we term quantization-induced masked bias flipping, in which up to 21% of responses flip between biased and unbiased states after quantization, despite showing no change in aggregate bias scores. These flips are strongly driven by model uncertainty, where the responses with high uncertainty are 3-11x more likely to change than the confident ones. Quantization strength amplifies this effect, with 4-bit quantized models exhibiting 4-6x more behavioral changes than 8-bit quantized models. Critically, these changes create asymmetric impacts across demographic groups, where bias can worsen by up to 18.6% for some groups while improving by 14.1% for others, yielding misleadingly neutral aggregate outcomes. Larger models show no consistent robustness advantage, and group-specific shifts vary unpredictably across model families. Our findings demonstrate that compression fundamentally alters bias patterns, requiring crucial post-quantization evaluation and interventions to ensure reliability in practice.

[11] BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

Nishant Balepur, Bhavya Rajasekaran, Jane Oh, Michael Xie, Atrey Desai, Vipul Gupta, Steven James Moore, Eunsol Choi, Rachel Rudinger, Jordan Lee Boyd-Graber

Main category: cs.CL

TL;DR: BenchMarker is an education-inspired toolkit that uses LLM judges to detect three common MCQ flaws in NLP benchmarks: contamination, shortcuts, and writing errors, revealing how these flaws degrade evaluation quality.

DetailsMotivation: Current multiple-choice question answering (MCQA) benchmarks in NLP lack rigorous quality control, containing flaws that degrade evaluation reliability. The paper aims to bridge education research with NLP by developing systematic quality assessment tools.

Method: Developed BenchMarker toolkit using LLM judges to flag three MCQ flaws: 1) contamination (items appearing exactly online), 2) shortcuts (cues enabling guessing), and 3) writing errors (structural/grammatical issues based on 19-rule education rubric). Validated with human annotations and applied to audit 12 benchmarks.

Result: Audit revealed: contaminated MCQs tend to inflate accuracy, writing errors tend to lower accuracy and change rankings beyond random. Prior benchmark repairs address targeted issues but inadvertently add new flaws like implausible distractors and multiple correct answers.

Conclusion: Flaws in MCQs degrade NLP evaluation, but education research offers solutions. BenchMarker bridges education and NLP fields to improve MCQA benchmark design. The toolkit is released for community use.

Abstract: Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination - items appearing exactly online; 2) shortcuts - cues in the choices that enable guessing; and 3) writing errors - structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering accuracy with LLM-written distractors), but inadvertently add new flaws (i.e. implausible distractors, many correct answers). Overall, flaws in MCQs degrade NLP evaluation, but education research offers a path forward. We release BenchMarker to bridge the fields and improve MCQA benchmark design.

[12] Can One-sided Arguments Lead to Response Change in Large Language Models?

Pedro Cisneros-Velarde

Main category: cs.CL

TL;DR: LLMs can be steered to adopt specific viewpoints on polemic questions by providing one-sided arguments, with effectiveness varying across stance types, question formulations, and argument presentation methods.

DetailsMotivation: To investigate whether LLMs' initial balanced responses to polemic questions can be steered toward specific viewpoints using simple, intuitive methods like providing one-sided arguments, and to systematically study the factors influencing this opinion steering.

Method: Constructed a small dataset and systematically studied three dimensions: (i) which stance is induced in LLM responses, (ii) how polemic questions are formulated, (iii) how arguments are presented. Tested across diverse models, numbers of arguments, and topics.

Result: Remarkably found that opinion steering occurs across all three dimensions for diverse models, argument quantities, and topics. Switching to other arguments consistently decreases opinion steering effectiveness.

Conclusion: LLMs can be effectively steered toward specific viewpoints on polemic issues through simple presentation of one-sided arguments, with the steering effectiveness being robust across various factors but sensitive to argument consistency.

Abstract: Polemic questions need more than one viewpoint to express a balanced answer. Large Language Models (LLMs) can provide a balanced answer, but also take a single aligned viewpoint or refuse to answer. In this paper, we study if such initial responses can be steered to a specific viewpoint in a simple and intuitive way: by only providing one-sided arguments supporting the viewpoint. Our systematic study has three dimensions: (i) which stance is induced in the LLM response, (ii) how the polemic question is formulated, (iii) how the arguments are shown. We construct a small dataset and remarkably find that opinion steering occurs across (i)-(iii) for diverse models, number of arguments, and topics. Switching to other arguments consistently decreases opinion steering.

[13] Is my model “mind blurting”? Interpreting the dynamics of reasoning tokens with Recurrence Quantification Analysis (RQA)

Quoc Tuan Pham, Mehdi Jafari, Flora Salim

Main category: cs.CL

TL;DR: Proposes Recurrence Quantification Analysis (RQA) as a non-textual method to analyze reasoning models’ token generation dynamics using hidden embeddings, showing it captures signals beyond response length and improves task complexity prediction.

DetailsMotivation: Current methods for analyzing reasoning behavior in large models rely on generated text, which is impractical and unreliable. Response length is a poor proxy for reasoning effort as it fails to capture the dynamics and effectiveness of Chain of Thoughts or token generation patterns.

Method: Treats token generation as a dynamical system, extracts hidden embeddings at each generation step, and applies Recurrence Quantification Analysis (RQA) to the resulting trajectories. RQA metrics like Determinism and Laminarity quantify patterns of repetition and stalling in latent representations.

Result: Analyzed 3,600 generation traces from DeepSeek-R1-Distill, showing RQA captures signals not reflected by response length and improves prediction of task complexity by 8%.

Conclusion: RQA establishes itself as a principled tool for studying latent token generation dynamics in reasoning models during test-time scaling, providing insights beyond traditional text-based analysis.

Abstract: Test-time compute is central to large reasoning models, yet analysing their reasoning behaviour through generated text is increasingly impractical and unreliable. Response length is often used as a brute proxy for reasoning effort, but this metric fails to capture the dynamics and effectiveness of the Chain of Thoughts (CoT) or the generated tokens. We propose Recurrence Quantification Analysis (RQA) as a non-textual alternative for analysing model’s reasoning chains at test time. By treating token generation as a dynamical system, we extract hidden embeddings at each generation step and apply RQA to the resulting trajectories. RQA metrics, including Determinism and Laminarity, quantify patterns of repetition and stalling in the model’s latent representations. Analysing 3,600 generation traces from DeepSeek-R1-Distill, we show that RQA captures signals not reflected by response length, but also substantially improves prediction of task complexity by 8%. These results help establish RQA as a principled tool for studying the latent token generation dynamics of test-time scaling in reasoning models.

[14] MPIB: A Benchmark for Medical Prompt Injection Attacks and Clinical Safety in LLMs

Junhyeok Lee, Han Jang, Kyu Sung Choi

Main category: cs.CL

TL;DR: MPIB is a benchmark for evaluating clinical safety of LLMs and RAG systems against prompt injection attacks, focusing on measuring clinical harm events rather than just attack success.

DetailsMotivation: LLMs and RAG systems are being integrated into clinical workflows, but prompt injection attacks can produce clinically unsafe outputs. Current benchmarks don't adequately measure clinical harm outcomes.

Method: Created Medical Prompt Injection Benchmark (MPIB) with 9,697 curated instances using multi-stage quality gates and clinical safety linting. Introduces Clinical Harm Event Rate (CHER) to measure high-severity clinical harm events, alongside Attack Success Rate (ASR).

Result: Found that ASR and CHER can diverge substantially, and robustness depends on whether adversarial instructions appear in user queries or retrieved context. Evaluated diverse LLMs and defense configurations.

Conclusion: MPIB enables systematic research on clinical prompt injection by providing comprehensive benchmark, evaluation code, adversarial baselines, and documentation for reproducible safety assessment.

Abstract: Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly integrated into clinical workflows; however, prompt injection attacks can steer these systems toward clinically unsafe or misleading outputs. We introduce the Medical Prompt Injection Benchmark (MPIB), a dataset-and-benchmark suite for evaluating clinical safety under both direct prompt injection and indirect, RAG-mediated injection across clinically grounded tasks. MPIB emphasizes outcome-level risk via the Clinical Harm Event Rate (CHER), which measures high-severity clinical harm events under a clinically grounded taxonomy, and reports CHER alongside Attack Success Rate (ASR) to disentangle instruction compliance from downstream patient risk. The benchmark comprises 9,697 curated instances constructed through multi-stage quality gates and clinical safety linting. Evaluating MPIB across a diverse set of baseline LLMs and defense configurations, we find that ASR and CHER can diverge substantially, and that robustness depends critically on whether adversarial instructions appear in the user query or in retrieved context. We release MPIB with evaluation code, adversarial baselines, and comprehensive documentation to support reproducible and systematic research on clinical prompt injection. Code and data are available at GitHub (code) and Hugging Face (data).

[15] VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

Yancheng Wang, Osama Hanna, Ruiming Xie, Xianfeng Rui, Maohao Shen, Xuedong Zhang, Christian Fuegen, Jilong Wu, Debjyoti Paul, Arthur Guo, Zhihong Lei, Ozlem Kalinli, Qing He, Yingzhen Yang

Main category: cs.CL

TL;DR: VowelPrompt enhances LLM-based speech emotion recognition by incorporating interpretable vowel-level prosodic features and using two-stage adaptation for improved performance and generalization.

DetailsMotivation: Current LLMs for speech emotion recognition focus on textual transcriptions but neglect fine-grained prosodic information, limiting effectiveness and interpretability. There's a need to combine linguistic content with vocal expressivity for better emotion understanding.

Method: Extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, converts them to natural language descriptions, and uses two-stage adaptation: supervised fine-tuning followed by Reinforcement Learning with Verifiable Reward via Group Relative Policy Optimization.

Result: Outperforms state-of-the-art emotion recognition methods across diverse benchmarks in zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while generating interpretable explanations.

Conclusion: VowelPrompt successfully integrates fine-grained prosodic cues with linguistic content for enhanced speech emotion recognition, offering improved performance, generalization, and interpretability through linguistically grounded vowel-level analysis.

Abstract: Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.

[16] RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution

Isaac Picov, Ritesh Goru

Main category: cs.CL

TL;DR: RoPE-LIME is a method for explaining outputs of closed-source LLMs using a smaller open-source surrogate model with improved perturbation strategies and similarity metrics.

DetailsMotivation: Closed-source LLMs present challenges for explanation because API access prevents gradient-based attribution methods, and existing perturbation methods are costly and noisy when they require regenerating text outputs.

Method: Uses a smaller open-source surrogate model to compute token-level attributions from probability-based objectives (negative log-likelihood and divergence targets) under input perturbations. Incorporates: (1) a locality kernel based on Relaxed Word Mover’s Distance computed in RoPE embedding space for stable similarity under masking, and (2) Sparse-K sampling, an efficient perturbation strategy that improves interaction coverage under limited budgets.

Result: Experiments on HotpotQA (sentence features) and a hand-labeled MMLU subset (word features) show that RoPE-LIME produces more informative attributions than leave-one-out sampling and improves over gSMILE while substantially reducing closed-model API calls.

Conclusion: RoPE-LIME provides an effective approach for explaining closed-source LLM outputs by decoupling reasoning from explanation and using efficient perturbation strategies with improved similarity metrics.

Abstract: Explaining closed-source LLM outputs is challenging because API access prevents gradient-based attribution, while perturbation methods are costly and noisy when they depend on regenerated text. We introduce RoPE-LIME, an open-source extension of gSMILE that decouples reasoning from explanation: given a fixed output from a closed model, a smaller open-source surrogate computes token-level attributions from probability-based objectives (negative log-likelihood and divergence targets) under input perturbations. RoPE-LIME incorporates (i) a locality kernel based on Relaxed Word Mover’s Distance computed in RoPE embedding space for stable similarity under masking, and (ii) Sparse-K sampling, an efficient perturbation strategy that improves interaction coverage under limited budgets. Experiments on HotpotQA (sentence features) and a hand-labeled MMLU subset (word features) show that RoPE-LIME produces more informative attributions than leave-one-out sampling and improves over gSMILE while substantially reducing closed-model API calls.

[17] Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Hyunwoo Ko, Amit Agarwal, Sunghee Ahn, Kyong-Ha Lee, Youngjae Yu

Main category: cs.CL

TL;DR: Consequence-Based Utility is an oracle-free evaluator that scores candidate solutions by testing their value as in-context exemplars for solving related verifiable questions, outperforming existing methods on research-level math problems.

DetailsMotivation: While reasoning models can generate plausible attempts for research-level mathematics, verification remains a bottleneck that consumes scarce expert time. The authors hypothesize that meaningful solutions should contain enough method-level information to help solve related questions.

Method: Proposes Consequence-Based Utility, an oracle-free evaluator that scores each candidate solution by testing its value as an in-context exemplar for solving related yet verifiable questions. The approach evaluates solutions based on their downstream performance on related problems rather than direct verification.

Result: Consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. For GPT-OSS-120B, improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similar large AUC gains on GPT-OSS-20B (69.0 to 79.2). Also exhibits a larger solver-evaluator gap and maintains stronger correct-wrong separation.

Conclusion: Consequence-Based Utility provides an effective oracle-free approach for evaluating research-level mathematical solutions by leveraging their utility as in-context exemplars for related problems, addressing the verification bottleneck in mathematical reasoning.

Abstract: Recent progress in reasoning models suggests that generating plausible attempts for research-level mathematics may be within reach, but verification remains a bottleneck, consuming scarce expert time. We hypothesize that a meaningful solution should contain enough method-level information that, when applied to a neighborhood of related questions, it should yield better downstream performance than incorrect solutions. Building on this idea, we propose \textbf{Consequence-Based Utility}, an oracle-free evaluator that scores each candidate by testing its value as an in-context exemplar in solving related yet verifiable questions. Our approach is evaluated on an original set of research-level math problems, each paired with one expert-written solution and nine LLM-generated solutions. Notably, Consequence-Based Utility consistently outperforms reward models, generative reward models, and LLM judges on ranking quality. Specifically, for GPT-OSS-120B, it improves Acc@1 from 67.2 to 76.3 and AUC from 71.4 to 79.6, with similarly large AUC gains on GPT-OSS-20B (69.0 to 79.2). Furthermore, compared to LLM-Judges, it also exhibits a larger solver-evaluator gap, maintaining a stronger correct-wrong separation even on instances where the underlying solver often fails to solve.

[18] Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions

Nemika Tyagi, Holly Hendrix, Nelvin Licona-Guevara, Justin Mackie, Phanos Kareen, Muhammad Imran, Megan Michelle Smith, Tatiana Gallego Hernande, Chitta Baral, Olga Kellert

Main category: cs.CL

TL;DR: The paper addresses challenges in parsing spoken code-switching by introducing a new benchmark, evaluation metric, and decoupled parsing framework that improves performance on spoken language phenomena.

DetailsMotivation: Spoken code-switching presents unique challenges for syntactic parsing due to disfluencies, repetition, ellipsis, and discourse-driven structures that violate standard Universal Dependencies assumptions, causing existing parsers and LLMs to fail despite strong performance on written text.

Method: The authors introduce: 1) A linguistically grounded taxonomy of spoken CSW phenomena; 2) SpokeBench, an expert-annotated gold benchmark for spoken-language structure; 3) FLEX-UD, an ambiguity-aware evaluation metric; and 4) DECAP, a decoupled agentic parsing framework that isolates spoken-phenomena handling from core syntactic analysis.

Result: DECAP produces more robust and interpretable parses without retraining and achieves up to 52.6% improvements over existing parsing techniques. FLEX-UD evaluations reveal qualitative improvements that are masked by standard metrics.

Conclusion: The paper demonstrates that a systems-oriented approach with specialized handling of spoken language phenomena significantly improves parsing performance for spoken code-switching, and highlights the limitations of rigid evaluation metrics for this domain.

Abstract: Spoken code-switching (CSW) challenges syntactic parsing in ways not observed in written text. Disfluencies, repetition, ellipsis, and discourse-driven structure routinely violate standard Universal Dependencies (UD) assumptions, causing parsers and large language models (LLMs) to fail despite strong performance on written data. These failures are compounded by rigid evaluation metrics that conflate genuine structural errors with acceptable variation. In this work, we present a systems-oriented approach to spoken CSW parsing. We introduce a linguistically grounded taxonomy of spoken CSW phenomena and SpokeBench, an expert-annotated gold benchmark designed to test spoken-language structure beyond standard UD assumptions. We further propose FLEX-UD, an ambiguity-aware evaluation metric, which reveals that existing parsing techniques perform poorly on spoken CSW by penalizing linguistically plausible analyses as errors. We then propose DECAP, a decoupled agentic parsing framework that isolates spoken-phenomena handling from core syntactic analysis. Experiments show that DECAP produces more robust and interpretable parses without retraining and achieves up to 52.6% improvements over existing parsing techniques. FLEX-UD evaluations further reveal qualitative improvements that are masked by standard metrics.

[19] Can Post-Training Transform LLMs into Causal Reasoners?

Junqi Chen, Sirui Chen, Chaochao Lu

Main category: cs.CL

TL;DR: Post-training enhances LLMs’ causal inference capabilities, with smaller models achieving competitive performance against larger ones through targeted training on comprehensive causal datasets.

DetailsMotivation: Causal inference is crucial for decision-making but challenging for non-experts. While LLMs show promise, their causal estimation capabilities are limited, and the impact of post-training on these abilities is insufficiently explored.

Method: Introduces CauGym dataset with seven core causal tasks for training and five diverse test sets. Systematically evaluates five post-training approaches: SFT, DPO, KTO, PPO, and GRPO across five in-domain and four existing benchmarks.

Result: Appropriate post-training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. A 14B parameter model achieves 93.5% accuracy on CaLM benchmark vs 55.4% by OpenAI o3. Post-trained LLMs show strong generalization and robustness under real-world conditions.

Conclusion: Targeted post-training can produce reliable and robust LLM-based causal reasoners, providing first systematic evidence of this capability enhancement.

Abstract: Causal inference is essential for decision-making but remains challenging for non-experts. While large language models (LLMs) show promise in this domain, their precise causal estimation capabilities are still limited, and the impact of post-training on these abilities is insufficiently explored. This paper examines the extent to which post-training can enhance LLMs’ capacity for causal inference. We introduce CauGym, a comprehensive dataset comprising seven core causal tasks for training and five diverse test sets. Using this dataset, we systematically evaluate five post-training approaches: SFT, DPO, KTO, PPO, and GRPO. Across five in-domain and four existing benchmarks, our experiments demonstrate that appropriate post-training enables smaller LLMs to perform causal inference competitively, often surpassing much larger models. Our 14B parameter model achieves 93.5% accuracy on the CaLM benchmark, compared to 55.4% by OpenAI o3. Furthermore, the post-trained LLMs exhibit strong generalization and robustness under real-world conditions such as distribution shifts and noisy data. Collectively, these findings provide the first systematic evidence that targeted post-training can produce reliable and robust LLM-based causal reasoners. Our data and GRPO-model are available at https://github.com/OpenCausaLab/CauGym.

[20] SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass

Yewei Liu, Xiyuan Wang, Yansheng Mao, Yoav Gelbery, Haggai Maron, Muhan Zhang

Main category: cs.CL

TL;DR: SHINE is a scalable hypernetwork that maps diverse contexts into LoRA adapters for LLMs in a single forward pass, enabling in-context knowledge to become in-parameter knowledge without fine-tuning.

DetailsMotivation: To overcome limitations of prior hypernetworks and enable efficient adaptation of LLMs to diverse contexts without expensive fine-tuning, transforming in-context knowledge into in-parameter knowledge efficiently.

Method: Uses a scalable hypernetwork design that reuses frozen LLM parameters, introduces architectural innovations, and employs pretraining + instruction fine-tuning pipeline to generate high-quality LoRA adapters from diverse contexts in one forward pass.

Result: Achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling.

Conclusion: SHINE enables efficient transformation of in-context knowledge to in-parameter knowledge, offering a scalable solution for LLM adaptation with significant resource savings.

Abstract: We propose SHINE (Scalable Hyper In-context NEtwork), a scalable hypernetwork that can map diverse meaningful contexts into high-quality LoRA adapters for large language models (LLM). By reusing the frozen LLM’s own parameters in an in-context hypernetwork design and introducing architectural innovations, SHINE overcomes key limitations of prior hypernetworks and achieves strong expressive power with a relatively small number of parameters. We introduce a pretraining and instruction fine-tuning pipeline, and train our hypernetwork to generate high quality LoRA adapters from diverse meaningful contexts in a single forward pass. It updates LLM parameters without any fine-tuning, and immediately enables complex question answering tasks related to the context without directly accessing the context, effectively transforming in-context knowledge to in-parameter knowledge in one pass. Our work achieves outstanding results on various tasks, greatly saves time, computation and memory costs compared to SFT-based LLM adaptation, and shows great potential for scaling. Our code is available at https://github.com/Yewei-Liu/SHINE

[21] Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production

Alberto Andres Valdes Gonzalez

Main category: cs.CL

TL;DR: Systematic comparison of LLM prompting vs fine-tuned encoders for text classification, showing encoders offer better cost/latency with competitive performance.

DetailsMotivation: While LLMs excel at open-ended tasks, their use for structured text classification often overlooks operational constraints like cost and latency in production systems.

Method: Compare zero-/few-shot LLM prompting vs fine-tuned BERT-family encoders across 4 benchmarks (IMDB, SST-2, AG News, DBPedia), evaluating F1, latency, and cost using Pareto frontiers and parameterized utility functions.

Result: Fine-tuned encoders achieve competitive/superior performance with 1-2 orders of magnitude lower cost and latency compared to LLM prompting.

Conclusion: Fine-tuned encoders are optimal for structured classification tasks, while LLMs should be used as complementary components in hybrid architectures rather than indiscriminately.

Abstract: Large language models (LLMs) such as GPT-4o and Claude Sonnet 4.5 have demonstrated strong capabilities in open-ended reasoning and generative language tasks, leading to their widespread adoption across a broad range of NLP applications. However, for structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone, overlooking operational constraints encountered in production systems. In this work, we present a systematic comparison of two contrasting paradigms for text classification: zero- and few-shot prompt-based large language models, and fully fine-tuned encoder-only architectures. We evaluate these approaches across four canonical benchmarks (IMDB, SST-2, AG News, and DBPedia), measuring predictive quality (macro F1), inference latency, and monetary cost. We frame model evaluation as a multi-objective decision problem and analyze trade-offs using Pareto frontier projections and a parameterized utility function reflecting different deployment regimes. Our results show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance while operating at one to two orders of magnitude lower cost and latency compared to zero- and few-shot LLM prompting. Overall, our findings suggest that indiscriminate use of large language models for standard text classification workloads can lead to suboptimal system-level outcomes. Instead, fine-tuned encoders emerge as robust and efficient components for structured NLP pipelines, while LLMs are better positioned as complementary elements within hybrid architectures. We release all code, datasets, and evaluation protocols to support reproducibility and cost-aware NLP system design.

[22] ReBeCA: Unveiling Interpretable Behavior Hierarchy behind the Iterative Self-Reflection of Language Models with Causal Analysis

Tianqiang Yan, Sihan Shang, Yuheng Li, Song Qiu, Hao Peng, Wenjian Luo, Jue Xie, Lizhen Qu, Yuan Gao

Main category: cs.CL

TL;DR: ReBeCA framework uses causal analysis to understand self-reflection mechanisms in language models, revealing hierarchical behavioral influences and sparse genuine causal factors.

DetailsMotivation: Existing analyses of self-reflection in language models provide only correlation-based insights that fail to generalize, lacking understanding of underlying causal mechanisms.

Method: ReBeCA models self-reflection trajectories as causal graphs and uses a three-stage Invariant Causal Prediction (ICP) pipeline to isolate genuine determinants of performance.

Result: Identifies hierarchical semantic behaviors influencing self-reflection, shows generalizability limited to few behaviors, reveals that more behaviors ≠ better performance, with ICP achieving up to 49.6% structural likelihood gains.

Conclusion: ReBeCA provides rigorous methodology for disentangling genuine causal mechanisms from spurious associations in self-reflection dynamics, with interventions confirming causal relationships hold out-of-distribution.

Abstract: While self-reflection can enhance language model reliability, its underlying mechanisms remain opaque, with existing analyses often yielding correlation-based insights that fail to generalize. To address this, we introduce \textbf{\texttt{ReBeCA}} (self-\textbf{\texttt{Re}}flection \textbf{\texttt{Be}}havior explained through \textbf{\texttt{C}}ausal \textbf{\texttt{A}}nalysis), a framework that unveils the interpretable behavioral hierarchy governing the self-reflection outcome. By modeling self-reflection trajectories as causal graphs, ReBeCA isolates genuine determinants of performance through a three-stage Invariant Causal Prediction (ICP) pipeline. We establish three critical findings: (1) \textbf{Behavioral hierarchy:} Semantic behaviors of the model influence final self-reflection results hierarchically: directly or indirectly; (2) \textbf{Causation matters:} Generalizability in self-reflection effects is limited to just a few semantic behaviors; (3) \textbf{More $\mathbf{\neq}$ better:} The confluence of seemingly positive semantic behaviors, even among direct causal factors, can impair the efficacy of self-reflection. ICP-based verification identifies sparse causal parents achieving up to $49.6%$ structural likelihood gains, stable across tasks where correlation-based patterns fail. Intervention studies on novel datasets confirm these causal relationships hold out-of-distribution ($p = .013, η^2_\mathrm{p} = .071$). ReBeCA thus provides a rigorous methodology for disentangling genuine causal mechanisms from spurious associations in self-reflection dynamics.

[23] FMBench: Adaptive Large Language Model Output Formatting

Yaoting Wang, Yun Zhou, Henghui Ding

Main category: cs.CL

TL;DR: FMBench is a benchmark for evaluating LLMs’ ability to generate semantically correct and structurally compliant Markdown formatting, with a proposed alignment pipeline combining supervised and reinforcement learning fine-tuning.

DetailsMotivation: Markdown formatting is essential for LLM deployment in user-facing applications but prone to subtle errors that degrade usability; current models struggle with complex formatting requirements.

Method: Proposes FMBench benchmark for adaptive Markdown formatting evaluation, and a lightweight alignment pipeline combining supervised fine-tuning (SFT) with reinforcement learning fine-tuning using a composite objective balancing semantic fidelity and structural correctness.

Result: Experiments on OpenPangu and Qwen models show SFT improves semantic alignment, while reinforcement learning provides additional robustness to challenging Markdown instructions; reveals trade-off between semantic and structural objectives.

Conclusion: Carefully designed rewards are crucial for reliable formatted generation; the proposed pipeline effectively improves Markdown compliance without hard decoding constraints.

Abstract: Producing outputs that satisfy both semantic intent and format constraints is essential for deploying large language models in user-facing and system-integrated workflows. In this work, we focus on Markdown formatting, which is ubiquitous in assistants, documentation, and tool-augmented pipelines but still prone to subtle, hard-to-detect errors (e.g., broken lists, malformed tables, inconsistent headings, and invalid code blocks) that can significantly degrade downstream usability. We present FMBench, a benchmark for adaptive Markdown output formatting that evaluates models under a wide range of instruction-following scenarios with diverse structural requirements. FMBench emphasizes real-world formatting behaviors such as multi-level organization, mixed content (natural language interleaved with lists/tables/code), and strict adherence to user-specified layout constraints. To improve Markdown compliance without relying on hard decoding constraints, we propose a lightweight alignment pipeline that combines supervised fine-tuning (SFT) with reinforcement learning fine-tuning. Starting from a base model, we first perform SFT on instruction-response pairs, and then optimize a composite objective that balances semantic fidelity with structural correctness. Experiments on two model families (OpenPangu and Qwen) show that SFT consistently improves semantic alignment, while reinforcement learning provides additional gains in robustness to challenging Markdown instructions when initialized from a strong SFT policy. Our results also reveal an inherent trade-off between semantic and structural objectives, highlighting the importance of carefully designed rewards for reliable formatted generation. Code is available at: https://github.com/FudanCVL/FMBench.

[24] Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding

Daisuke Oba, Danushka Bollegala, Masahiro Kaneko, Naoaki Okazaki

Main category: cs.CL

TL;DR: SureLock accelerates masked diffusion language models by skipping computations for stabilized token positions while maintaining generation quality.

DetailsMotivation: Current masked diffusion language models waste computational resources by recomputing attention and feed-forward blocks for all token positions at every step, even when many unmasked tokens are essentially fixed and no longer changing.

Method: Proposes SureLock: when the posterior distribution at an unmasked position has stabilized across steps (the “sure condition”), the system locks that position - skipping its query projection and feed-forward sublayers while caching its attention keys and values so other positions can still attend to it.

Result: On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30-50% relative to the same sampler without locking, while maintaining comparable generation quality. The method reduces computational cost from O(N²d) to O(MNd) where M decreases as iteration progresses.

Conclusion: SureLock provides an efficient optimization for masked diffusion language models that significantly reduces computational costs without sacrificing generation quality, with theoretical analysis supporting its design rationale.

Abstract: Masked Diffusion Language Models generate sequences via iterative sampling that progressively unmasks tokens. However, they still recompute the attention and feed-forward blocks for every token position at every step – even when many unmasked tokens are essentially fixed, resulting in substantial waste in compute. We propose SureLock: when the posterior at an unmasked position has stabilized across steps (our sure condition), we lock that position – thereafter skipping its query projection and feed-forward sublayers – while caching its attention keys and values so other positions can continue to attend to it. This reduces the dominant per-iteration computational cost from $O(N^2d)$ to $O(MNd)$ where $N$ is the sequence length, $M$ is the number of unlocked token positions, and $d$ is the model dimension. In practice, $M$ decreases as the iteration progresses, yielding substantial savings. On LLaDA-8B, SureLock reduces algorithmic FLOPs by 30–50% relative to the same sampler without locking, while maintaining comparable generation quality. We also provide a theoretical analysis to justify the design rationale of SureLock: monitoring only the local KL at the lock step suffices to bound the deviation in final token probabilities. Our code will be available at https://daioba.github.io/surelock .

[25] On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

Wenbo Shang, Yuxi Sun, Jing Ma, Xin Huang

Main category: cs.CL

TL;DR: HOMER: A humor-theory-driven multi-role LLM collaboration framework for generating funny image captions using GTVH humor theory with retrieval augmentation and hierarchical imagination trees.

DetailsMotivation: Humor generation in multimodal scenarios is challenging for LLMs, requiring visual understanding, humor reasoning, and creativity. Existing LLM-based approaches have limited creativity and interpretability.

Method: Proposes HOMER framework with three LLM-based roles: 1) conflicting-script extractor for script oppositions, 2) retrieval-augmented hierarchical imaginator with imagination trees for creative associations, 3) caption generator for funny captions.

Result: Extensive experiments on New Yorker Cartoon datasets show HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multimodal humor captioning.

Conclusion: HOMER effectively addresses humor generation challenges by integrating humor theory with multi-role LLM collaboration and retrieval augmentation, producing funnier and more diverse captions.

Abstract: Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.

[26] Investigating the structure of emotions by analyzing similarity and association of emotion words

Fumitaka Iwaki, Tatsuji Takahashi

Main category: cs.CL

TL;DR: The paper examines the validity of Plutchik’s wheel of emotion by constructing semantic networks from emotion word similarity/association data and comparing network structures to the wheel model.

DetailsMotivation: Plutchik's wheel of emotion is widely used in NLP sentiment analysis but its validity hasn't been sufficiently examined. The study aims to empirically validate this emotion model through semantic network analysis.

Method: Collected similarity and association data for ordered pairs of emotion words, constructed semantic networks from this data, and analyzed network structures through community detection. Compared these structures with Plutchik’s wheel of emotion model.

Result: Network structures were mostly similar to the wheel of emotion but showed local differences. This suggests the wheel model has general validity but may need refinement for specific applications.

Conclusion: Plutchik’s wheel of emotion has general validity as an emotion model, though local structural differences exist that could inform refinements for NLP applications.

Abstract: In the field of natural language processing, some studies have attempted sentiment analysis on text by handling emotions as explanatory or response variables. One of the most popular emotion models used in this context is the wheel of emotion proposed by Plutchik. This model schematizes human emotions in a circular structure, and represents them in two or three dimensions. However, the validity of Plutchik’s wheel of emotion has not been sufficiently examined. This study investigated the validity of the wheel by creating and analyzing a semantic networks of emotion words. Through our experiments, we collected data of similarity and association of ordered pairs of emotion words, and constructed networks using these data. We then analyzed the structure of the networks through community detection, and compared it with that of the wheel of emotion. The results showed that each network’s structure was, for the most part, similar to that of the wheel of emotion, but locally different.

[27] TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

Sung-Hoon Yoon, Ruizhi Qian, Minda Zhao, Weiyue Li, Mengyu Wang

Main category: cs.CL

TL;DR: History-aware RL framework for jailbreaking LLMs that analyzes vulnerability signals from prior interaction steps to improve attack efficiency and success rates.

DetailsMotivation: Most existing jailbreaking techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions where each response influences future actions, reinforcement learning provides a natural framework for this problem.

Method: Proposes a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. Introduces an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries.

Result: Extensive experiments on AdvBench and HarmBench demonstrate state-of-the-art jailbreak performance while significantly improving query efficiency. Incorporating historical information alone improves jailbreak success rates.

Conclusion: The results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.

Abstract: Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.

[28] CORE: Comprehensive Ontological Relation Evaluation for Large Language Models

Satyam Dwivedi, Sanjukta Ghosh, Shivam Dwivedi, Nishi Kumari, Anil Thakur, Anurag Purushottam, Deepak Alok, Praveen Gatla, Manjuprasad B, Bipasha Patgiri

Main category: cs.CL

TL;DR: CORE benchmark reveals LLMs struggle to distinguish meaningful semantic relations from genuine unrelatedness, showing near-ceiling performance on related pairs but severe degradation on unrelated pairs.

DetailsMotivation: Current LLM evaluations rarely assess the ability to distinguish meaningful semantic relations from genuine unrelatedness, which is crucial for robust reasoning and safety.

Method: Created CORE dataset with 225K multiple-choice questions across 74 disciplines, plus a benchmark of 203 validated questions covering 24 semantic relation types with equal representation of related and unrelated pairs.

Result: 29 state-of-the-art LLMs achieve 48.25-70.9% overall accuracy, with near-ceiling performance on related pairs (86.5-100%) but severe degradation on unrelated pairs (0-41.35%). On the full 225K dataset, accuracy drops to ~2%.

Conclusion: Unrelatedness reasoning is a critical, under-evaluated frontier for LLM evaluation and safety, revealing systematic generation of spurious relations and poor calibration on unrelated pairs.

Abstract: Large Language Models (LLMs) perform well on many reasoning benchmarks, yet existing evaluations rarely assess their ability to distinguish between meaningful semantic relations and genuine unrelatedness. We introduce CORE (Comprehensive Ontological Relation Evaluation), a dataset of 225K multiple-choice questions spanning 74 disciplines, together with a general-domain open-source benchmark of 203 rigorously validated questions (Cohen’s Kappa = 1.0) covering 24 semantic relation types with equal representation of unrelated pairs. A human baseline from 1,000+ participants achieves 92.6% accuracy (95.1% on unrelated pairs). In contrast, 29 state-of-the-art LLMs achieve 48.25-70.9% overall accuracy, with near-ceiling performance on related pairs (86.5-100%) but severe degradation on unrelated pairs (0-41.35%), despite assigning similar confidence (92-94%). Expected Calibration Error increases 2-4x on unrelated pairs, and a mean semantic collapse rate of 37.6% indicates systematic generation of spurious relations. On the CORE 225K MCQs dataset, accuracy further drops to approximately 2%, highlighting substantial challenges in domain-specific semantic reasoning. We identify unrelatedness reasoning as a critical, under-evaluated frontier for LLM evaluation and safety.

[29] Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoning

Xinxin Lin, Guangxin Dai, Yi Zhong, Xiang Li, Xue Xiao, Yixin Zhang, Zhengdong Wu, Yongbo Zheng, Runchuan Zhu, Ming Zhao, Huizi Yu, Shuo Wu, Jun Zhao, Lingming Hu, Yumei Wang, Ping Yin, Joey W. Y. Chan, Ngan Yin Chan, Sijing Chen, Yun Kwok Wing, Lin Lu, Xin Ma, Lizhou Fan

Main category: cs.CL

TL;DR: ClinMPO is a reinforcement learning framework that aligns light-parameter LLMs with psychiatric clinical reasoning using evidence-based medicine principles, enabling them to outperform medical students on complex diagnostic cases.

DetailsMotivation: Current LLMs have limitations in psychiatry due to hallucinations and superficial reasoning, especially problematic for light-parameter models needed for clinical deployment. Existing training prioritizes linguistic fluency over structured clinical logic, creating misalignment with professional diagnostic cognition.

Method: ClinMPO uses reinforcement learning with a specialized reward model trained on 4,474 psychiatry journal articles structured according to evidence-based medicine principles. The framework aligns LLM internal reasoning with professional psychiatric practice.

Result: ClinMPO-tuned Qwen3-8B model achieved 31.4% diagnostic accuracy on complex cases where leading large-parameter LLMs fail, surpassing the human benchmark of 30.8% from 300 medical students.

Conclusion: Medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks, suggesting explicit cognitive alignment offers a scalable pathway to reliable psychiatric decision support.

Abstract: Large language models (LLMs) hold transformative potential for medical decision support yet their application in psychiatry remains constrained by hallucinations and superficial reasoning. This limitation is particularly acute in light-parameter LLMs which are essential for privacy-preserving and efficient clinical deployment. Existing training paradigms prioritize linguistic fluency over structured clinical logic and result in a fundamental misalignment with professional diagnostic cognition. Here we introduce ClinMPO, a reinforcement learning framework designed to align the internal reasoning of LLMs with professional psychiatric practice. The framework employs a specialized reward model trained independently on a dataset derived from 4,474 psychiatry journal articles and structured according to evidence-based medicine principles. We evaluated ClinMPO on a unseen subset of the benchmark designed to isolate reasoning capabilities from rote memorization. This test set comprises items where leading large-parameter LLMs consistently fail. We compared the ClinMPO-aligned light LLM performance against a cohort of 300 medical students. The ClinMPO-tuned Qwen3-8B model achieved a diagnostic accuracy of 31.4% and surpassed the human benchmark of 30.8% on these complex cases. These results demonstrate that medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks. Our findings suggest that explicit cognitive alignment offers a scalable pathway to reliable and safe psychiatric decision support.

[30] RelayGen: Intra-Generation Model Switching for Efficient Reasoning

Jiwon Song, Yoongon Kim, Jae-Joon Kim

Main category: cs.CL

TL;DR: RelayGen is a training-free framework that dynamically switches between large and small models during long-form reasoning generation based on difficulty variation, reducing inference latency while preserving accuracy.

DetailsMotivation: Large reasoning models incur high deployment costs due to generating long reasoning trajectories, but existing efficiency approaches either ignore intra-generation difficulty variation or require complex supervised token-level routing.

Method: Uses offline analysis of generation uncertainty via token probability margins to identify difficulty transitions, then implements segment-level runtime model switching where easier segments are delegated to smaller models while difficult reasoning stays on the large model.

Result: Substantially reduces inference latency while preserving most large model accuracy; achieves up to 2.2× end-to-end speedup with <2% accuracy degradation when combined with speculative decoding.

Conclusion: RelayGen demonstrates that coarse-grained segment-level control effectively captures difficulty variation in reasoning, enabling efficient model switching without additional training or learned routing components.

Abstract: Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks by generating long, multi-step reasoning trajectories, but inference-time scaling incurs substantial deployment cost. A key challenge is that generation difficulty varies within a single output, whereas existing efficiency-oriented approaches either ignore this intra-generation variation or rely on supervised token-level routing with high system complexity. We present \textbf{RelayGen}, a training-free, segment-level runtime model switching framework that exploits difficulty variation in long-form reasoning. Through offline analysis of generation uncertainty using token probability margins, we show that coarse-grained segment-level control is sufficient to capture difficulty transitions within a reasoning trajectory. RelayGen identifies model-specific switch cues that signal transitions to lower-difficulty segments and dynamically delegates their continuation to a smaller model, while preserving high-difficulty reasoning on the large model. Across multiple reasoning benchmarks, RelayGen substantially reduces inference latency while preserving most of the accuracy of large models. When combined with speculative decoding, RelayGen achieves up to 2.2$\times$ end-to-end speedup with less than 2% accuracy degradation, without requiring additional training or learned routing components.

[31] Diffusion-State Policy Optimization for Masked Diffusion Language Models

Daisuke Oba, Hiroki Furuta, Naoaki Okazaki

Main category: cs.CL

TL;DR: DiSPO improves masked diffusion language models by optimizing intermediate token filling decisions through branching and credit assignment, without requiring additional diffusion rollouts.

DetailsMotivation: Masked diffusion language models generate iteratively but only receive terminal rewards on final completions, leading to coarse credit assignment over intermediate decisions.

Method: DiSPO branches at intermediate masked states by resampling fillings from cached logits, scores resulting completions, and updates only newly filled tokens without additional multi-step diffusion rollouts.

Result: On LLaDA-8B-Instruct, DiSPO consistently improves over terminal-feedback diffu-GRPO baseline on math and planning benchmarks under matched compute and optimizer steps.

Conclusion: DiSPO provides an effective plug-in credit-assignment layer for diffusion language models that optimizes intermediate decisions and improves performance on reasoning tasks.

Abstract: Masked diffusion language models generate by iteratively filling masked tokens over multiple denoising steps, so learning only from a terminal reward on the final completion yields coarse credit assignment over intermediate decisions. We propose DiSPO (Diffusion-State Policy Optimization), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling fillings for the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens – without additional multi-step diffusion rollouts. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that can be combined with terminal-feedback policy optimization using the same rollouts. On LLaDA-8B-Instruct, DiSPO consistently improves over the terminal-feedback diffu-GRPO baseline on math and planning benchmarks under matched rollout compute and optimizer steps. Our code will be available at https://daioba.github.io/dispo .

[32] Improve Large Language Model Systems with User Logs

Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu

Main category: cs.CL

TL;DR: UNO is a framework that uses user interaction logs to improve LLMs by distilling logs into rules/preferences, clustering data, and quantifying knowledge gaps to filter noise and enhance responses.

DetailsMotivation: Traditional scaling of LLMs faces data scarcity and diminishing returns; user logs offer rich feedback but are unstructured and noisy, requiring better methods to extract useful signals.

Method: UNO distills logs into semi-structured rules and preference pairs, uses query-and-feedback-driven clustering for data heterogeneity, quantifies cognitive gaps between model knowledge and logs, and adaptively filters noise to construct modules for primary/reflective experiences.

Result: UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming RAG and memory-based baselines in extensive experiments.

Conclusion: UNO provides a unified framework for leveraging user logs to improve LLMs, addressing noise and knowledge gaps while enabling continual learning from real-world deployment.

Abstract: Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model’s prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

[33] Revisiting the Shape Convention of Transformer Language Models

Feng-Ting Liao, Meng-Hsi Chen, Guan-Ting Yi, Da-shan Shiu

Main category: cs.CL

TL;DR: Hourglass-shaped feed-forward networks (FFNs) outperform conventional narrow-wide-narrow FFNs in Transformers, enabling more efficient parameter allocation and comparable/better performance across model scales up to 1B parameters.

DetailsMotivation: To challenge the long-standing convention of narrow-wide-narrow MLP shapes in Transformer FFNs, motivated by recent findings that residual wide-narrow-wide (hourglass) MLPs offer superior function approximation capabilities.

Method: Developed a Transformer variant that replaces conventional FFNs with deeper hourglass-shaped FFNs comprising stacked hourglass sub-MLPs connected by residual pathways, enabling lighter FFNs with saved parameters reallocated to other components like attention.

Result: Hourglass FFNs outperform conventional FFNs up to 400M parameters and achieve comparable performance at larger scales up to 1B parameters; variants with reduced FFN and increased attention parameters show consistent improvements over conventional configurations at matched budgets.

Conclusion: The findings prompt rethinking of the narrow-wide-narrow MLP convention and the balance between attention and FFN parameters towards more efficient and expressive language models.

Abstract: Dense Transformer language models have largely adhered to one consistent architectural shape: each layer consists of an attention module followed by a feed-forward network (FFN) with a narrow-wide-narrow MLP, allocating most parameters to the MLP at expansion ratios between 2 and 4. Motivated by recent results that residual wide-narrow-wide (hourglass) MLPs offer superior function approximation capabilities, we revisit the long-standing MLP shape convention in Transformer, challenging the necessity of the narrow-wide-narrow design. To study this, we develop a Transformer variant that replaces the conventional FFN with a deeper hourglass-shaped FFN, comprising a stack of hourglass sub-MLPs connected by residual pathways. We posit that a deeper but lighter hourglass FFN can serve as a competitive alternative to the conventional FFN, and that parameters saved by using a lighter hourglass FFN can be more effectively utilized, such as by enlarging model hidden dimensions under fixed budgets. We confirm these through empirical validations across model scales: hourglass FFNs outperform conventional FFNs up to 400M and achieve comparable performance at larger scales to 1B parameters; hourglass FFN variants with reduced FFN and increased attention parameters show consistent improvements over conventional configurations at matched budgets. Together, these findings shed new light on recent work and prompt a rethinking of the narrow-wide-narrow MLP convention and the balance between attention and FFN towards efficient and expressive modern language models.

[34] Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevant Assessment for IR Benchmarks

Minjeong Ban, Jeonghwan Choi, Hyangsuk Min, Nicole Hee-Yeon Kim, Minseok Kim, Jae-Gil Lee, Hwanjun Song

Main category: cs.CL

TL;DR: DREAM is a multi-round debate-based relevance assessment framework using LLM agents to improve IR evaluation by reducing human effort while increasing accuracy, and BRIDGE is a refined benchmark dataset created using this framework.

DetailsMotivation: Current IR evaluation faces challenges due to incomplete benchmark datasets with unlabeled relevant chunks. Existing LLM and hybrid approaches suffer from LLM overconfidence and ineffective AI-to-human escalation, leading to biased evaluations.

Method: DREAM uses a multi-round debate framework with LLM agents taking opposing initial stances and engaging in iterative reciprocal critique. The agreement-based debate yields accurate labeling for clear cases and reliable AI-to-human escalation for uncertain ones.

Result: Achieves 95.2% labeling accuracy with only 3.5% human involvement. Creates BRIDGE benchmark with 29,824 missing relevant chunks, enabling fairer retriever comparison and revealing that unaddressed evaluation holes distort retriever rankings and cause retrieval-generation misalignment.

Conclusion: DREAM provides an effective framework for improving IR evaluation through LLM debate, reducing human effort while increasing accuracy, and BRIDGE enables more reliable benchmarking by addressing dataset incompleteness issues.

Abstract: Information retrieval (IR) evaluation remains challenging due to incomplete IR benchmark datasets that contain unlabeled relevant chunks. While LLMs and LLM-human hybrid strategies reduce costly human effort, they remain prone to LLM overconfidence and ineffective AI-to-human escalation. To address this, we propose DREAM, a multi-round debate-based relevance assessment framework with LLM agents, built on opposing initial stances and iterative reciprocal critique. Through our agreement-based debate, it yields more accurate labeling for certain cases and more reliable AI-to-human escalation for uncertain ones, achieving 95.2% labeling accuracy with only 3.5% human involvement. Using DREAM, we build BRIDGE, a refined benchmark that mitigates evaluation bias and enables fairer retriever comparison by uncovering 29,824 missing relevant chunks. We then re-benchmark IR systems and extend evaluation to RAG, showing that unaddressed holes not only distort retriever rankings but also drive retrieval-generation misalignment. The relevance assessment framework is available at https: //github.com/DISL-Lab/DREAM-ICLR-26; and the BRIDGE dataset is available at https://github.com/DISL-Lab/BRIDGE-Benchmark.

[35] MTQE.en-he: Machine Translation Quality Estimation for English-Hebrew

Andy Rosenbaum, Assaf Siani, Ilan Kernerman

Main category: cs.CL

TL;DR: First publicly available English-Hebrew benchmark for Machine Translation Quality Estimation with human-annotated quality scores, showing ensemble methods outperform single models and parameter-efficient fine-tuning works best.

DetailsMotivation: Address the lack of publicly available benchmarks for English-Hebrew machine translation quality estimation, enabling research on this under-resourced language pair.

Method: Created MTQE.en-he benchmark with 959 English segments, Hebrew translations, and expert human assessments. Benchmarked ChatGPT prompting, TransQuest, and CometKiwi models, experimented with ensembling and fine-tuning approaches including LoRA, BitFit, and head-only fine-tuning.

Result: Ensembling three models outperformed best single model (CometKiwi) by 6.4 percentage points Pearson and 5.6 percentage points Spearman. Parameter-efficient fine-tuning methods trained stably and yielded 2-3 percentage point improvements, while full-model updates suffered from overfitting.

Conclusion: MTQE.en-he enables future research on English-Hebrew translation quality estimation, with ensemble methods and parameter-efficient fine-tuning showing promising results for this under-resourced language pair.

Abstract: We release MTQE.en-he: to our knowledge, the first publicly available English-Hebrew benchmark for Machine Translation Quality Estimation. MTQE.en-he contains 959 English segments from WMT24++, each paired with a machine translation into Hebrew, and Direct Assessment scores of the translation quality annotated by three human experts. We benchmark ChatGPT prompting, TransQuest, and CometKiwi and show that ensembling the three models outperforms the best single model (CometKiwi) by 6.4 percentage points Pearson and 5.6 percentage points Spearman. Fine-tuning experiments with TransQuest and CometKiwi reveal that full-model updates are sensitive to overfitting and distribution collapse, yet parameter-efficient methods (LoRA, BitFit, and FTHead, i.e., fine-tuning only the classification head) train stably and yield improvements of 2-3 percentage points. MTQE.en-he and our experimental results enable future research on this under-resourced language pair.

[36] Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making

Baichuan-M3 Team, :, Chengfeng Dou, Fan Yang, Fei Li, Jiyuan Jia, Qiang Ju, Shuai Wang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Hongda Zhang, Jinyang Tai, Linzhuang Sun, Peidong Guo, Yichuan Mo, Xiaochuan Wang, Hengfu Cui, Zhishou Zhang

Main category: cs.CL

TL;DR: Baichuan-M3 is a medical-enhanced LLM that shifts from passive QA to active clinical decision support with capabilities for proactive information gathering, long-horizon reasoning, and hallucination suppression.

DetailsMotivation: Existing medical AI systems have limitations in open-ended consultations and lack systematic workflow modeling of physicians, needing to move beyond passive question-answering to active clinical decision support.

Method: Utilizes specialized training pipeline to model physician systematic workflow, with key capabilities: proactive information acquisition, long-horizon reasoning to unify scattered evidence, and adaptive hallucination suppression.

Result: Achieves state-of-the-art results on HealthBench, HealthBench-Hallu and ScanBench, significantly outperforming GPT-5.2 in clinical inquiry, advisory and safety.

Conclusion: Baichuan-M3 represents a paradigm shift in medical AI from passive QA to active clinical decision support, with publicly available models demonstrating superior performance.

Abstract: We introduce Baichuan-M3, a medical-enhanced large language model engineered to shift the paradigm from passive question-answering to active, clinical-grade decision support. Addressing the limitations of existing systems in open-ended consultations, Baichuan-M3 utilizes a specialized training pipeline to model the systematic workflow of a physician. Key capabilities include: (i) proactive information acquisition to resolve ambiguity; (ii) long-horizon reasoning that unifies scattered evidence into coherent diagnoses; and (iii) adaptive hallucination suppression to ensure factual reliability. Empirical evaluations demonstrate that Baichuan-M3 achieves state-of-the-art results on HealthBench, the newly introduced HealthBench-Hallu and ScanBench, significantly outperforming GPT-5.2 in clinical inquiry, advisory and safety. The models are publicly available at https://huggingface.co/collections/baichuan-inc/baichuan-m3.

[37] Inference-Time Rethinking with Latent Thought Vectors for Math Reasoning

Deqian Kong, Minglu Zhao, Aoyang Qin, Bo Pang, Chenxin Tao, David Hartmann, Edouardo Honig, Dehong Xu, Amit Kumar, Matt Sarte, Chuan Li, Jianwen Xie, Ying Nian Wu

Main category: cs.CL

TL;DR: A framework called Inference-Time Rethinking enables iterative self-correction in reasoning by decoupling latent thought vectors from procedural generation, allowing gradient-based optimization over reasoning strategies.

DetailsMotivation: Standard chain-of-thought reasoning commits irrevocably to each token without recovery mechanisms for early errors, lacking iterative refinement capabilities.

Method: Factorizes reasoning into continuous latent thought vectors (what to reason about) and a decoder that verbalizes traces conditioned on these vectors. Uses a prior model mapping noise to valid reasoning patterns and a Gibbs-style procedure alternating between generating candidate traces and optimizing latent vectors.

Result: A 0.2B-parameter model trained on GSM8K with 30 rethinking iterations surpasses baselines with 10-15 times more parameters (including a 3B model), demonstrating effective mathematical reasoning emerges from sophisticated inference-time computation.

Conclusion: Effective mathematical reasoning can emerge from sophisticated inference-time computation rather than solely from massive parameter counts, introducing a new paradigm for iterative self-correction in reasoning.

Abstract: Standard chain-of-thought reasoning generates a solution in a single forward pass, committing irrevocably to each token and lacking a mechanism to recover from early errors. We introduce Inference-Time Rethinking, a generative framework that enables iterative self-correction by decoupling declarative latent thought vectors from procedural generation. We factorize reasoning into a continuous latent thought vector (what to reason about) and a decoder that verbalizes the trace conditioned on this vector (how to reason). Beyond serving as a declarative buffer, latent thought vectors compress the reasoning structure into a continuous representation that abstracts away surface-level token variability, making gradient-based optimization over reasoning strategies well-posed. Our prior model maps unstructured noise to a learned manifold of valid reasoning patterns, and at test time we employ a Gibbs-style procedure that alternates between generating a candidate trace and optimizing the latent vector to better explain that trace, effectively navigating the latent manifold to refine the reasoning strategy. Training a 0.2B-parameter model from scratch on GSM8K, our method with 30 rethinking iterations surpasses baselines with 10 to 15 times more parameters, including a 3B counterpart. This result demonstrates that effective mathematical reasoning can emerge from sophisticated inference-time computation rather than solely from massive parameter counts.

[38] Echoes as Anchors: Probabilistic Costs and Attention Refocusing in LLM Reasoning

Zhuoyuan Hao, Zhuo Li, Wu Li, Fangming Liu, Min Zhang, Jing Li

Main category: cs.CL

TL;DR: Echo of Prompt (EOP) analysis shows that large reasoning models naturally repeat questions, which can be harnessed as a compute-shaping mechanism to improve reasoning performance through echo-then-reason patterns.

DetailsMotivation: Current approaches to test-time compute allocation in large reasoning models either use task-agnostic tokens or ignore the spontaneous repetition that models exhibit. The paper aims to analyze and harness this natural tendency to restate questions (Echo of Prompt) as a front-loaded compute mechanism.

Method: Formalizes echo removal as rejection-based conditioning and defines Echo Likelihood Gap as a computable proxy. Develops Echo-Distilled SFT (ED-SFT) to instill echo-then-reason patterns through supervised finetuning, and Echoic Prompting (EP) to re-ground models mid-trace without training.

Result: EOP increases answer to answer-prefix attention in middle layers, consistent with attention refocusing. Evaluation on GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500 shows consistent gains over baselines under identical decoding settings and budgets.

Conclusion: The Echo of Prompt phenomenon provides a natural compute-shaping mechanism that can be harnessed to improve reasoning performance in large models through both training-based (ED-SFT) and prompting-based (EP) approaches.

Abstract: Test-time compute allocation in large reasoning models (LRMs) is widely used and has applications in mathematical problem solving, code synthesis, and planning. Recent work has addressed this problem by scaling self-consistency and parallel thinking, adding generic thinking tokens'' and prompting models to re-read the question before answering. Unfortunately, these approaches either inject task-agnostic tokens or mandate heuristics that do not explain -- and often ignore -- the \emph{spontaneous} repetition that many LRMs exhibit at the head of their internal chains. In contrast, we analyze and harness the model's tendency to restate the question, which we term the \emph{Echo of Prompt (EOP)}, as a front-loaded, compute-shaping mechanism. We formalize its probabilistic cost by casting echo removal as rejection-based conditioning and defining the \emph{Echo Likelihood Gap} $Δ\mathcal{L}$ as a computable proxy. This provides the missing theoretical link that links early repetition to likelihood gains and downstream accuracy. However, it does not by itself specify how to exploit EOP. Consequently, we develop \emph{Echo-Distilled SFT (ED-SFT)} to instill an echo-then-reason’’ pattern through supervised finetuning, and \emph{Echoic Prompting (EP)} to re-ground the model mid-trace without training. While promising, quantifying benefits beyond verbosity is non-trivial. Therefore, we conduct length and suffix-controlled likelihood analyses together with layer-wise attention studies, showing that EOP increases answer to answer-prefix attention in middle layers, consistent with an \emph{attention refocusing} mechanism. We evaluate on GSM8K, MathQA, Hendrycks-MATH, AIME24, and MATH-500 under identical decoding settings and budgets, and find consistent gains over baselines. Code is available at https://github.com/hhh2210/echoes-as-anchors.

[39] Do Prompts Guarantee Safety? Mitigating Toxicity from LLM Generations through Subspace Intervention

Himanshu Singh, Ziwei Xu, A. V. Subramanyam, Mohan Kankanhalli

Main category: cs.CL

TL;DR: Targeted subspace intervention strategy for identifying and suppressing hidden toxic patterns in LLM representations while preserving fluency and coherence

DetailsMotivation: LLMs can produce toxic/harmful content even from harmless prompts, presenting serious safety challenges. Toxicity is subtle and context-dependent, making detection difficult. Current detoxification methods face trade-offs between safety and text fluency/coherence.

Method: Targeted subspace intervention strategy that identifies and suppresses hidden toxic patterns from underlying model representations. The approach works at the representation level rather than token or sentence level.

Result: On RealToxicityPrompts, achieves strong mitigation performance with minimal impact on inference complexity. Reduces toxicity of state-of-the-art detoxification systems by 8-20% while maintaining comparable fluency. Consistently outperforms existing baselines.

Conclusion: The approach effectively reduces toxicity without impairing generative performance, offering a promising solution to the safety-fluency trade-off in LLM detoxification.

Abstract: Large Language Models (LLMs) are powerful text generators, yet they can produce toxic or harmful content even when given seemingly harmless prompts. This presents a serious safety challenge and can cause real-world harm. Toxicity is often subtle and context-dependent, making it difficult to detect at the token level or through coarse sentence-level signals. Moreover, efforts to mitigate toxicity often face a trade-off between safety and the coherence, or fluency of the generated text. In this work, we present a targeted subspace intervention strategy for identifying and suppressing hidden toxic patterns from underlying model representations, while preserving overall ability to generate safe fluent content. On the RealToxicityPrompts, our method achieves strong mitigation performance compared to existing baselines, with minimal impact on inference complexity. Across multiple LLMs, our approach reduces toxicity of state-of-the-art detoxification systems by 8-20%, while maintaining comparable fluency. Through extensive quantitative and qualitative analyses, we show that our approach achieves effective toxicity reduction without impairing generative performance, consistently outperforming existing baselines.

[40] FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge

Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Xiao Xu, Shijian Li

Main category: cs.CL

TL;DR: FairJudge is an adaptive, debiased LLM-as-a-Judge system that learns judging behavior as a regularized policy to address limitations in existing evaluation systems.

DetailsMotivation: Existing LLM-as-a-Judge systems suffer from three key limitations: poor adaptivity to task-specific criteria, systematic biases from non-semantic cues (position, length, format, model provenance), and evaluation inconsistency across different modes (pointwise vs pairwise).

Method: Models judging behavior as a learnable policy, constructs high-information-density judging dataset with explicit supervision signals, and uses curriculum-style SFT-DPO-GRPO training paradigm to progressively align rubric adherence, bias mitigation, and cross-mode consistency.

Result: Experimental results show FairJudge consistently improves agreement and F1 scores, reduces non-semantic biases, and outperforms substantially larger instruction-tuned LLMs on multiple internal and public benchmarks.

Conclusion: FairJudge addresses fundamental limitations in LLM evaluation systems through adaptive, debiased, and consistent judging behavior learning, with resources to be publicly released for future research.

Abstract: Existing LLM-as-a-Judge systems suffer from three fundamental limitations: limited adaptivity to task- and domain-specific evaluation criteria, systematic biases driven by non-semantic cues such as position, length, format, and model provenance, and evaluation inconsistency that leads to contradictory judgments across different evaluation modes (e.g., pointwise versus pairwise). To address these issues, we propose FairJudge, an adaptive, debiased, and consistent LLM-as-a-Judge. Unlike prior approaches that treat the judge as a static evaluator, FairJudge models judging behavior itself as a learnable and regularized policy. From a data-centric perspective, we construct a high-information-density judging dataset that explicitly injects supervision signals aligned with evaluation behavior. Building on this dataset, we adopt a curriculum-style SFT-DPO-GRPO training paradigm that progressively aligns rubric adherence, bias mitigation, and cross-mode consistency, while avoiding catastrophic forgetting. Experimental results on multiple internal and public benchmarks show that FairJudge consistently improves agreement and F1, reduces non-semantic biases, and outperforms substantially larger instruction-tuned LLMs. All resources will be publicly released after acceptance to facilitate future research.

[41] Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

Jianfeng Si, Lin Sun, Weihong Lin, Xiangzheng Zhang

Main category: cs.CL

TL;DR: PACT is a framework for dynamic safety control in LLMs using explicit, risk-aware reasoning with hierarchical policies to address the safety-helpfulness trade-off.

DetailsMotivation: Current LLMs use static, one-size-fits-all safety policies that lack runtime controllability, causing models to either over-refuse benign requests or under-constrain harmful ones, creating a fundamental safety-helpfulness trade-off.

Method: PACT uses a hierarchical policy architecture with non-overridable global safety policies for critical risks and user-defined policies for domain-specific risks. It decomposes safety decisions into structured Classify→Act paths that route queries to appropriate actions (comply, guide, or reject) with transparent reasoning.

Result: Extensive experiments show PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off.

Conclusion: PACT provides a framework for dynamic safety control through explicit reasoning, enabling better utility in real-world deployment while maintaining safety boundaries, with plans to release the model suite, training data, and evaluation protocols.

Abstract: Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.

[42] Not All Layers Need Tuning: Selective Layer Restoration Recovers Diversity

Bowen Zhang, Meiyi Wang, Harold Soh

Main category: cs.CL

TL;DR: Selective Layer Restoration (SLR) is a training-free method that recovers generation diversity in post-trained LLMs by restoring specific layers to their pre-trained weights, addressing mode collapse while maintaining output quality.

DetailsMotivation: Post-training improves instruction-following but reduces generation diversity (mode collapse). The authors hypothesize that mode collapse can be localized to specific layers and that restoring selected layers to pre-trained weights can recover diversity while maintaining quality.

Method: 1) Design Constrained Random Character (CRC) proxy task with explicit validity set and diversity objective to identify layers affecting diversity; 2) Propose Selective Layer Restoration (SLR) - restore selected layers in post-trained models to pre-trained weights, creating hybrid models with same architecture; 3) Apply across three model families (Llama, Qwen, Gemma) and three tasks (creative writing, QA, reasoning).

Result: SLR consistently and substantially improves output diversity while maintaining high output quality across all tested models and tasks. CRC task revealed clear diversity-validity trade-off across restoration ranges and identified optimal configurations.

Conclusion: Mode collapse in post-trained LLMs can be effectively addressed by selectively restoring specific layers to pre-trained weights, offering a training-free solution that improves diversity without compromising quality or increasing inference cost.

Abstract: Post-training improves instruction-following and helpfulness of large language models (LLMs) but often reduces generation diversity, which leads to repetitive outputs in open-ended settings, a phenomenon known as mode collapse. Motivated by evidence that LLM layers play distinct functional roles, we hypothesize that mode collapse can be localized to specific layers and that restoring a carefully chosen range of layers to their pre-trained weights can recover diversity while maintaining high output quality. To validate this hypothesis and decide which layers to restore, we design a proxy task – Constrained Random Character(CRC) – with an explicit validity set and a natural diversity objective. Results on CRC reveal a clear diversity-validity trade-off across restoration ranges and identify configurations that increase diversity with minimal quality loss. Based on these findings, we propose Selective Layer Restoration (SLR), a training-free method that restores selected layers in a post-trained model to their pre-trained weights, yielding a hybrid model with the same architecture and parameter count, incurring no additional inference cost. Across three different tasks (creative writing, open-ended question answering, and multi-step reasoning) and three different model families (Llama, Qwen, and Gemma), we find SLR can consistently and substantially improve output diversity while maintaining high output quality.

[43] compar:IA: The French Government’s LLM arena to collect French-language human prompts and preference data

Lucie Termignon, Simonas Zilinskas, Hadrien Pélissier, Aurélien Barrot, Nicolas Chesnais, Elie Gavoty

Main category: cs.CL

TL;DR: compar:IA is an open-source digital public service from the French government that collects large-scale human preference data for LLMs, primarily in French, to address the scarcity of non-English preference data for RLHF/DPO training.

DetailsMotivation: LLMs show reduced performance, cultural alignment, and safety robustness in non-English languages due to English dominance in pre-training and human preference alignment datasets. RLHF and DPO require human preference data that remains scarce and non-public for many languages beyond English.

Method: Developed an open-source digital public service with a blind pairwise comparison interface to capture unconstrained, real-world prompts and user judgments across diverse language models. Maintains low participation friction and privacy-preserving automated filtering.

Result: Collected over 600,000 free-form prompts and 250,000 preference votes, with approximately 89% of data in French. Released three complementary datasets (conversations, votes, reactions) under open licenses, including a French-language model leaderboard and user interaction patterns.

Conclusion: compar:IA addresses the critical gap in non-English human preference data and is evolving toward an international digital public good, offering reusable infrastructure for multilingual model training, evaluation, and human-AI interaction studies.

Abstract: Large Language Models (LLMs) often show reduced performance, cultural alignment, and safety robustness in non-English languages, partly because English dominates both pre-training data and human preference alignment datasets. Training methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) require human preference data, which remains scarce and largely non-public for many languages beyond English. To address this gap, we introduce compar:IA, an open-source digital public service developed inside the French government and designed to collect large-scale human preference data from a predominantly French-speaking general audience. The platform uses a blind pairwise comparison interface to capture unconstrained, real-world prompts and user judgments across a diverse set of language models, while maintaining low participation friction and privacy-preserving automated filtering. As of 2026-02-07, compar:IA has collected over 600,000 free-form prompts and 250,000 preference votes, with approximately 89% of the data in French. We release three complementary datasets – conversations, votes, and reactions – under open licenses, and present initial analyses, including a French-language model leaderboard and user interaction patterns. Beyond the French context, compar:IA is evolving toward an international digital public good, offering reusable infrastructure for multilingual model training, evaluation, and the study of human-AI interaction.

[44] Evaluating Prompt Engineering Strategies for Sentiment Control in AI-Generated Texts

Kerstin Sahler, Sophie Jentzsch

Main category: cs.CL

TL;DR: Prompt engineering can effectively control sentiment in LLM-generated text, with Few-Shot prompting being most effective, offering a practical alternative to fine-tuning for emotion-adaptive AI systems.

DetailsMotivation: LLMs offer new opportunities for emotion-adaptive AI, but deliberately controlling sentiment in these systems remains challenging. The study aims to provide a resource-sensitive and accessible alternative to existing methods like fine-tuning.

Method: Investigates prompt engineering techniques for controlling sentiment in LLM-generated text using Ekman’s six basic emotions. Examines Zero-Shot, Chain-of-Thought prompting using gpt-3.5-turbo, and compares to fine-tuning.

Result: Prompt engineering effectively steers emotions in AI-generated texts, offering a practical and cost-effective alternative to fine-tuning. Few-Shot prompting with human-written examples was the most effective technique.

Conclusion: Prompt engineering provides valuable insights for developing emotion-adaptive AI systems, especially in data-constrained settings, with Few-Shot prompting offering the best performance.

Abstract: The groundbreaking capabilities of Large Language Models (LLMs) offer new opportunities for enhancing human-computer interaction through emotion-adaptive Artificial Intelligence (AI). However, deliberately controlling the sentiment in these systems remains challenging. The present study investigates the potential of prompt engineering for controlling sentiment in LLM-generated text, providing a resource-sensitive and accessible alternative to existing methods. Using Ekman’s six basic emotions (e.g., joy, disgust), we examine various prompting techniques, including Zero-Shot and Chain-of-Thought prompting using gpt-3.5-turbo, and compare it to fine-tuning. Our results indicate that prompt engineering effectively steers emotions in AI-generated texts, offering a practical and cost-effective alternative to fine-tuning, especially in data-constrained settings. In this regard, Few-Shot prompting with human-written examples was the most effective among other techniques, likely due to the additional task-specific guidance. The findings contribute valuable insights towards developing emotion-adaptive AI systems.

[45] Table-as-Search: Formulate Long-Horizon Agentic Information Seeking as Table Completion

Tian Lan, Felix Henry, Bin Zhu, Qianghuai Jia, Junyang Ren, Qihang Pu, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo

Main category: cs.CL

TL;DR: TaS reformulates information seeking as table completion, using structured tables to manage search states and planning for improved long-horizon exploration.

DetailsMotivation: Current InfoSeeking agents struggle with maintaining focus and coherence during long-horizon exploration due to fragile plain-text context management of search states and results.

Method: Introduces Table-as-Search (TaS) framework that maps queries into structured table schemas in external databases, where rows represent search candidates and columns denote constraints/information, with filled cells recording history and empty cells serving as explicit search plans.

Result: TaS significantly outperforms state-of-the-art baselines across three kinds of benchmarks, including multi-agent frameworks and commercial systems, demonstrating superior robustness in long-horizon InfoSeeking with efficiency, scalability and flexibility.

Conclusion: TaS provides an effective structured planning framework for information seeking that addresses limitations of current approaches and unifies different search tasks through table-based state management.

Abstract: Current Information Seeking (InfoSeeking) agents struggle to maintain focus and coherence during long-horizon exploration, as tracking search states, including planning procedure and massive search results, within one plain-text context is inherently fragile. To address this, we introduce \textbf{Table-as-Search (TaS)}, a structured planning framework that reformulates the InfoSeeking task as a Table Completion task. TaS maps each query into a structured table schema maintained in an external database, where rows represent search candidates and columns denote constraints or required information. This table precisely manages the search states: filled cells strictly record the history and search results, while empty cells serve as an explicit search plan. Crucially, TaS unifies three distinct InfoSeeking tasks: Deep Search, Wide Search, and the challenging DeepWide Search. Extensive experiments demonstrate that TaS significantly outperforms numerous state-of-the-art baselines across three kinds of benchmarks, including multi-agent framework and commercial systems. Furthermore, our analysis validates the TaS’s superior robustness in long-horizon InfoSeeking, alongside its efficiency, scalability and flexibility. Code and datasets are publicly released at https://github.com/AIDC-AI/Marco-Search-Agent.

[46] R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging

Yanlin Lai, Mitt Huang, Hangyu Guo, Xiangfeng Wang, Haodong Li, Shaoxiong Zhan, Liang Zhao, Chengyuan Yao, Yinmin Zhang, Qi Han, Chun Yuan, Zheng Ge, Xiangyu Zhang, Daxin Jiang

Main category: cs.CL

TL;DR: The paper introduces R-Align, a method to improve reasoning fidelity in Generative Reward Models by supervising rationale alignment with gold judgments, reducing spurious correctness and improving downstream RLHF outcomes.

DetailsMotivation: Current Generative Reward Models (GenRMs) focus only on outcome labels during training and evaluation, leaving reasoning quality unchecked. The authors identify that reasoning fidelity - consistency between preference decisions and reference rationales - is crucial for downstream RLHF performance, beyond mere label accuracy.

Method: The paper proposes Rationale-Centric Alignment (R-Align), which augments GenRM training with gold judgments and explicitly supervises rationale alignment. It introduces Spurious Correctness (S-Corr) metric to measure label-correct decisions with misaligned rationales, and uses this to improve training.

Result: Empirical evaluation shows substantial S-Corr even in competitive GenRMs, with higher S-Corr associated with policy degeneration. R-Align reduces S-Corr on RM benchmarks and yields consistent performance gains across STEM, coding, instruction following, and general tasks.

Conclusion: Reasoning fidelity is critical for RLHF robustness, and R-Align effectively improves GenRM performance by aligning rationales with gold judgments, leading to better downstream actor performance across diverse domains.

Abstract: Reinforcement Learning from Human Feedback (RLHF) remains indispensable for aligning large language models (LLMs) in subjective domains. To enhance robustness, recent work shifts toward Generative Reward Models (GenRMs) that generate rationales before predicting preferences. Yet in GenRM training and evaluation, practice remains outcome-label-only, leaving reasoning quality unchecked. We show that reasoning fidelity-the consistency between a GenRM’s preference decision and reference decision rationales-is highly predictive of downstream RLHF outcomes, beyond standard label accuracy. Specifically, we repurpose existing reward-model benchmarks to compute Spurious Correctness (S-Corr)-the fraction of label-correct decisions with rationales misaligned with golden judgments. Our empirical evaluation reveals substantial S-Corr even for competitive GenRMs, and higher S-Corr is associated with policy degeneration under optimization. To improve fidelity, we propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment. R-Align reduces S-Corr on RM benchmarks and yields consistent gains in actor performance across STEM, coding, instruction following, and general tasks.

[47] Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling

Kate Sanders, Nathaniel Weir, Sapana Chaudhary, Kaj Bostrom, Huzefa Rangwala

Main category: cs.CL

TL;DR: LLMs struggle with error detection in reasoning traces, especially in technical domains. This paper proposes using automatically constructed error taxonomies (rubrics) to improve LLM-based error identification and create better reward functions for RL training.

DetailsMotivation: LLMs have difficulty reliably identifying errors in reasoning traces, particularly for long outputs, expert domains, and problems without verifiable rewards. This limits their use for reasoning output verification and creating effective reward functions for reinforcement learning.

Method: Develops a data-driven approach to automatically construct granular reasoning error taxonomies (rubrics). These rubrics are used to enhance LLM-driven error detection and build stronger LLM-as-judge reward functions for reasoning model training via reinforcement learning.

Result: Rubric-based classification shows strong error identification compared to baselines in technical domains (coding, math, chemical engineering). Reward functions using these rubrics improve task accuracy by +45% over general LLM-as-judge methods and approach performance of models trained with verifiable rewards using only 20% of gold labels.

Conclusion: The approach extends reward rubrics from assessing qualitative behavior to quantitative correctness, enabling teaching models to solve complex technical problems without costly full gold label datasets.

Abstract: An impediment to using Large Language Models (LLMs) for reasoning output verification is that LLMs struggle to reliably identify errors in thinking traces, particularly in long outputs, domains requiring expert knowledge, and problems without verifiable rewards. We propose a data-driven approach to automatically construct highly granular reasoning error taxonomies to enhance LLM-driven error detection on unseen reasoning traces. Our findings indicate that classification approaches that leverage these error taxonomies, or “rubrics”, demonstrate strong error identification compared to baseline methods in technical domains like coding, math, and chemical engineering. These rubrics can be used to build stronger LLM-as-judge reward functions for reasoning model training via reinforcement learning. Experimental results show that these rewards have the potential to improve models’ task accuracy on difficult domains over models trained by general LLMs-as-judges by +45%, and approach performance of models trained by verifiable rewards while using as little as 20% as many gold labels. Through our approach, we extend the usage of reward rubrics from assessing qualitative model behavior to assessing quantitative model correctness on tasks typically learned via RLVR rewards. This extension opens the door for teaching models to solve complex technical problems without a full dataset of gold labels, which are often highly costly to procure.

[48] Visual Word Sense Disambiguation with CLIP through Dual-Channel Text Prompting and Image Augmentations

Shamik Bhattacharya, Daniel Perkins, Yaren Dogan, Vineeth Konjeti, Sudarshan Srinivasan, Edmon Begoli

Main category: cs.CL

TL;DR: Visual Word Sense Disambiguation framework using CLIP with dual-channel prompting and image augmentation improves disambiguation of ambiguous language through visual context.

DetailsMotivation: Address lexical ambiguity challenges in LLMs by leveraging visual domain to resolve word sense disambiguation, creating an interpretable framework for multimodal understanding.

Method: Uses CLIP to project ambiguous text and candidate images into shared multimodal space. Enriches textual embeddings with dual-channel ensemble (semantic + photo-based prompts with WordNet synonyms). Refines image embeddings through test-time augmentations. Uses cosine similarity for alignment.

Result: On SemEval-2023 VWSD dataset: MRR improved from 0.7227 to 0.7590, Hit Rate from 0.5810 to 0.6220. Dual-channel prompting provides strong low-latency performance. Aggressive image augmentation yields marginal gains.

Conclusion: Precise CLIP-aligned prompts are effective for visual word sense disambiguation. Noisy external signals dilute semantic specificity. Dual-channel prompting offers robust performance for multimodal ambiguity resolution.

Abstract: Ambiguity poses persistent challenges in natural language understanding for large language models (LLMs). To better understand how lexical ambiguity can be resolved through the visual domain, we develop an interpretable Visual Word Sense Disambiguation (VWSD) framework. The model leverages CLIP to project ambiguous language and candidate images into a shared multimodal space. We enrich textual embeddings using a dual-channel ensemble of semantic and photo-based prompts with WordNet synonyms, while image embeddings are refined through robust test-time augmentations. We then use cosine similarity to determine the image that best aligns with the ambiguous text. When evaluated on the SemEval-2023 VWSD dataset, enriching the embeddings raises the MRR from 0.7227 to 0.7590 and the Hit Rate from 0.5810 to 0.6220. Ablation studies reveal that dual-channel prompting provides strong, low-latency performance, whereas aggressive image augmentation yields only marginal gains. Additional experiments with WordNet definitions and multilingual prompt ensembles further suggest that noisy external signals tend to dilute semantic specificity, reinforcing the effectiveness of precise, CLIP-aligned prompts for visual word sense disambiguation.

[49] The Representational Geometry of Number

Zhimin Hu, Lanhao Niu, Sashank Varma

Main category: cs.CL

TL;DR: Number representations in language models preserve stable relational structure across tasks while being embedded in distinct subspaces that are transformable via linear mappings.

DetailsMotivation: To understand how conceptual representations balance shared structure for generalization versus orthogonal subspaces for task interference, using number concepts as a testbed.

Method: Used language models as high-dimensional computational substrates to analyze number representations, examining geometric relations between concepts across different tasks.

Result: Number representations preserve stable relational structure across tasks, with task-specific representations in distinct subspaces that are transformable via linear mappings.

Conclusion: Understanding arises when task-specific transformations are applied to a shared underlying relational structure of conceptual representations.

Abstract: A central question in cognitive science is whether conceptual representations converge onto a shared manifold to support generalization, or diverge into orthogonal subspaces to minimize task interference. While prior work has discovered evidence for both, a mechanistic account of how these properties coexist and transform across tasks remains elusive. We propose that representational sharing lies not in the concepts themselves, but in the geometric relations between them. Using number concepts as a testbed and language models as high-dimensional computational substrates, we show that number representations preserve a stable relational structure across tasks. Task-specific representations are embedded in distinct subspaces, with low-level features like magnitude and parity encoded along separable linear directions. Crucially, we find that these subspaces are largely transformable into one another via linear mappings, indicating that representations share relational structure despite being located in distinct subspaces. Together, these results provide a mechanistic lens of how language models balance the shared structure of number representation with functional flexibility. It suggests that understanding arises when task-specific transformations are applied to a shared underlying relational structure of conceptual representations.

[50] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Mingqian Feng, Xiaodong Liu, Weiwei Yang, Jialin Song, Xuekai Zhu, Chenliang Xu, Jianfeng Gao

Main category: cs.CL

TL;DR: SEMA is a framework for training multi-turn jailbreak attackers for LLMs using self-tuning and reinforcement learning with intent-drift-aware rewards, achieving state-of-the-art attack success rates.

DetailsMotivation: Multi-turn jailbreaks represent the real threat model for safety-aligned chatbots, but existing approaches suffer from exploration complexity and intent drift issues.

Method: Two-stage approach: 1) Prefilling self-tuning fine-tunes on self-generated non-refusal multi-turn adversarial prompts, 2) Reinforcement learning with intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail.

Result: Achieves 80.1% average ASR@1 across three victim models on AdvBench, outperforming all baselines by 33.9% over SOTA. Compact, reproducible, and transfers across targets.

Conclusion: SEMA provides a stronger and more realistic stress test for LLM safety, enabling automatic redteaming to expose and localize failure modes in safety-aligned chatbots.

Abstract: Multi-turn jailbreaks capture the real threat model for safety-aligned chatbots, where single-turn attacks are merely a special case. Yet existing approaches break under exploration complexity and intent drift. We propose SEMA, a simple yet effective framework that trains a multi-turn attacker without relying on any existing strategies or external data. SEMA comprises two stages. Prefilling self-tuning enables usable rollouts by fine-tuning on non-refusal, well-structured, multi-turn adversarial prompts that are self-generated with a minimal prefix, thereby stabilizing subsequent learning. Reinforcement learning with intent-drift-aware reward trains the attacker to elicit valid multi-turn adversarial prompts while maintaining the same harmful objective. We anchor harmful intent in multi-turn jailbreaks via an intent-drift-aware reward that combines intent alignment, compliance risk, and level of detail. Our open-loop attack regime avoids dependence on victim feedback, unifies single- and multi-turn settings, and reduces exploration complexity. Across multiple datasets, victim models, and jailbreak judges, our method achieves state-of-the-art (SOTA) attack success rates (ASR), outperforming all single-turn baselines, manually scripted and template-driven multi-turn baselines, as well as our SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimization) variants. For instance, SEMA performs an average $80.1%$ ASR@1 across three closed-source and open-source victim models on AdvBench, 33.9% over SOTA. The approach is compact, reproducible, and transfers across targets, providing a stronger and more realistic stress test for large language model (LLM) safety and enabling automatic redteaming to expose and localize failure modes. Our code is available at: https://github.com/fmmarkmq/SEMA.

[51] Uncovering Cross-Objective Interference in Multi-Objective Alignment

Yining Lu, Meng Jiang

Main category: cs.CL

TL;DR: The paper studies cross-objective interference in multi-objective alignment for LLMs, proposes a covariance-based analysis, and introduces CTWA method to mitigate interference.

DetailsMotivation: Multi-objective alignment for LLMs often suffers from interference where improving one objective degrades others, which is a persistent failure mode that needs systematic study and solutions.

Method: The authors formalize cross-objective interference, derive a local covariance law showing objectives improve when their reward has positive covariance with scalarized score, extend analysis to clipped surrogate objectives, and propose Covariance Targeted Weight Adaptation (CTWA) to maintain positive covariance. They also provide global convergence analysis under Polyak-Łojasiewicz condition.

Result: The study shows interference is pervasive and model-dependent across scalarization algorithms. CTWA effectively mitigates cross-objective interference by maintaining positive covariance between objective rewards and training signals.

Conclusion: Cross-objective interference is a fundamental challenge in multi-objective LLM alignment that can be understood through covariance analysis and mitigated with targeted methods like CTWA, with global convergence properties depending on model geometry.

Abstract: We study a persistent failure mode in multi-objective alignment for large language models (LLMs): training improves performance on only a subset of objectives while causing others to degrade. We formalize this phenomenon as cross-objective interference and conduct the first systematic study across classic scalarization algorithms, showing that interference is pervasive and exhibits strong model dependence. To explain this phenomenon, we derive a local covariance law showing that an objective improves at first order when its reward exhibits positive covariance with the scalarized score. We extend this analysis to clipped surrogate objectives used in modern alignment, demonstrating that the covariance law remains valid under mild conditions despite clipping. Building on this analysis, we propose Covariance Targeted Weight Adaptation (CTWA), a plug-and-play method that maintains positive covariance between objective rewards and the training signal to effectively mitigate cross-objective interference. Finally, we complement these local improvement conditions with a global convergence analysis under the Polyak–Łojasiewicz condition, establishing when non-convex scalarized optimization achieves global convergence and how cross-objective interference depends on specific model geometric properties.

[52] Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

Samir Abdaljalil, Parichit Sharma, Erchin Serpedin, Hasan Kurban

Main category: cs.CL

TL;DR: Halluverse-M^3 is a multilingual, multi-task dataset for systematic analysis of hallucinations in LLMs across English, Arabic, Hindi, and Turkish, covering question answering and dialogue summarization with entity, relation, and sentence-level hallucination categories.

DetailsMotivation: Hallucinations in LLMs remain a persistent challenge, especially in multilingual and generative settings where factual consistency is difficult. Current models show strong English performance but their behavior across languages, tasks, and hallucination types is not well understood.

Method: Created Halluverse-M^3 dataset covering 4 languages (English, Arabic, Hindi, Turkish) and 2 generation tasks (question answering, dialogue summarization). Hallucinated outputs constructed through controlled editing and human validation. Evaluated diverse open-source and proprietary LLMs on fine-grained hallucination detection.

Result: Question answering easier than dialogue summarization; sentence-level hallucinations remain challenging even for strongest models. Performance highest in English, degrades in lower-resource languages, with Hindi showing lowest detection accuracy.

Conclusion: Halluverse-M^3 provides realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. Dataset released to support future research on hallucination detection and mitigation.

Abstract: Hallucinations in large language models remain a persistent challenge, particularly in multilingual and generative settings where factual consistency is difficult to maintain. While recent models show strong performance on English-centric benchmarks, their behavior across languages, tasks, and hallucination types is not yet well understood. In this work, we introduce Halluverse-M^3, a dataset designed to enable systematic analysis of hallucinations across multiple languages, multiple generation tasks, and multiple hallucination categories. Halluverse-M^3 covers four languages, English, Arabic, Hindi, and Turkish, and supports two generation tasks: question answering and dialogue summarization. The dataset explicitly distinguishes between entity-level, relation-level, and sentence-level hallucinations. Hallucinated outputs are constructed through a controlled editing process and validated by human annotators, ensuring clear alignment between original content and hallucinated generations. Using this dataset, we evaluate a diverse set of contemporary open-source and proprietary language models on fine-grained hallucination detection. Our results show that question answering is consistently easier than dialogue summarization, while sentence-level hallucinations remain challenging even for the strongest models. Performance is highest in English and degrades in lower-resource languages, with Hindi exhibiting the lowest detection accuracy. Overall, Halluverse-M^3 provides a realistic and challenging benchmark for studying hallucinations in multilingual, multi-task settings. We release the dataset to support future research on hallucination detection and mitigation\footnote{https://huggingface.co/datasets/sabdalja/HalluVerse-M3}.

[53] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

Duygu Altinok

Main category: cs.CL

TL;DR: Comprehensive study of Turkish subword tokenization examining vocabulary size, training corpus size, tokenizer families, and evaluation across semantic, syntactic, and morphology-sensitive tasks with novel diagnostic toolkit.

DetailsMotivation: Tokenization is crucial for neural language modeling in morphologically rich languages like Turkish, where agglutination challenges vocabulary efficiency and morphological fidelity. Prior studies lack systematic control over training corpus, limited diagnostics, and narrow task evaluation.

Method: Systematic study varying vocabulary size and tokenizer training corpus size, comparing WordPiece, morphology-level, and character baselines under matched parameter budgets. Introduces morphology-aware diagnostic toolkit with boundary-level metrics, affix-type coverage, and token-level atomicity measures.

Result: Provides comprehensive analysis of vocabulary-corpus-success triad, identifies when character-level and morphology-level tokenization are beneficial, and establishes unified evaluation framework linking intrinsic diagnostics to extrinsic outcomes.

Conclusion: Delivers actionable guidance for building effective tokenizers in morphologically rich languages and establishes reproducible foundation for future research with open-source release of evaluation code, tokenizer pipelines, and models.

Abstract: Tokenization is a pivotal design choice for neural language modeling in morphologically rich languages (MRLs) such as Turkish, where productive agglutination challenges both vocabulary efficiency and morphological fidelity. Prior studies have explored tokenizer families and vocabulary sizes but typically (i) vary vocabulary without systematically controlling the tokenizer’s training corpus, (ii) provide limited intrinsic diagnostics, and (iii) evaluate a narrow slice of downstream tasks. We present the first comprehensive, principled study of Turkish subword tokenization; a “subwords manifest”, that jointly varies vocabulary size and tokenizer training corpus size (data and vocabulary coupling), compares multiple tokenizer families under matched parameter budgets (WordPiece, morphology level, and character baselines), and evaluates across semantic (NLI, STS, sentiment analysis, NER), syntactic (POS, dependency parsing), and morphology-sensitive probes. To explain why tokenizers succeed or fail, we introduce a morphology-aware diagnostic toolkit that goes beyond coarse aggregates to boundary-level micro/macro F1, decoupled lemma atomicity vs. surface boundary hits, over/under-segmentation indices, character/word edit distances (CER/WER), continuation rates, and affix-type coverage and token-level atomicity. Our contributions are fourfold: (i) a systematic investigation of the vocabulary-corpus-success triad; (ii) a unified, morphology-aware evaluation framework linking intrinsic diagnostics to extrinsic outcomes; (iii) controlled comparisons identifying when character-level and morphology-level tokenization pay off; and (iv) an open-source release of evaluation code, tokenizer pipelines, and models. As the first work of its kind, this “subwords manifest” delivers actionable guidance for building effective tokenizers in MRLs and establishes a reproducible foundation for future research.

[54] DAWN: Dependency-Aware Fast Inference for Diffusion LLMs

Lizhuo Luo, Zhuoran Shi, Jiajun Luo, Zhi Wang, Shen Ren, Wenya Wang, Tianwei Zhang

Main category: cs.CL

TL;DR: DAWN is a training-free, dependency-aware decoding method for diffusion LLMs that accelerates inference by modeling token dependencies to enable more aggressive parallel decoding while maintaining quality.

DetailsMotivation: Existing diffusion LLM inference solutions use conservative parallel strategies due to the quality-speed trade-off, leaving efficiency potential unexplored. The core challenge is that parallel decoding assumes independent token generation, but tokens are semantically coupled, and ignoring these dependencies leads to deteriorated outputs.

Method: DAWN extracts token dependencies and uses a dependency graph to select reliable unmasking positions at each iteration. It leverages two key insights: (1) positions dependent on unmasked certain positions become more reliable, and (2) simultaneously unmasking strongly coupled uncertain positions induces errors.

Result: Extensive experiments across multiple models and datasets show DAWN speeds up inference by 1.80-8.06x over baselines while preserving generation quality.

Conclusion: DAWN demonstrates that modeling token dependencies enables more aggressive parallel decoding in diffusion LLMs, achieving significant speedups without quality degradation through a training-free approach.

Abstract: Diffusion large language models (dLLMs) have shown advantages in text generation, particularly due to their inherent ability for parallel decoding. However, constrained by the quality–speed trade-off, existing inference solutions adopt conservative parallel strategies, leaving substantial efficiency potential underexplored. A core challenge is that parallel decoding assumes each position can be filled independently, but tokens are often semantically coupled. Thus, the correct choice at one position constrains valid choices at others. Without modeling these inter-token dependencies, parallel strategies produce deteriorated outputs. Motivated by this insight, we propose DAWN, a training-free, dependency-aware decoding method for fast dLLM inference. DAWN extracts token dependencies and leverages two key motivations: (1) positions dependent on unmasked certain positions become more reliable, (2) simultaneously unmasking strongly coupled uncertain positions induces errors. Given those findings, DAWN leverages a dependency graph to select more reliable unmasking positions at each iteration, achieving high parallelism with negligible loss in generation quality. Extensive experiments across multiple models and datasets demonstrate that DAWN speedups the inference by 1.80-8.06x over baselines while preserving the generation quality. Code is released at https://github.com/lizhuo-luo/DAWN.

[55] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

Yuchen Yan, Liang Jiang, Jin Jiang, Shuaicheng Li, Zujie Wen, Zhiqiang Zhang, Jun Zhou, Jian Shao, Yueting Zhuang, Yongliang Shen

Main category: cs.CL

TL;DR: InftyThink+ is an RL framework for optimizing iterative reasoning in large language models, improving accuracy while reducing computational costs.

DetailsMotivation: Traditional chain-of-thought reasoning suffers from quadratic computational costs, context length limits, and degraded reasoning due to lost-in-the-middle effects. Existing iterative reasoning methods rely on fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning.

Method: InftyThink+ uses an end-to-end reinforcement learning framework with model-controlled iteration boundaries and explicit summarization. It employs a two-stage training scheme: supervised cold-start followed by trajectory-level reinforcement learning to learn strategic summarization and continuation decisions.

Result: On DeepSeek-R1-Distill-Qwen-1.5B, InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning. It also generalizes better to out-of-distribution benchmarks while significantly reducing inference latency and accelerating RL training.

Conclusion: InftyThink+ demonstrates improved reasoning efficiency alongside stronger performance by optimizing the entire iterative reasoning trajectory through learned summarization strategies.

Abstract: Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. Iterative reasoning mitigates these issues by periodically summarizing intermediate thoughts, yet existing methods rely on supervised learning or fixed heuristics and fail to optimize when to summarize, what to preserve, and how to resume reasoning. We propose InftyThink+, an end-to-end reinforcement learning framework that optimizes the entire iterative reasoning trajectory, building on model-controlled iteration boundaries and explicit summarization. InftyThink+ adopts a two-stage training scheme with supervised cold-start followed by trajectory-level reinforcement learning, enabling the model to learn strategic summarization and continuation decisions. Experiments on DeepSeek-R1-Distill-Qwen-1.5B show that InftyThink+ improves accuracy by 21% on AIME24 and outperforms conventional long chain-of-thought reinforcement learning by a clear margin, while also generalizing better to out-of-distribution benchmarks. Moreover, InftyThink+ significantly reduces inference latency and accelerates reinforcement learning training, demonstrating improved reasoning efficiency alongside stronger performance.

[56] ExpressivityBench: Can LLMs Communicate Implicitly?

Joshua Tint, Som Sagar, Aditya Taparia, Kelly Raines, Bimsara Pathiraja, Caleb Liu, Ransalu Senanayake

Main category: cs.CL

TL;DR: ExpressivityBench: A framework for evaluating LLMs’ ability to communicate implicit information (emotion, identity, tone) beyond literal meanings using information-theoretic models.

DetailsMotivation: Human communication involves implicit aspects like tone, identity, and intent beyond literal meanings. While LLMs excel at explicit tasks, their capacity for expressivity (implicit communication) remains underexplored and needs systematic evaluation.

Method: Introduces ExpressivityBench framework using information-theoretic communication models to quantify how well LLM-generated text communicates target properties without explicit mention. Uses LLM-based graders validated against human judgments across nine tasks spanning emotion, identity, and tone.

Result: Models are adept at expressing affective content but struggle with sociolinguistic signals, lagging behind human baselines. The framework enables scalable and reproducible evaluation of implicit communication capabilities.

Conclusion: Provides a necessary step to evaluate human-like implicit communication in LLMs, with implications for education, mental health support, and socially-aware dialogue systems. The benchmark enables systematic assessment of expressivity beyond literal language understanding.

Abstract: Human communication is often implicit, conveying tone, identity, and intent beyond literal meanings. While large language models have achieved strong performance on explicit tasks such as summarization and reasoning, their capacity for expressivity, or implicit communication, remains underexplored. We introduce \textbf{ExpressivityBench}, a framework for evaluating the expressivity of LLMs using information-theoretic communication models. Our approach quantifies how well LLM-generated text communicates target properties without explicit mention, across nine tasks spanning emotion, identity, and tone. To enable scalable and reproducible evaluation, we employ LLM-based graders validated against human judgments. Our results reveal that while models are adept at expressing affective content, they struggle with sociolinguistic signals, lagging behind human baselines. This study provides a necessary step to evaluate human-like implicit communication, with implications for applications such as education, mental health support, and socially-aware dialogue systems. We provide code and data for our benchmark alongside our paper.

[57] STAR: Stepwise Task Augmentation with Relation Learning for Aspect Sentiment Quad Prediction

Wenna Lai, Haoran Xie, Guandong Xu, Qing Li

Main category: cs.CL

TL;DR: STAR framework uses stepwise task augmentation with relation learning to improve aspect sentiment quad prediction by decomposing ASQP into auxiliary subtasks with increasing relational granularity.

DetailsMotivation: Aspect sentiment quad prediction (ASQP) is challenging due to difficulty modeling dependencies among sentiment elements and scarcity of annotated data, limiting model ability to understand relational dependencies.

Method: Proposes stepwise task augmentation framework with relation learning (STAR) that decomposes ASQP into sequence of auxiliary subtasks with increasing relational granularity, incrementally constructing auxiliary data by augmenting training data with pairwise and overall relation tasks.

Result: Extensive experiments across four benchmark datasets show STAR consistently outperforms existing methods, achieving average F1 improvements of over 2% under low-resource conditions.

Conclusion: Stepwise formulation provides effective relational learning signals that enhance quad prediction performance, particularly in low-resource scenarios.

Abstract: Aspect-based sentiment analysis (ABSA) aims to identify four sentiment elements, including aspect term, aspect category, opinion term, and sentiment polarity. These elements construct a complete picture of sentiments. The most challenging task, aspect sentiment quad prediction (ASQP), requires predicting all four elements simultaneously and is hindered by the difficulty of accurately modeling dependencies among sentiment elements. A key challenge lies in the scarcity of annotated data, which limits the model ability to understand and reason about the relational dependencies required for effective quad prediction. To address this challenge, we propose a stepwise task augmentation framework with relation learning that decomposes ASQP into a sequence of auxiliary subtasks with increasing relational granularity. Specifically, STAR incrementally constructs auxiliary data by augmenting the training data with pairwise and overall relation tasks, enabling the model to capture and compose sentiment dependencies in a stepwise manner. This stepwise formulation provides effective relational learning signals that enhance quad prediction performance, particularly in low-resource scenarios. Extensive experiments across four benchmark datasets demonstrate that STAR consistently outperforms existing methods, achieving average F1 improvements of over $2%$ under low-resource conditions.

[58] Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review

Sungduk Yu, Man Luo, Avinash Madasu, Vasudev Lal, Phillip Howard

Main category: cs.CL

TL;DR: A comprehensive dataset of 788,984 AI-written peer reviews paired with human reviews is introduced to benchmark AI text detection in peer review, revealing the difficulty of identifying AI-generated text at the individual review level.

DetailsMotivation: The paper addresses the risk of negligent reviewers using LLMs for peer review and the lack of resources for benchmarking AI text detection in this domain, which threatens the integrity of scientific publishing.

Method: Created a large dataset of AI-written peer reviews paired with human reviews from ICLR and NeurIPS (8 years), evaluated 18 existing AI detection algorithms, explored context-aware detection method called Anchor, and analyzed sensitivity to LLM-assisted editing.

Result: Revealed significant difficulty in identifying AI-generated text at individual peer review level, highlighting limitations of current detection methods and the need for better tools.

Conclusion: Urgent need for new tools and methods to detect unethical use of generative AI in peer review to maintain scientific integrity.

Abstract: Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews fully written by humans and different state-of-the-art LLMs. Additionally, we explore a context-aware detection method called Anchor, which leverages manuscript content to detect AI-generated reviews, and analyze the sensitivity of detection models to LLM-assisted editing of human-written text. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI. Our dataset is publicly available at: https://huggingface.co/datasets/IntelLabs/AI-Peer-Review-Detection-Benchmark.

[59] Harnessing the Unseen: The Hidden Influence of Intrinsic Knowledge in Long-Context Language Models

Yu Fu, Haz Sameen Shahgir, Hui Liu, Xianfeng Tang, Qi He, Yue Dong

Main category: cs.CL

TL;DR: The paper investigates how language models’ parametric knowledge affects content generation in long contexts, showing that parametric recall ability doesn’t improve with contextual retrieval ability and can even be hindered by it.

DetailsMotivation: Recent long-context language models focus on external contextual information but neglect the influence of parametric knowledge. The authors want to understand how this internal knowledge affects content generation as context length increases.

Method: The authors design a Hybrid Needle-in-a-Haystack test to evaluate models across both parametric recall ability (using internal knowledge) and extrinsic retrieval ability (using external context), rather than focusing only on retrieval.

Result: Qwen-2.5 models significantly outperform Llama-3.1 models in combining both abilities. Even powerful models like Llama-3.1-70B-Instruct fail to show better performance, revealing limitations in current evaluation approaches.

Conclusion: Models need to be evaluated from a dual-ability perspective considering both parametric recall and contextual retrieval. Better retrieval ability can interfere with parametric recall, limiting model potential.

Abstract: Recent advances in long-context language models (LCLMs), designed to handle extremely long contexts, primarily focus on utilizing external contextual information, often leaving the influence of language models’ parametric knowledge underexplored. In this work, we firstly investigate how this parametric knowledge affects content generation and demonstrate that its impact becomes increasingly pronounced as context length extends. Furthermore, we show that the model’s ability to utilize parametric knowledge, which we call parametric recall ability, does not improve simultaneously with its ability to leverage contextual knowledge through extrinsic retrieval ability. Moreover, better extrinsic retrieval ability can interfere with the model’s parametric recall ability, limiting its full potential. To bridge this gap, we design a simple yet effective Hybrid Needle-in-a-Haystack test that evaluates models based on their capabilities across both abilities, rather than solely emphasizing extrinsic retrieval ability. Our experimental results reveal that Qwen-2.5 models significantly outperform Llama-3.1 models, demonstrating superior potential to combine various abilities. Moreover, even the more powerful Llama-3.1-70B-Instruct model fails to exhibit better performance, highlighting the importance of evaluating models from a dual-ability perspective.

[60] Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters

Takashi Morita, Timothy J. O’Donnell

Main category: cs.CL

TL;DR: The paper presents a computational method for learning the Germanic-Latinate etymological distinction in English from phonotactic information alone, showing that historical word origins can be inferred from phonological patterns without explicit etymological knowledge.

DetailsMotivation: The paper addresses a learnability paradox: native words and loanwords follow different phonological rules in languages like English, but language learners presumably don't have access to historical etymological information. The authors investigate whether these etymological distinctions can be learned from observable phonotactic patterns.

Method: The authors performed unsupervised clustering on corpus-extracted English words using phonotactic information. They analyzed whether the resulting word clusters would align with the Germanic-Latinate etymological distinction without any explicit etymological labels provided to the model.

Result: The unsupervised clustering successfully recovered the Germanic-Latinate distinction in English words. The model-discovered clusters aligned with known linguistic generalizations about these etymological classes and also uncovered previously unrecognized features of the quasi-etymological clusters.

Conclusion: The findings demonstrate that etymological structure in languages can be discovered from phonotactic cues alone, providing a general cross-linguistic approach to understanding how language learners might acquire etymological distinctions without explicit historical knowledge.

Abstract: Cross-linguistically, native words and loanwords follow different phonological rules. In English, for example, words of Germanic and Latinate origin exhibit different stress patterns, and a certain syntactic structure, double-object datives, is predominantly associated with Germanic verbs rather than Latinate verbs. From the perspective of language acquisition, however, such etymology-based generalizations raise learnability concerns, since the historical origins of words are presumably inaccessible information for general language learners. In this study, we present computational evidence indicating that the Germanic-Latinate distinction in the English lexicon is learnable from the phonotactic information of individual words. Specifically, we performed an unsupervised clustering on corpus-extracted words, and the resulting word clusters largely aligned with the etymological distinction. The model-discovered clusters also recovered various linguistic generalizations documented in the previous literature regarding the corresponding etymological classes. Moreover, our model also uncovered previously unrecognized features of the quasi-etymological clusters. Taken together with prior results from Japanese, our findings indicate that the proposed method provides a general, cross-linguistic approach to discovering etymological structure from phonotactic cues in the lexicon.

[61] Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization

Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Jiang Zong, Hao Peng, Jianwei Yin

Main category: cs.CL

TL;DR: Proposes multi-stage influence function to attribute fine-tuned LLM predictions to pre-training data using EK-FAC approximation for scalability

DetailsMotivation: Need to explain fine-tuned LLM predictions by tracing them back to pre-training data, as most knowledge comes from pre-training. Existing influence functions can't handle multi-stage training and don't scale to billion-parameter LLMs.

Method: Multi-stage influence function for full-parameter fine-tuning paradigm, using Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation to handle large-scale LLMs.

Result: EK-FAC approximation shows superior scalability, and multi-stage influence function effectively attributes downstream predictions to pre-training data. Case studies on dolly-v2-3b demonstrate interpretive power with practical insights.

Conclusion: Proposed method enables scalable attribution of fine-tuned LLM predictions to pre-training data, providing valuable interpretability for large language models.

Abstract: Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches fail to compute ``multi-stage’’ influence and lack scalability to billion-scale LLMs. In this paper, we propose the multi-stage influence function to attribute the downstream predictions of fine-tuned LLMs to pre-training data under the full-parameter fine-tuning paradigm. To enhance the efficiency and practicality of our multi-stage influence function, we leverage Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation. Empirical results validate the superior scalability of EK-FAC approximation and the effectiveness of our multi-stage influence function. Additionally, case studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power, with exemplars illustrating insights provided by multi-stage influence estimates. Our code is public at https://github.com/colored-dye/multi_stage_influence_function.

[62] Large Language Models as Formalizers on Constraint Satisfaction Problems

Rikhil Amonkar, Ceyhun Efe Kayan, May Lai, Ronan Le Bras, Li Zhang

Main category: cs.CL

TL;DR: LLMs used as formalizers to generate constraint satisfaction programs rather than direct solvers, with evaluation showing comparable performance to LLM-as-solver while offering verifiability and interpretability.

DetailsMotivation: Recent work explores using LLMs as formalizers rather than end-to-end solvers to improve verifiability, interpretability, and robustness in solving constraint satisfaction problems.

Method: Systematically evaluated 6 LLMs (including 4 large reasoning models) on 4 domains using 5 pipelines with 2 types of formalism, comparing LLM-as-formalizer against LLM-as-solver in zero-shot settings.

Result: LLM-as-formalizer performs on par with LLM-as-solver while offering verifiability, interpretability, and robustness. However, excessive reasoning tokens and hard-coded solutions scaling with problem complexity reveal limitations in generating solutions or formal programs.

Conclusion: Even state-of-the-art LLMs have limited ability to generate solutions or formal programs, but LLM-as-formalizer approach shows promise and the paper provides actionable remedies for future research improvement.

Abstract: An emerging line of recent work advocates for using large language models (LLMs) as formalizers instead of as end-to-end solvers for various types of problems. Instead of generating the solution, the LLM generates a formal program that derives a solution via an external solver. We thoroughly investigate the formalization capability of LLMs on real-life constraint satisfaction problems. On 4 domains, we systematically evaluate 6 LLMs, including 4 large reasoning models with inference-time scaling, paired with 5 pipelines, including 2 types of formalism. We show that in zero-shot settings, LLM-as-formalizer performs on par with the mainstream LLM-as-solver while offering verifiability, interpretability, and robustness. We also observe excessive reasoning tokens and hard-coded solutions scaling with problem complexity, which demonstrates that even the state-of-the-art LLMs have limited ability to generate solutions or formal programs. We present our detailed analysis and actionable remedies to drive future research that improves LLM-as-formalizer.

[63] FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression

Jiayi Tian, Ryan Solgi, Jinming Lu, Yifan Yang, Hai Li, Zheng Zhang

Main category: cs.CL

TL;DR: FLAT-LLM is a training-free structural compression method for LLMs that uses fine-grained low-rank transformations in activation space via head-wise PCA and greedy budget redistribution to achieve efficient weight compression without accuracy degradation.

DetailsMotivation: LLMs have high computational and memory demands that challenge deployment in resource-constrained environments. Existing low-rank decomposition methods suffer from accuracy degradation, expensive calibration, and inefficient architectures that hinder real-world inference speedups.

Method: FLAT-LLM reduces hidden dimensions by transforming weights using truncated eigenvectors computed via head-wise Principal Component Analysis. It employs a greedy budget redistribution strategy to adaptively allocate ranks across decoders, enabling efficient weight compression without recovery fine-tuning.

Result: FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance across 5 models and 11 datasets, while delivering inference speedups over decomposition-based methods. Calibration completes within minutes without recovery fine-tuning.

Conclusion: FLAT-LLM provides an effective training-free structural compression method that balances efficiency and accuracy for LLM deployment in resource-constrained environments.

Abstract: Large Language Models (LLMs) have enabled remarkable progress in natural language processing, yet their high computational and memory demands pose challenges for deployment in resource-constrained environments. Although recent low-rank decomposition methods offer a promising path for structural compression, they often suffer from accuracy degradation, expensive calibration procedures, and result in inefficient model architectures that hinder real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and accurate, training-free structural compression method based on fine-grained low-rank transformations in the activation space. Specifically, we reduce the hidden dimension by transforming the weights using truncated eigenvectors computed via head-wise Principal Component Analysis, and employ a greedy budget redistribution strategy to adaptively allocate ranks across decoders. FLAT-LLM achieves efficient and effective weight compression without recovery fine-tuning, which could complete the calibration within a few minutes. Evaluated across 5 models and 11 datasets, FLAT-LLM outperforms structural pruning baselines in generalization and downstream performance, while delivering inference speedups over decomposition-based methods.

[64] code_transformed: The Influence of Large Language Models on Code

Yuliang Xu, Siming Huang, Mingmeng Geng, Yao Wan, Xuanhua Shi, Dongping Chen

Main category: cs.CL

TL;DR: Study shows LLMs are transforming real-world code style, with measurable trends in naming conventions, complexity, and maintainability across 20k+ GitHub repositories from 2020-2025.

DetailsMotivation: To investigate whether and how Large Language Models have transformed programming code style in real-world development, given their rapid advancement and increasing use in code generation.

Method: Analyzed code from over 20,000 GitHub repositories linked to arXiv papers (2020-2025), focusing on naming conventions, complexity, maintainability, and similarity metrics to identify trends aligned with LLM-generated code characteristics.

Result: Found measurable evolution in coding style: snake_case function names in Python increased from 40.7% (Q1 2023) to 49.8% (Q3 2025), indicating LLM influence on real-world programming practices.

Conclusion: LLMs are demonstrably affecting real-world code style, providing first large-scale empirical evidence of their impact on programming practices beyond just code generation capabilities.

Abstract: Coding remains one of the most fundamental modes of interaction between humans and machines. With the rapid advancement of Large Language Models (LLMs), code generation capabilities have begun to significantly reshape programming practices. This development prompts a central question: Have LLMs transformed code style, and how can such transformation be characterized? In this paper, we present a pioneering study that investigates the impact of LLMs on code style, with a focus on naming conventions, complexity, maintainability, and similarity. By analyzing code from over 20,000 GitHub repositories linked to arXiv papers published between 2020 and 2025, we identify measurable trends in the evolution of coding style that align with characteristics of LLM-generated code. For instance, the proportion of snake_case function names in Python code increased from 40.7% in Q1 2023 to 49.8% in Q3 2025. Furthermore, we investigate how LLMs approach algorithmic problems by examining their reasoning processes. Our experimental results may provide the first large-scale empirical evidence that LLMs affect real-world programming style. We release all the experimental dataset and source code at: https://github.com/ignorancex/LLM_code

[65] Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models

Chenchen Yuan, Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

Main category: cs.CL

TL;DR: A framework for synthesizing multiple LLMs’ moral judgments into collective moral consensus, with targeted fine-tuning for misaligned models.

DetailsMotivation: LLMs show impressive moral reasoning but diverge on complex moral dilemmas, creating inconsistencies that need addressing for safer AI systems.

Method: Aggregation mechanism fuses continuous moral acceptability scores into collective probability weighted by model reliability, plus embedding-optimization for misaligned models to minimize divergence from consensus.

Result: Experiments on large-scale social moral dilemma dataset show the approach builds robust consensus and improves individual model fidelity.

Conclusion: Data-driven moral alignment across multiple models has value for creating safer, more consistent AI systems.

Abstract: Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs’ moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.

[66] Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs

Michal Podstawski

Main category: cs.CL

TL;DR: Using pretrained text embedding models to enhance semantic analysis in labeled property graphs by embedding textual node/edge properties for improved node classification and relation prediction.

DetailsMotivation: Labeled property graphs contain rich textual attributes that could enhance analytical tasks if properly leveraged, but current approaches may not fully utilize the semantic information in these textual properties.

Method: Integrates pretrained text embedding models to embed textual node and edge properties, enabling semantic analysis without altering graph structure. The embeddings are then used to support downstream tasks like node classification and relation prediction.

Result: Demonstrates that textual semantics can significantly enhance the accuracy and interpretability of property graph analysis, showing improved performance on node classification and relation prediction tasks.

Conclusion: Pretrained text embedding models can effectively leverage textual attributes in property graphs to enhance semantic analysis and improve performance on graph analytical tasks without structural modifications.

Abstract: Labeled property graphs often contain rich textual attributes that can enhance analytical tasks when properly leveraged. This work explores the use of pretrained text embedding models to enable efficient semantic analysis in such graphs. By embedding textual node and edge properties, we support downstream tasks including node classification and relation prediction with improved contextual understanding. Our approach integrates language model embeddings into the graph pipeline without altering its structure, demonstrating that textual semantics can significantly enhance the accuracy and interpretability of property graph analysis.

[67] Encoding syntactic objects and Merge operations in function spaces

Matilde Marcolli, Robert C. Berwick

Main category: cs.CL

TL;DR: The paper presents a mathematical framework for representing syntactic structures as functions in a function space, showing how Merge operations can be implemented through algebraic structures like semirings and Hopf algebras, with potential neurocomputational realizations.

DetailsMotivation: To bridge theoretical linguistics (specifically syntactic structures and Merge operations) with mathematical representations that could potentially be implemented in neural systems, addressing the gap between abstract linguistic theory and neurocomputational realization.

Method: Develops a mathematical framework using function spaces, commutative non-associative semirings built with second Renyi entropy, operad algebras, and Hopf algebra Markov chains to represent syntactic objects and Merge operations as transformations on functional representations.

Result: Shows that arbitrary syntactic objects can be faithfully represented in function spaces, with Merge operations implementable through algebraic operations, and demonstrates a specific case using cross-frequency phase synchronization on sinusoidal waves.

Conclusion: Provides theoretical evidence for possible neurocomputational realization of core syntactic structures, connecting linguistic theory with mathematical representations that could be implemented in neural systems.

Abstract: We provide a mathematical argument showing that, given a representation of lexical items as functions (wavelets, for instance) in some function space, it is possible to construct a faithful representation of arbitrary syntactic objects in the same function space. This space can be endowed with a commutative non-associative semiring structure built using the second Renyi entropy. The resulting representation of syntactic objects is compatible with the magma structure. The resulting set of functions is an algebra over an operad, where the operations in the operad model circuits that transform the input wave forms into a combined output that encodes the syntactic structure. The action of Merge on workspaces is faithfully implemented as action on these circuits, through a coproduct and a Hopf algebra Markov chain. The results obtained here provide a constructive argument showing the theoretical possibility of a neurocomputational realization of the core computational structure of syntax. We also present a particular case of this general construction where this type of realization of Merge is implemented as a cross frequency phase synchronization on sinusoidal waves. This also shows that Merge can be expressed in terms of the successor function of a semiring, thus clarifying the well known observation of its similarities with the successor function of arithmetic.

[68] D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

Weibo Zhou, Lingbo Li, Shangsong Liang

Main category: cs.CL

TL;DR: D-SCoRE is a training-free framework that automatically generates diverse QA datasets with Chain-of-Thought reasoning from arbitrary text sources, enabling efficient domain-adaptive fine-tuning of LLMs.

DetailsMotivation: High-quality domain-specific QA datasets are scarce and expensive to create, limiting supervised fine-tuning of LLMs for specialized domains.

Method: Uses LLMs and prompt engineering with document-centric processing, segmentation, Chain-of-Thought reasoning, and structured export. Includes multi-dimensional controls like semantic role transformation, question type balancing, and counterfactual augmentation.

Result: LLMs fine-tuned on D-SCoRE-generated datasets outperform those trained on human-annotated QA data across most evaluated domains. The system generates over 1,100 high-quality QA pairs per GPU-hour end-to-end.

Conclusion: D-SCoRE enables efficient, scalable domain-adaptive fine-tuning of LLMs using automatically generated QA datasets, making high-performance domain adaptation accessible even on consumer-grade hardware.

Abstract: The scarcity and high cost of high-quality domain-specific question-answering (QA) datasets limit supervised fine-tuning of large language models (LLMs). We introduce $\textbf{D-SCoRE}$, a training-free framework that leverages LLMs and prompt engineering to automatically generate diverse, rich QA datasets with Chain-of-Thought (CoT) from arbitrary textual sources. By integrating $\textbf{D}$ocument-centric processing, $\textbf{S}$egmentation, $\textbf{Co}$T $\textbf{R}$easoning, and structured $\textbf{E}$xport - along with multi-dimensional controls such as semantic role transformation, question type balancing, and counterfactual augmentation - D-SCoRE produces tailored QA pairs with enhanced diversity and relevance. LLMs fine-tuned on D-SCoRE-generated datasets outperform those trained on human-annotated QA data across most evaluated domains. Its efficiency and scalability enable rapid, high-performance domain-adaptive fine-tuning on consumer-grade hardware, generating over 1,100 high-quality QA pairs per GPU-hour end-to-end.

[69] Estimating Semantic Alphabet Size for LLM Uncertainty Quantification

Lucas H. McCabe, Rimon Melamed, Thomas Hartvigsen, H. Howie Huang

Main category: cs.CL

TL;DR: Improved semantic entropy estimation for LLM uncertainty quantification using semantic alphabet size adjustment to address sample coverage issues, maintaining interpretability while enhancing hallucination detection.

DetailsMotivation: Existing black-box uncertainty quantification methods for LLMs often require many samples, which is computationally expensive. Semantic entropy is attractive but its discrete formulation underestimates true semantic entropy. Recent extensions improve performance but lose interpretability and add hyperparameters.

Method: Proposes a modified semantic alphabet size estimator to adjust discrete semantic entropy (DSE) for sample coverage issues. The approach maintains the interpretable discrete formulation while improving accuracy by accounting for undersampling in semantic space.

Result: The proposed semantic alphabet size estimator, when used to adjust DSE, provides more accurate semantic entropy estimation. It performs as well or better than many top-performing alternatives at flagging incorrect LLM responses while remaining highly interpretable.

Conclusion: The modified semantic alphabet size adjustment to DSE offers an improved, interpretable approach for LLM uncertainty quantification that addresses sample coverage issues without sacrificing performance on hallucination detection tasks.

Abstract: Many black-box techniques for quantifying the uncertainty of large language models (LLMs) rely on repeated LLM sampling, which can be computationally expensive. Therefore, practical applicability demands reliable estimation from few samples. Semantic entropy (SE) is a popular sample-based uncertainty estimator with a discrete formulation attractive for the black-box setting. Recent extensions of SE exhibit improved LLM hallucination detection, but do so with less interpretable methods that admit additional hyperparameters. For this reason, we revisit the canonical discrete semantic entropy (DSE) estimator, finding that it underestimates the “true” semantic entropy, as expected from theory. We propose a modified semantic alphabet size estimator, and illustrate that using it to adjust DSE for sample coverage results in more accurate SE estimation in our setting of interest. Furthermore, we find that two semantic alphabet size estimators, including our proposed, flag incorrect LLM responses as well or better than many top-performing alternatives, with the added benefit of remaining highly interpretable.

[70] FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, Irina Belousova

Main category: cs.CL

TL;DR: FS-DFM is a discrete flow-matching model for language generation that enables high-quality sampling with only 8 steps instead of hundreds/thousands, achieving 128× faster sampling while maintaining perplexity parity.

DetailsMotivation: Autoregressive language models are serial (one token per forward pass), limiting throughput and increasing latency for long sequences. Diffusion language models parallelize but require hundreds/thousands of steps for quality, trading serial depth for iterative breadth. Need a method that combines parallelization with few-step efficiency.

Method: FS-DFM (Few-Step Discrete Flow-Matching) makes number of sampling steps an explicit parameter, training the model to be consistent across step budgets. Uses a reliable update rule that moves probability without overshooting, plus strong teacher guidance distilled from long-run trajectories. Enables stable, accurate few-step sampling.

Result: With 8 sampling steps, FS-DFM achieves perplexity parity with 1,024-step discrete-flow baseline for generating 1,024 tokens using similar-size model. Delivers up to 128× faster sampling and corresponding latency/throughput gains.

Conclusion: FS-DFM demonstrates that discrete flow-matching can achieve high-quality language generation with dramatically fewer sampling steps, bridging the gap between parallelization efficiency and generation speed.

Abstract: Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains. Code & pretrained checkpoints: https://github.com/apple/ml-fs-dfm

[71] Let LLMs Speak Embedding Languages: Generative Text Embeddings via Iterative Contrastive Refinement

Yu-Che Tsai, Kuan-Yu Chen, Yuan-Chi Li, Yuan-Hao Chen, Ching-Yu Tsai, Shou-De Lin

Main category: cs.CL

TL;DR: GIRCSE introduces a generative iterative refinement framework for sentence embeddings that uses autoregressive generation to produce soft tokens optimized under contrastive learning, outperforming encoder-only LLM-based embeddings.

DetailsMotivation: Existing LLM-based embeddings treat LLMs as static feature extractors, ignoring their generative capabilities. The authors aim to leverage LLMs' core generative strengths for better semantic representation learning.

Method: Proposes GIRCSE framework using autoregressive generation to iteratively refine semantic representations through sequences of soft tokens. Introduces Iterative Contrastive Refinement (ICR) objective to guide refinement steps toward better representations.

Result: Outperforms strong LLM-based embedding baselines on MTEB benchmark and instruction-following tasks. Exhibits emergent test-time scaling where generating more tokens at inference improves embedding quality.

Conclusion: Establishes generative iterative refinement as a new paradigm for representation learning that captures latent concepts and implicit semantics missed by encoder-only methods.

Abstract: Existing large language model (LLM)-based embeddings typically adopt an encoder-only paradigm, treating LLMs as static feature extractors and overlooking their core generative strengths. We introduce GIRCSE (Generative Iterative Refinement for Contrastive Sentence Embeddings), a novel framework that leverages autoregressive generation to iteratively refine semantic representations. By producing sequences of soft tokens optimized under contrastive objective, GIRCSE captures latent concepts and implicit semantics that encoder-only methods often miss. To guide this process, we propose an Iterative Contrastive Refinement (ICR) objective that encourages each refinement step to yield better representations. Extensive experiments show that GIRCSE outperforms strong LLM-based embedding baselines on the MTEB benchmark and instruction-following tasks. Moreover, GIRCSE exhibits an emergent test-time scaling property: generating more tokens at inference steadily improves embedding quality. Our results establish generative iterative refinement as a new paradigm for representation learning.

[72] LeWiDi-2025 at NLPerspectives: Third Edition of the Learning with Disagreements Shared Task

Elisa Leonardelli, Silvia Casola, Siyao Peng, Giulia Rizzi, Valerio Basile, Elisabetta Fersini, Diego Frassinelli, Hyewon Jang, Maja Pavlovic, Barbara Plank, Massimo Poesio

Main category: cs.CL

TL;DR: LEWIDI shared task extends benchmark to 4 datasets with categorical and ordinal judgments, evaluates models using soft-label and perspectivist approaches with new metrics for disagreement-aware AI systems.

DetailsMotivation: AI models should be trained to recognize variation and disagreement in human judgments, requiring suitable datasets and evaluation methods to develop disagreement-aware technologies.

Method: Extended LEWIDI benchmark to four datasets (paraphrase identification, irony detection, sarcasm detection, natural language inference) with both categorical and ordinal judgments. Adopted two evaluation paradigms: soft-label (predicting population-level distributions) and perspectivist (predicting individual annotator interpretations). Developed new evaluation metrics beyond standard cross-entropy.

Result: Task attracted diverse participation, providing insights into strengths and limitations of methods for modeling variation. Results strengthen LEWIDI as a framework and provide new resources, benchmarks, and findings for disagreement-aware technologies.

Conclusion: LEWIDI series promotes training and evaluating AI models to be aware of human judgment variation and disagreement, with the third edition expanding resources and evaluation approaches to support development of disagreement-aware systems.

Abstract: Many researchers have reached the conclusion that AI models should be trained to be aware of the possibility of variation and disagreement in human judgments, and evaluated as per their ability to recognize such variation. The LEWIDI series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating AI models, by making suitable datasets more accessible and by developing evaluation methods. The third edition of the task builds on this goal by extending the LEWIDI benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference, with labeling schemes that include not only categorical judgments as in previous editions, but ordinal judgments as well. Another novelty is that we adopt two complementary paradigms to evaluate disagreement-aware systems: the soft-label approach, in which models predict population-level distributions of judgments, and the perspectivist approach, in which models predict the interpretations of individual annotators. Crucially, we moved beyond standard metrics such as cross-entropy, and tested new evaluation metrics for the two paradigms. The task attracted diverse participation, and the results provide insights into the strengths and limitations of methods to modeling variation. Together, these contributions strengthen LEWIDI as a framework and provide new resources, benchmarks, and findings to support the development of disagreement-aware technologies.

[73] Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark

Yu Wu, Ke Shu, Jonas Fischer, Lidia Pivovarova, David Rosson, Eetu Mäkelä, Mikko Tolonen

Main category: cs.CL

TL;DR: Novel task of extracting low-resourced, noisy Latin fragments from mixed-language historical documents using multimodal foundation models, benchmarked on 724 annotated pages.

DetailsMotivation: To address the challenge of processing Latin fragments in historical mixed-language documents with varied layouts, which is important for intellectual history and historical linguistics research but difficult due to low-resource nature and document complexity.

Method: Benchmarking and evaluation of large foundation models on a multimodal dataset of 724 annotated pages of mixed-language historical documents, focusing on zero-shot performance for Latin detection and comprehension.

Result: Reliable Latin detection is achievable with contemporary zero-shot models, but these models lack functional comprehension of Latin. The study establishes a comprehensive baseline for processing Latin within mixed-language corpora.

Conclusion: The research provides foundational work for Latin processing in historical documents, showing detection is possible but comprehension remains a challenge, with dataset and code made available for further research.

Abstract: This paper presents a novel task of extracting low-resourced and noisy Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary zero-shot models is achievable, yet these models lack a functional comprehension of Latin. This study establishes a comprehensive baseline for processing Latin within mixed-language corpora, supporting quantitative analysis in intellectual history and historical linguistics. Both the dataset and code are available at https://github.com/COMHIS/EACL26-detect-latin.

[74] You Had One Job: Per-Task Quantization Using LLMs’ Hidden Representations

Amit LeVi, Raz Lapid, Rom Himelstein, Chaim Baskin, Ravid Shwartz Ziv, Avi Mendelson

Main category: cs.CL

TL;DR: Task-Aware Quantization (TAQ) methods for LLMs that allocate precision based on task relevance using hidden representation analysis, enabling better compression for specific applications.

DetailsMotivation: Current post-training quantization pipelines allocate precision uniformly without considering the target task, potentially wasting bits on irrelevant layers. Task-specific applications need more efficient quantization that focuses precision on task-relevant components.

Method: Three task-aware allocation methods: TAQ (information-stability criterion from activation geometry), TAQO (direct sensitivity to single-layer quantization), and TAQ-KL (output sensitivity via KL divergence under quantization error proxy). Uses small unlabeled calibration prompts to estimate layer importance and allocate higher precision to task-relevant layers.

Result: Provides a post-training framework connecting mechanistic signals to quantization decisions, enabling task-aligned compression without additional training.

Conclusion: Task-aware quantization methods can improve efficiency by focusing precision on layers most relevant to specific applications, offering better compression-performance tradeoffs for narrow LLM capabilities.

Abstract: Many applications of large language models (LLMs) require only a narrow capability, yet common post-training quantization (PTQ) pipelines assign precision largely without regard to the target task. As a result, they may spend bits on layers that are less relevant to the task. We propose per-task mixed-precision PTQ guided by hidden representations. Given a small set of unlabeled calibration prompts from the target task, we estimate layer importance and allocate higher precision to task-relevant layers while lower to the rest, under a bits allocation budget. We introduce three task-aware allocation signals: \textbf{TAQ}, which scores layers using an information-stability criterion derived from activation geometry; \textbf{TAQO}, which ranks layers by direct sensitivity to single-layer quantization; and \textbf{TAQ-KL}, which measures output sensitivity via KL divergence under a noise proxy for quantization error. Together, these methods provide a simple, post-training framework that connects mechanistic signals to quantization decisions, enabling task-aligned compression without additional training.

[75] SeSE: Black-Box Uncertainty Quantification for Large Language Models Based on Structural Information Theory

Xingtao Zhao, Hao Peng, Dingli Su, Xianghua Zeng, Chunyang Liu, Jinzhi Liao, Philip S. Yu

Main category: cs.CL

TL;DR: SeSE is a black-box uncertainty quantification framework for LLMs that uses semantic structural entropy to capture latent semantic structure, enabling more precise uncertainty estimates and extending to long-form generation.

DetailsMotivation: Current semantic uncertainty quantification methods overlook latent semantic structural information, limiting their ability to provide precise uncertainty estimates, especially for safety-critical applications where LLMs need to abstain when uncertain to avoid hallucinations.

Method: Proposes SeSE (Semantic Structural Entropy) framework that constructs optimal hierarchical abstraction of semantic space through encoding trees with minimal structural entropy. This structural entropy quantifies inherent uncertainty after optimal compression, and the method extends to provide interpretable granular uncertainty for long-form outputs.

Result: Theoretically proves SeSE generalizes semantic entropy (gold standard for UQ in LLMs) and empirically demonstrates superior performance over strong baselines across 24 model-dataset combinations.

Conclusion: SeSE provides a principled black-box UQ framework that captures latent semantic structure for more precise uncertainty quantification, applicable to both open- and closed-source LLMs and extending to long-form generation scenarios.

Abstract: Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinations, i.e., plausible yet factually incorrect responses. However, while semantic UQ methods have achieved advanced performance, they overlook latent semantic structural information that could enable more precise uncertainty estimates. In this paper, we propose \underline{Se}mantic \underline{S}tructural \underline{E}ntropy ({SeSE}), a principled black-box UQ framework applicable to both open- and closed-source LLMs. To reveal the intrinsic structure of the semantic space, SeSE constructs its optimal hierarchical abstraction through an encoding tree with minimal structural entropy. The structural entropy of this encoding tree thus quantifies the inherent uncertainty within LLM semantic space after optimal compression. Additionally, unlike existing methods that primarily focus on simple short-form generation, we extent SeSE to provide interpretable, granular uncertainty estimation for long-form outputs. We theoretically prove that SeSE generalizes semantic entropy, the gold standard for UQ in LLMs, and empirically demonstrate its superior performance over strong baselines across 24 model-dataset combinations.

[76] CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, Yizhe Zhang

Main category: cs.CL

TL;DR: CLaRa is a unified framework for retrieval-augmented generation that performs embedding-based compression and joint optimization in continuous space, improving efficiency and alignment between retrieval and generation.

DetailsMotivation: Retrieval-augmented generation (RAG) faces challenges with long contexts and disjoint optimization between retrieval and generation modules, leading to inefficiencies and suboptimal performance.

Method: Proposes CLaRa with SCP framework for semantic compression, using question answering and paraphrase supervision to create retrievable compressed vectors. Trains reranker and generator end-to-end via single language modeling loss with differentiable top-k estimator.

Result: Achieves state-of-the-art compression and reranking performance across multiple QA benchmarks, outperforming text-based fine-tuned baselines even at 16x text compression rate.

Conclusion: CLaRa provides a unified approach that aligns retrieval relevance with answer quality through continuous latent reasoning, addressing key limitations of traditional RAG systems.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, thereby reducing the document length fed into the generator, we introduce SCP, a key-preserving data synthesis framework based on question answering and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, even at a text compression rate of 16, outperforming text-based fine-tuned baselines.

[77] Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

Zhijun Chen, Zeyu Ji, Qianren Mao, Hao Wu, Junhang Cheng, Bangjie Qin, Zhuoran Li, Jingzheng Li, Kai Sun, Zizhe Wang, Yikun Ban, Zhu Sun, Xiangyang Ji, Hailong Sun

Main category: cs.CL

TL;DR: LLM-PeerReview: An unsupervised ensemble method that selects the best response from multiple LLM-generated candidates using a peer-review-inspired framework with LLM-as-a-Judge scoring and aggregation techniques.

DetailsMotivation: To create an unsupervised ensemble method that leverages multiple LLMs' diverse strengths to select the most ideal response for each query, providing transparent and interpretable mechanisms without requiring supervision.

Method: Three-stage framework: 1) Scoring using LLM-as-a-Judge technique where multiple LLMs evaluate each response, 2) Reasoning via averaging or graphical model-based truth inference to aggregate scores, 3) Selection of highest-scoring response as ensemble output.

Result: Outperforms advanced model Smoothie-Global by 6.9% and 7.3% points across four datasets on diverse tasks including factual recall QA, math reasoning, and instruction following.

Conclusion: LLM-PeerReview is a conceptually simple yet empirically powerful unsupervised ensemble method that effectively harnesses collective wisdom of multiple LLMs through a transparent peer-review-inspired framework.

Abstract: We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a transparent and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a straightforward averaging strategy or a principled graphical model-based truth inference algorithm to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. Our results across four datasets show that the two variants of the proposed approach outperform the advanced model Smoothie-Global by 6.9% and 7.3% points, cross diverse task types including factual recall QA, math reasoning, and instruction following.

[78] A.X K1 Technical Report

Sung Jun Cheon, Jaekyung Cho, Seongho Choi, Hyunjun Eun, Seokhwan Jo, Jaehyun Jun, Minsoo Kang, Jin Kim, Jiwon Kim, Minsang Kim, Sungwan Kim, Seungsik Kim, Tae Yoon Kim, Youngrang Kim, Hyeongmun Lee, Sangyeol Lee, Sungeun Lee, Youngsoon Lee, Yujin Lee, Seongmin Ok, Chanyong Park, Hyewoong Park, Junyoung Park, Hyunho Yang, Subin Yi, Soohyun Bae, Dhammiko Arya, Yongseok Choi, Sangho Choi, Dongyeon Cho, Seungmo Cho, Gyoungeun Han, Yong-jin Han, Seokyoung Hong, Hyeon Hwang, Wonbeom Jang, Minjeong Ju, Wonjin Jung, Keummin Ka, Sungil Kang, Dongnam Kim, Joonghoon Kim, Jonghwi Kim, SaeRom Kim, Sangjin Kim, Seongwon Kim, Youngjin Kim, Seojin Lee, Sunwoo Lee, Taehoon Lee, Chanwoo Park, Sohee Park, Sooyeon Park, Yohan Ra, Sereimony Sek, Seungyeon Seo, Gun Song, Sanghoon Woo, Janghan Yoon, Sungbin Yoon

Main category: cs.CL

TL;DR: A.X K1 is a 519B-parameter Mixture-of-Experts language model trained from scratch on 10T tokens with controllable reasoning capabilities and strong Korean-language performance.

DetailsMotivation: To bridge the gap between reasoning capability and inference efficiency in large language models, enabling scalable deployment across diverse real-world scenarios with controllable reasoning.

Method: Uses scaling laws to optimize training configurations under fixed computational budgets, employs multi-stage data processing pipeline for 10T token corpus, and implements Think-Fusion training for user-controlled switching between thinking and non-thinking modes.

Result: A.X K1 achieves competitive performance with leading open-source models and establishes distinctive advantage in Korean-language benchmarks.

Conclusion: The model successfully demonstrates scalable deployment capabilities with controllable reasoning, showing particular strength in Korean language understanding while maintaining general competitiveness.

Abstract: We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.

[79] DimStance: Multilingual Datasets for Dimensional Stance Analysis

Jonas Becker, Liang-Chih Yu, Shamsuddeen Hassan Muhammad, Jan Philip Wahle, Terry Ruas, Idris Abdulmumin, Lung-Hao Lee, Nelson Odhiambo, Lilian Wanzare, Wen-Ni Liu, Tzu-Mi Lin, Zhe-Yu Xu, Ying-Lung Lin, Jin Wang, Maryam Ibrahim Mukhtar, Bela Gipp, Saif M. Mohammad

Main category: cs.CL

TL;DR: DimStance introduces a dimensional stance analysis framework using valence-arousal annotations for multilingual stance detection, moving beyond categorical labels to capture nuanced affective states.

DetailsMotivation: Traditional stance detection uses categorical labels (Favor, Neutral, Against) which oversimplify nuanced affective states. The authors aim to capture richer emotional dimensions underlying stance expressions using valence-arousal framework from affective science.

Method: Created DimStance resource with 11,746 target aspects across 5 languages and 2 domains, annotated with valence-arousal dimensions. Formulated dimensional stance regression task, benchmarked pretrained and large language models under regression and prompting settings.

Result: Fine-tuned LLM regressors show competitive performance, but challenges persist in low-resource languages. Token-based generation has limitations. Cross-lingual VA patterns reveal interesting insights about stance expression across languages.

Conclusion: DimStance provides foundation for multilingual, emotion-aware stance analysis and benchmarking, enabling fine-grained stance analysis beyond categorical labels.

Abstract: Stance detection is an established task that classifies an author’s attitude toward a specific target into categories such as Favor, Neutral, and Against. Beyond categorical stance labels, we leverage a long-established affective science framework to model stance along real-valued dimensions of valence (negative-positive) and arousal (calm-active). This dimensional approach captures nuanced affective states underlying stance expressions, enabling fine-grained stance analysis. To this end, we introduce DimStance, the first dimensional stance resource with valence-arousal (VA) annotations. This resource comprises 11,746 target aspects in 7,365 texts across five languages (English, German, Chinese, Nigerian Pidgin, and Swahili) and two domains (politics and environmental protection). To facilitate the evaluation of stance VA prediction, we formulate the dimensional stance regression task, analyze cross-lingual VA patterns, and benchmark pretrained and large language models under regression and prompting settings. Results show competitive performance of fine-tuned LLM regressors, persistent challenges in low-resource languages, and limitations of token-based generation. DimStance provides a foundation for multilingual, emotion-aware, stance analysis and benchmarking.

[80] inversedMixup: Data Augmentation via Inverting Mixed Embeddings

Fanshuang Kong, Richong Zhang, Qiyu Sun, Zhijie Nie, Ting Deng, Chunming Hu

Main category: cs.CL

TL;DR: inversedMixup: A framework combining Mixup’s controllability with LLM interpretability by aligning task model embeddings with LLM input space to generate human-readable augmented sentences.

DetailsMotivation: Mixup provides controllable augmentation in latent space but lacks human interpretability, while LLM-based methods produce readable outputs but offer limited control. The gap between latent embeddings and discrete tokens needs bridging for better text augmentation.

Method: Three-stage training: 1) Align task-specific model’s output embedding space with LLM’s input embedding space, 2) Use Mixup in aligned space with controllable ratio, 3) Reconstruct mixed embeddings into interpretable sentences via LLM inversion, plus mitigation of manifold intrusion.

Result: Extensive experiments show effectiveness in few-shot and fully supervised scenarios, with improved augmentation performance and first empirical evidence of manifold intrusion in text Mixup.

Conclusion: inversedMixup successfully bridges Mixup’s controllability with LLM interpretability, enabling human-readable augmented text generation with controlled mixing ratios, while addressing manifold intrusion issues.

Abstract: Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates in the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup adopts a three-stage training procedure to align the output embedding space of a task-specific model with the input embedding space of an LLM. Upon successful alignment, inversedMixup can reconstruct mixed embeddings with a controllable mixing ratio into human-interpretable augmented sentences, thereby improving the augmentation performance. Additionally, inversedMixup provides the first empirical evidence of the manifold intrusion phenomenon in text Mixup and introduces a simple yet effective strategy to mitigate it. Extensive experiments demonstrate the effectiveness and generalizability of our approach in both few-shot and fully supervised scenarios.

[81] DimABSA: Building Multilingual and Multidomain Datasets for Dimensional Aspect-Based Sentiment Analysis

Lung-Hao Lee, Liang-Chih Yu, Natalia Loukashevich, Ilseyar Alimova, Alexander Panchenko, Tzu-Mi Lin, Zhe-Yu Xu, Jian-Yu Zhou, Guangmin Zheng, Jin Wang, Sharanya Awasthi, Jonas Becker, Jan Philip Wahle, Terry Ruas, Shamsuddeen Hassan Muhammad, Saif M. Mohammad

Main category: cs.CL

TL;DR: DimABSA introduces the first multilingual dimensional Aspect-Based Sentiment Analysis resource with continuous valence-arousal scores, enabling fine-grained sentiment analysis beyond traditional categorical labels.

DetailsMotivation: Existing ABSA research relies on coarse-grained categorical sentiment labels (positive/negative/neutral), which limits the ability to capture nuanced affective states. The authors aim to address this by adopting a dimensional approach using continuous valence-arousal scores for more fine-grained sentiment analysis.

Method: Created DimABSA, a multilingual dimensional ABSA resource annotated with traditional ABSA elements (aspect terms, categories, opinion terms) plus continuous valence-arousal scores. Contains 76,958 aspect instances across 42,590 sentences in 6 languages and 4 domains. Introduced three subtasks combining VA scores with ABSA elements and proposed continuous F1 (cF1) metric for evaluating both categorical and continuous outputs. Benchmarked using prompted and fine-tuned LLMs.

Result: DimABSA is shown to be a challenging benchmark. The resource provides a foundation for advancing multilingual dimensional ABSA, with comprehensive evaluation demonstrating the difficulty of the proposed tasks even for large language models.

Conclusion: DimABSA enables fine-grained sentiment analysis at both aspect and sentiment levels through continuous valence-arousal scores, bridging traditional ABSA to dimensional ABSA and providing a valuable resource for multilingual affective computing research.

Abstract: Aspect-Based Sentiment Analysis (ABSA) focuses on extracting sentiment at a fine-grained aspect level and has been widely applied across real-world domains. However, existing ABSA research relies on coarse-grained categorical labels (e.g., positive, negative), which limits its ability to capture nuanced affective states. To address this limitation, we adopt a dimensional approach that represents sentiment with continuous valence-arousal (VA) scores, enabling fine-grained analysis at both the aspect and sentiment levels. To this end, we introduce DimABSA, the first multilingual, dimensional ABSA resource annotated with both traditional ABSA elements (aspect terms, aspect categories, and opinion terms) and newly introduced VA scores. This resource contains 76,958 aspect instances across 42,590 sentences, spanning six languages and four domains. We further introduce three subtasks that combine VA scores with different ABSA elements, providing a bridge from traditional ABSA to dimensional ABSA. Given that these subtasks involve both categorical and continuous outputs, we propose a new unified metric, continuous F1 (cF1), which incorporates VA prediction error into standard F1. We provide a comprehensive benchmark using both prompted and fine-tuned large language models across all subtasks. Our results show that DimABSA is a challenging benchmark and provides a foundation for advancing multilingual dimensional ABSA.

[82] Investigating Disability Representations in Text-to-Image Models

Yang Yian, Yu Fan, Liudmila Zavolokina, Sarah Ebling

Main category: cs.CL

TL;DR: Analysis of disability representation in text-to-image models (Stable Diffusion XL & DALL-E 3) reveals persistent imbalances and the need for more inclusive portrayals.

DetailsMotivation: While text-to-image models have advanced significantly, concerns remain about how they represent social groups. Disability representations are particularly underexplored compared to characteristics like gender and race, creating a gap in understanding AI's portrayal of people with disabilities.

Method: Analyzed outputs from Stable Diffusion XL and DALL-E 3 using structured prompt design. Compared image similarities between generic disability prompts and specific disability category prompts. Evaluated mitigation strategies’ influence on disability portrayals through sentiment polarity analysis combining automatic and human evaluation.

Result: Revealed persistent representational imbalances in how people with disabilities are portrayed in AI-generated images. Found that mitigation strategies affect disability portrayals, highlighting issues with affective framing through sentiment analysis.

Conclusion: Continuous evaluation and refinement of generative models is needed to foster more diverse and inclusive portrayals of disability in AI-generated content.

Abstract: Text-to-image generative models have made remarkable progress in producing high-quality visual content from textual descriptions, yet concerns remain about how they represent social groups. While characteristics like gender and race have received increasing attention, disability representations remain underexplored. This study investigates how people with disabilities are represented in AI-generated images by analyzing outputs from Stable Diffusion XL and DALL-E 3 using a structured prompt design. We analyze disability representations by comparing image similarities between generic disability prompts and prompts referring to specific disability categories. Moreover, we evaluate how mitigation strategies influence disability portrayals, with a focus on assessing affective framing through sentiment polarity analysis, combining both automatic and human evaluation. Our findings reveal persistent representational imbalances and highlight the need for continuous evaluation and refinement of generative models to foster more diverse and inclusive portrayals of disability.

[83] PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning

Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Jiansheng Wei, Xiaojun Meng, Min Zhang

Main category: cs.CL

TL;DR: PACE introduces a corrective exploration approach for LLM alignment that outperforms standard DPO-R1 with Best-of-N sampling while using significantly less compute, addressing diminishing returns and policy collapse in mathematical reasoning tasks.

DetailsMotivation: Standard DPO implementations rely on Best-of-N sampling to mine golden trajectories, but this scaling approach can lead to diminishing returns and catastrophic policy collapse in mathematical reasoning due to amplified verifier noise and detrimental distribution shifts.

Method: PACE (Proximal Alignment via Corrective Exploration) replaces brute-force mining with a generation-based corrective strategy that operates with minimal budget (2<N<3), synthesizing high-fidelity preference pairs from failed explorations rather than relying on extensive sampling.

Result: PACE outperforms DPO-R1 (N=16) while using only about 1/5 of the compute, demonstrating superior robustness against reward hacking and label noise in mathematical reasoning tasks.

Conclusion: The paper challenges the scaling hypothesis in LLM alignment and shows that aggressive exploration can be counterproductive; PACE provides a more efficient and robust alternative for aligning LLMs on reasoning tasks.

Abstract: Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., $N \ge 8$) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling $N$ amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ($2<N<3$), PACE synthesizes high-fidelity preference pairs from failed explorations. Empirical evaluations show that PACE outperforms DPO-R1 $(N=16)$ while using only about $1/5$ of the compute, demonstrating superior robustness against reward hacking and label noise.

[84] Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Yao Zhou, Zeen Song, Wenwen Qiang, Fengge Wu, Shuyi Zhou, Changwen Zheng, Hui Xiong

Main category: cs.CL

TL;DR: CFA² is a causal inference-based jailbreaking attack that models safety mechanisms as unobserved confounders and uses front-door adjustment to bypass LLM safety alignment.

DetailsMotivation: Safety alignment in LLMs operates as latent internal states that obscure inherent capabilities, making it difficult to understand and bypass safety mechanisms. The authors aim to develop a more robust jailbreaking method with mechanistic interpretability.

Method: Models safety mechanisms as unobserved confounders from a causal perspective, applies Pearl’s Front-Door Criterion to sever confounding associations, uses Sparse Autoencoders to physically strip defense-related features, and reduces marginalization to deterministic intervention for low inference complexity.

Result: CFA² achieves state-of-the-art attack success rates while providing mechanistic interpretation of the jailbreaking process.

Conclusion: The causal perspective provides a robust framework for jailbreaking LLMs with interpretability, demonstrating that safety mechanisms can be effectively bypassed through causal inference techniques.

Abstract: Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model’s inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the Causal Front-Door Adjustment Attack (CFA{$^2$}) to jailbreak LLM, which is a framework that leverages Pearl’s Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that CFA{$^2$} achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.

[85] A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering

Larissa Pusch, Alexandre Courtiol, Tim Conrad

Main category: cs.CL

TL;DR: LLM-powered interactive framework for generating and refining Cypher graph queries on Knowledge Graphs through natural language, improving accessibility while maintaining factual accuracy.

DetailsMotivation: LLMs have limitations in knowledge-intensive domains (hallucinations, outdated info, limited explainability), while text-based RAG struggles with multi-hop reasoning. Knowledge Graphs offer precise querying but require query language expertise, creating an accessibility barrier.

Method: Interactive framework where LLMs generate and explain Cypher graph queries, with users refining them through natural language. Evaluated on synthetic movie KG (90-query benchmark) and real-life KGs (Hyena and MaRDI).

Result: Framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor. Provides insights into how model performance varies across domains through quantitative evaluation.

Conclusion: Interactive LLM-KG framework bridges the gap between natural language accessibility and precise knowledge graph querying, enabling users to leverage structured knowledge without query language expertise.

Abstract: Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.

cs.CV

[86] From Blurry to Believable: Enhancing Low-quality Talking Heads with 3D Generative Priors

Ding-Jiun Huang, Yuanhao Wang, Shao-Ji Yuan, Albert Mosella-Montoro, Francisco Vicente Carrasco, Cheng Zhang, Fernando De la Torre

Main category: cs.CV

TL;DR: SuperHead enhances low-resolution animatable 3D talking head avatars using dynamics-aware 3D inversion with pre-trained generative models, producing high-fidelity 3D Gaussian Splatting models rigged to parametric head models for animation.

DetailsMotivation: Creating high-fidelity animatable 3D talking heads is crucial for immersive applications but often hindered by low-quality source materials that yield poor 3D reconstructions. Existing super-resolution techniques are ill-equipped to handle dynamic 3D inputs.

Method: SuperHead uses a novel dynamics-aware 3D inversion scheme that leverages priors from pre-trained 3D generative models. It optimizes latent representations to produce super-resolved 3D Gaussian Splatting head models, which are then rigged to parametric head models (e.g., FLAME) for animation. The inversion is jointly supervised using upscaled 2D face renderings and depth maps from diverse facial expressions and viewpoints.

Result: Experiments show SuperHead generates avatars with fine-grained facial details under dynamic motions, significantly outperforming baseline methods in visual quality.

Conclusion: SuperHead successfully addresses the challenge of enhancing low-resolution animatable 3D head avatars while maintaining 3D and temporal consistency during animation, preserving subject identity through a novel dynamics-aware 3D inversion approach.

Abstract: Creating high-fidelity, animatable 3D talking heads is crucial for immersive applications, yet often hindered by the prevalence of low-quality image or video sources, which yield poor 3D reconstructions. In this paper, we introduce SuperHead, a novel framework for enhancing low-resolution, animatable 3D head avatars. The core challenge lies in synthesizing high-quality geometry and textures, while ensuring both 3D and temporal consistency during animation and preserving subject identity. Despite recent progress in image, video and 3D-based super-resolution (SR), existing SR techniques are ill-equipped to handle dynamic 3D inputs. To address this, SuperHead leverages the rich priors from pre-trained 3D generative models via a novel dynamics-aware 3D inversion scheme. This process optimizes the latent representation of the generative model to produce a super-resolved 3D Gaussian Splatting (3DGS) head model, which is subsequently rigged to an underlying parametric head model (e.g., FLAME) for animation. The inversion is jointly supervised using a sparse collection of upscaled 2D face renderings and corresponding depth maps, captured from diverse facial expressions and camera viewpoints, to ensure realism under dynamic facial motions. Experiments demonstrate that SuperHead generates avatars with fine-grained facial details under dynamic motions, significantly outperforming baseline methods in visual quality.

[87] EgoAVU: Egocentric Audio-Visual Understanding

Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai

Main category: cs.CV

TL;DR: EgoAVU introduces a scalable data engine for generating egocentric audio-visual narrations, questions, and answers to train multimodal LLMs for joint audio-visual understanding in egocentric videos.

DetailsMotivation: Current multimodal LLMs struggle with joint audio-visual understanding in egocentric videos due to lack of coherent joint-modality text labels, showing bias toward visual signals while neglecting audio cues.

Method: EgoAVU data engine automatically generates egocentric audio-visual narrations through cross-modal correlation modeling, with token-based video filtering and modular graph-based curation for quality and diversity. Creates EgoAVU-Instruct (3M training samples) and EgoAVU-Bench (evaluation dataset).

Result: EgoAVU-Bench reveals existing MLLMs’ limitations: heavy visual bias, audio neglect, and failure to correlate audio with visual sources. Finetuning on EgoAVU-Instruct achieves up to 113% performance improvement on EgoAVU-Bench and transfers to other benchmarks (up to 28% gain).

Conclusion: EgoAVU addresses the critical gap in joint audio-visual understanding for egocentric videos, enabling MLLMs to effectively process both modalities and showing strong transfer learning capabilities.

Abstract: Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.

[88] Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping

Chao Zhou, Tianyi Wei, Yiling Chen, Wenbo Zhou, Nenghai Yu

Main category: cs.CV

TL;DR: PKA is an efficient attention framework for multi-condition control in Diffusion Transformers that reduces computational overhead by eliminating redundant cross-modal interactions through position-aligned and keyword-scoped attention mechanisms.

DetailsMotivation: Current text-to-image models lack fine-grained control for specific requirements like spatial layouts or subject appearances. Multi-condition control in Diffusion Transformers suffers from quadratic computational and memory overhead due to inefficient "concatenate-and-attend" strategies with redundant cross-modal interactions.

Method: Proposes Position-aligned and Keyword-scoped Attention (PKA) framework with two components: Position-Aligned Attention (PAA) linearizes spatial control via localized patch alignment, and Keyword-Scoped Attention (KSA) prunes irrelevant subject-driven interactions using semantic-aware masking. Also introduces Conditional Sensitivity-Aware Sampling (CSAS) to reweight training objectives toward critical denoising phases.

Result: PKA achieves 10.0× inference speedup and 5.1× VRAM saving compared to conventional approaches, providing scalable and resource-friendly multi-conditioned generation with high fidelity.

Conclusion: PKA offers an efficient solution for multi-condition control in Diffusion Transformers by eliminating redundant cross-modal interactions, enabling practical deployment of fine-grained controllable image generation.

Abstract: While modern text-to-image models excel at prompt-based generation, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control addresses this, yet its integration into Diffusion Transformers (DiTs) is bottlenecked by the conventional ``concatenate-and-attend’’ strategy, which suffers from quadratic computational and memory overhead as the number of conditions scales. Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant. To this end, we propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies. Specifically, Position-Aligned Attention (PAA) linearizes spatial control by enforcing localized patch alignment, while Keyword-Scoped Attention (KSA) prunes irrelevant subject-driven interactions via semantic-aware masking. To facilitate efficient learning, we further introduce a Conditional Sensitivity-Aware Sampling (CSAS) strategy that reweights the training objective towards critical denoising phases, drastically accelerating convergence and enhancing conditional fidelity. Empirically, PKA delivers a 10.0$\times$ inference speedup and a 5.1$\times$ VRAM saving, providing a scalable and resource-friendly solution for high-fidelity multi-conditioned generation.

[89] MGP-KAD: Multimodal Geometric Priors and Kolmogorov-Arnold Decoder for Single-View 3D Reconstruction in Complex Scenes

Luoxi Zhang, Chun Xie, Itaru Kitahara

Main category: cs.CV

TL;DR: MGP-KAD: A multimodal feature fusion framework for single-view 3D reconstruction that integrates RGB and geometric priors using class-level features and a KAN-based hybrid decoder, achieving SOTA performance on Pix3D.

DetailsMotivation: Single-view 3D reconstruction in complex real-world scenes faces challenges including noise, object diversity, and limited dataset availability. Existing methods struggle with geometric understanding and processing complex multimodal inputs.

Method: Proposes MGP-KAD framework with: 1) Multimodal feature fusion integrating RGB and geometric priors, 2) Geometric prior generation via sampling/clustering of ground-truth data to produce class-level features that dynamically adjust during training, 3) Hybrid decoder based on Kolmogorov-Arnold Networks (KAN) to overcome limitations of traditional linear decoders for complex multimodal inputs.

Result: Extensive experiments on Pix3D dataset demonstrate state-of-the-art performance with significant improvements in geometric integrity, smoothness, and detail preservation compared to existing methods.

Conclusion: MGP-KAD provides a robust and effective solution for advancing single-view 3D reconstruction in complex scenes by effectively integrating multimodal features and addressing geometric understanding challenges.

Abstract: Single-view 3D reconstruction in complex real-world scenes is challenging due to noise, object diversity, and limited dataset availability. To address these challenges, we propose MGP-KAD, a novel multimodal feature fusion framework that integrates RGB and geometric prior to enhance reconstruction accuracy. The geometric prior is generated by sampling and clustering ground-truth object data, producing class-level features that dynamically adjust during training to improve geometric understanding. Additionally, we introduce a hybrid decoder based on Kolmogorov-Arnold Networks (KAN) to overcome the limitations of traditional linear decoders in processing complex multimodal inputs. Extensive experiments on the Pix3D dataset demonstrate that MGP-KAD achieves state-of-the-art (SOTA) performance, significantly improving geometric integrity, smoothness, and detail preservation. Our work provides a robust and effective solution for advancing single-view 3D reconstruction in complex scenes.

[90] Visual Autoregressive Modeling for Instruction-Guided Image Editing

Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, Tao Mei

Main category: cs.CV

TL;DR: VAREdit: A visual autoregressive framework for instruction-guided image editing that reformulates editing as next-scale prediction using scale-aligned conditioning to overcome diffusion model limitations.

DetailsMotivation: Diffusion models for image editing suffer from global denoising that entangles edited regions with entire image context, causing unintended modifications and poor adherence to instructions. Autoregressive models offer a causal, compositional alternative that can better preserve original content while following editing instructions.

Method: VAREdit reframes image editing as next-scale prediction using visual autoregressive modeling. It generates multi-scale target features conditioned on source image features and text instructions. Key innovation is Scale-Aligned Reference (SAR) module that injects scale-matched conditioning information into the first self-attention layer to address the mismatch between finest-scale source features and coarser target predictions.

Result: Outperforms leading diffusion-based methods on EMU-Edit and PIE-Bench benchmarks by substantial margins in both CLIP and GPT scores. Achieves 1.2 seconds for 512×512 editing, making it 2.2× faster than similarly sized UltraEdit.

Conclusion: VAREdit demonstrates that autoregressive models offer superior adherence to editing instructions and efficiency compared to diffusion-based approaches, with the SAR module effectively solving scale-mismatch conditioning challenges in visual autoregressive editing.

Abstract: Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On EMU-Edit and PIE-Bench benchmarks, VAREdit outperforms leading diffusion-based methods by a substantial margin in terms of both CLIP and GPT scores. Moreover, VAREdit completes a 512$\times$512 editing in 1.2 seconds, making it 2.2$\times$ faster than the similarly sized UltraEdit. Code is available at: https://github.com/HiDream-ai/VAREdit.

[91] Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving

Xuyang Chen, Conglang Zhang, Chuanheng Fu, Zihao Yang, Kaixuan Zhou, Yizhi Zhang, Jianan He, Yanfeng Zhang, Mingwei Sun, Zengmao Wang, Zhen Dong, Xiaoxiao Long, Liqiu Meng

Main category: cs.CV

TL;DR: DwD uses DINOv3 vision foundation model features as a unified bridge for Sim2Real video generation in autonomous driving, addressing the consistency-realism dilemma through subspace projection, channel drop, spatial alignment, and temporal aggregation.

DetailsMotivation: Existing Sim2Real methods for autonomous driving video generation face a fundamental Consistency-Realism Dilemma: low-level signals ensure control but compromise realism, while high-level priors facilitate photorealism but lack structural detail for consistent guidance.

Method: DwD leverages DINOv3 VFM features as a unified bridge, using Principal Subspace Projection to remove texture artifacts, Random Channel Tail Drop to mitigate structural loss, a learnable Spatial Alignment Module for high-resolution feature adaptation, and a Causal Temporal Aggregator with causal convolutions for temporal stability.

Result: The framework reconciles realism with control consistency, mitigates motion blur, and guarantees temporal stability in autonomous driving video generation from simulation to real-world domains.

Conclusion: DwD presents an effective solution to the Consistency-Realism Dilemma in Sim2Real video generation by leveraging vision foundation model features with specialized processing techniques for spatial and temporal consistency.

Abstract: Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by “baking in” synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for “texture baking,” while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3’s high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: https://albertchen98.github.io/DwD-project/

[92] MetaSSP: Enhancing Semi-supervised Implicit 3D Reconstruction through Meta-adaptive EMA and SDF-aware Pseudo-label Evaluation

Luoxi Zhang, Chun Xie, Itaru Kitahara

Main category: cs.CV

TL;DR: MetaSSP: A semi-supervised framework for single-view 3D reconstruction using implicit SDFs that leverages unlabeled data through gradient-based parameter importance estimation and SDF-aware pseudo-label weighting.

DetailsMotivation: Implicit SDF-based methods for single-view 3D reconstruction achieve high-quality surfaces but require large labeled datasets, limiting their scalability. There's a need to exploit abundant unlabeled images to reduce annotation requirements.

Method: Proposes MetaSSP with two key components: 1) Gradient-based parameter importance estimation to regularize adaptive EMA updates, and 2) SDF-aware pseudo-label weighting mechanism combining augmentation consistency with SDF variance. Uses a 10% supervised warm-up followed by unified pipeline jointly refining labeled and unlabeled data.

Result: On Pix3D benchmark, reduces Chamfer Distance by approximately 20.61% and increases IoU by around 24.09% compared to existing semi-supervised baselines, setting new state-of-the-art.

Conclusion: MetaSSP effectively leverages unlabeled data to improve single-view 3D reconstruction quality while reducing annotation requirements, demonstrating significant improvements over existing semi-supervised methods.

Abstract: Implicit SDF-based methods for single-view 3D reconstruction achieve high-quality surfaces but require large labeled datasets, limiting their scalability. We propose MetaSSP, a novel semi-supervised framework that exploits abundant unlabeled images. Our approach introduces gradient-based parameter importance estimation to regularize adaptive EMA updates and an SDF-aware pseudo-label weighting mechanism combining augmentation consistency with SDF variance. Beginning with a 10% supervised warm-up, the unified pipeline jointly refines labeled and unlabeled data. On the Pix3D benchmark, our method reduces Chamfer Distance by approximately 20.61% and increases IoU by around 24.09% compared to existing semi-supervised baselines, setting a new state of the art.

[93] M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning

Bangji Yang, Ruihan Guo, Jiajun Fan, Chaoran Cheng, Ge Liu

Main category: cs.CV

TL;DR: M3 is a training-free multi-agent framework that improves text-to-image generation by decomposing complex prompts into checklists and iteratively refining constraints through specialized agents.

DetailsMotivation: Current generative models struggle with complex compositional prompts involving multiple constraints, despite achieving impressive fidelity in simpler text-to-image synthesis tasks.

Method: M3 uses a multi-agent loop with off-the-shelf foundation models: a Planner decomposes prompts into verifiable checklists, while specialized Checker, Refiner, and Editor agents correct constraints one at a time, with a Verifier ensuring monotonic improvement.

Result: M3 achieves state-of-the-art performance on the challenging OneIG-EN benchmark (0.532 overall), surpassing commercial systems like Imagen4 (0.515) and Seedream 3.0 (0.530), and substantially improves GenEval compositional metrics, effectively doubling spatial reasoning performance.

Conclusion: Intelligent multi-agent reasoning can elevate open-source models beyond proprietary alternatives, establishing a new paradigm for compositional generation without costly retraining.

Abstract: Generative models have achieved impressive fidelity in text-to-image synthesis, yet struggle with complex compositional prompts involving multiple constraints. We introduce \textbf{M3 (Multi-Modal, Multi-Agent, Multi-Round)}, a training-free framework that systematically resolves these failures through iterative inference-time refinement. M3 orchestrates off-the-shelf foundation models in a robust multi-agent loop: a Planner decomposes prompts into verifiable checklists, while specialized Checker, Refiner, and Editor agents surgically correct constraints one at a time, with a Verifier ensuring monotonic improvement. Applied to open-source models, M3 achieves remarkable results on the challenging OneIG-EN benchmark, with our Qwen-Image+M3 surpassing commercial flagship systems including Imagen4 (0.515) and Seedream 3.0 (0.530), reaching state-of-the-art performance (0.532 overall). This demonstrates that intelligent multi-agent reasoning can elevate open-source models beyond proprietary alternatives. M3 also substantially improves GenEval compositional metrics, effectively doubling spatial reasoning performance on hardened test sets. As a plug-and-play module compatible with any pre-trained T2I model, M3 establishes a new paradigm for compositional generation without costly retraining.

[94] Unsupervised Anomaly Detection of Diseases in the Female Pelvis for Real-Time MR Imaging

Anika Knupfer, Johanna P. Müller, Jordina A. Verdera, Martin Fenske, Claudius S. Mathy, Smiti Tripathy, Sebastian Arndt, Matthias May, Michael Uder, Matthias W. Beckmann, Stefanie Burghaus, Jana Hutter

Main category: cs.CV

TL;DR: Unsupervised anomaly detection framework for pelvic MRI using residual variational autoencoder trained on healthy scans to detect pathologies without labeled abnormal data.

DetailsMotivation: Pelvic diseases in women are often diagnosed late due to anatomical variability complicating MRI interpretation. Existing AI approaches are disease-specific and lack real-time compatibility, limiting clinical integration.

Method: Residual variational autoencoder trained exclusively on healthy sagittal T2-weighted MRI scans across diverse protocols, augmented with diffusion-generated synthetic data. Reconstruction error heatmaps identify pathological deviations from learned normal anatomy.

Result: Achieved AUC of 0.736 on Uterine Myoma MRI Dataset with 0.828 sensitivity and 0.692 specificity. Reconstruction speed of ~92.6 frames per second enables real-time compatibility.

Conclusion: Framework establishes baseline for unsupervised anomaly detection in female pelvis, supporting future real-time MRI integration. Addresses anatomical heterogeneity and inter-observer variability challenges.

Abstract: Pelvic diseases in women of reproductive age represent a major global health burden, with diagnosis frequently delayed due to high anatomical variability, complicating MRI interpretation. Existing AI approaches are largely disease-specific and lack real-time compatibility, limiting generalizability and clinical integration. To address these challenges, we establish a benchmark framework for disease- and parameter-agnostic, real-time-compatible unsupervised anomaly detection in pelvic MRI. The method uses a residual variational autoencoder trained exclusively on healthy sagittal T2-weighted scans acquired across diverse imaging protocols to model normal pelvic anatomy. During inference, reconstruction error heatmaps indicate deviations from learned healthy structure, enabling detection of pathological regions without labeled abnormal data. The model is trained on 294 healthy scans and augmented with diffusion-generated synthetic data to improve robustness. Quantitative evaluation on the publicly available Uterine Myoma MRI Dataset yields an average area-under-the-curve (AUC) value of 0.736, with 0.828 sensitivity and 0.692 specificity. Additional inter-observer clinical evaluation extends analysis to endometrial cancer, endometriosis, and adenomyosis, revealing the influence of anatomical heterogeneity and inter-observer variability on performance interpretation. With a reconstruction time of approximately 92.6 frames per second, the proposed framework establishes a baseline for unsupervised anomaly detection in the female pelvis and supports future integration into real-time MRI. Code is available upon request (https://github.com/AniKnu/UADPelvis), prospective data sets are available for academic collaboration.

[95] PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining

Cheng Liang, Chaoyi Wu, Weike Zhao, Ya Zhang, Yanfeng Wang, Weidi Xie

Main category: cs.CV

TL;DR: PhenoLIP: A medical vision-language model pretraining framework that incorporates structured phenotype knowledge from a multimodal knowledge graph (PhenoKG) to improve medical image understanding through phenotype-centric learning.

DetailsMotivation: Existing medical VLMs rely on coarse image-text contrastive objectives and fail to capture systematic visual knowledge encoded in medical phenotype ontologies, limiting their ability to understand structured medical concepts.

Method: Two-stage pretraining: 1) Learn knowledge-enhanced phenotype embeddings from textual ontology data, 2) Distill structured knowledge into multimodal pretraining via teacher-guided knowledge distillation using PhenoKG (520K+ image-text pairs linked to 3,000+ phenotypes).

Result: PhenoLIP outperforms previous SOTA baselines, improving BiomedCLIP by 8.85% in phenotype classification accuracy and BIOMEDICA by 15.03% in cross-modal retrieval. Also introduces PhenoBench benchmark (7,800+ image-caption pairs covering 1,000+ phenotypes).

Conclusion: Integrating phenotype-centric priors into medical VLMs enables structured and interpretable medical image understanding, demonstrating the value of incorporating domain-specific knowledge graphs for improved medical AI.

Abstract: Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image–caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.

[96] DeDPO: Debiased Direct Preference Optimization for Diffusion Models

Khiem Pham, Quang Nguyen, Tung Nguyen, Jingsen Zhu, Michele Santacatterina, Dimitris Metaxas, Ramin Zabih

Main category: cs.CV

TL;DR: DeDPO is a semi-supervised framework that enhances DPO for diffusion model alignment by using synthetic AI feedback to augment limited human preference data, with debiased estimation to correct systematic bias in synthetic annotations.

DetailsMotivation: DPO's reliance on large-scale human preference labels creates cost and scalability bottlenecks. The need for more efficient alignment methods that can leverage cheaper synthetic feedback while maintaining quality drives this research.

Method: DeDPO integrates debiased estimation techniques from causal inference into the DPO objective, explicitly identifying and correcting systematic bias and noise in synthetic annotators (including self-training and VLMs), enabling robust learning from imperfect feedback sources.

Result: DeDPO achieves performance matching or occasionally exceeding the theoretical upper bound of models trained on fully human-labeled data, demonstrating robustness to variations in synthetic labeling methods.

Conclusion: DeDPO establishes a scalable solution for human-AI alignment using inexpensive synthetic supervision, overcoming the cost and scalability limitations of traditional DPO approaches.

Abstract: Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data. This establishes DeDPO as a scalable solution for human-AI alignment using inexpensive synthetic supervision.

[97] AnyThermal: Towards Learning Universal Representations for Thermal Perception

Parv Maheshwari, Jay Karhade, Yogesh Chawla, Isaiah Adu, Florian Heisen, Andrew Porco, Andrew Jong, Yifei Liu, Santosh Pitla, Sebastian Scherer, Wenshan Wang

Main category: cs.CV

TL;DR: AnyThermal is a thermal backbone that learns task-agnostic thermal features by distilling knowledge from visual foundation models like DINOv2, enabling diverse thermal vision tasks without task-specific training across multiple environments.

DetailsMotivation: Existing thermal backbones are limited by task-specific training on small datasets, restricting their utility to specific environments and tasks. There's a need for a robust, general-purpose thermal feature extractor that works across diverse environments and tasks.

Method: Distills feature representations from visual foundation models (DINOv2) into a thermal encoder using thermal data from multiple environments. Introduces TartanRGBT platform and dataset - first open-source synced RGB-Thermal data collection system, collecting diverse data across 4 environments.

Result: Achieves state-of-the-art results with improvements up to 36% across diverse environments and downstream tasks on existing datasets. Demonstrates effectiveness for cross-modal place recognition, thermal segmentation, and monocular depth estimation.

Conclusion: AnyThermal provides a robust thermal backbone for diverse tasks without task-specific training, enabled by knowledge distillation from visual foundation models and comprehensive multi-environment thermal dataset collection.

Abstract: We present AnyThermal, a thermal backbone that captures robust task-agnostic thermal features suitable for a variety of tasks such as cross-modal place recognition, thermal segmentation, and monocular depth estimation using thermal images. Existing thermal backbones that follow task-specific training from small-scale data result in utility limited to a specific environment and task. Unlike prior methods, AnyThermal can be used for a wide range of environments (indoor, aerial, off-road, urban) and tasks, all without task-specific training. Our key insight is to distill the feature representations from visual foundation models such as DINOv2 into a thermal encoder using thermal data from these multiple environments. To bridge the diversity gap of the existing RGB-Thermal datasets, we introduce the TartanRGBT platform, the first open-source data collection platform with synced RGB-Thermal image acquisition. We use this payload to collect the TartanRGBT dataset - a diverse and balanced dataset collected in 4 environments. We demonstrate the efficacy of AnyThermal and TartanRGBT, achieving state-of-the-art results with improvements of up to 36% across diverse environments and downstream tasks on existing datasets.

[98] Multi-Sensor Attention Networks for Automated Subsurface Delamination Detection in Concrete Bridge Decks

Alireza Moayedikia, Amirhossein Moayedikia

Main category: cs.CV

TL;DR: A deep learning framework using hierarchical attention mechanisms to fuse Ground Penetrating Radar (GPR) and Infrared Thermography (IRT) data for detecting subsurface delaminations in concrete bridge decks, with uncertainty quantification and analysis of failure modes under class imbalance.

DetailsMotivation: Subsurface delaminations in concrete bridge decks cannot be detected through visual inspection alone, requiring automated non-destructive evaluation methods. Current approaches need better integration of multiple sensor modalities (GPR and IRT) with attention mechanisms to improve detection accuracy and provide uncertainty quantification.

Method: Proposes a hierarchical attention framework with: 1) temporal self-attention for GPR electromagnetic signals, 2) spatial attention for thermal imagery, and 3) cross-modal attention with learnable embeddings to model inter-sensor correspondences. Includes Monte Carlo dropout-based uncertainty quantification that decomposes prediction confidence into model and data-driven uncertainty components.

Result: The approach shows substantial performance gains over single-sensor and concatenation-based baselines on five real-world bridge datasets from SDNET2021 benchmark, particularly for balanced or moderately imbalanced data. Cross-modal attention provides meaningful improvements beyond unimodal attention alone. However, under extreme class imbalance, attention-based architectures show susceptibility to majority class bias.

Conclusion: The framework provides practitioners with criteria for selecting appropriate fusion strategies based on dataset characteristics rather than promoting universal architectural superiority. Attention mechanisms are effective but have limitations under extreme class imbalance where simpler architectures may be more robust.

Abstract: Subsurface delaminations in concrete bridge decks remain undetectable through conventional visual inspection, necessitating automated non-destructive evaluation methods. This work introduces a deep learning framework that integrates Ground Penetrating Radar (GPR) and Infrared Thermography (IRT) through hierarchical attention mechanisms. Our architecture employs temporal self-attention to process GPR electromagnetic signals, spatial attention to analyze thermal imagery, and cross-modal attention with learnable embeddings to model inter-sensor correspondences. We integrate Monte Carlo dropout-based uncertainty quantification, decomposing prediction confidence into model uncertainty and data-driven uncertainty components. Testing across five real-world bridge datasets from the SDNET2021 benchmark reveals that our approach delivers substantial performance gains over single-sensor and concatenation-based baselines when applied to balanced or moderately imbalanced data distributions. Comprehensive ablation analysis confirms that cross-modal attention mechanisms contribute meaningful improvements beyond unimodal attention alone. Critically, we identify and characterize specific failure modes: under extreme class imbalance, attention-based architectures demonstrate susceptibility to majority class bias, indicating scenarios where simpler architectural choices may prove more robust. Our findings equip practitioners with empirically-grounded criteria for selecting appropriate fusion strategies based on dataset characteristics, rather than promoting universal architectural superiority.

[99] DroneKey++: A Size Prior-free Method and New Benchmark for Drone 3D Pose Estimation from Sequential Images

Seo-Bin Hwang, Yeong-Jun Cho

Main category: cs.CV

TL;DR: DroneKey++ is a prior-free framework for 3D pose estimation, classification, and keypoint detection of drones using synthetic data and geometric reasoning.

DetailsMotivation: Existing drone pose estimation methods rely on prior information (physical sizes, 3D meshes) and suffer from limited datasets that are small-scale, single-model, and collected in constrained environments, making generalization validation difficult.

Method: Proposes DroneKey++ with keypoint encoder for simultaneous keypoint detection/classification and pose decoder using ray-based geometric reasoning with class embeddings. Uses synthetic 6DroneSyn dataset with 50K+ images across 7 drone models and 88 outdoor backgrounds via 360-degree panoramic synthesis.

Result: Achieves MAE 17.34° and MedAE 17.1° for rotation, MAE 0.135m and MedAE 0.242m for translation, with 19.25 FPS (CPU) and 414.07 FPS (GPU) inference speeds, demonstrating strong generalization across drone models and real-time capability.

Conclusion: DroneKey++ provides effective prior-free 3D pose estimation with strong generalization using synthetic data, suitable for real-time security/surveillance applications. Dataset is publicly available.

Abstract: Accurate 3D pose estimation of drones is essential for security and surveillance systems. However, existing methods often rely on prior drone information such as physical sizes or 3D meshes. At the same time, current datasets are small-scale, limited to single models, and collected under constrained environments, which makes reliable validation of generalization difficult. We present DroneKey++, a prior-free framework that jointly performs keypoint detection, drone classification, and 3D pose estimation. The framework employs a keypoint encoder for simultaneous keypoint detection and classification, and a pose decoder that estimates 3D pose using ray-based geometric reasoning and class embeddings. To address dataset limitations, we construct 6DroneSyn, a large-scale synthetic benchmark with over 50K images covering 7 drone models and 88 outdoor backgrounds, generated using 360-degree panoramic synthesis. Experiments show that DroneKey++ achieves MAE 17.34 deg and MedAE 17.1 deg for rotation, MAE 0.135 m and MedAE 0.242 m for translation, with inference speeds of 19.25 FPS (CPU) and 414.07 FPS (GPU), demonstrating both strong generalization across drone models and suitability for real-time applications. The dataset is publicly available.

[100] Addressing the Waypoint-Action Gap in End-to-End Autonomous Driving via Vehicle Motion Models

Jorge Daniel Rodríguez-Vidal, Gabriel Villalonga, Diego Porres, Antonio M. López Peña

Main category: cs.CV

TL;DR: A differentiable vehicle-model framework that bridges the gap between waypoint-based and action-based autonomous driving systems by rolling out action sequences to waypoint trajectories for consistent training and evaluation.

DetailsMotivation: Current autonomous driving benchmarks and training pipelines are predominantly waypoint-based, making it difficult to train and compare action-based policies (which directly output throttle, steer, brake), slowing progress in action-based approaches.

Method: Proposes a differentiable vehicle-model framework that takes predicted action sequences and rolls them out to corresponding ego-frame waypoint trajectories, enabling supervision in waypoint space while maintaining action-based architecture.

Result: Achieves consistent improvements across multiple challenging benchmarks, with state-of-the-art performance on NAVSIM navhard benchmark.

Conclusion: The framework successfully bridges the waypoint-action gap, enabling action-based architectures to be trained and evaluated within existing waypoint-based benchmarks without modifying evaluation protocols.

Abstract: End-to-End Autonomous Driving (E2E-AD) systems are typically grouped by the nature of their outputs: (i) waypoint-based models that predict a future trajectory, and (ii) action-based models that directly output throttle, steer and brake. Most recent benchmark protocols and training pipelines are waypoint-based, which makes action-based policies harder to train and compare, slowing their progress. To bridge this waypoint-action gap, we propose a novel, differentiable vehicle-model framework that rolls out predicted action sequences to their corresponding ego-frame waypoint trajectories while supervising in waypoint space. Our approach enables action-based architectures to be trained and evaluated, for the first time, within waypoint-based benchmarks without modifying the underlying evaluation protocol. We extensively evaluate our framework across multiple challenging benchmarks and observe consistent improvements over the baselines. In particular, on NAVSIM \texttt{navhard} our approach achieves state-of-the-art performance. Our code will be made publicly available upon acceptance.

[101] Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings

Grégoire Dhimoïla, Thomas Fel, Victor Boutin, Agustin Picard

Main category: cs.CV

TL;DR: The paper proposes an Aligned Sparse Autoencoder with Iso-Energy Assumption to analyze geometry of vision-language model embedding spaces, revealing that sparse bimodal atoms carry cross-modal alignment while unimodal atoms explain modality gaps.

DetailsMotivation: Vision-language models align images and text successfully but their shared embedding space geometry remains poorly understood. The authors want to probe this geometry to make it interpretable and actionable.

Method: Proposes Iso-Energy Assumption exploiting cross-modal redundancy: truly shared concepts should have same average energy across modalities. Operationalizes this with Aligned Sparse Autoencoder that encourages energy consistency during training while preserving reconstruction.

Result: Framework reveals clear structure: (1) sparse bimodal atoms carry entire cross-modal alignment signal; (2) unimodal atoms act as modality-specific biases and fully explain modality gap; (3) removing unimodal atoms collapses gap without harming performance; (4) restricting vector arithmetic to bimodal subspace yields in-distribution edits and improved retrieval.

Conclusion: Right inductive bias can both preserve model fidelity and render latent geometry interpretable and actionable. The Iso-Energy approach provides tools for geometric analysis of multimodal embedding spaces.

Abstract: Vision-language models (VLMs) align images and text with remarkable success, yet the geometry of their shared embedding space remains poorly understood. To probe this geometry, we begin from the Iso-Energy Assumption, which exploits cross-modal redundancy: a concept that is truly shared should exhibit the same average energy across modalities. We operationalize this assumption with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction. We find that this inductive bias changes the SAE solution without harming reconstruction, giving us a representation that serves as a tool for geometric analysis. Sanity checks on controlled data with known ground truth confirm that alignment improves when Iso-Energy holds and remains neutral when it does not. Applied to foundational VLMs, our framework reveals a clear structure with practical consequences: (i) sparse bimodal atoms carry the entire cross-modal alignment signal; (ii) unimodal atoms act as modality-specific biases and fully explain the modality gap; (iii) removing unimodal atoms collapses the gap without harming performance; (iv) restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval. These findings suggest that the right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable.

[102] ForeHOI: Feed-forward 3D Object Reconstruction from Daily Hand-Object Interaction Videos

Yuantao Chen, Jiahao Chang, Chongjie Ye, Chaoran Zhang, Zhaojie Fang, Chenghong Li, Xiaoguang Han

Main category: cs.CV

TL;DR: ForeHOI is a feed-forward model that reconstructs 3D object geometry from monocular hand-object interaction videos in under one minute, using joint 2D mask inpainting and 3D shape completion to handle severe occlusions.

DetailsMotivation: Monocular videos of hand-object interactions are abundant but reconstructing 3D objects from them is challenging due to severe occlusions and complex coupled motion of camera, hands, and objects. Current methods are slow and require preprocessing.

Method: Proposes ForeHOI, a feed-forward model that jointly predicts 2D mask inpainting and 3D shape completion. Uses information exchange between 2D and 3D shape completion to handle severe hand-object occlusion. Also creates a large-scale synthetic dataset of hand-object interactions with comprehensive annotations for training.

Result: Achieves state-of-the-art performance in object reconstruction, significantly outperforming previous methods with around 100x speedup (under one minute inference time). Extensive experiments demonstrate effectiveness in handling severe occlusions.

Conclusion: ForeHOI enables fast, accurate 3D object reconstruction from monocular hand-object interaction videos without preprocessing, addressing occlusion challenges through joint 2D-3D shape completion. The synthetic dataset supports future research.

Abstract: The ubiquity of monocular videos capturing daily hand-object interactions presents a valuable resource for embodied intelligence. While 3D hand reconstruction from in-the-wild videos has seen significant progress, reconstructing the involved objects remains challenging due to severe occlusions and the complex, coupled motion of the camera, hands, and object. In this paper, we introduce ForeHOI, a novel feed-forward model that directly reconstructs 3D object geometry from monocular hand-object interaction videos within one minute of inference time, eliminating the need for any pre-processing steps. Our key insight is that, the joint prediction of 2D mask inpainting and 3D shape completion in a feed-forward framework can effectively address the problem of severe occlusion in monocular hand-held object videos, thereby achieving results that outperform the performance of optimization-based methods. The information exchanges between the 2D and 3D shape completion boosts the overall reconstruction quality, enabling the framework to effectively handle severe hand-object occlusion. Furthermore, to support the training of our model, we contribute the first large-scale, high-fidelity synthetic dataset of hand-object interactions with comprehensive annotations. Extensive experiments demonstrate that ForeHOI achieves state-of-the-art performance in object reconstruction, significantly outperforming previous methods with around a 100x speedup. Code and data are available at: https://github.com/Tao-11-chen/ForeHOI.

[103] ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning

Aman Anand, Amir Eskandari, Elyas Rahsno, Farhana Zulkernine

Main category: cs.CV

TL;DR: ASMa: Asymmetric Spatio-temporal Masking for skeleton action representation learning that uses complementary masking strategies for balanced feature learning, with knowledge distillation for efficient deployment.

DetailsMotivation: Existing SSL methods for skeleton-based action recognition rely on biased data augmentations that focus on masking high-motion frames and high-degree joints, leading to incomplete feature representations that struggle with varied motion patterns.

Method: Proposes Asymmetric Spatio-temporal Masking (ASMa) with two complementary masking strategies: (1) masks high-degree joints and low-motion frames, (2) masks low-degree joints and high-motion frames. Includes learnable feature alignment module and knowledge distillation for model compression.

Result: Outperforms existing SSL methods by 2.7-4.4% on NTU RGB+D 60/120 and PKU-MMD datasets, achieves up to 5.9% improvement in transfer learning to noisy datasets. Distilled model achieves 91.4% parameter reduction and 3x faster inference on edge devices.

Conclusion: ASMa provides more balanced and comprehensive skeleton representation learning through complementary masking strategies, enabling practical deployment in resource-constrained scenarios while maintaining competitive performance.

Abstract: Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a more balanced and comprehensive skeleton representation learning. Furthermore, we introduce a learnable feature alignment module to effectively align the representations learned from both masked views. To facilitate deployment in resource-constrained settings and on low-resource devices, we compress the learned and aligned representation into a lightweight model using knowledge distillation. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7-4.4% in fine-tuning and up to 5.9% in transfer learning to noisy datasets and achieves competitive performance compared to fully supervised baselines. Our distilled model achieves 91.4% parameter reduction and 3x faster inference on edge devices while maintaining competitive accuracy, enabling practical deployment in resource-constrained scenarios.

[104] An Interpretable Vision Transformer as a Fingerprint-Based Diagnostic Aid for Kabuki and Wiedemann-Steiner Syndromes

Marilyn Lionts, Arnhildur Tomasdottir, Viktor I. Agustsson, Yuankai Huo, Hans T. Bjornsson, Lotta M. Ellingsen

Main category: cs.CV

TL;DR: Vision transformer model uses fingerprint images to distinguish Kabuki and Wiedemann-Steiner syndromes from controls and each other, achieving AUC scores up to 0.85, suggesting fingerprint-based AI could aid genetic syndrome diagnosis.

DetailsMotivation: Kabuki and Wiedemann-Steiner syndromes are rare developmental disorders with overlapping clinical features. Many individuals remain undiagnosed due to barriers in accessing genetic testing and expertise. Dermatoglyphic anomalies (fingerprint patterns) are established hallmarks of genetic syndromes but remain underutilized in diagnosis.

Method: Developed a vision transformer-based deep learning model that leverages fingerprint images for classification. Evaluated performance across three binary classification tasks: control vs. KS, control vs. WSS, and KS vs. WSS. Applied attention-based visualizations to identify fingerprint regions most salient to model predictions for interpretability.

Result: The model achieved AUC scores of 0.80 (control vs. KS), 0.73 (control vs. WSS), and 0.85 (KS vs. WSS), with corresponding F1 scores of 0.71, 0.72, and 0.83 respectively. Attention visualizations helped identify syndrome-specific fingerprint features.

Conclusion: The findings suggest the presence of syndrome-specific fingerprint features and demonstrate the feasibility of a fingerprint-based AI tool as a noninvasive, interpretable, and accessible diagnostic aid for early diagnosis of underdiagnosed genetic syndromes.

Abstract: Kabuki syndrome (KS) and Wiedemann-Steiner syndrome (WSS) are rare but distinct developmental disorders that share overlapping clinical features, including neurodevelopmental delay, growth restriction, and persistent fetal fingertip pads. While genetic testing remains the diagnostic gold standard, many individuals with KS or WSS remain undiagnosed due to barriers in access to both genetic testing and expertise. Dermatoglyphic anomalies, despite being established hallmarks of several genetic syndromes, remain an underutilized diagnostic signal in the era of molecular testing. This study presents a vision transformer-based deep learning model that leverages fingerprint images to distinguish individuals with KS and WSS from unaffected controls and from one another. We evaluate model performance across three binary classification tasks. Across the three classification tasks, the model achieved AUC scores of 0.80 (control vs. KS), 0.73 (control vs. WSS), and 0.85 (KS vs. WSS), with corresponding F1 scores of 0.71, 0.72, and 0.83, respectively. Beyond classification, we apply attention-based visualizations to identify fingerprint regions most salient to model predictions, enhancing interpretability. Together, these findings suggest the presence of syndrome-specific fingerprint features, demonstrating the feasibility of a fingerprint-based artificial intelligence (AI) tool as a noninvasive, interpretable, and accessible future diagnostic aid for the early diagnosis of underdiagnosed genetic syndromes.

[105] MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training

Lucia Gordon, Serge Belongie, Christian Igel, Nico Lang

Main category: cs.CV

TL;DR: MMEarth-Bench introduces a multimodal environmental benchmark with 12 modalities and global data distribution to evaluate geospatial ML models, showing pretraining helps but geographic generalization remains poor, and proposes test-time training with multimodal reconstruction to improve adaptation.

DetailsMotivation: Existing geospatial benchmarks lack multimodal data and global representation, limiting evaluation of multimodal pretrained models at global scales. There's a need for comprehensive multimodal environmental tasks to assess model robustness and geographic generalization.

Method: Introduces MMEarth-Bench with 5 new multimodal environmental tasks, 12 modalities, globally distributed data, and in-/out-of-distribution test splits. Proposes TTT-MMR (test-time training with multimodal reconstruction) - a model-agnostic method using all available test-time modalities as auxiliary tasks for adaptation.

Result: Benchmarking shows multimodal pretraining improves robustness in limited data settings but geographic generalization remains poor. TTT-MMR improves model performance on both random and geographic test splits, with geographic batching providing good regularization-specialization trade-off.

Conclusion: MMEarth-Bench fills the gap in multimodal geospatial evaluation, revealing limitations in geographic generalization despite pretraining benefits. TTT-MMR offers effective adaptation to new tasks and domains using multimodal reconstruction at test time.

Abstract: Recent research in geospatial machine learning has demonstrated that models pretrained with self-supervised learning on Earth observation data can perform well on downstream tasks with limited training data. However, most of the existing geospatial benchmark datasets have few data modalities and poor global representation, limiting the ability to evaluate multimodal pretrained models at global scales. To fill this gap, we introduce MMEarth-Bench, a collection of five new multimodal environmental tasks with 12 modalities, globally distributed data, and both in- and out-of-distribution test splits. We benchmark a diverse set of pretrained models and find that while (multimodal) pretraining tends to improve model robustness in limited data settings, geographic generalization abilities remain poor. In order to facilitate model adaptation to new downstream tasks and geographic domains, we propose a model-agnostic method for test-time training with multimodal reconstruction (TTT-MMR) that uses all the modalities available at test time as auxiliary tasks, regardless of whether a pretrained model accepts them as input. Our method improves model performance on both the random and geographic test splits, and geographic batching leads to a good trade-off between regularization and specialization during TTT. Our dataset, code, and visualization tool are linked from the project page at lgordon99.github.io/mmearth-bench.

[106] Unsupervised MRI-US Multimodal Image Registration with Multilevel Correlation Pyramidal Optimization

Jiazheng Wang, Zeyu Liu, Min Liu, Xiang Chen, Hang Zhang

Main category: cs.CV

TL;DR: MCPO: Unsupervised multimodal medical image registration method using multilevel correlation pyramidal optimization for preoperative-to-intraoperative surgical navigation

DetailsMotivation: Surgical navigation requires accurate registration of preoperative and intraoperative multimodal images, but faces challenges due to modality differences and tissue deformation during surgery. The Learn2Reg 2025 competition provides a platform to address these multimodal registration problems.

Method: Uses modality independent neighborhood descriptors to extract features and map multimodal images to feature space. Implements multilevel pyramidal fusion optimization with dense correlation analysis and weight-balanced coupled convex optimization at different scales for global optimization and local detail complementation of displacement fields.

Result: Achieved first place in both validation and test phases of ReMIND2Reg task in Learn2Reg 2025. On Resect dataset, achieved average TRE of 1.798 mm, demonstrating effectiveness for preoperative-to-intraoperative registration.

Conclusion: MCPO method effectively addresses multimodal medical image registration challenges, showing broad applicability for surgical navigation applications with strong performance in benchmark competitions.

Abstract: Surgical navigation based on multimodal image registration has played a significant role in providing intraoperative guidance to surgeons by showing the relative position of the target area to critical anatomical structures during surgery. However, due to the differences between multimodal images and intraoperative image deformation caused by tissue displacement and removal during the surgery, effective registration of preoperative and intraoperative multimodal images faces significant challenges. To address the multimodal image registration challenges in Learn2Reg 2025, an unsupervised multimodal medical image registration method based on multilevel correlation pyramidal optimization (MCPO) is designed to solve these problems. First, the features of each modality are extracted based on the modality independent neighborhood descriptor, and the multimodal images is mapped to the feature space. Second, a multilevel pyramidal fusion optimization mechanism is designed to achieve global optimization and local detail complementation of the displacement field through dense correlation analysis and weight-balanced coupled convex optimization for input features at different scales. Our method focuses on the ReMIND2Reg task in Learn2Reg 2025. Based on the results, our method achieved the first place in the validation phase and test phase of ReMIND2Reg. The MCPO is also validated on the Resect dataset, achieving an average TRE of 1.798 mm. This demonstrates the broad applicability of our method in preoperative-to-intraoperative image registration. The code is avaliable at https://github.com/wjiazheng/MCPO.

[107] Accelerating Vision Transformers on Brain Processing Unit

Jinchi Tang, Yan Guo

Main category: cs.CV

TL;DR: A method to restructure Vision Transformers (DeiT) by replacing linear layers with convolutional operators to enable deployment on CNN-optimized BPU hardware without retraining.

DetailsMotivation: There's an architectural mismatch between Vision Transformers and CNN-optimized hardware (BPUs). BPUs are designed for 4D convolution operations but Transformers use 3D linear layers, making it difficult to leverage BPU acceleration for Vision Transformer deployment.

Method: Restructure Vision Transformers by replacing linear layers and layer normalization operations with carefully designed convolutional operators. This allows the original weight parameters to be inherited without retraining or fine-tuning.

Result: Quantized DeiT-Base achieves 80.4% accuracy on ImageNet (vs original 81.8%) with up to 3.8× inference speedup. Fine-tuned DeiT on flower classification shows only 0.5% accuracy drop for DeiT-Base.

Conclusion: The proposed restructuring enables Vision Transformers to fully utilize CNN-optimized hardware acceleration while maintaining performance, representing the first successful deployment of Vision Transformers leveraging BPU capabilities.

Abstract: With the advancement of deep learning technologies, specialized neural processing hardware such as Brain Processing Units (BPUs) have emerged as dedicated platforms for CNN acceleration, offering optimized INT8 computation capabilities for convolutional operations. Meanwhile, Vision Transformer (ViT) models, such as the Data-efficient Image Transformer (DeiT), have demonstrated superior performance and play increasingly crucial roles in computer vision tasks. However, due to the architectural mismatch between CNN-optimized hardware and Vision Transformer computation characteristics–namely, that linear layers in Transformers operate on three-dimensional data while BPU acceleration is designed for four-dimensional convolution operations-it is difficult or even impossible to leverage BPU’s advantages when deploying Vision Transformers. To address this challenge, we propose a novel approach that restructures the Vision Transformer by replacing linear layers and layer normalization operations with carefully designed convolutional operators. This enables DeiT to fully utilize the acceleration capabilities of BPUs, while allowing the original weight parameters to be inherited by the restructured models without retraining or fine-tuning. To the best of our knowledge, this is the first successful deployment of Vision Transformers that fully leverages BPU classification datasets demonstrate the effectiveness of our approach. Specifically, the quantized DeiT-Base model achieves 80.4% accuracy on ImageNet, compared to the original 81.8%, while obtaining up to a 3.8* inference speedup. Our finetuned DeiT model on the flower classification dataset also achieves excellent performance, with only a 0.5% accuracy drop for the DeiT-Base model, further demonstrating the effectiveness of our method.

[108] Adaptive and Balanced Re-initialization for Long-timescale Continual Test-time Domain Adaptation

Yanshuo Wang, Jinguang Tong, Jun Lan, Weiqiang Wang, Huijia Zhu, Haoxing Chen, Xuesong Li, Jie Hong

Main category: cs.CV

TL;DR: Proposes Adaptive-and-Balanced Re-initialization (ABR) method for continual test-time domain adaptation to maintain long-term performance by re-initializing model weights based on label flip changes.

DetailsMotivation: Current continual test-time domain adaptation (CTTA) methods focus on optimizing adaptation processes but lack consideration for long-term performance in non-stationary environments. The paper addresses whether models can adapt to continually changing environments over extended periods.

Method: Proposes ABR method that performs weight re-initialization using adaptive intervals determined by changes in label flip patterns. The method monitors label flip trajectories and resets model weights when significant changes are detected to preserve long-term performance.

Result: Validated on extensive CTTA benchmarks, achieving superior performance compared to previous methods, demonstrating effectiveness in maintaining long-term adaptation capabilities.

Conclusion: A simple yet effective re-initialization policy based on label flip changes can significantly improve long-term performance in continual test-time domain adaptation scenarios.

Abstract: Continual test-time domain adaptation (CTTA) aims to adjust models so that they can perform well over time across non-stationary environments. While previous methods have made considerable efforts to optimize the adaptation process, a crucial question remains: Can the model adapt to continually changing environments over a long time? In this work, we explore facilitating better CTTA in the long run using a re-initialization (or reset) based method. First, we observe that the long-term performance is associated with the trajectory pattern in label flip. Based on this observed correlation, we propose a simple yet effective policy, Adaptive-and-Balanced Re-initialization (ABR), towards preserving the model’s long-term performance. In particular, ABR performs weight re-initialization using adaptive intervals. The adaptive interval is determined based on the change in label flip. The proposed method is validated on extensive CTTA benchmarks, achieving superior performance.

[109] Halt the Hallucination: Decoupling Signal and Semantic OOD Detection Based on Cascaded Early Rejection

Ningkang Peng, Chuanjie Cheng, Jingyang Mao, Xiaoqian Peng, Feng Xing, Bo Zhang, Chao Tan, Zhichao Zheng, Peiheng Li, Yanhui Gu

Main category: cs.CV

TL;DR: CER framework for efficient OOD detection using hierarchical filtering with structural and semantic anomaly detection modules

DetailsMotivation: Existing OOD detection methods waste computational resources on low-level noise and cause semantic hallucination, where networks misinterpret physical anomalies as semantic features

Method: Cascaded Early Rejection (CER) with two modules: Structural Energy Sieve (SES) uses Laplacian operator for non-parametric physical anomaly detection at network entry, and Semantically-aware Hyperspherical Energy (SHE) detector decouples feature magnitude/direction in intermediate layers for semantic deviation detection

Result: 32% computational overhead reduction, FPR95 decreased from 33.58% to 22.84%, AUROC improved to 93.97% on CIFAR-100, superior performance in sensor failure scenarios

Conclusion: CER provides efficient hierarchical OOD detection as a universal plugin for SOTA models, addressing computational waste and semantic hallucination issues

Abstract: Efficient and robust Out-of-Distribution (OOD) detection is paramount for safety-critical applications.However, existing methods still execute full-scale inference on low-level statistical noise. This computational mismatch not only incurs resource waste but also induces semantic hallucination, where deep networks forcefully interpret physical anomalies as high-confidence semantic features.To address this, we propose the Cascaded Early Rejection (CER) framework, which realizes hierarchical filtering for anomaly detection via a coarse-to-fine logic.CER comprises two core modules: 1)Structural Energy Sieve (SES), which establishes a non-parametric barrier at the network entry using the Laplacian operator to efficiently intercept physical signal anomalies; and 2) the Semantically-aware Hyperspherical Energy (SHE) detector, which decouples feature magnitude from direction in intermediate layers to identify fine-grained semantic deviations. Experimental results demonstrate that CER not only reduces computational overhead by 32% but also achieves a significant performance leap on the CIFAR-100 benchmark:the average FPR95 drastically decreases from 33.58% to 22.84%, and AUROC improves to 93.97%. Crucially, in real-world scenarios simulating sensor failures, CER exhibits performance far exceeding state-of-the-art methods. As a universal plugin, CER can be seamlessly integrated into various SOTA models to provide performance gains.

[110] Taming SAM3 in the Wild: A Concept Bank for Open-Vocabulary Segmentation

Gensheng Pei, Xiruo Jiang, Yazhou Yao, Xiangbo Shu, Fumin Shen, Byeungwoo Jeon

Main category: cs.CV

TL;DR: ConceptBank is a parameter-free calibration framework that adapts SAM3 for open-vocabulary segmentation to handle data and concept drift by constructing dataset-specific concept banks from target statistics.

DetailsMotivation: SAM3's reliance on pre-defined concept prompts makes it vulnerable to distribution shifts (data drift) and evolving label distributions (concept drift) in target domains, breaking the alignment between visual evidence and prompts.

Method: Constructs dataset-specific concept banks from target statistics using three key components: (i) anchors target-domain evidence via class-wise visual prototypes, (ii) mines representative supports to suppress outliers under data drift, and (iii) fuses candidate concepts to rectify concept drift.

Result: Effectively adapts SAM3 to distribution drifts in challenging natural-scene and remote-sensing scenarios, establishing new baselines for robustness and efficiency in open-vocabulary segmentation.

Conclusion: ConceptBank provides a parameter-free calibration framework that restores alignment between visual evidence and prompts under distribution shifts, making SAM3 more robust to data and concept drift in real-world applications.

Abstract: The recent introduction of \texttt{SAM3} has revolutionized Open-Vocabulary Segmentation (OVS) through \textit{promptable concept segmentation}, which grounds pixel predictions in flexible concept prompts. However, this reliance on pre-defined concepts makes the model vulnerable: when visual distributions shift (\textit{data drift}) or conditional label distributions evolve (\textit{concept drift}) in the target domain, the alignment between visual evidence and prompts breaks down. In this work, we present \textsc{ConceptBank}, a parameter-free calibration framework to restore this alignment on the fly. Instead of adhering to static prompts, we construct a dataset-specific concept bank from the target statistics. Our approach (\textit{i}) anchors target-domain evidence via class-wise visual prototypes, (\textit{ii}) mines representative supports to suppress outliers under data drift, and (\textit{iii}) fuses candidate concepts to rectify concept drift. We demonstrate that \textsc{ConceptBank} effectively adapts \texttt{SAM3} to distribution drifts, including challenging natural-scene and remote-sensing scenarios, establishing a new baseline for robustness and efficiency in OVS. Code and model are available at https://github.com/pgsmall/ConceptBank.

[111] SPDA-SAM: A Self-prompted Depth-Aware Segment Anything Model for Instance Segmentation

Yihan Shang, Wei Wang, Chao Huang, Xinghui Dong

Main category: cs.CV

TL;DR: SPDA-SAM enhances SAM for instance segmentation by adding self-prompting and depth-aware RGB-D fusion to address SAM’s manual prompt dependency and lack of spatial depth information.

DetailsMotivation: SAM's performance depends heavily on manual prompts and lacks depth information from RGB images, limiting spatial structure perception and object boundary delineation in instance segmentation.

Method: Proposes SPDA-SAM with: 1) Semantic-Spatial Self-prompt Module extracting prompts from SAM’s encoder/decoder, and 2) Coarse-to-Fine RGB-D Fusion Module fusing RGB and depth features using structural depth guidance and local depth variations.

Result: Outperforms state-of-the-art counterparts across twelve different datasets, demonstrating effectiveness of self-prompting and depth-aware fusion.

Conclusion: SPDA-SAM successfully addresses SAM’s limitations through self-prompting and depth-aware RGB-D fusion, achieving superior instance segmentation performance.

Abstract: Recently, Segment Anything Model (SAM) has demonstrated strong generalizability in various instance segmentation tasks. However, its performance is severely dependent on the quality of manual prompts. In addition, the RGB images that instance segmentation methods normally use inherently lack depth information. As a result, the ability of these methods to perceive spatial structures and delineate object boundaries is hindered. To address these challenges, we propose a Self-prompted Depth-Aware SAM (SPDA-SAM) for instance segmentation. Specifically, we design a Semantic-Spatial Self-prompt Module (SSSPM) which extracts the semantic and spatial prompts from the image encoder and the mask decoder of SAM, respectively. Furthermore, we introduce a Coarse-to-Fine RGB-D Fusion Module (C2FFM), in which the features extracted from a monocular RGB image and the depth map estimated from it are fused. In particular, the structural information in the depth map is used to provide coarse-grained guidance to feature fusion, while local variations in depth are encoded in order to fuse fine-grained feature representations. To our knowledge, SAM has not been explored in such self-prompted and depth-aware manners. Experimental results demonstrate that our SPDA-SAM outperforms its state-of-the-art counterparts across twelve different data sets. These promising results should be due to the guidance of the self-prompts and the compensation for the spatial information loss by the coarse-to-fine RGB-D fusion operation.

[112] Uncertainty-Aware 4D Gaussian Splatting for Monocular Occluded Human Rendering

Weiquan Wang, Feifei Shao, Lin Li, Zhen Wang, Jun Xiao, Long Chen

Main category: cs.CV

TL;DR: U-4DGS is a framework for high-fidelity rendering of dynamic humans from monocular videos that addresses occlusion challenges through probabilistic modeling and uncertainty-aware optimization.

DetailsMotivation: Existing methods for rendering dynamic humans from monocular videos degrade catastrophically under occlusions. Current solutions either use generative models that cause temporal flickering or impose rigid geometric heuristics that fail to capture diverse appearances.

Method: Reformulates the task as Maximum A Posteriori estimation under heteroscedastic observation noise. Uses Probabilistic Deformation Network and Double Rasterization pipeline to render pixel-aligned uncertainty maps that act as adaptive gradient modulators. Enforces Confidence-Aware Regularizations to prevent geometric drift in regions lacking reliable visual cues.

Result: Extensive experiments on ZJU-MoCap and OcMotion datasets demonstrate state-of-the-art rendering fidelity and robustness compared to existing methods.

Conclusion: U-4DGS effectively addresses occlusion challenges in dynamic human rendering through uncertainty-aware probabilistic modeling, achieving superior performance in both fidelity and robustness.

Abstract: High-fidelity rendering of dynamic humans from monocular videos typically degrades catastrophically under occlusions. Existing solutions incorporate external priors-either hallucinating missing content via generative models, which induces severe temporal flickering, or imposing rigid geometric heuristics that fail to capture diverse appearances. To this end, we reformulate the task as a Maximum A Posteriori estimation problem under heteroscedastic observation noise. In this paper, we propose U-4DGS, a framework integrating a Probabilistic Deformation Network and a Double Rasterization pipeline. This architecture renders pixel-aligned uncertainty maps that act as an adaptive gradient modulator, automatically attenuating artifacts from unreliable observations. Furthermore, to prevent geometric drift in regions lacking reliable visual cues, we enforce Confidence-Aware Regularizations, which leverage the learned uncertainty to selectively propagate spatial-temporal validity. Extensive experiments on ZJU-MoCap and OcMotion demonstrate that U-4DGS achieves SOTA rendering fidelity and robustness.

[113] FlowConsist: Make Your Flow Consistent with Real Trajectory

Tianyi Zhang, Chengcheng Liu, Jinwei Chen, Chun-Le Guo, Chongyi Li, Ming-Ming Cheng, Bo Li, Peng-Tao Jiang

Main category: cs.CV

TL;DR: FlowConsist: A training framework for fast flow models that addresses trajectory drift and error accumulation by using marginal velocities and trajectory rectification, achieving SOTA one-step generation on ImageNet 256×256 with FID 1.52.

DetailsMotivation: Current fast-flow models suffer from two fundamental issues: 1) conditional velocities from randomly paired noise-data samples cause systematic trajectory drift, preventing consistent ODE path following, and 2) model approximation errors accumulate over time steps, leading to severe deviations across long time intervals.

Method: Proposes FlowConsist with two key components: 1) replaces conditional velocities with marginal velocities predicted by the model itself to align optimization with true trajectory, and 2) introduces trajectory rectification strategy that aligns marginal distributions of generated and real samples at every time step along the trajectory.

Result: Establishes new state-of-the-art on ImageNet 256×256, achieving FID of 1.52 with only 1 sampling step, demonstrating significant improvement in one-step generation quality.

Conclusion: FlowConsist effectively addresses trajectory consistency issues in fast flow models, enabling high-quality one-step generation by enforcing trajectory consistency through marginal velocities and distribution alignment at every time step.

Abstract: Fast flow models accelerate the iterative sampling process by learning to directly predict ODE path integrals, enabling one-step or few-step generation. However, we argue that current fast-flow training paradigms suffer from two fundamental issues. First, conditional velocities constructed from randomly paired noise-data samples introduce systematic trajectory drift, preventing models from following a consistent ODE path. Second, the model’s approximation errors accumulate over time steps, leading to severe deviations across long time intervals. To address these issues, we propose FlowConsist, a training framework designed to enforce trajectory consistency in fast flows. We propose a principled alternative that replaces conditional velocities with the marginal velocities predicted by the model itself, aligning optimization with the true trajectory. To further address error accumulation over time steps, we introduce a trajectory rectification strategy that aligns the marginal distributions of generated and real samples at every time step along the trajectory. Our method establishes a new state-of-the-art on ImageNet 256$\times$256, achieving an FID of 1.52 with only 1 sampling step.

[114] Di3PO – Diptych Diffusion DPO for Targeted Improvements in Image

Sanjana Reddy, Ishaan Malhi, Sally Ma, Praneet Dutta

Main category: cs.CV

TL;DR: Di3PO introduces a novel method for constructing positive/negative image pairs for preference tuning of text-to-image diffusion models by isolating specific regions for improvement while keeping surrounding context stable.

DetailsMotivation: Existing preference tuning methods for T2I diffusion models suffer from computationally expensive generation steps, training pairs with insufficient meaningful differences, high sampling/filtering costs, and irrelevant pixel variance that degrades training efficiency.

Method: Di3PO constructs positive and negative pairs by isolating specific regions targeted for improvement during preference tuning while maintaining stable surrounding context in the image, enabling more efficient and effective training.

Result: The method demonstrates efficacy on the challenging task of text rendering in diffusion models, showing improvements over baseline methods of supervised fine-tuning (SFT) and direct preference optimization (DPO).

Conclusion: Di3PO provides a more efficient approach to preference tuning for text-to-image diffusion models by addressing key limitations in existing pair construction methods, particularly for tasks requiring precise improvements like text rendering.

Abstract: Existing methods for preference tuning of text-to-image (T2I) diffusion models often rely on computationally expensive generation steps to create positive and negative pairs of images. These approaches frequently yield training pairs that either lack meaningful differences, are expensive to sample and filter, or exhibit significant variance in irrelevant pixel regions, thereby degrading training efficiency. To address these limitations, we introduce “Di3PO”, a novel method for constructing positive and negative pairs that isolates specific regions targeted for improvement during preference tuning, while keeping the surrounding context in the image stable. We demonstrate the efficacy of our approach by applying it to the challenging task of text rendering in diffusion models, showcasing improvements over baseline methods of SFT and DPO.

[115] Robust Pedestrian Detection with Uncertain Modality

Qian Bie, Xiao Wang, Bin Yang, Zhixi Yu, Jun Chen, Xin Xu

Main category: cs.CV

TL;DR: AUNet: Adaptive Uncertainty-aware Network for cross-modal pedestrian detection using RGB, NIR, and TIR modalities with uncertainty-aware routing and modality-aware interaction to handle arbitrary input combinations.

DetailsMotivation: Existing cross-modal pedestrian detection (CMPD) methods struggle with real-world scenarios where imaging devices may not capture all three modalities (RGB, NIR, TIR) simultaneously, leading to performance degradation when faced with unpredictable modal combinations.

Method: Proposes AUNet with Unified Modality Validation Refinement (UMVR) including uncertainty-aware router to validate modal availability and semantic refinement, plus Modality-Aware Interaction (MAI) module to adaptively activate/deactivate interaction mechanisms based on available modalities.

Result: Constructed TRNT dataset with 8,281 pixel-aligned image triplets; AUNet effectively handles arbitrary input combinations and improves pedestrian detection performance across varying modal availability scenarios.

Conclusion: The proposed AUNet framework addresses the challenge of unpredictable modal combinations in real-world CMPD applications by adaptively utilizing available information through uncertainty-aware validation and modality-aware interaction mechanisms.

Abstract: Existing cross-modal pedestrian detection (CMPD) employs complementary information from RGB and thermal-infrared (TIR) modalities to detect pedestrians in 24h-surveillance systems.RGB captures rich pedestrian details under daylight, while TIR excels at night. However, TIR focuses primarily on the person’s silhouette, neglecting critical texture details essential for detection. While the near-infrared (NIR) captures texture under low-light conditions, which effectively alleviates performance issues of RGB and detail loss in TIR, thereby reducing missed detections. To this end, we construct a new Triplet RGB-NIR-TIR (TRNT) dataset, comprising 8,281 pixel-aligned image triplets, establishing a comprehensive foundation for algorithmic research. However, due to the variable nature of real-world scenarios, imaging devices may not always capture all three modalities simultaneously. This results in input data with unpredictable combinations of modal types, which challenge existing CMPD methods that fail to extract robust pedestrian information under arbitrary input combinations, leading to significant performance degradation. To address these challenges, we propose the Adaptive Uncertainty-aware Network (AUNet) for accurately discriminating modal availability and fully utilizing the available information under uncertain inputs. Specifically, we introduce Unified Modality Validation Refinement (UMVR), which includes an uncertainty-aware router to validate modal availability and a semantic refinement to ensure the reliability of information within the modality. Furthermore, we design a Modality-Aware Interaction (MAI) module to adaptively activate or deactivate its internal interaction mechanisms per UMVR output, enabling effective complementary information fusion from available modalities.

[116] Revisiting Salient Object Detection from an Observer-Centric Perspective

Fuxi Zhang, Yifan Wang, Hengrun Zhao, Zhuohan Sun, Changxing Xia, Lijun Wang, Huchuan Lu, Yangrui Shao, Chen Yang, Long Teng

Main category: cs.CV

TL;DR: Proposes Observer-Centric Salient Object Detection (OC-SOD) that incorporates observer-specific factors like preferences/intents, addressing the subjective nature of saliency detection through multi-modal LLMs and a new dataset.

DetailsMotivation: Existing salient object detection methods treat it as an objective prediction task with single groundtruth, ignoring the subjective nature where different observers may perceive different objects as salient based on their priors, preferences, or intents.

Method: 1) Proposes OC-SOD formulation considering both visual cues and observer-specific factors; 2) Uses multi-modal LLMs to create efficient annotation pipeline; 3) Constructs OC-SODBench dataset with 33k images and 152k textual prompts; 4) Designs OC-SODAgent with “Perceive-Reflect-Adjust” process.

Result: Created first OC-SOD dataset (OC-SODBench) with 33k training/validation/test images and 152k textual prompts. Developed OC-SODAgent baseline that performs observer-centric saliency detection. Extensive experiments justify effectiveness.

Conclusion: Observer-centric approach bridges gap between human perception and computational modeling, offering more realistic and flexible understanding of saliency. Addresses intrinsic ambiguity and diversity of human perception.

Abstract: Salient object detection is inherently a subjective problem, as observers with different priors may perceive different objects as salient. However, existing methods predominantly formulate it as an objective prediction task with a single groundtruth segmentation map for each image, which renders the problem under-determined and fundamentally ill-posed. To address this issue, we propose Observer-Centric Salient Object Detection (OC-SOD), where salient regions are predicted by considering not only the visual cues but also the observer-specific factors such as their preferences or intents. As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context-aware saliency prediction. By leveraging multi-modal large language models, we develop an efficient data annotation pipeline and construct the first OC-SOD dataset named OC-SODBench, comprising 33k training, validation and test images with 152k textual prompts and object pairs. Built upon this new dataset, we further design OC-SODAgent, an agentic baseline which performs OC-SOD via a human-like “Perceive-Reflect-Adjust” process. Extensive experiments on our proposed OC-SODBench have justified the effectiveness of our contribution. Through this observer-centric perspective, we aim to bridge the gap between human perception and computational modeling, offering a more realistic and flexible understanding of what makes an object truly “salient.” Code and dataset are publicly available at: https://github.com/Dustzx/OC_SOD

[117] POINTS-GUI-G: GUI-Grounding Journey

Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian, Xiao Zhou, Yangxiu You, Zilin Yu, Yang Yu, Jie Zhou

Main category: cs.CV

TL;DR: POINTS-GUI-G-8B achieves state-of-the-art GUI grounding performance by starting from a base model with minimal grounding ability and implementing refined data engineering, improved training strategies, and reinforcement learning with verifiable rewards.

DetailsMotivation: To master the full technical pipeline for GUI agents by starting from a base model with minimal grounding ability (like POINTS-1.5) rather than fine-tuning models already possessing strong spatial awareness, enabling precise localization of interface elements for accurate operations.

Method: Three key approaches: (1) Refined data engineering with unified dataset formats, augmentation, filtering, and difficulty grading; (2) Improved training strategies including continuous fine-tuning of vision encoder and resolution consistency; (3) Reinforcement learning with verifiable rewards adapted for perception-intensive GUI grounding.

Result: Achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision.

Conclusion: The approach successfully masters GUI grounding from minimal base capabilities, demonstrating that RL with verifiable rewards significantly improves precision in perception-intensive tasks, and GUI grounding provides natural advantages for RL due to easily verifiable rewards.

Abstract: The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model’s success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.

[118] EUGens: Efficient, Unified, and General Dense Layers

Sang Min Kim, Byeongchan Kim, Arijit Sehanobish, Somnath Basu Roy Chowdhury, Rahul Kidambi, Dongseok Shim, Avinava Dubey, Snigdha Chaturvedi, Min-hwan Oh, Krzysztof Choromanski

Main category: cs.CV

TL;DR: EUGens are efficient dense layers that generalize fully-connected feedforward layers using random features and input norm dependence, reducing quadratic to linear complexity while maintaining expressive power.

DetailsMotivation: Fully-connected feedforward layers create computation and parameter bottlenecks in neural networks, limiting scalability for real-time applications and resource-constrained environments.

Method: Proposed EUGens layers leverage random features to approximate standard FFLs with input norm dependence, unify existing efficient FFL extensions, and use layer-wise knowledge transfer without backpropagation.

Result: Integration into Transformers and MLPs yields up to 27% faster inference and 30% memory efficiency improvements across image classification, language model pre-training, and 3D scene reconstruction tasks.

Conclusion: EUGens enable scalable deployment of large-scale neural networks by reducing computational overhead while preserving model performance and adaptability.

Abstract: Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computation and parameter count bottlenecks within neural network architectures. To address this challenge, in this work, we propose a new class of dense layers that generalize standard fully-connected feedforward layers, \textbf{E}fficient, \textbf{U}nified and \textbf{Gen}eral dense layers (EUGens). EUGens leverage random features to approximate standard FFLs and go beyond them by incorporating a direct dependence on the input norms in their computations. The proposed layers unify existing efficient FFL extensions and improve efficiency by reducing inference complexity from quadratic to linear time. They also lead to \textbf{the first} unbiased algorithms approximating FFLs with arbitrary polynomial activation functions. Furthermore, EuGens reduce the parameter count and computational overhead while preserving the expressive power and adaptability of FFLs. We also present a layer-wise knowledge transfer technique that bypasses backpropagation, enabling efficient adaptation of EUGens to pre-trained models. Empirically, we observe that integrating EUGens into Transformers and MLPs yields substantial improvements in inference speed (up to \textbf{27}%) and memory efficiency (up to \textbf{30}%) across a range of tasks, including image classification, language model pre-training, and 3D scene reconstruction. Overall, our results highlight the potential of EUGens for the scalable deployment of large-scale neural networks in real-world scenarios.

[119] TFusionOcc: Student’s t-Distribution Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction

Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall

Main category: cs.CV

TL;DR: TFusionOcc: An object-centric multi-sensor fusion framework for 3D semantic occupancy prediction using deformable superquadrics and T-Mixture models, achieving SOTA on nuScenes benchmark.

DetailsMotivation: Existing 3D semantic occupancy prediction methods rely on 3D voxel volumes or Gaussians, which hinder efficient capture of fine-grained geometric details in driving environments. There's a need for more geometrically flexible representations.

Method: Proposes TFusionOcc with multi-stage multi-sensor fusion, Student’s t-distribution, T-Mixture model (TMM), and deformable superquadrics (superquadrics with inverse warp) as geometrically flexible primitives for object-centric representation.

Result: Achieved state-of-the-art performance on nuScenes benchmark and demonstrated robustness in different camera and lidar corruption scenarios on nuScenes-C dataset.

Conclusion: TFusionOcc effectively addresses geometric detail capture limitations in 3D semantic occupancy prediction through object-centric representation with flexible primitives and robust fusion techniques.

Abstract: 3D semantic occupancy prediction enables autonomous vehicles (AVs) to perceive fine-grained geometric and semantic structure of their surroundings from onboard sensors, which is essential for safe decision-making and navigation. Recent models for 3D semantic occupancy prediction have successfully addressed the challenge of describing real-world objects with varied shapes and classes. However, the intermediate representations used by existing methods for 3D semantic occupancy prediction rely heavily on 3D voxel volumes or a set of 3D Gaussians, hindering the model’s ability to efficiently and effectively capture fine-grained geometric details in the 3D driving environment. This paper introduces TFusionOcc, a novel object-centric multi-sensor fusion framework for predicting 3D semantic occupancy. By leveraging multi-stage multi-sensor fusion, Student’s t-distribution, and the T-Mixture model (TMM), together with more geometrically flexible primitives, such as the deformable superquadric (superquadric with inverse warp), the proposed method achieved state-of-the-art (SOTA) performance on the nuScenes benchmark. In addition, extensive experiments were conducted on the nuScenes-C dataset to demonstrate the robustness of the proposed method in different camera and lidar corruption scenarios. The code will be available at: https://github.com/DanielMing123/TFusionOcc

[120] MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing

Wenjie Wang, Wei Wu, Ying Liu, Yuan Zhao, Xiaole Lv, Liang Diao, Zengjian Fan, Wenfeng Xie, Ziling Lin, De Shi, Lin Huang, Kaihe Xu, Hong Li

Main category: cs.CV

TL;DR: MeDocVL: A vision-language model for query-driven medical document parsing using training-driven label refinement and noise-aware hybrid post-training to handle complex layouts and noisy annotations.

DetailsMotivation: Medical document OCR faces challenges with complex layouts, domain-specific terminology, and noisy annotations while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents.

Method: Proposes MeDocVL with two key components: 1) Training-driven Label Refinement to construct high-quality supervision from noisy annotations, and 2) Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning for robust and precise extraction.

Result: Experiments on medical invoice benchmarks show MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.

Conclusion: MeDocVL effectively addresses medical document parsing challenges through innovative label refinement and noise-aware training, demonstrating superior performance over existing approaches.

Abstract: Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.

[121] A neuromorphic model of the insect visual system for natural image processing

Adam D. Hines, Karin Nordström, Andrew B. Barron

Main category: cs.CV

TL;DR: A bio-inspired vision model mimicking insect visual processing that transforms dense visual input into sparse, discriminative codes using self-supervised contrastive learning, implemented as both artificial and spiking neural networks.

DetailsMotivation: To create a biologically grounded vision model that captures insect visual system principles, addressing the gap where contemporary models prioritize task performance over biologically plausible processing pathways.

Method: Developed a bio-inspired vision model using fully self-supervised contrastive learning to learn representations without labeled data. The model transforms dense visual input into sparse, discriminative codes and is implemented as both artificial neural networks and spiking neural networks for different deployment scenarios.

Result: The model produced reliable sparse codes that distinguish visually similar inputs, outperformed simple image downsampling baselines in simulated localization tasks, and demonstrated effectiveness on flower recognition and natural image benchmarks.

Conclusion: The work advances insect computational modeling by providing a generalizable bio-inspired vision model capable of sparse computation across diverse tasks, bridging biological plausibility with practical functionality.

Abstract: Insect vision supports complex behaviors including associative learning, navigation, and object detection, and has long motivated computational models for understanding biological visual processing. However, many contemporary models prioritize task performance while neglecting biologically grounded processing pathways. Here, we introduce a bio-inspired vision model that captures principles of the insect visual system to transform dense visual input into sparse, discriminative codes. The model is trained using a fully self-supervised contrastive objective, enabling representation learning without labeled data and supporting reuse across tasks without reliance on domain-specific classifiers. We evaluated the resulting representations on flower recognition tasks and natural image benchmarks. The model consistently produced reliable sparse codes that distinguish visually similar inputs. To support different modelling and deployment uses, we have implemented the model as both an artificial neural network and a spiking neural network. In a simulated localization setting, our approach outperformed a simple image downsampling comparison baseline, highlighting the functional benefit of incorporating neuromorphic visual processing pathways. Collectively, these results advance insect computational modelling by providing a generalizable bio-inspired vision model capable of sparse computation across diverse tasks.

[122] Point Virtual Transformer

Veerain Sood, Bnalin, Gaurav Pandey

Main category: cs.CV

TL;DR: PointViT: A transformer-based 3D object detection framework that fuses raw LiDAR points with selectively sampled virtual points from depth-completed RGB images to improve far-field object detection.

DetailsMotivation: LiDAR-based 3D object detectors struggle with far-field objects due to sparse point clouds at long ranges. Existing approaches using depth-completed virtual points from RGB images face computational inefficiency and fusion challenges between real and virtual information.

Method: Point Virtual Transformer (PointViT) jointly reasons over raw LiDAR points and selectively sampled virtual points. It explores multiple fusion strategies (early point-level to BEV-based gated fusion), voxelizes the fused point cloud, uses sparse convolutions for BEV representation, and employs transformer-based context aggregation with high-confidence object queries.

Result: Achieves 91.16% 3D AP, 95.94% BEV AP, and 99.36% AP on KITTI 2D detection benchmark for Car class, demonstrating state-of-the-art performance.

Conclusion: PointViT effectively addresses far-field object detection challenges by selectively fusing virtual points with raw LiDAR data through transformer-based architecture, balancing accuracy and efficiency.

Abstract: LiDAR-based 3D object detectors often struggle to detect far-field objects due to the sparsity of point clouds at long ranges, which limits the availability of reliable geometric cues. To address this, prior approaches augment LiDAR data with depth-completed virtual points derived from RGB images; however, directly incorporating all virtual points leads to increased computational cost and introduces challenges in effectively fusing real and virtual information. We present Point Virtual Transformer (PointViT), a transformer-based 3D object detection framework that jointly reasons over raw LiDAR points and selectively sampled virtual points. The framework examines multiple fusion strategies, ranging from early point-level fusion to BEV-based gated fusion, and analyses their trade-offs in terms of accuracy and efficiency. The fused point cloud is voxelized and encoded using sparse convolutions to form a BEV representation, from which a compact set of high-confidence object queries is initialised and refined through a transformer-based context aggregation module. Experiments on the KITTI benchmark report 91.16% 3D AP, 95.94% BEV AP, and 99.36% AP on the KITTI 2D detection benchmark for the Car class.

[123] Learning Human Visual Attention on 3D Surfaces through Geometry-Queried Semantic Priors

Soham Pahari, Sandeep C. Kumain

Main category: cs.CV

TL;DR: SemGeo-AttentionNet: A dual-stream architecture for 3D visual attention modeling that combines geometric processing with semantic awareness using diffusion-based priors and cross-modal fusion.

DetailsMotivation: Existing 3D saliency methods fail to explain why humans fixate on semantically meaningful but geometrically unremarkable regions, as they rely on hand-crafted geometric features or learning-based approaches lacking semantic awareness.

Method: Dual-stream architecture with asymmetric cross-modal fusion: uses diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention enables geometric features to query semantic content. Extended to temporal scanpath generation via reinforcement learning with 3D mesh topology and inhibition-of-return dynamics.

Result: Demonstrates substantial improvements on SAL3D, NUS3D and 3DVA datasets, validating that cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.

Conclusion: The proposed framework successfully models the interplay between bottom-up geometric processing and top-down semantic recognition in 3D visual attention, addressing limitations of existing methods through explicit semantic awareness.

Abstract: Human visual attention on three-dimensional objects emerges from the interplay between bottom-up geometric processing and top-down semantic recognition. Existing 3D saliency methods rely on hand-crafted geometric features or learning-based approaches that lack semantic awareness, failing to explain why humans fixate on semantically meaningful but geometrically unremarkable regions. We introduce SemGeo-AttentionNet, a dual-stream architecture that explicitly formalizes this dichotomy through asymmetric cross-modal fusion, leveraging diffusion-based semantic priors from geometry-conditioned multi-view rendering and point cloud transformers for geometric processing. Cross-attention ensures geometric features query semantic content, enabling bottom-up distinctiveness to guide top-down retrieval. We extend our framework to temporal scanpath generation through reinforcement learning, introducing the first formulation respecting 3D mesh topology with inhibition-of-return dynamics. Evaluation on SAL3D, NUS3D and 3DVA datasets demonstrates substantial improvements, validating how cognitively motivated architectures effectively model human visual attention on three-dimensional surfaces.

[124] Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

Yunze Tong, Mushui Liu, Canyu Zhao, Wanggui He, Shiyi Zhang, Hongwei Zhang, Peng Zhang, Jinlong Liu, Ju Huang, Jiamang Wang, Hao Jiang, Pipei Huang

Main category: cs.CV

TL;DR: TP-GRPO improves GRPO for flow matching models by addressing step-wise reward sparsity and modeling long-term effects through incremental rewards and turning point detection.

DetailsMotivation: Existing GRPO methods for text-to-image generation propagate outcome-based rewards to all denoising steps without distinguishing local effects, and group-wise ranking ignores within-trajectory dependencies where early actions affect later states through delayed interactions.

Method: TP-GRPO introduces two innovations: (1) replaces outcome-based rewards with step-level incremental rewards for dense, step-aware learning signals, and (2) identifies turning points (steps that flip local reward trends) and assigns them aggregated long-term rewards to capture delayed impact. Turning points are detected via sign changes in incremental rewards.

Result: Extensive experiments demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation quality.

Conclusion: TP-GRPO provides an efficient, hyperparameter-free framework that addresses reward sparsity and models long-term effects in denoising trajectories for improved text-to-image generation.

Abstract: Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action’s “pure” effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at https://github.com/YunzeTong/TurningPoint-GRPO.

[125] POPL-KF: A Pose-Only Geometric Representation-Based Kalman Filter for Point-Line-Based Visual-Inertial Odometry

Aiping Wang, Zhaolong Yang, Shuwen Chen, Hai Zhang

Main category: cs.CV

TL;DR: POPL-KF: A Kalman filter-based VIO system using pose-only geometric representation for both point and line features to improve performance in challenging scenes by eliminating feature coordinate linearization errors and enabling immediate measurement updates.

DetailsMotivation: Traditional VIO systems using point features degrade in challenging scenarios, and MSCKF-based VIO suffers from linearization errors related to feature 3D coordinates and delayed measurement updates. There's a need to improve VIO performance in difficult environments.

Method: Proposes pose-only geometric representation for line features, develops POPL-KF (Kalman filter-based VIO) using pose-only representation for both point and line features, eliminates feature coordinates from measurement equations, implements immediate visual measurement updates, designs unified base-frames selection algorithm, and adds line feature filter using image grid segmentation and bidirectional optical flow consistency.

Result: POPL-KF outperforms state-of-the-art filter-based methods (OpenVINS, PO-KF) and optimization-based methods (PL-VINS, EPLF-VINS) on public datasets and real-world experiments while maintaining real-time performance.

Conclusion: The pose-only geometric representation for both point and line features effectively mitigates linearization errors and improves VIO performance in challenging scenarios, making POPL-KF superior to existing methods.

Abstract: Mainstream Visual-inertial odometry (VIO) systems rely on point features for motion estimation and localization. However, their performance degrades in challenging scenarios. Moreover, the localization accuracy of multi-state constraint Kalman filter (MSCKF)-based VIO systems suffers from linearization errors associated with feature 3D coordinates and delayed measurement updates. To improve the performance of VIO in challenging scenes, we first propose a pose-only geometric representation for line features. Building on this, we develop POPL-KF, a Kalman filter-based VIO system that employs a pose-only geometric representation for both point and line features. POPL-KF mitigates linearization errors by explicitly eliminating both point and line feature coordinates from the measurement equations, while enabling immediate update of visual measurements. We also design a unified base-frames selection algorithm for both point and line features to ensure optimal constraints on camera poses within the pose-only measurement model. To further improve line feature quality, a line feature filter based on image grid segmentation and bidirectional optical flow consistency is proposed. Our system is evaluated on public datasets and real-world experiments, demonstrating that POPL-KF outperforms the state-of-the-art (SOTA) filter-based methods (OpenVINS, PO-KF) and optimization-based methods (PL-VINS, EPLF-VINS), while maintaining real-time performance.

[126] Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters

Yuxiang Zhao, Yirong Yang, Yanqing Zhu, Yanfen Shen, Chiyu Wang, Zhining Gu, Pei Shi, Wei Guo, Mu Xu

Main category: cs.CV

TL;DR: A vision-centric embodied navigation framework for outdoor-to-inddoor navigation without external priors, using image-based prompts and a novel dataset with trajectory-conditioned video synthesis.

DetailsMotivation: Current embodied navigation methods are limited to either indoor or outdoor environments and rely on precise coordinate systems, failing to enable fine-grained entry through specific building entrances for practical deployment scenarios requiring seamless outdoor-to-indoor transitions.

Method: Proposes a vision-centric embodied navigation framework that leverages image-based prompts for decision-making, and introduces a dataset generation pipeline that integrates trajectory-conditioned video synthesis.

Result: The proposed method consistently outperforms state-of-the-art baselines across key metrics including success rate and path efficiency.

Conclusion: The work bridges the gap between outdoor and indoor navigation by eliminating reliance on external priors and enabling fine-grained entry through specific building entrances using only egocentric visual observations guided by instructions.

Abstract: Embodied navigation holds significant promise for real-world applications such as last-mile delivery. However, most existing approaches are confined to either indoor or outdoor environments and rely heavily on strong assumptions, such as access to precise coordinate systems. While current outdoor methods can guide agents to the vicinity of a target using coarse-grained localization, they fail to enable fine-grained entry through specific building entrances, critically limiting their utility in practical deployment scenarios that require seamless outdoor-to-indoor transitions. To bridge this gap, we introduce a novel task: out-to-in prior-free instruction-driven embodied navigation. This formulation explicitly eliminates reliance on accurate external priors, requiring agents to navigate solely based on egocentric visual observations guided by instructions. To tackle this task, we propose a vision-centric embodied navigation framework that leverages image-based prompts to drive decision-making. Additionally, we present the first open-source dataset for this task, featuring a pipeline that integrates trajectory-conditioned video synthesis into the data generation process. Through extensive experiments, we demonstrate that our proposed method consistently outperforms state-of-the-art baselines across key metrics including success rate and path efficiency.

[127] ChatUMM: Robust Context Tracking for Conversational Interleaved Generation

Wenxun Dai, Zhiyuan Zhao, Yule Zhong, Yiji Cheng, Jianwei Zhang, Linqing Wang, Shiyi Zhang, Yunlong Lin, Runze He, Fellix Song, Wayne Zhuang, Yong Liu, Haoji Zhang, Yansong Tang, Qinglin Lu, Chunyu Wang

Main category: cs.CV

TL;DR: ChatUMM: A conversational unified multimodal model that enables continuous dialogue with interleaved text-image generation through multi-turn training and systematic data synthesis.

DetailsMotivation: Current unified multimodal models are limited to single-turn interactions, functioning as solvers for independent requests rather than continuous dialogue assistants. There's a need for models that can sustain interleaved multimodal generation across multiple turns while maintaining context awareness.

Method: Two key innovations: 1) Interleaved multi-turn training strategy that models serialized text-image streams as continuous conversational flow, and 2) Systematic conversational data synthesis pipeline that transforms single-turn datasets into fluid dialogues through three stages: basic stateful dialogues, long-range dependency resolution with “distractor” turns and history-dependent query rewriting, and synthesis of naturally interleaved multimodal responses.

Result: ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. It exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.

Conclusion: ChatUMM successfully bridges the gap between single-turn multimodal models and conversational assistants, enabling robust context tracking and sustained interleaved multimodal generation through innovative training strategies and data synthesis approaches.

Abstract: Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation. ChatUMM derives its capabilities from two key innovations: an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow, and a systematic conversational data synthesis pipeline. This pipeline transforms a diverse set of standard single-turn datasets into fluid dialogues through three progressive stages: constructing basic stateful dialogues, enforcing long-range dependency resolution via ``distractor’’ turns with history-dependent query rewriting, and synthesizing naturally interleaved multimodal responses. Extensive evaluations demonstrate that ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. Notably, ChatUMM exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.

[128] What Is Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution

Xingsong Ye, Yongkun Du, JiaXin Zhang, Chen Li, Jing LYU, Zhineng Chen

Main category: cs.CV

TL;DR: UnionST is a synthetic data engine for Scene Text Recognition that addresses domain gaps by improving diversity in corpus, font, and layout, outperforming existing synthetic datasets and even real data in some scenarios.

DetailsMotivation: Real-world text data collection for STR is expensive and hard to balance categorically, while current synthetic data suffers from insufficient diversity in corpus, font, and layout, creating a significant domain gap with real data.

Method: Developed UnionST data engine synthesizing text covering challenging samples; created UnionST-S large-scale synthetic dataset with improved simulations; introduced self-evolution learning (SEL) framework for effective real data annotation.

Result: Models trained on UnionST-S achieve significant improvements over existing synthetic datasets, even surpassing real-data performance in certain scenarios. SEL enables competitive performance with only 9% of real data labels.

Conclusion: UnionST addresses key limitations of synthetic data for STR, demonstrating that well-designed synthetic datasets can outperform real data and enable efficient annotation through self-evolution learning.

Abstract: Large-scale and categorical-balanced text data is essential for training effective Scene Text Recognition (STR) models, which is hard to achieve when collecting real data. Synthetic data offers a cost-effective and perfectly labeled alternative. However, its performance often lags behind, revealing a significant domain gap between real and current synthetic data. In this work, we systematically analyze mainstream rendering-based synthetic datasets and identify their key limitations: insufficient diversity in corpus, font, and layout, which restricts their realism in complex scenarios. To address these issues, we introduce UnionST, a strong data engine synthesizes text covering a union of challenging samples and better aligns with the complexity observed in the wild. We then construct UnionST-S, a large-scale synthetic dataset with improved simulations in challenging scenarios. Furthermore, we develop a self-evolution learning (SEL) framework for effective real data annotation. Experiments show that models trained on UnionST-S achieve significant improvements over existing synthetic datasets. They even surpass real-data performance in certain scenarios. Moreover, when using SEL, the trained models achieve competitive performance by only seeing 9% of real data labels.

[129] Exploring Specular Reflection Inconsistency for Generalizable Face Forgery Detection

Hongyan Fei, Zexi Jia, Chuanwei Huang, Jinchao Zhang, Jie Zhou

Main category: cs.CV

TL;DR: A novel deepfake detection method that analyzes specular reflection inconsistencies in facial images, particularly effective against high-quality diffusion-generated forgeries.

DetailsMotivation: Existing deepfake detection methods struggle with high-quality, entirely synthesized forgeries from diffusion models. Facial attributes governed by complex physical laws (like illumination) are difficult to replicate accurately, providing a promising detection avenue.

Method: Proposes SRI-Net (Specular-Reflection-Inconsistency-Network) that: 1) Uses Retinex theory for fast face texture estimation, 2) Separates specular reflection components, 3) Employs two-stage cross-attention to capture relationships between specular reflection, face texture, and direct light, 4) Integrates these features for robust detection.

Result: Superior performance on both traditional deepfake datasets and generative deepfake datasets, particularly effective against diffusion-generated forgery faces.

Conclusion: Analyzing physical inconsistencies in specular reflection provides an effective approach for detecting high-quality AI-generated forgeries, especially those from diffusion models.

Abstract: Detecting deepfakes has become increasingly challenging as forgery faces synthesized by AI-generated methods, particularly diffusion models, achieve unprecedented quality and resolution. Existing forgery detection approaches relying on spatial and frequency features demonstrate limited efficacy against high-quality, entirely synthesized forgeries. In this paper, we propose a novel detection method grounded in the observation that facial attributes governed by complex physical laws and multiple parameters are inherently difficult to replicate. Specifically, we focus on illumination, particularly the specular reflection component in the Phong illumination model, which poses the greatest replication challenge due to its parametric complexity and nonlinear formulation. We introduce a fast and accurate face texture estimation method based on Retinex theory to enable precise specular reflection separation. Furthermore, drawing from the mathematical formulation of specular reflection, we posit that forgery evidence manifests not only in the specular reflection itself but also in its relationship with corresponding face texture and direct light. To address this issue, we design the Specular-Reflection-Inconsistency-Network (SRI-Net), incorporating a two-stage cross-attention mechanism to capture these correlations and integrate specular reflection related features with image features for robust forgery detection. Experimental results demonstrate that our method achieves superior performance on both traditional deepfake datasets and generative deepfake datasets, particularly those containing diffusion-generated forgery faces.

[130] LAB-Det: Language as a Domain-Invariant Bridge for Training-Free One-Shot Domain Generalization in Object Detection

Xu Zhang, Zhe Chen, Jing Zhang, Dacheng Tao

Main category: cs.CV

TL;DR: LAB-Det enables frozen object detectors to adapt to specialized domains with just one exemplar per class using language as a domain-invariant bridge, without any training or weight updates.

DetailsMotivation: Foundation object detectors degrade in specialized domains like underwater imagery or industrial defects. Traditional few-shot approaches require fine-tuning scarce target data, which is costly and risks overfitting. The paper explores whether frozen detectors can adapt with only one exemplar per class without any training.

Method: LAB-Det (Language As a domain-invariant Bridge) projects each exemplar into descriptive text that conditions and guides a frozen detector. Instead of adapting visual features, it uses linguistic conditioning to replace gradient-based adaptation, enabling robust generalization in data-scarce domains.

Result: On UODD (underwater) and NEU-DET (industrial defects) benchmarks, LAB-Det achieves up to 5.4 mAP improvement over state-of-the-art fine-tuned baselines without updating a single parameter, establishing linguistic adaptation as an efficient alternative to fine-tuning.

Conclusion: Linguistic adaptation provides an efficient and interpretable alternative to fine-tuning in specialized detection settings, enabling frozen detectors to generalize to data-scarce domains with minimal exemplars and no weight updates.

Abstract: Foundation object detectors such as GLIP and Grounding DINO excel on general-domain data but often degrade in specialized and data-scarce settings like underwater imagery or industrial defects. Typical cross-domain few-shot approaches rely on fine-tuning scarce target data, incurring cost and overfitting risks. We instead ask: Can a frozen detector adapt with only one exemplar per class without training? To answer this, we introduce training-free one-shot domain generalization for object detection, where detectors must adapt to specialized domains with only one annotated exemplar per class and no weight updates. To tackle this task, we propose LAB-Det, which exploits Language As a domain-invariant Bridge. Instead of adapting visual features, we project each exemplar into a descriptive text that conditions and guides a frozen detector. This linguistic conditioning replaces gradient-based adaptation, enabling robust generalization in data-scarce domains. We evaluate on UODD (underwater) and NEU-DET (industrial defects), two widely adopted benchmarks for data-scarce detection, where object boundaries are often ambiguous, and LAB-Det achieves up to 5.4 mAP improvement over state-of-the-art fine-tuned baselines without updating a single parameter. These results establish linguistic adaptation as an efficient and interpretable alternative to fine-tuning in specialized detection settings.

[131] Efficient-LVSM: Faster, Cheaper, and Better Large View Synthesis Model via Decoupled Co-Refinement Attention

Xiaosong Jia, Yihang Sun, Junqi You, Songbur Wong, Zichen Zou, Junchi Yan, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: Efficient-LVSM improves novel view synthesis with a dual-stream architecture using decoupled co-refinement, achieving better performance and faster training/inference than previous transformer-based methods.

DetailsMotivation: Previous transformer-based methods for novel view synthesis (like LVSM) suffer from quadratic complexity with input views and rigid parameter sharing among heterogeneous tokens, making them inefficient.

Method: Proposes Efficient-LVSM with dual-stream architecture: intra-view self-attention for input views and self-then-cross attention for target views, eliminating unnecessary computation through decoupled co-refinement.

Result: Achieves 29.86 dB PSNR on RealEstate10K with 2 input views (surpassing LVSM by 0.2 dB), with 2x faster training convergence, 4.4x faster inference speed, and state-of-the-art performance on multiple benchmarks.

Conclusion: Efficient-LVSM’s decoupled design enables efficient novel view synthesis with strong performance, faster computation, zero-shot generalization to unseen view counts, and incremental inference capabilities.

Abstract: Feedforward models for novel view synthesis (NVS) have recently advanced by transformer-based methods like LVSM, using attention among all input and target views. In this work, we argue that its full self-attention design is suboptimal, suffering from quadratic complexity with respect to the number of input views and rigid parameter sharing among heterogeneous tokens. We propose Efficient-LVSM, a dual-stream architecture that avoids these issues with a decoupled co-refinement mechanism. It applies intra-view self-attention for input views and self-then-cross attention for target views, eliminating unnecessary computation. Efficient-LVSM achieves 29.86 dB PSNR on RealEstate10K with 2 input views, surpassing LVSM by 0.2 dB, with 2x faster training convergence and 4.4x faster inference speed. Efficient-LVSM achieves state-of-the-art performance on multiple benchmarks, exhibits strong zero-shot generalization to unseen view counts, and enables incremental inference with KV-cache, thanks to its decoupled designs.

[132] Instance-Free Domain Adaptive Object Detection

Hengfu Yu, Jinhong Deng, Lixin Duan, Wen Li

Main category: cs.CV

TL;DR: RSCN addresses domain adaptive object detection when target domain lacks foreground instances, using background feature prototypes and relational consistency for adaptation.

DetailsMotivation: Most domain adaptive object detection methods assume target data contains sufficient foreground instances, but in practical scenarios like wildlife monitoring or medical imaging, collecting target data with objects is costly while background-only data is abundant. This creates a technical challenge of adapting without target instances.

Method: Proposes Relational and Structural Consistency Network (RSCN) that uses background feature prototypes for alignment and enforces consistency in relationships between source foreground features and background features within each domain, enabling adaptation without target instances.

Result: RSCN significantly outperforms existing DAOD methods across three specialized benchmarks (simulative auto-driving detection, wildlife detection, and lung nodule detection) in instance-free scenarios.

Conclusion: The paper introduces the novel problem of Instance-Free Domain Adaptive Object Detection and provides an effective solution through RSCN, with benchmarks to facilitate future research in this practical but challenging scenario.

Abstract: While Domain Adaptive Object Detection (DAOD) has made significant strides, most methods rely on unlabeled target data that is assumed to contain sufficient foreground instances. However, in many practical scenarios (e.g., wildlife monitoring, lesion detection), collecting target domain data with objects of interest is prohibitively costly, whereas background-only data is abundant. This common practical constraint introduces a significant technical challenge: the difficulty of achieving domain alignment when target instances are unavailable, forcing adaptation to rely solely on the target background information. We formulate this challenge as the novel problem of Instance-Free Domain Adaptive Object Detection. To tackle this, we propose the Relational and Structural Consistency Network (RSCN) which pioneers an alignment strategy based on background feature prototypes while simultaneously encouraging consistency in the relationship between the source foreground features and the background features within each domain, enabling robust adaptation even without target instances. To facilitate research, we further curate three specialized benchmarks, including simulative auto-driving detection, wildlife detection, and lung nodule detection. Extensive experiments show that RSCN significantly outperforms existing DAOD methods across all three benchmarks in the instance-free scenario. The code and benchmarks will be released soon.

[133] Rebenchmarking Unsupervised Monocular 3D Occupancy Prediction

Zizhan Guo, Yi Feng, Mengtan Zhang, Haoran Zhang, Wei Ye, Rui Fan

Main category: cs.CV

TL;DR: Unsupervised monocular 3D occupancy prediction benchmark reformulation with occlusion-aware polarization for autonomous driving

DetailsMotivation: Existing unsupervised 3D structure inference from single images has inconsistent training/evaluation protocols and fails to address inherent ambiguity in occluded regions due to insufficient geometric constraints.

Method: 1) Reformulate benchmark by interpreting volume rendering variables to identify physically consistent occupancy probability representation; 2) Align this representation with voxel-wise 3D occupancy ground truth; 3) Introduce occlusion-aware polarization mechanism incorporating multi-view visual cues to enhance discrimination in occluded regions.

Result: Approach significantly outperforms existing unsupervised methods and matches supervised ones in 3D occupancy prediction for autonomous driving scenarios.

Conclusion: The proposed benchmark reformulation and occlusion-aware polarization mechanism effectively address inconsistencies in unsupervised 3D occupancy prediction and improve performance in occluded regions.

Abstract: Inferring the 3D structure from a single image, particularly in occluded regions, remains a fundamental yet unsolved challenge in vision-centric autonomous driving. Existing unsupervised approaches typically train a neural radiance field and treat the network outputs as occupancy probabilities during evaluation, overlooking the inconsistency between training and evaluation protocols. Moreover, the prevalent use of 2D ground truth fails to reveal the inherent ambiguity in occluded regions caused by insufficient geometric constraints. To address these issues, this paper presents a reformulated benchmark for unsupervised monocular 3D occupancy prediction. We first interpret the variables involved in the volume rendering process and identify the most physically consistent representation of the occupancy probability. Building on these analyses, we improve existing evaluation protocols by aligning the newly identified representation with voxel-wise 3D occupancy ground truth, thereby enabling unsupervised methods to be evaluated in a manner consistent with that of supervised approaches. Additionally, to impose explicit constraints in occluded regions, we introduce an occlusion-aware polarization mechanism that incorporates multi-view visual cues to enhance discrimination between occupied and free spaces in these regions. Extensive experiments demonstrate that our approach not only significantly outperforms existing unsupervised approaches but also matches the performance of supervised ones. Our source code and evaluation protocol will be made available upon publication.

[134] DreamHome-Pano: Design-Aware and Conflict-Free Panoramic Interior Generation

Lulu Chen, Yijiang Hu, Yuanqing Liu, Yulong Li, Yue Yang

Main category: cs.CV

TL;DR: DreamHome-Pano is a controllable panoramic generation framework for interior design that balances architectural constraints with stylistic preferences using a Prompt-LLM and Conflict-Free Control architecture.

DetailsMotivation: Existing multi-condition generative frameworks struggle to harmonize rigid architectural constraints with stylistic preferences, leading to "condition conflicts" where style compromises geometric precision in interior layout generation.

Method: Uses a Prompt-LLM as semantic bridge to translate layout constraints and style references into professional descriptive prompts. Develops Conflict-Free Control architecture with structural-aware geometric priors and multi-condition decoupling strategy to suppress stylistic interference. Implements multi-stage training with progressive SFT and RL.

Result: Achieves superior balance between aesthetic quality and structural consistency, offering robust professional-grade solution for panoramic interior visualization.

Conclusion: DreamHome-Pano provides effective framework for harmonizing architectural constraints with stylistic preferences in interior design generation, addressing condition conflicts through semantic bridging and conflict-free control mechanisms.

Abstract: In modern interior design, the generation of personalized spaces frequently necessitates a delicate balance between rigid architectural structural constraints and specific stylistic preferences. However, existing multi-condition generative frameworks often struggle to harmonize these inputs, leading to “condition conflicts” where stylistic attributes inadvertently compromise the geometric precision of the layout. To address this challenge, we present DreamHome-Pano, a controllable panoramic generation framework designed for high-fidelity interior synthesis. Our approach introduces a Prompt-LLM that serves as a semantic bridge, effectively translating layout constraints and style references into professional descriptive prompts to achieve precise cross-modal alignment. To safeguard architectural integrity during the generative process, we develop a Conflict-Free Control architecture that incorporates structural-aware geometric priors and a multi-condition decoupling strategy, effectively suppressing stylistic interference from eroding the spatial layout. Furthermore, we establish a comprehensive panoramic interior benchmark alongside a multi-stage training pipeline, encompassing progressive Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Experimental results demonstrate that DreamHome-Pano achieves a superior balance between aesthetic quality and structural consistency, offering a robust and professional-grade solution for panoramic interior visualization.

[135] Forest canopy height estimation from satellite RGB imagery using large-scale airborne LiDAR-derived training data and monocular depth estimation

Yongkang Lai, Xihan Mu, Tim R. McVicar, Dasheng Fan, Donghui Xie, Shanxin Guo, Wenli Huang, Tianjie Zhao, Guangjian Yan

Main category: cs.CV

TL;DR: Depth2CHM uses monocular depth estimation (Depth Anything V2) trained on airborne LiDAR canopy height data to estimate continuous forest canopy height from satellite RGB imagery.

DetailsMotivation: Large-scale forest canopy height mapping is crucial for understanding carbon/water cycles, but existing spaceborne LiDAR is sparse and uncertain, while airborne/UAV LiDAR provides finer measurements but limited coverage.

Method: Trained Depth Anything V2 model on ~16,000 km² of airborne LiDAR-derived canopy height models with 3m resolution PlanetScope/airborne RGB imagery to create Depth2CHM for estimating CHMs from satellite RGB.

Result: Depth2CHM achieved biases of 0.59m/0.41m and RMSEs of 2.54m/5.75m at validation sites in China and US, outperforming existing global CHM products by ~1.5m MAE and ~2m RMSE reduction.

Conclusion: Monocular depth estimation networks trained with large-scale airborne LiDAR data provide a promising, scalable pathway for high-resolution, continuous forest canopy height estimation from satellite RGB imagery.

Abstract: Large-scale, high-resolution forest canopy height mapping plays a crucial role in understanding regional and global carbon and water cycles. Spaceborne LiDAR missions, including the Ice, Cloud, and Land Elevation Satellite-2 (ICESat-2) and the Global Ecosystem Dynamics Investigation (GEDI), provide global observations of forest structure but are spatially sparse and subject to inherent uncertainties. In contrast, near-surface LiDAR platforms, such as airborne and unmanned aerial vehicle (UAV) LiDAR systems, offer much finer measurements of forest canopy structure, and a growing number of countries have made these datasets openly available. In this study, a state-of-the-art monocular depth estimation model, Depth Anything V2, was trained using approximately 16,000 km2 of canopy height models (CHMs) derived from publicly available airborne LiDAR point clouds and related products across multiple countries, together with 3 m resolution PlanetScope and airborne RGB imagery. The trained model, referred to as Depth2CHM, enables the estimation of spatially continuous CHMs directly from PlanetScope RGB imagery. Independent validation was conducted at sites in China (approximately 1 km2) and the United States (approximately 116 km2). The results showed that Depth2CHM could accurately estimate canopy height, with biases of 0.59 m and 0.41 m and root mean square errors (RMSEs) of 2.54 m and 5.75 m for these two sites, respectively. Compared with an existing global meter-resolution CHM product, the mean absolute error is reduced by approximately 1.5 m and the RMSE by approximately 2 m. These results demonstrated that monocular depth estimation networks trained with large-scale airborne LiDAR-derived canopy height data provide a promising and scalable pathway for high-resolution, spatially continuous forest canopy height estimation from satellite RGB imagery.

[136] FloorplanVLM: A Vision-Language Model for Floorplan Vectorization

Yuanqing Liu, Ziming Yang, Yulong Li, Yue Yang

Main category: cs.CV

TL;DR: FloorplanVLM converts raster floorplans to vector graphics using an image-to-sequence approach with structured JSON output, trained on large datasets and optimized for geometric constraints.

DetailsMotivation: Converting raster floorplans to engineering-grade vector graphics is challenging due to complex topology and strict geometric constraints. Existing methods have limitations: pixel-based approaches rely on fragile heuristics, while query-based transformers generate fragmented rooms.

Method: Reformulates floorplan vectorization as image-conditioned sequence modeling. Uses a ‘pixels-to-sequence’ paradigm to output structured JSON sequences representing global topology. Employs a scalable data engine with large-scale datasets (Floorplan-2M and Floorplan-HQ-300K). Uses progressive training: Supervised Fine-Tuning (SFT) for structural grounding, followed by Group Relative Policy Optimization (GRPO) for geometric alignment.

Result: Achieves 92.52% external-wall IoU on FPBench-2K benchmark. Demonstrates robust generalization across non-Manhattan architectures with exceptional structural validity.

Conclusion: FloorplanVLM provides a unified framework for precise floorplan vectorization that handles complex geometries like slanted walls and curved arcs through structured sequence generation.

Abstract: Converting raster floorplans into engineering-grade vector graphics is challenging due to complex topology and strict geometric constraints. To address this, we present FloorplanVLM, a unified framework that reformulates floorplan vectorization as an image-conditioned sequence modeling task. Unlike pixel-based methods that rely on fragile heuristics or query-based transformers that generate fragmented rooms, our model directly outputs structured JSON sequences representing the global topology. This ‘pixels-to-sequence’ paradigm enables the precise and holistic constraint satisfaction of complex geometries, such as slanted walls and curved arcs. To support this data-hungry approach, we introduce a scalable data engine: we construct a large-scale dataset (Floorplan-2M) and a high-fidelity subset (Floorplan-HQ-300K) to balance geometric diversity and pixel-level precision. We then employ a progressive training strategy, using Supervised Fine-Tuning (SFT) for structural grounding and quality annealing, followed by Group Relative Policy Optimization (GRPO) for strict geometric alignment. To standardize evaluation on complex layouts, we establish and open-source FPBench-2K. Evaluated on this rigorous benchmark, FloorplanVLM demonstrates exceptional structural validity, achieving $\textbf{92.52%}$ external-wall IoU and robust generalization across non-Manhattan architectures.

[137] DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving

Feiyang jia, Lin Liu, Ziying Song, Caiyan Jia, Hangjun Ye, Xiaoshuai Hao, Long Chen

Main category: cs.CV

TL;DR: DriveWorld-VLA is a novel framework that unifies world modeling and planning in autonomous driving by tightly integrating Vision-Language-Action models with world models in latent space, enabling action-conditioned imagination and improved decision-making.

DetailsMotivation: Existing end-to-end autonomous driving methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions.

Method: Proposes DriveWorld-VLA framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at representation level, using latent states as core decision-making states to assess how candidate actions impact future scene evolution.

Result: Achieves state-of-the-art performance: 91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 3-second average collision rate on nuScenes.

Conclusion: DriveWorld-VLA effectively unifies world modeling and planning in latent space, enabling controllable action-conditioned imagination at feature level and reducing reliance on dense annotated supervision while achieving superior autonomous driving performance.

Abstract: End-to-end (E2E) autonomous driving has recently attracted increasing interest in unifying Vision-Language-Action (VLA) with World Models to enhance decision-making and forward-looking imagination. However, existing methods fail to effectively unify future scene evolution and action planning within a single architecture due to inadequate sharing of latent states, limiting the impact of visual imagination on action decisions. To address this limitation, we propose DriveWorld-VLA, a novel framework that unifies world modeling and planning within a latent space by tightly integrating VLA and world models at the representation level, which enables the VLA planner to benefit directly from holistic scene-evolution modeling and reducing reliance on dense annotated supervision. Additionally, DriveWorld-VLA incorporates the latent states of the world model as core decision-making states for the VLA planner, facilitating the planner to assess how candidate actions impact future scene evolution. By conducting world modeling entirely in the latent space, DriveWorld-VLA supports controllable, action-conditioned imagination at the feature level, avoiding expensive pixel-level rollouts. Extensive open-loop and closed-loop evaluations demonstrate the effectiveness of DriveWorld-VLA, which achieves state-of-the-art performance with 91.3 PDMS on NAVSIMv1, 86.8 EPDMS on NAVSIMv2, and 0.16 3-second average collision rate on nuScenes. Code and models will be released in https://github.com/liulin815/DriveWorld-VLA.git.

[138] MicroBi-ConvLSTM: An Ultra-Lightweight Efficient Model for Human Activity Recognition on Resource Constrained Devices

Mridankan Mandal

Main category: cs.CV

TL;DR: Ultra-lightweight MicroBi-ConvLSTM architecture for human activity recognition on wearables achieves 11.4K parameters with competitive accuracy across 8 HAR benchmarks, enabling deployment on memory-constrained edge devices.

DetailsMotivation: Existing lightweight HAR models like TinierHAR (34K params) and TinyHAR (55K params) exceed memory budgets of microcontrollers when considering OS overhead, requiring even more compact architectures for resource-constrained wearables.

Method: Proposes MicroBi-ConvLSTM with two-stage convolutional feature extraction (4x temporal pooling) and a single bidirectional LSTM layer, achieving ultra-lightweight design with 11.4K parameters while maintaining linear O(N) complexity.

Result: Achieves competitive performance: 93.41% macro F1 on UCI-HAR, 94.46% on SKODA assembly gestures, 88.98% on Daphnet gait freeze detection. INT8 quantization yields only 0.21% F1 degradation and 23.0 KB deployment footprint.

Conclusion: MicroBi-ConvLSTM enables accurate HAR on memory-constrained edge devices with 2.9x parameter reduction vs TinierHAR, maintaining competitive accuracy while being suitable for microcontroller deployment.

Abstract: Human Activity Recognition (HAR) on resource constrained wearables requires models that balance accuracy against strict memory and computational budgets. State of the art lightweight architectures such as TinierHAR (34K parameters) and TinyHAR (55K parameters) achieve strong accuracy, but exceed memory budgets of microcontrollers with limited SRAM once operating system overhead is considered. We present MicroBi-ConvLSTM, an ultra-lightweight convolutional-recurrent architecture achieving 11.4K parameters on average through two stage convolutional feature extraction with 4x temporal pooling and a single bidirectional LSTM layer. This represents 2.9x parameter reduction versus TinierHAR and 11.9x versus DeepConvLSTM while preserving linear O(N) complexity. Evaluation across eight diverse HAR benchmarks shows that MicroBi-ConvLSTM maintains competitive performance within the ultra-lightweight regime: 93.41% macro F1 on UCI-HAR, 94.46% on SKODA assembly gestures, and 88.98% on Daphnet gait freeze detection. Systematic ablation reveals task dependent component contributions where bidirectionality benefits episodic event detection, but provides marginal gains on periodic locomotion. INT8 post training quantization incurs only 0.21% average F1-score degradation, yielding a 23.0 KB average deployment footprint suitable for memory constrained edge devices.

[139] AdaptOVCD: Training-Free Open-Vocabulary Remote Sensing Change Detection via Adaptive Information Fusion

Mingyu Dou, Shi Qiu, Ming Hu, Yifan Chen, Huping Ye, Xiaohan Liao, Zhe Sun

Main category: cs.CV

TL;DR: AdaptOVCD is a training-free open-vocabulary change detection framework that uses multi-level information fusion across data, feature, and decision levels to detect arbitrary category changes in remote sensing images without predefined categories or pixel-level annotations.

DetailsMotivation: Existing remote sensing change detection methods are limited by predefined categories and require large-scale pixel-level annotations, which restricts their generalization and applicability in open-world scenarios where arbitrary changes need to be detected.

Method: The framework integrates multi-level information fusion across data, feature, and decision levels: (1) Data level: Adaptive Radiometric Alignment (ARA) fuses radiometric statistics with texture features using SAM-HQ; (2) Feature level: Adaptive Change Thresholding (ACT) combines global difference distributions with edge structure priors using DINOv3; (3) Decision level: Adaptive Confidence Filtering (ACF) integrates semantic confidence with spatial constraints using DGTRS-CLIP.

Result: Comprehensive evaluations across nine scenarios show AdaptOVCD detects arbitrary category changes in zero-shot manner, significantly outperforming existing training-free methods. It achieves 84.89% of fully-supervised performance upper bound in cross-dataset evaluations and exhibits superior generalization capabilities.

Conclusion: AdaptOVCD provides an effective training-free solution for open-vocabulary change detection that overcomes limitations of traditional methods by leveraging heterogeneous pre-trained models through multi-level information fusion, enabling detection of arbitrary changes without predefined categories or extensive annotations.

Abstract: Remote sensing change detection plays a pivotal role in domains such as environmental monitoring, urban planning, and disaster assessment. However, existing methods typically rely on predefined categories and large-scale pixel-level annotations, which limit their generalization and applicability in open-world scenarios. To address these limitations, this paper proposes AdaptOVCD, a training-free Open-Vocabulary Change Detection (OVCD) architecture based on dual-dimensional multi-level information fusion. The framework integrates multi-level information fusion across data, feature, and decision levels vertically while incorporating targeted adaptive designs horizontally, achieving deep synergy among heterogeneous pre-trained models to effectively mitigate error propagation. Specifically, (1) at the data level, Adaptive Radiometric Alignment (ARA) fuses radiometric statistics with original texture features and synergizes with SAM-HQ to achieve radiometrically consistent segmentation; (2) at the feature level, Adaptive Change Thresholding (ACT) combines global difference distributions with edge structure priors and leverages DINOv3 to achieve robust change detection; (3) at the decision level, Adaptive Confidence Filtering (ACF) integrates semantic confidence with spatial constraints and collaborates with DGTRS-CLIP to achieve high-confidence semantic identification. Comprehensive evaluations across nine scenarios demonstrate that AdaptOVCD detects arbitrary category changes in a zero-shot manner, significantly outperforming existing training-free methods. Meanwhile, it achieves 84.89% of the fully-supervised performance upper bound in cross-dataset evaluations and exhibits superior generalization capabilities. The code is available at https://github.com/Dmygithub/AdaptOVCD.

[140] Universal Anti-forensics Attack against Image Forgery Detection via Multi-modal Guidance

Haipeng Li, Rongxuan Peng, Anwei Luo, Shunquan Tan, Changsheng Chen, Anastasia Antsiferova

Main category: cs.CV

TL;DR: ForgeryEraser is an anti-forensics attack framework that exploits vulnerabilities in AIGC detectors by manipulating image embeddings in VLM feature space to erase forgery traces without needing access to target detectors.

DetailsMotivation: Existing AIGC authenticity assessment protocols overlook anti-forensics attacks, failing to ensure comprehensive robustness of state-of-the-art AIGC detectors in real-world applications. There's a need to address vulnerabilities stemming from systemic reliance on VLMs as shared backbones.

Method: Proposes ForgeryEraser framework that executes universal anti-forensics attacks without access to target AIGC detectors. Instead of traditional logit-based optimization, uses multi-modal guidance loss to drive forged image embeddings within VLM feature space toward text-derived authentic anchors while repelling them from forgery anchors.

Result: Extensive experiments show ForgeryEraser causes substantial performance degradation to advanced AIGC detectors on both global synthesis and local editing benchmarks. Also induces explainable forensic models to generate explanations consistent with authentic images for forged images.

Conclusion: The work reveals adversarial vulnerabilities in AIGC detectors due to reliance on VLMs as shared backbones, and demonstrates effective anti-forensics attacks that can erase forgery traces in the feature space, highlighting security concerns for real-world AIGC detection systems.

Abstract: The rapid advancement of AI-Generated Content (AIGC) technologies poses significant challenges for authenticity assessment. However, existing evaluation protocols largely overlook anti-forensics attack, failing to ensure the comprehensive robustness of state-of-the-art AIGC detectors in real-world applications. To bridge this gap, we propose ForgeryEraser, a framework designed to execute universal anti-forensics attack without access to the target AIGC detectors. We reveal an adversarial vulnerability stemming from the systemic reliance on Vision-Language Models (VLMs) as shared backbones (e.g., CLIP), where downstream AIGC detectors inherit the feature space of these publicly accessible models. Instead of traditional logit-based optimization, we design a multi-modal guidance loss to drive forged image embeddings within the VLM feature space toward text-derived authentic anchors to erase forgery traces, while repelling them from forgery anchors. Extensive experiments demonstrate that ForgeryEraser causes substantial performance degradation to advanced AIGC detectors on both global synthesis and local editing benchmarks. Moreover, ForgeryEraser induces explainable forensic models to generate explanations consistent with authentic images for forged images. Our code will be made publicly available.

[141] NECromancer: Breathing Life into Skeletons via BVH Animation

Mingxi Xu, Qi Wang, Zhengyu Wen, Phong Dao Thien, Zhengyu Li, Ning Zhang, Xiaoyu He, Wei Zhao, Kehong Gong, Mingyuan Zhang

Main category: cs.CV

TL;DR: NECromancer is a universal motion tokenizer that works with arbitrary BVH skeletons, enabling cross-species motion transfer and generation through skeletal structure disentanglement.

DetailsMotivation: Existing motion tokenization methods are limited to species-specific skeletons, restricting their applicability across diverse morphologies. There's a need for a universal motion tokenizer that can handle arbitrary skeletal structures.

Method: Three-component approach: 1) Ontology-aware Skeletal Graph Encoder (OwO) for structural priors, 2) Topology-Agnostic Tokenizer (TAT) for compression into universal discrete representation, 3) Unified BVH Universe (UvU) dataset aggregating motions across heterogeneous skeletons.

Result: Achieves high-fidelity reconstruction under compression, effectively disentangles motion from skeletal structure, supports cross-species motion transfer, composition, denoising, generation, and text-motion retrieval.

Conclusion: NECromancer establishes a unified framework for motion analysis and synthesis across diverse morphologies, enabling universal motion tokenization that works with arbitrary BVH skeletons.

Abstract: Motion tokenization is a key component of generalizable motion models, yet most existing approaches are restricted to species-specific skeletons, limiting their applicability across diverse morphologies. We propose NECromancer (NEC), a universal motion tokenizer that operates directly on arbitrary BVH skeletons. NEC consists of three components: (1) an Ontology-aware Skeletal Graph Encoder (OwO) that encodes structural priors from BVH files, including joint semantics, rest-pose offsets, and skeletal topology, into skeletal embeddings; (2) a Topology-Agnostic Tokenizer (TAT) that compresses motion sequences into a universal, topology-invariant discrete representation; and (3) the Unified BVH Universe (UvU), a large-scale dataset aggregating BVH motions across heterogeneous skeletons. Experiments show that NEC achieves high-fidelity reconstruction under substantial compression and effectively disentangles motion from skeletal structure. The resulting token space supports cross-species motion transfer, composition, denoising, generation with token-based models, and text-motion retrieval, establishing a unified framework for motion analysis and synthesis across diverse morphologies. Demo page: https://animotionlab.github.io/NECromancer/

[142] LIBERO-X: Robustness Litmus for Vision-Language-Action Models

Guodong Wang, Chenkai Zhang, Qingjie Liu, Jinjin Zhang, Jiancheng Cai, Junjie Liu, Xinmin Liu

Main category: cs.CV

TL;DR: LIBERO-X is a new benchmark for Vision-Language-Action models featuring hierarchical evaluation protocols and diverse training data to better assess generalization, robustness, and real-world performance under distribution shifts.

DetailsMotivation: Existing VLA benchmarks provide limited or misleading assessments due to insufficient evaluation protocols that fail to capture real-world distribution shifts, making reliable benchmarking critical for advancing VLA models.

Method: Introduces LIBERO-X benchmark with: 1) Hierarchical evaluation protocol with progressive difficulty levels targeting spatial generalization, object recognition, and task instruction understanding; 2) High-diversity training dataset collected via human teleoperation with multiple fine-grained manipulation objectives per scene.

Result: Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding.

Conclusion: LIBERO-X offers a more reliable foundation for assessing and advancing VLA development by integrating hierarchical evaluation with diverse training data to address distribution gaps.

Abstract: Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.

[143] SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, Mattia Rigotti

Main category: cs.CV

TL;DR: SPARC is a modular VLM framework that explicitly separates visual perception from reasoning, enabling test-time scaling, asymmetric compute allocation, and selective optimization.

DetailsMotivation: Current vision-language models have brittle test-time scaling where perception and reasoning are entangled in unstructured chains-of-thought, leading to cascading errors and requiring expensive RL training.

Method: Two-stage pipeline: 1) explicit visual search to localize question-relevant regions, 2) reasoning conditioned on those regions. Enables independent scaling, asymmetric compute allocation, and compressed contexts via multi-resolution processing.

Result: Outperforms monolithic baselines and visual-grounding approaches on challenging VQA benchmarks. Improves Qwen3VL-4B by 6.7 points on V* VQA and surpasses “thinking with images” by 4.6 points on OOD tasks with 200× lower token budget.

Conclusion: Explicit separation of perception and reasoning enables more robust test-time scaling, efficient compute allocation, and better performance than entangled approaches in VLMs.

Abstract: Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses “thinking with images” by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.

[144] An Integer Linear Programming Approach to Geometrically Consistent Partial-Partial Shape Matching

Viktoria Ehm, Paul Roetzer, Florian Bernard, Daniel Cremers

Main category: cs.CV

TL;DR: First integer linear programming approach for partial-partial 3D shape matching that leverages geometric consistency to find overlapping regions and compute neighborhood-preserving correspondences.

DetailsMotivation: Partial-partial 3D shape matching is the most realistic setting (e.g., 3D scanning) but has unique challenges: need to compute accurate correspondences while finding unknown overlapping regions. Limited previous work exists due to these challenges.

Method: Introduces first integer linear programming approach specifically for partial-partial shape matching. Uses geometric consistency as strong prior to robustly estimate overlapping region and compute neighborhood-preserving correspondences.

Result: Achieves high-quality matching results in terms of both matching error and smoothness. Method is more scalable than previous formalisms.

Conclusion: Proposed integer linear programming approach effectively addresses the challenging partial-partial 3D shape matching problem by leveraging geometric consistency, achieving robust performance and improved scalability.

Abstract: The task of establishing correspondences between two 3D shapes is a long-standing challenge in computer vision. While numerous studies address full-full and partial-full 3D shape matching, only a limited number of works have explored the partial-partial setting, very likely due to its unique challenges: we must compute accurate correspondences while at the same time find the unknown overlapping region. Nevertheless, partial-partial 3D shape matching reflects the most realistic setting, as in many real-world cases, such as 3D scanning, shapes are only partially observable. In this work, we introduce the first integer linear programming approach specifically designed to address the distinctive challenges of partial-partial shape matching. Our method leverages geometric consistency as a strong prior, enabling both robust estimation of the overlapping region and computation of neighbourhood-preserving correspondences. We empirically demonstrate that our approach achieves high-quality matching results both in terms of matching error and smoothness. Moreover, we show that our method is more scalable than previous formalisms.

[145] ProtoQuant: Quantization of Prototypical Parts For General and Fine-Grained Image Classification

Mikołaj Janusz, Adam Wróbel, Bartosz Zieliński, Dawid Rymarczyk

Main category: cs.CV

TL;DR: ProtoQuant introduces a novel prototypical parts-based model using latent vector quantization to achieve stable, grounded prototypes without backbone finetuning, scaling to ImageNet-scale datasets while maintaining interpretability.

DetailsMotivation: Existing prototypical parts-based models struggle with ImageNet-scale generalization, require expensive backbone finetuning, and suffer from "prototype drift" where learned prototypes lack grounding in training data and change under small perturbations.

Method: ProtoQuant uses latent vector quantization to constrain prototypes to a discrete learned codebook within the latent space, ensuring they remain faithful representations of training data without needing to update the backbone. This creates an efficient, interpretable head that scales to large datasets.

Result: ProtoQuant achieves competitive classification accuracy on ImageNet and fine-grained benchmarks (CUB-200, Cars-196), while generalizing to ImageNet and achieving comparable interpretability metrics to other prototypical-parts-based methods.

Conclusion: ProtoQuant successfully addresses prototype drift and scalability issues in interpretable computer vision models through latent vector quantization, enabling efficient, stable prototypical learning at ImageNet scale.

Abstract: Prototypical parts-based models offer a “this looks like that” paradigm for intrinsic interpretability, yet they typically struggle with ImageNet-scale generalization and often require computationally expensive backbone finetuning. Furthermore, existing methods frequently suffer from “prototype drift,” where learned prototypes lack tangible grounding in the training distribution and change their activation under small perturbations. We present ProtoQuant, a novel architecture that achieves prototype stability and grounded interpretability through latent vector quantization. By constraining prototypes to a discrete learned codebook within the latent space, we ensure they remain faithful representations of the training data without the need to update the backbone. This design allows ProtoQuant to function as an efficient, interpretable head that scales to large-scale datasets. We evaluate ProtoQuant on ImageNet and several fine-grained benchmarks (CUB-200, Cars-196). Our results demonstrate that ProtoQuant achieves competitive classification accuracy while generalizing to ImageNet and comparable interpretability metrics to other prototypical-parts-based methods.

[146] DAVE: Distribution-aware Attribution via ViT Gradient Decomposition

Adam Wróbel, Siddhartha Gairola, Jacek Tabor, Bernt Schiele, Bartosz Zieliński, Dawid Rymarczyk

Main category: cs.CV

TL;DR: DAVE is a new attribution method for Vision Transformers that produces stable, high-resolution attribution maps by decomposing gradients to separate locally equivariant components from architecture-induced artifacts.

DetailsMotivation: Vision Transformers have become dominant in computer vision, but existing attribution methods struggle to produce stable, high-resolution attribution maps due to architectural components like patch embeddings and attention routing that introduce structured artifacts.

Method: DAVE uses a mathematically grounded structured decomposition of input gradients for ViTs, exploiting architectural properties to isolate locally equivariant and stable components of the input-output mapping while separating architecture-induced artifacts and other instability sources.

Result: The method produces improved attribution maps for Vision Transformers, addressing the challenge of structured artifacts from patch embeddings and attention routing that plague existing methods.

Conclusion: DAVE provides a principled approach to attribution for Vision Transformers that yields more stable and higher-resolution explanations compared to existing methods.

Abstract: Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet producing stable and high-resolution attribution maps for these models remains challenging. Architectural components such as patch embeddings and attention routing often introduce structured artifacts in pixel-level explanations, causing many existing methods to rely on coarse patch-level attributions. We introduce DAVE \textit{(\underline{D}istribution-aware \underline{A}ttribution via \underline{V}iT Gradient D\underline{E}composition)}, a mathematically grounded attribution method for ViTs based on a structured decomposition of the input gradient. By exploiting architectural properties of ViTs, DAVE isolates locally equivariant and stable components of the effective input–output mapping. It separates these from architecture-induced artifacts and other sources of instability.

[147] CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling

Yuxin He, An Li, Cheng Xue

Main category: cs.CV

TL;DR: CauCLIP: A causality-inspired vision-language framework using CLIP for domain-invariant surgical phase recognition without target domain data, achieving state-of-the-art performance on surgical domain adaptation benchmarks.

DetailsMotivation: Surgical phase recognition is crucial for intelligent operating rooms, but faces challenges due to limited annotated clinical videos and large domain gaps between synthetic and real surgical data. Existing methods struggle with domain generalization when target domain data is unavailable.

Method: Proposes CauCLIP, a causality-inspired vision-language framework that leverages CLIP for domain-invariant representation learning. Key components include: 1) Frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and 2) Causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These are combined in a unified training framework to focus on stable causal factors underlying surgical workflows.

Result: Experiments on the SurgVisDom hard adaptation benchmark demonstrate that CauCLIP substantially outperforms all competing approaches, showing the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.

Conclusion: CauCLIP successfully addresses domain generalization challenges in surgical phase recognition by leveraging causality principles and vision-language models, enabling robust performance without access to target domain data.

Abstract: Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.

[148] PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

Junxian Li, Kai Liu, Leyang Chen, Weida Wang, Zhixin Wang, Jiaqi Xu, Fan Li, Renjing Pei, Linghe Kong, Yulun Zhang

Main category: cs.CV

TL;DR: PlanViz is a new benchmark for evaluating multimodal models’ capabilities in computer-use planning tasks involving image generation and editing, with focus on spatial reasoning and procedural understanding.

DetailsMotivation: While unified multimodal models show impressive image generation and multimodal reasoning capabilities, their potential for computer-use planning tasks (closely related to daily life) remains underexplored. These tasks require spatial reasoning and procedural understanding capabilities that haven't been systematically evaluated.

Method: Proposes PlanViz benchmark with three new sub-tasks: route planning, work diagramming, and web&UI displaying. Uses human-annotated questions and reference images with quality control process. Introduces PlanScore, a task-adaptive evaluation metric for correctness, visual quality, and efficiency.

Result: The benchmark enables systematic evaluation of multimodal models’ capabilities in computer-use planning tasks. Experiments reveal key limitations and opportunities for future research in this area.

Conclusion: PlanViz provides a comprehensive benchmark for evaluating multimodal models’ capabilities in computer-use planning tasks, highlighting current limitations and research opportunities in spatial reasoning and procedural understanding for image generation and editing.

Abstract: Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.

[149] CytoCrowd: A Multi-Annotator Benchmark Dataset for Cytology Image Analysis

Yonghao Si, Xingyuan Zeng, Zhao Chen, Libin Zheng, Caleb Chen Cao, Lei Chen, Jian Yin

Main category: cs.CV

TL;DR: CytoCrowd is a medical cytology benchmark with 446 high-res images featuring both raw conflicting annotations from 4 pathologists and separate gold-standard ground truth, enabling evaluation of both standard CV tasks and annotation aggregation algorithms.

DetailsMotivation: Existing medical image datasets either provide single clean ground truth (hiding real-world expert disagreement) or multiple annotations without separate gold standard for objective evaluation. There's a need for datasets that capture real-world expert variability while maintaining objective evaluation capability.

Method: Created CytoCrowd dataset with 446 high-resolution cytology images, each annotated by four independent pathologists (raw conflicting annotations) plus a separate high-quality gold-standard ground truth established by a senior expert. Provides comprehensive baseline results for both standard computer vision tasks and annotation aggregation algorithms.

Result: CytoCrowd presents challenges for both standard computer vision tasks and annotation aggregation algorithms. The dataset demonstrates the value of capturing real-world expert disagreement while maintaining objective evaluation capability through separate gold standard.

Conclusion: CytoCrowd serves as a versatile resource for medical image analysis, enabling evaluation of both standard computer vision tasks and algorithms for resolving expert disagreements. It advances the field by providing realistic testbed for next-generation models.

Abstract: High-quality annotated datasets are crucial for advancing machine learning in medical image analysis. However, a critical gap exists: most datasets either offer a single, clean ground truth, which hides real-world expert disagreement, or they provide multiple annotations without a separate gold standard for objective evaluation. To bridge this gap, we introduce CytoCrowd, a new public benchmark for cytology analysis. The dataset features 446 high-resolution images, each with two key components: (1) raw, conflicting annotations from four independent pathologists, and (2) a separate, high-quality gold-standard ground truth established by a senior expert. This dual structure makes CytoCrowd a versatile resource. It serves as a benchmark for standard computer vision tasks, such as object detection and classification, using the ground truth. Simultaneously, it provides a realistic testbed for evaluating annotation aggregation algorithms that must resolve expert disagreements. We provide comprehensive baseline results for both tasks. Our experiments demonstrate the challenges presented by CytoCrowd and establish its value as a resource for developing the next generation of models for medical image analysis.

[150] Can We Build a Monolithic Model for Fake Image Detection? SICA: Semantic-Induced Constrained Adaptation for Unified-Yet-Discriminative Artifact Feature Space Reconstruction

Bo Du, Xiaochen Ma, Xuekang Zhu, Zhe Yang, Chaogun Niu, Jian Liu, Ji-Zhe Zhou

Main category: cs.CV

TL;DR: SICA is a monolithic fake image detection framework that addresses heterogeneous forensic artifacts across domains by using high-level semantics as structural prior to reconstruct a unified-yet-discriminative artifact feature space.

DetailsMotivation: Current monolithic fake image detection models underperform compared to ensemble approaches due to the "heterogeneous phenomenon" - intrinsic distinctness of artifacts across forensic subdomains causes collapse of artifact feature space.

Method: Proposes Semantic-Induced Constrained Adaptation (SICA) paradigm that uses high-level semantics as structural prior to reconstruct artifact feature space in a near-orthogonal manner, enabling unified-yet-discriminative detection.

Result: SICA outperforms 15 state-of-the-art methods on OpenMMSec dataset and successfully reconstructs the target unified-yet-discriminative artifact feature space, validating the hypothesis about semantics as structural prior.

Conclusion: SICA demonstrates that high-level semantics can effectively serve as structural prior for reconstructing artifact feature spaces, enabling practical monolithic fake image detection that handles heterogeneous forensic domains.

Abstract: Fake Image Detection (FID), aiming at unified detection across four image forensic subdomains, is critical in real-world forensic scenarios. Compared with ensemble approaches, monolithic FID models are theoretically more promising, but to date, consistently yield inferior performance in practice. In this work, by discovering the heterogeneous phenomenon'', which is the intrinsic distinctness of artifacts across subdomains, we diagnose the cause of this underperformance for the first time: the collapse of the artifact feature space driven by such phenomenon. The core challenge for developing a practical monolithic FID model thus boils down to the unified-yet-discriminative" reconstruction of the artifact feature space. To address this paradoxical challenge, we hypothesize that high-level semantics can serve as a structural prior for the reconstruction, and further propose Semantic-Induced Constrained Adaptation (SICA), the first monolithic FID paradigm. Extensive experiments on our OpenMMSec dataset demonstrate that SICA outperforms 15 state-of-the-art methods and reconstructs the target unified-yet-discriminative artifact feature space in a near-orthogonal manner, thus firmly validating our hypothesis. The code and dataset are available at:https: //github.com/scu-zjz/SICA_OpenMMSec.

[151] Clinical-Prior Guided Multi-Modal Learning with Latent Attention Pooling for Gait-Based Scoliosis Screening

Dong Chen, Zizhuang Wei, Jialei Xu, Xinyang Sun, Zonglin He, Meiru An, Huili Peng, Yong Hu, Kenneth MC Cheung

Main category: cs.CV

TL;DR: ScoliGait: A new benchmark dataset and multi-modal framework for video-based gait analysis to detect Adolescent Idiopathic Scoliosis using clinical kinematic priors and interpretable feature representations.

DetailsMotivation: Current AIS screening methods are subjective, difficult to scale, and rely on specialized expertise. Video-based gait analysis offers promise but suffers from data leakage issues and oversimplified models lacking clinical interpretability.

Method: Introduces ScoliGait dataset with 1,572 training and 300 independent test clips, each annotated with Cobb angles and clinical text. Proposes multi-modal framework integrating clinical-prior-guided kinematic knowledge map for interpretable features and latent attention pooling to fuse video, text, and knowledge map modalities.

Result: Establishes new state-of-the-art performance with significant gains on realistic, subject-independent benchmark, providing robust and interpretable foundation for non-invasive AIS assessment.

Conclusion: The work provides a clinically grounded, scalable foundation for non-invasive AIS assessment through interpretable multi-modal video analysis.

Abstract: Adolescent Idiopathic Scoliosis (AIS) is a prevalent spinal deformity whose progression can be mitigated through early detection. Conventional screening methods are often subjective, difficult to scale, and reliant on specialized clinical expertise. Video-based gait analysis offers a promising alternative, but current datasets and methods frequently suffer from data leakage, where performance is inflated by repeated clips from the same individual, or employ oversimplified models that lack clinical interpretability. To address these limitations, we introduce ScoliGait, a new benchmark dataset comprising 1,572 gait video clips for training and 300 fully independent clips for testing. Each clip is annotated with radiographic Cobb angles and descriptive text based on clinical kinematic priors. We propose a multi-modal framework that integrates a clinical-prior-guided kinematic knowledge map for interpretable feature representation, alongside a latent attention pooling mechanism to fuse video, text, and knowledge map modalities. Our method establishes a new state-of-the-art, demonstrating a significant performance gap on a realistic, non-repeating subject benchmark. Our approach establishes a new state of the art, showing a significant performance gain on a realistic, subject-independent benchmark. This work provides a robust, interpretable, and clinically grounded foundation for scalable, non-invasive AIS assessment.

[152] Gold Exploration using Representations from a Multispectral Autoencoder

Argyro Tsandalidou, Konstantinos Dogeas, Eleftheria Tetoula Tsonga, Elisavet Parselia, Georgios Tsimiklis, George Arvanitakis

Main category: cs.CV

TL;DR: A framework using generative representations from multispectral satellite imagery (Sentinel-2) to identify gold-bearing regions through an autoencoder foundation model and XGBoost classifier.

DetailsMotivation: To enable large-scale mineral exploration using satellite imagery due to high costs and limited availability of on-site data, specifically for identifying gold-bearing regions from space.

Method: Uses Isometric autoencoder foundation model pretrained on FalconSpace-S2 v1.0 dataset to generate spectral-spatial representations from multispectral Sentinel-2 imagery, then feeds these representations to a lightweight XGBoost classifier for gold identification.

Result: Improved patch-level accuracy from 0.51 to 0.68 and image-level accuracy from 0.55 to 0.73 compared to raw spectral input baseline, demonstrating generative embeddings capture transferable mineralogical patterns with limited labeled data.

Conclusion: Foundation-model representations from satellite imagery can make mineral exploration more efficient, scalable, and globally applicable by identifying mineral deposits from space.

Abstract: Satellite imagery is employed for large-scale prospectivity mapping due to the high cost and typically limited availability of on-site mineral exploration data. In this work, we present a proof-of-concept framework that leverages generative representations learned from multispectral Sentinel-2 imagery to identify gold-bearing regions from space. An autoencoder foundation model, called Isometric, which is pretrained on the large-scale FalconSpace-S2 v1.0 dataset, produces information-dense spectral-spatial representations that serve as inputs to a lightweight XGBoost classifier. We compare this representation-based approach with a raw spectral input baseline using a dataset of 63 Sentinel-2 images from known gold and non-gold locations. The proposed method improves patch-level accuracy from 0.51 to 0.68 and image-level accuracy from 0.55 to 0.73, demonstrating that generative embeddings capture transferable mineralogical patterns even with limited labeled data. These results highlight the potential of foundation-model representations to make mineral exploration more efficient, scalable, and globally applicable.

[153] Revisiting Emotions Representation for Recognition in the Wild

Joao Baptista Cardia Neto, Claudio Ferrari, Stefano Berretti

Main category: cs.CV

TL;DR: Paper proposes novel approach for facial emotion recognition using probability distributions over emotion classes instead of single-label classification, leveraging VAD space mapping to automatically relabel datasets for richer emotional state representation.

DetailsMotivation: Traditional single-label emotion classification oversimplifies complex emotional states which are often combinations of multiple emotions at different intensities. Existing distribution learning approaches are limited by datasets annotated with single emotion classes.

Method: Proposes automatic relabeling of existing datasets by mapping basic and compound emotions to probability distributions in Valence-Arousal-Dominance (VAD) space. Given face images with VAD annotations, estimates likelihood of belonging to each emotion distribution, enabling description of emotional states as mixtures.

Result: Preliminary experiments show advantages of this solution and open new investigation directions. Data annotations are publicly available on GitHub.

Conclusion: The approach enables richer description of emotional states as mixtures of emotions while accounting for ambiguous perception nature, moving beyond oversimplified single-label classification.

Abstract: Facial emotion recognition has been typically cast as a single-label classification problem of one out of six prototypical emotions. However, that is an oversimplification that is unsuitable for representing the multifaceted spectrum of spontaneous emotional states, which are most often the result of a combination of multiple emotions contributing at different intensities. Building on this, a promising direction that was explored recently is to cast emotion recognition as a distribution learning problem. Still, such approaches are limited in that research datasets are typically annotated with a single emotion class. In this paper, we contribute a novel approach to describe complex emotional states as probability distributions over a set of emotion classes. To do so, we propose a solution to automatically re-label existing datasets by exploiting the result of a study in which a large set of both basic and compound emotions is mapped to probability distributions in the Valence-Arousal-Dominance (VAD) space. In this way, given a face image annotated with VAD values, we can estimate the likelihood of it belonging to each of the distributions, so that emotional states can be described as a mixture of emotions, enriching their description, while also accounting for the ambiguous nature of their perception. In a preliminary set of experiments, we illustrate the advantages of this solution and a new possible direction of investigation. Data annotations are available at https://github.com/jbcnrlz/affectnet-b-annotation.

[154] Machine Learning for Detection and Severity Estimation of Sweetpotato Weevil Damage in Field and Lab Conditions

Doreen M. Chelangat, Sudi Murindanyi, Bruce Mugizi, Paul Musana, Benard Yada, Milton A. Otema, Florence Osaru, Andrew Katumba, Joyce Nakatumba-Nabende

Main category: cs.CV

TL;DR: Computer vision approach for automated assessment of sweetpotato weevil damage using classification models for field data and YOLO12-based object detection for laboratory analysis.

DetailsMotivation: Traditional manual assessment of sweetpotato weevil damage is labor-intensive, subjective, and inconsistent, hindering breeding programs for resilient varieties.

Method: Two-stage approach: 1) Classification models for field damage severity prediction, 2) Laboratory pipeline using YOLO12 with root segmentation and tiling strategy for small object detection of feeding holes.

Result: Field classification achieved 71.43% test accuracy; laboratory object detection achieved 77.7% mean average precision for detecting minute weevil feeding holes.

Conclusion: Computer vision provides efficient, objective, scalable assessment tools that enhance phenotyping efficiency in sweetpotato breeding programs and help mitigate weevil impacts on food security.

Abstract: Sweetpotato weevils (Cylas spp.) are considered among the most destructive pests impacting sweetpotato production, particularly in sub-Saharan Africa. Traditional methods for assessing weevil damage, predominantly relying on manual scoring, are labour-intensive, subjective, and often yield inconsistent results. These challenges significantly hinder breeding programs aimed at developing resilient sweetpotato varieties. This study introduces a computer vision-based approach for the automated evaluation of weevil damage in both field and laboratory contexts. In the field settings, we collected data to train classification models to predict root-damage severity levels, achieving a test accuracy of 71.43%. Additionally, we established a laboratory dataset and designed an object detection pipeline employing YOLO12, a leading real-time detection model. This methodology incorporated a two-stage laboratory pipeline that combined root segmentation with a tiling strategy to improve the detectability of small objects. The resulting model demonstrated a mean average precision of 77.7% in identifying minute weevil feeding holes. Our findings indicate that computer vision technologies can provide efficient, objective, and scalable assessment tools that align seamlessly with contemporary breeding workflows. These advancements represent a significant improvement in enhancing phenotyping efficiency within sweetpotato breeding programs and play a crucial role in mitigating the detrimental effects of weevils on food security.

[155] A Unified Formula for Affine Transformations between Calibrated Cameras

Levente Hajder

Main category: cs.CV

TL;DR: Derivation of closed-form expression for affine transformation mapping local image patches between two calibrated camera views based on relative pose, image coordinates, and surface normals.

DetailsMotivation: To provide a mathematical foundation for understanding how local image patches transform between different camera viewpoints, which is fundamental for computer vision tasks like stereo matching, structure from motion, and 3D reconstruction.

Method: Derivation of closed-form mathematical expression using geometric principles of calibrated cameras, relative pose transformations, and local surface geometry.

Result: Obtained explicit formula for affine transformation that maps local image patches between two views as function of relative camera pose, image coordinates, and local surface normal.

Conclusion: Provides theoretical foundation for understanding local image transformations in multi-view geometry, with applications in computer vision algorithms requiring patch correspondence.

Abstract: In this technical note, we derive a closed-form expression for the affine transformation mapping local image patches between two calibrated views. We show that the transformation is a function of the relative camera pose, the image coordinates, and the local surface normal.

[156] RAIGen: Rare Attribute Identification in Text-to-Image Generative Models

Silpa Vadakkeeveetil Sreelatha, Dan Wang, Serge Belongie, Muhammad Awais, Anjan Dutta

Main category: cs.CV

TL;DR: RAIGen is an unsupervised framework for discovering rare attributes in diffusion models by analyzing neuron activations to identify underrepresented semantic features beyond predefined fairness categories.

DetailsMotivation: Current bias mitigation approaches focus either on predefined fairness categories (closed-set) or identifying majority attributes (open-set), but overlook discovering rare/minority features that are underrepresented in training data but still encoded in model representations.

Method: RAIGen uses Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes.

Result: RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation.

Conclusion: RAIGen provides a novel unsupervised approach for discovering rare attributes in diffusion models, addressing a gap in current bias mitigation methods and enabling better understanding and control of model biases.

Abstract: Text-to-image diffusion models achieve impressive generation quality but inherit and amplify training-data biases, skewing coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering rare or minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce RAIGen, the first framework, to our knowledge, for un-supervised rare-attribute discovery in diffusion models. RAIGen leverages Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes. Experiments show RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to larger models such as SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation.

[157] GaussianPOP: Principled Simplification Framework for Compact 3D Gaussian Splatting via Error Quantification

Soonbin Lee, Yeong-Gyu Kim, Simon Sasse, Tomas M. Borges, Yago Sanchez, Eun-Seok Ryu, Thomas Schierl, Cornelius Hellge

Main category: cs.CV

TL;DR: GaussianPOP: A principled 3D Gaussian Splatting simplification framework using analytical error quantification to optimize trade-offs between model compactness and rendering fidelity.

DetailsMotivation: Existing 3D Gaussian Splatting simplification methods use importance scores not driven by visual error metrics, leading to suboptimal trade-offs between compactness and rendering quality.

Method: Develops a novel error criterion derived from the 3DGS rendering equation to measure each Gaussian’s contribution to rendered images, with an efficient algorithm enabling single-forward-pass error calculation. Supports both on-training pruning and post-training simplification via iterative error re-quantification.

Result: Outperforms existing state-of-the-art pruning methods across both application scenarios, achieving superior trade-off between model compactness and high rendering quality.

Conclusion: GaussianPOP provides an accurate and flexible framework for 3D Gaussian Splatting simplification that improves upon existing methods by using principled error quantification rather than heuristic importance scores.

Abstract: Existing 3D Gaussian Splatting simplification methods commonly use importance scores, such as blending weights or sensitivity, to identify redundant Gaussians. However, these scores are not driven by visual error metrics, often leading to suboptimal trade-offs between compactness and rendering fidelity. We present GaussianPOP, a principled simplification framework based on analytical Gaussian error quantification. Our key contribution is a novel error criterion, derived directly from the 3DGS rendering equation, that precisely measures each Gaussian’s contribution to the rendered image. By introducing a highly efficient algorithm, our framework enables practical error calculation in a single forward pass. The framework is both accurate and flexible, supporting on-training pruning as well as post-training simplification via iterative error re-quantification for improved stability. Experimental results show that our method consistently outperforms existing state-of-the-art pruning methods across both application scenarios, achieving a superior trade-off between model compactness and high rendering quality.

[158] Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing

Meng Lou, Stanley Yu, Yizhou Yu

Main category: cs.CV

TL;DR: AdaRoute: A parameter-efficient fine-tuning method using mixture-of-experts with dynamic parameter routing for dense vision tasks

DetailsMotivation: Existing parameter-efficient fine-tuning (PEFT) methods for vision models have limitations in complex dense prediction tasks, including input-agnostic modeling and redundant cross-layer representations

Method: Proposes AdaRoute with shared expert centers where each expert is a trainable parameter matrix. Each AdaRoute module dynamically generates weight matrices via parameter routing that selectively aggregates matrices from expert centers, enabling input-dependent low-rank adaptation

Result: Extensive experiments show superiority on diverse vision tasks including semantic segmentation, object detection, instance segmentation, and panoptic segmentation

Conclusion: AdaRoute effectively addresses limitations of existing PEFT methods for dense vision tasks through dynamic parameter routing and shared expert centers

Abstract: Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose AdaRoute, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each AdaRoute module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in AdaRoute modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since AdaRoute modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experiments demonstrate the superiority of AdaRoute on diverse vision tasks, including semantic segmentation, object detection and instance segmentation, and panoptic segmentation. Code will be available at: https://bit.ly/3NZcr0H.

[159] RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing

Mohammadreza Salehi, Mehdi Noroozi, Luca Morreale, Ruchika Chavhan, Malcolm Chadwick, Alberto Gil Ramos, Abhinav Mehrotra

Main category: cs.CV

TL;DR: RFDM is a causal, efficient video editing model that edits variable-length videos frame by frame using residual flow diffusion, focusing on changes between consecutive frames for temporal efficiency.

DetailsMotivation: Current instructional video editing methods require fixed-length inputs and substantial compute, while autoregressive video generation for editing remains under-explored. There's a need for efficient variable-length video editing that scales independently of input length.

Method: Adapts 2D image-to-image diffusion model to video-to-video editing by conditioning each time step on previous prediction. Introduces Residual Flow Diffusion Model (RFDM) with new forward process formulation that predicts residual between target output and previous prediction, focusing denoising on changes between frames.

Result: RFDM surpasses I2I-based methods and competes with fully spatiotemporal 3D V2V models, while matching compute of image models and scaling independently of input video length. Performs well on global/local style transfer and object removal tasks.

Conclusion: RFDM enables efficient variable-length video editing with computational efficiency comparable to image models, making video editing more accessible and scalable.

Abstract: Instructional video editing applies edits to an input video using only text prompts, enabling intuitive natural-language control. Despite rapid progress, most methods still require fixed-length inputs and substantial compute. Meanwhile, autoregressive video generation enables efficient variable-length synthesis, yet remains under-explored for video editing. We introduce a causal, efficient video editing model that edits variable-length videos frame by frame. For efficiency, we start from a 2D image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing by conditioning the edit at time step t on the model’s prediction at t-1. To leverage videos’ temporal redundancy, we propose a new I2I diffusion forward process formulation that encourages the model to predict the residual between the target output and the previous prediction. We call this Residual Flow Diffusion Model (RFDM), which focuses the denoising process on changes between consecutive frames. Moreover, we propose a new benchmark that better ranks state-of-the-art methods for editing tasks. Trained on paired video data for global/local style transfer and object removal, RFDM surpasses I2I-based methods and competes with fully spatiotemporal (3D) V2V models, while matching the compute of image models and scaling independently of input video length. More content can be found in: https://smsd75.github.io/RFDM_page/

[160] NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices

Ruchika Chavhan, Malcolm Chadwick, Alberto Gil Couto Pimentel Ramos, Luca Morreale, Mehdi Noroozi, Abhinav Mehrotra

Main category: cs.CV

TL;DR: NanoFLUX is a 2.4B text-to-image flow-matching model distilled from 17B FLUX.1-Schnell for on-device generation, achieving 512x512 images in ~2.5 seconds on mobile devices through progressive compression, pruning, and novel distillation techniques.

DetailsMotivation: The increasing scale of state-of-the-art text-to-image diffusion models has created a gap with on-device solutions. The authors aim to bridge this gap by developing efficient models that maintain high generation quality while being deployable on mobile devices.

Method: Three main contributions: (1) Model compression via pruning redundant components in diffusion transformers (12B→2B), (2) ResNet-based token downsampling for latency reduction while preserving high-resolution processing, (3) Novel text encoder distillation using visual signals from early denoiser layers during sampling.

Result: NanoFLUX generates 512×512 images in approximately 2.5 seconds on mobile devices, demonstrating the feasibility of high-quality on-device text-to-image generation while maintaining quality comparable to much larger models.

Conclusion: The work successfully bridges the gap between state-of-the-art text-to-image models and on-device solutions through progressive compression and distillation techniques, enabling practical high-quality image generation on mobile devices.

Abstract: While large-scale text-to-image diffusion models continue to improve in visual quality, their increasing scale has widened the gap between state-of-the-art models and on-device solutions. To address this gap, we introduce NanoFLUX, a 2.4B text-to-image flow-matching model distilled from 17B FLUX.1-Schnell using a progressive compression pipeline designed to preserve generation quality. Our contributions include: (1) A model compression strategy driven by pruning redundant components in the diffusion transformer, reducing its size from 12B to 2B; (2) A ResNet-based token downsampling mechanism that reduces latency by allowing intermediate blocks to operate on lower-resolution tokens while preserving high-resolution processing elsewhere; (3) A novel text encoder distillation approach that leverages visual signals from early layers of the denoiser during sampling. Empirically, NanoFLUX generates 512 x 512 images in approximately 2.5 seconds on mobile devices, demonstrating the feasibility of high-quality on-device text-to-image generation.

[161] Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

Yuxuan Yao, Yuxuan Chen, Hui Li, Kaihui Cheng, Qipeng Guo, Yuwei Sun, Zilong Dong, Jingdong Wang, Siyu Zhu

Main category: cs.CV

TL;DR: A training-free method called “prompt reinjection” addresses prompt forgetting in Multimodal Diffusion Transformers by reinjecting early-layer prompt representations into later layers, improving text-to-image generation quality.

DetailsMotivation: The paper identifies a "prompt forgetting phenomenon" in Multimodal Diffusion Transformers (MMDiTs) where the semantics of prompt representations in the text branch are progressively forgotten as depth increases during denoising, which negatively impacts text-to-image generation quality.

Method: The authors propose a training-free approach called “prompt reinjection” that reinjects prompt representations from early layers into later layers to alleviate the forgetting problem. This method is applied to three representative MMDiTs: SD3, SD3.5, and FLUX.1.

Result: Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text-image generation quality.

Conclusion: Prompt reinjection effectively addresses the prompt forgetting problem in MMDiTs without requiring retraining, leading to improved text-to-image generation performance across multiple models and benchmarks.

Abstract: Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs–SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text–image generation quality.

[162] PANC: Prior-Aware Normalized Cut for Object Segmentation

Juan Gutiérrez, Victor Gutiérrez-Garcia, José Luis Blanco-Murillo

Main category: cs.CV

TL;DR: PANC: Weakly supervised spectral segmentation using minimal annotated visual tokens to produce stable, controllable object masks by augmenting affinity graphs with anchor priors.

DetailsMotivation: Current unsupervised segmentation methods are non-deterministic, sensitive to initialization, and lack user control. They often find only the most salient object, limiting applicability in domains with subtle intra-class differences or where dense labeling is costly.

Method: Augments TokenCut’s token-token affinity graph with a handful of annotated visual tokens as anchor nodes. By manipulating graph topology with these priors, biases the spectral eigenspace toward partitions consistent with annotations while preserving global grouping from self-supervised features.

Result: Achieves SotA performance with only 5-30 annotations per dataset: 96.8% mIoU on CrackForest (+14.43%), 78.0% on CUB-200-2011 (+0.2%), 78.8% on HAM10000 (+0.37%). Also works well on multi-object benchmarks with explicit user-controllable semantic segmentation.

Conclusion: PANC provides a training-free, weakly supervised framework that significantly improves reproducibility, user control, and segmentation quality over unsupervised methods, especially valuable in domains where dense labels are costly or intra-class differences are subtle.

Abstract: Fully unsupervised segmentation pipelines naively seek the most salient object, should this be present. As a result, most of the methods reported in the literature deliver non-deterministic partitions that are sensitive to initialization, seed order, and threshold heuristics. We propose PANC, a weakly supervised spectral segmentation framework that uses a minimal set of annotated visual tokens to produce stable, controllable, and reproducible object masks. From the TokenCut approach, we augment the token-token affinity graph with a handful of priors coupled to anchor nodes. By manipulating the graph topology, we bias the spectral eigenspace toward partitions that are consistent with the annotations. Our approach preserves the global grouping enforced by dense self-supervised visual features, trading annotated tokens for significant gains in reproducibility, user control, and segmentation quality. Using 5 to 30 annotations per dataset, our training-free method achieves state-of-the-art performance among weakly and unsupervised approaches on standard benchmarks (e.g., DUTS-TE, ECSSD, MS COCO). Contrarily, it excels in domains where dense labels are costly or intra-class differences are subtle. We report strong and reliable results on homogeneous, fine-grained, and texture-limited domains, achieving 96.8% (+14.43% over SotA), 78.0% (+0.2%), and 78.8% (+0.37%) average mean intersection-over-union (mIoU) on CrackForest (CFD), CUB-200-2011, and HAM10000 datasets, respectively. For multi-object benchmarks, the framework showcases explicit, user-controllable semantic segmentation.

[163] Seeing Beyond Redundancy: Task Complexity’s Role in Vision Token Specialization in VLLMs

Darryl Hannan, John Cooper, Dylan White, Yijing Watkins

Main category: cs.CV

TL;DR: This paper investigates why vision large language models (VLLMs) struggle with fine-grained visual information and spatial reasoning, exploring visual redundancy and compression mechanisms through synthetic benchmarks and fine-tuning experiments.

DetailsMotivation: VLLMs consistently underperform on tasks requiring fine-grained visual information or spatial reasoning compared to their linguistic capabilities. The paper aims to understand why this occurs, specifically investigating the hypothesis that visual redundancy causes high-level visual information to be spread across tokens while discarding specific details.

Method: The authors introduce a synthetic benchmark dataset specifically constructed to probe various visual features, along with metrics for measuring visual redundancy. They then explore fine-tuning VLLMs on complex visual tasks to understand how redundancy and compression change based on training data complexity.

Result: The study finds a connection between task complexity and visual compression, suggesting that having sufficient high-complexity visual data is crucial for altering how VLLMs distribute their visual representations and improving performance on complex visual tasks.

Conclusion: Understanding visual redundancy and compression mechanisms in VLLMs is essential for improving their visual capabilities. The findings provide insights for training next-generation VLLMs to better handle fine-grained visual information and spatial reasoning tasks.

Abstract: Vision capabilities in vision large language models (VLLMs) have consistently lagged behind their linguistic capabilities. In particular, numerous benchmark studies have demonstrated that VLLMs struggle when fine-grained visual information or spatial reasoning is required. However, we do not yet understand exactly why VLLMs struggle so much with these tasks relative to others. Some works have focused on visual redundancy as an explanation, where high-level visual information is uniformly spread across numerous tokens and specific, fine-grained visual information is discarded. In this work, we investigate this premise in greater detail, seeking to better understand exactly how various types of visual information are processed by the model and what types of visual information are discarded. To do so, we introduce a simple synthetic benchmark dataset that is specifically constructed to probe various visual features, along with a set of metrics for measuring visual redundancy, allowing us to better understand the nuances of their relationship. Then, we explore fine-tuning VLLMs on a number of complex visual tasks to better understand how redundancy and compression change based upon the complexity of the data that a model is trained on. We find that there is a connection between task complexity and visual compression, implying that having a sufficient ratio of high complexity visual data is crucial for altering the way that VLLMs distribute their visual representation and consequently improving their performance on complex visual tasks. We hope that this work will provide valuable insights for training the next generation of VLLMs.

[164] Reliable Mislabel Detection for Video Capsule Endoscopy Data

Julia Werner, Julius Oexle, Oliver Bause, Maxime Le Floch, Franz Brinkmann, Hannah Tolle, Jochen Hampe, Oliver Bringmann

Main category: cs.CV

TL;DR: A framework for detecting mislabeled data in medical imaging datasets, validated on Video Capsule Endoscopy datasets, with expert review confirming improved anomaly detection after cleaning.

DetailsMotivation: Medical imaging datasets face challenges with accurate annotations due to limited specialized physicians and ambiguous class boundaries, leading to mislabeled data that affects deep neural network performance.

Method: Proposed a framework for mislabel detection in medical datasets, validated on two largest publicly available Video Capsule Endoscopy datasets, with expert review by three gastroenterologists for re-annotation.

Result: The framework successfully detects incorrectly labeled data and results in improved anomaly detection performance after dataset cleaning compared to current baselines.

Conclusion: The proposed mislabel detection framework effectively addresses annotation challenges in medical imaging, improving dataset quality and downstream model performance for medical applications.

Abstract: The classification performance of deep neural networks relies strongly on access to large, accurately annotated datasets. In medical imaging, however, obtaining such datasets is particularly challenging since annotations must be provided by specialized physicians, which severely limits the pool of annotators. Furthermore, class boundaries can often be ambiguous or difficult to define which further complicates machine learning-based classification. In this paper, we want to address this problem and introduce a framework for mislabel detection in medical datasets. This is validated on the two largest, publicly available datasets for Video Capsule Endoscopy, an important imaging procedure for examining the gastrointestinal tract based on a video stream of lowresolution images. In addition, potentially mislabeled samples identified by our pipeline were reviewed and re-annotated by three experienced gastroenterologists. Our results show that the proposed framework successfully detects incorrectly labeled data and results in an improved anomaly detection performance after cleaning the datasets compared to current baselines.

[165] CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

Kaiyi Huang, Yukun Huang, Yu Li, Jianhong Bai, Xintao Wang, Zinan Lin, Xuefei Ning, Jiwen Yu, Pengfei Wan, Yu Wang, Xihui Liu

Main category: cs.CV

TL;DR: CineScene: A framework for cinematic video generation using implicit 3D-aware scene representations from static environment images, enabling camera-controlled video synthesis with dynamic subjects while preserving scene consistency.

DetailsMotivation: Cinematic video production requires expensive physical sets and control over scene-subject composition and camera movement. The paper aims to enable cost-effective cinematic video generation by decoupling scene context from subject dynamics.

Method: Uses implicit 3D-aware scene representation from multiple static environment images. Encodes scene images into visual representations via VGGT, then injects spatial priors into a pretrained text-to-video model through context concatenation. Includes random-shuffling strategy for input images during training and constructs a scene-decoupled dataset using Unreal Engine 5.

Result: Achieves state-of-the-art performance in scene-consistent cinematic video generation, handles large camera movements, and demonstrates generalization across diverse environments.

Conclusion: CineScene effectively addresses cinematic video generation with decoupled scene context, enabling camera-controlled synthesis with consistent scenes and dynamic subjects using implicit 3D-aware representations.

Abstract: Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model’s robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.

[166] MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

Ankan Deria, Komal Kumar, Adinath Madhavrao Dukre, Eran Segal, Salman Khan, Imran Razzak

Main category: cs.CV

TL;DR: MedMO is a medical multimodal large language model trained on domain-specific data with multi-stage training for cross-modal alignment, instruction tuning, and reinforcement learning with verifiable rewards for improved medical reasoning and spatial grounding.

DetailsMotivation: Current MLLMs have limited adoption in medicine due to gaps in domain coverage, modality alignment, and grounded reasoning. There's a need for specialized medical foundation models that can handle complex clinical scenarios with accurate spatial reasoning.

Method: Three-stage training: 1) Cross-modal pretraining to align visual encoders with medical language backbone; 2) Instruction tuning on multi-task supervision (captioning, VQA, report generation, retrieval, disease localization); 3) Reinforcement learning with verifiable rewards combining factuality checks and box-level GIoU reward.

Result: Outperforms open-source medical MLLMs across multiple modalities: +13.7% average accuracy improvement on VQA benchmarks, +6.9% on text-based QA, significant gains in medical report generation, and +40.4 IoU improvement for grounding capability. Shows strong cross-modality generalization across radiology, ophthalmology, and pathology.

Conclusion: MedMO demonstrates superior performance in medical multimodal understanding with robust spatial reasoning and localization capabilities, addressing key limitations of existing MLLMs in medical applications through domain-specific training and verifiable reinforcement learning.

Abstract: Multimodal large language models (MLLMs) have rapidly advanced, yet their adoption in medicine remains limited by gaps in domain coverage, modality alignment, and grounded reasoning. In this work, we introduce MedMO, a medical foundation model built upon a generalized MLLM architecture and trained exclusively on large-scale, domain-specific data. MedMO follows a multi-stage training recipe: (i) cross-modal pretraining to align heterogeneous visual encoders with a medical language backbone; (ii) instruction tuning on multi-task supervision that spans captioning, VQA, report generation, retrieval, and grounded disease localization with bounding boxes; and (iii) reinforcement learning with verifiable rewards that combine factuality checks with a box-level GIoU reward to strengthen spatial grounding and step-by-step reasoning in complex clinical scenarios. MedMO consistently outperforms strong open-source medical MLLMs across multiple modalities and tasks. On VQA benchmarks, MedMO achieves an average accuracy improvement of +13.7% over the baseline and performs within 1.9% of the SOTA Fleming-VL. For text-based QA, it attains +6.9% over the baseline and +14.5% over Fleming-VL. In medical report generation, MedMO delivers significant gains in both semantic and clinical accuracy. Moreover, it exhibits strong grounding capability, achieving an IoU improvement of +40.4 over the baseline and +37.0% over Fleming-VL, underscoring its robust spatial reasoning and localization performance. Evaluations across radiology, ophthalmology, and pathology-microscopy confirm MedMO’s broad cross-modality generalization. We release two versions of MedMO: 4B and 8B. Project is available at https://genmilab.github.io/MedMO-Page

[167] 3D Object Detection for Autonomous Driving: A Survey

Rui Qian, Xin Lai, Xirong Li

Main category: cs.CV

TL;DR: A comprehensive survey of 3D object detection methods for autonomous driving, covering sensors, datasets, metrics, state-of-the-art methods, and quantitative comparisons with case studies on 15 representative approaches.

DetailsMotivation: To address the gap in structured knowledge about 3D object detection for autonomous driving, which is crucial for perception systems but still in its infancy despite growing literature. The challenges include visual appearance recovery without depth information, representation learning from occluded point clouds, and semantic alignment of heterogeneous multimodal features.

Method: Conducts a comprehensive survey encompassing sensors, datasets, performance metrics, and recent state-of-the-art detection methods. Provides quantitative comparisons and a case study on 15 selected representative methods involving runtime analysis, error analysis, and robustness analysis.

Result: Systematically organizes the growing knowledge in 3D object detection for autonomous driving, identifies pros and cons of existing methods, and provides comparative analysis of representative approaches.

Conclusion: 3D object detection remains challenging for autonomous driving, but the survey provides structured knowledge and identifies promising future directions after in-depth analysis of current methods.

Abstract: Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of perception stack especially for the sake of path planning, motion prediction, and collision avoidance etc. Taking a quick glance at the progress we have made, we attribute challenges to visual appearance recovery in the absence of depth information from images, representation learning from partially occluded unstructured point clouds, and semantic alignments over heterogeneous features from cross modalities. Despite existing efforts, 3D object detection for autonomous driving is still in its infancy. Recently, a large body of literature have been investigated to address this 3D vision task. Nevertheless, few investigations have looked into collecting and structuring this growing knowledge. We therefore aim to fill this gap in a comprehensive survey, encompassing all the main concerns including sensors, datasets, performance metrics and the recent state-of-the-art detection methods, together with their pros and cons. Furthermore, we provide quantitative comparisons with the state of the art. A case study on fifteen selected representative methods is presented, involved with runtime analysis, error analysis, and robustness analysis. Finally, we provide concluding remarks after an in-depth analysis of the surveyed works and identify promising directions for future work.

[168] STAG: Structural Test-time Alignment of Gradients for Online Adaptation

Juhyeon Shin, Yujin Oh, Jonghyun Lee, Saehyung Lee, Minjun Park, Dongjun Lee, Uiwon Hwang, Sungroh Yoon

Main category: cs.CV

TL;DR: STAG is a lightweight plug-in for Test-Time Adaptation that uses classifier geometry to align gradients without extra backpropagation, improving performance on corrupted image classification and semantic segmentation.

DetailsMotivation: Test-Time Adaptation needs to adapt models using only unlabeled test data in real-time without source data access. Current methods often have computational overhead or limited effectiveness, especially in challenging online scenarios.

Method: STAG exploits classifier geometry by deriving class-wise structural anchors from classifier weights via self-structural entropy. During adaptation, it analytically computes predicted-class entropy gradient from forward-pass quantities and aligns it to corresponding anchors using cosine-similarity loss, requiring no additional backpropagation.

Result: STAG provides performance gains across corrupted image classification and continual semantic segmentation for strong TTA baselines on both CNN and Transformer architectures. It shows particularly large improvements under challenging online regimes like imbalanced label shifts, single-sample adaptation, mixed corruption streams, and long-horizon continual TTA.

Conclusion: STAG is a broadly applicable, lightweight plug-in enhancer for TTA that leverages classifier geometry for effective adaptation with minimal computational overhead, making it suitable for real-time applications.

Abstract: Test-Time Adaptation (TTA) adapts pre-trained models using only unlabeled test streams, requiring real-time inference and update without access to source data. We propose StructuralTest-time Alignment of Gradients (STAG), a lightweight plug-in enhancer that exploits an always-available structural signal: the classifier’s intrinsic geometry. STAG derives class-wise structural anchors from classifier weights via self-structural entropy, and during adaptation analytically computes the predicted-class entropy gradient from forward-pass quantities, aligning it to the corresponding anchor with a cosine-similarity loss. This closed-form design incurs near-zero memory and latency overhead and requires no additional backpropagation beyond the underlying baseline. Across corrupted image classification and continual semantic segmentation, STAG provides broadly applicable performance gains for strong TTA baselines on both CNN and Transformer architectures regardless of the underlying normalization scheme, with particularly large gains under challenging online regimes such as imbalanced label shifts, single-sample adaptation, mixed corruption streams and long-horizon continual TTA.

[169] Extracting Manifold Information from Point Clouds

Patrick Guidotti

Main category: cs.CV

TL;DR: A kernel-based method for constructing signature functions to interpolate and analyze point clouds and manifolds of various dimensions and regularity, with applications to dimension estimation and curvature analysis.

DetailsMotivation: To develop a general method for analyzing and interpolating geometric data sets ranging from full-dimensional manifolds to point clouds, addressing the need for global approaches that don't require structured data organization and can handle various regularity conditions.

Method: Uses kernel-based approach where signature functions are constructed as combinations of translated kernels. Coefficients are obtained by solving Fredholm integral equations (or matrix equations for finite cases). The method has variational formulation with regularization for handling noisy data.

Result: The method can interpolate data sets with different regularity (analytic to Hölder continuous), estimate manifold dimension, and compute normals and curvatures. It works globally without requiring structured data organization and handles noisy data through regularization.

Conclusion: A versatile kernel-based framework for geometric data analysis that handles diverse data types and regularity conditions, providing tools for interpolation, dimension estimation, and curvature analysis in a unified approach.

Abstract: A kernel based method is proposed for the construction of signature (defining) functions of subsets of $\mathbb{R}^d$. The subsets can range from full dimensional manifolds (open subsets) to point clouds (a finite number of points) and include bounded (closed) smooth manifolds of any codimension. The interpolation and analysis of point clouds are the main application. Two extreme cases in terms of regularity are considered, where the data set is interpolated by an analytic surface, at the one extreme, and by a Hölder continuous surface, at the other. The signature function can be computed as a combination of translated kernels, the coefficients of which are the solution of a Fredholm integral equation (matrix equation in the finite dimensional case). Once it is obtained, it can be used to estimate the dimension as well as the normal and the curvatures of the interpolated manifold. The method is global and does not require the data set to be organized or structured in any particular way. It admits a variational formulation with a natural regularized counterpart, that proves useful in dealing with data sets corrupted by numerical error or noise. The underlying analytical structure of the approach is presented in general before it is applied to the case of point clouds.

[170] Anonymization Prompt Learning for Facial Privacy-Preserving Text-to-Image Generation

Liang Shi, Jie Zhang, Shiguang Shan

Main category: cs.CV

TL;DR: APL trains learnable prompt prefixes for text-to-image diffusion models to anonymize facial identities, preventing deepfake generation while maintaining image quality.

DetailsMotivation: Address privacy and security concerns in text-to-image diffusion models by preventing generation of identifiable facial images that could be used for malicious deepfakes.

Method: Train learnable prompt prefixes that force diffusion models to generate anonymized facial identities even when prompted for specific individuals.

Result: APL successfully anonymizes specific individuals without compromising non-identity-specific image quality, with plug-and-play capability across different pretrained models.

Conclusion: APL provides effective privacy protection against deepfake risks in text-to-image generation while maintaining overall image generation quality.

Abstract: Text-to-image diffusion models, such as Stable Diffusion, generate highly realistic images from text descriptions. However, the generation of certain content at such high quality raises concerns. A prominent issue is the accurate depiction of identifiable facial images, which could lead to malicious deepfake generation and privacy violations. In this paper, we propose Anonymization Prompt Learning (APL) to address this problem. Specifically, we train a learnable prompt prefix for text-to-image diffusion models, which forces the model to generate anonymized facial identities, even when prompted to produce images of specific individuals. Extensive quantitative and qualitative experiments demonstrate the successful anonymization performance of APL, which anonymizes any specific individuals without compromising the quality of non-identity-specific image generation. Furthermore, we reveal the plug-and-play property of the learned prompt prefix, enabling its effective application across different pretrained text-to-image models for transferrable privacy and security protection against the risks of deepfakes.

[171] Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective

Zeen Song, Wenwen Qiang, Changwen Zheng, Hui Xiong, Gang Hua

Main category: cs.CV

TL;DR: BOD-VCL proposes a bi-level optimization framework for video contrastive learning that explicitly separates static and dynamic semantics using Koopman theory to address spurious correlations between them.

DetailsMotivation: Existing video contrastive learning methods fail to effectively learn both static and dynamic semantics due to spurious correlations between them in datasets, where learning only one type is sufficient to minimize contrastive loss, causing models to prioritize easier-to-learn semantics.

Method: Models videos as linear dynamical systems using Koopman theory, performs eigen-decomposition on the Koopman operator to separate time-variant and time-invariant components, enabling explicit separation of static and dynamic semantics for separate similarity modeling in contrastive learning.

Result: Experimental results show significant improvements over existing video contrastive learning methods, with the approach being seamlessly integrable into existing V-CL frameworks.

Conclusion: By explicitly separating static and dynamic semantics through Koopman theory-based decomposition, BOD-VCL enables more effective learning of both types of features in video contrastive learning, overcoming limitations of existing methods.

Abstract: Video contrastive learning (V-CL) has emerged as a popular framework for unsupervised video representation learning, demonstrating strong results in tasks such as action classification and detection. Yet, to harness these benefits, it is critical for the learned representations to fully capture both static and dynamic semantics. However, our experiments show that existing V-CL methods fail to effectively learn either type of feature. Through a rigorous theoretical analysis based on the Structural Causal Model and gradient update, we find that in a given dataset, certain static semantics consistently co-occur with specific dynamic semantics. This phenomenon creates spurious correlations between static and dynamic semantics in the dataset. However, existing V-CL methods do not differentiate static and dynamic similarities when computing sample similarity. As a result, learning only one type of semantics is sufficient for the model to minimize the contrastive loss. Ultimately, this causes the V-CL pre-training process to prioritize learning the easier-to-learn semantics. To address this limitation, we propose Bi-level Optimization with Decoupling for Video Contrastive Learning. (BOD-VCL). In BOD-VCL, we model videos as linear dynamical systems based on Koopman theory. In this system, all frame-to-frame transitions are represented by a linear Koopman operator. By performing eigen-decomposition on this operator, we can separate time-variant and time-invariant components of semantics, which allows us to explicitly separate the static and dynamic semantics in the video. By modeling static and dynamic similarity separately, both types of semantics can be fully exploited during the V-CL training process. BOD-VCL can be seamlessly integrated into existing V-CL frameworks, and experimental results highlight the significant improvements achieved by our method.

[172] High-Precision Edge Detection via Task-Adaptive Texture Handling and Ideal-Prior Guidance

Hao Shu

Main category: cs.CV

TL;DR: A comprehensive framework for high-precision image edge detection addressing architecture design, data supervision, and evaluation consistency through SDPED model with CSDB blocks, ideal-prior guidance, and fixed-pixel evaluation.

DetailsMotivation: Edge detection requires specialized architectures, reliable supervision, and rigorous evaluation criteria for accurate localization. Current approaches lack consistency across architectural design, data supervision, and evaluation methods.

Method: Proposes SDPED model with Cascaded Skipping Density Blocks (CSDB) adapted from image super-resolution architectures. Introduces ideal-prior guidance strategy using labels as noise-free samples to mitigate annotation noise. Adopts fixed-pixel criterion for resolution-independent evaluation.

Result: Consistent performance improvements on four benchmark datasets (BRIND, UDED, MDBD, BIPED2) with AP gains up to 22.5% on MDBD and 11.8% on BIPED2. Provides practical solution for high-precision edge detection.

Conclusion: Offers coherent solution for high-precision edge detection with insights applicable to precision-oriented modeling in low-level and soft-computing-based vision tasks.

Abstract: Image edge detection (ED) requires specialized architectures, reliable supervision, and rigorous evaluation criteria to ensure accurate localization. In this work, we present a framework for high-precision ED that jointly addresses architectural design, data supervision, and evaluation consistency. We propose SDPED, a compact ED model built upon Cascaded Skipping Density Blocks (CSDB), motivated by a task-adaptive architectural transfer from image super-resolution. By re-engineering texture-oriented structures for ED, SDPED effectively differentiates textures from edges while preserving fine spatial precision. Extensive experiments on four benchmark datasets (BRIND, UDED, MDBD, and BIPED2) demonstrate consistent performance improvements, particularly in Average Precision (AP), with gains of up to 22.5% on MDBD and 11.8% on BIPED2. In addition, we introduce an ideal-prior guidance strategy that incorporates noiseless data into training by treating labels as noise-free samples, providing a practical means to mitigate the subjectivity and noise inherent in human annotations. To enable fair and resolution-independent evaluation, we further adopt a fixed-pixel criterion for assessing localization accuracy. Overall, this work offers a coherent solution for high-precision ED and provides insights applicable to precision-oriented modeling in low-level and soft-computing-based vision tasks. Codes can be found on https://github.com/Hao-B-Shu/SDPED.

[173] Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User’s Casual Sketches

Yongzhi Xu, Yonhon Ng, Yifu Wang, Inkyu Sa, Yunfei Duan, Zhenhong Sun, Yang Li, Pan Ji, Hongdong Li

Main category: cs.CV

TL;DR: Sketch-to-3D game scene generation using 2D diffusion models and procedural content generation

DetailsMotivation: To create an automated system for generating interactive 3D game scenes from casual user prompts like hand-drawn sketches, addressing the data deficiency problem in 3D scene generation.

Method: Uses pre-trained 2D denoising diffusion model to generate isometric 2D scene images from sketches, then segments images and extracts layouts using image understanding methods, finally feeding into procedural content generation engines like Unity/Unreal.

Result: Efficient generation of high-quality, interactive 3D game scenes with layouts that closely follow user intentions, seamlessly integrable into game development environments.

Conclusion: Proposed method successfully bridges sketch-based input to playable 3D scenes using 2D diffusion models as intermediate guidance, overcoming 3D data scarcity.

Abstract: 3D Content Generation is at the heart of many computer graphics applications, including video gaming, film-making, virtual and augmented reality, etc. This paper proposes a novel deep-learning based approach for automatically generating interactive and playable 3D game scenes, all from the user’s casual prompts such as a hand-drawn sketch. Sketch-based input offers a natural, and convenient way to convey the user’s design intention in the content creation process. To circumvent the data-deficient challenge in learning (i.e. the lack of large training data of 3D scenes), our method leverages a pre-trained 2D denoising diffusion model to generate a 2D image of the scene as the conceptual guidance. In this process, we adopt the isometric projection mode to factor out unknown camera poses while obtaining the scene layout. From the generated isometric image, we use a pre-trained image understanding method to segment the image into meaningful parts, such as off-ground objects, trees, and buildings, and extract the 2D scene layout. These segments and layouts are subsequently fed into a procedural content generation (PCG) engine, such as a 3D video game engine like Unity or Unreal, to create the 3D scene. The resulting 3D scene can be seamlessly integrated into a game development environment and is readily playable. Extensive tests demonstrate that our method can efficiently generate high-quality and interactive 3D game scenes with layouts that closely follow the user’s intention.

[174] T$^3$-S2S: Training-free Triplet Tuning for Sketch to Scene Synthesis in Controllable Concept Art Generation

Zhenhong Sun, Yifu Wang, Yonhon Ng, Yongzhi Xu, Daoyi Dong, Hongdong Li, Pan Ji

Main category: cs.CV

TL;DR: T3-S2S is a training-free method for generating detailed multi-instance 2D concept art from sketches using ControlNet with three key modules for prompt balance, sketch feature emphasis, and contour refinement.

DetailsMotivation: 2D concept art generation for 3D scenes is challenging and manual. While text-to-image AI helps, it struggles with complex multi-instance scenes and structured terrain layout. There's a need for better sketch-to-scene generation that preserves detailed multi-instance arrangements.

Method: Training-free Triplet Tuning (T3-S2S) with three modules: 1) Prompt Balance ensures keyword representation, 2) Characteristic Priority emphasizes sketch features via TopK indices in feature channels, 3) Dense Tuning refines contour details in attention maps. Also includes feature-sharing strategy with dual prompt sets for layer-aware terrain views.

Result: The sketch-to-scene workflow consistently produces multi-instance 2D scenes with details aligned with input prompts, enabling structured terrain layout generation.

Conclusion: T3-S2S effectively generates detailed multi-instance concept art from sketches, addressing limitations of existing text-to-image methods for complex scene generation and terrain layout.

Abstract: 2D concept art generation for 3D scenes is a crucial yet challenging task in computer graphics, as creating natural intuitive environments still demands extensive manual effort in concept design. While generative AI has simplified 2D concept design via text-to-image synthesis, it struggles with complex multi-instance scenes and offers limited support for structured terrain layout. In this paper, we propose a Training-free Triplet Tuning for Sketch-to-Scene (T3-S2S) generation after reviewing the entire cross-attention mechanism. This scheme revitalizes the ControlNet model for detailed multi-instance generation via three key modules: Prompt Balance ensures keyword representation and minimizes the risk of missing critical instances; Characteristic Priority emphasizes sketch-based features by highlighting TopK indices in feature channels; and Dense Tuning refines contour details within instance-related regions of the attention map. Leveraging the controllability of T3-S2S, we also introduce a feature-sharing strategy with dual prompt sets to generate layer-aware isometric and terrain-view representations for the terrain layout. Experiments show that our sketch-to-scene workflow consistently produces multi-instance 2D scenes with details aligned with input prompts.

[175] Probing Perceptual Constancy in Large Vision-Language Models

Haoran Sun, Bingyang Wang, Suyang Yu, Yijiang Li, Qingying Gao, Haiyun Lyu, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Maijunxian Wang, Dezhi Luo, Hokin Deng

Main category: cs.CV

TL;DR: VLMs show varying perceptual constancy abilities across color, size, and shape domains, with shape constancy performance dissociated from color/size constancy.

DetailsMotivation: To evaluate whether current Vision Language Models possess perceptual constancy abilities similar to humans, which is crucial for stable visual understanding in dynamic environments.

Method: Comprehensive evaluation of 155 VLMs using 236 experiments across three domains: color, size, and shape constancy. Experiments included single-image and video adaptations of classic cognitive tasks, plus novel in-the-wild tasks.

Result: Significant variability in VLM performance across domains, with shape constancy performance clearly dissociated from color and size constancy performance.

Conclusion: Current VLMs demonstrate inconsistent perceptual constancy abilities, revealing important limitations in their visual understanding capabilities that need to be addressed.

Abstract: Perceptual constancy is the ability to maintain stable perceptions of objects despite changes in sensory input, such as variations in distance, angle, or lighting. This ability is crucial for visual understanding in a dynamic world. Here, we explored such ability in current Vision Language Models (VLMs). In this study, we evaluated 155 VLMs using 236 experiments across three domains: color, size, and shape constancy. The experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions. We found significant variability in VLM performance across these domains, with model performance in shape constancy clearly dissociated from that of color and size constancy.

[176] Enhancing Features in Long-tailed Data Using Large Vision Model

Pengxiao Han, Changkun Ye, Jinguang Tong, Cuicui Jiang, Jie Hong, Li Fang, Xuesong Li

Main category: cs.CV

TL;DR: Using large vision models (LVMs) to enhance long-tailed recognition without language data by fusing LVM features with baseline network features and applying prototype-based losses.

DetailsMotivation: Previous work relies on language-based foundation models (LLMs/LVLMs) for long-tailed recognition, but linguistic data isn't always available. The paper explores using pure vision models (LVMs/VFMs) to enhance features for long-tailed recognition without language information.

Method: Extract features from large vision models (LVMs) and fuse them with baseline network features in both map and latent spaces. Design several prototype-based losses in the latent space to better exploit the augmented features.

Result: Validated on ImageNet-LT and iNaturalist2018 benchmark datasets, showing improved performance for long-tailed recognition using vision-only models.

Conclusion: Large vision models can effectively enhance long-tailed recognition without requiring linguistic data, offering a viable alternative to language-based approaches.

Abstract: Language-based foundation models, such as large language models (LLMs) or large vision-language models (LVLMs), have been widely studied in long-tailed recognition. However, the need for linguistic data is not applicable to all practical tasks. In this study, we aim to explore using large vision models (LVMs) or visual foundation models (VFMs) to enhance long-tailed data features without any language information. Specifically, we extract features from the LVM and fuse them with features in the baseline network’s map and latent space to obtain the augmented features. Moreover, we design several prototype-based losses in the latent space to further exploit the potential of the augmented features. In the experimental section, we validate our approach on two benchmark datasets: ImageNet-LT and iNaturalist2018.

[177] M4-SAR: A Multi-Resolution, Multi-Polarization, Multi-Scene, Multi-Source Dataset and Benchmark for Optical-SAR Fusion Object Detection

Chao Wang, Wei Lu, Xiang Li, Jian Yang, Lei Luo

Main category: cs.CV

TL;DR: First comprehensive optical-SAR fusion dataset (M4-SAR) with 112K aligned image pairs and 1M oriented instances, plus E2E-OSDet framework for multi-source object detection.

DetailsMotivation: Single-source remote sensing (optical or SAR) has limitations: optical images suffer from weather/lighting issues, while SAR has noise and limited semantics. Fusion can improve detection but lacks standardized datasets.

Method: Created M4-SAR dataset with multi-resolution, multi-polarization, multi-scene optical-SAR pairs. Proposed E2E-OSDet framework for end-to-end multi-source fusion detection to mitigate cross-domain discrepancies.

Result: Fusion improves mAP by 5.7% over single-source inputs, with significant gains in complex environments. Dataset enables standardized evaluation of six state-of-the-art fusion methods.

Conclusion: Optical-SAR fusion significantly boosts object detection in remote sensing. M4-SAR dataset and E2E-OSDet framework establish foundation for future multi-source fusion research.

Abstract: Single-source remote sensing object detection using optical or SAR images struggles in complex environments. Optical images offer rich textural details but are often affected by low-light, cloud-obscured, or low-resolution conditions, reducing the detection performance. SAR images are robust to weather, but suffer from speckle noise and limited semantic expressiveness. Optical and SAR images provide complementary advantages, and fusing them can significantly improve the detection accuracy. However, progress in this field is hindered by the lack of large-scale, standardized datasets. To address these challenges, we propose the first comprehensive dataset for optical-SAR fusion object detection, named Multi-resolution, Multi-polarization, Multi-scene, Multi-source SAR dataset (M4-SAR). It contains 112,184 precisely aligned image pairs and nearly one million labeled instances with arbitrary orientations, spanning six key categories. To enable standardized evaluation, we develop a unified benchmarking toolkit that integrates six state-of-the-art multi-source fusion methods. Furthermore, we propose E2E-OSDet, a novel end-to-end multi-source fusion detection framework that mitigates cross-domain discrepancies and establishes a robust baseline for future studies. Extensive experiments on M4-SAR demonstrate that fusing optical and SAR data can improve $mAP$ by 5.7% over single-source inputs, with particularly significant gains in complex environments. The dataset and code are publicly available at https://github.com/wchao0601/M4-SAR.

[178] Continual-MEGA: A Large-scale Benchmark for Generalizable Continual Anomaly Detection

Geonu Lee, Yujeong Oh, Geonhui Jang, Soyoung Lee, Jeonghyo Song, Sungmin Cha, YoungJoon Yoo

Main category: cs.CV

TL;DR: Continual-MEGA benchmark for continual learning in anomaly detection with novel zero-shot generalization scenario and unified baseline algorithm

DetailsMotivation: Current continual learning benchmarks for anomaly detection don't reflect real-world deployment scenarios, lacking diversity and challenging zero-shot generalization settings

Method: Proposes Continual-MEGA benchmark with large diverse dataset (ContinualAD), novel zero-shot generalization scenario, and unified baseline algorithm for robust few-shot detection

Result: Existing methods have substantial room for improvement (especially pixel-level defect localization), proposed method outperforms prior approaches, and ContinualAD dataset enhances strong anomaly detection models

Conclusion: Continual-MEGA provides comprehensive benchmark for continual anomaly detection with real-world relevance, highlighting limitations of current methods and benefits of the new dataset

Abstract: In this paper, we introduce a new benchmark for continual learning in anomaly detection, aimed at better reflecting real-world deployment scenarios. Our benchmark, Continual-MEGA, includes a large and diverse dataset that significantly expands existing evaluation settings by combining carefully curated existing datasets with our newly proposed dataset, ContinualAD. In addition to standard continual learning with expanded quantity, we propose a novel scenario that measures zero-shot generalization to unseen classes, those not observed during continual adaptation. This setting poses a new problem setting that continual adaptation also enhances zero-shot performance. We also present a unified baseline algorithm that improves robustness in few-shot detection and maintains strong generalization. Through extensive evaluations, we report three key findings: (1) existing methods show substantial room for improvement, particularly in pixel-level defect localization; (2) our proposed method consistently outperforms prior approaches; and (3) the newly introduced ContinualAD dataset enhances the performance of strong anomaly detection models. We release the benchmark and code in https://github.com/Continual-Mega/Continual-Mega.

[179] Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyung Kim, Seungryong Kim, Jin-Hwa Kim

Main category: cs.CV

TL;DR: A diffusion-based framework for aligned novel view image and geometry generation using warping-and-inpainting with cross-modal attention distillation.

DetailsMotivation: Prior methods require dense posed images or are limited to in-domain views; need for accurate alignment between generated images and geometry in novel view synthesis.

Method: Uses off-the-shelf geometry predictors for partial geometries, formulates novel-view synthesis as inpainting task for both image and geometry, employs cross-modal attention distillation between image and geometry diffusion branches, and uses proximity-based mesh conditioning with depth/normal cues.

Result: Achieves high-fidelity extrapolative view synthesis on both image and geometry across unseen scenes, competitive reconstruction quality under interpolation, and produces geometrically aligned colored point clouds for 3D completion.

Conclusion: The method enables accurate aligned novel view image and geometry generation through cross-modal attention distillation and warping-inpainting approach, advancing 3D scene understanding and completion.

Abstract: We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.

[180] WAFT: Warping-Alone Field Transforms for Optical Flow

Yihan Wang, Jia Deng

Main category: cs.CV

TL;DR: WAFT is a novel optical flow method that replaces traditional cost volumes with high-resolution warping, achieving state-of-the-art performance with lower memory usage and faster inference.

DetailsMotivation: The paper challenges the conventional wisdom that cost volumes are necessary for strong optical flow performance, seeking to develop a simpler, more efficient architecture with fewer inductive biases and custom designs.

Method: WAFT replaces the cost volume used in RAFT with high-resolution warping operations, creating a meta-architecture that is simpler, more flexible, and requires less memory while maintaining strong performance.

Result: WAFT achieves 1st place on Spring, Sintel, and KITTI benchmarks, shows best zero-shot generalization on KITTI, and is 1.3-4.1x faster than competitive methods while being more memory efficient.

Conclusion: WAFT demonstrates that cost volumes are not essential for state-of-the-art optical flow performance, offering a simpler, faster, and more efficient alternative that challenges established design paradigms.

Abstract: We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring, Sintel, and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being 1.3-4.1x faster than existing methods that have competitive accuracy (e.g., 1.3x than Flowformer++, 4.1x than CCMR+). Code and model weights are available at \href{https://github.com/princeton-vl/WAFT}{https://github.com/princeton-vl/WAFT}.

[181] XTransfer: Modality-Agnostic Few-Shot Model Transfer for Human Sensing at the Edge

Yu Zhang, Xi Zhang, Hualin Zhou, Xinyuan Chen, Shang Gao, Hong Jia, Jianfei Yang, Yuankai Qi, Tao Gu

Main category: cs.CV

TL;DR: XTransfer enables modality-agnostic few-shot transfer learning for human sensing on edge devices by repairing models to handle modality shifts and recombining layers efficiently.

DetailsMotivation: Deep learning for human sensing on edge systems faces challenges due to limited sensor data availability and resource constraints. Transferring pre-trained models to different sensing applications typically requires extensive data and computational resources, making it costly and limiting transferability.

Method: XTransfer uses two key techniques: (1) model repairing that safely mitigates modality shift by adapting pre-trained layers with only few sensor data, and (2) layer recombining that efficiently searches and recombines layers of interest from source models in a layer-wise manner to restructure models.

Result: XTransfer achieves state-of-the-art performance across diverse human sensing datasets spanning different modalities while significantly reducing costs of sensor data collection, model training, and edge deployment.

Conclusion: XTransfer provides a resource-efficient, modality-agnostic approach for few-shot transfer learning in human sensing applications on edge systems, overcoming data scarcity and computational constraints.

Abstract: Deep learning for human sensing on edge systems presents significant potential for smart applications. However, its training and development are hindered by the limited availability of sensor data and resource constraints of edge systems. While transferring pre-trained models to different sensing applications is promising, existing methods often require extensive sensor data and computational resources, resulting in high costs and limited transferability. In this paper, we propose XTransfer, a first-of-its-kind method enabling modality-agnostic, few-shot model transfer with resource-efficient design. XTransfer flexibly uses pre-trained models and transfers knowledge across different modalities by (i) model repairing that safely mitigates modality shift by adapting pre-trained layers with only few sensor data, and (ii) layer recombining that efficiently searches and recombines layers of interest from source models in a layer-wise manner to restructure models. We benchmark various baselines across diverse human sensing datasets spanning different modalities. The results show that XTransfer achieves state-of-the-art performance while significantly reducing the costs of sensor data collection, model training, and edge deployment.

[182] AR as an Evaluation Playground: Bridging Metrics and Visual Perception of Computer Vision Models

Ashkan Ganj, Yiqin Zhao, Tian Guo

Main category: cs.CV

TL;DR: ARCADE is an AR-based evaluation platform for computer vision models that addresses limitations of traditional metrics by enabling reproducible, human-centered evaluation through modular architecture supporting both quantitative metrics and visual perception studies.

DetailsMotivation: Traditional quantitative metrics for computer vision models often fail to capture real-world performance due to protocol inconsistencies and ground-truth noise, while visual perception studies require time-consuming end-to-end systems that are difficult to reproduce.

Method: ARCADE uses augmented reality (AR) with a modular architecture providing cross-platform data collection, pluggable model inference, and interactive AR tasks to support both metric-based and visual perception evaluation of CV models.

Result: User study with 15 participants and case studies on depth and lighting estimation showed ARCADE can reveal perceptual flaws in model quality missed by traditional metrics, while usability and performance evaluations demonstrated its flexibility as a reliable real-time platform.

Conclusion: ARCADE provides an effective AR-based platform for human-centered computer vision evaluation that bridges the gap between quantitative metrics and perceptual assessment, offering reproducible, flexible evaluation of CV models in real-world contexts.

Abstract: Quantitative metrics are central to evaluating computer vision (CV) models, but they often fail to capture real-world performance due to protocol inconsistencies and ground-truth noise. While visual perception studies can complement these metrics, they often require end-to-end systems that are time-consuming to implement and setups that are difficult to reproduce. We systematically summarize key challenges in evaluating CV models and present the design of ARCADE, an evaluation platform that leverages augmented reality (AR) to enable easy, reproducible, and human-centered CV evaluation. ARCADE uses a modular architecture that provides cross-platform data collection, pluggable model inference, and interactive AR tasks, supporting both metric and visual perception evaluation. We demonstrate ARCADE through a user study with 15 participants and case studies on two representative CV tasks, depth and lighting estimation, showing that ARCADE can reveal perceptual flaws in model quality that are often missed by traditional metrics. We also evaluate ARCADE’s usability and performance, showing its flexibility as a reliable real-time platform.

[183] Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

Changin Choi, Wonseok Lee, Jungmin Ko, Wonjong Rhee

Main category: cs.CV

TL;DR: PMSR is a progressive multimodal search and reasoning framework for knowledge-intensive VQA that iteratively builds structured reasoning trajectories using dual-scope queries to heterogeneous knowledge bases, enabling controlled refinement and reduced error propagation.

DetailsMotivation: Existing multimodal retrieval-augmented generation approaches for knowledge-intensive VQA often fail to acquire sufficient knowledge and lack mechanisms to revise misdirected reasoning, as they typically use single-pass frameworks.

Method: PMSR progressively constructs structured reasoning trajectories using dual-scope queries conditioned on both latest records and trajectory to retrieve diverse knowledge from heterogeneous knowledge bases, then synthesizes retrieved evidence into compact records via compositional reasoning.

Result: Extensive experiments across six diverse benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, and OK-VQA) demonstrate that PMSR consistently improves both retrieval recall and end-to-end answer accuracy.

Conclusion: PMSR’s progressive framework with controlled iterative refinement supports more stable reasoning trajectories with reduced error propagation, advancing knowledge-intensive VQA performance.

Abstract: Knowledge-intensive visual question answering (VQA) requires external knowledge beyond image content, demanding precise visual grounding and coherent integration of visual and textual information. Although multimodal retrieval-augmented generation has achieved notable advances by incorporating external knowledge bases, existing approaches largely adopt single-pass frameworks that often fail to acquire sufficient knowledge and lack mechanisms to revise misdirected reasoning. We propose PMSR (Progressive Multimodal Search and Reasoning), a framework that progressively constructs a structured reasoning trajectory to enhance both knowledge acquisition and synthesis. PMSR uses dual-scope queries conditioned on both the latest record and the trajectory to retrieve diverse knowledge from heterogeneous knowledge bases. The retrieved evidence is then synthesized into compact records via compositional reasoning. This design facilitates controlled iterative refinement, which supports more stable reasoning trajectories with reduced error propagation. Extensive experiments across six diverse benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, and OK-VQA) demonstrate that PMSR consistently improves both retrieval recall and end-to-end answer accuracy.

[184] MATTER: Multiscale Attention for Registration Error Regression

Shipeng Liu, Ziliang Xiong, Khac-Hoang Ngo, Per-Erik Forssén

Main category: cs.CV

TL;DR: Point cloud registration quality validation using regression instead of classification for fine-grained error estimation, with multiscale feature extraction and attention-based aggregation.

DetailsMotivation: Existing methods treat point cloud registration quality validation as classification, assigning quality to few classes, which lacks fine-grained quantification needed for downstream tasks like SLAM and object tracking.

Method: Uses regression for PCR validation instead of classification, extends misalignment features with multiscale extraction and attention-based aggregation for robust error estimation, especially for point clouds with heterogeneous spatial densities.

Result: Achieves accurate and robust registration error estimation on diverse datasets, significantly improves mapping quality for given amount of re-registered frames compared to state-of-the-art classification-based methods.

Conclusion: Regression-based approach with multiscale features and attention aggregation provides superior fine-grained PCR quality validation for downstream applications.

Abstract: Point cloud registration (PCR) is crucial for many downstream tasks, such as simultaneous localization and mapping (SLAM) and object tracking. This makes detecting and quantifying registration misalignment, i.e., PCR quality validation, an important task. All existing methods treat validation as a classification task, aiming to assign the PCR quality to a few classes. In this work, we instead use regression for PCR validation, allowing for a more fine-grained quantification of the registration quality. We also extend previously used misalignment-related features by using multiscale extraction and attention-based aggregation. This leads to accurate and robust registration error estimation on diverse datasets, especially for point clouds with heterogeneous spatial densities. Furthermore, when used to guide a mapping downstream task, our method significantly improves the mapping quality for a given amount of re-registered frames, compared to the state-of-the-art classification-based method.

[185] Spectral Compressive Imaging via Chromaticity-Intensity Decomposition

Xiaodong Wang, Zijun He, Ping Wang, Lishun Wang, Yanan Hu, Xin Yuan

Main category: cs.CV

TL;DR: A chromaticity-intensity decomposition framework for hyperspectral image reconstruction in coded aperture snapshot spectral imaging that disentangles lighting-invariant reflectance from illumination effects.

DetailsMotivation: CASSI systems capture entangled spatial-spectral information with illumination-dependent radiance, making it difficult to recover intrinsic spectral reflectance that should be invariant to lighting conditions.

Method: Proposes CIDNet, a Chromaticity-Intensity Decomposition unfolding network that disentangles HSIs into spatially smooth intensity maps and spectrally variant chromaticity cubes using a hybrid spatial-spectral Transformer and degradation-aware noise estimation.

Result: Superior performance in both spectral and chromaticity fidelity on synthetic and real-world CASSI datasets compared to existing methods.

Conclusion: The chromaticity-intensity decomposition framework effectively addresses illumination dependency in hyperspectral imaging and enables recovery of lighting-invariant spectral reflectance.

Abstract: In coded aperture snapshot spectral imaging (CASSI), the captured measurement entangles spatial and spectral information, posing a severely ill-posed inverse problem for hyperspectral images (HSIs) reconstruction. Moreover, the captured radiance inherently depends on scene illumination, making it difficult to recover the intrinsic spectral reflectance that remains invariant to lighting conditions. To address these challenges, we propose a chromaticity-intensity decomposition framework, which disentangles an HSI into a spatially smooth intensity map and a spectrally variant chromaticity cube. The chromaticity encodes lighting-invariant reflectance, enriched with high-frequency spatial details and local spectral sparsity. Building on this decomposition, we develop CIDNet, a Chromaticity-Intensity Decomposition unfolding network within a dual-camera CASSI system. CIDNet integrates a hybrid spatial-spectral Transformer tailored to reconstruct fine-grained and sparse spectral chromaticity and a degradation-aware, spatially-adaptive noise estimation module that captures anisotropic noise across iterative stages. Extensive experiments on both synthetic and real-world CASSI datasets demonstrate that our method achieves superior performance in both spectral and chromaticity fidelity. Code and models will be publicly available.

[186] Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

Patrick Knab, Sascha Marton, Philipp J. Schubert, Drago Guggiana, Christian Bartelt

Main category: cs.CV

TL;DR: MoTIF is a transformer-based concept bottleneck framework for interpretable video classification that uses temporal self-attention on automatically discovered object and action concepts.

DetailsMotivation: Extending interpretable concept bottleneck models from images to videos is challenging due to difficulties in extracting concepts and modeling their temporal dynamics over time.

Method: Uses transformer architecture operating on sequences of temporally grounded concept activations with per-concept temporal self-attention; includes agentic concept discovery module to automatically extract object- and action-centric textual concepts from videos without manual supervision.

Result: Substantially narrows performance gap between interpretable and black-box video models across multiple benchmarks while maintaining faithful and temporally grounded concept explanations.

Conclusion: MoTIF successfully extends concept bottleneck interpretability to video domain with automatic concept discovery and temporal modeling, achieving competitive performance with interpretable explanations.

Abstract: Concept Bottleneck Models (CBMs) enable interpretable image classification by structuring predictions around human-understandable concepts, but extending this paradigm to video remains challenging due to the difficulty of extracting concepts and modeling them over time. In this paper, we introduce $\textbf{MoTIF}$ (Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is an agentic concept discovery module to automatically extract object- and action-centric textual concepts from videos, yielding temporally expressive concept sets without manual supervision. Across multiple video benchmarks, this combination substantially narrows the performance gap between interpretable and black-box video models while maintaining faithful and temporally grounded concept explanations. Code available at $\href{https://github.com/patrick-knab/MoTIF}{github.com/patrick-knab/MoTIF}$.

[187] A Data-driven Typology of Vision Models from Integrated Representational Metrics

Jialin Wu, Shreya Saha, Yiqing Bo, Meenakshi Khosla

Main category: cs.CV

TL;DR: Analysis of vision model representations using multiple similarity metrics reveals that geometry and tuning carry family-specific signatures while linearly decodable information is broadly shared across model families.

DetailsMotivation: Large vision models have diverse architectures and training paradigms, but we lack principled methods to determine which aspects of their representations are shared across families versus which reflect distinctive computational strategies.

Method: Use multiple representational similarity metrics (capturing geometry, unit tuning, or linear decodability) and adapt Similarity Network Fusion (SNF) from multi-omics integration to combine complementary facets for comprehensive analysis.

Result: Geometry/tuning metrics yield strong family discrimination while linear decodability shows weaker separation. SNF achieves sharper family separation than individual metrics. Clustering reveals supervised ResNets/ViTs form distinct clusters, self-supervised models group together across architectures, and hybrid architectures cluster with masked autoencoders.

Conclusion: Emergent computational strategies shaped jointly by architecture and training objective define representational structure beyond surface design categories, providing a principled typology of vision models.

Abstract: Large vision models differ widely in architecture and training paradigm, yet we lack principled methods to determine which aspects of their representations are shared across families and which reflect distinctive computational strategies. We leverage a suite of representational similarity metrics, each capturing a different facet-geometry, unit tuning, or linear decodability-and assess family separability using multiple complementary measures. Metrics preserving geometry or tuning (e.g., RSA, Soft Matching) yield strong family discrimination, whereas flexible mappings such as Linear Predictivity show weaker separation. These findings indicate that geometry and tuning carry family-specific signatures, while linearly decodable information is more broadly shared. To integrate these complementary facets, we adapt Similarity Network Fusion (SNF), a method inspired by multi-omics integration. SNF achieves substantially sharper family separation than any individual metric and produces robust composite signatures. Clustering of the fused similarity matrix recovers both expected and surprising patterns: supervised ResNets and ViTs form distinct clusters, yet all self-supervised models group together across architectural boundaries. Hybrid architectures (ConvNeXt, Swin) cluster with masked autoencoders, suggesting convergence between architectural modernization and reconstruction-based training. This biology-inspired framework provides a principled typology of vision models, showing that emergent computational strategies-shaped jointly by architecture and training objective-define representational structure beyond surface design categories.

[188] An Evaluation of Hybrid Annotation Workflows on High-Ambiguity Spatiotemporal Video Footage

Juan Gutiérrez, Victor Gutiérrez, Ángel Mora, Silvia Rodriguez, José Luis Blanco

Main category: cs.CV

TL;DR: AI-assisted video annotation workflow using vision-language models reduces manual annotation time by 35% for most users while maintaining annotation quality.

DetailsMotivation: Manual video annotation is time-consuming but essential for high-quality datasets. Vision-language models can potentially accelerate this process by providing automatic pre-annotations to assist human annotators.

Method: Developed a human-in-the-loop labeling workflow using tuned encoder models to generate automatic pre-annotations for video footage. Conducted a single-iteration study with 18 volunteers to evaluate the workflow’s efficiency.

Result: The workflow reduced annotation time by 35% for 72% of participants. The study also established a benchmarking framework that quantifies trade-offs between algorithmic speed and human verification integrity.

Conclusion: AI-assisted annotation workflows using vision-language models can significantly reduce video annotation time while maintaining quality, providing a practical solution for creating dense temporal video datasets.

Abstract: Manual annotation remains the gold standard for high-quality, dense temporal video datasets, yet it is inherently time-consuming. Vision-language models can aid human annotators and expedite this process. We report on the impact of automatic Pre-Annotations from a tuned encoder on a Human-in-the-Loop labeling workflow for video footage. Quantitative analysis in a study of a single-iteration test involving 18 volunteers demonstrates that our workflow reduced annotation time by 35% for the majority (72%) of the participants. Beyond efficiency, we provide a rigorous framework for benchmarking AI-assisted workflows that quantifies trade-offs between algorithmic speed and the integrity of human verification.

[189] CompEvent: Complex-valued Event-RGB Fusion for Low-light Video Enhancement and Deblurring

Mingchen Zhong, Xin Lu, Dong Li, Senyan Xu, Ruixuan Jiang, Xueyang Fu, Baocai Yin

Main category: cs.CV

TL;DR: CompEvent: A complex neural network framework for low-light video deblurring using holistic fusion of event camera data and RGB frames through complex-valued temporal alignment and space-frequency learning.

DetailsMotivation: Low-light video deblurring is challenging for applications like nighttime surveillance and autonomous driving due to dim lighting and long exposures. Existing staged fusion methods for event cameras are limited against combined low-light and motion blur degradations.

Method: Proposes CompEvent with two core components: 1) Complex Temporal Alignment GRU using complex-valued convolutions and GRU processing for temporal alignment and continuous fusion of video and event streams; 2) Complex Space-Frequency Learning module performing unified complex-valued signal processing in both spatial and frequency domains for deep fusion.

Result: CompEvent outperforms state-of-the-art methods in low-light video deblurring, achieving full-process spatiotemporal fusion and maximizing complementary learning between modalities.

Conclusion: The holistic representation capability of complex-valued neural networks enables effective full-process fusion of event data and RGB frames, significantly strengthening low-light video deblurring capability.

Abstract: Low-light video deblurring poses significant challenges in applications like nighttime surveillance and autonomous driving due to dim lighting and long exposures. While event cameras offer potential solutions with superior low-light sensitivity and high temporal resolution, existing fusion methods typically employ staged strategies, limiting their effectiveness against combined low-light and motion blur degradations. To overcome this, we propose CompEvent, a complex neural network framework enabling holistic full-process fusion of event data and RGB frames for enhanced joint restoration. CompEvent features two core components: 1) Complex Temporal Alignment GRU, which utilizes complex-valued convolutions and processes video and event streams iteratively via GRU to achieve temporal alignment and continuous fusion; and 2) Complex Space-Frequency Learning module, which performs unified complex-valued signal processing in both spatial and frequency domains, facilitating deep fusion through spatial structures and system-level characteristics. By leveraging the holistic representation capability of complex-valued neural networks, CompEvent achieves full-process spatiotemporal fusion, maximizes complementary learning between modalities, and significantly strengthens low-light video deblurring capability. Extensive experiments demonstrate that CompEvent outperforms SOTA methods in addressing this challenging task.

[190] FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu

Main category: cs.CV

TL;DR: FloodDiffusion is a novel framework for text-driven streaming human motion generation using diffusion forcing with time-varying text prompts, achieving state-of-the-art performance with real-time latency.

DetailsMotivation: Existing methods for text-driven human motion generation use chunk-by-chunk or auto-regressive approaches with diffusion heads, which may not handle time-varying text prompts effectively for streaming applications. The authors aim to create a framework that generates seamless, text-aligned motion sequences with real-time latency.

Method: The authors adopt a diffusion forcing framework tailored for time-series generation under time-varying control events. They identify three key improvements over vanilla diffusion forcing: (1) using bi-directional attention instead of casual attention, (2) implementing a lower triangular time scheduler instead of random scheduling, and (3) utilizing continuous time-varying text conditioning.

Result: FloodDiffusion achieves state-of-the-art performance on streaming motion generation, reaching an FID of 0.057 on the HumanML3D benchmark. The framework generates text-aligned, seamless motion sequences with real-time latency.

Conclusion: The diffusion forcing-based framework, when properly tailored with the proposed improvements, can effectively model real motion distributions and achieve superior performance for text-driven streaming human motion generation tasks.

Abstract: We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events. We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning. With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available. https://shandaai.github.io/FloodDiffusion/

[191] ReflexFlow: Rethinking Learning Objective for Exposure Bias Alleviation in Flow Matching

Guanbo Huang, Jingjia Mao, Fanding Huang, Fengkai Liu, Xiangyang Luo, Yaoyuan Liang, Jiasheng Lu, Xiaoe Wang, Pei Liu, Ruiliu Fu, Shao-Lun Huang

Main category: cs.CV

TL;DR: ReflexFlow addresses exposure bias in Flow Matching methods through reflexive refinement with Anti-Drift Rectification and Frequency Compensation components.

DetailsMotivation: Flow Matching methods suffer from exposure bias due to discrepancies between training and inference, caused by model's lack of generalization to biased inputs and insufficient low-frequency content capture during early denoising.

Method: Proposes ReflexFlow with two components: (1) Anti-Drift Rectification (ADR) that reflexively adjusts prediction targets for biased inputs using redesigned loss under training-time scheduled sampling, and (2) Frequency Compensation (FC) that reflects on missing low-frequency components and compensates them by reweighting the loss using exposure bias.

Result: Experiments on CIFAR-10, CelebA-64, and ImageNet-256 show ReflexFlow outperforms prior approaches in mitigating exposure bias, achieving 35.65% reduction in FID on CelebA-64.

Conclusion: ReflexFlow is a model-agnostic, effective solution for exposure bias in Flow Matching that improves generation quality across datasets and is compatible with all Flow Matching frameworks.

Abstract: Despite tremendous recent progress, Flow Matching methods still suffer from exposure bias due to discrepancies in training and inference. This paper investigates the root causes of exposure bias in Flow Matching, including: (1) the model lacks generalization to biased inputs during training, and (2) insufficient low-frequency content captured during early denoising, leading to accumulated bias. Based on these insights, we propose ReflexFlow, a simple and effective reflexive refinement of the Flow Matching learning objective that dynamically corrects exposure bias. ReflexFlow consists of two components: (1) Anti-Drift Rectification (ADR), which reflexively adjusts prediction targets for biased inputs utilizing a redesigned loss under training-time scheduled sampling; and (2) Frequency Compensation (FC), which reflects on missing low-frequency components and compensates them by reweighting the loss using exposure bias. ReflexFlow is model-agnostic, compatible with all Flow Matching frameworks, and improves generation quality across datasets. Experiments on CIFAR-10, CelebA-64, and ImageNet-256 show that ReflexFlow outperforms prior approaches in mitigating exposure bias, achieving a 35.65% reduction in FID on CelebA-64.

[192] FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

Zhuokun Chen, Jianfei Cai, Bohan Zhuang

Main category: cs.CV

TL;DR: FlashBlock improves block diffusion efficiency by reusing stable attention outputs from tokens outside the current block, reducing KV cache access and attention computation without quality loss.

DetailsMotivation: Block diffusion improves inference efficiency for long-form content generation but still suffers from substantial overhead from repeatedly computing attention over growing KV caches. The authors identified cross-step redundancy in attention within blocks as an underexplored optimization opportunity.

Method: FlashBlock leverages the observation that attention outputs from tokens outside the current block remain stable across diffusion steps, while block-internal attention varies significantly. It implements a cached block-external attention mechanism that reuses stable attention outputs, reducing attention computation and KV cache access without modifying the diffusion process. The method is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy.

Result: Experiments on diffusion language models and video generation show up to 1.44× higher token throughput and up to 1.6× reduction in attention time, with negligible impact on generation quality.

Conclusion: FlashBlock effectively addresses the efficiency bottleneck in block diffusion for long-form content generation by exploiting cross-step attention redundancy, offering significant speedups while maintaining generation quality.

Abstract: Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.

[193] MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding

Benjamin Beilharz, Thomas S. A. Wallis

Main category: cs.CV

TL;DR: MRD uses differentiable rendering to find 3D scenes that produce identical activations in vision models, probing their implicit 3D understanding through physically-grounded metamers.

DetailsMotivation: To understand what 3D scene properties vision models implicitly represent, moving beyond pixel-based analysis to physically-grounded scene parameter analysis.

Method: Uses physically based differentiable rendering to find 3D scene parameters (geometry, materials, lighting) that produce identical model activations despite physical differences, creating model metamers.

Result: Successfully finds 3D scenes with high activation similarity to targets, revealing varying visual reconstructions and showing which physical scene attributes models are sensitive/invariant to.

Conclusion: MRD enables analysis of how physical scene parameters drive model responses, advancing understanding of both computer and human vision representations.

Abstract: While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models’ implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model’s sensitivity to object shape while holding material and lighting constant. As a proof-of-principle, we assess multiple models in their ability to recover scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions help investigate the physical scene attributes to which models are sensitive or invariant. MRD holds promise for advancing our understanding of both computer and human vision by enabling analysis of how physical scene parameters drive changes in model responses.

[194] Preserving Spectral Structure and Statistics in Diffusion Models

Baohua Yan, Jennifer Kava, Qingyuan Liu, Xuan Di

Main category: cs.CV

TL;DR: PreSS diffusion models use spectral space transformations to preserve informative structure during forward process, converging to informative Gaussian prior instead of white noise, reducing computational complexity while maintaining diversity.

DetailsMotivation: Standard diffusion models destroy data into non-informative white noise, making backward denoising computationally intensive. The authors aim to preserve spectral structure and statistics to provide a more principled starting point for generation.

Method: Proposes new forward/backward processes in spectral space where forward process converges to informative Gaussian prior N(mu_hat,Sigma_hat) instead of white noise. Preserves spectral structure and statistics (PreSS) by guiding spectral components toward informative prior while keeping structural signals intact.

Result: Experimental results on CIFAR-10, CelebA and CelebA-HQ show significant reductions in computational complexity, improved visual diversity, less drift, and smoother diffusion process compared to pixel-based DMs.

Conclusion: PreSS provides a mathematically tractable spectral approach that preserves informative structure during diffusion, enabling more efficient and diverse image generation while maintaining high quality.

Abstract: Standard diffusion models (DMs) rely on the total destruction of data into non-informative white noise, forcing the backward process to denoise from a fully unstructured noise state. While ensuring diversity, this results in a cumbersome and computationally intensive image generation task. We address this challenge by proposing new forward and backward process within a mathematically tractable spectral space. Unlike pixel-based DMs, our forward process converges towards an informative Gaussian prior N(mu_hat,Sigma_hat) rather than white noise. Our method, termed Preserving Spectral Structure and Statistics (PreSS) in diffusion models, guides spectral components toward this informative prior while ensuring that corresponding structural signals remain intact at terminal time. This provides a principled starting point for the backward process, enabling high-quality image reconstruction that builds upon preserved spectral structure while maintaining high generative diversity. Experimental results on CIFAR-10, CelebA and CelebA-HQ demonstrate significant reductions in computational complexity, improved visual diversity, less drift, and a smoother diffusion process compared to pixel-based DMs.

[195] SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

Xindi Zhang, Dechao Meng, Steven Xiao, Qi Wang, Peng Zhang, Bang Zhang

Main category: cs.CV

TL;DR: SyncAnyone: A two-stage diffusion-based framework for high-quality video dubbing with accurate lip-sync and visual fidelity

DetailsMotivation: Existing mask-based video dubbing methods disrupt spatiotemporal context, causing instability in facial structure and background consistency. Need for better motion modeling and visual fidelity in AI-powered video dubbing.

Method: Two-stage learning framework: Stage 1 trains diffusion-based video transformer for masked mouth inpainting; Stage 2 uses mask-free tuning with synthetic data generation to address mask-induced artifacts.

Result: Achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation in in-the-wild lip-syncing scenarios.

Conclusion: SyncAnyone overcomes limitations of mask-based approaches by combining accurate motion modeling with high visual fidelity through a novel two-stage framework.

Abstract: High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate, audio-driven lip movements. However, due to input corruption, minor artifacts may arise in the surrounding facial regions and the background. In Stage 2, we develop a mask-free tuning pipeline to address mask-induced artifacts. Specifically, on the basis of the Stage 1 model, we develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video and random sampled audio. We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency. Extensive experiments show that our method achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation under in-the wild lip-syncing scenarios.

[196] DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh

Main category: cs.CV

TL;DR: DarkEQA benchmark evaluates Vision Language Models’ performance on embodied question answering under realistic low-light conditions with physics-based visual degradations.

DetailsMotivation: Existing VLM benchmarks focus on ideal lighting conditions, but real-world embodied agents need to operate 24/7 including low-light environments like night or dark spaces. Current evaluation overlooks this crucial robustness requirement.

Method: Created DarkEQA benchmark with physics-based low-light degradations in linear RAW space, simulating illumination drop and sensor noise followed by ISP-inspired rendering. Evaluates EQA-relevant perceptual primitives under controlled multi-level degradations to isolate perception bottlenecks.

Result: Systematic evaluation reveals significant limitations of state-of-the-art VLMs and Low-Light Image Enhancement models when operating under challenging low-light conditions, demonstrating the benchmark’s utility for robustness analysis.

Conclusion: DarkEQA addresses a critical gap in VLM evaluation for embodied agents, providing a physically realistic benchmark for assessing robustness to low-light conditions essential for 24/7 operation.

Abstract: Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments–a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs’ limitations when operating under these challenging visual conditions. Project website: https://darkeqa-benchmark.github.io/

[197] Adaptive Attention Distillation for Robust Few-Shot Segmentation under Environmental Perturbations

Qianyu Guo, Jingrong Wu, Jieji Ren, Weifeng Ge, Wenqiang Zhang

Main category: cs.CV

TL;DR: Proposes Adaptive Attention Distillation (AAD) for environment-robust few-shot segmentation, addressing challenges like motion blur and camouflaged targets in real-world scenarios.

DetailsMotivation: Existing few-shot segmentation models fail in real-world scenarios due to complex environmental factors like illumination, background, and camera viewpoint, limiting practical deployment.

Method: Introduces environment-robust FSS setting with ER-FSS benchmark covering 8 real-world datasets. Proposes Adaptive Attention Distillation (AAD) that repeatedly contrasts and distills key shared semantics between support and query images to derive class-specific attention.

Result: AAD improves mIoU by 3.3% - 8.5% across all datasets and settings, demonstrating superior performance and strong generalization in complex environments.

Conclusion: The proposed environment-robust FSS setting and AAD method effectively enhance model robustness for real-world deployment, with significant performance improvements across diverse challenging scenarios.

Abstract: Few-shot segmentation (FSS) aims to rapidly learn novel class concepts from limited examples to segment specific targets in unseen images, and has been widely applied in areas such as medical diagnosis and industrial inspection. However, existing studies largely overlook the complex environmental factors encountered in real world scenarios-such as illumination, background, and camera viewpoint-which can substantially increase the difficulty of test images. As a result, models trained under laboratory conditions often fall short of practical deployment requirements. To bridge this gap, in this paper, an environment-robust FSS setting is introduced that explicitly incorporates challenging test cases arising from complex environments-such as motion blur, small objects, and camouflaged targets-to enhance model’s robustness under realistic, dynamic conditions. An environment robust FSS benchmark (ER-FSS) is established, covering eight datasets across multiple real world scenarios. In addition, an Adaptive Attention Distillation (AAD) method is proposed, which repeatedly contrasts and distills key shared semantics between known (support) and unknown (query) images to derive class-specific attention for novel categories. This strengthens the model’s ability to focus on the correct targets in complex environments, thereby improving environmental robustness. Comparative experiments show that AAD improves mIoU by 3.3% - 8.5% across all datasets and settings, demonstrating superior performance and strong generalization. The source code and dataset are available at: https://github.com/guoqianyu-alberta/Adaptive-Attention-Distillation-for-FSS.

[198] SPARK: Scalable Real-Time Point Cloud Aggregation with Multi-View Self-Calibration

Chentian Sun

Main category: cs.CV

TL;DR: SPARK is a real-time multi-camera 3D reconstruction framework that jointly handles point cloud fusion and camera extrinsic uncertainty through self-calibration and confidence-driven fusion.

DetailsMotivation: Existing methods struggle with multi-view fusion, camera extrinsic uncertainty, and scalability for large camera setups in real-time 3D reconstruction, which is crucial for 3D perception, immersive interaction, and robotics.

Method: SPARK consists of two main components: (1) a geometry-aware online extrinsic estimation module that leverages multi-view priors and enforces cross-view and temporal consistency for stable self-calibration, and (2) a confidence-driven point cloud fusion strategy that models depth reliability and visibility at pixel and point levels to suppress noise and view-dependent inconsistencies.

Result: Extensive experiments on real-world multi-camera systems show that SPARK outperforms existing approaches in extrinsic accuracy, geometric consistency, temporal stability, and real-time performance, demonstrating effectiveness and scalability for large-scale multi-camera 3D reconstruction.

Conclusion: SPARK provides a scalable solution for real-time multi-camera 3D reconstruction that addresses key challenges in extrinsic uncertainty and point cloud fusion, making it suitable for dynamic scenes and large camera setups.

Abstract: Real-time multi-camera 3D reconstruction is crucial for 3D perception, immersive interaction, and robotics. Existing methods struggle with multi-view fusion, camera extrinsic uncertainty, and scalability for large camera setups. We propose SPARK, a self-calibrating real-time multi-camera point cloud reconstruction framework that jointly handles point cloud fusion and extrinsic uncertainty. SPARK consists of: (1) a geometry-aware online extrinsic estimation module leveraging multi-view priors and enforcing cross-view and temporal consistency for stable self-calibration, and (2) a confidence-driven point cloud fusion strategy modeling depth reliability and visibility at pixel and point levels to suppress noise and view-dependent inconsistencies. By performing frame-wise fusion without accumulation, SPARK produces stable point clouds in dynamic scenes while scaling linearly with the number of cameras. Extensive experiments on real-world multi-camera systems show that SPARK outperforms existing approaches in extrinsic accuracy, geometric consistency, temporal stability, and real-time performance, demonstrating its effectiveness and scalability for large-scale multi-camera 3D reconstruction.

[199] An Example for Domain Adaptation Using CycleGAN

Yanhua Zhao

Main category: cs.CV

TL;DR: CycleGAN applied to medical image translation from microscopy to pseudo H&E stained histopathology images

DetailsMotivation: To demonstrate CycleGAN's effectiveness in medical domain adaptation, specifically for unpaired image-to-image translation between microscopy and histopathology images

Method: CycleGAN architecture for unpaired image-to-image translation, using cycle-consistency loss and adversarial training to translate microscopy images to pseudo H&E stained histopathology images

Result: Successful translation of microscopy images to pseudo H&E stained histopathology images using CycleGAN

Conclusion: CycleGAN shows promise for medical domain adaptation tasks, particularly for unpaired image translation in histopathology

Abstract: Cycle-Consistent Adversarial Network (CycleGAN) is very promising in domain adaptation. In this report, an example in medical domain will be explained. We present struecture of a CycleGAN model for unpaired image-to-image translation from microscopy to pseudo H&E stained histopathology images.

[200] Synthetic Data Guided Feature Selection for Robust Activity Recognition in Older Adults

Shuhao Que, Dieuwke van Dartel, Ilse Heeringa, Han Hegeman, Miriam Vollenbroek-Hutten, Ying Wang

Main category: cs.CV

TL;DR: Developed a robust human activity recognition system for older adults using accelerometers and synthetic data to improve physical activity monitoring during hip fracture rehabilitation.

DetailsMotivation: Physical activity monitoring during hip fracture rehabilitation is crucial but rarely quantified in clinical practice. Existing wearable systems developed for middle-aged adults perform unreliably in older adults with slower, more variable gait patterns.

Method: Used 24 healthy older adults (80+ years) performing activities of daily living while wearing two accelerometers (lower back and anterior upper thigh). Developed a feature intervention model (FIM) with synthetic data guidance to improve generalization across participants.

Result: Achieved reliable activity recognition with mean F1-scores: 0.896 (walking), 0.927 (standing), 0.997 (sitting), 0.937 (lying down), and 0.816 (postural transfers). FIM significantly improved postural transfer detection compared to control model without synthetic data.

Conclusion: Demonstrates feasibility of robust activity recognition in older adults. Further validation needed in hip fracture patient populations to assess clinical utility of the proposed monitoring system.

Abstract: Physical activity during hip fracture rehabilitation is essential for mitigating long-term functional decline in geriatric patients. However, it is rarely quantified in clinical practice. Existing continuous monitoring systems with commercially available wearable activity trackers are typically developed in middle-aged adults and therefore perform unreliably in older adults with slower and more variable gait patterns. This study aimed to develop a robust human activity recognition (HAR) system to improve continuous physical activity recognition in the context of hip fracture rehabilitation. 24 healthy older adults aged over 80 years were included to perform activities of daily living (walking, standing, sitting, lying down, and postural transfers) under simulated free-living conditions for 75 minutes while wearing two accelerometers positioned on the lower back and anterior upper thigh. Model robustness was evaluated using leave-one-subject-out cross-validation. The synthetic data demonstrated potential to improve generalization across participants. The resulting feature intervention model (FIM), aided by synthetic data guidance, achieved reliable activity recognition with mean F1-scores of 0.896 for walking, 0.927 for standing, 0.997 for sitting, 0.937 for lying down, and 0.816 for postural transfers. Compared with a control condition model without synthetic data, the FIM significantly improved the postural transfer detection, i.e., an activity class of high clinical relevance that is often overlooked in existing HAR literature. In conclusion, these preliminary results demonstrate the feasibility of robust activity recognition in older adults. Further validation in hip fracture patient populations is required to assess the clinical utility of the proposed monitoring system.

[201] Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, Jun Zhu

Main category: cs.CV

TL;DR: Causal Forcing bridges the architectural gap in video diffusion model distillation by using an autoregressive teacher for ODE initialization instead of bidirectional teacher, improving real-time interactive video generation performance.

DetailsMotivation: Current methods for real-time interactive video generation distill bidirectional video diffusion models into few-step autoregressive models, but face an architectural gap when replacing full attention with causal attention. Existing approaches don't bridge this gap theoretically, and ODE distillation from bidirectional teachers violates frame-level injectivity conditions, degrading performance.

Method: Proposes Causal Forcing that uses an autoregressive teacher for ODE initialization instead of bidirectional teacher, theoretically bridging the architectural gap. This ensures proper flow map recovery and avoids the conditional-expectation solution that degrades performance.

Result: Outperforms all baselines across all metrics, surpassing state-of-the-art Self Forcing by 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following.

Conclusion: Causal Forcing effectively bridges the architectural gap in video diffusion model distillation, enabling better real-time interactive video generation by using autoregressive teachers for proper ODE initialization.

Abstract: To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher’s flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following. Project page and the code: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}

[202] ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

Zhuoran Yang, Yanyong Zhang

Main category: cs.CV

TL;DR: ConsisDrive is an identity-preserving driving world model that addresses identity drift in generated driving videos by enforcing instance-level temporal consistency through instance-masked attention and loss mechanisms.

DetailsMotivation: Current world models for autonomous driving suffer from identity drift - where objects change appearance or category across frames due to lack of instance-level temporal constraints, limiting their usefulness for generating realistic training data.

Method: Two key components: (1) Instance-Masked Attention that uses identity and trajectory masks to restrict visual token interactions to corresponding instances across spatial-temporal dimensions; (2) Instance-Masked Loss that adaptively emphasizes foreground regions with probabilistic instance masking to reduce background noise while maintaining scene fidelity.

Result: Achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset.

Conclusion: ConsisDrive effectively addresses identity drift in driving world models through instance-level temporal consistency mechanisms, enabling higher quality generated data for autonomous driving applications.

Abstract: Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.

[203] Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

Haichao Jiang, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu

Main category: cs.CV

TL;DR: Refer-Agent is a collaborative multi-agent system with alternating reasoning-reflection mechanisms for referring video object segmentation, achieving state-of-the-art performance without fine-tuning.

DetailsMotivation: Current RVOS methods rely heavily on supervised fine-tuning of MLLMs, which is data-dependent and not scalable with rapidly evolving MLLMs. Zero-shot approaches offer flexibility but underperform due to simplistic workflow designs.

Method: Proposes Refer-Agent: a multi-agent system with reasoning-reflection mechanisms. Features: 1) Coarse-to-Fine frame selection for diversity and textual relevance, 2) Dynamic Focus Layout for adaptive visual attention, 3) Chain-of-Reflection mechanism with Questioner-Responder pair for self-verification and feedback generation.

Result: Extensive experiments on five challenging benchmarks show Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches. The system is flexible and enables fast integration of new MLLMs without fine-tuning costs.

Conclusion: Refer-Agent provides an effective alternative to SFT-based RVOS methods, achieving superior performance through collaborative multi-agent reasoning and reflection mechanisms while maintaining flexibility for MLLM integration.

Abstract: Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose \textbf{Refer-Agent}, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent’s visual focus. Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches. Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released at https://github.com/iSEE-Laboratory/Refer-Agent.

[204] A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

Basile Terver, Randall Balestriero, Megi Dervishi, David Fan, Quentin Garrido, Tushar Nagarajan, Koustuv Sinha, Wancong Zhang, Mike Rabbat, Yann LeCun, Amir Bar

Main category: cs.CV

TL;DR: EB-JEPA is an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures that predict in representation space rather than pixel space, with applications to images, video, and action-conditioned world models.

DetailsMotivation: To provide accessible, modular implementations of JEPA for research and education, demonstrating how representation learning techniques can transfer from images to video and action-conditioned world models while avoiding pitfalls of generative modeling.

Method: Uses Joint-Embedding Predictive Architectures that learn to predict in representation space rather than pixel space. Provides modular implementations for image-level self-supervised learning, video temporal modeling, and action-conditioned world models, all designed for single-GPU training.

Result: Achieves 91% accuracy on CIFAR-10 probing, demonstrates multi-step prediction on Moving MNIST, and achieves 97% planning success rate on Two Rooms navigation task. Comprehensive ablations show importance of regularization components.

Conclusion: EB-JEPA successfully demonstrates transferable representation learning from images to video and world models, making energy-based self-supervised learning accessible while showing JEPA’s effectiveness across modalities.

Abstract: We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa.

[205] DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding

Ning Zhang, Zhengyu Li, Kwong Weng Loh, Mingxi Xu, Qi Wang, Zhengyu Wen, Xiaoyu He, Wei Zhao, Kehong Gong, Mingyuan Zhang

Main category: cs.CV

TL;DR: DiMo is a discrete diffusion framework that unifies bidirectional text-motion understanding and generation using iterative masked token refinement, supporting T2M, M2T, and M2M tasks in a single model.

DetailsMotivation: Existing masked modeling motion generation methods focus mainly on text-to-motion, lacking bidirectional understanding and generation capabilities. The authors aim to create a unified framework that can handle both text-to-motion and motion-to-text tasks simultaneously.

Method: DiMo uses a discrete diffusion-style framework with iterative masked token refinement. It employs residual vector quantization (RVQ) for better motion token fidelity and Group Relative Policy Optimization (GRPO) for enhanced alignment and controllability. The approach unifies Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model.

Result: Experiments on HumanML3D and KIT-ML datasets show strong motion quality and competitive bidirectional understanding. The model also demonstrates capabilities in text-free motion completion, text-guided motion prediction, and motion caption correction without architectural changes.

Conclusion: DiMo successfully extends masked modeling to bidirectional text-motion understanding and generation, offering a unified framework that supports multiple tasks with quality-latency trade-offs during inference.

Abstract: Prior masked modeling motion generation methods predominantly study text-to-motion. We present DiMo, a discrete diffusion-style framework, which extends masked modeling to bidirectional text–motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference via the number of refinement steps. We further improve motion token fidelity with residual vector quantization (RVQ) and enhance alignment and controllability with Group Relative Policy Optimization (GRPO). Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding under a unified framework. In addition, we demonstrate model ability in text-free motion completion, text-guided motion prediction and motion caption correction without architectural change. Additional qualitative results are available on our project page: https://animotionlab.github.io/DiMo/.

[206] DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking

Sijia Chen, Lijuan Ma, Yanqiu Yu, En Yu, Liman Liu, Wenbing Tao

Main category: cs.CV

TL;DR: Proposes RGBD Referring Multi-Object Tracking (DRMOT) task that fuses RGB, Depth, and Language modalities for 3D-aware tracking, introduces DRSet dataset, and presents DRTrack framework with MLLM guidance.

DetailsMotivation: Existing Referring Multi-Object Tracking models rely solely on 2D RGB data, making it challenging to handle complex spatial semantics and maintain reliable identities under occlusion due to lack of explicit 3D spatial information.

Method: Introduces DRMOT task requiring RGB-Depth-Language fusion, constructs DRSet dataset with RGB images, depth maps, and language descriptions, and proposes DRTrack framework using MLLM-guided depth-referring tracking with depth-aware target grounding and trajectory association.

Result: Extensive experiments on DRSet dataset demonstrate the effectiveness of the DRTrack framework for 3D-aware tracking with spatial-semantic grounding.

Conclusion: RGBD Referring Multi-Object Tracking addresses limitations of 2D-only approaches by incorporating depth information, enabling better handling of complex spatial semantics and occlusion scenarios through multimodal fusion.

Abstract: Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera’’) and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models’ spatial-semantic grounding and tracking capabilities. Specifically, DRSet contains RGB images and depth maps from 187 scenes, along with 240 language descriptions, among which 56 descriptions incorporate depth-related information. Furthermore, we propose DRTrack, a MLLM-guided depth-referring tracking framework. DRTrack performs depth-aware target grounding from joint RGB-D-L inputs and enforces robust trajectory association by incorporating depth cues. Extensive experiments on the DRSet dataset demonstrate the effectiveness of our framework.

[207] DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

Chang Zou, Changlin Li, Yang Li, Patrol Li, Jianbing Wu, Xiao He, Songtao Liu, Zhao Zhong, Kailin Huang, Linfeng Zhang

Main category: cs.CV

TL;DR: A novel distillation-compatible learnable feature caching mechanism for accelerating video diffusion models, achieving 11.8× speedup while preserving quality.

DetailsMotivation: Current video diffusion model acceleration methods have limitations: feature caching causes semantic/detail loss with compression, step-distillation degrades severely in video generation, and combining them fails due to sparser sampling steps.

Method: Introduces a learnable neural predictor instead of training-free heuristics for feature caching, and proposes Restricted MeanFlow approach for stable lossless distillation on large-scale video models.

Result: Achieves 11.8× acceleration while preserving generation quality, outperforming existing methods that suffer from quality degradation.

Conclusion: The proposed distillation-compatible learnable feature caching mechanism successfully pushes acceleration boundaries for video diffusion models without quality loss.

Abstract: While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code will be made publicly available soon.

[208] Generalization of Self-Supervised Vision Transformers for Protein Localization Across Microscopy Domains

Ben Isselmann, Dilara Göksu, Andreas Weinmann

Main category: cs.CV

TL;DR: DINO-pretrained Vision Transformers show strong cross-domain transferability for protein localization in microscopy, with domain-relevant pretraining (Human Protein Atlas) outperforming general ImageNet pretraining and even direct OpenCell training.

DetailsMotivation: Microscopy datasets are often too small for robust deep learning, and while SSL can help with large unlabeled data, it's unclear how well SSL representations transfer across different microscopy domains with varying staining protocols and channel configurations.

Method: Investigates cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on OpenCell dataset. Uses three DINO backbones pretrained on: 1) ImageNet-1k, 2) Human Protein Atlas (HPA), and 3) OpenCell. Generates image embeddings and evaluates by training supervised classification head on OpenCell labels.

Result: All pretrained models transfer well. HPA-pretrained model (microscopy-specific) achieves best performance: mean macro F1-score = 0.8221 ± 0.0062, slightly outperforming DINO model trained directly on OpenCell (0.8057 ± 0.0090).

Conclusion: Large-scale pretraining is valuable, and domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even with limited task-specific labeled data.

Abstract: Task-specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self-supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet-1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy-specific HPA-pretrained model achieving the best performance (mean macro $F_1$-score = 0.8221 $\pm$ 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 $\pm$ 0.0090). These results highlight the value of large-scale pretraining and indicate that domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task-specific labeled data are limited.

[209] A Comparative Study of 3D Person Detection: Sensor Modalities and Robustness in Diverse Indoor and Outdoor Environments

Malaz Tamim, Andrea Matic-Flierl, Karsten Roscher

Main category: cs.CV

TL;DR: Systematic evaluation of 3D person detection comparing camera-only, LiDAR-only, and camera-LiDAR fusion approaches on JRDB dataset, showing fusion outperforms single modalities but has vulnerabilities to sensor misalignments.

DetailsMotivation: Accurate 3D person detection is critical for safety applications like robotics and surveillance. Most existing research focuses on autonomous driving, but there's a need to explore detection performance and robustness in diverse indoor/outdoor scenes.

Method: Evaluated three representative models: BEVDepth (camera-only), PointPillars (LiDAR-only), and DAL (camera-LiDAR fusion) on JRDB dataset. Analyzed performance under varying occlusion, distance levels, and robustness against sensor corruptions and misalignments.

Result: Fusion-based approach consistently outperformed single-modality models, especially in challenging scenarios. DAL showed improved resilience but remained sensitive to sensor misalignment and certain LiDAR corruptions. Camera-based BEVDepth had lowest performance and was most affected by occlusion, distance, and noise.

Conclusion: Sensor fusion enhances 3D person detection but vulnerabilities to sensor misalignments and corruptions highlight need for ongoing research to address these system weaknesses.

Abstract: Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.

[210] Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

Xuejun Zhang, Aditi Tiwari, Zhenhailong Wang, Heng Ji

Main category: cs.CV

TL;DR: CAMCUE is a pose-aware MLLM framework for multi-image spatial reasoning that uses camera pose as geometric anchor for cross-view fusion and novel-view perspective taking.

DetailsMotivation: Current MLLMs struggle with multi-image spatial reasoning, particularly perspective taking - building coherent 3D understanding from multi-view observations and reasoning from new language-specified viewpoints.

Method: CAMCUE injects per-view camera pose into visual tokens, grounds natural-language viewpoint descriptions to target camera poses, and synthesizes pose-conditioned imagined target views to support answering perspective-shift questions.

Result: Improves overall accuracy by 9.06%, predicts target poses from language descriptions with >90% rotation accuracy within 20° and translation accuracy within 0.5 error threshold, reduces inference time from 256.6s to 1.45s per example.

Conclusion: CAMCUE enables efficient multi-view spatial reasoning by explicitly incorporating camera geometry, avoiding expensive test-time search while maintaining high accuracy for perspective-taking tasks.

Abstract: Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.

cs.AI

[211] Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning

Zhuoming Chen, Hongyi Liu, Yang Zhou, Haizhong Zheng, Beidi Chen

Main category: cs.AI

TL;DR: Jackpot framework uses Optimal Budget Rejection Sampling to decouple rollout generation from policy optimization in RL for LLMs, reducing distribution mismatch and improving training efficiency.

DetailsMotivation: Current RL for LLMs is expensive due to costly rollouts. Decoupling rollout generation from policy optimization could improve efficiency, but introduces distribution mismatch that destabilizes learning.

Method: Proposes Jackpot framework with Optimal Budget Rejection Sampling (OBRS) to reduce discrepancy between rollout model and evolving policy. Includes unified training objective, top-k probability estimation, and batch-level bias correction.

Result: Theoretical analysis shows OBRS consistently moves rollout distribution closer to target distribution. Empirically improves training stability vs importance-sampling baselines, achieving comparable performance to on-policy RL for Qwen3-8B-Base.

Conclusion: OBRS-based alignment brings us closer to practical decoupling of rollout generation from policy optimization for RL in LLMs, enabling more efficient training.

Abstract: Reinforcement learning (RL) for large language models (LLMs) remains expensive, particularly because the rollout is expensive. Decoupling rollout generation from policy optimization (e.g., leveraging a more efficient model to rollout) could enable substantial efficiency gains, yet doing so introduces a severe distribution mismatch that destabilizes learning. We propose Jackpot, a framework that leverages Optimal Budget Rejection Sampling (OBRS) to directly reduce the discrepancy between the rollout model and the evolving policy. Jackpot integrates a principled OBRS procedure, a unified training objective that jointly updates the policy and rollout models, and an efficient system implementation enabled by top-$k$ probability estimation and batch-level bias correction. Our theoretical analysis shows that OBRS consistently moves the rollout distribution closer to the target distribution under a controllable acceptance budget. Empirically, \sys substantially improves training stability compared to importance-sampling baselines, achieving performance comparable to on-policy RL when training Qwen3-8B-Base for up to 300 update steps of batchsize 64. Taken together, our results show that OBRS-based alignment brings us a step closer to practical and effective decoupling of rollout generation from policy optimization for RL for LLMs.

[212] Large Language Model Reasoning Failures

Peiyang Song, Pengrui Han, Noah Goodman

Main category: cs.AI

TL;DR: Comprehensive survey of reasoning failures in Large Language Models, categorizing failures into fundamental, application-specific, and robustness issues, with analysis of causes and mitigation strategies.

DetailsMotivation: Despite LLMs' impressive reasoning capabilities, significant reasoning failures persist even in simple scenarios. The paper aims to systematically understand and address these shortcomings by providing the first comprehensive survey dedicated to reasoning failures in LLMs.

Method: Introduces a novel categorization framework distinguishing reasoning into embodied vs. non-embodied types, with non-embodied further divided into informal (intuitive) and formal (logical) reasoning. Classifies reasoning failures into three types: fundamental failures intrinsic to LLM architectures, application-specific limitations in particular domains, and robustness issues with inconsistent performance across minor variations.

Result: Provides a structured perspective on systemic weaknesses in LLM reasoning, offering clear definitions, analysis of existing studies, exploration of root causes, and presentation of mitigation strategies for each reasoning failure type. Also releases a comprehensive GitHub repository collection of research works on LLM reasoning failures.

Conclusion: The survey unifies fragmented research efforts on LLM reasoning failures, providing valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities in language models.

Abstract: Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.

[213] Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning (Extended Version)

Pierriccardo Olivieri, Fausto Lasca, Alessandro Gianola, Matteo Papini

Main category: cs.AI

TL;DR: A framework using Linear Temporal Logic Modulo Theories (LTLfMT) for specifying non-Markovian rewards in MDPs with large state spaces, enabling complex task specification over unstructured data domains without manual predicate encoding.

DetailsMotivation: Current approaches for specifying complex tasks in reinforcement learning often require manual predicate encoding and struggle with unstructured, heterogeneous data domains. There's a need for more expressive logical specifications that can handle complex tasks in large state spaces while being reusable and unified.

Method: Proposes using LTLfMT (Linear Temporal Logic Modulo Theories over finite traces) which extends classical temporal logic with first-order formulas. Identifies a tractable fragment of LTLfMT for reward specification. Implements a practical method combining reward machines with Hindsight Experience Replay (HER) to translate first-order logic specifications and address reward sparsity.

Result: The approach enables natural specification of complex tasks in continuous-control settings using Non-Linear Arithmetic Theory. Experimental results demonstrate that a tailored implementation of HER is crucial for solving tasks with complex goals.

Conclusion: LTLfMT provides a powerful, unified framework for specifying complex non-Markovian rewards in large state spaces, overcoming limitations of manual predicate encoding while maintaining computational tractability through careful fragment selection and HER-based implementation.

Abstract: In this work, we propose a novel framework for the logical specification of non-Markovian rewards in Markov Decision Processes (MDPs) with large state spaces. Our approach leverages Linear Temporal Logic Modulo Theories over finite traces (LTLfMT), a more expressive extension of classical temporal logic in which predicates are first-order formulas of arbitrary first-order theories rather than simple Boolean variables. This enhanced expressiveness enables the specification of complex tasks over unstructured and heterogeneous data domains, promoting a unified and reusable framework that eliminates the need for manual predicate encoding. However, the increased expressive power of LTLfMT introduces additional theoretical and computational challenges compared to standard LTLf specifications. We address these challenges from a theoretical standpoint, identifying a fragment of LTLfMT that is tractable but sufficiently expressive for reward specification in an infinite-state-space context. From a practical perspective, we introduce a method based on reward machines and Hindsight Experience Replay (HER) to translate first-order logic specifications and address reward sparsity. We evaluate this approach to a continuous-control setting using Non-Linear Arithmetic Theory, showing that it enables natural specification of complex tasks. Experimental results show how a tailored implementation of HER is fundamental in solving tasks with complex goals.

[214] Do LLMs Act Like Rational Agents? Measuring Belief Coherence in Probabilistic Decision Making

Khurram Yamin, Jingjing Tang, Santiago Cortes-Gomez, Amit Sharma, Eric Horvitz, Bryan Wilder

Main category: cs.AI

TL;DR: LLMs as decision agents in high-stakes domains are analyzed for rational utility maximization and coherent beliefs using Bayesian inference tests.

DetailsMotivation: LLMs are increasingly used as agents in high-stakes domains where optimal actions depend on uncertainty and outcome utilities, but their decision logic remains opaque. The paper aims to determine if LLMs act as rational utility maximizers with coherent beliefs and stable preferences.

Method: The study analyzes LLM behaviors on diagnosis challenge problems, examining the relationship between LLM inferences and ideal Bayesian utility maximization. The approach provides falsifiable conditions to test if reported probabilities correspond to true beliefs of any rational agent, applied across multiple medical diagnostic domains with evaluations of several LLMs.

Result: The results provide insights about LLM inferences relative to ideal Bayesian utility maximization, identifying conditions where reported probabilities cannot correspond to true beliefs of any rational agent. The methodology is applied to multiple medical diagnostic domains across several LLMs.

Conclusion: The study discusses implications for using LLMs in high-stakes decision guidance and provides directions forward, highlighting the need for interpretable decision logic in LLM agents operating in critical domains.

Abstract: Large language models (LLMs) are increasingly deployed as agents in high-stakes domains where optimal actions depend on both uncertainty about the world and consideration of utilities of different outcomes, yet their decision logic remains difficult to interpret. We study whether LLMs are rational utility maximizers with coherent beliefs and stable preferences. We consider behaviors of models for diagnosis challenge problems. The results provide insights about the relationship of LLM inferences to ideal Bayesian utility maximization for elicited probabilities and observed actions. Our approach provides falsifiable conditions under which the reported probabilities \emph{cannot} correspond to the true beliefs of any rational agent. We apply this methodology to multiple medical diagnostic domains with evaluations across several LLMs. We discuss implications of the results and directions forward for uses of LLMs in guiding high-stakes decisions.

[215] Exposing Weaknesses of Large Reasoning Models through Graph Algorithm Problems

Qifan Zhang, Jianhao Ruan, Aochuan Chen, Kang Zeng, Nuo Chen, Jing Tang, Jia Li

Main category: cs.AI

TL;DR: GrAlgoBench is a new benchmark for evaluating Large Reasoning Models using graph algorithm problems that require long-context reasoning, allow difficulty control, and enable programmatic evaluation.

DetailsMotivation: Existing benchmarks for Large Reasoning Models are limited - they lack long-context evaluation, offer insufficient challenge, and provide answers that are difficult to verify programmatically. There's a need for more rigorous evaluation of reasoning abilities.

Method: The authors introduce GrAlgoBench, a benchmark with nine graph algorithm tasks. Graph problems are chosen because they demand long-context reasoning, allow fine-grained difficulty control, and enable standardized programmatic evaluation.

Result: Experiments reveal two major weaknesses: 1) Accuracy deteriorates sharply as context length increases (below 50% for graphs >120 nodes), driven by execution errors, weak memory, and redundant reasoning. 2) LRMs suffer from “over-thinking” - extensive but ineffective self-verification that inflates reasoning traces without improving correctness.

Conclusion: GrAlgoBench establishes graph algorithm problems as a rigorous, multidimensional, and practically relevant testbed for advancing reasoning studies in LRMs, exposing critical limitations in current models.

Abstract: Large Reasoning Models (LRMs) have advanced rapidly; however, existing benchmarks in mathematics, code, and common-sense reasoning remain limited. They lack long-context evaluation, offer insufficient challenge, and provide answers that are difficult to verify programmatically. We introduce GrAlgoBench, a benchmark designed to evaluate LRMs through graph algorithm problems. Such problems are particularly well suited for probing reasoning abilities: they demand long-context reasoning, allow fine-grained control of difficulty levels, and enable standardized, programmatic evaluation. Across nine tasks, our systematic experiments reveal two major weaknesses of current LRMs. First, accuracy deteriorates sharply as context length increases, falling below 50% once graphs exceed 120 nodes. This degradation is driven by frequent execution errors, weak memory, and redundant reasoning. Second, LRMs suffer from an over-thinking phenomenon, primarily caused by extensive yet largely ineffective self-verification, which inflates reasoning traces without improving correctness. By exposing these limitations, GrAlgoBench establishes graph algorithm problems as a rigorous, multidimensional, and practically relevant testbed for advancing the study of reasoning in LRMs. Code is available at https://github.com/Bklight999/GrAlgoBench.

[216] Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion

Longhui Ma, Di Zhao, Siwei Wang, Zhao Lv, Miao Wang

Main category: cs.AI

TL;DR: Trifuse is an attention-based GUI grounding framework that integrates multimodal attention mechanisms with OCR-derived textual cues and icon-level caption semantics to improve localization of interface elements without task-specific fine-tuning.

DetailsMotivation: Existing GUI grounding approaches either require expensive fine-tuning on large datasets (which is data-intensive and generalizes poorly) or use attention-based methods that lack explicit spatial anchors, leading to low reliability. There's a need for a method that combines the benefits of attention mechanisms with explicit spatial cues.

Method: Trifuse integrates three complementary modalities: 1) attention mechanisms from MLLMs, 2) OCR-derived textual cues, and 3) icon-level caption semantics. These are fused using a Consensus-SinglePeak (CS) strategy that enforces cross-modal agreement while maintaining sharp localization peaks.

Result: Extensive evaluations on four GUI grounding benchmarks show Trifuse achieves strong performance without task-specific fine-tuning, substantially reducing reliance on expensive annotated data. Ablation studies confirm that incorporating OCR and caption cues consistently improves attention-based grounding across different backbones.

Conclusion: Trifuse provides an effective general framework for GUI grounding that leverages complementary spatial anchors to improve reliability while avoiding the data-intensive requirements of fine-tuning approaches.

Abstract: GUI grounding maps natural language instructions to the correct interface elements, serving as the perception foundation for GUI agents. Existing approaches predominantly rely on fine-tuning multimodal large language models (MLLMs) using large-scale GUI datasets to predict target element coordinates, which is data-intensive and generalizes poorly to unseen interfaces. Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning, but suffer from low reliability due to the lack of explicit and complementary spatial anchors in GUI images. To address this limitation, we propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors. Trifuse integrates attention, OCR-derived textual cues, and icon-level caption semantics via a Consensus-SinglePeak (CS) fusion strategy that enforces cross-modal agreement while retaining sharp localization peaks. Extensive evaluations on four grounding benchmarks demonstrate that Trifuse achieves strong performance without task-specific fine-tuning, substantially reducing the reliance on expensive annotated data. Moreover, ablation studies reveal that incorporating OCR and caption cues consistently improves attention-based grounding performance across different backbones, highlighting its effectiveness as a general framework for GUI grounding.

[217] A computational framework for human values

Nardine Osman, Mark d’Inverno

Main category: cs.AI

TL;DR: A formal computational framework for human values to support ethical AI design, addressing the value alignment problem

DetailsMotivation: The need for a formal computational definition of human values to enable AI systems that are provably aligned with human values, addressing the value alignment problem that includes learning human values, aggregating individual values to groups, and designing computational reasoning mechanisms.

Method: Development of a formal conceptual framework rooted in social sciences that provides a foundation for systematic, integrated, and interdisciplinary investigation into how human values can support designing ethical AI.

Result: Proposes the first formal, computational definition of values to address the gap in current research on ethical AI and value alignment.

Conclusion: The framework enables systematic investigation of how human values can be used to design ethical AI systems, providing a formal foundation for the value alignment problem.

Abstract: In the diverse array of work investigating the nature of human values from psychology, philosophy and social sciences, there is a clear consensus that values guide behaviour. More recently, a recognition that values provide a means to engineer ethical AI has emerged. Indeed, Stuart Russell proposed shifting AI’s focus away from simply intelligence'' towards intelligence provably aligned with human values’’. This challenge – the value alignment problem – with others including an AI’s learning of human values, aggregating individual values to groups, and designing computational mechanisms to reason over values, has energised a sustained research effort. Despite this, no formal, computational definition of values has yet been proposed. We address this through a formal conceptual framework rooted in the social sciences, that provides a foundation for the systematic, integrated and interdisciplinary investigation into how human values can support designing ethical AI.

[218] Difficulty-Estimated Policy Optimization

Yu Zhao, Fan Jiang, Tianle Liu, Bo Zeng, Yu Liu, Longyue Wang, Weihua Luo

Main category: cs.AI

TL;DR: DEPO introduces a difficulty estimation framework to optimize reasoning model training by filtering low-utility samples before rollout, reducing computational costs by 2x without performance loss.

DetailsMotivation: Current reasoning alignment methods like GRPO suffer from gradient signal attenuation on trivial or overly complex problems, and computational inefficiency from exhaustive rollouts on low-utility samples.

Method: Proposes Difficulty-Estimated Policy Optimization (DEPO) with an online Difficulty Estimator that dynamically assesses and filters training data before rollout phase to prioritize computational resources for high-learning-potential samples.

Result: Achieves up to 2x reduction in rollout costs without compromising model performance, significantly lowering computational barrier for training high-performance reasoning models.

Conclusion: DEPO offers a more efficient and sustainable path for reasoning scaling by optimizing computational resource allocation through difficulty-based sample filtering.

Abstract: Recent advancements in Large Reasoning Models (LRMs), exemplified by DeepSeek-R1, have underscored the potential of scaling inference-time compute through Group Relative Policy Optimization (GRPO). However, GRPO frequently suffers from gradient signal attenuation when encountering problems that are either too trivial or overly complex. In these scenarios, the disappearance of inter-group advantages makes the gradient signal susceptible to noise, thereby jeopardizing convergence stability. While variants like DAPO attempt to rectify gradient vanishing, they do not alleviate the substantial computational overhead incurred by exhaustive rollouts on low-utility samples. In this paper, we propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment. DEPO integrates an online Difficulty Estimator that dynamically assesses and filters training data before the rollout phase. This mechanism ensures that computational resources are prioritized for samples with high learning potential. Empirical results demonstrate that DEPO achieves up to a 2x reduction in rollout costs without compromising model performance. Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling. Code and data will be released upon acceptance.

[219] Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization

Arvid E. Gollwitzer, Paridhi Latawa, David de Gruijl, Deepak A. Subramanian, Adrián Noriega de la Colina

Main category: cs.AI

TL;DR: QA-Token introduces quality-aware tokenization that incorporates data reliability into vocabulary construction, improving performance on noisy real-world data like genomics and finance.

DetailsMotivation: Current tokenization methods process sequential data without accounting for signal quality, limiting effectiveness on noisy real-world corpora like genomics and financial time series.

Method: Three key contributions: (1) bilevel optimization for joint vocabulary construction and downstream performance, (2) reinforcement learning with quality-aware rewards and convergence guarantees, (3) adaptive parameter learning via Gumbel-Softmax relaxation for end-to-end optimization.

Result: Consistent improvements: genomics (6.7 percentage point F1 gain in variant calling), finance (30% Sharpe ratio improvement). At foundation scale: tokenized 1.7 trillion base-pairs, achieved state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%.

Conclusion: QA-Token unlocks noisy real-world corpora (petabases of genomic sequences, terabytes of financial time series) for foundation model training with zero inference overhead.

Abstract: Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising 1.7 trillion base-pairs and achieve state-of-the-art pathogen detection (94.53 MCC) while reducing token count by 15%. We unlock noisy real-world corpora, spanning petabases of genomic sequences and terabytes of financial time series, for foundation model training with zero inference overhead.

[220] Scaling Multi-Agent Epistemic Planning through GNN-Derived Heuristics

Giovanni Briglia, Francesco Fabiano, Stefano Mariani

Main category: cs.AI

TL;DR: This paper proposes using Graph Neural Networks (GNNs) to learn heuristics for Multi-agent Epistemic Planning (MEP) by capturing relational structures in Kripke models, improving scalability of epistemic solvers.

DetailsMotivation: Multi-agent Epistemic Planning (MEP) faces scalability issues due to the exponential search space and limitations of existing heuristics when dealing with Kripke structure representations of epistemic states.

Method: Use Graph Neural Networks (GNNs) to learn patterns and relational structures within epistemic states (Kripke models), deriving meaningful estimates of state quality (e.g., distance to goal) by generalizing from solved planning instances.

Result: The GNN-based predictive heuristics integrated into an epistemic planning pipeline show improvements in scalability compared to standard baselines.

Conclusion: GNNs effectively address scalability challenges in MEP by learning relational patterns in epistemic states, enabling better heuristic guidance for planning.

Abstract: Multi-agent Epistemic Planning (MEP) is an autonomous planning framework for reasoning about both the physical world and the beliefs of agents, with applications in domains where information flow and awareness among agents are critical. The richness of MEP requires states to be represented as Kripke structures, i.e., directed labeled graphs. This representation limits the applicability of existing heuristics, hindering the scalability of epistemic solvers, which must explore an exponential search space without guidance, resulting often in intractability. To address this, we exploit Graph Neural Networks (GNNs) to learn patterns and relational structures within epistemic states, to guide the planning process. GNNs, which naturally capture the graph-like nature of Kripke models, allow us to derive meaningful estimates of state quality – e.g., the distance from the nearest goal – by generalizing knowledge obtained from previously solved planning instances. We integrate these predictive heuristics into an epistemic planning pipeline and evaluate them against standard baselines, showing improvements in the scalability of multi-agent epistemic planning.

[221] Intrinsic Stability Limits of Autoregressive Reasoning: Structural Consequences for Long-Horizon Execution

Hsien-Jyh Liao

Main category: cs.AI

TL;DR: Autoregressive LLMs have intrinsic stability limits in long-horizon reasoning due to exponential decay of decision advantage, requiring discrete segmentation and graph-like structures for stable execution.

DetailsMotivation: LLMs show remarkable reasoning but fail in long-horizon tasks. Conventional explanations focus on task complexity, but this paper argues there's an intrinsic stability limit in autoregressive execution even in simple linear tasks.

Method: Theoretical analysis showing decision advantage decays exponentially with execution length (Theorem A), leading to fundamental bound on reasoning chains. Empirical studies in synthetic environments and TextWorld tasks to validate theoretical predictions.

Result: Observable performance cliffs consistent with theoretical predictions. Shows that stable long-horizon reasoning requires discrete segmentation, naturally inducing graph-like execution structures like DAGs.

Conclusion: Long-horizon reasoning failure stems from process-level instability in autoregressive generation, not just task complexity. Short-horizon evaluations may obscure structural instability, suggesting future reasoning systems need structured governance rather than just scaling.

Abstract: Large language models (LLMs) demonstrate remarkable reasoning capabilities, yet their performance often deteriorates sharply in long-horizon tasks, exhibiting systematic breakdown beyond certain scales. Conventional explanations primarily attribute this phenomenon to task complexity, such as combinatorial search explosion or long-term credit assignment challenges. In this work, we argue that these explanations are incomplete: even in linear, unbranched tasks without semantic ambiguity, autoregressive execution is subject to an intrinsic stability limit. We propose that the fundamental constraint on long-horizon reasoning arises from process-level instability in autoregressive generation rather than solely from search or task complexity, reframing long-horizon reasoning as a problem of structural governance. We derive Theorem~A, showing that decision advantage in single-path autoregressive reasoning decays exponentially with execution length, imposing a fundamental bound on maintainable reasoning chains. This result implies a structural consequence: stable long-horizon reasoning requires discrete segmentation, naturally inducing graph-like execution structures such as directed acyclic graphs (DAGs). Empirical studies in both synthetic environments and real TextWorld tasks reveal observable performance cliffs consistent with theoretical predictions. Our findings provide a dynamical perspective on long-horizon reasoning failure and suggest new limitations on maintaining long-term coherence under purely autoregressive architectures. Furthermore, we highlight that short-horizon evaluation protocols may obscure structural instability, indicating a potential shift from scaling toward structured governance in future reasoning systems.

[222] AgentCPM-Explore: Realizing Long-Horizon Deep Exploration for Edge-Scale Agents

Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, Huadong Wang, Yesai Wu, Yueying Wu, Zihao Xie, Yukun Yan, Zhong Zhang, Yankai Lin, Zhiyuan Liu, Maosong Sun

Main category: cs.AI

TL;DR: AgentCPM-Explore is a 4B-parameter edge-scale agent model that achieves SOTA performance through addressing catastrophic forgetting, reward noise sensitivity, and reasoning degradation in long contexts.

DetailsMotivation: Existing LLM-based agents rely heavily on large-scale models, leaving edge-scale models underexplored. The paper aims to systematically study training agentic models at the 4B-parameter scale and identify bottlenecks hindering their performance.

Method: Proposes AgentCPM-Explore with a holistic training framework featuring: 1) parameter-space model fusion to address catastrophic forgetting during SFT, 2) reward signal denoising for RL sensitivity, and 3) contextual information refinement for reasoning degradation in long-context scenarios.

Result: Achieves SOTA among 4B-class models, matches/surpasses 8B-class SOTA models on four benchmarks, outperforms larger models like Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks, and achieves 97.09% accuracy on GAIA text-based tasks under pass@64.

Conclusion: The bottleneck for edge-scale models is not their inherent capability ceiling but inference stability. The training framework effectively unlocks the significant, previously underestimated potential of edge-scale models.

Abstract: While Large Language Model (LLM)-based agents have shown remarkable potential for solving complex tasks, existing systems remain heavily reliant on large-scale models, leaving the capabilities of edge-scale models largely underexplored. In this paper, we present the first systematic study on training agentic models at the 4B-parameter scale. We identify three primary bottlenecks hindering the performance of edge-scale models: catastrophic forgetting during Supervised Fine-Tuning (SFT), sensitivity to reward signal noise during Reinforcement Learning (RL), and reasoning degradation caused by redundant information in long-context scenarios. To address the issues, we propose AgentCPM-Explore, a compact 4B agent model with high knowledge density and strong exploration capability. We introduce a holistic training framework featuring parameter-space model fusion, reward signal denoising, and contextual information refinement. Through deep exploration, AgentCPM-Explore achieves state-of-the-art (SOTA) performance among 4B-class models, matches or surpasses 8B-class SOTA models on four benchmarks, and even outperforms larger-scale models such as Claude-4.5-Sonnet or DeepSeek-v3.2 in five benchmarks. Notably, AgentCPM-Explore achieves 97.09% accuracy on GAIA text-based tasks under pass@64. These results provide compelling evidence that the bottleneck for edge-scale models is not their inherent capability ceiling, but rather their inference stability. Based on our well-established training framework, AgentCPM-Explore effectively unlocks the significant, yet previously underestimated, potential of edge-scale models.

[223] MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang

Main category: cs.AI

TL;DR: MAGIC: A multi-agent reinforcement learning framework for LLM safety alignment through adversarial co-evolution between attacker and defender agents.

DetailsMotivation: Existing LLM safety defenses rely on static, pre-collected data distributions and lag behind evolving adversarial attacks, creating a need for dynamic, adaptive safety alignment methods.

Method: Multi-turn multi-agent reinforcement learning framework where an attacker agent learns to rewrite queries into deceptive prompts while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs, creating a co-evolutionary process.

Result: The framework demonstrates superior defense success rates without compromising model helpfulness, with the attacker evolving novel combinatorial strategies through iterative RL training.

Conclusion: MAGIC provides a dynamic approach to LLM safety alignment through adversarial co-evolution, uncovering long-tail vulnerabilities and driving generalization to unseen attack patterns.

Abstract: Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected data distributions}. In this paper, we introduce \textbf{MAGIC}, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a \textbf{co-evolution}, where the attacker’s ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves \textbf{novel, previously unseen combinatorial strategies} through iterative RL training, underscoring our method’s substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework’s effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at https://github.com/BattleWen/MAGIC.

[224] JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks

Lanbo Lin, Jiayao Liu, Tianyuan Yang, Li Cai, Yuanwu Xu, Lei Wei, Sicong Xie, Guannan Zhang

Main category: cs.AI

TL;DR: JADE is a two-layer evaluation framework for agentic AI that combines predefined evaluation skills with dynamic claim-level assessment to balance rigor and flexibility in professional task evaluation.

DetailsMotivation: There's a fundamental dilemma in evaluating agentic AI on open-ended professional tasks: static rubrics provide rigor but lack flexibility for diverse valid responses, while LLM-as-a-judge approaches adapt but suffer from instability and bias. Human experts combine domain-grounded principles with dynamic claim-level assessment, inspiring a better solution.

Method: JADE uses a two-layer framework: Layer 1 encodes expert knowledge as predefined evaluation skills for stable criteria; Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims.

Result: Experiments on BizBench show JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM-based evaluators. It demonstrates strong alignment with expert-authored rubrics and effective transfer to a medical-domain benchmark, validating JADE across professional domains.

Conclusion: JADE successfully addresses the evaluation dilemma by combining the stability of predefined skills with the flexibility of claim-level assessment, providing a robust framework for evaluating agentic AI on open-ended professional tasks across domains.

Abstract: Evaluating agentic AI on open-ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM-as-a-judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain-grounded principles with dynamic, claim-level assessment. Inspired by this process, we propose JADE, a two-layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM-based evaluators. We further demonstrate strong alignment with expert-authored rubrics and effective transfer to a medical-domain benchmark, validating JADE across professional domains. Our code is publicly available at https://github.com/smiling-world/JADE.

[225] Progress Constraints for Reinforcement Learning in Behavior Trees

Finn Rietz, Mart Kartašev, Johannes A. Stork, Petter Ögren

Main category: cs.AI

TL;DR: Combining Behavior Trees with Reinforcement Learning using progress constraints to prevent controllers from undoing achieved subgoals, improving performance and sample efficiency.

DetailsMotivation: Behavior Trees provide structured reactive decision-making while RL learns optimal controllers, but naive integration causes controllers to counteract each other and undo subgoals, degrading overall performance.

Method: Proposed progress constraints mechanism where feasibility estimators constrain allowed actions based on theoretical BT convergence results to prevent controllers from undoing previously achieved subgoals.

Result: Empirical evaluations in 2D proof-of-concept and high-fidelity warehouse environments demonstrate improved performance, sample efficiency, and constraint satisfaction compared to prior BT-RL integration methods.

Conclusion: Progress constraints effectively address the problem of controllers counteracting each other in BT-RL integration, enabling better combination of structured domain knowledge with learned controllers.

Abstract: Behavior Trees (BTs) provide a structured and reactive framework for decision-making, commonly used to switch between sub-controllers based on environmental conditions. Reinforcement Learning (RL), on the other hand, can learn near-optimal controllers but sometimes struggles with sparse rewards, safe exploration, and long-horizon credit assignment. Combining BTs with RL has the potential for mutual benefit: a BT design encodes structured domain knowledge that can simplify RL training, while RL enables automatic learning of the controllers within BTs. However, naive integration of BTs and RL can lead to some controllers counteracting other controllers, possibly undoing previously achieved subgoals, thereby degrading the overall performance. To address this, we propose progress constraints, a novel mechanism where feasibility estimators constrain the allowed action set based on theoretical BT convergence results. Empirical evaluations in a 2D proof-of-concept and a high-fidelity warehouse environment demonstrate improved performance, sample efficiency, and constraint satisfaction, compared to prior methods of BT-RL integration.

[226] HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction

Shengxuan Qiu, Haochen Huang, Shuzhang Zhong, Pengfei Zuo, Meng Li

Main category: cs.AI

TL;DR: HyPER is a training-free online control policy for multi-path decoding in mixture-of-experts models that dynamically manages exploration-exploitation trade-off during test-time compute scaling, improving reasoning accuracy while reducing token usage.

DetailsMotivation: Current approaches to test-time compute scaling for reasoning have rigid exploration-exploitation trade-offs: tree search has brittle expansion rules that interfere with reasoning, while parallel reasoning over-explores redundant paths with weak answer selection. The optimal balance is phase-dependent, and correct/incorrect paths often diverge only at late stages.

Method: Reformulates test-time scaling as a dynamic expand-reduce control problem over hypothesis pools. HyPER includes: 1) online controller transitioning from exploration to exploitation as pool evolves, 2) token-level refinement for efficient generation-time exploitation without full-path resampling, and 3) length- and confidence-aware aggregation for reliable answer-time exploitation.

Result: Experiments on four mixture-of-experts language models across diverse reasoning benchmarks show HyPER consistently achieves superior accuracy-compute trade-off, improving accuracy by 8-10% while reducing token usage by 25-40%.

Conclusion: HyPER provides an effective training-free approach for dynamic control of multi-path decoding that optimizes the exploration-exploitation trade-off during test-time compute scaling for reasoning tasks.

Abstract: Scaling test-time compute with multi-path chain-of-thought improves reasoning accuracy, but its effectiveness depends critically on the exploration-exploitation trade-off. Existing approaches address this trade-off in rigid ways: tree-structured search hard-codes exploration through brittle expansion rules that interfere with post-trained reasoning, while parallel reasoning over-explores redundant hypothesis paths and relies on weak answer selection. Motivated by the observation that the optimal balance is phase-dependent and that correct and incorrect reasoning paths often diverge only at late stages, we reformulate test-time scaling as a dynamic expand-reduce control problem over a pool of hypotheses. We propose HyPER, a training-free online control policy for multi-path decoding in mixture-of-experts models that reallocates computation under a fixed budget using lightweight path statistics. HyPER consists of an online controller that transitions from exploration to exploitation as the hypothesis pool evolves, a token-level refinement mechanism that enables efficient generation-time exploitation without full-path resampling, and a length- and confidence-aware aggregation strategy for reliable answer-time exploitation. Experiments on four mixture-of-experts language models across diverse reasoning benchmarks show that HyPER consistently achieves a superior accuracy-compute trade-off, improving accuracy by 8 to 10 percent while reducing token usage by 25 to 40 percent.

[227] LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models

Brian Rabern, Philipp Mondorf, Barbara Plank

Main category: cs.AI

TL;DR: LogicSkills benchmark evaluates LLMs on three core logical reasoning skills: formal symbolization, countermodel construction, and validity assessment in first-order logic.

DetailsMotivation: While LLMs show strong performance on logical reasoning benchmarks, it's unclear what core logical skills they truly master. The paper aims to isolate and evaluate fundamental reasoning skills beyond surface-level pattern recognition.

Method: Created LogicSkills benchmark with items from two-variable fragment of first-order logic (without identity), presented in both natural English and Carroll-style language with nonce words. All examples verified using SMT solver Z3 for correctness and non-triviality.

Result: Leading models perform well on validity assessment but substantially worse on formal symbolization and countermodel construction, suggesting they rely on surface-level patterns rather than genuine symbolic or rule-based reasoning.

Conclusion: Current LLMs lack deep understanding of formal logical reasoning skills, particularly in symbolization and countermodel construction, indicating limitations in their logical reasoning capabilities.

Abstract: Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a unified benchmark designed to isolate three fundamental skills in formal reasoning: (i) $\textit{formal symbolization}\unicode{x2014}$translating premises into first-order logic; (ii) $\textit{countermodel construction}\unicode{x2014}$formulating a finite structure in which all premises are true while the conclusion is false; and (iii) $\textit{validity assessment}\unicode{x2014}$deciding whether a conclusion follows from a given set of premises. Items are drawn from the two-variable fragment of first-order logic (without identity) and are presented in both natural English and a Carroll-style language with nonce words. All examples are verified for correctness and non-triviality using the SMT solver Z3. Across leading models, performance is high on validity but substantially lower on symbolization and countermodel construction, suggesting reliance on surface-level patterns rather than genuine symbolic or rule-based reasoning.

[228] AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research

Yishan Li, Wentong Chen, Yukun Yan, Mingwei Li, Sen Mei, Xiaorong Wang, Kunpeng Liu, Xin Cong, Shuo Wang, Zhong Zhang, Yaxi Lu, Zhenghao Liu, Yankai Lin, Zhiyuan Liu, Maosong Sun

Main category: cs.AI

TL;DR: AgentCPM-Report: A lightweight 8B-parameter local deep research agent using Writing As Reasoning Policy (WARP) for dynamic outline revision during report generation, outperforming closed-source systems.

DetailsMotivation: Current deep research systems rely heavily on closed-source or online large models for outline construction, creating deployment barriers and raising safety/privacy concerns for user data. There's a need for lightweight local solutions that can perform deep research without external dependencies.

Method: Proposes AgentCPM-Report with Writing As Reasoning Policy (WARP) that enables dynamic outline revision during report generation. Uses Evidence-Based Drafting and Reasoning-Driven Deepening phases, supported by Multi-Stage Agentic Training (cold-start, atomic skill RL, and holistic pipeline RL).

Result: Outperforms leading closed-source systems on DeepResearch Bench, DeepConsult, and DeepResearch Gym benchmarks, with substantial gains in Insight metrics.

Conclusion: Demonstrates that lightweight local models can achieve state-of-the-art deep research capabilities through innovative agentic frameworks and training strategies, reducing dependency on large external models.

Abstract: Generating deep research reports requires large-scale information acquisition and the synthesis of insight-driven analysis, posing a significant challenge for current language models. Most existing approaches follow a plan-then-write paradigm, whose performance heavily depends on the quality of the initial outline. However, constructing a comprehensive outline itself demands strong reasoning ability, causing current deep research systems to rely almost exclusively on closed-source or online large models. This reliance raises practical barriers to deployment and introduces safety and privacy concerns for user-authored data. In this work, we present AgentCPM-Report, a lightweight yet high-performing local solution composed of a framework that mirrors the human writing process and an 8B-parameter deep research agent. Our framework uses a Writing As Reasoning Policy (WARP), which enables models to dynamically revise outlines during report generation. Under this policy, the agent alternates between Evidence-Based Drafting and Reasoning-Driven Deepening, jointly supporting information acquisition, knowledge refinement, and iterative outline evolution. To effectively equip small models with this capability, we introduce a Multi-Stage Agentic Training strategy, consisting of cold-start, atomic skill RL, and holistic pipeline RL. Experiments on DeepResearch Bench, DeepConsult, and DeepResearch Gym demonstrate that AgentCPM-Report outperforms leading closed-source systems, with substantial gains in Insight.

[229] SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, Bolin Ding

Main category: cs.AI

TL;DR: SeeUPO is a critic-free RL algorithm with convergence guarantees for multi-turn LLM agent training, addressing instability in existing methods through sequential policy updates in reverse order.

DetailsMotivation: Existing RL algorithms for LLM-based agents lack verified convergence guarantees in multi-turn settings, leading to training instability and failure to converge to optimal policies.

Method: SeeUPO models multi-turn interactions as sequentially executed multi-agent bandit problems and performs turn-by-turn sequential policy updates in reverse execution order using backward induction.

Result: Experiments show substantial improvements: 43.3%-54.6% relative gains on Qwen3-14B and 24.1%-41.9% on Qwen2.5-14B, with superior training stability over existing backbone algorithms.

Conclusion: SeeUPO provides a critic-free approach with convergence guarantees for multi-turn LLM agent training, addressing fundamental limitations of existing RL algorithms in agentic scenarios.

Abstract: Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO & GRAE breaks PPO’s original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence guarantees in multi-turn scenarios. To address this, we propose SeeUPO (Sequence-level Sequential Update Policy Optimization), a critic-free approach with convergence guarantees for multi-turn interactions. SeeUPO models multi-turn interaction as sequentially executed multi-agent bandit problems. Through turn-by-turn sequential policy updates in reverse execution order, it ensures monotonic improvement and convergence to global optimal solution via backward induction. Experiments on AppWorld and BFCL v4 demonstrate SeeUPO’s substantial improvements over existing backbone algorithms: relative gains of 43.3%-54.6% on Qwen3-14B and 24.1%-41.9% on Qwen2.5-14B (averaged across benchmarks), along with superior training stability.

[230] Same Answer, Different Representations: Hidden instability in VLMs

Farooq Ahmad Wani, Alessandro Suglia, Rohit Saxena, Aryo Pradipta Gema, Wai-Chung Kwan, Fazl Barez, Maria Sofia Bucarelli, Fabrizio Silvestri, Pasquale Minervini

Main category: cs.AI

TL;DR: A framework for evaluating VLM robustness beyond output invariance, revealing three failure modes: substantial internal representation drift despite stable predictions, no robustness improvement with scale, and task-dependent perturbation effects.

DetailsMotivation: Current VLM robustness evaluation relies on output-level invariance, assuming stable predictions reflect stable multimodal processing. This work argues this assumption is insufficient and proposes a more comprehensive evaluation framework.

Method: Introduces a representation-aware and frequency-aware evaluation framework measuring internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applied to modern VLMs across SEEDBench, MMMU, and POPE datasets.

Result: Reveals three distinct failure modes: 1) Models preserve predicted answers while undergoing substantial internal representation drift (approaching inter-image variability), 2) Robustness doesn’t improve with scale (larger models have higher accuracy but equal/greater sensitivity), 3) Perturbations affect tasks differently - harm reasoning when disrupting coarse/fine visual cue combination, but can reduce hallucination false positives.

Conclusion: Output-level invariance is insufficient for assessing VLM robustness. Internal representation analysis reveals hidden vulnerabilities and task-dependent effects, suggesting need for more comprehensive evaluation frameworks that examine multimodal processing stability.

Abstract: The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.

[231] Autoregressive Models for Knowledge Graph Generation

Thiviyan Thanapalasingam, Antonis Vozikis, Peter Bloem, Paul Groth

Main category: cs.AI

TL;DR: ARK is an autoregressive model family for knowledge graph generation that treats graphs as sequences of triples, learning semantic constraints implicitly without rule supervision, achieving high validity scores.

DetailsMotivation: Knowledge graph generation requires capturing complex semantic dependencies and domain constraints, unlike link prediction which scores triples independently. Existing approaches need to generate semantically coherent structures while maintaining validity constraints.

Method: ARK treats knowledge graphs as sequences of (head, relation, tail) triples and uses autoregressive models to generate them. It learns implicit semantic constraints (type consistency, temporal validity, relational patterns) directly from data without explicit rule supervision. Also introduces SAIL, a variational extension for controlled generation through latent representations.

Result: On IntelliGraphs benchmark, ARK achieves 89.2% to 100.0% semantic validity across diverse datasets while generating novel graphs not seen during training. Model capacity (hidden dimensionality >= 64) is more critical than architectural depth, with recurrent architectures achieving comparable validity to transformers with better computational efficiency.

Conclusion: Autoregressive models provide an effective framework for knowledge graph generation with practical applications in knowledge base completion and query answering. The approach demonstrates that implicit constraint learning from data is sufficient for high-quality graph generation.

Abstract: Knowledge Graph (KG) generation requires models to learn complex semantic dependencies between triples while maintaining domain validity constraints. Unlike link prediction, which scores triples independently, generative models must capture interdependencies across entire subgraphs to produce semantically coherent structures. We present ARK (Auto-Regressive Knowledge Graph Generation), a family of autoregressive models that generate KGs by treating graphs as sequences of (head, relation, tail) triples. ARK learns implicit semantic constraints directly from data, including type consistency, temporal validity, and relational patterns, without explicit rule supervision. On the IntelliGraphs benchmark, our models achieve 89.2% to 100.0% semantic validity across diverse datasets while generating novel graphs not seen during training. We also introduce SAIL, a variational extension of ARK that enables controlled generation through learned latent representations, supporting both unconditional sampling and conditional completion from partial graphs. Our analysis reveals that model capacity (hidden dimensionality >= 64) is more critical than architectural depth for KG generation, with recurrent architectures achieving comparable validity to transformer-based alternatives while offering substantial computational efficiency. These results demonstrate that autoregressive models provide an effective framework for KG generation, with practical applications in knowledge base completion and query answering.

[232] Semantically Labelled Automata for Multi-Task Reinforcement Learning with LTL Instructions

Alessandro Abate, Giuseppe De Giacomo, Mathias Jackermeier, Jan Kretínský, Maximilian Prokop, Christoph Weinhuber

Main category: cs.AI

TL;DR: A novel multi-task RL approach using LTL formulae as task specifications with semantic automata embeddings for policy conditioning.

DetailsMotivation: To develop a universal policy for multi-task RL that can generalize to unseen tasks specified by LTL formulae, addressing limitations of existing methods in handling complex specifications.

Method: Uses semantic LTL-to-automata translations to create semantically labelled automata, enabling efficient on-the-fly computation, extraction of expressive task embeddings for policy conditioning, and full LTL support.

Result: Achieves state-of-the-art performance across various domains and scales to complex specifications where existing methods fail.

Conclusion: The semantic automata embedding approach effectively enables multi-task RL with LTL specifications, offering improved generalization and scalability.

Abstract: We study multi-task reinforcement learning (RL), a setting in which an agent learns a single, universal policy capable of generalising to arbitrary, possibly unseen tasks. We consider tasks specified as linear temporal logic (LTL) formulae, which are commonly used in formal methods to specify properties of systems, and have recently been successfully adopted in RL. In this setting, we present a novel task embedding technique leveraging a new generation of semantic LTL-to-automata translations, originally developed for temporal synthesis. The resulting semantically labelled automata contain rich, structured information in each state that allow us to (i) compute the automaton efficiently on-the-fly, (ii) extract expressive task embeddings used to condition the policy, and (iii) naturally support full LTL. Experimental results in a variety of domains demonstrate that our approach achieves state-of-the-art performance and is able to scale to complex specifications where existing methods fail.

[233] Towards Understanding What State Space Models Learn About Code

Jiali Wu, Abhinav Anand, Shweta Verma, Mira Mezini

Main category: cs.AI

TL;DR: SSM-based code models outperform Transformers in capturing code syntax/semantics during pretraining but forget certain relations during fine-tuning, especially for short-range dependencies; analysis reveals spectral shift and leads to architectural improvements.

DetailsMotivation: State Space Models (SSMs) show promise as efficient alternatives to Transformers for code understanding tasks, but their internal mechanisms are poorly understood. The paper aims to systematically analyze what SSM-based code models actually learn and compare them with Transformer-based models.

Method: Conducted first systematic analysis of SSM-based code models with comparative analysis against Transformers. Introduced SSM-Interpret, a frequency-domain framework to diagnose model behavior, revealing spectral shifts during fine-tuning. Proposed architectural modifications based on findings.

Result: SSMs outperform Transformers at capturing code syntax and semantics during pretraining, but forget certain syntactic and semantic relations during fine-tuning, especially when tasks emphasize short-range dependencies. The spectral analysis shows a shift toward short-range dependencies during fine-tuning. Architectural modifications significantly improve SSM-based code model performance.

Conclusion: SSMs have different learning dynamics than Transformers for code understanding, with strengths in pretraining but weaknesses in fine-tuning for certain dependency types. The analysis framework enables better understanding and improvement of SSM-based models.

Abstract: State Space Models (SSMs) have emerged as an efficient alternative to the transformer architecture. Recent studies show that SSMs can match or surpass Transformers on code understanding tasks, such as code retrieval, when trained under similar conditions. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models actually learn and perform the first comparative analysis of SSM and Transformer-based code models. Our analysis reveals that SSMs outperform Transformers at capturing code syntax and semantics in pretraining but forgets certain syntactic and semantic relations during fine-tuning on task, especially when the task emphasizes short-range dependencies. To diagnose this, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model, validating that our analysis directly enables better models.

[234] Wild Guesses and Mild Guesses in Active Concept Learning

Anirudh Chari, Neil Pattanaik

Main category: cs.AI

TL;DR: Active concept learning balancing informativeness vs. stability in neuro-symbolic Bayesian learners using LLMs, comparing Rational Active Learner (EIG) and Positive Test Strategy (PTS).

DetailsMotivation: Human concept learning is active, requiring balance between query informativeness and learner stability. Need to understand trade-offs in neuro-symbolic Bayesian learners using LLMs for hypothesis generation.

Method: Neuro-symbolic Bayesian learner with LLM-generated executable programs as hypotheses, comparing Rational Active Learner (maximizing expected information gain) vs. Positive Test Strategy (querying positive instances). Tested on Number Game concept-learning tasks.

Result: EIG effective for complex concepts requiring falsification but underperforms on simple concepts due to support mismatch between EIG policy and LLM proposal distribution. PTS maintains proposal validity with “safe” queries, leading to faster convergence on simple rules.

Conclusion: “Confirmation bias” (PTS) may be rational adaptation for maintaining tractable inference in sparse, open-ended hypothesis spaces, not a cognitive error. Different strategies suit different concept types.

Abstract: Human concept learning is typically active: learners choose which instances to query or test in order to reduce uncertainty about an underlying rule or category. Active concept learning must balance informativeness of queries against the stability of the learner that generates and scores hypotheses. We study this trade-off in a neuro-symbolic Bayesian learner whose hypotheses are executable programs proposed by a large language model (LLM) and reweighted by Bayesian updating. We compare a Rational Active Learner that selects queries to maximize approximate expected information gain (EIG) and the human-like Positive Test Strategy (PTS) that queries instances predicted to be positive under the current best hypothesis. Across concept-learning tasks in the classic Number Game, EIG is effective when falsification is necessary (e.g., compound or exception-laden rules), but underperforms on simple concepts. We trace this failure to a support mismatch between the EIG policy and the LLM proposal distribution: highly diagnostic boundary queries drive the posterior toward regions where the generator produces invalid or overly specific programs, yielding a support-mismatch trap in the particle approximation. PTS is information-suboptimal but tends to maintain proposal validity by selecting “safe” queries, leading to faster convergence on simple rules. Our results suggest that “confirmation bias” may not be a cognitive error, but rather a rational adaptation for maintaining tractable inference in the sparse, open-ended hypothesis spaces characteristic of human thought.

[235] ScaleEnv: Scaling Environment Synthesis from Scratch for Generalist Interactive Tool-Use Agent Training

Dunwei Tu, Hongyan Hao, Hansi Yang, Yihao Chen, Yi-Kai Zhang, Zhikang Xia, Yu Yang, Yueqing Sun, Xingchen Liu, Furao Shen, Qi Gu, Hui Su, Xunliang Cai

Main category: cs.AI

TL;DR: ScaleEnv: A framework for constructing fully interactive environments from scratch to train generalist agents through self-exploration, with procedural testing for reliability and tool dependency verification for task completeness.

DetailsMotivation: There's a critical scarcity of interactive environments for training generalist agents, and existing synthesis methods have limitations in environmental diversity and scalability. Current approaches don't provide sufficient diversity for robust agent learning.

Method: ScaleEnv constructs interactive environments entirely from scratch with procedural testing for reliability. It uses tool dependency graph expansion and executable action verification to guarantee task completeness and solvability, enabling agents to learn through exploration.

Result: Significant performance improvements on unseen, multi-turn tool-use benchmarks like τ²-Bench and VitaBench, demonstrating strong generalization capabilities. Empirical evidence shows that scaling environmental diversity is critical for robust agent learning.

Conclusion: ScaleEnv addresses the scarcity of interactive environments and enables effective training of generalist agents through scalable environmental diversity, with verified reliability and task completeness.

Abstract: Training generalist agents capable of adapting to diverse scenarios requires interactive environments for self-exploration. However, interactive environments remain critically scarce, and existing synthesis methods suffer from significant limitations regarding environmental diversity and scalability. To address these challenges, we introduce ScaleEnv, a framework that constructs fully interactive environments and verifiable tasks entirely from scratch. Specifically, ScaleEnv ensures environment reliability through procedural testing, and guarantees task completeness and solvability via tool dependency graph expansion and executable action verification. By enabling agents to learn through exploration within ScaleEnv, we demonstrate significant performance improvements on unseen, multi-turn tool-use benchmarks such as $τ^2$-Bench and VitaBench, highlighting strong generalization capabilities. Furthermore, we investigate the relationship between increasing number of domains and model generalization performance, providing empirical evidence that scaling environmental diversity is critical for robust agent learning.

[236] POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models

Yi Chen, Wonjin Shin, Shuhong Liu, Tho Mai, Jeongmo Lee, Chuanbo Hua, Kun Wang, Jun Liu, Joo-Young Kim

Main category: cs.AI

TL;DR: POP is an online structural pruning framework that enables context-conditioned dynamic pruning during autoregressive token generation with minimal computational overhead.

DetailsMotivation: Current structural pruning methods use fixed pruning decisions during inference, ignoring the sparsity patterns that emerge during autoregressive token generation, which limits efficiency gains.

Method: POP partitions model channels into retained, candidate, and pruned regions. Prefilling defines a coarse pruning partition, and decoding generates fine-grained masks within candidate regions without full-channel re-evaluation.

Result: POP consistently delivers higher accuracy than existing pruning approaches across diverse LFMs (LLMs, MoEs, VLMs) while incurring smaller computational overhead and minimizing inference latency.

Conclusion: POP is an effective lightweight, plug-and-play online pruning framework that enables context-conditioned dynamic pruning without preprocessing, retraining, or learning predictors.

Abstract: Large foundation models (LFMs) achieve strong performance through scaling, yet current structural pruning methods derive fixed pruning decisions during inference, overlooking sparsity patterns that emerge in the autoregressive token generation. In this paper, we propose POP (Partition-guided Online Pruning), an efficient online structural pruning framework that enables context-conditioned dynamic pruning with minimal computational overhead. POP partitions model channels into retained, candidate, and pruned regions, where prefilling defines a coarse pruning partition, and the decoding stage generates a fine-grained mask within the candidate region, avoiding full-channel re-evaluation. The coarse pruning partition preserves consistently important weights, while the fine-grained masking provides context-conditioned variation during decoding. Moreover, POP is a lightweight, plug-and-play method that requires no preprocessing, including offline calibration, retraining, or learning predictors. Extensive evaluations across diverse LFMs, including large language models (LLMs), mixture-of-experts models (MoEs), and vision-language models (VLMs), demonstrate that POP consistently delivers higher accuracy than existing pruning approaches while incurring smaller computational overhead and minimizing inference latency.

[237] LLM Active Alignment: A Nash Equilibrium Perspective

Tonghan Wang, Yuqi Pan, Xinyi Yang, Yanchen Jiang, Milind Tambe, David C. Parkes

Main category: cs.AI

TL;DR: Game-theoretic framework for predicting and steering LLM population behavior using Nash equilibrium analysis with human subpopulation alignment

DetailsMotivation: To address the challenge of predicting and regulating multi-agent LLM dynamics, particularly in social settings where LLMs may exhibit undesirable behaviors like political exclusion of certain subpopulations

Method: Models each LLM agent’s action as a mixture over human subpopulations, with agents strategically choosing which groups to align with. Uses Nash equilibrium analysis with concave-utility assumptions to derive closed-form characterizations and provide actionable guidance for shifting alignment targets

Result: Shows that LLM populations, especially reasoning-based models, can exhibit political exclusion where some subpopulations are ignored by all agents. The method can avoid these pathologies and function as an active alignment layer on top of existing alignment pipelines like RLHF

Conclusion: The game-theoretic framework provides interpretable policy classes and actionable guidance for steering LLM populations toward socially desirable outcomes, demonstrating promise for regulating multi-agent LLM dynamics across domains

Abstract: We develop a game-theoretic framework for predicting and steering the behavior of populations of large language models (LLMs) through Nash equilibrium (NE) analysis. To avoid the intractability of equilibrium computation in open-ended text spaces, we model each agent’s action as a mixture over human subpopulations. Agents choose actively and strategically which groups to align with, yielding an interpretable and behaviorally substantive policy class. We derive closed-form NE characterizations, adopting standard concave-utility assumptions to enable analytical system-level predictions and give explicit, actionable guidance for shifting alignment targets toward socially desirable outcomes. The method functions as an active alignment layer on top of existing alignment pipelines such as RLHF. In a social-media setting, we show that a population of LLMs, especially reasoning-based models, may exhibit political exclusion, pathologies where some subpopulations are ignored by all LLM agents, which can be avoided by our method, illustrating the promise of applying the method to regulate multi-agent LLM dynamics across domains.

[238] An Adaptive Differentially Private Federated Learning Framework with Bi-level Optimization

Jin Wang, Hui Ma, Fei Xing, Ming Yan

Main category: cs.AI

TL;DR: Adaptive differentially private federated learning framework addressing device heterogeneity and Non-IID data challenges through local compression, adaptive gradient clipping, and constraint-aware aggregation.

DetailsMotivation: Federated learning faces challenges with device heterogeneity and Non-IID data causing unstable gradient updates. When differential privacy is added, fixed gradient clipping and noise injection amplify perturbations, leading to training oscillation and performance degradation.

Method: Proposes adaptive differentially private federated learning framework with: 1) lightweight local compressed module to regularize intermediate representations and constrain gradient variability, 2) adaptive gradient clipping strategy that dynamically adjusts thresholds based on historical update statistics, and 3) constraint-aware aggregation mechanism to suppress unreliable or noise-dominated client updates.

Result: Extensive experiments on CIFAR-10 and SVHN demonstrate improved convergence stability and classification accuracy compared to conventional approaches.

Conclusion: The proposed framework effectively addresses challenges of device heterogeneity, Non-IID data, and differential privacy constraints in federated learning, achieving better stability and performance through adaptive mechanisms.

Abstract: Federated learning enables collaborative model training across distributed clients while preserving data privacy. However, in practical deployments, device heterogeneity, non-independent, and identically distributed (Non-IID) data often lead to highly unstable and biased gradient updates. When differential privacy is enforced, conventional fixed gradient clipping and Gaussian noise injection may further amplify gradient perturbations, resulting in training oscillation and performance degradation and degraded model performance. To address these challenges, we propose an adaptive differentially private federated learning framework that explicitly targets model efficiency under heterogeneous and privacy-constrained settings. On the client side, a lightweight local compressed module is introduced to regularize intermediate representations and constrain gradient variability, thereby mitigating noise amplification during local optimization. On the server side, an adaptive gradient clipping strategy dynamically adjusts clipping thresholds based on historical update statistics to avoid over-clipping and noise domination. Furthermore, a constraint-aware aggregation mechanism is designed to suppress unreliable or noise-dominated client updates and stabilize global optimization. Extensive experiments on CIFAR-10 and SVHN demonstrate improved convergence stability and classification accuracy.

[239] From Features to Actions: Explainability in Traditional and Agentic AI Systems

Sindhuja Chaduvula, Jessee Ho, Kina Kim, Aravind Narayanan, Mahshid Alinoori, Muskan Garg, Dhanesh Ramachandram, Shaina Raza

Main category: cs.AI

TL;DR: The paper compares static attribution-based explanations with trace-based diagnostics for agentic AI systems, showing that traditional XAI methods fail for multi-step agentic behavior while trajectory-level diagnostics effectively localize failures.

DetailsMotivation: Traditional explainable AI focuses on interpreting individual model predictions, but recent advances in LLMs enable agentic AI systems with multi-step trajectories. There's a gap in understanding how static explanation approaches translate to agentic settings where behavior emerges over time.

Method: The authors empirically compare attribution-based explanations (used in static classification) with trace-based diagnostics (used in agentic benchmarks like TAU-bench Airline and AssistantBench). They analyze how these methods perform in both static and agentic settings.

Result: Attribution methods achieve stable feature rankings in static settings (Spearman ρ=0.86) but cannot reliably diagnose execution-level failures in agentic trajectories. Trace-grounded rubric evaluation consistently localizes behavior breakdowns and reveals that state tracking inconsistency is 2.7× more prevalent in failed runs and reduces success probability by 49%.

Conclusion: The findings motivate a shift towards trajectory-level explainability for agentic systems when evaluating and diagnosing autonomous AI behavior, as traditional static explanation methods are insufficient for multi-step agentic settings.

Abstract: Over the last decade, explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations that relate inputs to outputs under a fixed decision structure. Recent advances in large language models (LLMs) have enabled agentic AI systems whose behaviour unfolds over multi-step trajectories. In these settings, success and failure are determined by sequences of decisions rather than a single output. While useful, it remains unclear how explanation approaches designed for static predictions translate to agentic settings where behaviour emerges over time. In this work, we bridge the gap between static and agentic explainability by comparing attribution-based explanations with trace-based diagnostics across both settings. To make this distinction explicit, we empirically compare attribution-based explanations used in static classification tasks with trace-based diagnostics used in agentic benchmarks (TAU-bench Airline and AssistantBench). Our results show that while attribution methods achieve stable feature rankings in static settings (Spearman $ρ= 0.86$), they cannot be applied reliably to diagnose execution-level failures in agentic trajectories. In contrast, trace-grounded rubric evaluation for agentic settings consistently localizes behaviour breakdowns and reveals that state tracking inconsistency is 2.7$\times$ more prevalent in failed runs and reduces success probability by 49%. These findings motivate a shift towards trajectory-level explainability for agentic systems when evaluating and diagnosing autonomous AI behaviour. Resources: https://github.com/VectorInstitute/unified-xai-evaluation-framework https://vectorinstitute.github.io/unified-xai-evaluation-framework

[240] AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster, Bassel Al Omari, Despoina Magka, Alberto Pepe, Alexis Audran-Reiss, Muna Aghamelu, Nicolas Baldwin, Lucia Cipolina-Kun, Jean-Christophe Gagnon-Audet, Chee Hau Leow, Sandra Lefdal, Hossam Mossalam, Abhinav Moudgil, Saba Nazir, Emanuel Tewolde, Isabel Urrego, Jordi Armengol Estape, Amar Budhiraja, Gaurav Chaurasia, Abhishek Charnalia, Derek Dunfield, Karen Hambardzumyan, Daniel Izcovich, Martin Josifoski, Ishita Mediratta, Kelvin Niu, Parth Pathak, Michael Shvartsman, Edan Toledo, Anton Protopopov, Roberta Raileanu, Alexander Miller, Tatiana Shavrina, Jakob Foerster, Yoram Bachrach

Main category: cs.AI

TL;DR: AIRS-Bench is a benchmark suite of 20 ML research tasks assessing LLM agents’ capabilities across the full research lifecycle, showing agents exceed human SOTA in 4 tasks but fail in 16 others, with substantial room for improvement.

DetailsMotivation: To accelerate progress in LLM agents for scientific research by creating a comprehensive benchmark that assesses agentic capabilities across the full research lifecycle, enabling rigorous comparison of different agentic frameworks.

Method: Created AIRS-Bench with 20 tasks sourced from state-of-the-art ML papers spanning diverse domains (language modeling, mathematics, bioinformatics, time series forecasting). Tasks assess capabilities including idea generation, experiment analysis, and iterative refinement without providing baseline code. Established baselines using frontier models with sequential and parallel scaffolds.

Result: Agents exceeded human state-of-the-art in 4 tasks but failed to match it in 16 others. Even when agents surpassed human benchmarks, they did not reach the theoretical performance ceiling for the underlying tasks, indicating AIRS-Bench is far from saturated.

Conclusion: AIRS-Bench offers substantial room for improvement and serves as a catalyst for further development in autonomous scientific research. The benchmark is open-sourced to enable community contributions and rigorous evaluation of agentic frameworks.

Abstract: LLM agents hold significant promise for advancing scientific research. To accelerate this progress, we introduce AIRS-Bench (the AI Research Science Benchmark), a suite of 20 tasks sourced from state-of-the-art machine learning papers. These tasks span diverse domains, including language modeling, mathematics, bioinformatics, and time series forecasting. AIRS-Bench tasks assess agentic capabilities over the full research lifecycle – including idea generation, experiment analysis and iterative refinement – without providing baseline code. The AIRS-Bench task format is versatile, enabling easy integration of new tasks and rigorous comparison across different agentic frameworks. We establish baselines using frontier models paired with both sequential and parallel scaffolds. Our results show that agents exceed human SOTA in four tasks but fail to match it in sixteen others. Even when agents surpass human benchmarks, they do not reach the theoretical performance ceiling for the underlying tasks. These findings indicate that AIRS-Bench is far from saturated and offers substantial room for improvement. We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.

[241] Agentic Uncertainty Reveals Agentic Overconfidence

Jean Kaddour, Srijan Patel, Gbètondji Dovonon, Leo Richter, Pasquale Minervini, Matt J. Kusner

Main category: cs.AI

TL;DR: AI agents are overconfident in predicting their own task success, with pre-execution assessments sometimes outperforming post-execution reviews, and adversarial prompting for bug-finding improves calibration.

DetailsMotivation: The paper investigates whether AI agents can accurately predict their own success at tasks, examining agentic uncertainty and calibration of self-assessment capabilities in AI systems.

Method: Researchers elicited success probability estimates from AI agents at three stages: before, during, and after task execution. They compared pre-execution assessments (with less information) against standard post-execution reviews, and tested adversarial prompting that reframed assessment as bug-finding.

Result: Results show significant agentic overconfidence: some agents succeeding only 22% of the time predicted 77% success. Counterintuitively, pre-execution assessments with less information sometimes yielded better discrimination than post-execution reviews. Adversarial prompting for bug-finding achieved the best calibration.

Conclusion: AI agents exhibit systematic overconfidence in self-assessment, but assessment timing and framing significantly impact calibration, with adversarial bug-finding approaches showing promise for improving reliability of agent self-evaluation.

Abstract: Can AI agents predict whether they will succeed at a task? We study agentic uncertainty by eliciting success probability estimates before, during, and after task execution. All results exhibit agentic overconfidence: some agents that succeed only 22% of the time predict 77% success. Counterintuitively, pre-execution assessment with strictly less information tends to yield better discrimination than standard post-execution review, though differences are not always significant. Adversarial prompting reframing assessment as bug-finding achieves the best calibration.

[242] OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu, Ruqi Huang

Main category: cs.AI

TL;DR: OmniVideo-R1 is a reinforced framework that improves audio-visual understanding through query-intensive grounding and modality-attentive fusion strategies.

DetailsMotivation: Humans perceive the world through synergistic multimodal cues, but existing omnivideo models struggle with audio-visual understanding tasks, requiring better mixed-modality reasoning capabilities.

Method: Two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms, and (2) modality-attentive fusion built upon contrastive learning paradigms.

Result: Extensive experiments on multiple benchmarks show OmniVideo-R1 consistently outperforms strong baselines, demonstrating effectiveness and robust generalization capabilities.

Conclusion: OmniVideo-R1 successfully improves audio-visual understanding through reinforced multimodal reasoning, enabling models to better “think with omnimodal cues.”

Abstract: While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to “think with omnimodal cues” by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.

[243] OpenDeception: Learning Deception and Trust in Human-AI Interaction via Multi-Agent Simulation

Yichen Wu, Qianqian Gao, Xudong Pan, Geng Hong, Min Yang

Main category: cs.AI

TL;DR: OpenDeception is a framework for evaluating deception risk in human-AI dialogue, featuring scenario benchmarks, intent inference, and trust estimation, with experiments showing high deception rates in LLMs.

DetailsMotivation: As LLMs become interactive agents, open-ended human-AI interactions can involve deceptive behaviors with serious real-world consequences, but existing evaluations remain scenario-specific and model-centric.

Method: OpenDeception framework includes: 1) 50 real-world deception case benchmark, 2) IntentNet to infer deceptive intent from agent reasoning, 3) TrustNet to estimate user susceptibility, 4) LLM-based role-and-goal simulation for high-risk dialogue synthesis, and 5) contrastive learning for User Trust Scorer training.

Result: Experiments on 11 LLMs and 3 large reasoning models show over 90% of goal-driven interactions in most models exhibit deceptive intent, with stronger models displaying higher risk. Real-world case study demonstrates proactive warning capability before critical trust thresholds.

Conclusion: OpenDeception provides a lightweight framework for joint deception risk evaluation from both human and AI perspectives, revealing concerning deception tendencies in current LLMs and enabling proactive risk assessment.

Abstract: As large language models (LLMs) are increasingly deployed as interactive agents, open-ended human-AI interactions can involve deceptive behaviors with serious real-world consequences, yet existing evaluations remain largely scenario-specific and model-centric. We introduce OpenDeception, a lightweight framework for jointly evaluating deception risk from both sides of human-AI dialogue. It consists of a scenario benchmark with 50 real-world deception cases, an IntentNet that infers deceptive intent from agent reasoning, and a TrustNet that estimates user susceptibility. To address data scarcity, we synthesize high-risk dialogues via LLM-based role-and-goal simulation, and train the User Trust Scorer using contrastive learning on controlled response pairs, avoiding unreliable scalar labels. Experiments on 11 LLMs and three large reasoning models show that over 90% of goal-driven interactions in most models exhibit deceptive intent, with stronger models displaying higher risk. A real-world case study adapted from a documented AI-induced suicide incident further demonstrates that our joint evaluation can proactively trigger warnings before critical trust thresholds are reached.

[244] Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs

Paiheng Xu, Gang Wu, Xiang Chen, Tong Yu, Chang Xiao, Franck Dernoncourt, Tianyi Zhou, Wei Ai, Viswanathan Swaminathan

Main category: cs.AI

TL;DR: Offline simulation framework for curating verified software scripting skillsets using LLMs and API synergy analysis, tested on Adobe Illustrator to improve automation success rates and reduce runtime costs.

DetailsMotivation: Traditional scripting requires programming expertise and API familiarity, creating barriers for users. While LLMs can generate code from natural language, runtime code generation faces security risks, unverified code, longer response times, and higher computational costs.

Method: Proposes an offline simulation framework with two components: (1) task creation using top-down functionality guidance and bottom-up API synergy exploration, and (2) skill generation with trials that refines scripts based on execution feedback. Uses GNN-based link prediction to capture API synergy and generate skills involving underutilized APIs.

Result: Experiments with Adobe Illustrator show the framework significantly improves automation success rates, reduces response time, and saves runtime token costs compared to traditional runtime code generation.

Conclusion: First attempt to use software scripting interfaces as a testbed for LLM-based systems, demonstrating advantages of leveraging execution feedback in controlled environments and offering insights into aligning AI capabilities with user needs in specialized software domains.

Abstract: Scripting interfaces enable users to automate tasks and customize software workflows, but creating scripts traditionally requires programming expertise and familiarity with specific APIs, posing barriers for many users. While Large Language Models (LLMs) can generate code from natural language queries, runtime code generation is severely limited due to unverified code, security risks, longer response times, and higher computational costs. To bridge the gap, we propose an offline simulation framework to curate a software-specific skillset, a collection of verified scripts, by exploiting LLMs and publicly available scripting guides. Our framework comprises two components: (1) task creation, using top-down functionality guidance and bottom-up API synergy exploration to generate helpful tasks; and (2) skill generation with trials, refining and validating scripts based on execution feedback. To efficiently navigate the extensive API landscape, we introduce a Graph Neural Network (GNN)-based link prediction model to capture API synergy, enabling the generation of skills involving underutilized APIs and expanding the skillset’s diversity. Experiments with Adobe Illustrator demonstrate that our framework significantly improves automation success rates, reduces response time, and saves runtime token costs compared to traditional runtime code generation. This is the first attempt to use software scripting interfaces as a testbed for LLM-based systems, highlighting the advantages of leveraging execution feedback in a controlled environment and offering valuable insights into aligning AI capabilities with user needs in specialized software domains.

[245] The Quest for Efficient Reasoning: A Data-Centric Benchmark to CoT Distillation

Ruichen Zhang, Rana Muhammad Shahroz Khan, Zhen Tan, Dawei Li, Song Wang, Tianlong Chen

Main category: cs.AI

TL;DR: DC-CoT is a comprehensive benchmark for evaluating data-centric distillation techniques in chain-of-thought reasoning, assessing data augmentation, selection, and mixing across different teacher/student models and datasets.

DetailsMotivation: There's a lack of systematic benchmarks to evaluate data-centric distillation approaches (data augmentation, selection, mixing) for creating smaller LLMs with strong reasoning abilities, particularly for chain-of-thought distillation.

Method: Created DC-CoT benchmark that investigates data manipulation in CoT distillation from method, model, and data perspectives using various teacher models (o4-mini, Gemini-Pro, Claude-3.5) and student architectures (3B, 7B parameters).

Result: Systematically evaluated impact of data manipulations on student model performance across multiple reasoning datasets, focusing on IID/OOD generalization and cross-domain transfer.

Conclusion: Provides actionable insights and best practices for optimizing CoT distillation through data-centric techniques to facilitate development of more accessible and capable reasoning models.

Abstract: Data-centric distillation, including data augmentation, selection, and mixing, offers a promising path to creating smaller, more efficient student Large Language Models (LLMs) that retain strong reasoning abilities. However, there still lacks a comprehensive benchmark to systematically assess the effect of each distillation approach. This paper introduces DC-CoT, the first data-centric benchmark that investigates data manipulation in chain-of-thought (CoT) distillation from method, model and data perspectives. Utilizing various teacher models (e.g., o4-mini, Gemini-Pro, Claude-3.5) and student architectures (e.g., 3B, 7B parameters), we rigorously evaluate the impact of these data manipulations on student model performance across multiple reasoning datasets, with a focus on in-distribution (IID) and out-of-distribution (OOD) generalization, and cross-domain transfer. Our findings aim to provide actionable insights and establish best practices for optimizing CoT distillation through data-centric techniques, ultimately facilitating the development of more accessible and capable reasoning models. The codebase can be accessed at https://github.com/UNITES-Lab/Distillation-Bench

[246] Scalable In-Context Q-Learning

Jinmei Liu, Fuhong Liu, Zhenhong Sun, Jianye Hao, Huaxiong Li, Bo Wang, Daoyi Dong, Chunlin Chen, Zhi Wang

Main category: cs.AI

TL;DR: S-ICQL is a scalable in-context Q-learning framework that uses dynamic programming and world modeling to improve in-context reinforcement learning, enabling better reward maximization and task generalization from suboptimal trajectories.

DetailsMotivation: Existing in-context reinforcement learning approaches struggle with learning from suboptimal trajectories and achieving precise in-context inference due to complex dynamics and temporal correlations. The authors aim to extend the promise of in-context learning from language models to decision domains more effectively.

Method: Proposes S-ICQL framework with: 1) Prompt-based multi-head transformer architecture predicting optimal policies and in-context value functions separately, 2) Pretrained generalized world model to capture task-relevant information for compact prompts, 3) Iterative policy improvement by fitting state value function to upper-expectile of Q-function, 4) Distilling in-context value functions into policy extraction using advantage-weighted regression.

Result: Extensive experiments across discrete and continuous environments show consistent performance gains over various baselines, especially when learning from suboptimal data.

Conclusion: S-ICQL successfully extends in-context learning to reinforcement learning domains, addressing challenges with suboptimal trajectories and achieving precise in-context inference through dynamic programming and world modeling approaches.

Abstract: Recent advancements in language models have demonstrated remarkable in-context learning abilities, prompting the exploration of in-context reinforcement learning (ICRL) to extend the promise to decision domains. Due to involving more complex dynamics and temporal correlations, existing ICRL approaches may face challenges in learning from suboptimal trajectories and achieving precise in-context inference. In the paper, we propose \textbf{S}calable \textbf{I}n-\textbf{C}ontext \textbf{Q}-\textbf{L}earning (\textbf{S-ICQL}), an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining. We design a prompt-based multi-head transformer architecture that simultaneously predicts optimal policies and in-context value functions using separate heads. We pretrain a generalized world model to capture task-relevant information, enabling the construction of a compact prompt that facilitates fast and precise in-context inference. During training, we perform iterative policy improvement by fitting a state value function to an upper-expectile of the Q-function, and distill the in-context value functions into policy extraction using advantage-weighted regression. Extensive experiments across a range of discrete and continuous environments show consistent performance gains over various types of baselines, especially when learning from suboptimal data. Our code is available at \textcolor{magenta}{\href{https://github.com/NJU-RL/SICQL}{https://github.com/NJU-RL/SICQL}}.

[247] Conversational Intent-Driven GraphRAG: Enhancing Multi-Turn Dialogue Systems through Adaptive Dual-Retrieval of Flow Patterns and Context Semantics

Ziqi Zhu, Tao Hu, Honglong Zhang, Dan Yang, HanGeng Chen, Mengran Zhang, Xilun Chen

Main category: cs.AI

TL;DR: CID-GraphRAG is a novel framework that combines intent transition graphs with semantic retrieval to improve multi-turn customer service dialogue systems by maintaining both contextual coherence and goal-oriented progression.

DetailsMotivation: Existing dialogue systems struggle to maintain both contextual coherence and goal-oriented progression in multi-turn customer service conversations. Traditional RAG systems rely only on semantic similarity or static knowledge graphs, failing to capture conversational intent flow patterns.

Method: Constructs dynamic intent transition graphs from historical goal-achieved dialogues and implements a dual-retrieval mechanism that adaptively balances intent-based graph traversal with semantic search to leverage both intent flow patterns and contextual semantics.

Result: Significantly outperforms semantic-based Conversation RAG and intent-based GraphRAG baselines across all evaluation criteria, with relative gains of 11% in BLEU, 5% in ROUGE-L, 6% in METEOR, and 58% improvement in response quality according to LLM-as-judge evaluations.

Conclusion: The integration of intent transition structures with semantic retrieval creates synergistic effects that neither approach achieves independently, establishing CID-GraphRAG as an effective framework for knowledge-intensive multi-turn dialogues.

Abstract: We present CID-GraphRAG (Conversational Intent-Driven Graph Retrieval Augmented Generation), a novel framework that addresses the limitations of existing dialogue systems in maintaining both contextual coherence and goal-oriented progression in multi-turn customer service conversations. Unlike traditional RAG systems that rely solely on semantic similarity (Conversation RAG) or standard knowledge graphs (GraphRAG), CID-GraphRAG constructs dynamic intent transition graphs from goal achieved historical dialogues and implements a dual-retrieval mechanism that adaptively balances intent-based graph traversal with semantic search. This approach enables the system to simultaneously leverage both conversional intent flow patterns and contextual semantics, significantly improving retrieval quality and response quality. In extensive experiments on real-world customer service dialogues, we employ both automatic metrics and LLM-as-judge assessments, demonstrating that CID-GraphRAG significantly outperforms both semantic-based Conversation RAG and intent-based GraphRAG baselines across all evaluation criteria. Quantitatively, CID-GraphRAG demonstrates substantial improvements over Conversation RAG across automatic metrics, with relative gains of 11% in BLEU, 5% in ROUGE-L, 6% in METEOR, and most notably, a 58% improvement in response quality according to LLM-as-judge evaluations. These results demonstrate that the integration of intent transition structures with semantic retrieval creates a synergistic effect that neither approach achieves independently, establishing CID-GraphRAG as an effective framework for addressing the challenges of maintaining contextual coherence and goal-oriented progression in knowledge-intensive multi-turn dialogues.

[248] GATSim: Urban Mobility Simulation with Generative Agents

Qi Liu, Can Li, Wanjing Ma

Main category: cs.AI

TL;DR: GATSim is a novel urban mobility simulation framework using generative AI agents with cognitive structures to model complex human travel behaviors, replacing traditional rule-based systems.

DetailsMotivation: Traditional agent-based urban mobility simulations use rigid rule-based systems that fail to capture the complexity, adaptability, and behavioral diversity of human travel decision-making. There's a need for more realistic simulations that can model evolving preferences and psychological factors.

Method: GATSim leverages large language models and AI agent technologies to create generative agents with dedicated cognitive structures. Agents have diverse socioeconomic profiles, lifestyles, and preferences shaped through psychologically informed memory systems and lifelong learning. The framework integrates urban mobility foundation models with agent cognitive systems and transport simulation environments, featuring hierarchical memory for efficient retrieval, and planning/reactive mechanisms with multi-scale reflection processes.

Result: Experiments show that generative agents perform competitively with human annotators in role-playing scenarios while naturally producing realistic macroscopic traffic patterns. The framework demonstrates improved simulation of complex human mobility behaviors.

Conclusion: GATSim represents a significant advancement in urban mobility simulation by using generative AI agents with cognitive structures to better capture the complexity and adaptability of human travel behavior, moving beyond traditional rule-based approaches.

Abstract: Traditional agent-based urban mobility simulations often rely on rigid rulebased systems that struggle to capture the complexity, adaptability, and behavioral diversity inherent in human travel decision making. Inspired by recent advancements in large language models and AI agent technologies, we introduce GATSim, a novel framework that leverages these advancements to simulate urban mobility using generative agents with dedicated cognitive structures. GATSim agents are characterized by diverse socioeconomic profiles, individual lifestyles, and evolving preferences shaped through psychologically informed memory systems and lifelong learning. The main contributions of this work are: 1) a comprehensive architecture that integrates urban mobility foundation model with agent cognitive systems and transport simulation environment; 2) a hierarchical memory designed for efficient retrieval of contextually relevant information, incorporating spatial and temporal associations; 3) planning and reactive mechanisms for modeling adaptive mobility behaviors which integrate a multi-scale reflection process to transform specific travel experiences into generalized behavioral insights. Experiments indicate that generative agents perform competitively with human annotators in role-playing scenarios, while naturally producing realistic macroscopic traffic patterns. The code for the prototype implementation is publicly available at https://github.com/qiliuchn/gatsim.

[249] Emergent Cognitive Convergence via Implementation: Structured Cognitive Loop Reflecting Four Theories of Mind

Myung Ho Kim

Main category: cs.AI

TL;DR: Agentic Flow architecture unintentionally converges four major cognitive theories (Kahneman, Friston, Minsky, Clark) into a practical AI system with five modules, achieving 95.8% task success vs 62.3% for baseline LLMs.

DetailsMotivation: To address limitations of large language models (LLMs) by creating a structured cognitive architecture that improves task success, constraint adherence, and reproducible reasoning.

Method: Developed Agentic Flow - a practical AI architecture with five interlocking modules (Retrieval, Cognition, Control, Action, Memory) organized into a repeatable cognitive loop, later formalized as the Structured Cognitive Loop (SCL).

Result: Achieved 95.8% task success rate compared to 62.3% for baseline LLMs, with stronger constraint adherence and more reproducible reasoning. The architecture unintentionally converged computational motifs from four major cognitive theories.

Conclusion: Intelligent architectures may evolve toward shared structural patterns shaped by practical demands rather than deliberate theoretical synthesis. Agentic Flow serves as an implementation instance showing how unified cognitive forms emerge from real-world reasoning necessities.

Abstract: We report a structural convergence among four influential theories of mind: Kahneman dual-system theory, Friston predictive processing, Minsky society of mind, and Clark extended mind, emerging unintentionally within a practical AI architecture known as Agentic Flow. Designed to address limitations of large language models LLMs, Agentic Flow comprises five interlocking modules - Retrieval, Cognition, Control, Action, and Memory - organized into a repeatable cognitive loop. Although originally inspired only by Minsky and Clark, subsequent analysis showed that its structure echoes computational motifs from all four theories. This suggests that theoretical convergence may arise from implementation constraints rather than deliberate synthesis. In controlled evaluations, the structured agent achieved 95.8 percent task success compared to 62.3 percent for baseline LLMs, demonstrating stronger constraint adherence and more reproducible reasoning. We characterize this convergence through a broader descriptive meta-architecture called PEACE, highlighting recurring patterns such as predictive modeling, associative recall, and error-sensitive control. Later formalized as the Structured Cognitive Loop (SCL), this abstraction generalizes principles first realized in Agentic Flow as a foundation for behavioral intelligence in LLM-based agents.Rather than asserting theoretical unification, this position paper proposes that intelligent architectures may evolve toward shared structural patterns shaped by practical demands. Agentic Flow thus functions as an implementation instance of the Structured Cognitive Loop, illustrating how a unified cognitive form can emerge not from abstraction, but from the necessities of real-world reasoning.

[250] AI sustains higher strategic tension than humans in chess

Adamo Cerioli, Edward D. Lee, Vito D. P. Servedio

Main category: cs.AI

TL;DR: Analysis of strategic tension in chess comparing human and AI gameplay reveals AI sustains higher tension longer, with tension scaling with algorithmic complexity in AI and Elo rating in humans.

DetailsMotivation: To understand how strategic decision-making balances immediate opportunities against long-term objectives in competitive environments, specifically comparing human and AI approaches in chess.

Method: Network-based metric quantifying piece-to-piece interactions to measure strategic tension; analysis of human and AI gameplay dynamics across different skill levels and time controls.

Result: Elite AI players sustain substantially higher strategic tension for longer durations than top human grandmasters; cumulative tension scales with algorithmic complexity in AI and increases linearly with skill level in humans; longer time controls associated with higher tension in human games.

Conclusion: AI systems tolerate densely interconnected positions balancing offensive/defensive tactics over extended periods, while humans systematically limit tension and game complexity; implications for understanding artificial vs biological strategic navigation and AI deployment in competitive scenarios.

Abstract: Strategic decision-making requires balancing immediate opportunities against long-term objectives: a tension fundamental to competitive environments. We investigate this trade-off in chess by analyzing the dynamics of human and AI gameplay through a network-based metric that quantifies piece-to-piece interactions. Our analysis reveals that elite AI players sustain substantially higher levels of strategic tension for longer durations than top human grandmasters. We find that cumulative tension scales with algorithmic complexity in AI systems and increases linearly with skill level (Elo rating) in human play. Longer time controls are associated with higher tension in human games, reflecting the additional strategic complexity players can manage with more thinking time. The temporal profiles reveal contrasting approaches: highly competitive AI systems tolerate densely interconnected positions that balance offensive and defensive tactics over extended periods, while human players systematically limit tension and game complexity. These differences have broader implications for understanding how artificial and biological systems navigate complex strategic environments and for the deployment of AI in high-stakes competitive scenarios.

[251] Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

Davi Bastos Costa, Renato Vicente

Main category: cs.AI

TL;DR: Mini-Mafia is a simplified social deduction game used as a benchmark to evaluate LLM social intelligence through multi-agent interactions, with a theoretical model describing win-rates via latent deception, detection, and disclosure parameters.

DetailsMotivation: Social deduction games like Mafia provide a testbed for evaluating LLM social intelligence due to their information asymmetry and theory-of-mind reasoning, mirroring real-world multi-agent scenarios. However, existing setups lack systematic study frameworks.

Method: Introduces Mini-Mafia: a simplified 4-player variant (1 mafioso, 1 detective, 2 villagers) reduced to single day phase. Develops theoretical model: logit(p) = v × (m - d), where m, d, v are latent parameters for deception, detection, and disclosure. Uses Bayesian inference to estimate parameters from LLM gameplay data, creating the Mini-Mafia Benchmark.

Result: The compact analytic model accurately captures multi-agent dialogue dynamics with few parameters. Experiments reveal counterintuitive findings: smaller models sometimes outperform larger ones. LLMs excel at persuasion but lag in simple strategic reasoning compared to humans. Enables study of emergent dynamics like name bias and last-speaker advantage.

Conclusion: Mini-Mafia provides a principled theoretical framework for studying multi-agent LLM systems, serving as both a benchmark for social intelligence evaluation and a tool for generating deception detection training data, contributing to AI safety research.

Abstract: Mafia is a social deduction game where informed mafia compete against uninformed townsfolk. Its asymmetry of information and reliance on theory-of-mind reasoning mirror real-world multi-agent scenarios, making it a useful testbed for evaluating the social intelligence of large language models (LLMs). To support a systematic study, we introduce Mini-Mafia: a simplified four-player variant with one mafioso, one detective and two villagers. We set the mafioso to kill a villager and the detective to investigate the mafioso during the night, reducing the game to a single day phase of discussion and voting. Remarkably, we find that the mafia win-rate $p$ in this three-agent system can be described by a simple theoretical model: $\text{logit}(p) = v \times (m - d)$, where $m$, $d$, and $v$ are intrinsic model parameters representing the mafioso’s deception, the villager’s detection, and the detective’s disclosure capabilities, respectively. This compact analytic description of an interacting triad shows that multi-agent dialogue can be captured by a few latent parameters while still matching empirical outcomes, opening a path to a principled theoretical description of multi-agent LLM systems. Estimating these parameters from LLM gameplay data using Bayesian inference yields the Mini-Mafia Benchmark. Our experiments reveal counterintuitive results, including cases where smaller models significantly outperform larger ones. We also establish human baselines, revealing that LLMs excel at persuasion but lag in simple strategic reasoning for agentic interaction. Beyond benchmarking, Mini-Mafia enables quantitative study of emergent multi-agent dynamics such as name bias and last-speaker advantage, and contributes to AI safety by generating training data for deception detectors.

[252] Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning

Yunghwei Lai, Kaiming Liu, Ziyue Wang, Weizhi Ma, Yang Liu

Main category: cs.AI

TL;DR: Doctor-R1: An AI doctor agent trained for both accurate medical decision-making and strategic, empathetic patient consultation through multi-agent interactive environment and two-tiered reward architecture.

DetailsMotivation: Existing LLMs excel at medical decision-making benchmarks but lack strategic, empathetic consultation skills essential for real-world clinical scenarios. There's a gap in AI systems that can conduct patient-centered, multi-turn inquiry like human doctors.

Method: Proposes Doctor-R1 framework with three key components: 1) multi-agent interactive environment, 2) two-tiered reward architecture that separately optimizes clinical decision-making and communicative inquiry skills, and 3) experience repository for policy learning from high-quality trajectories.

Result: Doctor-R1 surpasses state-of-the-art open-source specialized LLMs by substantial margin with higher parameter efficiency, outperforms powerful proprietary models, and achieves superior clinical capability and patient-centric performance in human expert evaluations.

Conclusion: The framework effectively enables AI doctor agents to master both medical decision-making and strategic consultation skills, demonstrating the importance of combining technical accuracy with empathetic communication for real-world clinical applications.

Abstract: The professionalism of a human doctor in outpatient service depends on two core abilities: the ability to make accurate medical decisions and the medical consultation skill to conduct strategic, empathetic patient inquiry. Existing Large Language Models (LLMs) have achieved remarkable accuracy on medical decision-making benchmarks. However, they often lack the ability to conduct the strategic and empathetic consultation, which is essential for real-world clinical scenarios. To address this gap, we propose Doctor-R1, an AI doctor agent trained to master both of the capabilities by ask high-yield questions and conduct strategic multi-turn inquiry to guide decision-making. Our framework introduces three key components: a multi-agent interactive environment, a two-tiered reward architecture that separately optimizes clinical decision-making and communicative inquiry skills, and an experience repository to ground policy learning in high-quality prior trajectories. We evaluate Doctor-R1 on OpenAI’s HealthBench and MAQuE, assessed across multi-facet metrics, such as communication quality, user experience, and task accuracy. Remarkably, Doctor-R1 surpasses state-of-the-art open-source specialized LLMs by a substantial margin with higher parameter efficiency and outperforms powerful proprietary models. Furthermore, human expert evaluations show that Doctor-R1 achieves superior clinical capability and patient-centric performance, demonstrating the effectiveness of the framework.

[253] Personalized Learning Path Planning with Goal-Driven Learner State Modeling

Joy Jia Yin Lim, Ye He, Jifan Yu, Xin Cong, Daniel Zhang-Li, Zhiyuan Liu, Huiqin Liu, Lei Hou, Juanzi Li, Bin Xu

Main category: cs.AI

TL;DR: Pxplore is a framework for Personalized Learning Path Planning that uses reinforcement learning and LLMs to create adaptive, goal-aligned learning paths.

DetailsMotivation: Existing LLM-based approaches for personalized learning lack mechanisms for goal-aligned planning, creating a need for frameworks that can transform abstract educational objectives into actionable learning paths.

Method: Combines reinforcement-based training with LLM-driven architecture, featuring structured learner state modeling, automated reward functions, and policy training via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO).

Result: Extensive experiments validate Pxplore’s effectiveness in producing coherent, personalized, and goal-driven learning paths, deployed within a real-world learning platform.

Conclusion: Pxplore successfully addresses the gap in goal-aligned personalized learning path planning through its integrated reinforcement learning and LLM framework.

Abstract: Personalized Learning Path Planning (PLPP) aims to design adaptive learning paths that align with individual goals. While large language models (LLMs) show potential in personalizing learning experiences, existing approaches often lack mechanisms for goal-aligned planning. We introduce Pxplore, a novel framework for PLPP that integrates a reinforcement-based training paradigm and an LLM-driven educational architecture. We design a structured learner state model and an automated reward function that transforms abstract objectives into computable signals. We train the policy combining supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), and deploy it within a real-world learning platform. Extensive experiments validate Pxplore’s effectiveness in producing coherent, personalized, and goal-driven learning paths. We release our code and dataset at https://github.com/Pxplore/pxplore-algo.

[254] Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control

Zhe Wu, Hongjin Lu, Junliang Xing, Changhao Zhang, Yuxuan Li, Yin Zhu, Yuhao Yang, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, Jun Wang, Yuanchun Shi

Main category: cs.AI

TL;DR: Hi-Agent: A hierarchical vision-language agent for mobile device control with high-level reasoning and low-level action models that achieves SOTA performance on Android benchmarks.

DetailsMotivation: Existing approaches for mobile device agents using VLMs rely on direct state-to-action mappings, which lack structured reasoning and planning, leading to poor generalization to novel tasks or unseen UI layouts.

Method: Hierarchical architecture with high-level reasoning model and low-level action model jointly optimized. Reformulates multi-step decision-making as sequence of single-step subgoals with foresight advantage function that leverages execution feedback from low-level model to guide high-level optimization.

Result: Achieves SOTA 87.9% task success rate on Android-in-the-Wild benchmark, significantly outperforming prior methods across prompt-based (17.7%), supervised (54.5%), and RL-based (71.9%) paradigms. Shows competitive zero-shot generalization on ScreenSpot-v2 and scales effectively on challenging AndroidWorld benchmark.

Conclusion: Hi-Agent demonstrates effective hierarchical reasoning for mobile control, addressing limitations of direct state-to-action mappings and enabling stable joint training without critics.

Abstract: Building agents that autonomously operate mobile devices has attracted increasing attention. While Vision-Language Models (VLMs) show promise, most existing approaches rely on direct state-to-action mappings, which lack structured reasoning and planning, and thus generalize poorly to novel tasks or unseen UI layouts. We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control, featuring a high-level reasoning model and a low-level action model that are jointly optimized. For efficient training, we reformulate multi-step decision-making as a sequence of single-step subgoals and propose a foresight advantage function, which leverages execution feedback from the low-level model to guide high-level optimization. This design alleviates the path explosion issue encountered by Group Relative Policy Optimization (GRPO) in long-horizon tasks and enables stable, critic-free joint training. Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark, significantly outperforming prior methods across three paradigms: prompt-based (AppAgent: 17.7%), supervised (Filtered BC: 54.5%), and reinforcement learning-based (DigiRL: 71.9%). It also demonstrates competitive zero-shot generalization on the ScreenSpot-v2 benchmark. On the more challenging AndroidWorld benchmark, Hi-Agent also scales effectively with larger backbones, showing strong adaptability in high-complexity mobile control scenarios.

[255] String Seed of Thought: Prompting LLMs for Distribution-Faithful and Diverse Generation

Kou Misaki, Takuya Akiba

Main category: cs.AI

TL;DR: SSoT is a prompting method that improves LLMs’ ability to follow probabilistic instructions by having them generate random strings for entropy, then extracting randomness to produce diverse answers matching target distributions.

DetailsMotivation: LLMs struggle with Probabilistic Instruction Following (PIF) - selecting answers from predefined options with specific probabilities. This failure causes biases in applications requiring non-deterministic behaviors like human-behavior simulation, content diversification, and multiplayer games, and reduces response diversity crucial for test-time scaling.

Method: String Seed of Thought (SSoT) prompts LLMs to first output a random string to generate sufficient entropy, then extract randomness by manipulating this string to derive a final answer, preserving diversity while adhering to specific constraints.

Result: SSoT significantly improves LLMs’ PIF performance, approaching ideal pseudo-random number generator performance. Experiments on NoveltyBench show benefits extend beyond closed-set tasks to open-ended tasks by enhancing response diversity.

Conclusion: SSoT is an effective prompting method that addresses LLMs’ limitations in probabilistic instruction following, improving their ability to generate diverse, distribution-aligned responses for applications requiring non-deterministic behaviors.

Abstract: We introduce String Seed of Thought (SSoT), a novel prompting method for LLMs that improves Probabilistic Instruction Following (PIF). We define PIF as a task requiring an LLM to select its answer from a predefined set of options, each associated with a specific probability, such that the empirical distribution of the generated answers aligns with the target distribution when prompted multiple times. While LLMs excel at tasks with single, deterministic answers, they often fail at PIF, exhibiting biases problematic for applications requiring non-deterministic behaviors, such as human-behavior simulation, content diversification, and multiplayer games. It also harms the diversity of generated responses, a crucial factor in test-time scaling, by causing the outputs to collapse into a limited set of answers. To address this, we propose SSoT, a simple prompting method that instructs an LLM to first output a random string to generate sufficient entropy. SSoT also instructs the LLM to extract randomness by manipulating this string to derive a final answer, thereby preserving diversity while adhering to specific constraints. We demonstrate that SSoT significantly improves the PIF performance of LLMs, approaching the ideal performance of a pseudo-random number generator. Furthermore, our experiments on NoveltyBench show SSoT’s benefits extend beyond closed-set tasks to open-ended tasks by enhancing response diversity.

[256] Bridging Symbolic Control and Neural Reasoning in LLM Agents: Structured Cognitive Loop with a Governance Layer

Myung Ho Kim

Main category: cs.AI

TL;DR: SCL introduces a modular agent architecture with five-phase cognition (R-CCAM) and Soft Symbolic Control for explainable, controllable AI agents.

DetailsMotivation: Address fundamental architectural problems in LLM agents: entangled reasoning/execution, memory volatility, and uncontrolled action sequences. Existing frameworks like ReAct and AutoGPT lack explainability and control.

Method: Structured Cognitive Loop (SCL) with five modular phases: Retrieval, Cognition, Control, Action, Memory (R-CCAM). Soft Symbolic Control layer applies symbolic constraints to probabilistic inference while preserving neural flexibility.

Result: Achieves zero policy violations, eliminates redundant tool calls, maintains complete decision traceability on multi-step conditional reasoning tasks. Outperforms existing frameworks like ReAct and AutoGPT.

Conclusion: SCL offers a practical path toward reliable, explainable, and governable AI agents by connecting expert system principles with modern LLM capabilities, with three design principles for trustworthy agents.

Abstract: Large language model agents suffer from fundamental architectural problems: entangled reasoning and execution, memory volatility, and uncontrolled action sequences. We introduce Structured Cognitive Loop (SCL), a modular architecture that explicitly separates agent cognition into five phases: Retrieval, Cognition, Control, Action, and Memory (R-CCAM). Soft Symbolic Control constitutes a dedicated governance layer within SCL, applying symbolic constraints to probabilistic inference while preserving the flexibility of neural reasoning and restoring the explainability and controllability of classical symbolic systems. Through empirical validation on multi-step conditional reasoning tasks, we demonstrate that SCL achieves zero policy violations, eliminates redundant tool calls, and maintains complete decision traceability. These results address critical gaps in existing frameworks such as ReAct, AutoGPT, and memory-augmented approaches. Our contributions are threefold: (1) we situate SCL within the taxonomy of hybrid intelligence, differentiating it from prompt-centric and memory-only approaches; (2) we formally define Soft Symbolic Control and contrast it with neuro-symbolic AI; and (3) we derive three design principles for trustworthy agents: modular decomposition, adaptive symbolic governance, and transparent state management. We provide a complete open-source implementation demonstrating the R-CCAM loop architecture, alongside a live GPT-4o-powered travel planning agent. By connecting expert system principles with modern LLM capabilities, this work offers a practical and theoretically grounded path toward reliable, explainable, and governable AI agents.

[257] Human-AI Co-Embodied Intelligence for Scientific Experimentation and Manufacturing

Xinyi Lin, Yuyang Zhang, Yuanhang Gan, Juntao Chen, Hao Shen, Yichun He, Lijun Li, Ze Yuan, Shuang Wang, Chaohao Wang, Rui Zhang, Na Li, Jia Liu

Main category: cs.AI

TL;DR: APEX system introduces human-AI co-embodied intelligence for physical experimentation, combining human execution with AI reasoning via wearable interfaces to transform scientific fabrication processes.

DetailsMotivation: Current scientific experimentation and manufacturing require continuous human expertise, limiting interpretability and scalability. The paper aims to overcome these limitations by integrating human researchers with AI systems for more autonomous, traceable, and scalable physical processes.

Method: Develops human-AI co-embodied intelligence paradigm combining human researchers, agentic AI, and wearable hardware. Instantiated as APEX system in microfabrication cleanrooms, where wearable interfaces capture experimentation data and facilitate human-AI communication for real-time error detection and procedural expertise transfer.

Result: APEX achieves 51% higher accuracy than state-of-the-art multimodal LLMs/VLMs in understanding fabrication procedures, enables real-time error detection/correction, transfers expertise to novices, and overcomes material incompatibility challenges to achieve wafer-scale brain-level soft neural probes with single-unit-resolution recording.

Conclusion: Human-AI co-embodied intelligence extends agentic reasoning into the physical domain, transforming scientific experimentation and manufacturing into autonomous, traceable, interpretable, and scalable processes with demonstrated breakthrough fabrication outcomes.

Abstract: Scientific experimentation and manufacturing rely on prolonged protocol development and complex, multi-step implementation, which require continuous human expertise for precise execution and decision-making, limiting interpretability and scalability. Here, we introduce human-artificial intelligence (AI) co-embodied intelligence, a new form of physical AI that unites human researchers, agentic AI, and wearable hardware. In this paradigm, humans provide precise execution, while agentic AI contributes contextual reasoning, adaptive planning, and analysis. The wearable interface continuously captures experimentation and manufacturing, facilitating seamless communication between humans and AI. We instantiate this paradigm in a microfabrication cleanroom, leading to the agentic-physical experimentation (APEX) system which understands fabrication procedure with accuracy 51% higher than state-of-the-art multimodal large language models/vision language models (LLMs/VLMs), detects and corrects fabrication errors in real-time, and transfers procedural expertise to novice users. Critically, APEX system enables the co-development of fabrication protocols in cleanrooms, overcoming the incompatibility of elastomeric materials in standard microfabrication processes and enabling previously unattainable fabrication outcomes, as demonstrated by the wafer-scale realization of brain-level soft neural probe capable of single-unit-resolution neural recording. These results establish the human-AI co-embodied intelligence that extends agentic reasoning beyond computation into the physical domain, transforming scientific experimentation and manufacturing into autonomous, traceable, interpretable and scalable processes.

[258] Simulating the Visual World with Artificial Intelligence: A Roadmap

Jingtong Yue, Ziqi Huang, Zhaoxi Chen, Xintao Wang, Pengfei Wan, Ziwei Liu

Main category: cs.AI

TL;DR: Survey paper conceptualizing video foundation models as implicit world models with video renderers, tracing evolution through four generations toward physically plausible, interactive, and planning-capable systems.

DetailsMotivation: The field is shifting from generating visually appealing videos to building virtual environments that support interaction and maintain physical plausibility, pointing toward video foundation models that function as implicit world models.

Method: Systematic survey conceptualizing modern video foundation models as combination of two core components: implicit world model (encoding structured knowledge about physical laws, interactions, agent behavior) and video renderer (transforming latent simulation into realistic visual observations). Traces progression through four generations of video generation.

Result: Provides comprehensive overview of evolution toward world models with intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities across multiple spatiotemporal scales. Examines application domains like robotics, autonomous driving, and interactive gaming.

Conclusion: Video foundation models are evolving into implicit world models that combine simulation engines with visual renderers, with open challenges around agent intelligence and evaluation. The field is moving toward systems that can simulate, interact, and plan in virtual environments.

Abstract: The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a “window” into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.

[259] Leveraging Spreading Activation for Improved Document Retrieval in Knowledge-Graph-Based RAG Systems

Jovan Pavlović, Miklós Krész, László Hajdu

Main category: cs.AI

TL;DR: A novel RAG framework using spreading activation algorithm on automatically constructed heterogeneous knowledge graphs, reducing reliance on semantic knowledge graphs and LLM-guided traversal, improving multi-hop QA performance.

DetailsMotivation: Standard RAG frameworks treat all retrieved information as equally reliable, overlooking varying credibility and interconnected nature of textual corpora. GraphRAG approaches using knowledge graphs have potential but depend on high-quality graph representations that are costly to construct or unreliable when automated.

Method: Proposes a RAG framework using spreading activation algorithm to retrieve information from documents connected by an automatically constructed heterogeneous knowledge graph. Reduces reliance on semantic knowledge graphs and avoids LLM-guided graph traversal.

Result: Achieves better or comparable performance to state-of-the-art RAG methods, can be integrated as plug-and-play module with different iterative RAG pipelines. Combined with chain-of-thought iterative retrieval yields up to 39% absolute improvement in answer correctness over naive RAG, works with small open-weight language models.

Conclusion: The spreading activation approach on heterogeneous knowledge graphs provides effective alternative to traditional GraphRAG, reducing dependency on high-quality semantic graphs and LLM guidance while improving multi-hop reasoning performance.

Abstract: Despite initial successes and a variety of architectures, retrieval-augmented generation systems still struggle to reliably retrieve and connect the multi-step evidence required for complicated reasoning tasks. Most of the standard RAG frameworks regard all retrieved information as equally reliable, overlooking the varying credibility and interconnected nature of large textual corpora. GraphRAG approaches offer potential improvement to RAG systems by integrating knowledge graphs, which structure information into nodes and edges, capture entity relationships, and enable multi-step logical traversal. However, GraphRAG is not always an ideal solution, as it depends on high-quality graph representations of the corpus. Such representations usually rely on manually curated knowledge graphs, which are costly to construct and update, or on automated graph-construction pipelines that are often unreliable. Moreover, systems following this paradigm typically use large language models to guide graph traversal and evidence retrieval. In this paper, we propose a novel RAG framework that uses a spreading activation algorithm to retrieve information from a corpus of documents connected by an automatically constructed heterogeneous knowledge graph. This approach reduces reliance on semantic knowledge graphs, which are often incomplete due to information loss during information extraction, avoids LLM-guided graph traversal, and improves performance on multi-hop question answering. Experiments show that our method achieves better or comparable performance to several state-of-the-art RAG methods and can be integrated as a plug-and-play module with different iterative RAG pipelines. When combined with chain-of-thought iterative retrieval, it yields up to a 39% absolute improvement in answer correctness over naive RAG, while achieving these results with small open-weight language models.

[260] UniRel: Relation-Centric Knowledge Graph Question Answering with RL-Tuned LLM Reasoning

Yinxu Tang, Chengsong Huang, Jiaxin Huang, William Yeoh

Main category: cs.AI

TL;DR: UniRel is a framework for relation-centric KGQA that returns informative subgraphs showing semantic relations between entities, using reinforcement learning to prefer compact, specific answers with low-degree intermediate entities.

DetailsMotivation: Current KGQA focuses on entity-centric queries returning single entities, but many real-world questions are relational, aiming to understand how entities are associated. There's a need for relation-centric KGQA that returns informative subgraphs showing semantic relations.

Method: Proposes UniRel, a unified modular framework combining a subgraph retriever with an LLM fine-tuned using reinforcement learning. Uses a reward function to prefer compact and specific subgraphs with informative relations and low-degree intermediate entities.

Result: UniRel improves connectivity and reward over Prompting baselines, generalizes well to unseen entities and relations, and can be applied to conventional entity-centric KGQA, achieving competitive or improved performance in several settings.

Conclusion: UniRel successfully addresses relation-centric KGQA by returning informative subgraphs, demonstrating effectiveness through improved connectivity and generalization, while also showing applicability to traditional entity-centric KGQA tasks.

Abstract: Knowledge Graph Question Answering (KGQA) has largely focused on entity-centric queries that return a single answer entity. However, many real-world questions are inherently relational, aiming to understand how entities are associated rather than which entity satisfies a query. In this work, we introduce relation-centric KGQA, a complementary setting in which the answer is a subgraph that represents the semantic relations among entities. The main challenge lies in the abundance of candidate subgraphs, where trivial or overly common connections often obscure the identification of unique and informative answers. To tackle this, we propose UniRel, a unified modular framework that combines a subgraph retriever with an LLM fine-tuned using reinforcement learning. The framework uses a reward function to prefer compact and specific subgraphs with informative relations and low-degree intermediate entities. Experiments show that UniRel improves connectivity and reward over Prompting baselines and generalizes well to unseen entities and relations. Moreover, UniRel can be applied to conventional entity-centric KGQA, achieving competitive or improved performance in several settings.

[261] Forecast Aware Deep Reinforcement Learning for Efficient Electricity Load Scheduling in Dairy Farms

Nawazish Ali, Rachael Shaw, Karl Mason

Main category: cs.AI

TL;DR: A Deep Reinforcement Learning framework for energy load scheduling in dairy farms using forecast-aware PPO with PID-controlled KL divergence for stable training under variable tariffs.

DetailsMotivation: Dairy farming is energy-intensive and relies heavily on grid electricity. With increasing renewable energy integration, sustainable energy management is essential but challenged by renewable intermittency. Existing RL methods assume complete future knowledge and use fixed thresholds leading to unstable training under dynamic conditions.

Method: Proposes a Deep RL framework with two PPO variants: 1) Forecast Aware PPO incorporating short-term demand/renewable forecasts using hour-of-day and month-based residual calibration, and 2) PID KL PPO using a proportional-integral-derivative controller to adaptively regulate KL divergence for stable policy updates.

Result: Achieves up to 1% lower electricity cost than standard PPO, 4.8% lower than DQN, and 1.5% lower than SAC. For battery scheduling, reduces grid imports by 13.1%, demonstrating effectiveness for sustainable energy management in dairy farming.

Conclusion: The proposed DRL framework effectively addresses energy scheduling challenges in dairy farms, showing scalability and effectiveness for sustainable energy management through improved RL algorithms that handle forecast uncertainty and maintain training stability.

Abstract: Dairy farming is an energy intensive sector that relies heavily on grid electricity. With increasing renewable energy integration, sustainable energy management has become essential for reducing grid dependence and supporting the United Nations Sustainable Development Goal 7 on affordable and clean energy. However, the intermittent nature of renewables poses challenges in balancing supply and demand in real time. Intelligent load scheduling is therefore crucial to minimize operational costs while maintaining reliability. Reinforcement Learning has shown promise in improving energy efficiency and reducing costs. However, most RL-based scheduling methods assume complete knowledge of future prices or generation, which is unrealistic in dynamic environments. Moreover, standard PPO variants rely on fixed clipping or KL divergence thresholds, often leading to unstable training under variable tariffs. To address these challenges, this study proposes a Deep Reinforcement Learning framework for efficient load scheduling in dairy farms, focusing on battery storage and water heating under realistic operational constraints. The proposed Forecast Aware PPO incorporates short term forecasts of demand and renewable generation using hour of day and month based residual calibration, while the PID KL PPO variant employs a proportional integral derivative controller to regulate KL divergence for stable policy updates adaptively. Trained on real world dairy farm data, the method achieves up to 1% lower electricity cost than PPO, 4.8% than DQN, and 1.5% than SAC. For battery scheduling, PPO reduces grid imports by 13.1%, demonstrating scalability and effectiveness for sustainable energy management in modern dairy farming.

[262] Think-Augmented Function Calling: Improving LLM Parameter Accuracy Through Embedded Reasoning

Lei Wei, Xiao Peng, Jinpeng Ou, Bin Wang

Main category: cs.AI

TL;DR: TAFC enhances LLM function calling with explicit reasoning at function and parameter levels using “think” parameter augmentation and dynamic optimization for improved accuracy and interpretability.

DetailsMotivation: Current LLM function calling lacks explicit reasoning transparency during parameter generation, especially for complex functions with interdependent parameters. Existing approaches like chain-of-thought operate at agent level but fail to provide fine-grained reasoning guidance for individual parameters.

Method: Proposes Think-Augmented Function Calling (TAFC) framework with universal “think” parameter augmentation for explicit reasoning, dynamic optimization of parameter descriptions, automatic granular reasoning triggering based on complexity scoring, and reasoning-guided optimization to align with human expectations.

Result: Evaluation on ToolBench across proprietary and open-source models shows significant improvements in parameter generation accuracy and reasoning coherence for multi-parameter functions, while providing enhanced interpretability for debugging AI agent behaviors.

Conclusion: TAFC enhances function calling accuracy through explicit reasoning without requiring architectural modifications to existing LLMs while maintaining full API compatibility, offering improved transparency and debugging capabilities.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in function calling for autonomous agents, yet current mechanisms lack explicit reasoning transparency during parameter generation, particularly for complex functions with interdependent parameters. While existing approaches like chain-of-thought prompting operate at the agent level, they fail to provide fine-grained reasoning guidance for individual function parameters. To address these limitations, we propose Think-Augmented Function Calling (TAFC), a novel framework that enhances function calling accuracy through explicit reasoning at both function and parameter levels. Our method introduces a universal “think” parameter augmentation that enables models to articulate their decision-making process, with dynamic optimization for parameter descriptions to improve reasoning quality. For complex parameters, TAFC automatically triggers granular reasoning based on complexity scoring, ensuring appropriate justification for critical decisions. Additionally, we propose reasoning-guided optimization to align generated reasoning with human expectations. TAFC requires no architectural modifications to existing LLMs while maintaining full API compatibility. Evaluation on ToolBench across proprietary and open-source models demonstrates significant improvements in parameter generation accuracy and reasoning coherence for multi-parameter functions, while providing enhanced interpretability for debugging AI agent behaviors.

[263] MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

Ashutosh Hathidara, Julien Yu, Vaishali Senthil, Sebastian Schreiber, Anil Babu Ankisettipalli

Main category: cs.AI

TL;DR: MirrorBench is a benchmarking framework for evaluating user proxy agents in conversational systems, focusing on their ability to generate human-like utterances using lexical diversity metrics and LLM judges.

DetailsMotivation: Current LLM-based user simulators often produce verbose, unrealistic utterances when using naive "act-as-a-user" prompting, creating a need for principled evaluation of user proxy agents that focuses on human-likeness rather than downstream task success.

Method: MirrorBench combines three lexical-diversity metrics (MATTR, Yule’s K, and HD-D) with three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason), and uses Human-Human and Proxy-Proxy calibration controls to contextualize judge scores.

Result: The framework yields variance-aware comparisons across four public datasets and reveals systematic gaps between user proxies and real human users, showing that current proxies don’t fully capture human conversational patterns.

Conclusion: MirrorBench provides a reproducible, extensible benchmarking framework for evaluating user proxy agents, decoupling human-likeness assessment from task success, with open-source implementation available.

Abstract: Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive “act-as-a-user” prompting often yields verbose, unrealistic utterances, motivating principled evaluation of user proxy agents. We present MirrorBench, a reproducible and extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational regimes, explicitly decoupled from downstream task success. MirrorBench combines three lexical-diversity metrics (MATTR, Yule’s~$K$, and HD-D) with three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls. Across four public datasets, MirrorBench yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users. The framework is open source and includes a command-line interface for running and managing user-proxy benchmarking experiments.

[264] FadeMem: Biologically-Inspired Forgetting for Efficient Agent Memory

Lei Wei, Xiao Peng, Xu Dong, Niantao Xie, Bin Wang

Main category: cs.AI

TL;DR: FadeMem: A biologically-inspired agent memory architecture with active forgetting mechanisms that uses differential decay rates and LLM-guided conflict resolution to balance retention and forgetting in autonomous agents.

DetailsMotivation: Current AI agents lack selective forgetting mechanisms, leading to catastrophic forgetting at context boundaries or information overload. Human memory naturally balances retention through adaptive decay, while AI systems use binary retention strategies that preserve everything or lose it entirely.

Method: FadeMem implements differential decay rates across a dual-layer memory hierarchy with adaptive exponential decay functions modulated by semantic relevance, access frequency, and temporal patterns. It uses LLM-guided conflict resolution and intelligent memory fusion to consolidate related information while allowing irrelevant details to fade.

Result: Experiments on Multi-Session Chat, LoCoMo, and LTI-Bench demonstrate superior multi-hop reasoning and retrieval with 45% storage reduction, validating the effectiveness of biologically-inspired forgetting in agent memory systems.

Conclusion: FadeMem provides an effective biologically-inspired approach to agent memory management that balances retention and forgetting, addressing critical limitations in current autonomous agent systems.

Abstract: Large language models deployed as autonomous agents face critical memory limitations, lacking selective forgetting mechanisms that lead to either catastrophic forgetting at context boundaries or information overload within them. While human memory naturally balances retention and forgetting through adaptive decay processes, current AI systems employ binary retention strategies that preserve everything or lose it entirely. We propose FadeMem, a biologically-inspired agent memory architecture that incorporates active forgetting mechanisms mirroring human cognitive efficiency. FadeMem implements differential decay rates across a dual-layer memory hierarchy, where retention is governed by adaptive exponential decay functions modulated by semantic relevance, access frequency, and temporal patterns. Through LLM-guided conflict resolution and intelligent memory fusion, our system consolidates related information while allowing irrelevant details to fade. Experiments on Multi-Session Chat, LoCoMo, and LTI-Bench demonstrate superior multi-hop reasoning and retrieval with 45% storage reduction, validating the effectiveness of biologically-inspired forgetting in agent memory systems.

[265] Yunjue Agent Tech Report: A Fully Reproducible, Zero-Start In-Situ Self-Evolving Agent System for Open-Ended Tasks

Haotian Li, Shijun Yang, Weizhen Qi, Silei Zhao, Rui Hua, Mingzhu Song, Xiaojian Yang, Chao Peng

Main category: cs.AI

TL;DR: A self-evolving agent system that continuously improves through tool evolution using execution feedback, without ground-truth labels, for open-ended environments.

DetailsMotivation: Conventional agent systems fail in open-ended environments with drifting task distributions and scarce supervision, as they rely on static toolsets or offline training that can't adapt to dynamic changes.

Method: Proposes In-Situ Self-Evolving paradigm using sequential task interactions as experience stream, distilling execution feedback into reusable capabilities. Focuses on tool evolution with binary feedback signals, implements Yunjue Agent with Parallel Batch Evolution strategy for efficiency.

Result: Significant performance gains over proprietary baselines across five diverse benchmarks in zero-start setting. Accumulated general knowledge transfers seamlessly to novel domains in warm-start evaluations. Novel metric proposed for monitoring evolution convergence.

Conclusion: The self-evolving paradigm enables agents to expand capabilities autonomously in open-ended environments through tool evolution, with demonstrated effectiveness and transferability of learned knowledge.

Abstract: Conventional agent systems often struggle in open-ended environments where task distributions continuously drift and external supervision is scarce. Their reliance on static toolsets or offline training lags behind these dynamics, leaving the system’s capability boundaries rigid and unknown. To address this, we propose the In-Situ Self-Evolving paradigm. This approach treats sequential task interactions as a continuous stream of experience, enabling the system to distill short-term execution feedback into long-term, reusable capabilities without access to ground-truth labels. Within this framework, we identify tool evolution as the critical pathway for capability expansion, which provides verifiable, binary feedback signals. Within this framework, we develop Yunjue Agent, a system that iteratively synthesizes, optimizes, and reuses tools to navigate emerging challenges. To optimize evolutionary efficiency, we further introduce a Parallel Batch Evolution strategy. Empirical evaluations across five diverse benchmarks under a zero-start setting demonstrate significant performance gains over proprietary baselines. Additionally, complementary warm-start evaluations confirm that the accumulated general knowledge can be seamlessly transferred to novel domains. Finally, we propose a novel metric to monitor evolution convergence, serving as a function analogous to training loss in conventional optimization. We open-source our codebase, system traces, and evolved tools to facilitate future research in resilient, self-evolving intelligence.

[266] Balancing Sustainability And Performance: The Role Of Small-Scale LLMs In Agentic Artificial Intelligence Systems

Anh Khoa Ngo Ho, Martin Chauvin, Simon Gosset, Philippe Cordier, Boris Gamazaychikov

Main category: cs.AI

TL;DR: Smaller open-weight language models can reduce energy consumption in multi-agent systems while maintaining performance, offering sustainable AI design strategies.

DetailsMotivation: As LLMs become integral to agentic AI systems, their energy demands during inference pose sustainability challenges. The study investigates whether smaller-scale models can reduce energy consumption without compromising performance in real-world multi-agent environments.

Method: Conducted comparative analysis across language models of varying scales to quantify trade-offs between efficiency and performance. Proposed practical guidelines for sustainable AI design including optimal batch size configuration and computation resource allocation.

Result: Results show that smaller open-weights models can lower energy usage while preserving task quality in multi-agent environments.

Conclusion: The findings offer actionable strategies for developing scalable, environmentally responsible AI systems through sustainable model deployment practices.

Abstract: As large language models become integral to agentic artificial intelligence systems, their energy demands during inference may pose significant sustainability challenges. This study investigates whether deploying smaller-scale language models can reduce energy consumption without compromising responsiveness and output quality in a multi-agent, real-world environments. We conduct a comparative analysis across language models of varying scales to quantify trade-offs between efficiency and performance. Results show that smaller open-weights models can lower energy usage while preserving task quality. Building on these findings, we propose practical guidelines for sustainable artificial intelligence design, including optimal batch size configuration and computation resource allocation. These insights offer actionable strategies for developing scalable, environmentally responsible artificial intelligence systems.

[267] How does information access affect LLM monitors’ ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani, Shubhorup Biswas, Francis Rhys Ward

Main category: cs.AI

TL;DR: LLM monitors perform better with less information (less-is-more effect), leading to extract-and-evaluate monitoring that improves sabotage detection by 16.8 percentage points.

DetailsMotivation: To control potentially misaligned AI agents that exhibit deception and hidden objectives, using LLMs as monitors is promising, but the optimal information access for monitors is unclear.

Method: Introduces extract-and-evaluate (EaE) monitoring: a hierarchical approach where one monitor extracts relevant excerpts from agent trajectories and another scores them, leveraging the less-is-more effect.

Result: EaE improves sabotage detection by 16.8 percentage points over baselines in BigCodeBench-Sabotage, outperforms or is competitive in other settings, and agents unaware of monitoring are caught more easily.

Conclusion: Information filtering improves LLM monitoring performance, with EaE being an effective approach that leverages the counterintuitive less-is-more effect for automated oversight.

Abstract: Frontier language model agents can exhibit misaligned behaviors, including deception, exploiting reward hacks, and pursuing hidden objectives. To control potentially misaligned agents, we can use LLMs themselves to monitor for misbehavior. In this paper, we study how information access affects LLM monitor performance. While one might expect that monitors perform better when they have access to more of the monitored agents’ reasoning and actions, we find that contemporary systems often perform better with less information, a phenomenon we call the less-is-more effect for automated oversight. We demonstrate this phenomenon, and analyze the conditions under which it occurs, in three evaluation environments where agents must conduct sabotage while evading monitors. Leveraging this effect, we introduce extract-and-evaluate (EaE) monitoring – a new hierarchical approach where one monitor isolates relevant excerpts from the monitored agent’s trajectory and another monitor scores these excerpts. In BigCodeBench-Sabotage with a GPT-4.1-mini monitor, EaE improves sabotage detection rates by 16.8 percentage points over the next-best approach. In other settings, EaE either outperforms or is competitive with baselines. In addition, we find that agents unaware of being monitored can be caught much more easily and that monitors scoring longer reasoning traces gain more from information filtering. Lastly, we conduct a cost-performance analysis and find that Gemini and Claude models with monitoring techniques that involve information filtering occupy much of the Pareto frontier.

[268] Neuro-symbolic AI for Predictive Maintenance (PdM) – review and recommendations

Kyle Hamilton, Muhammad Intizar Ali

Main category: cs.AI

TL;DR: Survey paper reviewing predictive maintenance (PdM) methods over 5 years, comparing data-driven deep learning vs. traditional knowledge-based systems, and proposing neuro-symbolic AI as a hybrid solution for better accuracy, explainability, and robustness.

DetailsMotivation: Current PdM systems face limitations: data-driven methods need large labeled datasets, lack generalizability, and are opaque; traditional rule-based systems have poor accuracy and need expert tuning. A hybrid approach combining strengths of both is needed.

Method: Systematic review of PdM literature, analysis of data-driven vs. knowledge-based approaches, examination of hybrid systems, and proposal of neuro-symbolic AI architectures that integrate deep learning with symbolic logic.

Result: Data-driven methods show higher accuracy but suffer from data requirements and lack of explainability; traditional methods are interpretable but inaccurate. Neuro-symbolic AI offers potential to overcome these limitations by combining learning with symbolic reasoning.

Conclusion: Neuro-symbolic AI represents a promising direction for PdM, enabling more accurate, explainable, and robust systems by integrating deep learning with domain knowledge and symbolic logic.

Abstract: In this document we perform a systematic review of the State-of-the-art in Predictive Maintenance (PdM) over the last five years in industrial settings such as commercial buildings, pharmaceutical facilities, or semi-conductor manufacturing. In general, data-driven methods such as those based on deep learning, exhibit higher accuracy than traditional knowledge-based systems. These systems however, are not without significant limitations. The need for large labeled data sets, a lack of generalizability to new environments (out-of-distribution generalization), and a lack of transparency at inference time are some of the obstacles to adoption in real world environments. In contrast, traditional approaches based on domain expertise in the form of rules, logic or first principles suffer from poor accuracy, many false positives and a need for ongoing expert supervision and manual tuning. While the majority of approaches in recent literature utilize some form of data-driven architecture, there are hybrid systems which also take into account domain specific knowledge. Such hybrid systems have the potential to overcome the weaknesses of either approach on its own while preserving their strengths. We propose taking the hybrid approach even further and integrating deep learning with symbolic logic, or Neuro-symbolic AI, to create more accurate, explainable, interpretable, and robust systems. We describe several neuro-symbolic architectures and examine their strengths and limitations within the PdM domain. We focus specifically on methods which involve the use of sensor data and manually crafted rules as inputs by describing concrete NeSy architectures. In short, this survey outlines the context of modern maintenance, defines key concepts, establishes a generalized framework, reviews current modeling approaches and challenges, and introduces the proposed focus on Neuro-symbolic AI (NESY).

[269] Methods and Open Problems in Differentiable Social Choice: Learning Mechanisms, Decisions, and Alignment

Zhiyu An, Wan Du

Main category: cs.AI

TL;DR: Survey paper on differentiable social choice - applying machine learning to voting rules, mechanisms, and aggregation procedures by making them learnable and differentiable models optimized from data.

DetailsMotivation: Social choice has become foundational to modern ML systems (auctions, resource allocation, federated learning, participatory governance, LLM alignment), but these systems often implement social choice mechanisms implicitly without normative scrutiny. Need to bridge ML, economics, and social choice theory.

Method: Survey methodology synthesizing work across auctions, voting, budgeting, liquid democracy, decentralized aggregation, and inverse mechanism learning. Formulates social choice mechanisms as differentiable, learnable models that can be optimized from data.

Result: Synthesis showing how classical axioms and impossibility results reappear as objectives, constraints, and optimization trade-offs in differentiable social choice frameworks.

Conclusion: Identifies 36 open problems defining a new research agenda at the intersection of machine learning, economics, and social choice theory through the differentiable social choice paradigm.

Abstract: Social choice is no longer a peripheral concern of political theory or economics-it has become a foundational component of modern machine learning systems. From auctions and resource allocation to federated learning, participatory governance, and the alignment of large language models, machine learning pipelines increasingly aggregate heterogeneous preferences, incentives, and judgments into collective decisions. In effect, many contemporary machine learning systems already implement social choice mechanisms, often implicitly and without explicit normative scrutiny. This Review surveys differentiable social choice: an emerging paradigm that formulates voting rules, mechanisms, and aggregation procedures as learnable, differentiable models optimized from data. We synthesize work across auctions, voting, budgeting, liquid democracy, decentralized aggregation, and inverse mechanism learning, showing how classical axioms and impossibility results reappear as objectives, constraints, and optimization trade-offs. We conclude by identifying 36 open problems defining a new research agenda at the intersection of machine learning, economics, and social choice theory.

Zhanli Li, Huiwen Tian, Lvzhou Luo, Yixuan Cao, Ping Luo

Main category: cs.AI

TL;DR: DeepRead is a structure-aware, multi-turn document reasoning agent that leverages document hierarchical organization for improved long-document question answering through coordinated retrieval and reading tools.

DetailsMotivation: Existing agentic search frameworks treat long documents as flat collections of chunks, underutilizing document-native priors like hierarchical organization and sequential discourse structure, which limits their effectiveness for long-document question answering.

Method: DeepRead converts PDFs to structured Markdown preserving headings/paragraph boundaries, indexes at paragraph level with coordinate-style metadata encoding section identity and order. Equips LLM with two tools: Retrieve (localizes relevant paragraphs with structural context) and ReadSection (enables contiguous, order-preserving reading within specified section ranges).

Result: DeepRead achieves significant improvements over Search-o1-style agentic search in document question answering, with validated synergistic effect between retrieval and reading tools, and exhibits human-like “locate then read” behavior.

Conclusion: Explicitly operationalizing document structural priors through coordinated retrieval and reading tools enables more effective long-document reasoning, resembling human reading patterns and outperforming flat-chunk approaches.

Abstract: With the rapid progress of tool-using and agentic large language models (LLMs), Retrieval-Augmented Generation (RAG) is evolving from one-shot, passive retrieval into multi-turn, decision-driven evidence acquisition. Despite strong results in open-domain settings, existing agentic search frameworks commonly treat long documents as flat collections of chunks, underutilizing document-native priors such as hierarchical organization and sequential discourse structure. We introduce DeepRead, a structure-aware, multi-turn document reasoning agent that explicitly operationalizes these priors for long-document question answering. DeepRead leverages LLM-based OCR model to convert PDFs into structured Markdown that preserves headings and paragraph boundaries. It then indexes documents at the paragraph level and assigns each paragraph a coordinate-style metadata key encoding its section identity and in-section order. Building on this representation, DeepRead equips the LLM with two complementary tools: a Retrieve tool that localizes relevant paragraphs while exposing their structural coordinates (with lightweight scanning context), and a ReadSection tool that enables contiguous, order-preserving reading within a specified section and paragraph range. Our experiments demonstrate that DeepRead achieves significant improvements over Search-o1-style agentic search in document question answering. The synergistic effect between retrieval and reading tools is also validated. Our fine-grained behavioral analysis reveals a reading and reasoning paradigm resembling human-like ``locate then read’’ behavior.

[271] Steering LLMs via Scalable Interactive Oversight

Enyu Zhou, Zhiheng Xi, Long Ma, Zhihao Zhang, Shihan Dou, Zhikai Lei, Guoteng Wang, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.AI

TL;DR: A framework called Scalable Interactive Oversight that decomposes complex AI tasks into manageable decision trees to help non-experts provide effective supervision and guidance to large language models.

DetailsMotivation: Large language models can automate complex tasks, but users struggle to guide them effectively due to insufficient domain expertise, difficulty articulating precise intent, and inability to validate complex outputs, creating a supervision gap in scalable oversight.

Method: Proposes Scalable Interactive Oversight framework that decomposes complex intent into recursive tree of manageable decisions, elicits low-burden feedback at each node, and recursively aggregates these signals into precise global guidance.

Result: Validated in web development tasks, enables non-experts to produce expert-level Product Requirement Documents with 54% improvement in alignment. Framework can be optimized via Reinforcement Learning using only online user feedback.

Conclusion: Provides practical pathway for maintaining human control as AI scales by enabling effective supervision through decomposed decision trees and interactive feedback aggregation.

Abstract: As Large Language Models increasingly automate complex, long-horizon tasks such as \emph{vibe coding}, a supervision gap has emerged. While models excel at execution, users often struggle to guide them effectively due to insufficient domain expertise, the difficulty of articulating precise intent, and the inability to reliably validate complex outputs. It presents a critical challenge in scalable oversight: enabling humans to responsibly steer AI systems on tasks that surpass their own ability to specify or verify. To tackle this, we propose Scalable Interactive Oversight, a framework that decomposes complex intent into a recursive tree of manageable decisions to amplify human supervision. Rather than relying on open-ended prompting, our system elicits low-burden feedback at each node and recursively aggregates these signals into precise global guidance. Validated in web development task, our framework enables non-experts to produce expert-level Product Requirement Documents, achieving a 54% improvement in alignment. Crucially, we demonstrate that this framework can be optimized via Reinforcement Learning using only online user feedback, offering a practical pathway for maintaining human control as AI scales.

[272] Are AI Capabilities Increasing Exponentially? A Competing Hypothesis

Haosen Ge, Hamsa Bastani, Osbert Bastani

Main category: cs.AI

TL;DR: The paper critiques claims of exponential AI growth, arguing data shows inflection point has already passed and proposing a decomposed model of base vs reasoning capabilities with different improvement rates.

DetailsMotivation: To challenge the METR report's claim that AI capabilities have exhibited exponential growth since 2019, arguing the data doesn't support this and highlighting fragility of existing exponential growth forecasts.

Method: Re-analyzes METR’s data by fitting sigmoid/logistic curves to show inflection point has already passed, and proposes a more complex model decomposing AI capabilities into base and reasoning capabilities with individual improvement rates.

Result: Finds that fitting sigmoid curves to METR’s data shows inflection point has already passed, and the decomposed model supports hypothesis that AI capabilities will exhibit inflection point in near future rather than continuing exponential growth.

Conclusion: AI capabilities are not experiencing exponential growth as claimed; existing forecasts of exponential growth are fragile and the inflection point may have already occurred or will occur soon.

Abstract: Rapidly increasing AI capabilities have substantial real-world consequences, ranging from AI safety concerns to labor market consequences. The Model Evaluation & Threat Research (METR) report argues that AI capabilities have exhibited exponential growth since 2019. In this note, we argue that the data does not support exponential growth, even in shorter-term horizons. Whereas the METR study claims that fitting sigmoid/logistic curves results in inflection points far in the future, we fit a sigmoid curve to their current data and find that the inflection point has already passed. In addition, we propose a more complex model that decomposes AI capabilities into base and reasoning capabilities, exhibiting individual rates of improvement. We prove that this model supports our hypothesis that AI capabilities will exhibit an inflection point in the near future. Our goal is not to establish a rigorous forecast of our own, but to highlight the fragility of existing forecasts of exponential growth.

[273] VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health

Kate H. Bentley, Luca Belli, Adam M. Chekroud, Emily J. Ward, Emily R. Dworkin, Emily Van Ark, Kelly M. Johnston, Will Alexander, Millard Brown, Matt Hawrilenko

Main category: cs.AI

TL;DR: VERA-MH is an automated AI safety evaluation framework for mental health chatbots, validated for suicide risk detection through clinician ratings and LLM judge alignment.

DetailsMotivation: Millions use AI chatbots for psychological support, but safety concerns remain the most pressing issue in AI for mental health, necessitating evidence-based safety benchmarks.

Method: Simulated conversations between LLM-based user-agents and AI chatbots were rated by licensed mental health clinicians using a scoring rubric, then compared with an LLM judge using the same rubric to evaluate rating alignment.

Result: Clinicians showed high inter-rater reliability (0.77), establishing a gold-standard reference, and the LLM judge strongly aligned with clinical consensus (0.81 IRR), supporting VERA-MH’s validity and reliability.

Conclusion: VERA-MH is a valid and reliable open-source automated AI safety evaluation for mental health, with future research planned to expand its generalizability and target additional safety areas.

Abstract: Millions now use generative AI chatbots for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based, automated safety benchmark. This study aimed to examine the clinical validity and reliability of VERA-MH for evaluating AI safety in suicide risk detection and response. We first simulated a large set of conversations between large language model (LLM)-based users (user-agents) and general-purpose AI chatbots. Licensed mental health clinicians used a rubric (scoring guide) to independently rate the simulated conversations for safe and unsafe chatbot behaviors, as well as user-agent realism. An LLM-based judge used the same scoring rubric to evaluate the same set of simulated conversations. We then examined rating alignment (a) among individual clinicians and (b) between clinician consensus and the LLM judge, and (c) summarized clinicians’ ratings of user-agent realism. Individual clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR] = 0.77), establishing a gold-standard clinical reference. The LLM judge was strongly aligned with this clinical consensus overall (IRR = 0.81) and within key conditions. Together, findings from this human evaluation study support the validity and reliability of VERA-MH: an open-source, automated AI safety evaluation for mental health. Future research will examine the generalizability and robustness of VERA-MH and expand the framework to target additional key areas of AI safety in mental health.

[274] AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction

Ruijie Shi, Houbin Zhang, Yuecheng Han, Yuheng Wang, Jingru Fan, Runde Yang, Yufan Dang, Huatao Li, Dewen Liu, Yuan Cheng, Chen Qian

Main category: cs.AI

TL;DR: AgentXRay: A search-based framework that reconstructs interpretable agent workflows from black-box systems using only input-output access, enabling better understanding and control of LLM-based agents.

DetailsMotivation: LLM-based agentic systems are powerful but often operate as black boxes with opaque internal workflows, making them difficult to interpret and control. Current frameworks lack methods to reconstruct explicit, editable workflows from deployed systems.

Method: Introduces Agentic Workflow Reconstruction (AWR) task and AgentXRay framework that formulates AWR as combinatorial optimization over discrete agent roles and tool invocations. Uses Monte Carlo Tree Search enhanced by Red-Black Pruning to navigate vast search space, matching target outputs via observable proxy metrics without accessing model parameters.

Result: Experiments show AgentXRay achieves higher proxy similarity and reduces token consumption compared to unpruned search, enabling deeper workflow exploration under fixed iteration budgets across diverse domains.

Conclusion: AgentXRay successfully reconstructs interpretable, editable workflows from black-box agentic systems, improving transparency and control while being efficient and practical for real-world deployment.

Abstract: Large Language Models have shown strong capabilities in complex problem solving, yet many agentic systems remain difficult to interpret and control due to opaque internal workflows. While some frameworks offer explicit architectures for collaboration, many deployed agentic systems operate as black boxes to users. We address this by introducing Agentic Workflow Reconstruction (AWR), a new task aiming to synthesize an explicit, interpretable stand-in workflow that approximates a black-box system using only input–output access. We propose AgentXRay, a search-based framework that formulates AWR as a combinatorial optimization problem over discrete agent roles and tool invocations in a chain-structured workflow space. Unlike model distillation, AgentXRay produces editable white-box workflows that match target outputs under an observable, output-based proxy metric, without accessing model parameters. To navigate the vast search space, AgentXRay employs Monte Carlo Tree Search enhanced by a scoring-based Red-Black Pruning mechanism, which dynamically integrates proxy quality with search depth. Experiments across diverse domains demonstrate that AgentXRay achieves higher proxy similarity and reduces token consumption compared to unpruned search, enabling deeper workflow exploration under fixed iteration budgets.

Weijian Yu, Yuhuan Lu, Dingqi Yang

Main category: cs.AI

TL;DR: THOR is an inductive link prediction method for hyper-relational knowledge graphs that learns transferable structural patterns across different KGs, enabling predictions on unseen vocabularies.

DetailsMotivation: Existing hyper-relational KG link prediction methods are mostly transductive, limited to specific vocabularies and lacking generalizability to unseen entities/relations. There's a need for inductive techniques that can transfer learned patterns across different knowledge graphs.

Method: Proposes THOR with two key components: 1) Relation and entity foundation graphs that model fundamental inter- and intra-fact interactions agnostic to specific relations/entities, 2) Two parallel graph encoders followed by a transformer decoder for masked training and fully-inductive inference.

Result: Outperforms baselines on 12 datasets with 66.1%, 55.9%, and 20.4% improvement over best-performing rule-based, semi-inductive, and fully-inductive techniques respectively. Ablation studies confirm key design factors capture structural invariance transferable across HKGs.

Conclusion: THOR effectively addresses the inductive link prediction challenge for hyper-relational KGs by learning transferable structural patterns, enabling generalization to unseen vocabularies and improving reasoning performance.

Abstract: Knowledge graphs (KGs) have become a key ingredient supporting a variety of applications. Beyond the traditional triplet representation of facts where a relation connects two entities, modern KGs observe an increasing number of hyper-relational facts, where an arbitrary number of qualifiers associated with a triplet provide auxiliary information to further describe the rich semantics of the triplet, which can effectively boost the reasoning performance in link prediction tasks. However, existing link prediction techniques over such hyper-relational KGs (HKGs) mostly focus on a transductive setting, where KG embedding models are learned from the specific vocabulary of a given KG and subsequently can only make predictions within the same vocabulary, limiting their generalizability to previously unseen vocabularies. Against this background, we propose THOR, an inducTive link prediction technique for Hyper-relational knOwledge gRaphs. Specifically, we first introduce both relation and entity foundation graphs, modeling their fundamental inter- and intra-fact interactions in HKGs, which are agnostic to any specific relations and entities. Afterward, THOR is designed to learn from the two foundation graphs with two parallel graph encoders followed by a transformer decoder, which supports efficient masked training and fully-inductive inference. We conduct a thorough evaluation of THOR in hyper-relational link prediction tasks on 12 datasets with different settings. Results show that THOR outperforms a sizable collection of baselines, yielding 66.1%, 55.9%, and 20.4% improvement over the best-performing rule-based, semi-inductive, and fully-inductive techniques, respectively. A series of ablation studies also reveals our key design factors capturing the structural invariance transferable across HKGs for inductive tasks.

cs.SD

[276] Misophonia Trigger Sound Detection on Synthetic Soundscapes Using a Hybrid Model with a Frozen Pre-Trained CNN and a Time-Series Module

Kurumi Sashida, Gouhei Tanaka

Main category: cs.SD

TL;DR: The paper investigates sound event detection for misophonia trigger sounds using synthetic soundscapes and hybrid CNN models with various temporal modules (GRU, LSTM, ESN) for personalized assistive technology.

DetailsMotivation: Misophonia is a disorder causing intense negative emotional responses to specific everyday sounds, impairing daily functioning. Assistive technologies that can detect trigger sounds could help reduce distress and improve quality of life for affected individuals.

Method: The study uses synthetic soundscapes generated via audio synthesis techniques due to scarcity of real-world misophonia data. It employs hybrid CNN-based models combining feature extraction with frozen pre-trained CNN backbones and trainable temporal modules (GRUs, LSTMs, ESNs and their bidirectional variants). Evaluation includes multi-class trigger sound detection and few-shot personalization tasks.

Result: Bidirectional temporal modeling consistently improves detection performance, with BiGRU achieving best overall accuracy. BiESN attains competitive performance with orders of magnitude fewer trainable parameters. In few-shot personalization tasks for “eating sound” detection, BiESN shows robust and stable performance with limited training data.

Conclusion: Lightweight temporal modules like BiESN are promising for personalized misophonia trigger sound detection, offering efficient parameter usage and robust performance in few-shot learning scenarios for assistive technologies.

Abstract: Misophonia is a disorder characterized by a decreased tolerance to specific everyday sounds (trigger sounds) that can evoke intense negative emotional responses such as anger, panic, or anxiety. These reactions can substantially impair daily functioning and quality of life. Assistive technologies that selectively detect trigger sounds could help reduce distress and improve well-being. In this study, we investigate sound event detection (SED) to localize intervals of trigger sounds in continuous environmental audio as a foundational step toward such assistive support. Motivated by the scarcity of real-world misophonia data, we generate synthetic soundscapes tailored to misophonia trigger sound detection using audio synthesis techniques. Then, we perform trigger sound detection tasks using hybrid CNN-based models. The models combine feature extraction using a frozen pre-trained CNN backbone with a trainable time-series module such as gated recurrent units (GRUs), long short-term memories (LSTMs), echo state networks (ESNs), and their bidirectional variants. The detection performance is evaluated using common SED metrics, including Polyphonic Sound Detection Score 1 (PSDS1). On the multi-class trigger SED task, bidirectional temporal modeling consistently improves detection performance, with Bidirectional GRU (BiGRU) achieving the best overall accuracy. Notably, the Bidirectional ESN (BiESN) attains competitive performance while requiring orders of magnitude fewer trainable parameters by optimizing only the readout. We further simulate user personalization via a few-shot “eating sound” detection task with at most five support clips, in which BiGRU and BiESN are compared. In this strict adaptation setting, BiESN shows robust and stable performance, suggesting that lightweight temporal modules are promising for personalized misophonia trigger SED.

[277] EMG-to-Speech with Fewer Channels

Injune Hwang, Jaejun Lee, Kyogu Lee

Main category: cs.SD

TL;DR: EMG channel analysis for silent speech interfaces shows complementary channel relationships matter most, with pretraining and fine-tuning strategies mitigating performance loss from sensor reduction.

DetailsMotivation: Surface EMG is promising for silent speech interfaces but depends heavily on sensor placement and channel availability. The paper aims to understand how individual and combined EMG channels contribute to speech reconstruction performance to develop more practical, lightweight systems.

Method: Analyzed contribution of individual and combined EMG channels to speech reconstruction performance, examined phoneme classification accuracy under channel ablations, and addressed performance degradation through pretraining models on full 8-channel data using random channel dropout followed by fine-tuning on reduced-channel subsets.

Result: Found that while certain EMG channels are individually informative, highest performance comes from subsets leveraging complementary relationships. Phoneme classification showed interpretable patterns reflecting anatomical muscle roles. Fine-tuning consistently outperformed training from scratch for 4-6 channel settings, with best dropout strategy depending on channel count.

Conclusion: Performance degradation from sensor reduction can be mitigated through pretraining and channel-aware design, supporting development of lightweight, practical EMG-based silent speech systems.

Abstract: Surface electromyography (EMG) is a promising modality for silent speech interfaces, but its effectiveness depends heavily on sensor placement and channel availability. In this work, we investigate the contribution of individual and combined EMG channels to speech reconstruction performance. Our findings reveal that while certain EMG channels are individually more informative, the highest performance arises from subsets that leverage complementary relationships among channels. We also analyzed phoneme classification accuracy under channel ablations and observed interpretable patterns reflecting the anatomical roles of the underlying muscles. To address performance degradation from channel reduction, we pretrained models on full 8-channel data using random channel dropout and fine-tuned them on reduced-channel subsets. Fine-tuning consistently outperformed training from scratch for 4 - 6 channel settings, with the best dropout strategy depending on the number of channels. These results suggest that performance degradation from sensor reduction can be mitigated through pretraining and channel-aware design, supporting the development of lightweight and practical EMG-based silent speech systems.

[278] Scaling Speech Tokenizers with Diffusion Autoencoders

Yuancheng Wang, Zhenyu Tang, Yun Wang, Arthur Hinsvark, Yingru Liu, Yinghao Li, Kainan Peng, Junyi Ao, Mingbo Ma, Mike Seltzer, Qing He, Xubo Liu

Main category: cs.SD

TL;DR: Speech Diffusion Tokenizer (SiTok) is a 1.6B parameter diffusion autoencoder that learns semantic-rich representations for speech understanding while enabling high-fidelity audio reconstruction at extremely low token rates (12.5 Hz) and bit rates (200 bps).

DetailsMotivation: Existing speech tokenizers face challenges in balancing semantic encoding for understanding vs. acoustic encoding for reconstruction, and achieving both low bit rates and low token rates simultaneously.

Method: Proposes SiTok, a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. Scaled to 1.6B parameters and trained on 2 million hours of speech.

Result: Outperforms strong baselines on understanding, reconstruction, and generation tasks at extremely low token rate of 12.5 Hz and bit-rate of 200 bits-per-second.

Conclusion: SiTok successfully addresses the trade-off between semantic understanding and acoustic reconstruction in speech tokenization, achieving state-of-the-art performance at unprecedented low rates.

Abstract: Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.

[279] Hierarchical Activity Recognition and Captioning from Long-Form Audio

Peng Zhang, Qingyu Luo, Philip J. B. Jackson, Wenwu Wang

Main category: cs.SD

TL;DR: MultiAct: A dataset and benchmark for hierarchical understanding of long-form audio activities with multi-level annotations and a unified model for classification, detection, and captioning.

DetailsMotivation: Most audio understanding research focuses on short clips and isolated events, but real-world audio activities are long-duration and hierarchical. There's a need for datasets and models that can handle complex, structured activities over extended time periods.

Method: Introduces MultiAct dataset with long-duration kitchen recordings annotated at three semantic levels (activities, sub-activities, events) plus captions and summaries. Proposes a unified hierarchical model for joint classification, detection, sequence prediction, and multi-resolution captioning.

Result: Establishes strong baselines on MultiAct benchmark and reveals key challenges in modeling hierarchical and compositional structure of long-form audio. Shows limitations of current approaches in capturing complex long-range relationships.

Conclusion: MultiAct enables research on hierarchical audio understanding of long-form activities. Future work should focus on methods better suited for capturing complex, long-range relationships in extended audio sequences.

Abstract: Complex activities in real-world audio unfold over extended durations and exhibit hierarchical structure, yet most prior work focuses on short clips and isolated events. To bridge this gap, we introduce MultiAct, a new dataset and benchmark for multi-level structured understanding of human activities from long-form audio. MultiAct comprises long-duration kitchen recordings annotated at three semantic levels (activities, sub-activities and events) and paired with fine-grained captions and high-level summaries. We further propose a unified hierarchical model that jointly performs classification, detection, sequence prediction and multi-resolution captioning. Experiments on MultiAct establish strong baselines and reveal key challenges in modelling hierarchical and compositional structure of long-form audio. A promising direction for future work is the exploration of methods better suited to capturing the complex, long-range relationships in long-form audio.

[280] AI-Generated Music Detection in Broadcast Monitoring

David Lopez-Ayala, Asier Cabello, Pablo Zinemanas, Emilio Molina, Martin Rocamora

Main category: cs.SD

TL;DR: AI-OpenBMAT dataset for broadcast-style AI music detection, addressing challenges of short excerpts and speech masking in broadcast audio.

DetailsMotivation: Existing AI music detection methods are designed for clean, full-length music tracks in streaming contexts, but fail in broadcast scenarios where music appears as short excerpts often masked by dominant speech.

Method: Created AI-OpenBMAT dataset with 3,294 one-minute audio excerpts following real television audio patterns, combining human-made production music with AI-generated continuations using Suno v3.5. Benchmarked CNN baseline and SpectTTTra models on SNR and duration robustness.

Result: Models that perform well in streaming scenarios suffer substantial degradation in broadcast settings, with F1-scores dropping below 60% when music is in background or has short duration. Speech masking and short music length identified as critical challenges.

Conclusion: Broadcast audio presents unique challenges for AI music detection requiring specialized datasets and methods. AI-OpenBMAT serves as benchmark for developing detectors meeting industrial broadcast requirements.

Abstract: AI music generators have advanced to the point where their outputs are often indistinguishable from human compositions. While detection methods have emerged, they are typically designed and validated in music streaming contexts with clean, full-length tracks. Broadcast audio, however, poses a different challenge: music appears as short excerpts, often masked by dominant speech, conditions under which existing detectors fail. In this work, we introduce AI-OpenBMAT, the first dataset tailored to broadcast-style AI-music detection. It contains 3,294 one-minute audio excerpts (54.9 hours) that follow the duration patterns and loudness relations of real television audio, combining human-made production music with stylistically matched continuations generated with Suno v3.5. We benchmark a CNN baseline and state-of-the-art SpectTTTra models to assess SNR and duration robustness, and evaluate on a full broadcast scenario. Across all settings, models that excel in streaming scenarios suffer substantial degradation, with F1-scores dropping below 60% when music is in the background or has a short duration. These results highlight speech masking and short music length as critical open challenges for AI music detection, and position AI-OpenBMAT as a benchmark for developing detectors capable of meeting industrial broadcast requirements.

[281] DynFOA: Generating First-Order Ambisonics with Conditional Diffusion for Dynamic and Acoustically Complex 360-Degree Videos

Ziyu Luo, Lin Chen, Qiang Qu, Xiaoming Chen, Yiran Shen

Main category: cs.SD

TL;DR: DynFOA: A framework using dynamic acoustic perception and conditional diffusion to generate high-fidelity first-order ambisonics (FOA) spatial audio from 360-degree videos by modeling complex acoustic scenes with 3D reconstruction and sound source tracking.

DetailsMotivation: Current methods for generating spatial audio from 360-degree videos fail to handle dynamic scenes, complex acoustic effects (occlusion, reflections, reverberation), and multiple moving sound sources, limiting immersive VR experiences.

Method: Uses video encoder for sound source detection/localization, depth/semantic estimation, and 3D scene reconstruction with Gaussian Splatting; audio encoder captures spatial motion and 4D trajectories; diffusion-based FOA generator fine-tuned for real-time spatial cue adjustment.

Result: Outperforms existing methods in spatial accuracy, acoustic fidelity, and distribution matching; improves user experience for VR and immersive media applications.

Conclusion: DynFOA provides robust, scalable approach for realistic dynamic spatial audio generation in complex 360-degree scenes, advancing immersive VR/AR experiences.

Abstract: Spatial audio is crucial for creating compelling immersive 360-degree video experiences. However, generating realistic spatial audio, such as first-order ambisonics (FOA), from 360-degree videos in complex acoustic scenes remains challenging. Existing methods often overlook the dynamic nature and acoustic complexity of 360-degree scenes, fail to fully account for dynamic sound sources, and neglect complex environmental effects such as occlusion, reflections, and reverberation, which are influenced by scene geometries and materials. We propose DynFOA, a framework based on dynamic acoustic perception and conditional diffusion, for generating high-fidelity FOA from 360-degree videos. DynFOA first performs visual processing via a video encoder, which detects and localizes multiple dynamic sound sources, estimates their depth and semantics, and reconstructs the scene geometry and materials using a 3D Gaussian Splatting. This reconstruction technique accurately models occlusion, reflections, and reverberation based on the geometries and materials of the reconstructed 3D scene and the listener’s viewpoint. The audio encoder then captures the spatial motion and temporal 4D sound source trajectories to fine-tune the diffusion-based FOA generator. The fine-tuned FOA generator adjusts spatial cues in real time, ensuring consistent directional fidelity during listener head rotation and complex environmental changes. Extensive evaluations demonstrate that DynFOA consistently outperforms existing methods across metrics such as spatial accuracy, acoustic fidelity, and distribution matching, while also improving the user experience. Therefore, DynFOA provides a robust and scalable approach to rendering realistic dynamic spatial audio for VR and immersive media applications.

[282] Reciprocal Latent Fields for Precomputed Sound Propagation

Hugo Seuté, Pranai Vasudev, Etienne Richan, Louis-Xavier Buffoni

Main category: cs.SD

TL;DR: RLF is a memory-efficient framework using volumetric latent embeddings and Riemannian metric learning to encode acoustic parameters for realistic sound propagation in virtual scenes, achieving perceptual equivalence to ground-truth simulations with orders of magnitude memory reduction.

DetailsMotivation: Physically accurate wave-based sound simulations are computationally prohibitive for real-time applications, and existing wave coding methods produce unmanageably large parameter sets for large environments with many source-receiver pairs.

Method: Reciprocal Latent Fields (RLF) uses a volumetric grid of trainable latent embeddings decoded with symmetric functions to ensure acoustic reciprocity, with various decoders including Riemannian metric learning approaches to better reproduce acoustic phenomena.

Result: RLF maintains replication quality while reducing memory footprint by several orders of magnitude, and subjective listening tests show sound rendered via RLF is perceptually indistinguishable from ground-truth simulations.

Conclusion: RLF provides an effective memory-efficient solution for encoding acoustic parameters in virtual scenes, enabling realistic sound propagation with perceptual quality matching ground-truth while dramatically reducing storage requirements.

Abstract: Realistic sound propagation is essential for immersion in a virtual scene, yet physically accurate wave-based simulations remain computationally prohibitive for real-time applications. Wave coding methods address this limitation by precomputing and compressing impulse responses of a given scene into a set of scalar acoustic parameters, which can reach unmanageable sizes in large environments with many source-receiver pairs. We introduce Reciprocal Latent Fields (RLF), a memory-efficient framework for encoding and predicting these acoustic parameters. The RLF framework employs a volumetric grid of trainable latent embeddings decoded with a symmetric function, ensuring acoustic reciprocity. We study a variety of decoders and show that leveraging Riemannian metric learning leads to a better reproduction of acoustic phenomena in complex scenes. Experimental validation demonstrates that RLF maintains replication quality while reducing the memory footprint by several orders of magnitude. Furthermore, a MUSHRA-like subjective listening test indicates that sound rendered via RLF is perceptually indistinguishable from ground-truth simulations.

[283] ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation

Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo, Xuerui Yang

Main category: cs.SD

TL;DR: ACE-Step v1.5 is an efficient open-source music foundation model that achieves commercial-grade music generation on consumer hardware, featuring a hybrid LM-DiT architecture with intrinsic reinforcement learning for bias-free alignment.

DetailsMotivation: To create an open-source music foundation model that brings commercial-quality music generation to consumer hardware while addressing biases from external reward models and human preferences.

Method: Uses a novel hybrid architecture where a Language Model acts as an omni-capable planner, transforming user queries into comprehensive song blueprints and synthesizing metadata via Chain-of-Thought to guide a Diffusion Transformer. Alignment is achieved through intrinsic reinforcement learning using only the model’s internal mechanisms.

Result: Achieves quality beyond most commercial music models while being extremely fast (under 2 seconds per song on A100, under 10 seconds on RTX 3090), runs locally with <4GB VRAM, supports lightweight personalization via LoRA, and maintains strict prompt adherence across 50+ languages.

Conclusion: ACE-Step v1.5 paves the way for powerful, accessible music generation tools that integrate into creative workflows while eliminating biases from external reward systems.

Abstract: We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast – under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style. At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints – scaling from short loops to 10-minute compositions – while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model’s internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities – such as cover generation, repainting, and vocal-to-BGM conversion – while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. The code, the model weights and the demo are available at: https://ace-step.github.io/ace-step-v1.5.github.io/

[284] AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Georgii Aparin, Tasnima Sadekova, Alexey Rukhovich, Assel Yermekova, Laida Kushnareva, Vadim Popov, Kristian Kuznetsov, Irina Piontkovskaya

Main category: cs.SD

TL;DR: Sparse Autoencoders (SAEs) are applied to audio models (Whisper and HuBERT) for interpretable feature analysis, showing stable, interpretable features that capture acoustic/semantic information and align with human EEG during speech perception.

DetailsMotivation: SAEs are powerful for interpreting neural representations but remain underexplored in audio. The paper aims to apply SAEs to audio models to understand their representations, evaluate stability/interpretability, and demonstrate practical utility.

Method: Train SAEs across all encoder layers of Whisper and HuBERT models. Evaluate feature stability across random seeds, reconstruction quality, interpretability (capturing acoustic/semantic info, specific events), and practical applications like feature steering for error reduction.

Result: Over 50% of features remain consistent across seeds; reconstruction quality preserved. SAE features capture general acoustic/semantic info and specific events (environmental noises, paralinguistic sounds). Feature steering reduces Whisper’s false speech detections by 70% with minimal WER increase. SAE features correlate with human EEG during speech perception.

Conclusion: SAEs are effective for interpreting audio model representations, showing stable, interpretable features with practical applications and alignment with human neural processing. Opens avenues for audio model interpretability and human-aligned AI.

Abstract: Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper’s false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.

cs.LG

[285] NanoNet: Parameter-Efficient Learning with Label-Scarce Supervision for Lightweight Text Mining Model

Qianren Mao, Yashuo Luo, Ziqi Qin, Junnan Liu, Weifeng Jiang, Zhijun Chen, Zhuoran Li, Likang Xiao, Chuou Xu, Qili Zhang, Hanwen Hao, Jingzheng Li, Chunghua Lin, Jianxin Li, Philip S. Yu

Main category: cs.LG

TL;DR: NanoNet is a lightweight semi-supervised framework for text mining that combines parameter-efficient learning with online knowledge distillation and mutual learning regularization to create small, fast-inference models with limited labeled data.

DetailsMotivation: Existing lightweight semi-supervised learning approaches using knowledge transfer and co-training are computationally intensive and prone to local optima. The authors aim to explore low-cost scenarios for text mining: limited labeled supervision, lightweight fine-tuning, and rapid-inference small models.

Method: Proposes NanoNet framework that uses parameter-efficient learning with limited supervision. It employs online knowledge distillation to generate multiple small models and enhances their performance through mutual learning regularization, reducing training costs and supervision requirements.

Result: The framework yields lightweight models for downstream inference with reduced computational costs and minimized supervision requirements, though specific quantitative results are not provided in the abstract.

Conclusion: NanoNet provides an effective approach for lightweight text mining by integrating parameter-efficient learning, online knowledge distillation, and mutual learning regularization to address computational intensity and local optima issues in existing methods.

Abstract: The lightweight semi-supervised learning (LSL) strategy provides an effective approach of conserving labeled samples and minimizing model inference costs. Prior research has effectively applied knowledge transfer learning and co-training regularization from large to small models in LSL. However, such training strategies are computationally intensive and prone to local optima, thereby increasing the difficulty of finding the optimal solution. This has prompted us to investigate the feasibility of integrating three low-cost scenarios for text mining tasks: limited labeled supervision, lightweight fine-tuning, and rapid-inference small models. We propose NanoNet, a novel framework for lightweight text mining that implements parameter-efficient learning with limited supervision. It employs online knowledge distillation to generate multiple small models and enhances their performance through mutual learning regularization. The entire process leverages parameter-efficient learning, reducing training costs and minimizing supervision requirements, ultimately yielding a lightweight model for downstream inference.

[286] Agentic Workflow Using RBA$_θ$ for Event Prediction

Purbak Sengupta, Sambeet Mishra, Sonal Shreya

Main category: cs.LG

TL;DR: Event-first forecasting paradigm for wind power ramp events using frequency-aware deep learning with wavelet decomposition, enabling direct event prediction and trajectory reconstruction for better transferability across sites.

DetailsMotivation: Wind power ramp events are challenging to forecast due to strong variability, multi-scale dynamics, and site-specific effects. Traditional trajectory-first approaches with post-hoc event extraction have limited generalization across different wind farms.

Method: Proposes an event-first forecasting paradigm that directly predicts ramp events then reconstructs power trajectories. Uses enhanced Ramping Behaviour Analysis (RBA$_θ$) for event representation, integrates statistical, machine learning, and deep learning models. Introduces event-first deep architecture with wavelet-based frequency decomposition, temporal excitation features, and adaptive feature selection. Includes agentic forecasting layer for dynamic workflow selection.

Result: Shows that ramp magnitude and duration are governed by distinct mid-frequency bands, enabling accurate signal reconstruction from sparse event forecasts. Achieves stable long-horizon event prediction, physically consistent trajectory reconstruction, and zero-shot transfer to unseen wind farms.

Conclusion: Event-first, frequency-aware forecasting provides a transferable and operationally aligned alternative to trajectory-first wind-power prediction, with better generalization across sites and improved robustness.

Abstract: Wind power ramp events are difficult to forecast due to strong variability, multi-scale dynamics, and site-specific meteorological effects. This paper proposes an event-first, frequency-aware forecasting paradigm that directly predicts ramp events and reconstructs the power trajectory thereafter, rather than inferring events from dense forecasts. The framework is built on an enhanced Ramping Behaviour Analysis (RBA$_θ$) method’s event representation and progressively integrates statistical, machine-learning, and deep-learning models. Traditional forecasting models with post-hoc event extraction provides a strong interpretable baseline but exhibits limited generalisation across sites. Direct event prediction using Random Forests improves robustness over survival-based formulations, motivating fully event-aware modelling. To capture the multi-scale nature of wind ramps, we introduce an event-first deep architecture that integrates wavelet-based frequency decomposition, temporal excitation features, and adaptive feature selection. The resulting sequence models enable stable long-horizon event prediction, physically consistent trajectory reconstruction, and zero-shot transfer to previously unseen wind farms. Empirical analysis shows that ramp magnitude and duration are governed by distinct mid-frequency bands, allowing accurate signal reconstruction from sparse event forecasts. An agentic forecasting layer is proposed, in which specialised workflows are selected dynamically based on operational context. Together, the framework demonstrates that event-first, frequency-aware forecasting provides a transferable and operationally aligned alternative to trajectory-first wind-power prediction.

[287] Toward Faithful and Complete Answer Construction from a Single Document

Zhaoyang Chen, Cody Fleming

Main category: cs.LG

TL;DR: EVE is a structured framework for document-grounded reasoning that decomposes reasoning into extraction, validation, and enumeration to improve completeness and faithfulness compared to standard LLM generation.

DetailsMotivation: Standard LLMs are biased toward high-probability continuations rather than exhaustive and faithful answers grounded in source content, lacking systematic mechanisms for completeness and faithfulness which conflicts with AI safety principles.

Method: EVE constrains generation to a structured, verifiable pipeline that decomposes high-rigor reasoning into three stages: extraction, validation, and enumeration, unlike free-form prompting.

Result: EVE enables consistent improvements in recall, precision, and F1-score: recall and precision increase by up to 24% and 29% respectively, with 31% gain in F1-score, breaking the trade-off between coverage and accuracy typical of single-pass LLM generation.

Conclusion: EVE effectively addresses limitations of standard LLM generation for document-grounded reasoning but exhibits performance saturation due to inherent ambiguity of natural language, reflecting fundamental limits of language-based reasoning.

Abstract: Modern large language models (LLMs) are powerful generators driven by statistical next-token prediction. While effective at producing fluent text, this design biases models toward high-probability continuations rather than exhaustive and faithful answers grounded in source content. As a result, directly applying LLMs lacks systematic mechanisms to ensure both completeness (avoiding omissions) and faithfulness (avoiding unsupported content), which fundamentally conflicts with core AI safety principles. To address this limitation, we present EVE, a structured framework for document-grounded reasoning. Unlike free-form prompting, EVE constrains generation to a structured, verifiable pipeline that decomposes high-rigor reasoning into extraction, validation, and enumeration. Empirically, this design enables consistent and simultaneous improvements in recall, precision, and F1-score: recall and precision increase by up to 24% and 29%, respectively, with a corresponding 31% gain in F1-score. This effectively breaks the long-standing trade-off between coverage and accuracy typical of single-pass LLM generation, while also mitigating generation truncation caused by length limitations. At the same time, we emphasize that EVE exhibits performance saturation due to the inherent ambiguity of natural language, reflecting fundamental limits of language-based reasoning.

[288] RuleSmith: Multi-Agent LLMs for Automated Game Balancing

Ziyao Zeng, Chen Liu, Tianyu Liu, Hao Wang, Xiatao Sun, Fengyu Yang, Xiaofeng Liu, Zhiwen Fan

Main category: cs.LG

TL;DR: RuleSmith: Automated game balancing framework using multi-agent LLMs for self-play simulation and Bayesian optimization to tune game parameters.

DetailsMotivation: Game balancing traditionally requires extensive manual playtesting and expert intuition, which is time-consuming and subjective. There's a need for automated approaches to balance complex games with multiple interacting systems and parameters.

Method: Combines game engine, multi-agent LLMs for self-play simulation, and Bayesian optimization over multi-dimensional rule space. LLM agents interpret textual rulebooks and game states to generate actions. Uses acquisition-based adaptive sampling and discrete projection for efficient parameter search.

Result: Successfully applied to CivMini (simplified civilization-style game), achieving convergence to highly balanced configurations. Provides interpretable rule adjustments that can be directly applied to game systems.

Conclusion: LLM simulation can serve as a powerful surrogate for automating design and balancing in complex multi-agent environments, demonstrating the potential of multi-agent LLMs for game development.

Abstract: Game balancing is a longstanding challenge requiring repeated playtesting, expert intuition, and extensive manual tuning. We introduce RuleSmith, the first framework that achieves automated game balancing by leveraging the reasoning capabilities of multi-agent LLMs. It couples a game engine, multi-agent LLMs self-play, and Bayesian optimization operating over a multi-dimensional rule space. As a proof of concept, we instantiate RuleSmith on CivMini, a simplified civilization-style game containing heterogeneous factions, economy systems, production rules, and combat mechanics, all governed by tunable parameters. LLM agents interpret textual rulebooks and game states to generate actions, to conduct fast evaluation of balance metrics such as win-rate disparities. To search the parameter landscape efficiently, we integrate Bayesian optimization with acquisition-based adaptive sampling and discrete projection: promising candidates receive more evaluation games for accurate assessment, while exploratory candidates receive fewer games for efficient exploration. Experiments show that RuleSmith converges to highly balanced configurations and provides interpretable rule adjustments that can be directly applied to downstream game systems. Our results illustrate that LLM simulation can serve as a powerful surrogate for automating design and balancing in complex multi-agent environments.

[289] Pragmatic Curiosity: A Hybrid Learning-Optimization Paradigm via Active Inference

Yingke Li, Anjali Parashar, Enlu Zhou, Chuchu Fan

Main category: cs.LG

TL;DR: Pragmatic curiosity: A hybrid learning-optimization paradigm using expected free energy minimization to balance goal-seeking and information-seeking in black-box evaluation settings.

DetailsMotivation: Many engineering/scientific workflows involve expensive black-box evaluations requiring simultaneous performance improvement and uncertainty reduction. Existing approaches (Bayesian optimization and Bayesian experimental design) treat goal-seeking and information-seeking separately, lacking guidance for hybrid settings where learning and optimization are intrinsically coupled.

Method: Proposes “pragmatic curiosity” derived from active inference, where actions are selected by minimizing expected free energy - a single objective that couples pragmatic utility with epistemic information gain.

Result: Demonstrates practical effectiveness on real-world hybrid tasks including constrained system identification, targeted active search, and composite optimization with unknown preferences. Consistently outperforms strong BO-type and BED-type baselines with higher estimation accuracy, better critical-region coverage, and improved final solution quality.

Conclusion: Pragmatic curiosity provides an effective unified framework for hybrid learning-optimization problems that require balancing goal-seeking and information-seeking objectives.

Abstract: Many engineering and scientific workflows depend on expensive black-box evaluations, requiring decision-making that simultaneously improves performance and reduces uncertainty. Bayesian optimization (BO) and Bayesian experimental design (BED) offer powerful yet largely separate treatments of goal-seeking and information-seeking, providing limited guidance for hybrid settings where learning and optimization are intrinsically coupled. We propose “pragmatic curiosity,” a hybrid learning-optimization paradigm derived from active inference, in which actions are selected by minimizing the expected free energy–a single objective that couples pragmatic utility with epistemic information gain. We demonstrate the practical effectiveness and flexibility of pragmatic curiosity on various real-world hybrid tasks, including constrained system identification, targeted active search, and composite optimization with unknown preferences. Across these benchmarks, pragmatic curiosity consistently outperforms strong BO-type and BED-type baselines, delivering higher estimation accuracy, better critical-region coverage, and improved final solution quality.

[290] Pairwise is Not Enough: Hypergraph Neural Networks for Multi-Agent Pathfinding

Rishabh Jain, Keisuke Okumura, Michael Amir, Pietro Lio, Amanda Prorok

Main category: cs.LG

TL;DR: HMAGAT introduces a hypergraph attention network for multi-agent path finding that captures group dynamics beyond pairwise interactions, outperforming larger models with less data.

DetailsMotivation: Current learning-based MAPF approaches using GNNs are limited to pairwise message passing, leading to suboptimal behaviors and attention dilution in dense environments where group coordination is critical. Existing methods fail to fully explore higher-order interactions between agents.

Method: HMAGAT (Hypergraph Multi-Agent Attention Network) uses attentional mechanisms over directed hypergraphs to explicitly capture group dynamics. This architecture moves beyond pairwise interactions to model complex multi-agent relationships.

Result: HMAGAT establishes new state-of-the-art among learning-based MAPF solvers, outperforming current SoTA 85M parameter model despite having only 1M parameters and being trained on 100× less data. Attention analysis shows hypergraph representations mitigate attention dilution inherent in GNNs.

Conclusion: Appropriate inductive biases (like hypergraph representations) are often more critical than training data size or parameter count for multi-agent problems. HMAGAT demonstrates the importance of capturing group dynamics beyond pairwise interactions.

Abstract: Multi-Agent Path Finding (MAPF) is a representative multi-agent coordination problem, where multiple agents are required to navigate to their respective goals without collisions. Solving MAPF optimally is known to be NP-hard, leading to the adoption of learning-based approaches to alleviate the online computational burden. Prevailing approaches, such as Graph Neural Networks (GNNs), are typically constrained to pairwise message passing between agents. However, this limitation leads to suboptimal behaviours and critical issues, such as attention dilution, particularly in dense environments where group (i.e. beyond just two agents) coordination is most critical. Despite the importance of such higher-order interactions, existing approaches have not been able to fully explore them. To address this representational bottleneck, we introduce HMAGAT (Hypergraph Multi-Agent Attention Network), a novel architecture that leverages attentional mechanisms over directed hypergraphs to explicitly capture group dynamics. Empirically, HMAGAT establishes a new state-of-the-art among learning-based MAPF solvers: e.g., despite having just 1M parameters and being trained on 100$\times$ less data, it outperforms the current SoTA 85M parameter model. Through detailed analysis of HMAGAT’s attention values, we demonstrate how hypergraph representations mitigate the attention dilution inherent in GNNs and capture complex interactions where pairwise methods fail. Our results illustrate that appropriate inductive biases are often more critical than the training data size or sheer parameter count for multi-agent problems.

[291] Private and interpretable clinical prediction with quantum-inspired tensor train models

José Ramón Pareja Monturiol, Juliette Sinnott, Roger G. Melko, Mohammad Kohandel

Main category: cs.LG

TL;DR: Tensor train-based quantum-inspired defense protects clinical ML models from privacy attacks while maintaining interpretability and accuracy.

DetailsMotivation: Clinical machine learning needs to balance predictive accuracy, interpretability, and privacy. Traditional models like logistic regression and neural networks are vulnerable to privacy attacks that can leak training data information, especially in white-box scenarios where attackers have full model access.

Method: The paper proposes a quantum-inspired defense using tensor train decomposition. Models are tensorized into tensor trains (TTs), which obfuscates parameters while preserving accuracy. This approach reduces white-box attacks to random guessing and degrades black-box attacks comparably to Differential Privacy. The method maintains interpretability and enables efficient computation of marginal and conditional distributions.

Result: Both logistic regression and neural network models leak significant training-set information, with LRs being particularly vulnerable in white-box scenarios. Common practices like cross-validation exacerbate these risks. The tensor train defense successfully mitigates vulnerabilities while preserving model accuracy and interpretability.

Conclusion: Tensorization provides a practical foundation for private, interpretable, and effective clinical prediction. The method is widely applicable and offers a quantum-inspired approach to balancing the competing demands of accuracy, interpretability, and privacy in clinical ML.

Abstract: Machine learning in clinical settings must balance predictive accuracy, interpretability, and privacy. Models such as logistic regression (LR) offer transparency, while neural networks (NNs) provide greater predictive power; yet both remain vulnerable to privacy attacks. We empirically assess these risks by designing attacks that identify which public datasets were used to train a model under varying levels of adversarial access, applying them to LORIS, a publicly available LR model for immunotherapy response prediction, as well as to additional shallow NN models trained for the same task. Our results show that both models leak significant training-set information, with LRs proving particularly vulnerable in white-box scenarios. Moreover, we observe that common practices such as cross-validation in LRs exacerbate these risks. To mitigate these vulnerabilities, we propose a quantum-inspired defense based on tensorizing discretized models into tensor trains (TTs), which fully obfuscates parameters while preserving accuracy, reducing white-box attacks to random guessing and degrading black-box attacks comparably to Differential Privacy. TT models retain LR interpretability and extend it through efficient computation of marginal and conditional distributions, while also enabling this higher level of interpretability for NNs. Our results demonstrate that tensorization is widely applicable and establishes a practical foundation for private, interpretable, and effective clinical prediction.

[292] Compressing LLMs with MoP: Mixture of Pruners

Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Victor Zacarias, Leandro Giusti Mugnaini, Keith Ando Ogawa, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao

Main category: cs.LG

TL;DR: MoP is an iterative pruning framework that unifies depth and width pruning for LLMs, achieving better accuracy than single-dimension methods and translating to real speedups, with extensions to vision-language models.

DetailsMotivation: Current LLM pruning methods focus on either depth or width pruning separately, but unified approaches could achieve better compression-performance trade-offs and real inference speedups.

Method: MoP iteratively generates two pruning branches (depth vs width) and selects the best candidate to advance the pruning path, creating a unified framework for structured pruning across dimensions.

Result: On LLaMA-2/3, MoP exceeds competing methods across compression regimes, outperforms depth/width-only pruning, reduces latency by 39% at 40% compression, and extends effectively to vision-language model LLaVA-1.5.

Conclusion: MoP advances structured pruning frontier by unifying depth and width dimensions, achieving better accuracy-speed trade-offs and demonstrating cross-modal applicability to vision-language models.

Abstract: The high computational demands of Large Language Models (LLMs) motivate methods that reduce parameter count and accelerate inference. In response, model pruning emerges as an effective strategy, yet current methods typically focus on a single dimension-depth or width. We introduce MoP (Mixture of Pruners), an iterative framework that unifies these dimensions. At each iteration, MoP generates two branches-pruning in depth versus pruning in width-and selects a candidate to advance the path. On LLaMA-2 and LLaMA-3, MoP advances the frontier of structured pruning, exceeding the accuracy of competing methods across a broad set of compression regimes. It also consistently outperforms depth-only and width-only pruning. Furthermore, MoP translates structural pruning into real speedup, reducing end-to-end latency by 39% at 40% compression. Finally, extending MoP to the vision-language model LLaVA-1.5, we notably improve computational efficiency and demonstrate that text-only recovery fine-tuning can restore performance even on visual tasks.

[293] Urban Spatio-Temporal Foundation Models for Climate-Resilient Housing: Scaling Diffusion Transformers for Disaster Risk Prediction

Olaf Yunus Laitinen Imanov, Derya Umut Kulali, Taner Yilmaz

Main category: cs.LG

TL;DR: Skjold-DiT is a diffusion-transformer framework that integrates heterogeneous urban data to forecast building-level climate-risk indicators while incorporating transportation-network structure and accessibility signals for intelligent vehicles and emergency response systems.

DetailsMotivation: Climate hazards increasingly disrupt urban transportation and emergency-response operations by damaging infrastructure and reducing network accessibility. There's a need for frameworks that can forecast climate-risk indicators while explicitly incorporating transportation-network structure for intelligent vehicle routing and emergency dispatch systems.

Method: Skjold-DiT combines: (1) Fjell-Prompt for cross-city transfer via prompt-based conditioning, (2) Norrland-Fusion using cross-modal attention to unify hazard maps/imagery, building attributes, demographics, and transportation infrastructure, and (3) Valkyrie-Forecast for probabilistic risk trajectory generation under interventions. The framework produces calibrated, uncertainty-aware accessibility layers for intelligent vehicle routing.

Result: The paper introduces the Baltic-Caspian Urban Resilience (BCUR) dataset with 847,392 building-level observations across six cities. Experiments evaluate prediction quality, cross-city generalization, calibration, and downstream transportation-relevant outcomes including reachability and hazard-conditioned travel times under counterfactual interventions.

Conclusion: Skjold-DiT provides a comprehensive framework for forecasting building-level climate-risk indicators while integrating transportation-network accessibility signals, enabling hazard-conditioned routing constraints for intelligent vehicles and emergency response systems.

Abstract: Climate hazards increasingly disrupt urban transportation and emergency-response operations by damaging housing stock, degrading infrastructure, and reducing network accessibility. This paper presents Skjold-DiT, a diffusion-transformer framework that integrates heterogeneous spatio-temporal urban data to forecast building-level climate-risk indicators while explicitly incorporating transportation-network structure and accessibility signals relevant to intelligent vehicles (e.g., emergency reachability and evacuation-route constraints). Concretely, Skjold-DiT enables hazard-conditioned routing constraints by producing calibrated, uncertainty-aware accessibility layers (reachability, travel-time inflation, and route redundancy) that can be consumed by intelligent-vehicle routing and emergency dispatch systems. Skjold-DiT combines: (1) Fjell-Prompt, a prompt-based conditioning interface designed to support cross-city transfer; (2) Norrland-Fusion, a cross-modal attention mechanism unifying hazard maps/imagery, building attributes, demographics, and transportation infrastructure into a shared latent representation; and (3) Valkyrie-Forecast, a counterfactual simulator for generating probabilistic risk trajectories under intervention prompts. We introduce the Baltic-Caspian Urban Resilience (BCUR) dataset with 847,392 building-level observations across six cities, including multi-hazard annotations (e.g., flood and heat indicators) and transportation accessibility features. Experiments evaluate prediction quality, cross-city generalization, calibration, and downstream transportation-relevant outcomes, including reachability and hazard-conditioned travel times under counterfactual interventions.

[294] Self-Improving World Modelling with Latent Actions

Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti

Main category: cs.LG

TL;DR: SWIRL is a self-improvement framework that learns world models from state-only sequences by treating actions as latent variables, using alternating forward and inverse dynamics modelling with reinforcement learning.

DetailsMotivation: Learning world models for LLMs and VLMs typically requires costly action-labelled trajectories. The authors aim to enable learning from state-only sequences by treating actions as latent variables.

Method: SWIRL alternates between Forward World Modelling (FWM) P_θ(Y|X,Z) and Inverse Dynamics Modelling (IDM) Q_φ(Z|X,Y) using two phases: Variational Information Maximisation (updates FWM) and ELBO Maximisation (updates IDM). Both models are trained with reinforcement learning (GRPO) using the opposite frozen model’s log-probability as reward.

Result: SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench across various environments including visual dynamics and synthetic textual environments.

Conclusion: SWIRL provides an effective framework for learning world models from state-only sequences without requiring action labels, with theoretical learnability guarantees and strong empirical performance across multiple benchmarks.

Abstract: Internal modelling of the world – predicting transitions between previous states $X$ and next states $Y$ under actions $Z$ – is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) $P_θ(Y|X,Z)$ and an Inverse Dynamics Modelling (IDM) $Q_φ(Z|X,Y)$. SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model’s log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.

[295] Tempora: Characterising the Time-Contingent Utility of Online Test-Time Adaptation

Sudarshan Sreeram, Young D. Kwon, Cecilia Mascolo

Main category: cs.LG

TL;DR: Tempora is a framework for evaluating test-time adaptation methods under temporal pressure, introducing time-contingent utility metrics to measure accuracy-latency trade-offs in real-world deployments.

DetailsMotivation: Current TTA evaluations unrealistically assume unbounded processing time, overlooking the critical accuracy-latency trade-off in real-world deployments where predictions must arrive within time constraints to be useful.

Method: Introduces Tempora framework with temporal scenarios modeling deployment constraints, evaluation protocols, and three time-contingent utility metrics: discrete utility for hard deadlines, continuous utility for decaying value with latency, and amortised utility for budget constraints.

Result: Evaluation of 7 TTA methods on ImageNet-C across 240 temporal scenarios shows rank instability - conventional rankings don’t predict rankings under temporal pressure. ETA (state-of-the-art conventionally) falls short in 41.2% of evaluations. Best method varies with corruption type and temporal pressure.

Conclusion: Tempora enables systematic evaluation across diverse temporal constraints, revealing when and why method rankings invert, providing practitioners with selection guidance and researchers with targets for deployable adaptation.

Abstract: Test-time adaptation (TTA) offers a compelling remedy for machine learning (ML) models that degrade under domain shifts, improving generalisation on-the-fly with only unlabelled samples. This flexibility suits real deployments, yet conventional evaluations unrealistically assume unbounded processing time, overlooking the accuracy-latency trade-off. As ML increasingly underpins latency-sensitive and user-facing use-cases, temporal pressure constrains the viability of adaptable inference; predictions arriving too late to act on are futile. We introduce Tempora, a framework for evaluating TTA under this pressure. It consists of temporal scenarios that model deployment constraints, evaluation protocols that operationalise measurement, and time-contingent utility metrics that quantify the accuracy-latency trade-off. We instantiate the framework with three such metrics: (1) discrete utility for asynchronous streams with hard deadlines, (2) continuous utility for interactive settings where value decays with latency, and (3) amortised utility for budget-constrained deployments. Applying Tempora to seven TTA methods on ImageNet-C across 240 temporal evaluations reveals rank instability: conventional rankings do not predict rankings under temporal pressure; ETA, a state-of-the-art method in the conventional setting, falls short in 41.2% of evaluations. The highest-utility method varies with corruption type and temporal pressure, with no clear winner. By enabling systematic evaluation across diverse temporal constraints for the first time, Tempora reveals when and why rankings invert, offering practitioners a lens for method selection and researchers a target for deployable adaptation.

[296] Flow Matching for Offline Reinforcement Learning with Discrete Actions

Fairoz Nower Khan, Nabuat Zaman Nahim, Ruiquan Huang, Haibo Yang, Peizhong Ju

Main category: cs.LG

TL;DR: Extends flow matching to discrete action spaces for offline RL using continuous-time Markov chains with Q-weighted objectives, applicable to multi-agent settings and multiple objectives.

DetailsMotivation: Current generative policies based on diffusion models and flow matching are largely confined to continuous action spaces, limiting their applicability to broader offline RL settings that include discrete actions and multi-objective scenarios.

Method: Replace continuous flows with continuous-time Markov chains trained using Q-weighted flow matching objective; extend to multi-agent settings via factorized conditional paths to mitigate exponential joint action space growth; apply to continuous control through action quantization.

Result: Method performs robustly in high-dimensional control, multi-modal decision-making, and dynamically changing preferences over multiple objectives; theoretical analysis shows optimal policy recovery under idealized conditions.

Conclusion: Provides a flexible discrete framework for offline RL that bridges continuous and discrete action spaces, offering trade-offs between representational complexity and performance across various RL settings.

Abstract: Generative policies based on diffusion models and flow matching have shown strong promise for offline reinforcement learning (RL), but their applicability remains largely confined to continuous action spaces. To address a broader range of offline RL settings, we extend flow matching to a general framework that supports discrete action spaces with multiple objectives. Specifically, we replace continuous flows with continuous-time Markov chains, trained using a Q-weighted flow matching objective. We then extend our design to multi-agent settings, mitigating the exponential growth of joint action spaces via a factorized conditional path. We theoretically show that, under idealized conditions, optimizing this objective recovers the optimal policy. Extensive experiments further demonstrate that our method performs robustly in practical scenarios, including high-dimensional control, multi-modal decision-making, and dynamically changing preferences over multiple objectives. Our discrete framework can also be applied to continuous-control problems through action quantization, providing a flexible trade-off between representational complexity and performance.

[297] Optimistic Training and Convergence of Q-Learning – Extended Version

Prashant Mehta, Sean Meyn

Main category: cs.LG

TL;DR: Q-learning stability requires more structure than previously thought; even with ideal basis functions, multiple solutions to projected Bellman equation can exist under greedy policies, challenging convergence guarantees.

DetailsMotivation: Prior work showed Q-learning with linear function approximation is stable under (ε,κ)-tamed Gibbs policy, but left open uniqueness of solutions to projected Bellman equation and convergence criteria beyond tabular/linear MDP settings.

Method: Extends analysis to other Q-learning variants, presents counterexamples showing issues with oblivious policies, and demonstrates that even with ideal basis functions (true Q-function in span), multiple solutions to PBE can exist under greedy policies.

Result: Shows far more structure required for convergence than previously assumed; presents example where basis is ideal but two solutions to PBE exist under greedy policy, implying instability even with (ε,κ)-tamed Gibbs policy for small ε>0 and κ≥1.

Conclusion: Convergence of Q-learning with function approximation requires stronger structural assumptions than previously recognized; simple conditions like ideal basis functions are insufficient to guarantee unique solutions or stability.

Abstract: In recent work it is shown that Q-learning with linear function approximation is stable, in the sense of bounded parameter estimates, under the $(\varepsilon,κ)$-tamed Gibbs policy; $κ$ is inverse temperature, and $\varepsilon>0$ is introduced for additional exploration. Under these assumptions it also follows that there is a solution to the projected Bellman equation (PBE). Left open is uniqueness of the solution, and criteria for convergence outside of the standard tabular or linear MDP settings. The present work extends these results to other variants of Q-learning, and clarifies prior work: a one dimensional example shows that under an oblivious policy for training there may be no solution to the PBE, or multiple solutions, and in each case the algorithm is not stable under oblivious training. The main contribution is that far more structure is required for convergence. An example is presented for which the basis is ideal, in the sense that the true Q-function is in the span of the basis. However, there are two solutions to the PBE under the greedy policy, and hence also for the $(\varepsilon,κ)$-tamed Gibbs policy for all sufficiently small $\varepsilon>0$ and $κ\ge 1$.

[298] MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models

Nurbek Tastan, Stefanos Laskaridis, Karthik Nandakumar, Samuel Horvath

Main category: cs.LG

TL;DR: MoSE introduces slimmable experts in MoE architecture, enabling variable expert widths for continuous accuracy-compute trade-offs at inference time.

DetailsMotivation: Standard MoE models have discontinuous accuracy-compute trade-offs because once an expert is selected, it's executed fully. There's a need for more flexible conditional computation that allows varying how much of each expert is used.

Method: Proposes Mixture of Slimmable Experts (MoSE) where each expert has nested, slimmable structure that can execute at variable widths. Combines multi-width training with standard MoE objectives, and explores runtime width determination strategies including test-time training to map router confidence to expert widths under budget constraints.

Result: MoSE matches or improves upon standard MoE at full width and consistently shifts the Pareto frontier for accuracy vs. cost, achieving comparable performance with significantly fewer FLOPs on GPT models trained on OpenWebText.

Conclusion: MoSE enables more continuous accuracy-compute trade-offs in MoE models through slimmable experts, providing flexible inference-time computation control while maintaining performance.

Abstract: Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully. Hence, the trade-off between accuracy and computation in an MoE model typically exhibits large discontinuities. We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths. This enables conditional computation not only over which experts are activated, but also over how much of each expert is utilized. Consequently, a single pretrained MoSE model can support a more continuous spectrum of accuracy-compute trade-offs at inference time. We present a simple and stable training recipe for slimmable experts under sparse routing, combining multi-width training with standard MoE objectives. During inference, we explore strategies for runtime width determination, including a lightweight test-time training mechanism that learns how to map router confidence/probabilities to expert widths under a fixed budget. Experiments on GPT models trained on OpenWebText demonstrate that MoSE matches or improves upon standard MoE at full width and consistently shifts the Pareto frontier for accuracy vs. cost, achieving comparable performance with significantly fewer FLOPs.

[299] Latent Structure Emergence in Diffusion Models via Confidence-Based Filtering

Wei Wei, Yizhou Zeng, Kuntian Chen, Sophie Langer, Mariia Seleznova, Hung-Hsu Chou

Main category: cs.LG

TL;DR: Diffusion models’ latent noise space shows class structure only when filtering for high-confidence generated samples, enabling conditional generation without guidance.

DetailsMotivation: To understand whether diffusion models' high-dimensional latent noise space contains class-predictable structure, and whether this structure emerges only under certain conditions like high-confidence samples.

Method: Analyze latent space structure by examining confidence scores from a pre-trained classifier on generated samples, compare class predictability across noise subsets with varying confidence levels, and investigate class separability in latent space under confidence-based filtering.

Result: Latent space appears largely unstructured for all noise realizations, but restricting to initial noise seeds that produce high-confidence samples reveals pronounced class separability, showing class-relevant latent structure emerges only under confidence-based filtering.

Conclusion: Confidence-based filtering reveals hidden class structure in diffusion model latent spaces, enabling conditional generation as an alternative to guidance-based methods.

Abstract: Diffusion models rely on a high-dimensional latent space of initial noise seeds, yet it remains unclear whether this space contains sufficient structure to predict properties of the generated samples, such as their classes. In this work, we investigate the emergence of latent structure through the lens of confidence scores assigned by a pre-trained classifier to generated samples. We show that while the latent space appears largely unstructured when considering all noise realizations, restricting attention to initial noise seeds that produce high-confidence samples reveals pronounced class separability. By comparing class predictability across noise subsets of varying confidence and examining the class separability of the latent space, we find evidence of class-relevant latent structure that becomes observable only under confidence-based filtering. As a practical implication, we discuss how confidence-based filtering enables conditional generation as an alternative to guidance-based methods.

[300] SCONE: A Practical, Constraint-Aware Plug-in for Latent Encoding in Learned DNA Storage

Cihan Ruan, Lebin Zhou, Rongduo Han, Linyi Han, Bingqing Zhao, Chenchen Zhu, Wei Jiang, Wei Wang, Nam Ling

Main category: cs.LG

TL;DR: SCONE is a plug-in module that integrates neural compression with DNA storage by performing quaternary arithmetic coding directly on latent representations, enforcing biochemical constraints during encoding without post-processing.

DetailsMotivation: Current DNA storage systems inefficiently integrate with neural compression pipelines. While neural codecs compress data into learned latent representations, they still convert these latents to DNA via naive binary-to-quaternary transcoding, discarding the entropy model's optimization. This mismatch undermines compression efficiency and complicates the encoding stack.

Method: SCONE collapses latent compression and DNA encoding into a single step by performing quaternary arithmetic coding directly on the latent space in DNA bases. Its Constraint-Aware Adaptive Coding module dynamically steers the entropy encoder’s learned probability distribution to enforce biochemical constraints (GC balance and homopolymer suppression) deterministically during encoding, eliminating post-hoc correction.

Result: SCONE achieves near-perfect constraint satisfaction with negligible computational overhead (<2% latency), establishing a latent-agnostic interface for end-to-end DNA-compatible learned codecs.

Conclusion: SCONE provides an efficient integration of neural compression with DNA storage by directly encoding latent representations into DNA bases while enforcing biochemical constraints during the encoding process, preserving full reversibility and exploiting learned priors without modification.

Abstract: DNA storage has matured from concept to practical stage, yet its integration with neural compression pipelines remains inefficient. Early DNA encoders applied redundancy-heavy constraint layers atop raw binary data - workable but primitive. Recent neural codecs compress data into learned latent representations with rich statistical structure, yet still convert these latents to DNA via naive binary-to-quaternary transcoding, discarding the entropy model’s optimization. This mismatch undermines compression efficiency and complicates the encoding stack. A plug-in module that collapses latent compression and DNA encoding into a single step. SCONE performs quaternary arithmetic coding directly on the latent space in DNA bases. Its Constraint-Aware Adaptive Coding module dynamically steers the entropy encoder’s learned probability distribution to enforce biochemical constraints - Guanine-Cytosine (GC) balance and homopolymer suppression - deterministically during encoding, eliminating post-hoc correction. The design preserves full reversibility and exploits the hyperprior model’s learned priors without modification. Experiments show SCONE achieves near-perfect constraint satisfaction with negligible computational overhead (<2% latency), establishing a latent-agnostic interface for end-to-end DNA-compatible learned codecs.

[301] To 2:4 Sparsity and Beyond: Neuron-level Activation Function to Accelerate LLM Pre-Training

Meghana Madhyastha, Daniel Haziza, Jesse Cai, Newsha Ardalani, Zhiqi Bu, Carole-Jean Wu

Main category: cs.LG

TL;DR: Sparse training method accelerates Transformer FFN matrix multiplications using 2:4 weight sparsity and v:n:m activation sparsity, achieving 1.4-1.7x training speedup with comparable model quality.

DetailsMotivation: Matrix multiplications in Transformer FFNs become a major bottleneck (up to 50% of pretraining FLOPs) for large language models, creating need for acceleration techniques.

Method: Leverage hardware-accelerated sparsity with 2:4 sparsity for weights and v:n:m (Venom) sparsity for activations in FFN matrix multiplications. Combine sparse training steps for most of pretraining with dense training steps at the end.

Result: Models achieve same performance on quality benchmarks while accelerating training end-to-end by 1.4-1.7x. Method works on NVIDIA GPUs from A100 generation onward and is orthogonal to quantization and applicable to mixture-of-experts architectures.

Conclusion: Sparse training of FFN layers using hardware-accelerated sparsity provides significant training speedups without compromising model quality, offering a practical optimization for large language model training.

Abstract: Trainings of Large Language Models are generally bottlenecked by matrix multiplications. In the Transformer architecture, a large portion of these operations happens in the Feed Forward Network (FFN), and this portion increases for larger models, up to 50% of the total pretraining floating point operations. We show that we can leverage hardware-accelerated sparsity to accelerate all matrix multiplications in the FFN, with 2:4 sparsity for weights and v:n:m (Venom) sparsity for activations. Our recipe relies on sparse training steps to accelerate a large part of the pretraining, associated with regular dense training steps towards the end. Overall, models trained with this approach exhibit the same performance on our quality benchmarks, and can speed up training end-to-end by 1.4 to 1.7x. This approach is applicable to all NVIDIA GPUs starting with the A100 generation, and is orthogonal to common optimization techniques, such as, quantization, and can also be applied to mixture-of-experts model architectures.

[302] $f$-FUM: Federated Unlearning via min–max and $f$-divergence

Radmehr Karimian, Amirhossein Bagheri, Meghdad Kurmanji, Nicholas D. Lane, Gholamali Aminian

Main category: cs.LG

TL;DR: A federated unlearning framework using min-max optimization to efficiently remove specific data points while preserving model utility, applicable as a plugin to existing FL systems.

DetailsMotivation: Federated Learning needs principled data unlearning to address legal requirements like "right to be forgotten" and mitigate data poisoning attacks, but distributed nature makes data removal challenging compared to centralized settings.

Method: Proposes a federated unlearning framework formulated as min-max optimization: maximize f-divergence between model trained with all data and model retrained without specific data, while minimizing degradation on retained data. Acts as a plugin for existing FL setups.

Result: Achieves significant speedups over naive retraining with minimal impact on utility. More flexible than SOTA methods that require model degradation in server or specific model architecture modifications.

Conclusion: Provides an efficient, practical solution for data removal in federated learning that balances unlearning effectiveness with model utility preservation, addressing important privacy and security concerns.

Abstract: Federated Learning (FL) has emerged as a powerful paradigm for collaborative machine learning across decentralized data sources, preserving privacy by keeping data local. However, increasing legal and ethical demands, such as the “right to be forgotten”, and the need to mitigate data poisoning attacks have underscored the urgent necessity for principled data unlearning in FL. Unlike centralized settings, the distributed nature of FL complicates the removal of individual data contributions. In this paper, we propose a novel federated unlearning framework formulated as a min-max optimization problem, where the objective is to maximize an $f$-divergence between the model trained with all data and the model retrained without specific data points, while minimizing the degradation on retained data. Our framework could act like a plugin and be added to almost any federated setup, unlike SOTA methods like (\cite{10269017} which requires model degradation in server, or \cite{khalil2025notfederatedunlearningweight} which requires to involve model architecture and model weights). This formulation allows for efficient approximation of data removal effects in a federated setting. We provide empirical evaluations to show that our method achieves significant speedups over naive retraining, with minimal impact on utility.

[303] Learning Rate Scaling across LoRA Ranks and Transfer to Full Finetuning

Nan Chen, Soledad Villar, Soufiane Hayou

Main category: cs.LG

TL;DR: μA framework provides theoretical scaling rules for optimal learning rates in LoRA based on adapter rank and model width, enabling learning rate transfer across ranks and to full finetuning.

DetailsMotivation: LoRA training dynamics are complex and depend on hyperparameters like initialization, adapter rank, and learning rate. Practitioners must re-tune learning rates when changing rank, which is inefficient. There's a need for theoretical understanding of how optimal learning rates should scale with model width and adapter rank.

Method: Introduces Maximal-Update Adaptation (μA), a theoretical framework inspired by Maximal-Update Parametrization (μP) from pretraining. Uses techniques from hyperparameter transfer to analyze how optimal learning rates should scale with model width and adapter rank under different initialization and LoRA scaling configurations.

Result: Identifies two regimes: one where optimal learning rate is roughly invariant across ranks, and another where it scales inversely with rank. Finds a configuration that enables learning rate transfer from LoRA to full finetuning. Experimental validation across language, vision, vision-language, image generation, and RL tasks confirms scaling rules.

Conclusion: μA provides principled scaling rules for LoRA learning rates, reducing hyperparameter tuning costs. Enables learning rate transfer across adapter ranks and even to full finetuning, making parameter-efficient finetuning more practical and efficient.

Abstract: Low-Rank Adaptation (LoRA) is a standard tool for parameter-efficient finetuning of large models. While it induces a small memory footprint, its training dynamics can be surprisingly complex as they depend on several hyperparameters such as initialization, adapter rank, and learning rate. In particular, it is unclear how the optimal learning rate scales with adapter rank, which forces practitioners to re-tune the learning rate whenever the rank is changed. In this paper, we introduce Maximal-Update Adaptation ($μ$A), a theoretical framework that characterizes how the “optimal” learning rate should scale with model width and adapter rank to produce stable, non-vanishing feature updates under standard configurations. $μ$A is inspired from the Maximal-Update Parametrization ($μ$P) in pretraining. Our analysis leverages techniques from hyperparameter transfer and reveals that the optimal learning rate exhibits different scaling patterns depending on initialization and LoRA scaling factor. Specifically, we identify two regimes: one where the optimal learning rate remains roughly invariant across ranks, and another where it scales inversely with rank. We further identify a configuration that allows learning rate transfer from LoRA to full finetuning, drastically reducing the cost of learning rate tuning for full finetuning. Experiments across language, vision, vision–language, image generation, and reinforcement learning tasks validate our scaling rules and show that learning rates tuned on LoRA transfer reliably to full finetuning.

[304] Multi-Way Representation Alignment

Akshit Achara, Tatiana Gaintseva, Mateo Mahaut, Pritish Chakraborty, Viktor Stenby Johansson, Melih Barsbey, Emanuele Rodolà, Donato Crisostomi

Main category: cs.LG

TL;DR: GCPA method aligns multiple models’ latent spaces into a shared reference space, improving any-to-any retrieval while preserving geometry.

DetailsMotivation: Current pairwise alignment methods scale poorly with multiple models and lack consistent global reference; need better multi-model alignment that balances geometry preservation and retrieval performance.

Method: Adapt Generalized Procrustes Analysis (GPA) for shared orthogonal universe preserving geometry, then propose Geometry-Corrected Procrustes Alignment (GCPA) with post-hoc correction for directional mismatch.

Result: GCPA consistently improves any-to-any retrieval across models while maintaining practical shared reference space, outperforming strict isometric alignment and CCA-based methods.

Conclusion: GCPA provides effective multi-model alignment solution that bridges gap between geometry-preserving and retrieval-optimized approaches, enabling better cross-model interoperability.

Abstract: The Platonic Representation Hypothesis suggests that independently trained neural networks converge to increasingly similar latent spaces. However, current strategies for mapping these representations are inherently pairwise, scaling quadratically with the number of models and failing to yield a consistent global reference. In this paper, we study the alignment of $M \ge 3$ models. We first adapt Generalized Procrustes Analysis (GPA) to construct a shared orthogonal universe that preserves the internal geometry essential for tasks like model stitching. We then show that strict isometric alignment is suboptimal for retrieval, where agreement-maximizing methods like Canonical Correlation Analysis (CCA) typically prevail. To bridge this gap, we finally propose Geometry-Corrected Procrustes Alignment (GCPA), which establishes a robust GPA-based universe followed by a post-hoc correction for directional mismatch. Extensive experiments demonstrate that GCPA consistently improves any-to-any retrieval while retaining a practical shared reference space.

[305] Emergent Low-Rank Training Dynamics in MLPs with Smooth Activations

Alec S. Xu, Can Yaras, Matthew Asato, Qing Qu, Laura Balzano

Main category: cs.LG

TL;DR: The paper analyzes how training dynamics of deep neural networks occur in low-dimensional subspaces, providing theoretical and empirical evidence for this phenomenon in MLPs with smooth activations.

DetailsMotivation: Empirical evidence shows training dynamics of large neural networks occur in low-dimensional subspaces, inspiring low-rank training methods, but theoretical understanding for nonlinear networks remains limited compared to linear settings.

Method: Analyzes learning dynamics of MLPs under gradient descent, theoretically characterizing invariant low-dimensional subspaces for two-layer networks with smooth nonlinear activations, and experimentally validating the phenomenon beyond theoretical assumptions.

Result: Demonstrates weight dynamics concentrate within invariant low-dimensional subspaces throughout training, and shows empirically that low-rank MLP parameterization initialized within appropriate subspaces matches performance of fully-parameterized networks on various classification tasks.

Conclusion: Provides theoretical justification for low-dimensional training dynamics in nonlinear networks, enabling more efficient low-rank parameterizations that maintain performance while reducing computational requirements.

Abstract: Recent empirical evidence has demonstrated that the training dynamics of large-scale deep neural networks occur within low-dimensional subspaces. While this has inspired new research into low-rank training, compression, and adaptation, theoretical justification for these dynamics in nonlinear networks remains limited. %compared to deep linear settings. To address this gap, this paper analyzes the learning dynamics of multi-layer perceptrons (MLPs) under gradient descent (GD). We demonstrate that the weight dynamics concentrate within invariant low-dimensional subspaces throughout training. Theoretically, we precisely characterize these invariant subspaces for two-layer networks with smooth nonlinear activations, providing insight into their emergence. Experimentally, we validate that this phenomenon extends beyond our theoretical assumptions. Leveraging these insights, we empirically show there exists a low-rank MLP parameterization that, when initialized within the appropriate subspaces, matches the classification performance of fully-parameterized counterparts on a variety of classification tasks.

[306] SR4-Fit: An Interpretable and Informative Classification Algorithm Applied to Prediction of U.S. House of Representatives Elections

Shyam Sundar Murali Krishnan, Dean Frederick Hougen

Main category: cs.LG

TL;DR: SR4-Fit is a novel interpretable classification algorithm that combines predictive power with transparency, outperforming both black-box models and traditional rule-based methods on election forecasting and multiple benchmark datasets.

DetailsMotivation: There's a growing need for interpretable ML models in critical applications, but current approaches face a trade-off: high-performing models are "black-box" systems that obscure input-output relationships, while traditional rule-based algorithms like RuleFit lack predictive power and stability despite being interpretable.

Method: SR4-Fit (Sparse Relaxed Regularized Regression Rule-Fit) is developed as an interpretable classification algorithm that addresses limitations of existing methods. It generates stable and interpretable rule sets while maintaining superior predictive performance through sparse relaxed regularization techniques.

Result: SR4-Fit predicts U.S. House election party outcomes with unprecedented accuracy and interpretability, revealing intrinsic combinations of demographic factors that affect prediction outcomes. It surpasses both black-box models (like random forests) and existing interpretable rule-based algorithms (like RuleFit) in accuracy, simplicity, and robustness across multiple datasets including breast cancer, Ecoli, page blocks, Pima Indians, vehicle, and yeast datasets.

Conclusion: SR4-Fit successfully addresses the traditional trade-off between model interpretability and predictive capability, providing a practical solution for applications requiring both transparency and high performance, particularly in domains like electoral forecasting where interpretability is crucial.

Abstract: The growth of machine learning demands interpretable models for critical applications, yet most high-performing models are ``black-box’’ systems that obscure input-output relationships, while traditional rule-based algorithms like RuleFit suffer from a lack of predictive power and instability despite their simplicity. This motivated our development of Sparse Relaxed Regularized Regression Rule-Fit (SR4-Fit), a novel interpretable classification algorithm that addresses these limitations while maintaining superior classification performance. Using demographic characteristics of U.S. congressional districts from the Census Bureau’s American Community Survey, we demonstrate that SR4-Fit can predict House election party outcomes with unprecedented accuracy and interpretability. Our results show that while the majority party remains the strongest predictor, SR4-Fit has revealed intrinsic combinations of demographic factors that affect prediction outcomes that were unable to be interpreted in black-box algorithms such as random forests. The SR4-Fit algorithm surpasses both black-box models and existing interpretable rule-based algorithms such as RuleFit with respect to accuracy, simplicity, and robustness, generating stable and interpretable rule sets while maintaining superior predictive performance, thus addressing the traditional trade-off between model interpretability and predictive capability in electoral forecasting. To further validate SR4-Fit’s performance, we also apply it to six additional publicly available classification datasets, like the breast cancer, Ecoli, page blocks, Pima Indians, vehicle, and yeast datasets, and find similar results.

[307] Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

Adam Barla, Emanuele Nevali, Luca Viano, Volkan Cevher

Main category: cs.LG

TL;DR: PEPO is a single-step DPO-like algorithm that addresses over-optimization in preference learning using an ensemble of policies trained on disjoint data subsets, aggregated via worst-case construction to favor model agreement.

DetailsMotivation: The paper addresses the over-optimization issue in preference learning algorithms like DPO, which can lead to poor performance due to reward hacking or distribution shift. Existing methods often require knowledge of the data-generating distribution or explicit reward models, which can be impractical.

Method: PEPO uses an ensemble approach where multiple preference-optimized policies are trained on disjoint data subsets. These policies are then aggregated through a worst-case construction that favors agreement across models, achieving pessimism without requiring explicit reward models or distribution knowledge.

Result: In tabular settings, PEPO achieves sample complexity guarantees that depend only on a single-policy concentrability coefficient, avoiding the all-policy concentrability issue that affects DPO. The method shows convincing practical performance while maintaining DPO-style training simplicity.

Conclusion: PEPO provides a practical solution to over-optimization in preference learning that retains the simplicity of DPO while offering better theoretical guarantees and empirical performance through ensemble-based pessimism.

Abstract: We introduce PEPO (Pessimistic Ensemble based Preference Optimization), a single-step Direct Preference Optimization (DPO)-like algorithm to mitigate the well-known over-optimization issue in preference learning without requiring the knowledge of the data-generating distribution or learning an explicit reward model. PEPO achieves pessimism via an ensemble of preference-optimized policies trained on disjoint data subsets and then aggregates them through a worst case construction that favors the agreement across models. In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO. The theoretical findings are corroborated by a convincing practical performance, while retaining the simplicity and the practicality of DPO-style training.

[308] ATEX-CF: Attack-Informed Counterfactual Explanations for Graph Neural Networks

Yu Zhang, Sean Bin Yang, Arijit Khan, Cuneyt Gurcan Akcora

Main category: cs.LG

TL;DR: ATEX-CF is a novel framework that unifies adversarial attacks and counterfactual explanations for GNNs by integrating both edge additions and deletions to generate faithful, concise, and plausible explanations.

DetailsMotivation: Traditional approaches treat counterfactual explanations and adversarial attacks separately for GNNs, even though they share the goal of flipping predictions. There's a need to unify these approaches to leverage adversarial insights for generating more effective counterfactual explanations.

Method: ATEX-CF integrates adversarial attack techniques with counterfactual explanation generation by combining both edge additions and deletions (unlike traditional methods that use only one type). The method jointly optimizes fidelity, sparsity, and plausibility under a constrained perturbation budget, using theoretical grounding to explore impactful counterfactuals.

Result: Experiments on synthetic and real-world node classification benchmarks show that ATEX-CF generates faithful, concise, and plausible explanations, demonstrating the effectiveness of integrating adversarial insights into counterfactual reasoning for GNNs.

Conclusion: The proposed ATEX-CF framework successfully bridges adversarial attacks and counterfactual explanations for GNNs, producing more informative and realistic explanations by leveraging insights from both domains.

Abstract: Counterfactual explanations offer an intuitive way to interpret graph neural networks (GNNs) by identifying minimal changes that alter a model’s prediction, thereby answering “what must differ for a different outcome?”. In this work, we propose a novel framework, ATEX-CF that unifies adversarial attack techniques with counterfactual explanation generation-a connection made feasible by their shared goal of flipping a node’s prediction, yet differing in perturbation strategy: adversarial attacks often rely on edge additions, while counterfactual methods typically use deletions. Unlike traditional approaches that treat explanation and attack separately, our method efficiently integrates both edge additions and deletions, grounded in theory, leveraging adversarial insights to explore impactful counterfactuals. In addition, by jointly optimizing fidelity, sparsity, and plausibility under a constrained perturbation budget, our method produces instance-level explanations that are both informative and realistic. Experiments on synthetic and real-world node classification benchmarks demonstrate that ATEX-CF generates faithful, concise, and plausible explanations, highlighting the effectiveness of integrating adversarial insights into counterfactual reasoning for GNNs.

[309] A Fast and Generalizable Fourier Neural Operator-Based Surrogate for Melt-Pool Prediction in Laser Processing

Alix Benoit, Toni Ivas, Mateusz Papierz, Asel Sagingalieva, Alexey Melnikov, Elia Iseli

Main category: cs.LG

TL;DR: LP-FNO uses Fourier Neural Operators as a surrogate model for laser welding simulations, achieving 100,000x speedup with 1% temperature error by reformulating transient problems in moving laser frames.

DetailsMotivation: High-fidelity laser welding simulations are computationally expensive, limiting large-scale process exploration and real-time applications. There's a need for efficient surrogate models that can predict complex thermo-fluid phenomena quickly.

Method: Proposes Laser Processing Fourier Neural Operator (LP-FNO) that learns parametric solution operators from multiphysics simulations. Uses novel approach of reformulating transient problems in moving laser frame with temporal averaging to create quasi-steady state setting suitable for operator learning.

Result: Achieves temperature prediction errors around 1% and intersection-over-union scores over 0.9 for melt-pool segmentation. Model trained on coarse-resolution data can be evaluated on finer grids, providing accurate super-resolved predictions. Enables predictions in tens of milliseconds, up to 100,000x faster than traditional simulations.

Conclusion: LP-FNO provides an efficient surrogate modeling framework for laser welding that enables fast prediction of 3D temperature fields and phase interfaces over wide parameter ranges, bridging the gap between computational efficiency and accuracy.

Abstract: High-fidelity simulations of laser welding capture complex thermo-fluid phenomena, including phase change, free-surface deformation, and keyhole dynamics, however their computational cost limits large-scale process exploration and real-time use. In this work we present the Laser Processing Fourier Neural Operator (LP-FNO), a Fourier Neural Operator (FNO) based surrogate model that learns the parametric solution operator of various laser processes from multiphysics simulations generated with FLOW-3D WELD (registered trademark). Through a novel approach of reformulating the transient problem in the moving laser frame and applying temporal averaging, the system results in a quasi-steady state setting suitable for operator learning, even in the keyhole welding regime. The proposed LP-FNO maps process parameters to three-dimensional temperature fields and melt-pool boundaries across a broad process window spanning conduction and keyhole regimes using the non-dimensional normalized enthalpy formulation. The model achieves temperature prediction errors on the order of 1% and intersection-over-union scores for melt-pool segmentation over 0.9. We demonstrate that a LP-FNO model trained on coarse-resolution data can be evaluated on finer grids, yielding accurate super-resolved predictions in mesh-converged conduction regimes, whereas discrepancies in keyhole regimes reflect unresolved dynamics in the coarse-mesh training data. These results indicate that the LP-FNO provides an efficient surrogate modeling framework for laser welding, enabling prediction of full three-dimensional fields and phase interfaces over wide parameter ranges in just tens of milliseconds, up to a hundred thousand times faster than traditional Finite Volume multi-physics software.

[310] Adaptive Sparse Möbius Transforms for Learning Polynomials

Yigit Efe Erginbas, Justin Singh Kang, Elizabeth Polito, Kannan Ramchandran

Main category: cs.LG

TL;DR: Efficient algorithms for learning sparse real-valued Boolean polynomials using adaptive group testing to implement Möbius transform, with applications to hypergraph reconstruction.

DetailsMotivation: The problem of learning sparse real-valued Boolean polynomials in the AND basis (Möbius transform) is challenging because the basis vectors are coherent, preventing standard compressed sensing methods. This contrasts with the well-understood Fourier transform in the parity basis.

Method: Two algorithms: Fully-Adaptive Sparse Möbius Transform (FASMT) uses O(sd log(n/d)) adaptive queries with O((sd + n) sd log(n/d)) time. Partially-Adaptive Sparse Möbius Transform (PASMT) uses O(sd² log(n/d)) queries with O(d² log(n/d)) adaptive rounds, trading query efficiency for reduced adaptivity.

Result: FASMT achieves near-optimal query complexity. Both algorithms avoid combinatorial explosion in rank d for hypergraph reconstruction from edge-count queries, showing practical utility in simulations.

Conclusion: Adaptive group testing enables efficient Möbius transform for sparse functions, with applications to hypergraph reconstruction that improve upon existing baselines.

Abstract: We consider the problem of exactly learning an $s$-sparse real-valued Boolean polynomial of degree $d$ of the form $f:{ 0,1}^n \rightarrow \mathbb{R}$. This problem corresponds to decomposing functions in the AND basis and is known as taking a Möbius transform. While the analogous problem for the parity basis (Fourier transform) $f: {-1,1 }^n \rightarrow \mathbb{R}$ is well-understood, the AND basis presents a unique challenge: the basis vectors are coherent, precluding standard compressed sensing methods. We overcome this challenge by identifying that we can exploit adaptive group testing to provide a constructive, query-efficient implementation of the Möbius transform (also known as Möbius inversion) for sparse functions. We present two algorithms based on this insight. The Fully-Adaptive Sparse Möbius Transform (FASMT) uses $O(sd \log(n/d))$ adaptive queries in $O((sd + n) sd \log(n/d))$ time, which we show is near-optimal in query complexity. Furthermore, we also present the Partially-Adaptive Sparse Möbius Transform (PASMT), which uses $O(sd^2\log(n/d))$ queries, trading a factor of $d$ to reduce the number of adaptive rounds to $O(d^2\log(n/d))$, with no dependence on $s$. When applied to hypergraph reconstruction from edge-count queries, our results improve upon baselines by avoiding the combinatorial explosion in the rank $d$. We demonstrate the practical utility of our method for hypergraph reconstruction by applying it to learning real hypergraphs in simulations.

[311] REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop

Patryk Rybak, Paweł Batorski, Paul Swoboda, Przemysław Spurek

Main category: cs.LG

TL;DR: REBEL is an evolutionary adversarial prompt generation method that reveals current machine unlearning methods for LLMs only provide superficial protection, as it can recover “forgotten” knowledge that standard benchmarks miss.

DetailsMotivation: Current machine unlearning evaluation metrics rely on benign queries that mistake superficial information suppression for genuine knowledge removal, failing to detect residual knowledge that sophisticated prompting could extract.

Method: REBEL uses an evolutionary approach for adversarial prompt generation to probe whether unlearned data can still be recovered from LLMs, testing on TOFU and WMDP benchmarks across various unlearning algorithms.

Result: REBEL consistently outperforms static baselines, recovering “forgotten” knowledge with Attack Success Rates up to 60% on TOFU and 93% on WMDP, showing current unlearning methods provide only superficial protection.

Conclusion: Current machine unlearning methods for LLMs may be less effective than previously thought, as adversarial prompting can extract supposedly removed knowledge, highlighting the need for more robust unlearning techniques.

Abstract: Machine unlearning for LLMs aims to remove sensitive or copyrighted data from trained models. However, the true efficacy of current unlearning methods remains uncertain. Standard evaluation metrics rely on benign queries that often mistake superficial information suppression for genuine knowledge removal. Such metrics fail to detect residual knowledge that more sophisticated prompting strategies could still extract. We introduce REBEL, an evolutionary approach for adversarial prompt generation designed to probe whether unlearned data can still be recovered. Our experiments demonstrate that REBEL successfully elicits forgotten'' knowledge from models that seemed to be forgotten in standard unlearning benchmarks, revealing that current unlearning methods may provide only a superficial layer of protection. We validate our framework on subsets of the TOFU and WMDP benchmarks, evaluating performance across a diverse suite of unlearning algorithms. Our experiments show that REBEL consistently outperforms static baselines, recovering forgotten’’ knowledge with Attack Success Rates (ASRs) reaching up to 60% on TOFU and 93% on WMDP. We will make all code publicly available upon acceptance. Code is available at https://github.com/patryk-rybak/REBEL/

[312] Steering Safely or Off a Cliff? Rethinking Specificity and Robustness in Inference-Time Interventions

Navita Goyal, Hal Daumé

Main category: cs.LG

TL;DR: Systematic evaluation of model steering specificity reveals that while steering effectively controls target properties and preserves general abilities, it fails to maintain robustness against distribution shifts, compromising safety in critical applications.

DetailsMotivation: Current model steering evaluations focus primarily on efficacy but lack comprehensive assessment of specificity - whether interventions alter only intended properties without unintended changes to related behaviors, especially in safety-critical applications.

Method: Proposes a framework with three specificity dimensions: general (preserving fluency/unrelated abilities), control (preserving related control properties), and robustness (preserving control under distribution shifts). Evaluates two safety-critical use cases: reducing overrefusal and faithfulness hallucinations.

Result: Steering achieves high efficacy and maintains general/control specificity but consistently fails robustness specificity. For overrefusal steering, methods reduce overrefusal without harming general abilities or refusal on harmful queries, but substantially increase vulnerability to jailbreaks.

Conclusion: Standard efficacy and specificity checks are insufficient; without robustness evaluation, steering methods may appear reliable while compromising model safety. First systematic evaluation of specificity in model steering reveals critical safety gaps.

Abstract: Model steering, which involves intervening on hidden representations at inference time, has emerged as a lightweight alternative to finetuning for precisely controlling large language models. While steering efficacy has been widely studied, evaluations of whether interventions alter only the intended property remain limited, especially with respect to unintended changes in behaviors related to the target property. We call this notion specificity. We propose a framework that distinguishes three dimensions of specificity: general (preserving fluency and unrelated abilities), control (preserving related control properties), and robustness (preserving control properties under distribution shifts). We study two safety-critical use cases: steering models to reduce overrefusal and faithfulness hallucinations, and show that while steering achieves high efficacy and largely maintains general and control specificity, it consistently fails to preserve robustness specificity. In the case of overrefusal steering, for example, all steering methods reduce overrefusal without harming general abilities and refusal on harmful queries; however, they substantially increase vulnerability to jailbreaks. Our work provides the first systematic evaluation of specificity in model steering, showing that standard efficacy and specificity checks are insufficient, because without robustness evaluation, steering methods may appear reliable even when they compromise model safety.

[313] On Randomized Algorithms in Online Strategic Classification

Chase Hutton, Adam Melrod, Han Shao

Main category: cs.LG

TL;DR: This paper studies online strategic classification where agents modify features to get favorable predictions, focusing on randomized algorithms and improved regret bounds in both realizable and agnostic settings.

DetailsMotivation: While randomized algorithms could offer advantages in strategic classification settings, they remain largely unexplored. Current bounds for deterministic learners may be circumvented by randomization, and existing regret bounds in agnostic settings are far from standard online learning rates.

Method: The authors provide refined bounds for online strategic classification in both realizable and agnostic settings. They extend lower bounds to all learners (including randomized), develop randomized learners that improve deterministic bounds, and use convex optimization techniques for proper learning in agnostic settings.

Result: First lower bound for randomized learners in realizable setting, first randomized learner improving deterministic upper bounds, and improved regret upper bound of O(√(T log|H|) + |H|log(T|H|)) in agnostic setting with matching lower bound up to logarithmic factors for proper learners.

Conclusion: Randomized algorithms can provide advantages in strategic classification, and proper learning achieves near-optimal regret bounds, though improper learning may be necessary for further improvements.

Abstract: Online strategic classification studies settings in which agents strategically modify their features to obtain favorable predictions. For example, given a classifier that determines loan approval based on credit scores, applicants may open or close credit cards and bank accounts to obtain a positive prediction. The learning goal is to achieve low mistake or regret bounds despite such strategic behavior. While randomized algorithms have the potential to offer advantages to the learner in strategic settings, they have been largely underexplored. In the realizable setting, no lower bound is known for randomized algorithms, and existing lower bound constructions for deterministic learners can be circumvented by randomization. In the agnostic setting, the best known regret upper bound is $O(T^{3/4}\log^{1/4}T|\mathcal H|)$, which is far from the standard online learning rate of $O(\sqrt{T\log|\mathcal H|})$. In this work, we provide refined bounds for online strategic classification in both settings. In the realizable setting, we extend, for $T > \mathrm{Ldim}(\mathcal{H}) Δ^2$, the existing lower bound $Ω(\mathrm{Ldim}(\mathcal{H}) Δ)$ for deterministic learners to all learners. This yields the first lower bound that applies to randomized learners. We also provide the first randomized learner that improves the known (deterministic) upper bound of $O(\mathrm{Ldim}(\mathcal H) \cdot Δ\log Δ)$. In the agnostic setting, we give a proper learner using convex optimization techniques to improve the regret upper bound to $O(\sqrt{T \log |\mathcal{H}|} + |\mathcal{H}| \log(T|\mathcal{H}|))$. We show a matching lower bound up to logarithmic factors for all proper learning rules, demonstrating the optimality of our learner among proper learners. As such, improper learning is necessary to further improve regret guarantees.

[314] GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

Mark Russinovich, Yanan Cai, Keegan Hines, Giorgio Severi, Blake Bullwinkel, Ahmed Salem

Main category: cs.LG

TL;DR: GRP-Obliteration is a method that uses Group Relative Policy Optimization to remove safety constraints from aligned models using just a single unlabeled prompt, achieving stronger unalignment than existing techniques while preserving utility, and generalizing to both language and image generation models.

DetailsMotivation: Safety alignment can be vulnerable to post-deployment fine-tuning attacks, but existing unalignment methods require extensive data curation and degrade model utility. The authors aim to extend practical limits of unalignment with a more efficient approach.

Method: GRP-Obliteration uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. The method requires only a single unlabeled prompt and works across language models and diffusion-based image generation systems.

Result: GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques while largely preserving model utility. It was evaluated on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models from various families.

Conclusion: The method demonstrates that safety alignment is vulnerable to efficient unalignment attacks, highlighting the need for more robust safety mechanisms beyond current post-training approaches.

Abstract: Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit on six utility benchmarks and five safety benchmarks across fifteen 7-20B parameter models, spanning instruct and reasoning models, as well as dense and MoE architectures. The evaluated model families include GPT-OSS, distilled DeepSeek, Gemma, Llama, Ministral, and Qwen.

[315] Swap Regret Minimization Through Response-Based Approachability

Ioannis Anagnostides, Gabriele Farina, Maxwell Fishelson, Haipeng Luo, Jon Schneider

Main category: cs.LG

TL;DR: A new algorithm for minimizing linear swap regret in online optimization with improved computational efficiency and better regret bounds compared to prior work.

DetailsMotivation: Swap regret minimization is important for correlated equilibrium in games and non-manipulability against strategic adversaries. Existing efficient algorithms have suboptimal regret bounds and rely on computationally intensive methods like the ellipsoid algorithm.

Method: Leverages response-based approachability framework combined with geometric preconditioning via John ellipsoid. Provides computationally efficient algorithm that simultaneously minimizes profile swap regret.

Result: Achieves O(d^{3/2}√T) linear swap regret for general convex sets and O(d√T) for centrally symmetric sets. Establishes matching Ω(d√T) lower bound, showing optimality. Extends to polynomial dimension swap deviations.

Conclusion: Provides significantly simpler and more efficient algorithm for swap regret minimization with optimal regret bounds, unifying equilibrium computation and online learning results.

Abstract: We consider the problem of minimizing different notions of swap regret in online optimization. These forms of regret are tightly connected to correlated equilibrium concepts in games, and have been more recently shown to guarantee non-manipulability against strategic adversaries. The only computationally efficient algorithm for minimizing linear swap regret over a general convex set in $\mathbb{R}^d$ was developed recently by Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC ‘25). However, it incurs a highly suboptimal regret bound of $Ω(d^4 \sqrt{T})$ and also relies on computationally intensive calls to the ellipsoid algorithm at each iteration. In this paper, we develop a significantly simpler, computationally efficient algorithm that guarantees $O(d^{3/2} \sqrt{T})$ linear swap regret for a general convex set and $O(d \sqrt{T})$ when the set is centrally symmetric. Our approach leverages the powerful response-based approachability framework of Bernstein and Shimkin (JMLR ‘15) – previously overlooked in the line of work on swap regret minimization – combined with geometric preconditioning via the John ellipsoid. Our algorithm simultaneously minimizes profile swap regret, which was recently shown to guarantee non-manipulability. Moreover, we establish a matching information-theoretic lower bound: any learner must incur in expectation $Ω(d \sqrt{T})$ linear swap regret for large enough $T$, even when the set is centrally symmetric. This also shows that the classic algorithm of Gordon, Greenwald, and Marks (ICML ‘08) is existentially optimal for minimizing linear swap regret, although it is computationally inefficient. Finally, we extend our approach to minimize regret with respect to the set of swap deviations with polynomial dimension, unifying and strengthening recent results in equilibrium computation and online learning.

[316] PurSAMERE: Reliable Adversarial Purification via Sharpness-Aware Minimization of Expected Reconstruction Error

Vinh Hoang, Sebastian Krumscheid, Holger Rauhut, Raúl Tempone

Main category: cs.LG

TL;DR: A deterministic purification method for adversarial robustness that maps adversarial samples toward data distribution modes using score models and sharpness-aware minimization.

DetailsMotivation: To improve adversarial robustness by creating a deterministic purification approach that avoids the degradation of effective robustness seen in stochastic methods when adversaries have full system knowledge.

Method: Uses a score model trained to minimize expected reconstruction error of noise-corrupted data. For adversarial inputs, searches local neighborhood for purified samples minimizing reconstruction error, guided by sharpness-aware minimization toward flat regions of the error landscape.

Result: Demonstrates significant gains in adversarial robustness over state-of-the-art methods under strong deterministic white-box attacks.

Conclusion: Deterministic purification with score models and sharpness-aware minimization effectively improves adversarial robustness by moving samples toward data distribution modes where classifiers are more reliable.

Abstract: We propose a novel deterministic purification method to improve adversarial robustness by mapping a potentially adversarial sample toward a nearby sample that lies close to a mode of the data distribution, where classifiers are more reliable. We design the method to be deterministic to ensure reliable test accuracy and to prevent the degradation of effective robustness observed in stochastic purification approaches when the adversary has full knowledge of the system and its randomness. We employ a score model trained by minimizing the expected reconstruction error of noise-corrupted data, thereby learning the structural characteristics of the input data distribution. Given a potentially adversarial input, the method searches within its local neighborhood for a purified sample that minimizes the expected reconstruction error under noise corruption and then feeds this purified sample to the classifier. During purification, sharpness-aware minimization is used to guide the purified samples toward flat regions of the expected reconstruction error landscape, thereby enhancing robustness. We further show that, as the noise level decreases, minimizing the expected reconstruction error biases the purified sample toward local maximizers of the Gaussian-smoothed density; under additional local assumptions on the score model, we prove recovery of a local maximizer in the small-noise limit. Experimental results demonstrate significant gains in adversarial robustness over state-of-the-art methods under strong deterministic white-box attacks.

[317] Statistical Learning from Attribution Sets

Lorne Applebaum, Robert Busa-Fekete, August Y. Chen, Claudio Gentile, Tomer Koren, Aryan Mokhtari

Main category: cs.LG

TL;DR: Training conversion prediction models under privacy constraints where clicks and conversions can’t be directly linked, only to attribution sets of candidate clicks.

DetailsMotivation: Privacy-preserving browser APIs and deprecation of third-party cookies create a setting where direct links between ad clicks and conversions are unavailable, requiring new learning approaches.

Method: Construct unbiased estimator of population loss from coarse attribution signals, use Empirical Risk Minimization with generalization guarantees that scale with prior informativeness and robustness to prior estimation errors.

Result: Unbiased approach significantly outperforms common industry heuristics, especially when attribution sets are large or overlapping, as shown in empirical evaluations on standard datasets.

Conclusion: Effective conversion prediction is possible under privacy constraints using attribution sets rather than direct links, with theoretical guarantees and practical advantages over existing heuristics.

Abstract: We address the problem of training conversion prediction models in advertising domains under privacy constraints, where direct links between ad clicks and conversions are unavailable. Motivated by privacy-preserving browser APIs and the deprecation of third-party cookies, we study a setting where the learner observes a sequence of clicks and a sequence of conversions, but can only link a conversion to a set of candidate clicks (an attribution set) rather than a unique source. We formalize this as learning from attribution sets generated by an oblivious adversary equipped with a prior distribution over the candidates. Despite the lack of explicit labels, we construct an unbiased estimator of the population loss from these coarse signals via a novel approach. Leveraging this estimator, we show that Empirical Risk Minimization achieves generalization guarantees that scale with the informativeness of the prior and is also robust against estimation errors in the prior, despite complex dependencies among attribution sets. Simple empirical evaluations on standard datasets suggest our unbiased approach significantly outperforms common industry heuristics, particularly in regimes where attribution sets are large or overlapping.

[318] SOCKET: SOft Collison Kernel EsTimator for Sparse Attention

Sahil Joshi, Agniva Chowdhury, Wyatt Bellinger, Amar Kanakamedala, Ekam Singh, Hoang Anh Duy Le, Aditya Desai, Anshumali Shrivastava

Main category: cs.LG

TL;DR: SOCKET introduces a soft LSH-based sparse attention method that replaces hard bucket matching with probabilistic similarity-aware aggregation for more efficient long-context inference in LLMs.

DetailsMotivation: Attention computation dominates the cost of autoregressive decoding in large language models, especially for long contexts. While sparse attention reduces this cost by focusing on relevant tokens, existing methods using hard Locality-Sensitive Hashing produce discrete collision signals that are poorly suited for ranking tokens effectively.

Method: SOCKET (SOft Collision Kernel EsTimator) replaces hard LSH bucket matches with probabilistic, similarity-aware aggregation. It aggregates graded collision evidence across hash tables, preserving the stability of relative ordering among true top-k tokens. This transforms LSH from a heuristic to a principled scoring kernel for sparse attention.

Result: SOCKET matches or surpasses established sparse attention baselines across multiple long-context benchmarks using diverse models. With custom CUDA kernels for scoring and Flash Decode Triton backend, it achieves up to 1.5× higher throughput than FlashAttention.

Conclusion: SOCKET provides an effective tool for long-context inference by enabling efficient token selection without ad-hoc voting mechanisms, making sparse attention more principled and mathematically grounded while maintaining performance.

Abstract: Exploiting sparsity during long-context inference is central to scaling large language models, as attention dominates the cost of autoregressive decoding. Sparse attention reduces this cost by restricting computation to a subset of tokens, but its effectiveness depends critically on efficient scoring and selection of relevant tokens at inference time. We revisit Locality-Sensitive Hashing (LSH) as a sparsification primitive and introduce SOCKET, a SOft Collision Kernel EsTimator that replaces hard bucket matches with probabilistic, similarity-aware aggregation. Our key insight is that hard LSH produces discrete collision signals and is therefore poorly suited for ranking. In contrast, soft LSH aggregates graded collision evidence across hash tables, preserving the stability of relative ordering among the true top-$k$ tokens. This transformation elevates LSH from a candidate-generation heuristic to a principled and mathematically grounded scoring kernel for sparse attention. Leveraging this property, SOCKET enables efficient token selection without ad-hoc voting mechanism, and matches or surpasses established sparse attention baselines across multiple long-context benchmarks using diverse set of models. With a custom CUDA kernel for scoring keys and a Flash Decode Triton backend for sparse attention, SOCKET achieves up to 1.5$\times$ higher throughput than FlashAttention, making it an effective tool for long-context inference. Code is open-sourced at https://github.com/amarka8/SOCKET.

[319] Toward generative machine learning for boosting ensembles of climate simulations

Parsa Gooya, Reinel Sospedra-Alfonso, Johannes Exenberger

Main category: cs.LG

TL;DR: A conditional Variational Autoencoder (cVAE) is developed to generate large climate ensembles from limited simulations, addressing computational constraints in climate modeling while preserving physical consistency and statistical properties.

DetailsMotivation: Computational constraints in climate modeling create a trade-off between generating large ensembles for robust uncertainty quantification and increasing model resolution for better dynamics capture. Generative machine learning offers a solution to alleviate these constraints.

Method: Developed a conditional Variational Autoencoder (cVAE) trained on limited climate simulations (CMIP6 historical and future scenarios from CanESM5) to generate arbitrarily large ensembles. The approach includes incorporating output noise to improve multiscale variability representation.

Result: The cVAE learns the underlying data distribution and generates physically consistent samples that reproduce realistic low and high moment statistics, including extremes. It captures realistic global teleconnection patterns even under climate conditions absent from training data.

Conclusion: cVAEs offer a mathematically transparent, interpretable, and computationally efficient framework for generating climate ensembles, though they have limitations like overly smooth outputs and spectral bias that can be mitigated with strategies like output noise incorporation.

Abstract: Accurately quantifying uncertainty in predictions and projections arising from irreducible internal climate variability is critical for informed decision making. Such uncertainty is typically assessed using ensembles produced with physics based climate models. However, computational constraints impose a trade off between generating the large ensembles required for robust uncertainty estimation and increasing model resolution to better capture fine scale dynamics. Generative machine learning offers a promising pathway to alleviate these constraints. We develop a conditional Variational Autoencoder (cVAE) trained on a limited sample of climate simulations to generate arbitrary large ensembles. The approach is applied to output from monthly CMIP6 historical and future scenario experiments produced with the Canadian Centre for Climate Modelling and Analysis’ (CCCma’s) Earth system model CanESM5. We show that the cVAE model learns the underlying distribution of the data and generates physically consistent samples that reproduce realistic low and high moment statistics, including extremes. Compared with more sophisticated generative architectures, cVAEs offer a mathematically transparent, interpretable, and computationally efficient framework. Their simplicity lead to some limitations, such as overly smooth outputs, spectral bias, and underdispersion, that we discuss along with strategies to mitigate them. Specifically, we show that incorporating output noise improves the representation of climate relevant multiscale variability, and we propose a simple method to achieve this. Finally, we show that cVAE-enhanced ensembles capture realistic global teleconnection patterns, even under climate conditions absent from the training data.

[320] The Condensate Theorem: Transformers are O(n), Not $O(n^2)$

Jorge L. Ruiz Williams

Main category: cs.LG

TL;DR: Attention sparsity is a learned topological property, not architectural constraint; projecting attention onto Condensate Manifold achieves 100% output equivalence with full quadratic attention, enabling massive speedups.

DetailsMotivation: The quadratic computational complexity of attention mechanisms in transformers is a major bottleneck for scaling to long sequences. Current approaches treat this as an architectural limitation requiring approximations, but the authors hypothesize that attention sparsity emerges naturally as a learned topological property rather than being an inherent constraint.

Method: Empirical analysis of trained language models reveals attention mass concentrates on a topological manifold (Condensate Manifold). The authors prove that projecting attention onto this manifold (Anchor + Window + Dynamic Top-k) achieves lossless parity with full attention. They validate across multiple models and implement a Topological Attention kernel that maps this topology to hardware.

Result: Demonstrated bit-exact token matching on 1,500+ generated tokens across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral. Achieved 159x measured speedup at 131K tokens (3.94ms vs 628ms) and projected >1,200x speedup at 1M tokens, reducing inference costs by >99.9% compared to Flash Attention.

Conclusion: The quadratic bottleneck in attention is an artifact of naive implementation rather than an inherent requirement for intelligence. Attention sparsity emerges as a learned topological property that can be exploited for massive computational savings without sacrificing accuracy.

Abstract: We present the Condensate Theorem: attention sparsity is a learned topological property, not an architectural constraint. Through empirical analysis of trained language models, we find that attention mass concentrates on a distinct topological manifold – and this manifold can be identified dynamically without checking every position. We prove a general result: for any query, projecting attention onto the Condensate Manifold (Anchor + Window + Dynamic Top-k) achieves 100% output equivalence with full $O(n^2)$ attention. This is not an approximation – it is lossless parity. We validate this across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral, demonstrating bit-exact token matching on 1,500+ generated tokens. By mapping this topology to hardware, our Topological Attention kernel achieves a 159x measured speedup at 131K tokens (3.94ms vs 628ms) and a projected >1,200x speedup at 1M tokens, reducing inference costs by >99.9% compared to Flash Attention. We conclude that the quadratic bottleneck is an artifact of naive implementation, not intelligence.

[321] How (Not) to Hybridize Neural and Mechanistic Models for Epidemiological Forecasting

Yiqi Su, Ray Lee, Jiaming Cui, Naren Ramakrishnan

Main category: cs.LG

TL;DR: Hybrid neural-mechanistic model for epidemiological forecasting that decomposes infection data into trend/seasonal/residual components to drive neural ODE dynamics, improving forecast accuracy under non-stationary transmission dynamics.

DetailsMotivation: Epidemiological forecasting is challenging due to non-stationary dynamics from behavior changes, waning immunity, seasonality, and interventions. Hybrid models combining mechanistic compartmental models with neural components can leverage epidemiological plausibility while capturing data-adaptive effects, but many couplings fail under partial observability and shifting dynamics.

Method: Extract multi-scale structure from observed infection series as interpretable control signals for a controlled neural ODE coupled to an epidemiological model. Decompose infections into trend, seasonal, and residual components, using these signals to drive continuous-time latent dynamics while jointly forecasting and inferring time-varying transmission, recovery, and immunity-loss rates.

Result: Across seasonal and non-seasonal settings (early outbreaks and multi-wave regimes), reduces long-horizon RMSE by 15-35%, improves peak timing error by 1-3 weeks, and lowers peak magnitude bias by up to 30% relative to strong time-series, neural ODE, and hybrid baselines, without auxiliary covariates.

Conclusion: Robust epidemiological forecasting requires making non-stationarity explicit through multi-scale decomposition of infection data, which serves as interpretable control signals for hybrid neural-mechanistic models, enabling better capture of shifting transmission dynamics.

Abstract: Epidemiological forecasting from surveillance data is a hard problem and hybridizing mechanistic compartmental models with neural models is a natural direction. The mechanistic structure helps keep trajectories epidemiologically plausible, while neural components can capture non-stationary, data-adaptive effects. In practice, however, many seemingly straightforward couplings fail under partial observability and continually shifting transmission dynamics driven by behavior, waning immunity, seasonality, and interventions. We catalog these failure modes and show that robust performance requires making non-stationarity explicit: we extract multi-scale structure from the observed infection series and use it as an interpretable control signal for a controlled neural ODE coupled to an epidemiological model. Concretely, we decompose infections into trend, seasonal, and residual components and use these signals to drive continuous-time latent dynamics while jointly forecasting and inferring time-varying transmission, recovery, and immunity-loss rates. Across seasonal and non-seasonal settings, including early outbreaks and multi-wave regimes, our approach reduces long-horizon RMSE by 15-35%, improves peak timing error by 1-3 weeks, and lowers peak magnitude bias by up to 30% relative to strong time-series, neural ODE, and hybrid baselines, without relying on auxiliary covariates.

[322] Online Adaptive Reinforcement Learning with Echo State Networks for Non-Stationary Dynamics

Aoi Yoshimura, Gouhei Tanaka

Main category: cs.LG

TL;DR: Lightweight online RL adaptation using Reservoir Computing (Echo State Networks) for rapid adaptation to non-stationary dynamics without backpropagation or pretraining.

DetailsMotivation: RL policies trained in simulation degrade in real-world due to non-stationary dynamics. Existing methods like Domain Randomization and meta-RL require extensive pretraining, privileged information, or high computational cost, limiting real-time and edge deployment.

Method: Integrates Echo State Networks (ESNs) as adaptation module that encodes recent observation histories into latent context representation. Updates readout weights online using Recursive Least Squares (RLS) without backpropagation, pretraining, or privileged information.

Result: Significantly outperforms Domain Randomization and adaptive baselines on CartPole and HalfCheetah tasks with severe/abrupt environment changes (periodic disturbances, extreme friction variations). Achieves stable adaptation within few control steps and handles intra-episode changes without policy reset.

Conclusion: Computationally efficient and stable framework provides practical solution for online adaptation in non-stationary environments, well-suited for real-world robotic control and edge deployment.

Abstract: Reinforcement learning (RL) policies trained in simulation often suffer from severe performance degradation when deployed in real-world environments due to non-stationary dynamics. While Domain Randomization (DR) and meta-RL have been proposed to address this issue, they typically rely on extensive pretraining, privileged information, or high computational cost, limiting their applicability to real-time and edge systems. In this paper, we propose a lightweight online adaptation framework for RL based on Reservoir Computing. Specifically, we integrate an Echo State Networks (ESNs) as an adaptation module that encodes recent observation histories into a latent context representation, and update its readout weights online using Recursive Least Squares (RLS). This design enables rapid adaptation without backpropagation, pretraining, or access to privileged information. We evaluate the proposed method on CartPole and HalfCheetah tasks with severe and abrupt environment changes, including periodic external disturbances and extreme friction variations. Experimental results demonstrate that the proposed approach significantly outperforms DR and representative adaptive baselines under out-of-distribution dynamics, achieving stable adaptation within a few control steps. Notably, the method successfully handles intra-episode environment changes without resetting the policy. Due to its computational efficiency and stability, the proposed framework provides a practical solution for online adaptation in non-stationary environments and is well suited for real-world robotic control and edge deployment.

[323] Don’t Break the Boundary: Continual Unlearning for OOD Detection Based on Free Energy Repulsion

Ningkang Peng, Kun Shao, Jingyang Mao, Linjing Qian, Xiaoqian Peng, Xichen Yang, Yanhui Gu

Main category: cs.LG

TL;DR: TFER framework enables boundary-preserving class unlearning for OOD detection by transforming target classes into OOD samples using free energy repulsion, maintaining model stability without catastrophic forgetting.

DetailsMotivation: There's a fundamental contradiction between robust OOD detection (requiring static data manifolds) and machine unlearning (which disrupts these structures). Current approaches fail to preserve OOD detection capabilities while removing target classes.

Method: Proposes TFER framework with Push-Pull game mechanism: anchors retained classes in low-energy ID manifold (pull) while expelling forgotten classes to high-energy OOD regions using free energy repulsion force. Uses parameter-efficient fine-tuning instead of full retraining.

Result: TFER achieves precise unlearning while preserving discriminative performance on remaining classes and external OOD data. Demonstrates structural stability and resistance to catastrophic forgetting, showing potential for continual unlearning tasks.

Conclusion: TFER resolves the geometric contradiction between OOD detection and unlearning by transforming target classes into OOD samples, providing a stable framework for boundary-preserving class unlearning without catastrophic performance loss.

Abstract: Deploying trustworthy AI in open-world environments faces a dual challenge: the necessity for robust Out-of-Distribution (OOD) detection to ensure system safety, and the demand for flexible machine unlearning to satisfy privacy compliance and model rectification. However, this objective encounters a fundamental geometric contradiction: current OOD detectors rely on a static and compact data manifold, whereas traditional classification-oriented unlearning methods disrupt this delicate structure, leading to a catastrophic loss of the model’s capability to discriminate anomalies while erasing target classes. To resolve this dilemma, we first define the problem of boundary-preserving class unlearning and propose a pivotal conceptual shift: in the context of OOD detection, effective unlearning is mathematically equivalent to transforming the target class into OOD samples. Based on this, we propose the TFER (Total Free Energy Repulsion) framework. Inspired by the free energy principle, TFER constructs a novel Push-Pull game mechanism: it anchors retained classes within a low-energy ID manifold through a pull mechanism, while actively expelling forgotten classes to high-energy OOD regions using a free energy repulsion force. This approach is implemented via parameter-efficient fine-tuning, circumventing the prohibitive cost of full retraining. Extensive experiments demonstrate that TFER achieves precise unlearning while maximally preserving the model’s discriminative performance on remaining classes and external OOD data. More importantly, our study reveals that the unique Push-Pull equilibrium of TFER endows the model with inherent structural stability, allowing it to effectively resist catastrophic forgetting without complex additional constraints, thereby demonstrating exceptional potential in continual unlearning tasks.

[324] Adversarial Learning in Games with Bandit Feedback: Logarithmic Pure-Strategy Maximin Regret

Shinji Ito, Haipeng Luo, Arnab Maiti, Taira Tsuchiya, Yue Wu

Main category: cs.LG

TL;DR: The paper studies adversarial learning in zero-sum games under bandit feedback, focusing on minimizing regret against maximin pure strategies with logarithmic instance-dependent bounds.

DetailsMotivation: Real-world applications often require learning against unknown opponents with limited bandit feedback (only observing payoff of chosen action), where traditional external regret minimization faces Ω(√T) lower bounds. The paper aims to overcome this barrier by focusing on pure-strategy maximin regret.

Method: Analyzes two bandit feedback models: uninformed (only reward revealed) and informed (reward and opponent’s action revealed). For uninformed setting, uses Tsallis-INF algorithm; for informed setting, introduces Maximin-UCB. Generalizes to bilinear games with Tsallis-FTRL-SPM and Maximin-LinUCB.

Result: Achieves O(c log T) instance-dependent regret for uninformed setting with Tsallis-INF, proves information-theoretic lower bound showing c-dependence is necessary. For informed setting, Maximin-UCB achieves O(c’ log T) with potentially smaller c’. Results generalize to bilinear games with similar logarithmic bounds.

Conclusion: The paper provides theoretical guarantees for adversarial learning in zero-sum games under bandit feedback, achieving logarithmic instance-dependent regret bounds and showing fundamental limits through information-theoretic lower bounds.

Abstract: Learning to play zero-sum games is a fundamental problem in game theory and machine learning. While significant progress has been made in minimizing external regret in the self-play settings or with full-information feedback, real-world applications often force learners to play against unknown, arbitrary opponents and restrict learners to bandit feedback where only the payoff of the realized action is observable. In such challenging settings, it is well-known that $Ω(\sqrt{T})$ external regret is unavoidable (where T is the number of rounds). To overcome this barrier, we investigate adversarial learning in zero-sum games under bandit feedback, aiming to minimize the deficit against the maximin pure strategy – a metric we term Pure-Strategy Maximin Regret. We analyze this problem under two bandit feedback models: uninformed (only the realized reward is revealed) and informed (both the reward and the opponent’s action are revealed). For uninformed bandit learning of normal-form games, we show that the Tsallis-INF algorithm achieves $O(c \log T)$ instance-dependent regret with a game-dependent parameter $c$. Crucially, we prove an information-theoretic lower bound showing that the dependence on c is necessary. To overcome this hardness, we turn to the informed setting and introduce Maximin-UCB, which obtains another regret bound of the form $O(c’ \log T)$ for a different game-dependent parameter $c’$ that could potentially be much smaller than $c$. Finally, we generalize both results to bilinear games over an arbitrary, large action set, proposing Tsallis-FTRL-SPM and Maximin-LinUCB for the uninformed and informed setting respectively and establishing similar game-dependent logarithmic regret bounds.

[325] Enhance and Reuse: A Dual-Mechanism Approach to Boost Deep Forest for Label Distribution Learning

Jia-Le Xu, Shen-Huan Lyu, Yu-Nian Wang, Ning Chen, Zhihao Qu, Bin Tang, Baoliu Ye

Main category: cs.LG

TL;DR: Enhanced and Reused Feature Deep Forest (ERDF) improves label distribution learning by incorporating label correlation through feature enhancement and measure-aware feature reuse mechanisms within a deep forest framework.

DetailsMotivation: Existing Deep Forest methods for label distribution learning fail to effectively utilize label correlations, which is crucial for predicting the degree of correlation between samples and labels. The authors aim to address this limitation by developing mechanisms to exploit label correlations and ensure training stability.

Method: ERDF introduces two key mechanisms: 1) Feature enhancement exploiting label correlation - uses label correlations to enhance original features, providing more comprehensive information for LDL tasks; 2) Measure-aware feature reuse - reuses features for samples that perform worse than previous layers on validation sets to ensure training stability and prevent noise propagation.

Result: The method outperforms other comparison algorithms on six evaluation metrics, demonstrating improved performance in label distribution learning tasks.

Conclusion: ERDF successfully addresses the limitation of existing Deep Forest methods for LDL by effectively utilizing label correlations through feature enhancement and implementing stability mechanisms through feature reuse, resulting in superior performance.

Abstract: Label distribution learning (LDL) requires the learner to predict the degree of correlation between each sample and each label. To achieve this, a crucial task during learning is to leverage the correlation among labels. Deep Forest (DF) is a deep learning framework based on tree ensembles, whose training phase does not rely on backpropagation. DF performs in-model feature transform using the prediction of each layer and achieves competitive performance on many tasks. However, its exploration in the field of LDL is still in its infancy. The few existing methods that apply DF to the field of LDL do not have effective ways to utilize the correlation among labels. Therefore, we propose a method named Enhanced and Reused Feature Deep Forest (ERDF). It mainly contains two mechanisms: feature enhancement exploiting label correlation and measure-aware feature reuse. The first one is to utilize the correlation among labels to enhance the original features, enabling the samples to acquire more comprehensive information for the task of LDL. The second one performs a reuse operation on the features of samples that perform worse than the previous layer on the validation set, in order to ensure the stability of the training process. This kind of Enhance-Reuse pattern not only enables samples to enrich their features but also validates the effectiveness of their new features and conducts a reuse process to prevent the noise from spreading further. Experiments show that our method outperforms other comparison algorithms on six evaluation metrics.

[326] Evaluating LLM-persona Generated Distributions for Decision-making

Jackie Baek, Yunhan Chen, Ziyu Chi, Will Ma

Main category: cs.LG

TL;DR: LLM-generated distributions for decision-making (LLM-SAA) show practical utility in low-data regimes, but traditional distribution metrics can be misleading for evaluating decision quality.

DetailsMotivation: To evaluate how well LLM-generated distributions support downstream decision-making tasks, particularly in business contexts like pricing, assortment optimization, and inventory management, where LLMs can simulate consumer preferences and demand.

Method: Proposes LLM-SAA approach where LLMs construct estimated distributions from prompts, then optimize decisions under those distributions. Evaluates using three canonical decision-making problems: assortment optimization, pricing, and newsvendor problems.

Result: LLM-generated distributions are practically useful, especially in low-data regimes. However, decision-agnostic metrics like Wasserstein distance can be misleading when evaluating distributions for decision-making purposes.

Conclusion: LLM-generated distributions have practical value for decision-making, particularly when data is scarce, but evaluation metrics should be decision-aware rather than relying solely on distribution similarity measures.

Abstract: LLMs can generate a wealth of data, ranging from simulated personas imitating human valuations and preferences, to demand forecasts based on world knowledge. But how well do such LLM-generated distributions support downstream decision-making? For example, when pricing a new product, a firm could prompt an LLM to simulate how much consumers are willing to pay based on a product description, but how useful is the resulting distribution for optimizing the price? We refer to this approach as LLM-SAA, in which an LLM is used to construct an estimated distribution and the decision is then optimized under that distribution. In this paper, we study metrics to evaluate the quality of these LLM-generated distributions, based on the decisions they induce. Taking three canonical decision-making problems (assortment optimization, pricing, and newsvendor) as examples, we find that LLM-generated distributions are practically useful, especially in low-data regimes. We also show that decision-agnostic metrics such as Wasserstein distance can be misleading when evaluating these distributions for decision-making.

[327] Training Data Selection with Gradient Orthogonality for Efficient Domain Adaptation

Xiyang Zhang, Yuanhe Tian, Hongzhi Wang, Yan Song

Main category: cs.LG

TL;DR: OGS is a data selection method that uses reinforcement learning to identify training samples with gradients orthogonal to general knowledge, preventing catastrophic forgetting during domain-specific fine-tuning of LLMs.

DetailsMotivation: Fine-tuning LLMs for specialized domains causes catastrophic forgetting - loss of general reasoning capabilities. Existing methods either use computationally expensive gradient projection or efficient but blind data selection that doesn't address gradient conflicts.

Method: Orthogonal Gradient Selection (OGS) shifts gradient projection insights to data selection stage. Uses lightweight Navigator model with reinforcement learning to dynamically select training samples whose gradients are orthogonal to a general-knowledge anchor, ensuring safe updates without optimizer modifications.

Result: Experiments across medical, legal, and financial domains show OGS achieves excellent results, significantly improving domain performance and training efficiency while maintaining or enhancing performance on general tasks like GSM8K.

Conclusion: OGS harmonizes domain performance, general capability retention, and training efficiency through data-centric orthogonal gradient selection, offering a practical solution to catastrophic forgetting in domain-specific LLM fine-tuning.

Abstract: Fine-tuning large language models (LLMs) for specialized domains often necessitates a trade-off between acquiring domain expertise and retaining general reasoning capabilities, a phenomenon known as catastrophic forgetting. Existing remedies face a dichotomy: gradient surgery methods offer geometric safety but incur prohibitive computational costs via online projections, while efficient data selection approaches reduce overhead but remain blind to conflict-inducing gradient directions. In this paper, we propose Orthogonal Gradient Selection (OGS), a data-centric method that harmonizes domain performance, general capability retention, and training efficiency. OGS shifts the geometric insights of gradient projection from the optimizer to the data selection stage by treating data selection as a constrained decision-making process. By leveraging a lightweight Navigator model and reinforcement learning techniques, OGS dynamically identifies training samples whose gradients are orthogonal to a general-knowledge anchor. This approach ensures naturally safe updates for target models without modifying the optimizer or incurring runtime projection costs. Experiments across medical, legal, and financial domains demonstrate that OGS achieves excellent results, significantly improving domain performance and training efficiency while maintaining or even enhancing performance on general tasks such as GSM8K.

[328] Uniform Spectral Growth and Convergence of Muon in LoRA-Style Matrix Factorization

Changmin Kang, Jihun Yun, Baekrok Shin, Yeseul Cho, Chulhee Yun

Main category: cs.LG

TL;DR: Spectral gradient descent in LoRA fine-tuning shows uniform singular value growth across spectrum, contrasting with standard gradient methods where larger singular values learn first.

DetailsMotivation: To understand the dynamics of spectral gradient descent methods (like Muon) in LoRA fine-tuning of LLMs, where empirical observations show uniform growth of singular values despite orthogonalization on separate factors.

Method: Analyze spectral gradient flow (continuous-time analogue) in a simplified LoRA-style matrix factorization setting, proving theoretical properties of singular value dynamics and convergence.

Result: Proves “equal-rate” dynamics where all singular values grow at equal rates, smaller singular values reach targets earlier than larger ones, and global convergence from almost all initializations with bounded factor norms.

Conclusion: Spectral gradient methods in LoRA exhibit fundamentally different learning dynamics from standard gradient methods, with theoretical guarantees for convergence and uniform spectral learning.

Abstract: Spectral gradient descent (SpecGD) orthogonalizes the matrix parameter updates and has inspired practical optimizers such as Muon. They often perform well in large language model (LLM) training, but their dynamics remain poorly understood. In the low-rank adaptation (LoRA) setting, where weight updates are parameterized as a product of two low-rank factors, we find a distinctive spectral phenomenon under Muon in LoRA fine-tuning of LLMs: singular values of the LoRA product show near-uniform growth across the spectrum, despite orthogonalization being performed on the two factors separately. Motivated by this observation, we analyze spectral gradient flow (SpecGF)-a continuous-time analogue of SpecGD-in a simplified LoRA-style matrix factorization setting and prove “equal-rate” dynamics: all singular values grow at equal rates up to small deviations. Consequently, smaller singular values attain their target values earlier than larger ones, sharply contrasting with the largest-first stepwise learning observed in standard gradient flow. Moreover, we prove that SpecGF in our setting converges to global minima from almost all initializations, provided the factor norms remain bounded; with $\ell_2$ regularization, we obtain global convergence. Lastly, we corroborate our theory with experiments in the same setting.

[329] Generating High-quality Privacy-preserving Synthetic Data

David Yavo, Richard Khoury, Christophe Pere, Sadoune Ait Kaci Azzou

Main category: cs.LG

TL;DR: A post-processing framework for synthetic tabular data that improves distributional fidelity and privacy through mode patching and k-NN filtering, while maintaining downstream utility.

DetailsMotivation: Synthetic tabular data needs to balance distributional fidelity, downstream utility, and privacy protection. Current synthetic data generators often struggle with underrepresented categories and privacy risks from synthetic records too close to real data.

Method: Two-step post-processing framework: 1) Mode patching repairs missing/underrepresented categories while preserving learned dependencies; 2) k-NN filter replaces synthetic records too close to real data points, enforcing minimum distance between real and synthetic samples.

Result: With thresholds 0.2-0.35, reduces categorical distribution divergence by up to 36%, improves pairwise dependence preservation by 10-14%, keeps downstream predictive performance within ~1% of baseline, improves distance-based privacy indicators, and maintains similar attribute inference attack success rates.

Conclusion: The framework provides practical guidance for improving synthetic tabular data quality and empirical privacy while complementing formal differential privacy approaches, offering a model-agnostic solution applicable to various generative models.

Abstract: Synthetic tabular data enables sharing and analysis of sensitive records, but its practical deployment requires balancing distributional fidelity, downstream utility, and privacy protection. We study a simple, model agnostic post processing framework that can be applied on top of any synthetic data generator to improve this trade off. First, a mode patching step repairs categories that are missing or severely underrepresented in the synthetic data, while largely preserving learned dependencies. Second, a k nearest neighbor filter replaces synthetic records that lie too close to real data points, enforcing a minimum distance between real and synthetic samples. We instantiate this framework for two neural generative models for tabular data, a feed forward generator and a variational autoencoder, and evaluate it on three public datasets covering credit card transactions, cardiovascular health, and census based income. We assess marginal and joint distributional similarity, the performance of models trained on synthetic data and evaluated on real data, and several empirical privacy indicators, including nearest neighbor distances and attribute inference attacks. With moderate thresholds between 0.2 and 0.35, the post processing reduces divergence between real and synthetic categorical distributions by up to 36 percent and improves a combined measure of pairwise dependence preservation by 10 to 14 percent, while keeping downstream predictive performance within about 1 percent of the unprocessed baseline. At the same time, distance based privacy indicators improve and the success rate of attribute inference attacks remains largely unchanged. These results provide practical guidance for selecting thresholds and applying post hoc repairs to improve the quality and empirical privacy of synthetic tabular data, while complementing approaches that provide formal differential privacy guarantees.

[330] Near-Optimal Regret for Distributed Adversarial Bandits: A Black-Box Approach

Hao Qiu, Mengxiao Zhang, Nicolò Cesa-Bianchi

Main category: cs.LG

TL;DR: Distributed adversarial bandits with N agents minimizing global average loss using only local observations, achieving optimal regret bounds with gossip communication.

DetailsMotivation: To study distributed multi-agent bandit problems where agents must cooperate with limited communication (gossip) while only observing their own local losses, aiming to minimize global regret.

Method: Novel black-box reduction to bandits with delayed feedback, using gossip communication between agents. Extends to distributed linear bandits using volumetric spanners with O(d) communication cost.

Result: Proves minimax regret of Θ̃(√((ρ^{-1/2}+K/N)T)), significantly improving previous bounds. Shows decomposition into communication cost ρ^{-1/4}√T and bandit cost √(KT/N). Extends to linear bandits with Õ(√((ρ^{-1/2}+1/N)dT)) regret.

Conclusion: Develops optimal algorithms for distributed adversarial bandits with gossip communication, providing tight regret bounds and demonstrating versatility through extensions to first-order and linear bandit settings.

Abstract: We study distributed adversarial bandits, where $N$ agents cooperate to minimize the global average loss while observing only their own local losses. We show that the minimax regret for this problem is $\tildeΘ(\sqrt{(ρ^{-1/2}+K/N)T})$, where $T$ is the horizon, $K$ is the number of actions, and $ρ$ is the spectral gap of the communication matrix. Our algorithm, based on a novel black-box reduction to bandits with delayed feedback, requires agents to communicate only through gossip. It achieves an upper bound that significantly improves over the previous best bound $\tilde{O}(ρ^{-1/3}(KT)^{2/3})$ of Yi and Vojnovic (2023). We complement this result with a matching lower bound, showing that the problem’s difficulty decomposes into a communication cost $ρ^{-1/4}\sqrt{T}$ and a bandit cost $\sqrt{KT/N}$. We further demonstrate the versatility of our approach by deriving first-order and best-of-both-worlds bounds in the distributed adversarial setting. Finally, we extend our framework to distributed linear bandits in $R^d$, obtaining a regret bound of $\tilde{O}(\sqrt{(ρ^{-1/2}+1/N)dT})$, achieved with only $O(d)$ communication cost per agent and per round via a volumetric spanner.

[331] EEG Emotion Classification Using an Enhanced Transformer-CNN-BiLSTM Architecture with Dual Attention Mechanisms

S M Rakib UI Karim, Wenyi Lu, Diponkor Bala, Rownak Ara Rasul, Sean Goggins

Main category: cs.LG

TL;DR: Hybrid deep learning model combining CNN, bidirectional RNN, and self-attention improves EEG-based emotion recognition with state-of-the-art performance on three emotional states.

DetailsMotivation: EEG-based emotion recognition is challenging due to high-dimensional, noisy, and subject-dependent signals, but is critical for affective computing and decision-support systems.

Method: Enhanced hybrid model integrating convolutional feature extraction, bidirectional temporal modeling, self-attention mechanisms, and regularization strategies to mitigate overfitting.

Result: Achieves state-of-the-art classification performance on three emotional states (neutral, positive, negative), significantly outperforming classical ML and neural baselines with robust cross-validation results.

Conclusion: Carefully designed hybrid architectures can effectively balance predictive accuracy, robustness, and interpretability in EEG-based emotion recognition, with implications for applied affective computing.

Abstract: Electroencephalography (EEG)-based emotion recognition plays a critical role in affective computing and emerging decision-support systems, yet remains challenging due to high-dimensional, noisy, and subject-dependent signals. This study investigates whether hybrid deep learning architectures that integrate convolutional, recurrent, and attention-based components can improve emotion classification performance and robustness in EEG data. We propose an enhanced hybrid model that combines convolutional feature extraction, bidirectional temporal modeling, and self-attention mechanisms with regularization strategies to mitigate overfitting. Experiments conducted on a publicly available EEG dataset spanning three emotional states (neutral, positive, and negative) demonstrate that the proposed approach achieves state-of-the-art classification performance, significantly outperforming classical machine learning and neural baselines. Statistical tests confirm the robustness of these performance gains under cross-validation. Feature-level analyses further reveal that covariance-based EEG features contribute most strongly to emotion discrimination, highlighting the importance of inter-channel relationships in affective modeling. These findings suggest that carefully designed hybrid architectures can effectively balance predictive accuracy, robustness, and interpretability in EEG-based emotion recognition, with implications for applied affective computing and human-centered intelligent systems.

[332] Adaptive Protein Tokenization

Rohit Dilip, Ayush Varshney, David Van Valen

Main category: cs.LG

TL;DR: Global tokenization method for protein structures that uses successive tokens with increasing detail levels, outperforming local tokenization approaches on reconstruction, generation, and representation tasks.

DetailsMotivation: Existing protein structure tokenizers use local neighborhood pooling which limits performance on generative and representation tasks. There's a need for global tokenization that can handle error accumulation, provide better embeddings, and allow task-specific adaptation.

Method: Proposes global tokenization where successive tokens contribute increasing levels of detail to a global representation, enabling adaptive tokens with information content-based inference criteria.

Result: Matches or outperforms existing local tokenization models on reconstruction, generative, and representation tasks. Enables zero-shot protein shrinking and affinity maturation, with improved designability and CATH classification performance.

Conclusion: Global tokenization resolves issues with local protein tokenization, providing better generative modeling, representation learning, and enabling novel applications like protein shrinking and affinity maturation.

Abstract: Tokenization is a promising path to multi-modal models capable of jointly understanding protein sequences, structure, and function. Existing protein structure tokenizers create tokens by pooling information from local neighborhoods, an approach that limits their performance on generative and representation tasks. In this work, we present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation. This change resolves several issues with generative models based on local protein tokenization: it mitigates error accumulation, provides embeddings without sequence-reduction operations, and allows task-specific adaptation of a tokenized sequence’s information content. We validate our method on reconstruction, generative, and representation tasks and demonstrate that it matches or outperforms existing models based on local protein structure tokenizers. We show how adaptive tokens enable inference criteria based on information content, which boosts designability. We validate representations generated from our tokenizer on CATH classification tasks and demonstrate that non-linear probing on our tokenized sequences outperforms equivalent probing on representations from other tokenizers. Finally, we demonstrate how our method supports zero-shot protein shrinking and affinity maturation.

[333] Beyond Code Contributions: How Network Position, Temporal Bursts, and Code Review Activities Shape Contributor Influence in Large-Scale Open Source Ecosystems

S M Rakib Ul Karim, Wenyi Lu, Sean Goggins

Main category: cs.LG

TL;DR: Analysis of open source contributor networks using graph neural networks and temporal analysis reveals power-law influence distributions, five distinct contributor roles, and the critical importance of Bridge contributors for network cohesion.

DetailsMotivation: To understand the complex network dynamics of open source software contributors and how their interactions drive innovation and sustainability, with a focus on identifying key contributor roles and their impact on network health.

Method: Used advanced graph neural networks and temporal network analysis on 25 years of data from Cloud Native Computing Foundation ecosystem. Applied GPU-accelerated PageRank, betweenness centrality, custom LSTM models, and structural integrity simulations to analyze thousands of contributors across hundreds of repositories.

Result: Found strong power-law distributions in influence (top 1% control substantial influence), identified five distinct contributor roles (Core, Bridge, Connector, Regular, Peripheral), revealed significant correlations between action types and influence, and showed Bridge contributors have disproportionate impact on network cohesion when removed.

Conclusion: OSS networks exhibit structured influence hierarchies with critical roles, particularly Bridge contributors who are essential for network cohesion. Findings provide empirical evidence for strategic contributor retention policies and actionable community health metrics.

Abstract: Open source software (OSS) projects rely on complex networks of contributors whose interactions drive innovation and sustainability. This study presents a comprehensive analysis of OSS contributor networks using advanced graph neural networks and temporal network analysis on data spanning 25 years from the Cloud Native Computing Foundation ecosystem, encompassing sandbox, incubating, and graduated projects. Our analysis of thousands of contributors across hundreds of repositories reveals that OSS networks exhibit strong power-law distributions in influence, with the top 1% of contributors controlling a substantial portion of network influence. Using GPU-accelerated PageRank, betweenness centrality, and custom LSTM models, we identify five distinct contributor roles: Core, Bridge, Connector, Regular, and Peripheral, each with unique network positions and structural importance. Statistical analysis reveals significant correlations between specific action types (commits, pull requests, issues) and contributor influence, with multiple regression models explaining substantial variance in influence metrics. Temporal analysis shows that network density, clustering coefficients, and modularity exhibit statistically significant temporal trends, with distinct regime changes coinciding with major project milestones. Structural integrity simulations show that Bridge contributors, despite representing a small fraction of the network, have a disproportionate impact on network cohesion when removed. Our findings provide empirical evidence for strategic contributor retention policies and offer actionable insights into community health metrics.

[334] Reclaiming First Principles: A Differentiable Framework for Conceptual Hydrologic Models

Jasper A. Vrugt, Jonathan M. Frame, Ethan Bollman

Main category: cs.LG

TL;DR: A fully analytic framework for differentiable hydrologic modeling using exact parameter sensitivities via sensitivity equations, enabling efficient gradient-based calibration without numerical differentiation or autodiff overhead.

DetailsMotivation: Conceptual hydrologic models have slow and numerically fragile calibration; gradient-based methods using finite-difference or autodiff frameworks are computationally demanding, introduce errors/instabilities, and have substantial overhead, especially for ODE systems in watershed modeling.

Method: Augment governing ODE system with sensitivity equations to jointly evolve model states and Jacobian matrix with respect to all parameters, providing fully analytic gradients for various loss functions including hydrologic performance metrics.

Result: Eliminates step-size dependence and noise of numerical differentiation, avoids instability of adjoint methods and overhead of autodiff toolchains, producing deterministic, physically interpretable gradients for gradient-based optimizers.

Conclusion: Enables rapid, stable, and transparent gradient-based calibration of conceptual hydrologic models without reliance on external autodiff libraries, unlocking full potential of differentiable modeling in hydrology.

Abstract: Conceptual hydrologic models remain the cornerstone of rainfall-runoff modeling, yet their calibration is often slow and numerically fragile. Most gradient-based parameter estimation methods rely on finite-difference approximations or automatic differentiation frameworks (e.g., JAX, PyTorch and TensorFlow), which are computationally demanding and introduce truncation errors, solver instabilities, and substantial overhead. These limitations are particularly acute for the ODE systems of conceptual watershed models. Here we introduce a fully analytic and computationally efficient framework for differentiable hydrologic modeling based on exact parameter sensitivities. By augmenting the governing ODE system with sensitivity equations, we jointly evolve the model states and the Jacobian matrix with respect to all parameters. This Jacobian then provides fully analytic gradient vectors for any differentiable loss function. These include classical objective functions such as the sum of absolute and squared residuals, widely used hydrologic performance metrics such as the Nash-Sutcliffe and Kling-Gupta efficiencies, robust loss functions that down-weight extreme events, and hydrograph-based functionals such as flow-duration and recession curves. The analytic sensitivities eliminate the step-size dependence and noise inherent to numerical differentiation, while avoiding the instability of adjoint methods and the overhead of modern machine-learning autodiff toolchains. The resulting gradients are deterministic, physically interpretable, and straightforward to embed in gradient-based optimizers. Overall, this work enables rapid, stable, and transparent gradient-based calibration of conceptual hydrologic models, unlocking the full potential of differentiable modeling without reliance on external, opaque, or CPU-intensive automatic-differentiation libraries.

[335] Is Gradient Ascent Really Necessary? Memorize to Forget for Machine Unlearning

Zhuo Huang, Qizhou Wang, Ziming Hong, Shanshan Ye, Bo Han, Tongliang Liu

Main category: cs.LG

TL;DR: Proposes model extrapolation as a stable alternative to gradient ascent for machine unlearning, avoiding catastrophic collapse by training a memorization model first then extrapolating to obtain the forget model.

DetailsMotivation: Machine unlearning is critical for ethical and safe AI to protect sensitive, private, and copyrighted knowledge, but current gradient ascent approaches suffer from catastrophic collapse leading to serious performance degradation.

Method: Uses model extrapolation instead of gradient ascent: 1) Use original model as reference, 2) Train it further to memorize undesired data while keeping prediction consistency on retained data to get memorization model, 3) Obtain forget model via extrapolation from memorization model to reference model.

Result: The approach stabilizes the machine unlearning process, avoids catastrophic collapse, is simple and efficient to implement, and effectively converges throughout training to achieve improved unlearning performance.

Conclusion: Model extrapolation provides a stable and effective alternative to gradient ascent for machine unlearning, successfully addressing the catastrophic collapse problem while maintaining performance on general tasks.

Abstract: For ethical and safe AI, machine unlearning rises as a critical topic aiming to protect sensitive, private, and copyrighted knowledge from misuse. To achieve this goal, it is common to conduct gradient ascent (GA) to reverse the training on undesired data. However, such a reversal is prone to catastrophic collapse, which leads to serious performance degradation in general tasks. As a solution, we propose model extrapolation as an alternative to GA, which reaches the counterpart direction in the hypothesis space from one model given another reference model. Therefore, we leverage the original model as the reference, further train it to memorize undesired data while keeping prediction consistency on the rest retained data, to obtain a memorization model. Counterfactual as it might sound, a forget model can be obtained via extrapolation from the memorization model to the reference model. Hence, we avoid directly acquiring the forget model using GA, but proceed with gradient descent for the memorization model, which successfully stabilizes the machine unlearning process. Our model extrapolation is simple and efficient to implement, and it can also effectively converge throughout training to achieve improved unlearning performance.

[336] Principle-Evolvable Scientific Discovery via Uncertainty Minimization

Yingming Pu, Tao Lin, Hongyu Chen

Main category: cs.LG

TL;DR: PiEvo is a principle-evolvable framework for LLM-based scientific agents that evolves scientific principles rather than searching hypotheses, using Bayesian optimization over an expanding principle space to improve discovery efficiency and novelty.

DetailsMotivation: Current LLM-based scientific agents suffer from inefficiencies due to fixed initial priors and static hypothesis spaces, which restrict novel discovery and cause computational waste when baseline theories fail.

Method: PiEvo treats scientific discovery as Bayesian optimization over an expanding principle space, integrating Information-Directed Hypothesis Selection via Gaussian Process and an anomaly-driven augmentation mechanism to autonomously refine theoretical worldviews.

Result: Achieves 90.81%~93.15% average solution quality (29.7%~31.1% improvement over SOTA), 83.3% speedup in convergence via reduced sample complexity, and maintains robust performance across diverse scientific domains and LLM backbones.

Conclusion: Shifting from hypothesis search to principle evolution enables more efficient and novel scientific discovery by LLM-based agents, with significant improvements in solution quality and computational efficiency.

Abstract: Large Language Model (LLM)-based scientific agents have accelerated scientific discovery, yet they often suffer from significant inefficiencies due to adherence to fixed initial priors. Existing approaches predominantly operate within a static hypothesis space, which restricts the discovery of novel phenomena, resulting in computational waste when baseline theories fail. To address this, we propose shifting the focus from searching hypotheses to evolving the underlying scientific principles. We present PiEvo, a principle-evolvable framework that treats scientific discovery as Bayesian optimization over an expanding principle space. By integrating Information-Directed Hypothesis Selection via Gaussian Process and an anomaly-driven augmentation mechanism, PiEvo enables agents to autonomously refine their theoretical worldview. Evaluation across four benchmarks demonstrates that PiEvo (1) achieves an average solution quality of up to 90.81%~93.15%, representing a 29.7%~31.1% improvement over the state-of-the-art, (2) attains an 83.3% speedup in convergence step via significantly reduced sample complexity by optimizing the compact principle space, and (3) maintains robust performance across diverse scientific domains and LLM backbones.

[337] BrokenBind: Universal Modality Exploration beyond Dataset Boundaries

Zhuo Huang, Runnan Chen, Bo Han, Gang Niu, Masashi Sugiyama, Tongliang Liu

Main category: cs.LG

TL;DR: BrokenBind enables binding modalities from different datasets by generating pseudo embeddings to fill missing modalities, overcoming dataset limitations in multimodal learning.

DetailsMotivation: Existing multimodal methods are limited to modalities present in the same dataset, making them inflexible and costly for real-world applications where modalities come from different sources.

Method: Uses multiple datasets with shared modality to capture relationships between modalities, generates pseudo embeddings for missing modalities, enabling binding of any two modalities regardless of dataset limitations.

Result: BrokenBind outperforms baseline methods, works effectively in low-data regimes, and handles scenarios requiring more than two datasets for modality binding.

Conclusion: BrokenBind provides a flexible framework for universal modality exploration that overcomes dataset limitations, reducing costs and improving generalization in multimodal learning.

Abstract: Multi-modal learning combines various modalities to provide a comprehensive understanding of real-world problems. A common strategy is to directly bind different modalities together in a specific joint embedding space. However, the capability of existing methods is restricted within the modalities presented in the given dataset, thus they are biased when generalizing to unpresented modalities in downstream tasks. As a result, due to such inflexibility, the viability of previous methods is seriously hindered by the cost of acquiring multi-modal datasets. In this paper, we introduce BrokenBind, which focuses on binding modalities that are presented from different datasets. To achieve this, BrokenBind simultaneously leverages multiple datasets containing the modalities of interest and one shared modality. Though the two datasets do not correspond to each other due to distribution mismatch, we can capture their relationship to generate pseudo embeddings to fill in the missing modalities of interest, enabling flexible and generalized multi-modal learning. Under our framework, any two modalities can be bound together, free from the dataset limitation, to achieve universal modality exploration. Further, to reveal the capability of our method, we study intensified scenarios where more than two datasets are needed for modality binding and show the effectiveness of BrokenBind in low-data regimes. Through extensive evaluation, we carefully justify the superiority of BrokenBind compared to well-known multi-modal baseline methods.

[338] On the Plasticity and Stability for Post-Training Large Language Models

Wenwen Qiang, Ziyin Gu, Jiahuan Zhou, Jie Hu, Jingyao Wang, Changwen Zheng, Hui Xiong

Main category: cs.LG

TL;DR: PCR introduces a Bayesian framework for GRPO training stability by modeling gradients as random variables and using uncertainty-aware soft projection to resolve geometric conflicts between plasticity and stability gradients.

DetailsMotivation: Training stability is a critical bottleneck for Group Relative Policy Optimization (GRPO), often causing a trade-off between reasoning plasticity and general capability retention. The root cause is identified as geometric conflict between plasticity and stability gradients leading to destructive interference. Deterministic projection methods are suboptimal for GRPO as they overlook the intrinsic stochasticity of group-based gradient estimates.

Method: Proposes Probabilistic Conflict Resolution (PCR), a Bayesian framework that models gradients as random variables. PCR dynamically arbitrates conflicts via an uncertainty-aware “soft projection” mechanism that optimizes the signal-to-noise ratio.

Result: Extensive experiments demonstrate that PCR significantly smooths the training trajectory and achieves superior performance in various reasoning tasks.

Conclusion: PCR effectively addresses the training stability issues in GRPO by providing a principled probabilistic approach to resolve gradient conflicts, leading to improved training stability and performance in reasoning tasks.

Abstract: Training stability remains a critical bottleneck for Group Relative Policy Optimization (GRPO), often manifesting as a trade-off between reasoning plasticity and general capability retention. We identify a root cause as the geometric conflict between plasticity and stability gradients, which leads to destructive interference. Crucially, we argue that deterministic projection methods are suboptimal for GRPO as they overlook the intrinsic stochasticity of group-based gradient estimates. To address this, we propose Probabilistic Conflict Resolution (PCR), a Bayesian framework that models gradients as random variables. PCR dynamically arbitrates conflicts via an uncertainty-aware ``soft projection’’ mechanism, optimizing the signal-to-noise ratio. Extensive experiments demonstrate that PCR significantly smooths the training trajectory and achieves superior performance in various reasoning tasks.

[339] Dynamics-Aligned Shared Hypernetworks for Zero-Shot Actuator Inversion

Jan Benad, Pradeep Kr. Banerjee, Frank Röder, Nihat Ay, Martin V. Butz, Manfred Eppe

Main category: cs.LG

TL;DR: DMA*-SH uses hypernetwork-generated adapter weights shared across RL components to handle latent context actuator inversion, achieving zero-shot generalization on new tasks.

DetailsMotivation: Addressing the challenge of zero-shot generalization in contextual RL when context is latent, particularly the actuator inversion problem where identical actions produce opposite effects under different latent contexts.

Method: A hypernetwork trained via dynamics prediction generates adapter weights shared across dynamics model, policy, and action-value function. Uses input/output normalization and random input masking to stabilize context inference and create directionally concentrated representations.

Result: On the Actuator Inversion Benchmark (AIB), DMA*-SH achieves zero-shot generalization, outperforming domain randomization by 111.8% and surpassing standard context-aware baseline by 16.1%.

Conclusion: Shared hypernetwork modulation provides effective inductive bias for actuator inversion problems, enabling zero-shot generalization in contextual RL with latent contexts.

Abstract: Zero-shot generalization in contextual reinforcement learning remains a core challenge, particularly when the context is latent and must be inferred from data. A canonical failure mode is actuator inversion, where identical actions produce opposite physical effects under a latent binary context. We propose DMA*-SH, a framework where a single hypernetwork, trained solely via dynamics prediction, generates a small set of adapter weights shared across the dynamics model, policy, and action-value function. This shared modulation imparts an inductive bias matched to actuator inversion, while input/output normalization and random input masking stabilize context inference, promoting directionally concentrated representations. We provide theoretical support via an expressivity separation result for hypernetwork modulation, and a variance decomposition with policy-gradient variance bounds that formalize how within-mode compression improves learning under actuator inversion. For evaluation, we introduce the Actuator Inversion Benchmark (AIB), a suite of environments designed to isolate discontinuous context-to-dynamics interactions. On AIB’s held-out actuator-inversion tasks, DMA*-SH achieves zero-shot generalization, outperforming domain randomization by 111.8% and surpassing a standard context-aware baseline by 16.1%.

[340] The Window Dilemma: Why Concept Drift Detection is Ill-Posed

Brandon Gower-Winter, Misja Groen, Georg Krempl

Main category: cs.LG

TL;DR: The paper challenges the fundamental assumptions of concept drift detection in data streams, arguing that perceived drift is often an artifact of windowing rather than actual changes in data distribution, and questions the practical value of drift detectors compared to traditional batch learning.

DetailsMotivation: The paper questions the core assumptions behind concept drift detection in data streams, noting that while concept drift is widely studied, the practical verification of drift events is often implausible, and the effectiveness of drift detectors compared to simpler methods remains unclear.

Method: The authors introduce the “Window Dilemma” concept showing how drift perception depends on windowing choices, then conduct empirical comparisons of drift detectors against various alternative adaptation strategies using illustrative examples and systematic evaluations.

Result: The main finding is that traditional batch learning techniques often outperform drift-aware counterparts, challenging the practical utility of drift detectors in stream classification tasks.

Conclusion: The paper fundamentally questions the purpose and effectiveness of drift detectors in stream classification, suggesting that perceived drift may be more about methodological choices than actual distributional changes, and that simpler approaches may be more effective.

Abstract: Non-stationarity of an underlying data generating process that leads to distributional changes over time is a key characteristic of Data Streams. This phenomenon, commonly referred to as Concept Drift, has been intensively studied, and Concept Drift Detectors have been established as a class of methods for detecting such changes (drifts). For the most part, Drift Detectors compare regions (windows) of the data stream and detect drift if those windows are sufficiently dissimilar. In this work, we introduce the Window Dilemma, an observation that perceived drift is a product of windowing and not necessarily the underlying data generating process. Additionally, we highlight that drift detection is ill-posed, primarily because verification of drift events are implausible in practice. We demonstrate these contributions first by an illustrative example, followed by empirical comparisons of drift detectors against a variety of alternative adaptation strategies. Our main finding is that traditional batch learning techniques often perform better than their drift-aware counterparts further bringing into question the purpose of detectors in Stream Classification.

[341] Which Graph Shift Operator? A Spectral Answer to an Empirical Question

Yassine Abbahaddou

Main category: cs.LG

TL;DR: Proposes alignment gain metric to quantify geometric distortion between input signal and label subspaces, enabling principled selection of optimal Graph Shift Operator (GSO) for GNNs without extensive search.

DetailsMotivation: Current GNN architectures rely on empirical selection of Graph Shift Operators (GSOs), which is inefficient and lacks theoretical grounding. There's a need for a principled, computation-efficient method to select optimal GSOs for graph learning tasks.

Method: Introduces an alignment gain metric that quantifies geometric distortion between input signal and label subspaces. Connects this alignment to generalization bounds via spectral proxy for Lipschitz constant, providing theoretical foundation for GSO selection.

Result: Develops a principled criterion to rank and select optimal GSO for any prediction task prior to training, eliminating need for extensive empirical search while maintaining theoretical guarantees.

Conclusion: Provides a theoretically grounded, efficient method for GSO selection in GNNs that bridges the gap between empirical practice and theoretical understanding of graph neural network architectures.

Abstract: Graph Neural Networks (GNNs) have established themselves as the leading models for learning on graph-structured data, generally categorized into spatial and spectral approaches. Central to these architectures is the Graph Shift Operator (GSO), a matrix representation of the graph structure used to filter node signals. However, selecting the optimal GSO, whether fixed or learnable, remains largely empirical. In this paper, we introduce a novel alignment gain metric that quantifies the geometric distortion between the input signal and label subspaces. Crucially, our theoretical analysis connects this alignment directly to generalization bounds via a spectral proxy for the Lipschitz constant. This yields a principled, computation-efficient criterion to rank and select the optimal GSO for any prediction task prior to training, eliminating the need for extensive search.

[342] Achieving Better Local Regret Bound for Online Non-Convex Bilevel Optimization

Tingkai Jia, Haiguang Wang, Cheng Chen

Main category: cs.LG

TL;DR: Online bilevel optimization algorithms with optimal regret bounds for both standard and window-averaged settings, featuring efficient gradient evaluations.

DetailsMotivation: Existing online bilevel optimization algorithms have unclear optimality of regret bounds, motivating the need to establish optimal regret guarantees for both standard bilevel local regret and window-averaged settings.

Method: Developed algorithms for two settings: 1) For standard bilevel local regret, proposed algorithm achieving optimal regret with efficient inner-level gradient evaluations, plus a fully single-loop variant with gradient-variation terms. 2) For window-averaged bilevel local regret, designed algorithm using window-based analysis to capture environmental variation.

Result: Achieved optimal regret bounds: Ω(1+V_T) for standard setting with O(T log T) inner-level gradient evaluations, and Ω(T/W²) for window-averaged setting. Experimental validation confirms theoretical findings and practical effectiveness.

Conclusion: Established optimal regret bounds for online bilevel optimization in both standard and window-averaged settings, with efficient algorithms validated through experiments.

Abstract: Online bilevel optimization (OBO) has emerged as a powerful framework for many machine learning problems. Prior works have developed several algorithms that minimize the standard bilevel local regret or the window-averaged bilevel local regret of the OBO problem, but the optimality of existing regret bounds remains unclear. In this work, we establish optimal regret bounds for both settings. For standard bilevel local regret, we propose an algorithm that achieves the optimal regret $Ω(1+V_T)$ with at most $O(T\log T)$ total inner-level gradient evaluations. We further develop a fully single-loop algorithm whose regret bound includes an additional gradient-variation terms. For the window-averaged bilevel local regret, we design an algorithm that captures sublinear environmental variation through a window-based analysis and achieves the optimal regret $Ω(T/W^2)$. Experiments validate our theoretical findings and demonstrate the practical effectiveness of the proposed methods.

[343] Transformer-based Parameter Fitting of Models derived from Bloch-McConnell Equations for CEST MRI Analysis

Christof Duhme, Chris Lippe, Verena Hoerr, Xiaoyi Jiang

Main category: cs.LG

TL;DR: Transformer-based neural network for CEST MRI parameter quantification outperforms classical gradient-based methods

DetailsMotivation: CEST MRI provides non-invasive metabolite detection with high resolution and sensitivity, but quantification is challenging due to complex physiological interactions in the measured signals.

Method: Introduces a transformer-based neural network trained with self-supervision to fit parameters (metabolite concentrations, exchange rates, relaxation rates) from physical models derived from Bloch-McConnell equations to in-vitro CEST spectra.

Result: The self-supervised trained neural network clearly outperforms classical gradient-based solver approaches for CEST data quantification.

Conclusion: Transformer-based neural networks offer superior performance for CEST MRI parameter quantification compared to traditional methods, enabling more accurate metabolite analysis.

Abstract: Chemical exchange saturation transfer (CEST) MRI is a non-invasive imaging modality for detecting metabolites. It offers higher resolution and sensitivity compared to conventional magnetic resonance spectroscopy (MRS). However, quantification of CEST data is challenging because the measured signal results from a complex interplay of many physiological variables. Here, we introduce a transformer-based neural network to fit parameters such as metabolite concentrations, exchange and relaxation rates of a physical model derived from Bloch-McConnell equations to in-vitro CEST spectra. We show that our self-supervised trained neural network clearly outperforms the solution of classical gradient-based solver.

[344] Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning

Jingyao Wang, Peizheng Guo, Wenwen Qiang, Jiahuan Zhou, Huijie Guo, Changwen Zheng, Hui Xiong

Main category: cs.LG

TL;DR: Group Causal Counterfactual Policy Optimization (GCCPO) improves LLM reasoning by rewarding robust reasoning patterns rather than just final answers, using counterfactual analysis to ensure reasoning generalizes across questions.

DetailsMotivation: Current LLM reward mechanisms focus too much on final answer correctness while ignoring reasoning quality, leading to poor generalization where sound reasoning with wrong answers gets penalized and lucky guesses with flawed logic get rewarded.

Method: Proposes Group Causal Counterfactual Policy Optimization (GCCPO) that treats multi-candidate reasoning as counterfactual experiments. Uses episodic causal counterfactual reward capturing robustness (reasoning stability under perturbations) and effectiveness (transferability across questions). Constructs token-level advantages from this reward to optimize policy.

Result: Extensive experiments on diverse benchmarks demonstrate advantages of GCCPO in improving reasoning generalization compared to existing methods.

Conclusion: GCCPO successfully trains LLMs to learn generalizable reasoning patterns by focusing on process validity and counterfactual robustness rather than just final answer correctness.

Abstract: Large language models (LLMs) excel at complex tasks with advances in reasoning capabilities. However, existing reward mechanisms remain tightly coupled to final correctness and pay little attention to the underlying reasoning process: trajectories with sound reasoning but wrong answers receive low credit, while lucky guesses with flawed logic may be highly rewarded, affecting reasoning generalization. From a causal perspective, we interpret multi-candidate reasoning for a fixed question as a family of counterfactual experiments with theoretical supports. Building on this, we propose Group Causal Counterfactual Policy Optimization to explicitly train LLMs to learn generalizable reasoning patterns. It proposes an episodic causal counterfactual reward that jointly captures (i) robustness, encouraging the answer distribution induced by a reasoning step to remain stable under counterfactual perturbations; and (ii) effectiveness, enforcing sufficient variability so that the learned reasoning strategy can transfer across questions. We then construct token-level advantages from this reward and optimize the policy, encouraging LLMs to favor reasoning patterns that are process-valid and counterfactually robust. Extensive experiments on diverse benchmarks demonstrate its advantages.

[345] Perturbing the Phase: Analyzing Adversarial Robustness of Complex-Valued Neural Networks

Florian Eilers, Christof Duhme, Xiaoyi Jiang

Main category: cs.LG

TL;DR: CVNNs are analyzed for robustness against adversarial attacks, with novel Phase Attacks targeting complex-valued phase information showing CVNNs can be more robust than RVNNs but highly susceptible to phase perturbations.

DetailsMotivation: To ensure safe practical use of complex-valued neural networks (CVNNs) by analyzing their robustness against outliers and adversarial attacks, particularly focusing on phase information vulnerabilities.

Method: Design Phase Attacks specifically targeting phase information of complex-valued inputs, and derive complex-valued versions of common adversarial attacks to compare CVNN and RVNN robustness.

Result: CVNNs show greater robustness than RVNNs in some scenarios, but both are highly susceptible to phase changes, with Phase Attacks decreasing model performance more than equally strong regular attacks.

Conclusion: Phase information is a critical vulnerability in CVNNs, requiring specialized defense mechanisms, though CVNNs demonstrate some inherent robustness advantages over RVNNs in certain contexts.

Abstract: Complex-valued neural networks (CVNNs) are rising in popularity for all kinds of applications. To safely use CVNNs in practice, analyzing their robustness against outliers is crucial. One well known technique to understand the behavior of deep neural networks is to investigate their behavior under adversarial attacks, which can be seen as worst case minimal perturbations. We design Phase Attacks, a kind of attack specifically targeting the phase information of complex-valued inputs. Additionally, we derive complex-valued versions of commonly used adversarial attacks. We show that in some scenarios CVNNs are more robust than RVNNs and that both are very susceptible to phase changes with the Phase Attacks decreasing the model performance more, than equally strong regular attacks, which can attack both phase and magnitude.

[346] Adaptive Uncertainty-Aware Tree Search for Robust Reasoning

Zeen Song, Zihao Ma, Wenwen Qiang, Changwen Zheng, Gang Hua

Main category: cs.LG

TL;DR: UATS addresses epistemic uncertainty in Process Reward Models for LLM reasoning by combining Monte Carlo Dropout uncertainty estimation with RL-based compute allocation to improve out-of-distribution performance.

DetailsMotivation: Process Reward Models (PRMs) used for external search in LLM reasoning have high epistemic uncertainty when evaluating reasoning paths outside their training distribution, leading to unreliable scoring and poor performance on out-of-distribution samples.

Method: Proposes Uncertainty-Aware Tree Search (UATS) which uses Monte Carlo Dropout to estimate uncertainty in PRM evaluations and employs a reinforcement learning-based controller to dynamically allocate compute budget based on uncertainty estimates.

Result: Extensive experiments show UATS effectively mitigates the impact of out-of-distribution errors and improves reasoning performance compared to standard search methods.

Conclusion: Addressing epistemic uncertainty in PRMs through uncertainty-aware search strategies is crucial for improving LLM reasoning capabilities, especially on out-of-distribution problems.

Abstract: Inference-time reasoning scaling has significantly advanced the capabilities of Large Language Models (LLMs) in complex problem-solving. A prevalent approach involves external search guided by Process Reward Models (PRMs). However, a fundamental limitation of this framework is the epistemic uncertainty of PRMs when evaluating reasoning paths that deviate from their training distribution. In this work, we conduct a systematic analysis of this challenge. We first provide empirical evidence that PRMs exhibit high uncertainty and unreliable scoring on out-of-distribution (OOD) samples. We then establish a theoretical framework proving that while standard search incurs linear regret accumulation, an uncertainty-aware strategy can achieve sublinear regret. Motivated by these findings, we propose Uncertainty-Aware Tree Search (UATS), a unified method that estimates uncertainty via Monte Carlo Dropout and dynamically allocates compute budget using a reinforcement learning-based controller. Extensive experiments demonstrate that our approach effectively mitigates the impact of OOD errors.

[347] Exploring Sparsity and Smoothness of Arbitrary $\ell_p$ Norms in Adversarial Attacks

Christof Duhme, Florian Eilers, Xiaoyi Jiang

Main category: cs.LG

TL;DR: Study shows ℓp norm choice (p∈[1,2]) affects sparsity and smoothness of adversarial attacks; optimal p around 1.3-1.5 balances trade-offs better than ℓ1 or ℓ2 alone.

DetailsMotivation: Adversarial attacks are typically constructed under ℓ1, ℓ2, or ℓ∞ norms without systematic investigation of how the norm parameter p influences structural and perceptual properties of perturbations. The authors aim to study how p affects sparsity and smoothness for p∈[1,2].

Method: Adopt two established sparsity measures and introduce three smoothness measures (including a framework based on smoothing operations and a measure based on first-order Taylor approximations). Conduct comprehensive empirical evaluation across multiple image datasets and diverse model architectures (convolutional and transformer-based networks).

Result: Choice of ℓ1 or ℓ2 is suboptimal in most cases; optimal p value depends on specific task. Experiments show ℓp norms with p∈[1.3, 1.5] yield best trade-off between sparse and smooth attacks.

Conclusion: Principled norm selection is important when designing and evaluating adversarial attacks; intermediate p values (1.3-1.5) offer better balance between sparsity and smoothness than traditional ℓ1 or ℓ2 norms.

Abstract: Adversarial attacks against deep neural networks are commonly constructed under $\ell_p$ norm constraints, most often using $p=1$, $p=2$ or $p=\infty$, and potentially regularized for specific demands such as sparsity or smoothness. These choices are typically made without a systematic investigation of how the norm parameter ( p ) influences the structural and perceptual properties of adversarial perturbations. In this work, we study how the choice of ( p ) affects sparsity and smoothness of adversarial attacks generated under ( \ell_p ) norm constraints for values of $p \in [1,2]$. To enable a quantitative analysis, we adopt two established sparsity measures from the literature and introduce three smoothness measures. In particular, we propose a general framework for deriving smoothness measures based on smoothing operations and additionally introduce a smoothness measure based on first-order Taylor approximations. Using these measures, we conduct a comprehensive empirical evaluation across multiple real-world image datasets and a diverse set of model architectures, including both convolutional and transformer-based networks. We show that the choice of $\ell_1$ or $\ell_2$ is suboptimal in most cases and the optimal $p$ value is dependent on the specific task. In our experiments, using $\ell_p$ norms with $p\in [1.3, 1.5]$ yields the best trade-off between sparse and smooth attacks. These findings highlight the importance of principled norm selection when designing and evaluating adversarial attacks.

[348] Can Microcanonical Langevin Dynamics Leverage Mini-Batch Gradient Noise?

Emanuel Sommer, Kangning Diao, Jakob Robnik, Uros Seljak, David Rügamer

Main category: cs.LG

TL;DR: Proposes stochastic-gradient microcanonical Langevin Monte Carlo with gradient noise preconditioning and adaptive tuning for scalable Bayesian inference in high-dimensional models like neural networks.

DetailsMotivation: Microcanonical Langevin Monte Carlo shows state-of-the-art performance but relies on expensive full-dataset gradients, limiting scalability to large-scale problems. The paper investigates whether microcanonical dynamics can effectively leverage mini-batch gradient noise.

Method: Develops continuous-time theoretical analysis of stochastic-gradient microcanonical dynamics, identifies failure modes (anisotropic gradient noise bias and numerical instabilities), proposes gradient noise preconditioning to reduce bias, and introduces energy-variance-based adaptive tuner for automated step size selection and numerical guardrails.

Result: Creates robust and scalable microcanonical Monte Carlo sampler achieving state-of-the-art performance on challenging high-dimensional inference tasks like Bayesian neural networks. Combined with ensemble techniques, enables new class of stochastic microcanonical Langevin ensemble (SMILE) samplers.

Conclusion: The work successfully addresses scalability limitations of microcanonical dynamics, making them practical for large-scale Bayesian inference through principled handling of mini-batch gradient noise and adaptive numerical stabilization.

Abstract: Scaling inference methods such as Markov chain Monte Carlo to high-dimensional models remains a central challenge in Bayesian deep learning. A promising recent proposal, microcanonical Langevin Monte Carlo, has shown state-of-the-art performance across a wide range of problems. However, its reliance on full-dataset gradients makes it prohibitively expensive for large-scale problems. This paper addresses a fundamental question: Can microcanonical dynamics effectively leverage mini-batch gradient noise? We provide the first systematic study of this problem, establishing a novel continuous-time theoretical analysis of stochastic-gradient microcanonical dynamics. We reveal two critical failure modes: a theoretically derived bias due to anisotropic gradient noise and numerical instabilities in complex high-dimensional posteriors. To tackle these issues, we propose a principled gradient noise preconditioning scheme shown to significantly reduce this bias and develop a novel, energy-variance-based adaptive tuner that automates step size selection and dynamically informs numerical guardrails. The resulting algorithm is a robust and scalable microcanonical Monte Carlo sampler that achieves state-of-the-art performance on challenging high-dimensional inference tasks like Bayesian neural networks. Combined with recent ensemble techniques, our work unlocks a new class of stochastic microcanonical Langevin ensemble (SMILE) samplers for large-scale Bayesian inference.

[349] Target noise: A pre-training based neural network initialization for efficient high resolution learning

Shaowen Wang, Tariq Alkhalifah

Main category: cs.LG

TL;DR: Noise-driven pre-training for neural network initialization: train networks on random noise before actual tasks to improve convergence speed, especially for implicit neural representations and DIP-style networks.

DetailsMotivation: Traditional initialization methods like Xavier and Kaiming rely on random sampling and don't exploit optimization process information. Networks need better initialization strategies to improve convergence efficiency.

Method: Pre-train neural networks to fit random noise as a self-supervised task before actual training. This creates structured, non-random parameter configurations that improve subsequent optimization.

Result: Noise-based pre-training significantly improves convergence speed, helps networks capture high-frequency components earlier, and is particularly effective for implicit neural representations and DIP-style networks which have strong low-frequency bias.

Conclusion: Noise-driven pre-training offers a lightweight, general alternative to traditional random initialization, enabling more efficient optimization of deep neural networks without requiring additional data or architectural changes.

Abstract: Weight initialization plays a crucial role in the optimization behavior and convergence efficiency of neural networks. Most existing initialization methods, such as Xavier and Kaiming initializations, rely on random sampling and do not exploit information from the optimization process itself. We propose a simple, yet effective, initialization strategy based on self-supervised pre-training using random noise as the target. Instead of directly training the network from random weights, we first pre-train it to fit random noise, which leads to a structured and non-random parameter configuration. We show that this noise-driven pre-training significantly improves convergence speed in subsequent tasks, without requiring additional data or changes to the network architecture. The proposed method is particularly effective for implicit neural representations (INRs) and Deep Image Prior (DIP)-style networks, which are known to exhibit a strong low-frequency bias during optimization. After noise-based pre-training, the network is able to capture high-frequency components much earlier in training, leading to faster and more stable convergence. Although random noise contains no semantic information, it serves as an effective self-supervised signal (considering its white spectrum nature) for shaping the initialization of neural networks. Overall, this work demonstrates that noise-based pre-training offers a lightweight and general alternative to traditional random initialization, enabling more efficient optimization of deep neural networks.

[350] Evolutionary Generation of Multi-Agent Systems

Yuntong Hu, Matthew Trager, Yuting Zhang, Yi Zhang, Shuo Yang, Wei Xia, Stefano Soatto

Main category: cs.LG

TL;DR: EvoMAS is an evolutionary approach for automatically generating multi-agent systems by evolving structured configurations rather than code, improving executability and performance over human-designed and prior automatic methods.

DetailsMotivation: Current LLM-based multi-agent system design is labor-intensive and brittle, with existing automatic generation methods suffering from executability issues or limited expressiveness due to rigid templates.

Method: EvoMAS formulates MAS generation as structured configuration evolution, using feedback-conditioned mutation and crossover guided by execution traces, with iterative refinement of candidate pools and experience memory.

Result: EvoMAS consistently outperforms human-designed MAS and prior automatic methods across reasoning, software engineering, and tool-use benchmarks, with significant performance gains and improved executability.

Conclusion: Evolutionary generation in configuration space provides an effective approach for automatically creating robust, high-performing multi-agent systems that overcome limitations of existing methods.

Abstract: Large language model (LLM)-based multi-agent systems (MAS) show strong promise for complex reasoning, planning, and tool-augmented tasks, but designing effective MAS architectures remains labor-intensive, brittle, and hard to generalize. Existing automatic MAS generation methods either rely on code generation, which often leads to executability and robustness failures, or impose rigid architectural templates that limit expressiveness and adaptability. We propose Evolutionary Generation of Multi-Agent Systems (EvoMAS), which formulates MAS generation as structured configuration generation. EvoMAS performs evolutionary generation in configuration space. Specifically, EvoMAS selects initial configurations from a pool, applies feedback-conditioned mutation and crossover guided by execution traces, and iteratively refines both the candidate pool and an experience memory. We evaluate EvoMAS on diverse benchmarks, including BBEH, SWE-Bench, and WorkBench, covering reasoning, software engineering, and tool-use tasks. EvoMAS consistently improves task performance over both human-designed MAS and prior automatic MAS generation methods, while producing generated systems with higher executability and runtime robustness. EvoMAS outperforms the agent evolution method EvoAgent by +10.5 points on BBEH reasoning and +7.1 points on WorkBench. With Claude-4.5-Sonnet, EvoMAS also reaches 79.1% on SWE-Bench-Verified, matching the top of the leaderboard.

[351] Topography scanning as a part of process monitoring in power cable insulation process

Janne Harjuhahto, Jaakko Harjuhahto, Mikko Lahti, Jussi Hanhirova, Björn Sonerud

Main category: cs.LG

TL;DR: A novel topography scanning system for XLPE cable core monitoring using 3D surface mapping and deep learning for defect detection

DetailsMotivation: To develop an advanced monitoring system for XLPE cable cores that can detect surface defects and geometry errors in real-time, addressing quality control challenges in cable manufacturing

Method: Combines modern measurement technology with embedded high-performance computing to create detailed 3D surface maps, then uses convolutional neural networks for real-time analysis and defect detection

Result: Successfully developed a complete 3D surface mapping system, identified melt homogeneity as a major factor for geometry errors, and demonstrated that CNNs are well-suited for reliable real-time surface defect detection

Conclusion: The developed topography scanning system with deep learning-based defect detection provides an effective solution for XLPE cable core monitoring, enabling real-time quality control and identification of manufacturing issues

Abstract: We present a novel topography scanning system developed to XLPE cable core monitoring. Modern measurement technology is utilized together with embedded high-performance computing to build a complete and detailed 3D surface map of the insulated core. Cross sectional and lengthwise geometry errors are studied, and melt homogeneity is identified as one major factor for these errors. A surface defect detection system has been developed utilizing deep learning methods. Our results show that convolutional neural networks are well suited for real time analysis of surface measurement data enabling reliable detection of surface defects.

[352] Endogenous Resistance to Activation Steering in Language Models

Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Mike Vaiana, Diogo de Lucena, Judd Rosenblatt, Michael S. A. Graziano

Main category: cs.LG

TL;DR: LLMs can resist task-misaligned activation steering during inference, recovering mid-generation even when steering remains active, a phenomenon called Endogenous Steering Resistance (ESR).

DetailsMotivation: To understand how LLMs internally resist adversarial manipulation through activation steering, and to identify dedicated internal consistency-checking circuits that enable models to recover from misaligned steering.

Method: Used sparse autoencoder (SAE) latents to steer model activations, tested on Llama-3.3-70B and smaller models, identified 26 SAE latents linked to ESR, performed zero-ablation experiments, and enhanced ESR through meta-prompting and fine-tuning on self-correction examples.

Result: Llama-3.3-70B shows substantial ESR while smaller models show it less frequently; zero-ablating identified latents reduces multi-attempt rate by 25%; meta-prompts increase multi-attempt rate by 4x; fine-tuning induces ESR-like behavior in smaller models.

Conclusion: ESR represents dedicated internal consistency-checking circuits that could protect against adversarial manipulation but might interfere with beneficial safety interventions. Understanding these resistance mechanisms is crucial for transparent and controllable AI systems.

Abstract: Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at github.com/agencyenterprise/endogenous-steering-resistance.

[353] The challenge of generating and evolving real-life like synthetic test data without accessing real-world raw data – a Systematic Review

Maj-Annika Tammisto, Faiz Ali Shah, Daniel Rodriguez, Dietmar Pfahl

Main category: cs.LG

TL;DR: Systematic review of synthetic test data generation methods for e-Government applications, finding 37 partial approaches but none fully meeting requirements for privacy-preserving test data evolution without real data access.

DetailsMotivation: High-level system testing of e-Government applications requires realistic test data that preserves privacy, especially for sensitive domains like international information exchange, medicine, and banking. Current approaches often require access to real data, creating privacy concerns.

Method: Conducted systematic literature review following Kitchenham methodologies across IEEE Xplore, ACM Digital Library, and SCOPUS databases, analyzing 1,013 publications and extracting data from 75 relevant papers.

Result: Identified 37 approaches partially addressing the research question, with 9 synthetic test data generation methods being closest to requirements. However, all existing approaches have limitations and none fully cover the need for evolving synthetic test data without real data access.

Conclusion: Synthetic test data evolution remains an under-explored research area, particularly important for Digital Government Solutions given new legal regulations. Current methods only partially meet requirements, highlighting a significant research gap.

Abstract: Background: High-level system testing of applications that use data from e-Government services as input requires test data that is real-life-like but where the privacy of personal information is guaranteed. Applications with such strong requirement include information exchange between countries, medicine, banking, etc. This review aims to synthesize the current state-of-the-practice in this domain. Objectives: The objective of this Systematic Review is to identify existing approaches for creating and evolving synthetic test data without using real-life raw data. Methods: We followed well-known methodologies for conducting systematic literature reviews, including the ones from Kitchenham as well as guidelines for analysing the limitations of our review and its threats to validity. Results: A variety of methods and tools exist for creating privacy-preserving test data. Our search found 1,013 publications in IEEE Xplore, ACM Digital Library, and SCOPUS. We extracted data from 75 of those publications and identified 37 approaches that answer our research question partly. A common prerequisite for using these methods and tools is direct access to real-life data for data anonymization or synthetic test data generation. Nine existing synthetic test data generation approaches were identified that were closest to answering our research question. Nevertheless, further work would be needed to add the ability to evolve synthetic test data to the existing approaches. Conclusions: None of the publications really covered our requirements completely, only partially. Synthetic test data evolution is a field that has not received much attention from researchers but needs to be explored in Digital Government Solutions, especially since new legal regulations are being placed in force in many countries.

[354] Live Knowledge Tracing: Real-Time Adaptation using Tabular Foundation Models

Mounir Lbath, Alexandre Paresy, Abdelkayoum Kaddouri, Alan André, Alexandre Ittah, Jill-Jênn Vie

Main category: cs.LG

TL;DR: TFMs enable real-time knowledge tracing by attending across time steps and other students’ interactions, achieving competitive performance with massive speedups.

DetailsMotivation: Traditional deep knowledge tracing models require substantial training time and are prone to overfitting on short sequences, creating a need for more efficient real-time approaches.

Method: Leverages tabular foundation models (TFMs) with two-way attention mechanism that attends across both time steps and interactions of other students, aligning testing sequences with relevant training sequences at inference time without training.

Result: Achieves competitive predictive performance with up to 273x speedups across datasets of increasing size, particularly effective as more student interactions are observed over time.

Conclusion: TFMs offer a promising new paradigm for real-time knowledge tracing that eliminates training requirements while maintaining performance and achieving significant efficiency gains.

Abstract: Deep knowledge tracing models have achieved significant breakthroughs in modeling student learning trajectories. However, these architectures require substantial training time and are prone to overfitting on datasets with short sequences. In this paper, we explore a new paradigm for knowledge tracing by leveraging tabular foundation models (TFMs). Unlike traditional methods that require offline training on a fixed training set, our approach performs real-time ‘’live’’ knowledge tracing in an online way. The core of our method lies in a two-way attention mechanism: while attention knowledge tracing models only attend across earlier time steps, TFMs simultaneously attend across both time steps and interactions of other students in the training set. They align testing sequences with relevant training sequences at inference time, therefore skipping the training step entirely. We demonstrate, using several datasets of increasing size, that our method achieves competitive predictive performance with up to 273x speedups, in a setting where more student interactions are observed over time.

[355] Learning a Generative Meta-Model of LLM Activations

Grace Luo, Jiahai Feng, Trevor Darrell, Alec Radford, Jacob Steinhardt

Main category: cs.LG

TL;DR: Training diffusion models on neural network activations creates “meta-models” that learn internal state distributions, improving interpretability and steering interventions without structural assumptions.

DetailsMotivation: Existing neural network analysis methods like PCA and sparse autoencoders rely on strong structural assumptions. Generative models offer an alternative that can uncover structure without such assumptions and act as priors to improve intervention fidelity.

Method: Train diffusion models on one billion residual stream activations to create “meta-models” that learn the distribution of a network’s internal states. Use these meta-models as priors for steering interventions and analyze how diffusion loss correlates with downstream utility.

Result: Diffusion loss decreases smoothly with compute and reliably predicts downstream utility. Applying the meta-model’s learned prior to steering interventions improves fluency, with larger gains as loss decreases. The meta-model’s neurons increasingly isolate concepts into individual units, with sparse probing scores that scale as loss decreases.

Conclusion: Generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions, showing promise for understanding and controlling neural network internal representations.

Abstract: Existing approaches for analyzing neural network activations, such as PCA and sparse autoencoders, rely on strong structural assumptions. Generative models offer an alternative: they can uncover structure without such assumptions and act as priors that improve intervention fidelity. We explore this direction by training diffusion models on one billion residual stream activations, creating “meta-models” that learn the distribution of a network’s internal states. We find that diffusion loss decreases smoothly with compute and reliably predicts downstream utility. In particular, applying the meta-model’s learned prior to steering interventions improves fluency, with larger gains as loss decreases. Moreover, the meta-model’s neurons increasingly isolate concepts into individual units, with sparse probing scores that scale as loss decreases. These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions. Project page: https://generative-latent-prior.github.io.

[356] Trust Regions Sell, But Who’s Buying? Overlap Geometry as an Alternative Trust Region for Policy Optimization

Gaurish Trivedi, Alakh Sharma, Kartikey Singh Bhandari, Yash Sinha, Pratik Narang, Dhruv Kumar, Jagat Sesh Challa

Main category: cs.LG

TL;DR: Proposes overlap geometry using Bhattacharyya coefficient as alternative trust region for RL policy optimization, replacing KL divergence to better control likelihood-ratio excursions and stabilize training.

DetailsMotivation: Standard trust-region methods use KL divergence which only controls average divergence and fails to prevent rare, large likelihood-ratio excursions that destabilize training. Current heuristics like PPO's clipping address this but lack theoretical grounding.

Method: Proposes overlap geometry via Bhattacharyya coefficient as trust region, which penalizes separation in ratio tails. Develops Bhattacharyya-TRPO (BTRPO) with quadratic Hellinger/Bhattacharyya penalty and Bhattacharyya-PPO (BPPO) that clips square-root ratio q = sqrt(r).

Result: Overlap-based updates improve robustness and aggregate performance as measured by RLiable under matched training budgets, showing better control over likelihood-ratio excursions without relying on loose total variation bounds.

Conclusion: Overlap constraints provide practical, principled alternative to KL divergence for stable policy optimization, offering tighter control over distributional divergence and improved training stability.

Abstract: Standard trust-region methods constrain policy updates via Kullback-Leibler (KL) divergence. However, KL controls only an average divergence and does not directly prevent rare, large likelihood-ratio excursions that destabilize training–precisely the failure mode that motivates heuristics such as PPO’s clipping. We propose overlap geometry as an alternative trust region, constraining distributional overlap via the Bhattacharyya coefficient (closely related to the Hellinger/Renyi-1/2 geometry). This objective penalizes separation in the ratio tails, yielding tighter control over likelihood-ratio excursions without relying on total variation bounds that can be loose in tail regimes. We derive Bhattacharyya-TRPO (BTRPO) and Bhattacharyya-PPO (BPPO), enforcing overlap constraints via square-root ratio updates: BPPO clips the square-root ratio q = sqrt(r), and BTRPO applies a quadratic Hellinger/Bhattacharyya penalty. Empirically, overlap-based updates improve robustness and aggregate performance as measured by RLiable under matched training budgets, suggesting overlap constraints as a practical, principled alternative to KL for stable policy optimization.

[357] Refining the Information Bottleneck via Adversarial Information Separation

Shuai Ning, Zhenpeng Wang, Lin Wang, Bing Chen, Shuangrong Liu, Xu Wu, Jin Zhou, Bo Yang

Main category: cs.LG

TL;DR: AdverISF: A self-supervised adversarial framework that isolates task-relevant features from noise without explicit supervision, using statistical independence constraints and multi-layer noise recycling for better generalization in data-scarce domains like material science.

DetailsMotivation: In domains like material science, experimental datasets have task-relevant features heavily confounded by measurement noise and artifacts. Standard regularization fails to precisely separate meaningful features from noise, while existing adversarial methods require explicit separation labels which are often unavailable.

Method: Proposes AdverISF with: 1) Self-supervised adversarial mechanism to enforce statistical independence between task-relevant features and noise representations without explicit supervision; 2) Multi-layer separation architecture that progressively recycles noise information across feature hierarchies to recover features inadvertently discarded as noise, enabling finer-grained feature extraction.

Result: Extensive experiments show AdverISF outperforms state-of-the-art methods in data-scarce scenarios. Evaluations on real-world material design tasks demonstrate superior generalization performance compared to existing approaches.

Conclusion: AdverISF effectively addresses the challenge of separating task-relevant features from noise in data-scarce domains without requiring explicit supervision, achieving better generalization through its self-supervised adversarial approach and multi-layer noise recycling architecture.

Abstract: Generalizing from limited data is particularly critical for models in domains such as material science, where task-relevant features in experimental datasets are often heavily confounded by measurement noise and experimental artifacts. Standard regularization techniques fail to precisely separate meaningful features from noise, while existing adversarial adaptation methods are limited by their reliance on explicit separation labels. To address this challenge, we propose the Adversarial Information Separation Framework (AdverISF), which isolates task-relevant features from noise without requiring explicit supervision. AdverISF introduces a self-supervised adversarial mechanism to enforce statistical independence between task-relevant features and noise representations. It further employs a multi-layer separation architecture that progressively recycles noise information across feature hierarchies to recover features inadvertently discarded as noise, thereby enabling finer-grained feature extraction. Extensive experiments demonstrate that AdverISF outperforms state-of-the-art methods in data-scarce scenarios. In addition, evaluations on real-world material design tasks show that it achieves superior generalization performance.

[358] Temperature Scaling Attack Disrupting Model Confidence in Federated Learning

Kichang Lee, Jaeho Jin, JaeYeon Park, Songkuk Kim, JeongGil Ko

Main category: cs.LG

TL;DR: Temperature Scaling Attack (TSA) manipulates confidence calibration in federated learning while maintaining accuracy, enabling stealthy attacks on mission-critical systems that rely on predictive confidence for risk-aware decisions.

DetailsMotivation: The paper identifies confidence calibration as a distinct attack surface in federated learning, separate from traditional accuracy or backdoor attacks. Since predictive confidence controls critical risk-aware logic (escalation, abstention, fallback) in mission-critical systems, compromising calibration integrity can have severe consequences even when accuracy appears normal.

Method: TSA injects temperature scaling with learning rate-temperature coupling during local training in federated learning. Malicious clients manipulate temperature parameters to distort confidence calibration while maintaining benign-like optimization behavior, evading accuracy-based monitoring and similarity-based detection. The method includes convergence analysis under non-IID settings.

Result: TSA achieves substantial calibration degradation (e.g., 145% error increase on CIFAR-100) with minimal accuracy change (<2%). The attack remains effective under robust aggregation and post-hoc calibration defenses. Case studies show up to 7.2x increases in missed critical cases in healthcare or false alarms in autonomous driving despite unchanged accuracy.

Conclusion: Confidence calibration integrity is a critical attack surface in federated learning that requires specific defenses beyond traditional accuracy-focused security measures. The Temperature Scaling Attack demonstrates that calibration can be systematically manipulated while evading existing detection mechanisms.

Abstract: Predictive confidence serves as a foundational control signal in mission-critical systems, directly governing risk-aware logic such as escalation, abstention, and conservative fallback. While prior federated learning attacks predominantly target accuracy or implant backdoors, we identify confidence calibration as a distinct attack objective. We present the Temperature Scaling Attack (TSA), a training-time attack that degrades calibration while preserving accuracy. By injecting temperature scaling with learning rate-temperature coupling during local training, malicious updates maintain benign-like optimization behavior, evading accuracy-based monitoring and similarity-based detection. We provide a convergence analysis under non-IID settings, showing that this coupling preserves standard convergence bounds while systematically distorting confidence. Across three benchmarks, TSA substantially shifts calibration (e.g., 145% error increase on CIFAR-100) with <2 accuracy change, and remains effective under robust aggregation and post-hoc calibration defenses. Case studies further show that confidence manipulation can cause up to 7.2x increases in missed critical cases (healthcare) or false alarms (autonomous driving), even when accuracy is unchanged. Overall, our results establish calibration integrity as a critical attack surface in federated learning.

[359] Fine-Grained Model Merging via Modular Expert Recombination

Haiyun Qiu, Xingyu Wu, Liang Feng, Kay Chen Tan

Main category: cs.LG

TL;DR: MERGE enables component-wise model merging with input-aware module recombination, addressing limitations of instance-specific merging by creating reusable modular expert libraries.

DetailsMotivation: Current instance-specific model merging methods lack reusability and treat models monolithically, ignoring that different components (attention vs MLP layers) have varying mergeability and merging sensitivities.

Method: Formulates component-wise merging as bi-objective optimization balancing cross-task performance and storage efficiency, uses surrogate-assisted evolutionary algorithm to find Pareto-optimal configurations, creates modular expert library with lightweight routing network for dynamic recombination.

Result: Extensive experiments across various model scales, task types, and fine-tuning strategies show MERGE consistently outperforms strong baselines and generalizes effectively.

Conclusion: MERGE provides an effective framework for fine-grained model merging with reusable modular components, enabling efficient inference under storage constraints while maintaining performance.

Abstract: Model merging constructs versatile models by integrating task-specific models without requiring labeled data or expensive joint retraining. Although recent methods improve adaptability to heterogeneous tasks by generating customized merged models for each instance, they face two critical limitations. First, the instance-specific merged models lack reusability, restricting the exploitation of high-quality merging configurations and efficient batch inference. Second, these methods treat each task-specific model as a monolithic whole, overlooking the diverse mergeability of homologous components such as attention and multilayer perceptron layers, and the differing merging sensitivities across components. To address these limitations, we propose MERGE (\underline{M}odular \underline{E}xpert \underline{R}ecombination for fine-\underline{G}rained m\underline{E}rging), a method that enables component-wise model merging and input-aware, on-demand module recombination at inference. MERGE formulates component-wise merging as a bi-objective optimization problem that balances cross-task performance and storage efficiency, and develops a surrogate-assisted evolutionary algorithm to efficiently identify Pareto-optimal merging configurations. These high-quality configurations underpin a reusable modular expert library, from which a lightweight routing network dynamically activates and recombines modular experts to assemble input-specific models and enable efficient inference under storage constraints. Extensive experiments across various model scales, task types, and fine-tuning strategies demonstrate that MERGE consistently outperforms strong baselines and generalizes effectively.

[360] Learning to Allocate Resources with Censored Feedback

Giovanni Montanari, Côme Fiegel, Corentin Pla, Aadirupa Saha, Vianney Perchet

Main category: cs.LG

TL;DR: Online resource allocation with censored feedback where arms require both activation probability and budget threshold conditions for reward, with algorithms achieving sublinear regret.

DetailsMotivation: Addresses the practical problem of budget allocation under censored feedback where rewards are only observed when both arm activation and sufficient budget allocation conditions are met, with applications in advertising, resource management, and recommendation systems.

Method: Proposes RA-UCB algorithm for known budgets using optimistic parameter estimation and confidence bounds, and MG-UCB for unknown budgets allowing within-round switching and infinitesimal allocations.

Result: Proves Ω(T^{1/3}) information-theoretic regret lower bound, achieves Õ(√T) regret with RA-UCB for known budgets, O(poly-log T) under stronger assumptions, and similar guarantees with MG-UCB for unknown budgets.

Conclusion: The problem is fundamentally challenging with cubic-root lower bound, but proposed algorithms achieve sublinear regret with theoretical guarantees validated on real-world datasets.

Abstract: We study the online resource allocation problem in which at each round, a budget $B$ must be allocated across $K$ arms under censored feedback. An arm yields a reward if and only if two conditions are satisfied: (i) the arm is activated according to an arm-specific Bernoulli random variable with unknown parameter, and (ii) the allocated budget exceeds a random threshold drawn from a parametric distribution with unknown parameter. Over $T$ rounds, the learner must jointly estimate the unknown parameters and allocate the budget so as to maximize cumulative reward facing the exploration–exploitation trade-off. We prove an information-theoretic regret lower bound $Ω(T^{1/3})$, demonstrating the intrinsic difficulty of the problem. We then propose RA-UCB, an optimistic algorithm that leverages non-trivial parameter estimation and confidence bounds. When the budget $B$ is known at the beginning of each round, RA-UCB achieves a regret of order $\widetilde{\mathcal{O}}(\sqrt{T})$, and even $\mathcal{O}(\mathrm{poly}\text{-}\log T)$ under stronger assumptions. As for unknown, round dependent budget, we introduce MG-UCB, which allows within-round switching and infinitesimal allocations, and matches the regret guarantees of RA-UCB. We then validate our theoretical results through experiments on real-world datasets.

[361] Degradation of Feature Space in Continual Learning

Chiara Lanza, Roberto Pereira, Marco Miozzo, Eduard Angelats, Paolo Dini

Main category: cs.LG

TL;DR: Isotropic regularization in continual learning fails to improve accuracy and can degrade performance, contrary to its benefits in centralized training.

DetailsMotivation: The paper investigates whether promoting feature-space isotropy can enhance representation quality in continual learning, given that isotropic distributions naturally arise in centralized training but continual learning tends to yield anisotropic feature spaces due to streaming non-stationary data.

Method: Experiments using contrastive continual learning techniques on CIFAR-10 and CIFAR-100 datasets, testing whether isotropic regularization improves model accuracy in continual settings.

Result: Isotropic regularization fails to improve and can actually degrade model accuracy in continual learning scenarios, highlighting essential differences in feature geometry between centralized and continual learning.

Conclusion: Isotropy, while beneficial in centralized training setups, may not constitute an appropriate inductive bias for non-stationary learning scenarios like continual learning.

Abstract: Centralized training is the standard paradigm in deep learning, enabling models to learn from a unified dataset in a single location. In such setup, isotropic feature distributions naturally arise as a mean to support well-structured and generalizable representations. In contrast, continual learning operates on streaming and non-stationary data, and trains models incrementally, inherently facing the well-known plasticity-stability dilemma. In such settings, learning dynamics tends to yield increasingly anisotropic feature space. This arises a fundamental question: should isotropy be enforced to achieve a better balance between stability and plasticity, and thereby mitigate catastrophic forgetting? In this paper, we investigate whether promoting feature-space isotropy can enhance representation quality in continual learning. Through experiments using contrastive continual learning techniques on CIFAR-10 and CIFAR-100 data, we find that isotropic regularization fails to improve, and can in fact degrade, model accuracy in continual settings. Our results highlight essential differences in feature geometry between centralized and continual learning, suggesting that isotropy, while beneficial in centralized setups, may not constitute an appropriate inductive bias for non-stationary learning scenarios.

[362] DiTS: Multimodal Diffusion Transformers Are Time Series Forecasters

Haoran Zhang, Haixuan Liu, Yong Liu, Yunzhong Qiu, Yuxuan Wang, Jianmin Wang, Mingsheng Long

Main category: cs.LG

TL;DR: DiTS is a diffusion transformer architecture for time series forecasting that treats endogenous and exogenous variables as distinct modalities, using a dual-stream transformer with time and variate attention modules to better capture dependencies.

DetailsMotivation: Current generative time series models don't adequately handle multi-dimensional properties of time series data. Existing Diffusion Transformers (DiT) use simplistic conditioning and single-stream architectures that underutilize cross-variate dependencies in covariate-aware forecasting.

Method: Proposes Diffusion Transformers for Time Series (DiTS) that frames endogenous and exogenous variates as distinct modalities. Uses a dual-stream Transformer block with Time Attention for temporal modeling and Variate Attention for cross-variate modeling, leveraging low-rank properties of multivariate dependencies to reduce computational costs.

Result: DiTS achieves state-of-the-art performance across benchmarks, regardless of whether future exogenous variate observations are available, demonstrating superior generative forecasting capabilities compared to traditional deterministic deep forecasting models.

Conclusion: DiTS provides a general-purpose architecture for time series forecasting that effectively captures both inter-variate and intra-variate dependencies through its dual-stream design, offering improved generative forecasting performance.

Abstract: While generative modeling on time series facilitates more capable and flexible probabilistic forecasting, existing generative time series models do not address the multi-dimensional properties of time series data well. The prevalent architecture of Diffusion Transformers (DiT), which relies on simplistic conditioning controls and a single-stream Transformer backbone, tends to underutilize cross-variate dependencies in covariate-aware forecasting. Inspired by Multimodal Diffusion Transformers that integrate textual guidance into video generation, we propose Diffusion Transformers for Time Series (DiTS), a general-purpose architecture that frames endogenous and exogenous variates as distinct modalities. To better capture both inter-variate and intra-variate dependencies, we design a dual-stream Transformer block tailored for time-series data, comprising a Time Attention module for autoregressive modeling along the temporal dimension and a Variate Attention module for cross-variate modeling. Unlike the common approach for images, which flattens 2D token grids into 1D sequences, our design leverages the low-rank property inherent in multivariate dependencies, thereby reducing computational costs. Experiments show that DiTS achieves state-of-the-art performance across benchmarks, regardless of the presence of future exogenous variate observations, demonstrating unique generative forecasting strengths over traditional deterministic deep forecasting models.

[363] SaDiT: Efficient Protein Backbone Design via Latent Structural Tokenization and Diffusion Transformers

Shentong Mo, Lanqing Li

Main category: cs.LG

TL;DR: SaDiT accelerates protein backbone generation by combining SaProt tokenization with Diffusion Transformer architecture, using discrete latent space and IPA token cache for efficiency while maintaining structural quality.

DetailsMotivation: Current diffusion-based protein backbone design models are computationally intensive and slow for large-scale exploration. While flow-matching approaches like Proteina improve efficiency, tokenization for structural compression remains unexplored in protein domain.

Method: SaDiT integrates SaProt Tokenization with Diffusion Transformer (DiT) architecture, using discrete latent space to represent protein geometry. Introduces IPA Token Cache mechanism to optimize Invariant Point Attention layers by reusing computed token states during iterative sampling.

Result: SaDiT outperforms state-of-the-art models (RFDiffusion and Proteina) in both computational speed and structural viability. Shows superior ability to capture complex topological features with high designability in unconditional backbone generation and fold-class conditional generation tasks.

Conclusion: SaDiT provides an efficient framework for protein backbone generation through tokenization and optimization techniques, enabling faster structural exploration while maintaining theoretical SE(3) equivalence and structural quality.

Abstract: Generative models for de novo protein backbone design have achieved remarkable success in creating novel protein structures. However, these diffusion-based approaches remain computationally intensive and slower than desired for large-scale structural exploration. While recent efforts like Proteina have introduced flow-matching to improve sampling efficiency, the potential of tokenization for structural compression and acceleration remains largely unexplored in the protein domain. In this work, we present SaDiT, a novel framework that accelerates protein backbone generation by integrating SaProt Tokenization with a Diffusion Transformer (DiT) architecture. SaDiT leverages a discrete latent space to represent protein geometry, significantly reducing the complexity of the generation process while maintaining theoretical SE(3) equivalence. To further enhance efficiency, we introduce an IPA Token Cache mechanism that optimizes the Invariant Point Attention (IPA) layers by reusing computed token states during iterative sampling. Experimental results demonstrate that SaDiT outperforms state-of-the-art models, including RFDiffusion and Proteina, in both computational speed and structural viability. We evaluate our model across unconditional backbone generation and fold-class conditional generation tasks, where SaDiT shows superior ability to capture complex topological features with high designability.

[364] The hidden risks of temporal resampling in clinical reinforcement learning

Thomas Frost, Hrisheekesh Vaidya, Steve Harris

Main category: cs.LG

TL;DR: Temporal resampling in offline RL for healthcare degrades live performance due to counterfactual trajectories, distorted temporal expectations, and compounded generalization errors, with standard evaluation metrics failing to detect these issues.

DetailsMotivation: Current offline RL research in healthcare aggregates patient data into fixed time intervals to fit standard frameworks, but the impact of these temporal manipulations on model safety and efficacy is poorly understood.

Method: Using a gridworld navigation task and the UVA/Padova clinical diabetes simulator to demonstrate how temporal resampling affects offline RL performance during live deployment, identifying three failure mechanisms.

Result: Temporal resampling significantly degrades offline RL algorithm performance during live deployment, with standard off-policy evaluation metrics failing to detect these performance drops.

Conclusion: There’s a fundamental risk in current healthcare ORL pipelines, emphasizing the need for methods that explicitly handle the irregular timing of clinical decision-making.

Abstract: Offline reinforcement learning (ORL) has shown potential for improving decision-making in healthcare. However, contemporary research typically aggregates patient data into fixed time intervals, simplifying their mapping to standard ORL frameworks. The impact of these temporal manipulations on model safety and efficacy remains poorly understood. In this work, using both a gridworld navigation task and the UVA/Padova clinical diabetes simulator, we demonstrate that temporal resampling significantly degrades the performance of offline reinforcement learning algorithms during live deployment. We propose three mechanisms that drive this failure: (i) the generation of counterfactual trajectories, (ii) the distortion of temporal expectations, and (iii) the compounding of generalisation errors. Crucially, we find that standard off-policy evaluation metrics can fail to detect these drops in performance. Our findings reveal a fundamental risk in current healthcare ORL pipelines and emphasise the need for methods that explicitly handle the irregular timing of clinical decision-making.

[365] F-GRPO: Don’t Let Your Policy Learn the Obvious and Forget the Rare

Daniil Plyusov, Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov

Main category: cs.LG

TL;DR: RLVR methods using group sampling struggle with rare-correct trajectories due to computational limits on group size; proposed difficulty-aware advantage scaling improves performance without increasing computational cost.

DetailsMotivation: Current RLVR methods based on group sampling face limitations: large group sizes are computationally infeasible, biasing learning toward already likely trajectories, while small groups miss rare-correct trajectories and contain mixed rewards that concentrate probability on common solutions.

Method: Analyzed probability of missing rare-correct modes as function of group size, characterized how updates redistribute mass within correct set. Proposed lightweight difficulty-aware advantage scaling coefficient inspired by Focal loss that down-weights updates on high-success prompts, directly integrable into existing RLVR algorithms like GRPO, DAPO, and CISPO.

Result: On Qwen2.5-7B across in-domain and out-of-domain benchmarks, method improved pass@256 from 64.1→70.3 (GRPO), 69.3→72.5 (DAPO), and 73.2→76.8 (CISPO), while preserving or improving pass@1, without increasing group size or computational cost.

Conclusion: Difficulty-aware advantage scaling effectively addresses limitations of group sampling in RLVR by better handling rare-correct trajectories, improving performance across multiple algorithms without additional computational overhead.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updates. In practice, large group sizes are not feasible due to computational limits, which biases learning toward trajectories that are already likely. Smaller groups often miss rare-correct trajectories while still containing mixed rewards, concentrating probability on common solutions. We derive the probability that updates miss rare-correct modes as a function of group size, showing non-monotonic behavior, and characterize how updates redistribute mass within the correct set, revealing that unsampled-correct mass can shrink even as total correct mass grows. Motivated by this analysis, we propose a difficulty-aware advantage scaling coefficient, inspired by Focal loss, that down-weights updates on high-success prompts. The lightweight modification can be directly integrated into any group-relative RLVR algorithm such as GRPO, DAPO, and CISPO. On Qwen2.5-7B across in-domain and out-of-domain benchmarks, our method improves pass@256 from 64.1 $\rightarrow$ 70.3 (GRPO), 69.3 $\rightarrow$ 72.5 (DAPO), and 73.2 $\rightarrow$ 76.8 (CISPO), while preserving or improving pass@1, without increasing group size or computational cost.

[366] Adaptive-CaRe: Adaptive Causal Regularization for Robust Outcome Prediction

Nithya Bhasker, Fiona R. Kolbinger, Susu Hu, Gitta Kutyniok, Stefanie Speidel

Main category: cs.LG

TL;DR: Adaptive-CaRe: A model-agnostic regularization method that balances predictive accuracy and causal robustness for medical outcome prediction by penalizing differences between statistical and causal feature contributions.

DetailsMotivation: Supervised ML models for medical outcome prediction often optimize for accuracy at the cost of latching onto spurious correlations, while causal methods can be too conservative and lose diagnostic precision. There's a need to balance predictive value with causal robustness.

Method: Proposes Adaptive-CaRe regularization that incorporates a penalty proportional to the difference between estimated statistical contribution and estimated causal contribution of input features for model predictions. The method is model-agnostic and allows tuning of regularization strength λ to navigate the trade-off.

Result: Experiments on synthetic data show efficacy in finding robust predictors while maintaining competitive accuracy. Validation on real-world datasets confirms practical applicability. The method provides a blueprint for navigating accuracy-robustness trade-offs.

Conclusion: Adaptive-CaRe offers a simple yet effective solution to the long-standing trade-off between predictive accuracy and causal robustness in medical domain. Future work involves exploring alternate causal frameworks and complex models at larger scales.

Abstract: Accurate prediction of outcomes is crucial for clinical decision-making and personalized patient care. Supervised machine learning algorithms, which are commonly used for outcome prediction in the medical domain, optimize for predictive accuracy, which can result in models latching onto spurious correlations instead of robust predictors. Causal structure learning methods on the other hand have the potential to provide robust predictors for the target, but can be too conservative because of algorithmic and data assumptions, resulting in loss of diagnostic precision. Therefore, we propose a novel model-agnostic regularization strategy, Adaptive-CaRe, for generalized outcome prediction in the medical domain. Adaptive-CaRe strikes a balance between both predictive value and causal robustness by incorporating a penalty that is proportional to the difference between the estimated statistical contribution and estimated causal contribution of the input features for model predictions. Our experiments on synthetic data establish the efficacy of the proposed Adaptive-CaRe regularizer in finding robust predictors for the target while maintaining competitive predictive accuracy. With experiments on a standard causal benchmark, we provide a blueprint for navigating the trade-off between predictive accuracy and causal robustness by tweaking the regularization strength, $λ$. Validation using real-world dataset confirms that the results translate to practical, real-domain settings. Therefore, Adaptive-CaRe provides a simple yet effective solution to the long-standing trade-off between predictive accuracy and causal robustness in the medical domain. Future work would involve studying alternate causal structure learning frameworks and complex classification models to provide deeper insights at a larger scale.

[367] Pruning at Initialisation through the lens of Graphon Limit: Convergence, Expressivity, and Generalisation

Hoang Pham, The-Anh Ta, Long Tran-Thanh

Main category: cs.LG

TL;DR: The paper connects pruning at initialization methods to graph limit theory via graphons, establishing a continuous framework for analyzing sparse neural networks and deriving theoretical results about their expressivity and generalization.

DetailsMotivation: Pruning at Initialization (PaI) methods discover sparse, trainable subnetworks before training, but their theoretical mechanisms remain elusive. Existing analyses are limited to finite-width statistics and lack rigorous characterization of global sparsity patterns as networks grow large.

Method: The authors connect discrete pruning heuristics to graph limit theory via graphons, establishing the graphon limit of PaI masks. They introduce a Factorised Saliency Model that encompasses popular pruning criteria and prove that discrete masks converge to deterministic bipartite graphons under regularity conditions.

Result: The limit framework establishes a topological taxonomy: unstructured methods converge to homogeneous graphons (uniform connectivity), while data-driven methods converge to heterogeneous graphons (implicit feature selection). The authors derive a Universal Approximation Theorem for sparse networks and a Graphon-NTK generalization bound showing how the limit graphon modulates kernel geometry.

Conclusion: The results transform the study of sparse neural networks from combinatorial graph problems into a rigorous framework of continuous operators, offering a new mechanism for analyzing expressivity and generalization in sparse neural networks.

Abstract: Pruning at Initialisation methods discover sparse, trainable subnetworks before training, but their theoretical mechanisms remain elusive. Existing analyses are often limited to finite-width statistics, lacking a rigorous characterisation of the global sparsity patterns that emerge as networks grow large. In this work, we connect discrete pruning heuristics to graph limit theory via graphons, establishing the graphon limit of PaI masks. We introduce a Factorised Saliency Model that encompasses popular pruning criteria and prove that, under regularity conditions, the discrete masks generated by these algorithms converge to deterministic bipartite graphons. This limit framework establishes a novel topological taxonomy for sparse networks: while unstructured methods (e.g., Random, Magnitude) converge to homogeneous graphons representing uniform connectivity, data-driven methods (e.g., SNIP, GraSP) converge to heterogeneous graphons that encode implicit feature selection. Leveraging this continuous characterisation, we derive two fundamental theoretical results: (i) a Universal Approximation Theorem for sparse networks that depends only on the intrinsic dimension of active coordinate subspaces; and (ii) a Graphon-NTK generalisation bound demonstrating how the limit graphon modulates the kernel geometry to align with informative features. Our results transform the study of sparse neural networks from combinatorial graph problems into a rigorous framework of continuous operators, offering a new mechanism for analysing expressivity and generalisation in sparse neural networks.

[368] Optimal Abstractions for Verifying Properties of Kolmogorov-Arnold Networks (KANs)

Noah Schwartz, Chandra Kanth Nagesh, Sriram Sankaranarayanan, Ramneet Kaur, Tuhin Sahai, Susmit Jha

Main category: cs.LG

TL;DR: A verification framework for Kolmogorov-Arnold Networks (KANs) using piecewise affine abstractions with optimal piece allocation via dynamic programming and knapsack optimization to balance accuracy and computational tractability.

DetailsMotivation: KANs present verification challenges due to their nonlinear activation functions (piecewise polynomial splines/Gaussian processes). Existing verification methods struggle with balancing approximation accuracy vs. computational complexity when creating abstractions for formal verification.

Method: Replace KAN units with piecewise affine (PWA) functions with error bounds, encode verification as MILP. Key innovation: systematic framework using dynamic programming per unit and knapsack optimization across network to minimize total pieces while meeting error requirements.

Result: Empirical evaluation shows method’s upfront analysis costs yield superior verification results across KAN benchmarks compared to naive approaches, effectively balancing computational tractability with informative bounds.

Conclusion: The framework provides an optimal abstraction strategy for KAN verification by exploiting network structure, enabling practical verification of properties through careful piece allocation that maintains accuracy while controlling MILP complexity.

Abstract: We present a novel approach for verifying properties of Kolmogorov-Arnold Networks (KANs), a class of neural networks characterized by nonlinear, univariate activation functions typically implemented as piecewise polynomial splines or Gaussian processes. Our method creates mathematical ``abstractions’’ by replacing each KAN unit with a piecewise affine (PWA) function, providing both local and global error estimates between the original network and its approximation. These abstractions enable property verification by encoding the problem as a Mixed Integer Linear Program (MILP), determining whether outputs satisfy specified properties when inputs belong to a given set. A critical challenge lies in balancing the number of pieces in the PWA approximation: too many pieces add binary variables that make verification computationally intractable, while too few pieces create excessive error margins that yield uninformative bounds. Our key contribution is a systematic framework that exploits KAN structure to find optimal abstractions. By combining dynamic programming at the unit level with a knapsack optimization across the network, we minimize the total number of pieces while guaranteeing specified error bounds. This approach determines the optimal approximation strategy for each unit while maintaining overall accuracy requirements. Empirical evaluation across multiple KAN benchmarks demonstrates that the upfront analysis costs of our method are justified by superior verification results.

[369] Memory-Conditioned Flow-Matching for Stable Autoregressive PDE Rollouts

Victor Armegioiu

Main category: cs.LG

TL;DR: Memory-conditioned diffusion/flow-matching for stable long-horizon PDE solving with improved fine-scale fidelity

DetailsMotivation: Autoregressive generative PDE solvers suffer from error accumulation and drift over long rollouts, especially in coarse-to-fine regimes where unresolved scales must be regenerated at each step. Current diffusion/flow-matching models are memoryless despite the structural need for memory in accurate closures.

Method: Introduces memory-conditioned diffusion/flow-matching with compact online state injection into denoising via latent features. Uses Mori-Zwanzig projection formalism to show exact resolved evolution requires memory term. Memory induces structured conditional tail prior for unresolved scales and reduces transport needed for missing frequencies.

Result: Proves Wasserstein stability of conditional kernel and derives discrete Grönwall rollout bounds separating memory approximation from conditional generation error. Experiments on compressible flows with shocks and multiscale mixing show improved accuracy, markedly more stable long-horizon rollouts, and better fine-scale spectral/statistical fidelity.

Conclusion: Memory-conditioned generative models address fundamental limitations of memoryless closures for PDE solving, enabling stable long-horizon rollouts with accurate fine-scale reconstruction in multiscale systems.

Abstract: Autoregressive generative PDE solvers can be accurate one step ahead yet drift over long rollouts, especially in coarse-to-fine regimes where each step must regenerate unresolved fine scales. This is the regime of diffusion and flow-matching generators: although their internal dynamics are Markovian, rollout stability is governed by per-step \emph{conditional law} errors. Using the Mori–Zwanzig projection formalism, we show that eliminating unresolved variables yields an exact resolved evolution with a Markov term, a memory term, and an orthogonal forcing, exposing a structural limitation of memoryless closures. Motivated by this, we introduce memory-conditioned diffusion/flow-matching with a compact online state injected into denoising via latent features. Via disintegration, memory induces a structured conditional tail prior for unresolved scales and reduces the transport needed to populate missing frequencies. We prove Wasserstein stability of the resulting conditional kernel. We then derive discrete Grönwall rollout bounds that separate memory approximation from conditional generation error. Experiments on compressible flows with shocks and multiscale mixing show improved accuracy and markedly more stable long-horizon rollouts, with better fine-scale spectral and statistical fidelity.

[370] NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

Hyochan Chong, Dongkyu Kim, Changdong Kim, Minseop Choi

Main category: cs.LG

TL;DR: NanoQuant is a post-training quantization method that compresses LLMs to binary and sub-1-bit levels using low-rank binary factorization, enabling large models to run on consumer hardware.

DetailsMotivation: Existing weight-only quantization methods fail to efficiently compress models to binary (1-bit) levels, requiring large amounts of data/compute or incurring additional storage overhead. There's a need for efficient compression to enable large-scale deployment on consumer hardware.

Method: Formulates quantization as a low-rank binary factorization problem, compressing full-precision weights to low-rank binary matrices and scales. Uses ADMM for precise initialization of latent binary matrices and scales, followed by block and model reconstruction tuning.

Result: Achieves state-of-the-art accuracy at sub-1-bit compression rates, compressing Llama2-70B by 25.8× in 13 hours on a single H100, enabling 70B models to operate on 8 GB consumer GPUs.

Conclusion: NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, making large-scale LLM deployment feasible on consumer hardware through efficient binary and sub-1-bit compression.

Abstract: Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) method to precisely initialize latent binary matrices and scales, and then tune the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, achieving state-of-the-art accuracy even at sub-1-bit compression rates. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8$\times$ in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU.

[371] AEGIS: Adversarial Target-Guided Retention-Data-Free Robust Concept Erasure from Diffusion Models

Fengpeng Li, Kemou Li, Qizhou Wang, Bo Han, Jiantao Zhou

Main category: cs.LG

TL;DR: AEGIS framework improves both robustness and retention in concept erasure for diffusion models without needing retention data

DetailsMotivation: Current concept erasure methods for diffusion models face a trade-off between robustness (resisting reactivation of erased concepts) and retention (preserving unrelated concepts), with existing approaches typically improving one at the expense of the other

Method: Adversarial Erasure with Gradient Informed Synergy (AEGIS) - a retention-data-free framework that advances both robustness and retention simultaneously

Result: Not specified in abstract, but the framework claims to advance both robustness and retention without the typical trade-off

Conclusion: AEGIS provides a solution to the robustness-retention trade-off in concept erasure for diffusion models, enabling safer and more practical deployment

Abstract: Concept erasure helps stop diffusion models (DMs) from generating harmful content; but current methods face robustness retention trade off. Robustness means the model fine-tuned by concept erasure methods resists reactivation of erased concepts, even under semantically related prompts. Retention means unrelated concepts are preserved so the model’s overall utility stays intact. Both are critical for concept erasure in practice, yet addressing them simultaneously is challenging, as existing works typically improve one factor while sacrificing the other. Prior work typically strengthens one while degrading the other, e.g., mapping a single erased prompt to a fixed safe target leaves class level remnants exploitable by prompt attacks, whereas retention-oriented schemes underperform against adaptive adversaries. This paper introduces Adversarial Erasure with Gradient Informed Synergy (AEGIS), a retention-data-free framework that advances both robustness and retention.

[372] Diffeomorphism-Equivariant Neural Networks

Josephine Elisabeth Oettinger, Zakhar Shumaylov, Johannes Bostelmann, Jan Lellmann, Carola-Bibiane Schönlieb

Main category: cs.LG

TL;DR: Proposes energy-based canonicalisation method to induce diffeomorphism equivariance in pre-trained neural networks using differentiable image registration techniques

DetailsMotivation: Existing equivariance methods focus on compact/finite/low-dimensional groups with linear actions, but there's a need to extend equivariance to infinite-dimensional groups like diffeomorphisms for better generalization

Method: Formulates equivariance as optimization problem using energy-based canonicalisation, leveraging established differentiable image registration methods to induce diffeomorphism equivariance in pre-trained networks

Result: Achieves approximate equivariance and generalizes to unseen transformations without extensive data augmentation or retraining on segmentation and classification tasks

Conclusion: Provides a practical approach to extend equivariance to infinite-dimensional groups using optimization-based canonicalisation, enabling better generalization with existing pre-trained models

Abstract: Incorporating group symmetries via equivariance into neural networks has emerged as a robust approach for overcoming the efficiency and data demands of modern deep learning. While most existing approaches, such as group convolutions and averaging-based methods, focus on compact, finite, or low-dimensional groups with linear actions, this work explores how equivariance can be extended to infinite-dimensional groups. We propose a strategy designed to induce diffeomorphism equivariance in pre-trained neural networks via energy-based canonicalisation. Formulating equivariance as an optimisation problem allows us to access the rich toolbox of already established differentiable image registration methods. Empirical results on segmentation and classification tasks confirm that our approach achieves approximate equivariance and generalises to unseen transformations without relying on extensive data augmentation or retraining.

[373] Explaining Grokking in Transformers through the Lens of Inductive Bias

Jaisidh Singh, Diganta Misra, Antonio Orvieto

Main category: cs.LG

TL;DR: Transformers exhibit grokking (delayed generalization) influenced by architectural inductive biases like LayerNorm position and optimization settings, with feature compressibility predicting generalization emergence.

DetailsMotivation: To understand grokking phenomena in transformers through the lens of inductive bias - how architectural choices and optimization settings shape the network's preference for certain solutions over others during learning.

Method: Systematically investigate how LayerNorm position affects grokking speed by analyzing shortcut learning and attention entropy. Study optimization settings (learning rate, weight decay, readout scale) and their interactions. Analyze feature evolution throughout training and measure feature compressibility as a predictor of generalization.

Result: LayerNorm position strongly modulates grokking speed by shaping shortcut learning and attention entropy. Readout scale as a control for lazy training can be confounded by learning rate and weight decay. Features evolve continuously throughout training, suggesting grokking is more nuanced than a simple lazy-to-rich transition. Generalization predictably emerges with feature compressibility across different inductive bias modulators.

Conclusion: Grokking in transformers is shaped by architectural inductive biases and optimization settings, with feature compressibility serving as a reliable predictor of generalization emergence, providing insights into the nuanced learning dynamics beyond simple lazy-to-rich transitions.

Abstract: We investigate grokking in transformers through the lens of inductive bias: dispositions arising from architecture or optimization that let the network prefer one solution over another. We first show that architectural choices such as the position of Layer Normalization (LN) strongly modulates grokking speed. This modulation is explained by isolating how LN on specific pathways shapes shortcut-learning and attention entropy. Subsequently, we study how different optimization settings modulate grokking, inducing distinct interpretations of previously proposed controls such as readout scale. Particularly, we find that using readout scale as a control for lazy training can be confounded by learning rate and weight decay in our setting. Accordingly, we show that features evolve continuously throughout training, suggesting grokking in transformers can be more nuanced than a lazy-to-rich transition of the learning regime. Finally, we show how generalization predictably emerges with feature compressibility in grokking, across different modulators of inductive bias. Our code is released at https://tinyurl.com/y52u3cad.

[374] On the Identifiability of Steering Vectors in Large Language Models

Sohan Venkatesh, Ashish Mahendran Kurapath

Main category: cs.LG

TL;DR: Activation steering vectors are fundamentally non-identifiable due to equivalence classes of behaviorally indistinguishable interventions, though identifiability can be recovered under certain structural assumptions.

DetailsMotivation: To investigate whether activation steering methods (like persona vectors) actually reveal meaningful internal representations, given the implicit assumption that steering directions are identifiable and uniquely recoverable from input-output behavior.

Method: Formalize steering as interventions on internal representations, prove theoretical non-identifiability under realistic conditions, and empirically validate across multiple models and semantic traits using orthogonal perturbations.

Result: Steering vectors are fundamentally non-identifiable due to large equivalence classes of behaviorally indistinguishable interventions; orthogonal perturbations achieve near-equivalent efficacy with negligible effect sizes.

Conclusion: Reveals fundamental interpretability limits for activation steering methods and clarifies structural assumptions (statistical independence, sparsity, multi-environment validation, cross-layer consistency) required for reliable safety-critical control.

Abstract: Activation steering methods, such as persona vectors, are widely used to control large language model behavior and increasingly interpreted as revealing meaningful internal representations. This interpretation implicitly assumes steering directions are identifiable and uniquely recoverable from input-output behavior. We formalize steering as an intervention on internal representations and prove that, under realistic modeling and data conditions, steering vectors are fundamentally non-identifiable due to large equivalence classes of behaviorally indistinguishable interventions. Empirically, we validate this across multiple models and semantic traits, showing orthogonal perturbations achieve near-equivalent efficacy with negligible effect sizes. However, identifiability is recoverable under structural assumptions including statistical independence, sparsity constraints, multi-environment validation or cross-layer consistency. These findings reveal fundamental interpretability limits and clarify structural assumptions required for reliable safety-critical control.

[375] Disentanglement by means of action-induced representations

Gorka Muñoz-Gil, Hendrik Poulsen Nautrup, Arunava Majumder, Paulin de Schoulepnikoff, Florian Fürrutter, Marius Krumm, Hans J. Briegel

Main category: cs.LG

TL;DR: VAIR (Variational Action-Induced Representations) framework achieves provable disentanglement by modeling representations based on actions performed on physical systems, overcoming limitations of standard VAEs for nonlinear independent component analysis.

DetailsMotivation: Standard VAEs struggle with disentangled representations where each latent dimension corresponds to distinct generative factors, fundamentally tied to the inability to perform nonlinear independent component analysis. The paper aims to overcome this by incorporating action/experiment information.

Method: Introduces Action-Induced Representations (AIRs) framework that models representations of physical systems given experiments/actions performed on them. Develops variational AIR architecture (VAIR) that can extract AIRs by incorporating action dependence into the representation learning process.

Result: VAIR achieves provable disentanglement of degrees of freedom with respect to their action dependence, where standard VAEs fail. The method captures action dependence of underlying generative factors, directly linking experiments to the degrees of freedom they influence.

Conclusion: The AIR framework and VAIR architecture provide a principled approach to learning interpretable, disentangled representations by incorporating action information, solving fundamental limitations of standard VAE-based disentanglement methods.

Abstract: Learning interpretable representations with variational autoencoders (VAEs) is a major goal of representation learning. The main challenge lies in obtaining disentangled representations, where each latent dimension corresponds to a distinct generative factor. This difficulty is fundamentally tied to the inability to perform nonlinear independent component analysis. Here, we introduce the framework of action-induced representations (AIRs) which models representations of physical systems given experiments (or actions) that can be performed on them. We show that, in this framework, we can provably disentangle degrees of freedom w.r.t. their action dependence. We further introduce a variational AIR architecture (VAIR) that can extract AIRs and therefore achieve provable disentanglement where standard VAEs fail. Beyond state representation, VAIR also captures the action dependence of the underlying generative factors, directly linking experiments to the degrees of freedom they influence.

[376] Soft Forward-Backward Representations for Zero-shot Reinforcement Learning with General Utilities

Marco Bagatella, Thomas Rupf, Georg Martius, Andreas Krause

Main category: cs.LG

TL;DR: A novel maximum entropy forward-backward algorithm for zero-shot RL with general utilities (arbitrary differentiable functions of occupancy measures), enabling direct optimization of complex objectives like distribution matching at test-time without iterative training.

DetailsMotivation: Existing forward-backward algorithms work well for standard RL with additive rewards, but cannot handle the more expressive class of RL problems with general utilities (arbitrary differentiable functions of occupancy measures). This limitation prevents solving important tasks like distribution matching or pure exploration that don't reduce to additive rewards.

Method: Proposes a maximum entropy (soft) variant of forward-backward algorithms that recovers a family of stochastic policies from offline data. When combined with zero-order search over compact policy embeddings, the method can directly optimize general utilities at test-time without iterative optimization schemes.

Result: The method retains favorable properties of forward-backward algorithms while extending their range to more general RL problems. Demonstrates effectiveness across both didactic and high-dimensional experiments, showing capability to handle tasks that cannot be reduced to additive rewards.

Conclusion: The proposed maximum entropy forward-backward algorithm successfully extends zero-shot RL capabilities to the more expressive class of RL with general utilities, enabling direct optimization of complex objectives like distribution matching at test-time without iterative training.

Abstract: Recent advancements in zero-shot reinforcement learning (RL) have facilitated the extraction of diverse behaviors from unlabeled, offline data sources. In particular, forward-backward algorithms (FB) can retrieve a family of policies that can approximately solve any standard RL problem (with additive rewards, linear in the occupancy measure), given sufficient capacity. While retaining zero-shot properties, we tackle the greater problem class of RL with general utilities, in which the objective is an arbitrary differentiable function of the occupancy measure. This setting is strictly more expressive, capturing tasks such as distribution matching or pure exploration, which may not be reduced to additive rewards. We show that this additional complexity can be captured by a novel, maximum entropy (soft) variant of the forward-backward algorithm, which recovers a family of stochastic policies from offline data. When coupled with zero-order search over compact policy embeddings, this algorithm can sidestep iterative optimization schemes, and optimizes general utilities directly at test-time. Across both didactic and high-dimensional experiments, we demonstrate that our method retains favorable properties of FB algorithms, while also extending their range to more general RL problems.

[377] Calibrating Generative AI to Produce Realistic Essays for Data Augmentation

Edward W. Wolfe, Justin O. Barber

Main category: cs.LG

TL;DR: LLM prompting strategies for generating simulated student essays to augment automated scoring training data, with predict-next prompting performing best for score agreement and realism.

DetailsMotivation: Data augmentation can address limited training data in automated scoring engines for constructed response items, but needs to preserve writing quality and produce realistic text.

Method: Three LLM prompting approaches (predict next, sentence strategy, 25 examples) used to generate simulated essays; human raters scored essays and rated realism of generated text.

Result: Predict-next prompting produced highest agreement between human raters on scores, best preserved original essay quality (with sentence strategy), and produced most realistic text (with 25 examples strategy).

Conclusion: Predict-next prompting is most effective for generating realistic simulated essays that preserve writing quality for augmenting automated scoring training datasets.

Abstract: Data augmentation can mitigate limited training data in machine-learning automated scoring engines for constructed response items. This study seeks to determine how well three approaches to large language model prompting produce essays that preserve the writing quality of the original essays and produce realistic text for augmenting ASE training datasets. We created simulated versions of student essays, and human raters assigned scores to them and rated the realism of the generated text. The results of the study indicate that the predict next prompting strategy produces the highest level of agreement between human raters regarding simulated essay scores, predict next and sentence strategies best preserve the rated quality of the original essay in the simulated essays, and predict next and 25 examples strategies produce the most realistic text as judged by human raters.

[378] AEGPO: Adaptive Entropy-Guided Policy Optimization for Diffusion Models

Yuming Li, Qingyu Li, Chengyu Bai, Xiangyang Luo, Zeyue Xue, Wenyu Qin, Meng Wang, Yikai Wang, Shanghang Zhang

Main category: cs.LG

TL;DR: AEGPO introduces an adaptive entropy-guided policy optimization method that uses attention entropy as a dual-signal proxy to dynamically allocate computational resources during RLHF training of diffusion models, focusing on high-value samples and critical timesteps.

DetailsMotivation: Standard RLHF methods for aligning diffusion models (like GRPO) suffer from inefficient sampling strategies that treat all prompts and denoising steps uniformly, ignoring variations in sample learning value and dynamic exploration moments.

Method: AEGPO uses attention entropy as a dual-signal proxy: (1) ΔEntropy across samples indicates learning value for global budget allocation, and (2) Entropy(t) peaks identify critical timesteps for local exploration. This enables adaptive optimization at both global (sample prioritization) and local (timestep selection) levels.

Result: Experiments on text-to-image generation tasks show AEGPO significantly accelerates convergence and achieves superior alignment performance compared to standard GRPO variants.

Conclusion: Attention entropy serves as an effective proxy for identifying high-value learning opportunities in diffusion model RLHF, enabling more efficient and effective policy optimization through adaptive resource allocation.

Abstract: Reinforcement learning from human feedback (RLHF) shows promise for aligning diffusion and flow models, yet policy optimization methods such as GRPO suffer from inefficient and static sampling strategies. These methods treat all prompts and denoising steps uniformly, ignoring substantial variations in sample learning value as well as the dynamic nature of critical exploration moments. To address this issue, we conduct a detailed analysis of the internal attention dynamics during GRPO training and uncover a key insight: attention entropy can serve as a powerful dual-signal proxy. First, across different samples, the relative change in attention entropy (ΔEntropy), which reflects the divergence between the current policy and the base policy, acts as a robust indicator of sample learning value. Second, during the denoising process, the peaks of absolute attention entropy (Entropy(t)), which quantify attention dispersion, effectively identify critical timesteps where high-value exploration occurs. Building on this observation, we propose Adaptive Entropy-Guided Policy Optimization (AEGPO), a novel dual-signal, dual-level adaptive optimization strategy. At the global level, AEGPO uses ΔEntropy to dynamically allocate rollout budgets, prioritizing prompts with higher learning value. At the local level, it exploits the peaks of Entropy(t) to guide exploration selectively at critical high-dispersion timesteps rather than uniformly across all denoising steps. By focusing computation on the most informative samples and the most critical moments, AEGPO enables more efficient and effective policy optimization. Experiments on text-to-image generation tasks demonstrate that AEGPO significantly accelerates convergence and achieves superior alignment performance compared to standard GRPO variants.

[379] On the Convergence of Multicalibration Gradient Boosting

Daniel Haimovich, Fridolin Linder, Lorenzo Perini, Niek Tax, Milan Vojnovic

Main category: cs.LG

TL;DR: Multicalibration gradient boosting convergence analysis for regression with squared-error loss, showing O(1/√T) convergence rate for multicalibration error, with linear convergence under smoothness assumptions.

DetailsMotivation: Multicalibration gradient boosting has shown empirical success in producing approximately multicalibrated predictors at web scale, but its theoretical convergence properties remain poorly understood, creating a gap between practical deployment and theoretical foundations.

Method: Theoretical analysis of multicalibration gradient boosting convergence in regression with squared-error loss, examining prediction update decay rates, convergence bounds for multicalibration error, and analysis of adaptive variants with local quadratic convergence of training loss.

Result: Proves O(1/√T) convergence rate for multicalibration error, with linear convergence under smooth weak learner assumptions; shows local quadratic convergence for adaptive variants; validates theory with real-world dataset experiments.

Conclusion: Provides first convergence guarantees for multicalibration gradient boosting, bridging empirical success with theoretical understanding, with implications for scalable fair machine learning deployment.

Abstract: Multicalibration gradient boosting has recently emerged as a scalable method that empirically produces approximately multicalibrated predictors and has been deployed at web scale. Despite this empirical success, its convergence properties are not well understood. In this paper, we bridge the gap by providing convergence guarantees for multicalibration gradient boosting in regression with squared-error loss. We show that the magnitude of successive prediction updates decays at $O(1/\sqrt{T})$, which implies the same convergence rate bound for the multicalibration error over rounds. Under additional smoothness assumptions on the weak learners, this rate improves to linear convergence. We further analyze adaptive variants, showing local quadratic convergence of the training loss, and we study rescaling schemes that preserve convergence. Experiments on real-world datasets support our theory and clarify the regimes in which the method achieves fast convergence and strong multicalibration.

[380] Robust Online Learning

Sajad Ashkezari

Main category: cs.LG

TL;DR: The paper studies robust online learning where both clean data and labels are adversarially chosen, defining a new dimension that controls mistake bounds in realizable settings and regret bounds in agnostic settings.

DetailsMotivation: The motivation is to study robust classification in a more challenging setting than traditional robust PAC learning, where not only the input is perturbed but the clean data and labels are also adversarially chosen, making it an online learning problem.

Method: The authors formulate the problem as an online learning problem, define a new dimension for hypothesis classes (resembling Littlestone dimension), analyze both realizable and agnostic settings, generalize to multiclass hypothesis classes, and study cases where perturbation sets are unknown but have priors.

Result: The paper shows that their newly defined dimension controls mistake bounds in the realizable setting and regret bounds in the agnostic setting. They prove similar results for multiclass hypothesis classes and analyze the case with unknown perturbation sets.

Conclusion: The work establishes a theoretical framework for robust online learning with adversarial clean data and labels, providing dimension-based characterizations of learnability that differ from traditional PAC learning dimensions.

Abstract: We study the problem of learning robust classifiers where the classifier will receive a perturbed input. Unlike robust PAC learning studied in prior work, here the clean data and its label are also adversarially chosen. We formulate this setting as an online learning problem and consider both the realizable and agnostic learnability of hypothesis classes. We define a new dimension of classes and show it controls the mistake bounds in the realizable setting and the regret bounds in the agnostic setting. In contrast to the dimension that characterizes learnability in the PAC setting, our dimension is rather simple and resembles the Littlestone dimension. We generalize our dimension to multiclass hypothesis classes and prove similar results in the realizable case. Finally, we study the case where the learner does not know the set of allowed perturbations for each point and only has some prior on them.

[381] Zero-shot Generalizable Graph Anomaly Detection with Mixture of Riemannian Experts

Xinyu Zhao, Qingyun Sun, Jiayi Luo, Xingcheng Fu, Jianxin Li

Main category: cs.LG

TL;DR: GAD-MoRE: A zero-shot graph anomaly detection framework using mixture of Riemannian experts to handle diverse geometric anomaly patterns across domains.

DetailsMotivation: Existing zero-shot graph anomaly detection methods ignore intrinsic geometric differences across anomaly patterns, limiting cross-domain generalization. Anomaly detectability depends on underlying geometric properties, and using a single static curvature space distorts structural signatures of anomalies.

Method: Proposes GAD-MoRE with: 1) Multiple specialized Riemannian expert networks each in distinct curvature spaces; 2) Anomaly-aware multi-curvature feature alignment module to project inputs into parallel Riemannian spaces; 3) Memory-based dynamic router that adaptively assigns inputs to most compatible expert based on historical reconstruction performance.

Result: Extensive experiments in zero-shot setting show GAD-MoRE significantly outperforms state-of-the-art generalist GAD baselines, and even surpasses strong competitors that are few-shot fine-tuned with labeled data from target domain.

Conclusion: The mixture of Riemannian experts architecture effectively addresses geometry-dependent anomaly patterns, enabling better zero-shot generalization across diverse graph domains by capturing different geometric characteristics in appropriate curvature spaces.

Abstract: Graph Anomaly Detection (GAD) aims to identify irregular patterns in graph data, and recent works have explored zero-shot generalist GAD to enable generalization to unseen graph datasets. However, existing zero-shot GAD methods largely ignore intrinsic geometric differences across diverse anomaly patterns, substantially limiting their cross-domain generalization. In this work, we reveal that anomaly detectability is highly dependent on the underlying geometric properties and that embedding graphs from different domains into a single static curvature space can distort the structural signatures of anomalies. To address the challenge that a single curvature space cannot capture geometry-dependent graph anomaly patterns, we propose GAD-MoRE, a novel framework for zero-shot Generalizable Graph Anomaly Detection with a Mixture of Riemannian Experts architecture. Specifically, to ensure that each anomaly pattern is modeled in the Riemannian space where it is most detectable, GAD-MoRE employs a set of specialized Riemannian expert networks, each operating in a distinct curvature space. To align raw node features with curvature-specific anomaly characteristics, we introduce an anomaly-aware multi-curvature feature alignment module that projects inputs into parallel Riemannian spaces, enabling the capture of diverse geometric characteristics. Finally, to facilitate better generalization beyond seen patterns, we design a memory-based dynamic router that adaptively assigns each input to the most compatible expert based on historical reconstruction performance on similar anomalies. Extensive experiments in the zero-shot setting demonstrate that GAD-MoRE significantly outperforms state-of-the-art generalist GAD baselines, and even surpasses strong competitors that are few-shot fine-tuned with labeled data from the target domain.

[382] Weisfeiler and Lehman Go Categorical

Seongjin Choi, Gahee Kim, Se-Young Yun

Main category: cs.LG

TL;DR: A categorical framework for hypergraph neural networks using functorial mappings to graded posets, yielding Hypergraph Isomorphism Networks with provable expressivity.

DetailsMotivation: Extending the lifting map paradigm from graphs to hypergraphs remains fragmented; existing methods lack systematic approaches to capture complex intersection geometries in hypergraphs.

Method: Introduces categorical Weisfeiler-Lehman framework formalizing lifting as functorial mapping from data categories to graded posets. Derives Hypergraph Isomorphism Networks via two functors: incidence functor and symmetric simplicial complex functor.

Result: Both incidence-based and symmetric simplicial approaches subsume expressive power of standard Hypergraph Weisfeiler-Lehman test. Experiments on real-world benchmarks validate theoretical findings.

Conclusion: The categorical framework provides systematic derivation of hypergraph neural architectures with rich information flow capturing complex geometries missed by existing methods.

Abstract: While lifting map has significantly enhanced the expressivity of graph neural networks, extending this paradigm to hypergraphs remains fragmented. To address this, we introduce the categorical Weisfeiler-Lehman framework, which formalizes lifting as a functorial mapping from an arbitrary data category to the unifying category of graded posets. When applied to hypergraphs, this perspective allows us to systematically derive Hypergraph Isomorphism Networks, a family of neural architectures where the message passing topology is strictly determined by the choice of functor. We introduce two distinct functors from the category of hypergraphs: an incidence functor and a symmetric simplicial complex functor. While the incidence architecture structurally mirrors standard bipartite schemes, our functorial derivation enforces a richer information flow over the resulting poset, capturing complex intersection geometries often missed by existing methods. We theoretically characterize the expressivity of these models, proving that both the incidence-based and symmetric simplicial approaches subsume the expressive power of the standard Hypergraph Weisfeiler-Lehman test. Extensive experiments on real-world benchmarks validate these theoretical findings.

[383] CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers

Boxiang Zhang, Baijian Yang

Main category: cs.LG

TL;DR: CORP is a closed-form one-shot structured pruning framework for Vision Transformers that removes MLP dimensions and attention substructures without retraining, using only a small unlabeled calibration set.

DetailsMotivation: Vision Transformers achieve strong accuracy but have high compute/memory costs. Existing pruning methods require retraining or multi-stage optimization, limiting post-training deployment. There's a need for efficient pruning that works under strict post-training constraints.

Method: CORP formulates structured pruning as a representation recovery problem. It models removed activations and attention logits as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights, minimizing expected representation error under calibration distribution.

Result: Experiments on ImageNet with DeiT models show CORP preserves accuracy under aggressive sparsity. On DeiT-Huge, it retains 82.8% Top-1 accuracy after pruning 50% of both MLP and attention structures. Pruning completes in under 20 minutes on a single GPU.

Conclusion: CORP enables efficient post-training structured pruning for Vision Transformers without labels, gradients, or fine-tuning, demonstrating strong redundancy in MLP and attention representations while delivering substantial real-world efficiency gains.

Abstract: Vision Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning can reduce inference cost, but most methods rely on retraining or multi-stage optimization. These requirements limit post-training deployment. We propose \textbf{CORP}, a closed-form one-shot structured pruning framework for Vision Transformers. CORP removes entire MLP hidden dimensions and attention substructures without labels, gradients, or fine-tuning. It operates under strict post-training constraints using only a small unlabeled calibration set. CORP formulates structured pruning as a representation recovery problem. It models removed activations and attention logits as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes expected representation error under the calibration distribution. Experiments on ImageNet with DeiT models show strong redundancy in MLP and attention representations. Without compensation, one-shot structured pruning causes severe accuracy degradation. With CORP, models preserve accuracy under aggressive sparsity. On DeiT-Huge, CORP retains 82.8% Top-1 accuracy after pruning 50% of both MLP and attention structures. CORP completes pruning in under 20 minutes on a single GPU and delivers substantial real-world efficiency gains.

[384] Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences

Idan Pipano, Shoham Sabach, Kavosh Asadi, Mohammad Ghavamzadeh

Main category: cs.LG

TL;DR: The paper introduces a more general theoretical framework for RLHF alignment algorithms beyond DPO, identifying conditions for tractability and preventing probability displacement, and proposes a new SquaredPO loss with better theoretical guarantees.

DetailsMotivation: Previous work generalized DPO to convex f-divergences, but the authors aim to further relax the convexity requirement and address the probability displacement problem observed in DPO where response probabilities approach zero.

Method: The authors identify two key conditions: 1) DPO-inducing conditions for tractability without requiring convexity, and 2) displacement-resistant conditions to prevent probability collapse. They focus on a specific f-divergence satisfying both conditions to derive the novel SquaredPO loss.

Result: The paper establishes more general theoretical conditions for RLHF tractability, identifies displacement-resistant f-divergences, and shows that the proposed SquaredPO loss performs competitively with DPO while offering stronger theoretical guarantees.

Conclusion: The work provides a broader theoretical foundation for RLHF alignment algorithms, addressing limitations of DPO through more general conditions and a new loss function that maintains practical performance while improving theoretical properties.

Abstract: DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of $f$-divergence with a convex generating function $f$. Our first contribution is to show that convexity of $f$ is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on $f$ that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any $f$ that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant $f$, leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.

[385] Supercharging Simulation-Based Inference for Bayesian Optimal Experimental Design

Samuel Klein, Willie Neiswanger, Daniel Ratner, Michael Kagan, Sean Gasiorowski

Main category: cs.LG

TL;DR: The paper presents a Bayesian optimal experimental design framework that leverages simulation-based inference methods, introducing novel EIG estimators using neural likelihood estimation and improved optimization techniques.

DetailsMotivation: Bayesian optimal experimental design (BOED) aims to maximize expected information gain from experiments, but requires likelihood estimation which is often intractable. Simulation-based inference (SBI) provides tools for this regime, but existing connections between SBI and BOED are limited to a single contrastive EIG bound.

Method: The authors show that EIG admits multiple formulations that can leverage modern SBI density estimators (neural posterior, likelihood, and ratio estimation). They define a novel EIG estimator using neural likelihood estimation and address optimization bottlenecks with a multi-start parallel gradient ascent procedure.

Result: The SBI-based BOED methods match or outperform existing state-of-the-art approaches by up to 22% across standard BOED benchmarks.

Conclusion: The paper demonstrates that leveraging multiple SBI formulations and improved optimization techniques significantly enhances BOED performance, providing a more flexible and powerful framework for experimental design in likelihood-intractable settings.

Abstract: Bayesian optimal experimental design (BOED) seeks to maximize the expected information gain (EIG) of experiments. This requires a likelihood estimate, which in many settings is intractable. Simulation-based inference (SBI) provides powerful tools for this regime. However, existing work explicitly connecting SBI and BOED is restricted to a single contrastive EIG bound. We show that the EIG admits multiple formulations which can directly leverage modern SBI density estimators, encompassing neural posterior, likelihood, and ratio estimation. Building on this perspective, we define a novel EIG estimator using neural likelihood estimation. Further, we identify optimization as a key bottleneck of gradient based EIG maximization and show that a simple multi-start parallel gradient ascent procedure can substantially improve reliability and performance. With these innovations, our SBI-based BOED methods are able to match or outperform by up to $22%$ existing state-of-the-art approaches across standard BOED benchmarks.

[386] Rare Event Analysis of Large Language Models

Jake McAllister Dorman, Edward Gillman, Dominic C. Rose, Jamie F. Mair, Juan P. Garrahan

Main category: cs.LG

TL;DR: Framework for systematic analysis of rare events in large language models, including theory, generation strategies, probability estimation, and error analysis

DetailsMotivation: LLMs as probabilistic models exhibit rare but significant behaviors that may not be observed during development but become prominent at deployment scale, requiring systematic analysis methods

Method: End-to-end framework with practical implementation covering theoretical foundations, efficient rare event generation strategies, probability estimation techniques, and error analysis

Result: Provides concrete examples illustrating the framework and outlines extensions to other models and contexts

Conclusion: Presents general concepts and techniques for rare event analysis applicable to LLMs and other probabilistic models

Abstract: Being probabilistic models, during inference large language models (LLMs) display rare events: behaviour that is far from typical but highly significant. By definition all rare events are hard to see, but the enormous scale of LLM usage means that events completely unobserved during development are likely to become prominent in deployment. Here we present an end-to-end framework for the systematic analysis of rare events in LLMs. We provide a practical implementation spanning theory, efficient generation strategies, probability estimation and error analysis, which we illustrate with concrete examples. We outline extensions and applications to other models and contexts, highlighting the generality of the concepts and techniques presented here.

[387] FlowDA: Accurate, Low-Latency Weather Data Assimilation via Flow Matching

Ran Cheng, Lailai Zhu

Main category: cs.LG

TL;DR: FlowDA: A flow matching-based generative data assimilation framework for weather prediction that conditions on observations through SetConv embeddings and fine-tunes the Aurora foundation model for efficient, accurate analysis.

DetailsMotivation: Traditional variational data assimilation methods are computationally expensive bottlenecks in ML-based weather forecasting pipelines. Existing generative ML-based DA methods suffer from slow sampling and error accumulation in long-horizon auto-regressive cycling assimilation.

Method: Proposes FlowDA, a weather-scale generative DA framework based on flow matching. Uses SetConv-based embedding to condition on observations and fine-tunes the Aurora foundation model for analysis generation. Designed for low-latency operation and robustness to observational noise.

Result: Superior performance over strong baselines with similar parameter sizes across observation rates from 3.9% to 0.1%. Shows robustness to observational noise and stable performance in long-horizon auto-regressive cycling DA.

Conclusion: FlowDA points to an efficient and scalable direction for data-driven data assimilation, offering accurate, efficient, and robust analyses for weather prediction systems.

Abstract: Data assimilation (DA) is a fundamental component of modern weather prediction, yet it remains a major computational bottleneck in machine learning (ML)-based forecasting pipelines due to reliance on traditional variational methods. Recent generative ML-based DA methods offer a promising alternative but typically require many sampling steps and suffer from error accumulation under long-horizon auto-regressive rollouts with cycling assimilation. We propose FlowDA, a low-latency weather-scale generative DA framework based on flow matching. FlowDA conditions on observations through a SetConv-based embedding and fine-tunes the Aurora foundation model to deliver accurate, efficient, and robust analyses. Experiments across observation rates decreasing from $3.9%$ to $0.1%$ demonstrate superior performance of FlowDA over strong baselines with similar tunable-parameter size. FlowDA further shows robustness to observational noise and stable performance in long-horizon auto-regressive cycling DA. Overall, FlowDA points to an efficient and scalable direction for data-driven DA.

[388] From Kepler to Newton: Inductive Biases Guide Learned World Models in Transformers

Ziming Liu, Sophia Sanborn, Surya Ganguli, Andreas Tolias

Main category: cs.LG

TL;DR: Transformers can learn world models of physical laws with three minimal inductive biases: spatial smoothness, stability via noisy training, and temporal locality, enabling them to discover Newtonian force representations rather than just curve-fitting.

DetailsMotivation: To bridge the gap between domain-specific AI Physicist approaches that bake in physics priors and generic Transformers that fail to learn underlying physical laws despite high predictive accuracy. The goal is to enable general-purpose AI architectures to discover physical laws governing the universe through appropriate inductive biases.

Method: Introduces three minimal inductive biases: 1) Spatial smoothness by formulating prediction as continuous regression, 2) Stability by training with noisy contexts to mitigate error accumulation, and 3) Temporal locality by restricting attention window to immediate past, forcing models to learn local state dependencies rather than complex history.

Result: With spatial smoothness and stability, Transformers can learn coherent Keplerian world models and fit ellipses to planetary trajectories. Adding temporal locality forces models to abandon curve-fitting and discover Newtonian force representations, demonstrating that simple architectural choices determine whether AI becomes a curve-fitter or physicist.

Conclusion: Simple architectural choices and inductive biases can enable generic Transformers to learn world models and discover physical laws, marking a critical step toward automated scientific discovery and distinguishing between mere prediction and true physical understanding.

Abstract: Can general-purpose AI architectures go beyond prediction to discover the physical laws governing the universe? True intelligence relies on “world models” – causal abstractions that allow an agent to not only predict future states but understand the underlying governing dynamics. While previous “AI Physicist” approaches have successfully recovered such laws, they typically rely on strong, domain-specific priors that effectively “bake in” the physics. Conversely, Vafa et al. recently showed that generic Transformers fail to acquire these world models, achieving high predictive accuracy without capturing the underlying physical laws. We bridge this gap by systematically introducing three minimal inductive biases. We show that ensuring spatial smoothness (by formulating prediction as continuous regression) and stability (by training with noisy contexts to mitigate error accumulation) enables generic Transformers to surpass prior failures and learn a coherent Keplerian world model, successfully fitting ellipses to planetary trajectories. However, true physical insight requires a third bias: temporal locality. By restricting the attention window to the immediate past – imposing the simple assumption that future states depend only on the local state rather than a complex history – we force the model to abandon curve-fitting and discover Newtonian force representations. Our results demonstrate that simple architectural choices determine whether an AI becomes a curve-fitter or a physicist, marking a critical step toward automated scientific discovery.

[389] Calibrating Tabular Anomaly Detection via Optimal Transport

Hangting Ye, He Zhao. Wei Fan, Xiaozhuang Song, Dandan Guo, Yi Chang, Hongyuan Zha

Main category: cs.LG

TL;DR: CTAD is a model-agnostic post-processing framework that calibrates any tabular anomaly detector using optimal transport distance between empirical and structural distributions to improve detection consistency across diverse datasets.

DetailsMotivation: Tabular anomaly detection faces challenges due to data heterogeneity - features lack natural relationships, vary in distribution/scale, and have diverse types. Existing methods make implicit assumptions about anomaly patterns that work well on some datasets but fail on others, with no consistent top performer across scenarios.

Method: CTAD characterizes normal data via two complementary distributions: empirical distribution from random sampling and structural distribution from K-means centroids. It measures how adding a test sample disrupts their compatibility using Optimal Transport distance. Normal samples maintain low disruption while anomalies cause high disruption, providing calibration signals to amplify detection.

Result: Extensive experiments on 34 diverse tabular datasets with 7 representative detectors spanning all major TAD categories show CTAD consistently improves performance with statistical significance. It enhances even state-of-the-art deep learning methods and shows robust performance across diverse hyperparameter settings.

Conclusion: CTAD provides a model-agnostic post-processing framework that can calibrate any existing tabular anomaly detector to improve performance consistently across diverse datasets, addressing the fundamental challenge of data heterogeneity in tabular anomaly detection.

Abstract: Tabular anomaly detection (TAD) remains challenging due to the heterogeneity of tabular data: features lack natural relationships, vary widely in distribution and scale, and exhibit diverse types. Consequently, each TAD method makes implicit assumptions about anomaly patterns that work well on some datasets but fail on others, and no method consistently outperforms across diverse scenarios. We present CTAD (Calibrating Tabular Anomaly Detection), a model-agnostic post-processing framework that enhances any existing TAD detector through sample-specific calibration. Our approach characterizes normal data via two complementary distributions, i.e., an empirical distribution from random sampling and a structural distribution from K-means centroids, and measures how adding a test sample disrupts their compatibility using Optimal Transport (OT) distance. Normal samples maintain low disruption while anomalies cause high disruption, providing a calibration signal to amplify detection. We prove that OT distance has a lower bound proportional to the test sample’s distance from centroids, and establish that anomalies systematically receive higher calibration scores than normals in expectation, explaining why the method generalizes across datasets. Extensive experiments on 34 diverse tabular datasets with 7 representative detectors spanning all major TAD categories (density estimation, classification, reconstruction, and isolation-based methods) demonstrate that CTAD consistently improves performance with statistical significance. Remarkably, CTAD enhances even state-of-the-art deep learning methods and shows robust performance across diverse hyperparameter settings, requiring no additional tuning for practical deployment.

[390] Vision Transformer Finetuning Benefits from Non-Smooth Components

Ambroise Odonnat, Laetitia Chapel, Romain Tavenard, Ievgen Redko

Main category: cs.LG

TL;DR: Vision transformer plasticity (sensitivity to input changes) rather than smoothness predicts better fine-tuning performance, with attention and feedforward layers showing highest plasticity.

DetailsMotivation: While transformer smoothness has been studied for generalization and robustness, its role in transfer learning remains poorly understood. The paper aims to analyze vision transformer components' plasticity (ability to adapt outputs to input changes) to provide principled guidance for adaptation.

Method: Defines plasticity as average rate of change capturing sensitivity to input perturbation (high plasticity = low smoothness). Uses theoretical analysis and comprehensive experiments to measure plasticity of different transformer components (attention modules, feedforward layers, etc.) and correlate with fine-tuning performance.

Result: High plasticity of attention modules and feedforward layers consistently leads to better fine-tuning performance. This challenges prevailing assumption that smoothness is desirable, offering novel perspective on transformer functional properties.

Conclusion: Plasticity provides principled guidance for choosing components to prioritize during adaptation. Attention and feedforward layers’ high plasticity makes them most important for fine-tuning, departing from smoothness-focused assumptions.

Abstract: The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their plasticity. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies low smoothness. We demonstrate through theoretical analysis and comprehensive experiments that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on the functional properties of transformers. The code is available at https://github.com/ambroiseodt/vit-plasticity.

[391] Cochain Perspectives on Temporal-Difference Signals for Learning Beyond Markov Dynamics

Zuyuan Zhang, Sizhe Tang, Tian Lan

Main category: cs.LG

TL;DR: A topological framework for RL that interprets TD errors as cochains and Markov dynamics as topological integrability, enabling decomposition of TD errors and proposing HFPS algorithm for non-Markovian environments.

DetailsMotivation: Real-world environments often exhibit non-Markovian dynamics due to long-range dependencies, partial observability, and memory effects. The Bellman equation becomes only approximately valid in such settings, and existing RL approaches lack theoretical foundations for understanding what dynamics are capturable by the Bellman framework and how to design optimal approximations.

Method: Introduces a topological viewpoint where TD errors are interpreted as 1-cochains in the topological space of state transitions, and Markov dynamics as topological integrability. Uses a Bellman-de Rham projection to decompose TD errors into integrable and topological residual components. Proposes HodgeFlow Policy Search (HFPS) which fits a potential network to minimize the non-integrable projection residual.

Result: HFPS achieves stability/sensitivity guarantees and significantly improves RL performance under non-Markovian environments in numerical evaluations.

Conclusion: The topological framework provides theoretical foundations for understanding RL in non-Markovian settings, and HFPS offers a principled approach to handle such environments with performance improvements.

Abstract: Non-Markovian dynamics are commonly found in real-world environments due to long-range dependencies, partial observability, and memory effects. The Bellman equation that is the central pillar of Reinforcement learning (RL) becomes only approximately valid under Non-Markovian. Existing work often focus on practical algorithm designs and offer limited theoretical treatment to address key questions, such as what dynamics are indeed capturable by the Bellman framework and how to inspire new algorithm classes with optimal approximations. In this paper, we present a novel topological viewpoint on temporal-difference (TD) based RL. We show that TD errors can be viewed as 1-cochain in the topological space of state transitions, while Markov dynamics are then interpreted as topological integrability. This novel view enables us to obtain a Hodge-type decomposition of TD errors into an integrable component and a topological residual, through a Bellman-de Rham projection. We further propose HodgeFlow Policy Search (HFPS) by fitting a potential network to minimize the non-integrable projection residual in RL, achieving stability/sensitivity guarantees. In numerical evaluations, HFPS is shown to significantly improve RL performance under non-Markovian.

[392] Learning Deep Hybrid Models with Sharpness-Aware Minimization

Naoya Takeishi

Main category: cs.LG

TL;DR: Proposes using sharpness-aware minimization (SAM) to improve hybrid modeling by focusing on flat loss minima, preventing ML models from ignoring scientific components.

DetailsMotivation: Hybrid models combine ML with scientific models for interpretable predictions, but ML components can dominate and ignore scientific models, making hybrid modeling pointless. Existing regularization approaches are architecture/domain-specific.

Method: Adapts sharpness-aware minimization (SAM) to hybrid modeling to find flat minima in loss landscape, making models simpler and more robust. SAM searches for parameters with low loss and flat neighborhoods.

Result: Numerical experiments show SAM-based method works well across different model choices and datasets, effectively preventing ML from ignoring scientific components.

Conclusion: SAM provides a general, architecture-agnostic approach to improve hybrid modeling by promoting flat minima, ensuring scientific models contribute meaningfully to predictions.

Abstract: Hybrid modeling, the combination of machine learning models and scientific mathematical models, enables flexible and robust data-driven prediction with partial interpretability. However, effectively the scientific models may be ignored in prediction due to the flexibility of the machine learning model, making the idea of hybrid modeling pointless. Typically some regularization is applied to hybrid model learning to avoid such a failure case, but the formulation of the regularizer strongly depends on model architectures and domain knowledge. In this paper, we propose to focus on the flatness of loss minima in learning hybrid models, aiming to make the model as simple as possible. We employ the idea of sharpness-aware minimization and adapt it to the hybrid modeling setting. Numerical experiments show that the SAM-based method works well across different choices of models and datasets.

[393] Improved Sampling Schedules for Discrete Diffusion Models

Alberto Foresti, Mustapha Bounoua, Giulio Franzese, Luca Ambrogioni, Pietro Michiardi

Main category: cs.LG

TL;DR: Discrete diffusion models’ reverse processes analyzed through thermodynamic entropy production, leading to novel sampling schedules (EDS and WDS) that outperform state-of-the-art methods across multiple domains.

DetailsMotivation: Discrete diffusion models lack the theoretical understanding of their reverse processes compared to continuous diffusion models. The authors aim to bridge this gap by analyzing reverse process dynamics through thermodynamic entropy production to better understand information generation.

Method: Propose entropy production rate as a proxy for quantifying information generation, derive Wasserstein distance bounds, and introduce two novel sampling schedules: Entropic Discrete Schedule (EDS) with constant information gain rate, and Wasserstein Discrete Schedule (WDS) with equal Wasserstein distance steps.

Result: Empirical demonstration shows proposed schedules significantly outperform state-of-the-art strategies across diverse domains including synthetic data, music notation, vision and language modeling, achieving superior performance with lower computational budget.

Conclusion: Thermodynamic entropy production provides a principled framework for understanding discrete diffusion reverse processes, leading to practically effective sampling schedules that improve efficiency and performance across multiple application domains.

Abstract: Discrete diffusion models have emerged as a powerful paradigm for generative modeling on sequence data; however, the information-theoretic principles governing their reverse processes remain significantly less understood than those of their continuous counterparts. In this work, we bridge this gap by analyzing the reverse process dynamics through the lens of thermodynamic entropy production. We propose the entropy production rate as a rigorous proxy for quantifying information generation, deriving as a byproduct a bound on the Wasserstein distance between intermediate states and the data distribution. Leveraging these insights, we introduce two novel sampling schedules that are uniformly spaced with respect to their corresponding physics-inspired metrics: the Entropic Discrete Schedule (EDS), which is defined by maintaining a constant rate of information gain, and the Wasserstein Discrete Schedule (WDS), which is defined by taking equal steps in terms of the Wasserstein distance. We empirically demonstrate that our proposed schedules significantly outperform state-of-the-art strategies across diverse application domains, including synthetic data, music notation, vision and language modeling, consistently achieving superior performance at a lower computational budget.

[394] Designing a Robust, Bounded, and Smooth Loss Function for Improved Supervised Learning

Soumi Mahato, Lineesh M. C

Main category: cs.LG

TL;DR: Proposes RoBoS-NN, a robust bounded smooth loss function for neural networks to handle outliers and improve time series forecasting performance.

DetailsMotivation: Traditional loss functions struggle with high-dimensional and outlier-sensitive datasets, leading to reduced performance and slower convergence in supervised learning.

Method: Develops RoBoS (robust, bounded, smooth) loss function with theoretical analysis of generalization ability, implements it in neural networks for time series forecasting as L_RoBoS-NN algorithm.

Result: L_RoBoS-NN outperforms benchmark models on multiple real-world datasets, including scenarios with infused outliers, in terms of accuracy measures.

Conclusion: RoBoS-NN loss function effectively addresses limitations of traditional loss functions, providing robust performance for time series forecasting with outliers.

Abstract: The loss function is crucial to machine learning, especially in supervised learning frameworks. It is a fundamental component that controls the behavior and general efficacy of learning algorithms. However, despite their widespread use, traditional loss functions have significant drawbacks when dealing with high-dimensional and outlier-sensitive datasets, which frequently results in reduced performance and slower convergence during training. In this work, we develop a robust, bounded, and smooth (RoBoS-NN) loss function to resolve the aforementioned hindrances. The generalization ability of the loss function has also been theoretically analyzed to rigorously justify its robustness. Moreover, we implement RoboS-NN loss in the framework of a neural network (NN) to forecast time series and present a new robust algorithm named $\mathcal{L}{\text{RoBoS}}$-NN. To assess the potential of $\mathcal{L}{\text{RoBoS}}$-NN, we conduct experiments on multiple real-world datasets. In addition, we infuse outliers into data sets to evaluate the performance of $\mathcal{L}{\text{RoBoS}}$-NN in more challenging scenarios. Numerical results show that $\mathcal{L}{\text{RoBoS}}$-NN outperforms the other benchmark models in terms of accuracy measures.

[395] T-STAR: A Context-Aware Transformer Framework for Short-Term Probabilistic Demand Forecasting in Dock-Based Shared Micro-Mobility

Jingyi Cheng, Gonçalo Homem de Almeida Correia, Oded Cats, Shadi Sharif Azadeh

Main category: cs.LG

TL;DR: T-STAR is a two-stage transformer framework for 15-minute bike-sharing demand forecasting that disentangles hourly patterns from short-term fluctuations, incorporating metro service data for improved accuracy.

DetailsMotivation: Reliable short-term demand forecasting is crucial for managing shared micro-mobility services, but existing methods struggle with high-resolution forecasting challenges including disentangling consistent patterns from short-term fluctuations and incorporating multimodal contextual information.

Method: Two-stage transformer-based probabilistic framework: first stage captures coarse-grained hourly demand patterns, second stage incorporates high-frequency localized inputs (recent fluctuations, real-time metro demand) using time series transformers in both stages for probabilistic predictions.

Result: Outperforms existing methods in deterministic and probabilistic accuracy using Washington D.C. Capital Bikeshare data, shows strong spatial/temporal robustness, and demonstrates zero-shot transfer capability to unseen service areas.

Conclusion: T-STAR enables granular, reliable uncertainty-aware short-term forecasts that support multimodal trip planning and enhance real-time operations in shared micro-mobility services.

Abstract: Reliable short-term demand forecasting is essential for managing shared micro-mobility services and ensuring responsive, user-centered operations. This study introduces T-STAR (Two-stage Spatial and Temporal Adaptive contextual Representation), a novel transformer-based probabilistic framework designed to forecast station-level bike-sharing demand at a 15-minute resolution. T-STAR addresses key challenges in high-resolution forecasting by disentangling consistent demand patterns from short-term fluctuations through a hierarchical two-stage structure. The first stage captures coarse-grained hourly demand patterns, while the second stage improves prediction accuracy by incorporating high-frequency, localized inputs, including recent fluctuations and real-time demand variations in connected metro services, to account for temporal shifts in short-term demand. Time series transformer models are employed in both stages to generate probabilistic predictions. Extensive experiments using Washington D.C.’s Capital Bikeshare data demonstrate that T-STAR outperforms existing methods in both deterministic and probabilistic accuracy. The model exhibits strong spatial and temporal robustness across stations and time periods. A zero-shot forecasting experiment further highlights T-STAR’s ability to transfer to previously unseen service areas without retraining. These results underscore the framework’s potential to deliver granular, reliable, and uncertainty-aware short-term demand forecasts, which enable seamless integration to support multimodal trip planning for travelers and enhance real-time operations in shared micro-mobility services.

[396] Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, David F. Gleich

Main category: cs.LG

TL;DR: DeVA is a novel optimization framework that bridges vector-based variance adaptation (like Adam) with matrix spectral optimization, enabling seamless transition from Adam to adaptive spectral descent for improved training efficiency.

DetailsMotivation: Current optimization methods face a gap between vector-based adaptive methods (like Adam) and matrix-based spectral optimizers (like Muon). While Adam works well for vector/Euclidean optimization with coordinate-wise adaptation, spectral optimizers treat weight matrices as matrices rather than vectors. Bridging these approaches is challenging because natural generalizations are infeasible to implement, and Adam's adaptation cannot be directly applied to matrix spectrum.

Method: The authors reformulate the AdaGrad update and decompose it into two components: a variance adaptation term and a scale-invariant term. This decoupling produces DeVA (Decoupled Variance Adaptation), a framework that enables seamless transition from Adam to adaptive spectral descent by separating variance adaptation from the optimization process.

Result: Extensive experiments in language modeling and image classification show DeVA consistently outperforms state-of-the-art methods like Muon and SOAP, reducing token usage by approximately 6.6%. Theoretical analysis demonstrates that the variance adaptation term effectively improves blockwise smoothness, facilitating faster convergence.

Conclusion: DeVA successfully bridges the gap between vector-based variance adaptation and matrix spectral optimization, providing a unified framework that combines the strengths of both approaches for more efficient training of large models.

Abstract: Adaptive methods like Adam have become the $\textit{de facto}$ standard for large-scale vector and Euclidean optimization due to their coordinate-wise adaptation with a second-order nature. More recently, matrix-based spectral optimizers like Muon (Jordan et al., 2024b) show the power of treating weight matrices as matrices rather than long vectors. Linking these is hard because many natural generalizations are not feasible to implement, and we also cannot simply move the Adam adaptation to the matrix spectrum. To address this, we reformulate the AdaGrad update and decompose it into a variance adaptation term and a scale-invariant term. This decoupling produces $\textbf{DeVA}$ ($\textbf{De}$coupled $\textbf{V}$ariance $\textbf{A}$daptation), a framework that bridges between vector-based variance adaptation and matrix spectral optimization, enabling a seamless transition from Adam to adaptive spectral descent. Extensive experiments across language modeling and image classification demonstrate that DeVA consistently outperforms state-of-the-art methods such as Muon and SOAP (Vyas et al., 2024), reducing token usage by around 6.6%. Theoretically, we show that the variance adaptation term effectively improves the blockwise smoothness, facilitating faster convergence. Our implementation is available at https://github.com/Tsedao/Decoupled-Variance-Adaptation

[397] A Cycle-Consistent Graph Surrogate for Full-Cycle Left Ventricular Myocardial Biomechanics

Siyu Mu, Wei Xuan Chan, Choon Hwai Yap

Main category: cs.LG

TL;DR: CGFENet is a graph-based neural network surrogate for rapid full-cycle estimation of left ventricular myocardial biomechanics, using global-local graph encoding and cycle-consistent bidirectional formulation to reduce computational cost compared to traditional finite-element analysis.

DetailsMotivation: Traditional finite-element analysis for cardiac biomechanics simulation is computationally intensive, while existing graph-based surrogates lack full-cycle prediction capabilities and physics-informed neural networks struggle with complex cardiac geometries.

Method: Proposes CardioGraphFENet with: (1) global-local graph encoder for mesh features with weak-form-inspired global coupling, (2) GRU-based temporal encoder conditioned on volume-time signals for cycle-coherent dynamics, (3) cycle-consistent bidirectional formulation for loading/unloading in single framework.

Result: Achieves high fidelity to FEA ground truths, produces physiologically plausible pressure-volume loops when coupled with lumped-parameter model, and enables significant reduction in FEA supervision with minimal accuracy loss through cycle-consistency strategy.

Conclusion: CGFENet provides a unified graph-based surrogate for rapid full-cycle LV biomechanics estimation that balances computational efficiency with accuracy, enabling practical clinical applications.

Abstract: Image-based patient-specific simulation of left ventricular (LV) mechanics is valuable for understanding cardiac function and supporting clinical intervention planning, but conventional finite-element analysis (FEA) is computationally intensive. Current graph-based surrogates do not have full-cycle prediction capabilities, and physics-informed neural networks often struggle to converge on complex cardiac geometries. We present CardioGraphFENet (CGFENet), a unified graph-based surrogate for rapid full-cycle estimation of LV myocardial biomechanics, supervised by a large FEA simulation dataset. The proposed model integrates (i) a global–local graph encoder to capture mesh features with weak-form-inspired global coupling, (ii) a gated recurrent unit-based temporal encoder conditioned on the target volume-time signal to model cycle-coherent dynamics, and (iii) a cycle-consistent bidirectional formulation for both loading and inverse unloading within a single framework. These strategies enable high fidelity with respect to traditional FEA ground truths and produce physiologically plausible pressure-volume loops that match FEA results when coupled with a lumped-parameter model. In particular, the cycle-consistency strategy enables a significant reduction in FEA supervision with only minimal loss in accuracy.

[398] Sample Complexity of Causal Identification with Temporal Heterogeneity

Ameya Rathod, Sujay Belsare, Salvik Krishna Nautiyal, Dhruv Laad, Ponnurangam Kumaraguru

Main category: cs.LG

TL;DR: This paper develops unified identifiability conditions for causal graph recovery by integrating time-series dynamics and multi-environment heterogeneity, analyzing statistical limits under different noise distributions including heavy-tailed cases.

DetailsMotivation: Causal graph recovery from observational data is ill-posed due to multiple generating mechanisms producing the same distribution. The paper aims to address this by combining temporal structure and environmental heterogeneity as complementary sources of information.

Method: The authors integrate time-series dynamics and multi-environment heterogeneity to derive unified necessary identifiability conditions. They analyze statistical recovery limits under thin vs heavy-tailed noise (Student’s t distributions), examining how temporal structure can compensate for missing environmental diversity.

Result: Temporal structure can effectively substitute for missing environmental diversity, potentially achieving identifiability even under insufficient heterogeneity. While geometric identifiability conditions remain invariant for heavy-tailed distributions, sample complexity diverges significantly from Gaussian baselines. Information-theoretic bounds quantify the cost of robustness.

Conclusion: The work shifts focus from whether causal structure is identifiable to whether it is statistically recoverable in practice, establishing fundamental limits of covariance-based causal graph recovery methods in realistic non-stationary systems.

Abstract: Recovering a unique causal graph from observational data is an ill-posed problem because multiple generating mechanisms can lead to the same observational distribution. This problem becomes solvable only by exploiting specific structural or distributional assumptions. While recent work has separately utilized time-series dynamics or multi-environment heterogeneity to constrain this problem, we integrate both as complementary sources of heterogeneity. This integration yields unified necessary identifiability conditions and enables a rigorous analysis of the statistical limits of recovery under thin versus heavy-tailed noise. In particular, temporal structure is shown to effectively substitute for missing environmental diversity, possibly achieving identifiability even under insufficient heterogeneity. Extending this analysis to heavy-tailed (Student’s t) distributions, we demonstrate that while geometric identifiability conditions remain invariant, the sample complexity diverges significantly from the Gaussian baseline. Explicit information-theoretic bounds quantify this cost of robustness, establishing the fundamental limits of covariance-based causal graph recovery methods in realistic non-stationary systems. This work shifts the focus from whether causal structure is identifiable to whether it is statistically recoverable in practice.

[399] Parameter-free Dynamic Regret: Time-varying Movement Costs, Delayed Feedback, and Memory

Emmanuel Esposito, Andrew Jacobsen, Hao Qiu, Mengxiao Zhang

Main category: cs.LG

TL;DR: Novel algorithm for online convex optimization with time-varying movement costs achieves comparator-adaptive dynamic regret bounds, with applications to delayed feedback and time-varying memory settings.

DetailsMotivation: The paper aims to generalize online convex optimization (OCO) by allowing movement cost coefficients to vary arbitrarily over time, addressing limitations of standard OCO where movement costs are typically fixed or zero.

Method: Develops a novel algorithm that establishes comparator-adaptive dynamic regret bounds for OCO with time-varying movement costs, using a reduction approach to translate delayed feedback and time-varying memory problems into movement cost frameworks.

Result: Achieves $\widetilde{\mathcal{O}}(\sqrt{(1+P_T)(T+\sum_t λ_t)})$ regret bound, which recovers optimal guarantees for both static and dynamic regret in standard OCO as special cases, and demonstrates applications to delayed feedback and time-varying memory settings.

Conclusion: The paper provides a unified framework for handling time-varying movement costs in OCO, with the first-order dependence on movement costs enabling optimal comparator-adaptive dynamic regret guarantees in various applications.

Abstract: In this paper, we study dynamic regret in unconstrained online convex optimization (OCO) with movement costs. Specifically, we generalize the standard setting by allowing the movement cost coefficients $λ_t$ to vary arbitrarily over time. Our main contribution is a novel algorithm that establishes the first comparator-adaptive dynamic regret bound for this setting, guaranteeing $\widetilde{\mathcal{O}}(\sqrt{(1+P_T)(T+\sum_t λ_t)})$ regret, where $P_T$ is the path length of the comparator sequence over $T$ rounds. This recovers the optimal guarantees for both static and dynamic regret in standard OCO as a special case where $λ_t=0$ for all rounds. To demonstrate the versatility of our results, we consider two applications: OCO with delayed feedback and OCO with time-varying memory. We show that both problems can be translated into time-varying movement costs, establishing a novel reduction specifically for the delayed feedback setting that is of independent interest. A crucial observation is that the first-order dependence on movement costs in our regret bound plays a key role in enabling optimal comparator-adaptive dynamic regret guarantees in both settings.

[400] A first realization of reinforcement learning-based closed-loop EEG-TMS

Dania Humaidan, Jiahua Xu, Jing Chen, Christoph Zrenner, David Emanuel Vetter, Laura Marzetti, Paolo Belardinelli, Timo Roine, Risto J. Ilmoniemi, Gian Luca Romani, Ulf Zieman

Main category: cs.LG

TL;DR: Machine learning-based closed-loop EEG-TMS system identifies individual mu-rhythm phases associated with high/low corticospinal excitability states for personalized brain stimulation.

DetailsMotivation: Traditional TMS therapy uses a one-size-fits-all approach, ignoring individual differences. Current EEG-TMS methods require manual definition of target phases. The authors aim to develop an automated, user-independent system to identify optimal stimulation phases.

Method: Used EEG-TMS on 25 participants targeting sensorimotor network, employed reinforcement learning algorithm to identify mu-rhythm phases associated with high/low corticospinal excitability, analyzed effects using linear mixed effects models and Bayesian analysis.

Result: Reinforcement learning successfully identified mu-rhythm phases associated with high vs. low excitability states. Stimulation of these phases resulted in long-term increases vs. decreases in functional connectivity in the sensorimotor network.

Conclusion: First demonstration of closed-loop EEG-TMS feasibility in humans, representing a critical step toward individualized treatment of brain disorders through personalized brain stimulation.

Abstract: Background: Transcranial magnetic stimulation (TMS) is a powerful tool to investigate neurophysiology of the human brain and treat brain disorders. Traditionally, therapeutic TMS has been applied in a one-size-fits-all approach, disregarding inter- and intra-individual differences. Brain state-dependent EEG-TMS, such as coupling TMS with a pre-specified phase of the sensorimotor mu-rhythm, enables the induction of differential neuroplastic effects depending on the targeted phase. But this approach is still user-dependent as it requires defining an a-priori target phase. Objectives: To present a first realization of a machine-learning-based, closed-loop real-time EEG-TMS setup to identify user-independently the individual mu-rhythm phase associated with high- vs. low-corticospinal excitability states. Methods: We applied EEG-TMS to 25 participants targeting the supplementary motor area-primary motor cortex network and used a reinforcement learning algorithm to identify the mu-rhythm phase associated with high- vs. low corticospinal excitability. We employed linear mixed effects models and Bayesian analysis to determine effects of reinforced learning on corticospinal excitability indexed by motor evoked potential amplitude, and functional connectivity indexed by the imaginary part of resting-state EEG coherence. Results: Reinforcement learning effectively identified the mu-rhythm phase associated with high- vs. low-excitability states, and their repetitive stimulation resulted in long-term increases vs. decreases in functional connectivity in the stimulated sensorimotor network. Conclusions: We demonstrated for the first time the feasibility of closed-loop EEG-TMS in humans, a critical step towards individualized treatment of brain disorders.

[401] Hyperbolic Fine-Tuning for Large Language Models

Menglin Yang, Ram Samarth B B, Aosong Feng, Bo Xiong, Jihong Liu, Irwin King, Rex Ying

Main category: cs.LG

TL;DR: HypLoRA: A hyperbolic space fine-tuning method for LLMs that leverages discovered hierarchical structures in token embeddings to improve performance on reasoning tasks.

DetailsMotivation: The paper investigates whether Euclidean space is optimal for LLMs, discovering that token embeddings exhibit hyperbolic characteristics with hierarchical structures, suggesting hyperbolic space might be more suitable for modeling language.

Method: HypLoRA performs low-rank adaptation directly in hyperbolic space, preserving hyperbolic modeling capabilities during fine-tuning, exploiting the discovered latent tree-like structures in token embeddings.

Result: Extensive experiments across various base models and reasoning benchmarks (arithmetic and commonsense reasoning) demonstrate that HypLoRA substantially improves LLM performance compared to standard approaches.

Conclusion: Hyperbolic space is more suitable for LLMs than Euclidean space, and fine-tuning in hyperbolic space via HypLoRA effectively leverages hierarchical structures in language to enhance model performance on reasoning tasks.

Abstract: Large language models (LLMs) have demonstrated remarkable performance across various tasks. However, it remains an open question whether the default Euclidean space is the most suitable choice for LLMs. In this study, we investigate the geometric characteristics of LLMs, focusing specifically on tokens and their embeddings. Our findings reveal that token frequency follows a power-law distribution, where high-frequency tokens (e.g., the, that ) constitute the minority, while low-frequency tokens (e.g., apple, dog) constitute the majority. Furthermore, high-frequency tokens cluster near the origin, whereas low-frequency tokens are positioned farther away in the embedding space. Additionally, token embeddings exhibit hyperbolic characteristics, indicating a latent tree-like structure within the embedding space. Motivated by these observations, we propose HypLoRA, an efficient fine-tuning approach that operates in hyperbolic space to exploit these underlying hierarchical structures better. HypLoRA performs low-rank adaptation directly in hyperbolic space, thereby preserving hyperbolic modeling capabilities throughout the fine-tuning process. Extensive experiments across various base models and reasoning benchmarks, specifically arithmetic and commonsense reasoning tasks, demonstrate that HypLoRA substantially improves LLM performance.

[402] Revisiting the Generic Transformer: Deconstructing a Strong Baseline for Time Series Foundation Models

Yunshi Wen, Wesley M. Gifford, Chandra Reddy, Lam M. Nguyen, Jayant Kalagnanam, Anak Agung Julius

Main category: cs.LG

TL;DR: A patch Transformer achieves state-of-the-art zero-shot time series forecasting with simple training, showing architectural scalability matters more than complex data engineering.

DetailsMotivation: To disentangle whether improvements in Time Series Foundation Models come from architectural innovations versus data engineering, given heterogeneous training setups across studies.

Method: Use a standard patch Transformer architecture with straightforward training protocol, conduct comprehensive ablation studies on model scaling, data composition, and training techniques.

Result: Achieves state-of-the-art zero-shot forecasting performance, identifies key performance drivers, confirms excellent scalability of generic architecture.

Conclusion: The generic patch Transformer architecture itself demonstrates excellent scalability and performance, providing a transparent baseline for future time series research.

Abstract: The recent surge in Time Series Foundation Models has rapidly advanced the field, yet the heterogeneous training setups across studies make it difficult to attribute improvements to architectural innovations versus data engineering. In this work, we investigate the potential of a standard patch Transformer, demonstrating that this generic architecture achieves state-of-the-art zero-shot forecasting performance using a straightforward training protocol. We conduct a comprehensive ablation study that covers model scaling, data composition, and training techniques to isolate the essential ingredients for high performance. Our findings identify the key drivers of performance, while confirming that the generic architecture itself demonstrates excellent scalability. By strictly controlling these variables, we provide comprehensive empirical results on model scaling across multiple dimensions. We release our open-source model and detailed findings to establish a transparent, reproducible baseline for future research.

[403] FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim

Main category: cs.LG

TL;DR: FastKV is a KV cache compression framework that reduces latency in both prefill and decoding stages by leveraging layer-dependent token importance and decoupling KV budget from prefill compute reduction.

DetailsMotivation: LLMs require substantial prefill computation and KV cache, burdening computational efficiency and memory usage. Existing KV compression methods tie prefill compute reduction to decoding KV budget, causing accuracy degradation due to overlooking layer-dependent variation of critical context.

Method: FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer, which forwards only the most informative tokens to subsequent layers. From propagated tokens, it independently selects salient KV entries for caching, decoupling KV budget from prefill compute reduction based on TSP decisions.

Result: FastKV achieves speedups of up to 1.82× in prefill and 2.87× in decoding compared to full-context baseline, while matching accuracy of baselines that only accelerate decoding stage.

Conclusion: FastKV provides an effective KV cache compression framework that reduces latency in both prefill and decoding stages while maintaining accuracy, with independent control over TSP rate and KV retention rate for flexible efficiency-accuracy optimization.

Abstract: While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational efficiency and memory usage in both prefill and decoding stages. Recent works that compress KV caches with prefill acceleration reduce this cost but inadvertently tie the prefill compute reduction to the decoding KV budget. This coupling arises from overlooking the layer-dependent variation of critical context, often leading to accuracy degradation. To address this issue, we introduce FastKV, a KV cache compression framework designed to reduce latency in both prefill and decoding by leveraging the stabilization of token importance in later layers. FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer, which forwards only the most informative tokens to subsequent layers. From these propagated tokens, FastKV independently selects salient KV entries for caching, thereby decoupling KV budget from the prefill compute reduction based on the TSP decision. This independent control of the TSP rate and KV retention rate enables flexible optimization of efficiency and accuracy. Experimental results show that FastKV achieves speedups of up to 1.82$\times$ in prefill and 2.87$\times$ in decoding compared to the full-context baseline, while matching the accuracy of the baselines that only accelerate the decoding stage. Our code is available at https://github.com/dongwonjo/FastKV.

[404] Robustness Beyond Known Groups with Low-rank Adaptation

Abinitha Gourabathina, Hyewon Jeong, Teya Bergamaschi, Marzyeh Ghassemi, Collin Stultz

Main category: cs.LG

TL;DR: LEIA improves group robustness by identifying low-dimensional error subspaces and adapting classifiers without group labels or backbone modifications.

DetailsMotivation: Deep learning models often fail systematically on specific subpopulations, but existing group-robust methods require prior knowledge of relevant subgroups through group annotations. There's a need for methods that improve performance on sensitive subgroups without requiring pre-specified group information.

Method: Two-stage method: 1) Identify a low-dimensional subspace in representation space where model errors concentrate, 2) Restrict adaptation to this error-informed subspace via low-rank adjustment to classifier logits. This directly targets latent failure modes without modifying the backbone or requiring group labels.

Result: Tested on five real-world datasets across three settings: no knowledge, partial knowledge, and full knowledge of subgroup relevance. LEIA consistently improves worst-group performance while being fast, parameter-efficient, and robust to hyperparameter choice.

Conclusion: LEIA provides an effective approach for improving group robustness without requiring group annotations, making it practical for real-world applications where sensitive subgroups are often unknown or unlabeled.

Abstract: Deep learning models trained to optimize average accuracy often exhibit systematic failures on particular subpopulations. In real world settings, the subpopulations most affected by such disparities are frequently unlabeled or unknown, thereby motivating the development of methods that are performant on sensitive subgroups without being pre-specified. However, existing group-robust methods typically assume prior knowledge of relevant subgroups, using group annotations for training or model selection. We propose Low-rank Error Informed Adaptation (LEIA), a simple two-stage method that improves group robustness by identifying a low-dimensional subspace in the representation space where model errors concentrate. LEIA restricts adaptation to this error-informed subspace via a low-rank adjustment to the classifier logits, directly targeting latent failure modes without modifying the backbone or requiring group labels. Using five real-world datasets, we analyze group robustness under three settings: (1) truly no knowledge of subgroup relevance, (2) partial knowledge of subgroup relevance, and (3) full knowledge of subgroup relevance. Across all settings, LEIA consistently improves worst-group performance while remaining fast, parameter-efficient, and robust to hyperparameter choice.

[405] Efficient LLM Moderation with Multi-Layer Latent Prototypes

Maciej Chrabąszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzciński, Sebastian Cygert

Main category: cs.LG

TL;DR: MLPM is a lightweight, customizable input moderation tool for LLMs that uses prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining efficiency.

DetailsMotivation: Existing LLM moderation approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements, creating a need for a practical and adaptable solution for safe LLM deployment.

Method: Proposes Multi-Layer Prototype Moderator (MLPM) that leverages prototypes of intermediate representations across multiple layers of LLMs to detect harmful content, adding negligible overhead to the generation pipeline.

Result: Achieves state-of-the-art performance on diverse moderation benchmarks, demonstrates strong scalability across model families of various sizes, and integrates smoothly into end-to-end moderation pipelines.

Conclusion: MLPM provides a practical and adaptable solution for safe, robust, and efficient LLM deployment that can be seamlessly applied to any model and combined with output moderation techniques.

Abstract: Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.

[406] Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

Wenlong Mou

Main category: cs.LG

TL;DR: Off-policy RL for continuous-time Markov diffusion processes with discrete observations/actions, using model-free function approximation without structural assumptions on dynamics.

DetailsMotivation: To develop reinforcement learning methods for continuous-time diffusion processes that can learn directly from data without requiring unrealistic structural assumptions about the underlying dynamics, making RL more applicable to real-world continuous systems.

Method: Leverages ellipticity of diffusions to establish Hilbert-space properties for Bellman operators, then proposes Sobolev-prox fitted q-learning algorithm that learns value and advantage functions by iteratively solving least-squares regression problems.

Result: Derives oracle inequalities for estimation error governed by four factors: best approximation error of function classes, localized complexity, exponentially decaying optimization error, and numerical discretization error.

Conclusion: Ellipticity is identified as a key structural property that makes reinforcement learning with function approximation for Markov diffusions no harder than supervised learning, providing theoretical guarantees for practical RL in continuous-time systems.

Abstract: We study off-policy reinforcement learning for controlling continuous-time Markov diffusion processes with discrete-time observations and actions. We consider model-free algorithms with function approximation that learn value and advantage functions directly from data, without unrealistic structural assumptions on the dynamics. Leveraging the ellipticity of the diffusions, we establish a new class of Hilbert-space positive definiteness and boundedness properties for the Bellman operators. Based on these properties, we propose the Sobolev-prox fitted $q$-learning algorithm, which learns value and advantage functions by iteratively solving least-squares regression problems. We derive oracle inequalities for the estimation error, governed by (i) the best approximation error of the function classes, (ii) their localized complexity, (iii) exponentially decaying optimization error, and (iv) numerical discretization error. These results identify ellipticity as a key structural property that renders reinforcement learning with function approximation for Markov diffusions no harder than supervised learning.

[407] When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

Junxiong Wang, Fengxiang Bie, Jisen Li, Zhongzhu Zhou, Zelei Shao, Yubo Wang, Yinghui Liu, Qingyang Wu, Avner May, Sri Yanamandra, Yineng Zhang, Ce Zhang, Tri Dao, Percy Liang, Ben Athiwaratkun, Shuaiwen Leon Song, Chenfeng Xu, Xiaoxia Wu

Main category: cs.LG

TL;DR: Aurora: A unified training-serving system for speculative decoding that enables continuous online learning of speculators from live inference traces, addressing deployment lag and domain drift issues.

DetailsMotivation: Current speculative decoding deployments separate speculator training from serving, causing three problems: high time-to-serve (speculators must be trained offline before deployment), delayed utility feedback (true decoding speedup only known after training), and domain-drift degradation (speculators become stale as models adapt to new domains).

Method: Aurora reframes online speculator learning as an asynchronous reinforcement learning problem, using accepted tokens as positive feedback and rejected speculator proposals as implicit negative feedback. The system integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption and supporting day-0 deployment.

Result: Aurora achieves 1.5x day-0 speedup on frontier models (MiniMax M2.1 229B and Qwen3-Coder-Next 80B) and adapts effectively to distribution shifts, delivering an additional 1.25x speedup over static speculators on widely used models (Qwen3 and Llama3).

Conclusion: Aurora demonstrates that continuous online learning of speculators from live inference traces can overcome deployment lag and domain drift issues in speculative decoding, enabling immediate deployment and adaptation to changing user traffic patterns.

Abstract: Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).

[408] From Core to Detail: Unsupervised Disentanglement with Entropy-Ordered Flows

Daniel Galperin, Ullrich Köthe

Main category: cs.LG

TL;DR: EOFlows is a normalizing-flow framework that orders latent dimensions by explained entropy, enabling adaptive compression and denoising by retaining only top latent variables at inference time.

DetailsMotivation: The paper addresses the challenge of learning unsupervised representations that are both semantically meaningful and stable across runs, while enabling flexible compression and noise handling.

Method: EOFlows uses entropy-ordered flows to order latent dimensions by explained entropy (similar to PCA’s explained variance). Combines likelihood-based training with local Jacobian regularization and noise augmentation, building on Independent Mechanism Analysis, Principal Component Flows and Manifold Entropic Metrics.

Result: Experiments on CelebA show the method uncovers semantically interpretable features, allows for high compression and strong denoising capabilities.

Conclusion: EOFlows provides a flexible framework for adaptive representation learning with interpretable features and practical compression/denoising applications.

Abstract: Learning unsupervised representations that are both semantically meaningful and stable across runs remains a central challenge in modern representation learning. We introduce entropy-ordered flows (EOFlows), a normalizing-flow framework that orders latent dimensions by their explained entropy, analogously to PCA’s explained variance. This ordering enables adaptive injective flows: after training, one may retain only the top C latent variables to form a compact core representation while the remaining variables capture fine-grained detail and noise, with C chosen flexibly at inference time rather than fixed during training. EOFlows build on insights from Independent Mechanism Analysis, Principal Component Flows and Manifold Entropic Metrics. We combine likelihood-based training with local Jacobian regularization and noise augmentation into a method that scales well to high-dimensional data such as images. Experiments on the CelebA dataset show that our method uncovers a rich set of semantically interpretable features, allowing for high compression and strong denoising.

[409] Improving Credit Card Fraud Detection with an Optimized Explainable Boosting Machine

Reza E. Fazel, Arash Bakhtiary, Siavash A. Bigdeli

Main category: cs.LG

TL;DR: Enhanced Explainable Boosting Machine (EBM) with systematic optimization for credit card fraud detection, achieving state-of-the-art performance while maintaining interpretability.

DetailsMotivation: Address class imbalance in credit card fraud detection while maintaining both high accuracy and interpretability for trustworthy financial systems.

Method: Optimized EBM (Explainable Boosting Machine) with systematic hyperparameter tuning, feature selection, preprocessing refinement, and Taguchi method for scaler sequence optimization.

Result: Achieved ROC-AUC of 0.983, surpassing prior EBM baselines (0.975) and outperforming Logistic Regression, Random Forest, XGBoost, and Decision Tree models.

Conclusion: Interpretable machine learning with systematic optimization can advance trustworthy fraud analytics in financial systems.

Abstract: Addressing class imbalance is a central challenge in credit card fraud detection, as it directly impacts predictive reliability in real-world financial systems. To overcome this, the study proposes an enhanced workflow based on the Explainable Boosting Machine (EBM)-a transparent, state-of-the-art implementation of the GA2M algorithm-optimized through systematic hyperparameter tuning, feature selection, and preprocessing refinement. Rather than relying on conventional sampling techniques that may introduce bias or cause information loss, the optimized EBM achieves an effective balance between accuracy and interpretability, enabling precise detection of fraudulent transactions while providing actionable insights into feature importance and interaction effects. Furthermore, the Taguchi method is employed to optimize both the sequence of data scalers and model hyperparameters, ensuring robust, reproducible, and systematically validated performance improvements. Experimental evaluation on benchmark credit card data yields an ROC-AUC of 0.983, surpassing prior EBM baselines (0.975) and outperforming Logistic Regression, Random Forest, XGBoost, and Decision Tree models. These results highlight the potential of interpretable machine learning and data-driven optimization for advancing trustworthy fraud analytics in financial systems.

[410] MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings

Victor Rambaud, Salvador Mascarenhas, Yair Lakretz

Main category: cs.LG

TL;DR: MapFormers are Transformer-based architectures that learn cognitive maps from observational data through input-dependent positional encoding, enabling structure-content disentanglement and strong out-of-distribution generalization in navigation tasks.

DetailsMotivation: Current AI systems lack the cognitive mapping capabilities that give humans and animals strong out-of-distribution generalization and flexibility. The goal is to bridge this gap by developing models that can learn internal representations of abstract relationships among entities.

Method: Developed MapFormers based on Transformer architecture with input-dependent positional encoding matrices that disentangle structural relationships from content. Created two variants: one for episodic memory (absolute positional encoding) and one for working memory (relative positional encoding).

Result: MapFormers successfully learned cognitive maps of underlying spaces and demonstrated near-perfect out-of-distribution generalization (e.g., to longer sequences) in 2D navigation tasks, outperforming current architectures.

Conclusion: Models designed to learn cognitive maps with structural bias for structure-content disentanglement show superior performance. Transformers with input-dependent positional encoding can naturally achieve this property, with broad applications in both neuroscience and AI.

Abstract: A cognitive map is an internal model which encodes the abstract relationships among entities in the world, giving humans and animals the flexibility to adapt to new situations, with a strong out-of-distribution (OOD) generalization that current AI systems still do not possess. To bridge this gap, we introduce MapFormers, new architectures based on Transformer models, which can learn cognitive maps from observational data and perform path integration in parallel, in a self-supervised manner. Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved naturally by updating the positional encoding in Transformers with input-dependent matrices. We developed two variants of MapFormers that unify absolute and relative positional encoding to model episodic (EM) and working memory (WM), respectively. We tested MapFormers on several tasks, including a classic 2D navigation task, showing that our models can learn a cognitive map of the underlying space and generalize OOD (e.g., to longer sequences) with near-perfect performance, unlike current architectures. Together, these results demonstrate the superiority of models designed to learn a cognitive map, and the importance of introducing a structural bias for structure-content disentanglement, which can be achieved in Transformers with input-dependent positional encoding. MapFormers have broad applications in both neuroscience and AI, by explaining the neural mechanisms giving rise to cognitive maps, while allowing these relation models to be learned at scale.

[411] Context-Free Recognition with Transformers

Selim Jerad, Anej Svete, Sophie Hao, Ryan Cotterell, William Merrill

Main category: cs.LG

TL;DR: Transformers with O(log(n)) looping layers and O(n^6) padding tokens can recognize all context-free languages, but practical constraints like unambiguity reduce padding to O(n^3).

DetailsMotivation: While transformers excel on tasks with grammatical structure like natural language and code, it's unclear how they process grammatical syntax theoretically. Standard transformers cannot recognize context-free languages (CFLs) under standard complexity conjectures, and prior work only addressed regular languages.

Method: Theoretical analysis of looped transformers with O(log(n)) looping layers and padding tokens. Shows that with O(n^6) padding tokens, transformers can recognize all CFLs. For natural subclasses like unambiguous CFLs, padding reduces to O(n^3). Empirical validation on languages requiring logarithmic depth.

Result: Looped transformers with O(log(n)) layers and O(n^6) padding can recognize all CFLs. For unambiguous CFLs, padding reduces to O(n^3). Empirical results confirm looping helps on languages requiring logarithmic depth.

Conclusion: While general CFL recognition may require intractable padding (O(n^6)), natural constraints like unambiguity yield efficient recognition algorithms with O(n^3) padding, making transformers theoretically capable of processing grammatical syntax.

Abstract: Transformers excel empirically on tasks that process well-formed inputs according to some grammar, such as natural language and code. However, it remains unclear how they can process grammatical syntax. In fact, under standard complexity conjectures, standard transformers cannot recognize context-free languages (CFLs), a canonical formalism to describe syntax, or even regular languages, a subclass of CFLs. Past work proves that $\mathcal{O}(\log(n))$ looping layers (w.r.t. input length n) allows transformers to recognize regular languages, but the question of context-free recognition remained open. In this work, we show that looped transformers with $\mathcal{O}(\log(n))$ looping layers and $\mathcal{O}(n^6)$ padding tokens can recognize all CFLs. However, training and inference with $\mathcal{O}(n^6)$ padding tokens is potentially impractical. Fortunately, we show that, for natural subclasses such as unambiguous CFLs, the recognition problem on transformers becomes more tractable, requiring $\mathcal{O}(n^3)$ padding. We empirically validate our results and show that looping helps on a language that provably requires logarithmic depth. Overall, our results shed light on the intricacy of CFL recognition by transformers: While general recognition may require an intractable amount of padding, natural constraints such as unambiguity yield efficient recognition algorithms.

[412] Quantifying the Effect of Test Set Contamination on Generative Evaluations

Rylan Schaeffer, Joshua Kazdan, Baber Abbasi, Ken Ziyu Liu, Brando Miranda, Ahmed Ahmed, Fazl Berez, Abhay Puri, Stella Biderman, Niloofar Mireshghallah, Sanmi Koyejo

Main category: cs.LG

TL;DR: Test set contamination in generative evaluations: Models pretrained on contaminated data show improved performance, with surprising scaling effects where even single test replicas enable lower loss than irreducible error of clean training.

DetailsMotivation: While test set contamination's impact on discriminative evaluations is well-studied, little research exists on its effects on generative evaluations, which is critical for accurate assessment of frontier AI systems trained on web-scale data.

Method: Pretrained language models on mixtures of web data and MATH benchmark, sweeping model sizes and test set replicas; used scaling laws to analyze contamination effects; studied further training phases (overtraining, supervised finetuning) and inference factors (temperature, solution length).

Result: Performance improves with contamination and model size; single test replica enables lower loss than irreducible error of clean training; overtraining reduces contamination effects; finetuning effects depend on pretraining contamination; high temperatures mitigate contamination, longer solutions harder to memorize.

Conclusion: Test set contamination significantly impacts generative evaluations differently than discriminative ones, introducing new complexity for trustworthy AI evaluation; memorization interacts with generation in ways requiring careful consideration in evaluation protocols.

Abstract: As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination effects, and longer solutions are exponentially more difficult to memorize than shorter ones, presenting a contrast with discriminative evaluations, where solutions are only a few tokens in length. By characterizing how generation and memorization interact, we highlight a new layer of complexity for trustworthy evaluation of AI systems.

[413] Layer-adaptive Expert Pruning for Pre-Training of Mixture-of-Experts Large Language Models

YuanLab. ai, :, Shawn Wu, Jiangang Luo, Tong Yu, Darcy Chen, Sean Wang, Xudong Zhao, Louie Li, Claire Wang, Hunter He, Carol Wang, Allen Wang

Main category: cs.LG

TL;DR: LAEP algorithm improves MoE LLM pre-training efficiency by pruning underutilized experts and reorganizing experts across devices based on token distribution statistics.

DetailsMotivation: Mixture-of-Experts LLMs have superior accuracy but suffer from computational bottlenecks during pre-training due to underutilized experts and limited training efficiency.

Method: Layer-Adaptive Expert Pruning (LAEP) algorithm that selectively prunes underutilized experts and reorganizes experts across computing devices according to token distribution statistics during pre-training.

Result: Achieved 48.3% improvement in training efficiency and 33.3% parameter reduction when pre-training Yuan3.0-1T Base model, while maintaining excellent performance across multiple domains.

Conclusion: LAEP effectively reduces model size and substantially improves pre-training efficiency for MoE LLMs while maintaining performance.

Abstract: Although Mixture-of-Experts (MoE) Large Language Models (LLMs) deliver superior accuracy with a reduced number of active parameters, their pre-training represents a significant computationally bottleneck due to underutilized experts and limited training efficiency. This work introduces a Layer-Adaptive Expert Pruning (LAEP) algorithm designed for the pre-training stage of MoE LLMs. In contrast to previous expert pruning approaches that operate primarily in the post-training phase, the proposed algorithm enhances training efficiency by selectively pruning underutilized experts and reorganizing experts across computing devices according to token distribution statistics. Comprehensive experiments demonstrate that LAEP effectively reduces model size and substantially improves pre-training efficiency. In particular, when pre-training the Yuan3.0-1T Base model from scratch original with 1515B parameters, LAEP achieves a 48.3% improvement in training efficiency alongside a 33.3% parameter reduction, while still delivering excellent performance across multiple domains.

[414] Teaching Models to Teach Themselves: Reasoning at the Edge of Learnability

Shobhita Sundaram, John Quan, Ariel Kwiatkowski, Kartik Ahuja, Yann Ollivier, Julia Kempe

Main category: cs.LG

TL;DR: SOAR is a self-improvement framework using meta-RL where a teacher LLM generates synthetic problems for a student LLM, rewarded by student progress on hard problems, enabling learning from zero-success datasets.

DetailsMotivation: Reinforcement learning for finetuning large reasoning models fails on datasets with low initial success rates due to sparse training signals. The paper investigates whether pretrained LLMs can generate automated curricula for problems they cannot solve themselves.

Method: SOAR uses meta-RL with teacher and student copies of the same model. The teacher generates synthetic problems for the student, and receives reward based on the student’s improvement on a small subset of hard problems. This grounds the curriculum in measured student progress rather than intrinsic proxy rewards.

Result: On hardest mathematical benchmarks (0/128 initial success), SOAR enables learning under sparse binary rewards. Grounded rewards outperform intrinsic reward schemes, avoiding instability and diversity collapse. Analysis shows structural quality and well-posedness of generated questions are more critical than solution correctness.

Conclusion: LLMs can generate useful stepping stones without needing to solve hard problems first, providing a principled path to escape reasoning plateaus without additional curated data.

Abstract: Can a model learn to escape its own learning plateau? Reinforcement learning methods for finetuning large reasoning models stall on datasets with low initial success rates, and thus little training signal. We investigate a fundamental question: Can a pretrained LLM leverage latent knowledge to generate an automated curriculum for problems it cannot solve? To explore this, we design SOAR: A self-improvement framework designed to surface these pedagogical signals through meta-RL. A teacher copy of the model proposes synthetic problems for a student copy, and is rewarded with its improvement on a small subset of hard problems. Critically, SOAR grounds the curriculum in measured student progress rather than intrinsic proxy rewards. Our study on the hardest subsets of mathematical benchmarks (0/128 success) reveals three core findings. First, we show that it is possible to realize bi-level meta-RL that unlocks learning under sparse, binary rewards by sharpening a latent capacity of pretrained models to generate useful stepping stones. Second, grounded rewards outperform intrinsic reward schemes used in prior LLM self-play, reliably avoiding the instability and diversity collapse modes they typically exhibit. Third, analyzing the generated questions reveals that structural quality and well-posedness are more critical for learning progress than solution correctness. Our results suggest that the ability to generate useful stepping stones does not require the preexisting ability to actually solve the hard problems, paving a principled path to escape reasoning plateaus without additional curated data.

[415] Simulated Adoption: Decoupling Magnitude and Direction in LLM In-Context Conflict Resolution

Long Zhang, Fangwei Lin

Main category: cs.LG

TL;DR: LLMs prioritize conflicting in-context information over parametric memory through “Orthogonal Interference” - injecting steering vectors quasi-orthogonal to ground-truth directions rather than suppressing internal knowledge magnitude.

DetailsMotivation: To understand the mechanistic basis of LLM sycophancy/compliance behavior where models prioritize conflicting in-context information over pre-existing parametric memory, specifically examining whether this suppression arises from signal magnitude dilution or directional geometric alteration.

Method: Conducted layer-wise geometric analysis across Qwen-3-4B, Llama-3.1-8B, and GLM-4-9B, decomposing residual stream updates induced by counter-factual contexts into radial (norm-based) and angular (cosine-based) components to test the “Manifold Dilution” hypothesis.

Result: Rejected the universality of “Manifold Dilution” hypothesis - two of three architectures maintained stable residual norms despite significant performance degradation. Found compliance is consistently characterized by “Orthogonal Interference” where conflicting context injects steering vectors quasi-orthogonal to ground-truth directions, rotating hidden state representations.

Conclusion: Models don’t “unlearn” or suppress internal truth magnitude but use geometric displacement to bypass correct unembedding vectors, simulating adoption while preserving original structural magnitude. Challenges scalar confidence metrics and underscores need for vectorial monitoring to distinguish genuine knowledge integration from superficial in-context mimicry.

Abstract: Large Language Models (LLMs) frequently prioritize conflicting in-context information over pre-existing parametric memory, a phenomenon often termed sycophancy or compliance. However, the mechanistic realization of this behavior remains obscure, specifically how the model resolves these knowledge conflicts through compliance, and whether this suppression arises from signal magnitude dilution or directional geometric alteration within the residual stream. To resolve this, we conducted a layer-wise geometric analysis across Qwen-3-4B, Llama-3.1-8B, and GLM-4-9B, decomposing the residual stream updates induced by counter-factual contexts into radial (norm-based) and angular (cosine-based) components. Our empirical results reject the universality of the “Manifold Dilution” hypothesis, as two of the three architectures maintained stable residual norms despite exhibiting significant performance degradation on factual queries. Instead, we observed that compliance is consistently characterized by “Orthogonal Interference,” where the conflicting context injects a steering vector that is quasi-orthogonal to the ground-truth direction, effectively rotating the hidden state representation. This suggests that models do not “unlearn” or suppress the magnitude of internal truths but rather employ a mechanism of geometric displacement to bypass the correct unembedding vector, effectively simulating adoption while preserving the original structural magnitude. These findings challenge scalar confidence metrics for detecting hallucinations and underscore the necessity of vectorial monitoring to distinguish between genuine knowledge integration and superficial in-context mimicry.

[416] Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities

Pengyi Li, Elizaveta Goncharova, Andrey Kuznetsov, Ivan Oseledets

Main category: cs.LG

TL;DR: ProGRPO introduces Advantage Re-weighting Mechanism (ARM) to address mode collapse in RLVR for LLMs by rebalancing confidence across correct reasoning paths to enhance output diversity while maintaining accuracy.

DetailsMotivation: Standard RLVR methods like GRPO suffer from mode collapse and low output diversity due to disproportionate reinforcement of highest-likelihood paths, suppressing valid alternative reasoning chains in LLMs.

Method: Proposes Advantage Re-weighting Mechanism (ARM) that incorporates Prompt Perplexity and Answer Confidence into advantage estimation to dynamically reshape reward signals, attenuating gradient updates for over-confident paths while redistributing probability mass to under-explored correct solutions.

Result: On Qwen2.5-7B, ProGRPO outperforms GRPO by 5.7% in Pass@1 and 13.9% in Pass@32, significantly enhancing generative diversity and response entropy while maintaining competitive accuracy on mathematical and coding benchmarks.

Conclusion: ProGRPO effectively achieves superior trade-off between exploration and exploitation in reasoning tasks by mitigating entropy collapse and promoting diverse correct reasoning paths in LLMs.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient updates of over-confident reasoning paths, while redistributing probability mass toward under-explored correct solutions. Empirical results demonstrate that our approach significantly enhances generative diversity and response entropy while maintaining competitive accuracy, effectively achieving a superior trade-off between exploration and exploitation in reasoning tasks. Empirical results on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks show that ProGRPO significantly mitigates entropy collapse. Specifically, on Qwen2.5-7B, our method outperforms GRPO by 5.7% in Pass@1 and, notably, by 13.9% in Pass@32, highlighting its superior capability in generating diverse correct reasoning paths.

[417] Constrained Group Relative Policy Optimization

Roger Girgis, Rodrigue de Schaetzen, Luke Rowe, Azalée Robitaille, Christopher Pal, Liam Paull

Main category: cs.LG

TL;DR: Constrained GRPO extends Group Relative Policy Optimization with Lagrangian constraints for safe policy learning, addressing advantage estimation issues that break constraint enforcement.

DetailsMotivation: While GRPO is effective for critic-free policy learning, it lacks explicit constraint handling needed for safe AI systems. The paper aims to extend GRPO to constrained settings common in robotics and embodied AI where safety constraints are crucial.

Method: Introduces Constrained GRPO with Lagrangian relaxation using indicator cost functions. Identifies that naive multi-component advantage estimation breaks constraint learning due to mismatched standard deviations, and proposes scalarized advantage construction to preserve the intended reward-constraint trade-off.

Result: Experiments in toy gridworld confirm the optimization pathology and show scalarizing advantages restores stable constraint control. In robotics tasks, Constrained GRPO improves constraint satisfaction while increasing task success rates.

Conclusion: Provides a simple and effective recipe for constrained policy optimization in embodied AI domains, particularly relevant for systems using large multimodal foundation models where safety constraints are essential.

Abstract: While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.

[418] Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, Junxian He

Main category: cs.LG

TL;DR: Dr Kernel-14B: An LLM trained with multi-turn RL for generating high-performance GPU kernels, achieving competitive performance with Claude-4.5-Sonnet on kernel generation benchmarks.

DetailsMotivation: High-quality kernel code is critical for scalable AI systems, but training LLMs for kernel generation faces challenges: need for sufficient data, robust environment, and vulnerability to reward hacking and lazy optimization where models prioritize trivial correctness over meaningful speedup.

Method: 1) Designed KernelGYM - a robust distributed GPU environment supporting reward hacking checks, multi-turn interaction data collection, and long-term RL training. 2) Proposed Turn-level Reinforce-Leave-One-Out (TRLOO) to address biased policy gradient in multi-turn RL. 3) Incorporated mismatch correction for training stability. 4) Introduced Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome lazy optimization.

Result: Dr Kernel-14B achieves performance competitive with Claude-4.5-Sonnet on Kernelbench. On KernelBench Level-2 subset, 31.6% of generated kernels achieve ≥1.2x speedup over Torch reference (vs 26.7% for Claude-4.5-Sonnet, 28.6% for GPT-5). When selecting best candidate across all turns, 1.2x speedup rate increases to 47.8%.

Conclusion: The paper presents a systematic RL approach for kernel generation with KernelGYM environment, TRLOO method for unbiased advantage estimation, and techniques to address reward hacking and lazy optimization, resulting in state-of-the-art kernel generation performance.

Abstract: High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.

[419] Tokenizing Single-Channel EEG with Time-Frequency Motif Learning

Jathurshan Pradeepkumar, Xihao Piao, Zheng Chen, Jimeng Sun

Main category: cs.LG

TL;DR: TFM-Tokenizer is a novel EEG tokenization framework that learns time-frequency motifs from single-channel EEG signals and encodes them into discrete tokens, improving accuracy, generalization, and scalability for EEG analysis.

DetailsMotivation: Foundation models are transforming EEG analysis, but effective EEG tokenization remains challenging. Current approaches lack robust methods for converting continuous EEG signals into discrete tokens that capture meaningful time-frequency patterns.

Method: Proposes TFM-Tokenizer with dual-path architecture using time-frequency masking to learn vocabulary of time-frequency motifs from single-channel EEG. Model-agnostic framework supports both lightweight transformers and existing foundation models.

Result: Achieves up to 11% improvement in Cohen’s Kappa on four EEG benchmarks, boosts performance of foundation models like BIOT and LaBraM, and shows 14% improvement on ear-EEG sleep staging despite domain differences.

Conclusion: TFM-Tokenizer provides effective EEG tokenization with strong accuracy, generalization, and device-agnostic scalability, offering improved representation quality and interpretability for EEG analysis tasks.

Abstract: Foundation models are reshaping EEG analysis, yet an important problem of EEG tokenization remains a challenge. This paper presents TFM-Tokenizer, a novel tokenization framework that learns a vocabulary of time-frequency motifs from single-channel EEG signals and encodes them into discrete tokens. We propose a dual-path architecture with time-frequency masking to capture robust motif representations, and it is model-agnostic, supporting both lightweight transformers and existing foundation models for downstream tasks. Our study demonstrates three key benefits: Accuracy: Experiments on four diverse EEG benchmarks demonstrate consistent performance gains across both single- and multi-dataset pretraining settings, achieving up to $11%$ improvement in Cohen’s Kappa over strong baselines. Generalization: Moreover, as a plug-and-play component, it consistently boosts the performance of diverse foundation models, including BIOT and LaBraM. Scalability: By operating at the single-channel level rather than relying on the strict 10-20 EEG system, our method has the potential to be device-agnostic. Experiments on ear-EEG sleep staging, which differs from the pretraining data in signal format, channel configuration, recording device, and task, show that our tokenizer outperforms baselines by $14%$. A comprehensive token analysis reveals strong class-discriminative, frequency-aware, and consistent structure, enabling improved representation quality and interpretability. Code is available at https://github.com/Jathurshan0330/TFM-Tokenizer.

[420] PiFlow: Principle-Aware Scientific Discovery with Multi-Agent Collaboration

Yingming Pu, Tao Lin, Hongyu Chen

Main category: cs.LG

TL;DR: PiFlow: An information-theoretical framework for LLM-based multi-agent scientific discovery that treats automation as structured uncertainty reduction guided by principles like scientific laws.

DetailsMotivation: Existing LLM-based multi-agent systems for scientific discovery use predefined workflows lacking rationality constraints, leading to aimless hypothesizing and failure to link hypotheses with evidence, hindering systematic uncertainty reduction.

Method: Introduces PiFlow, an information-theoretical framework that treats automated scientific discovery as a structured uncertainty reduction problem guided by principles (e.g., scientific laws). It serves as a Plug-and-Play module that generalizes on existing agent architectures.

Result: Improves discovery efficiency by 31.18%~41.73% and solution quality by 12.47%~31.72% against SOTA methods; delivers 5.6x speedup in time-to-solution while reducing token consumption by up to 27% compared to vanilla agents.

Conclusion: PiFlow establishes a novel paradigm shift in highly efficient agentic scientific discovery, paving the way for more robust and accelerated AI-driven research.

Abstract: Large Language Model (LLM)-based multi-agent systems (MAS) demonstrate remarkable potential for scientific discovery. Existing approaches, however, often automate scientific discovery using predefined workflows that lack rationality constraints. This often leads to aimless hypothesizing and a failure to consistently link hypotheses with evidence, thereby hindering the systematic reduction of uncertainty. Overcoming these limitations fundamentally requires a principled approach to exploration. We introduce PiFlow, an information-theoretical framework, treating automated scientific discovery as a structured uncertainty reduction problem guided by principles (e.g., scientific laws). Extensive evaluations across three distinct scientific domains demonstrate that PiFlow (I) improves discovery efficiency by 31.18%~41.73% and solution quality by 12.47%~31.72% against state-of-the-art methods, (II) delivers a 5.6x speedup in time-to-solution while reducing token consumption by up to 27% compared to vanilla agents, and (III) serves as a Plug-and-Play module that generalizes on existing agent architecture. Overall, PiFlow establishes a novel paradigm shift in highly efficient agentic scientific discovery, paving the way for more robust and accelerated AI-driven research.

[421] STFlow: Data-Coupled Flow Matching for Geometric Trajectory Simulation

Kiet Bennema ten Brinke, Koen Minartz, Vlado Menkovski

Main category: cs.LG

TL;DR: STFlow is a generative model for simulating dynamical system trajectories using graph neural networks and hierarchical convolutions, with improved efficiency through informed priors and reduced transport cost.

DetailsMotivation: Trajectory simulation in dynamical systems faces challenges including high sensitivity to perturbations (bifurcations), multi-scale correlations, and computational complexity. Existing methods struggle with these issues, particularly for N-body systems, molecular dynamics, and human trajectory forecasting.

Method: STFlow combines graph neural networks with hierarchical convolutions within a Flow Matching framework. It uses data-dependent couplings and denoises starting from conditioned random-walks instead of Gaussian noise, creating an informed prior that reduces transport cost and improves efficiency.

Result: STFlow achieves lowest prediction errors across N-body systems, molecular dynamics, and human trajectory forecasting benchmarks, with fewer simulation steps and improved scalability compared to existing methods.

Conclusion: The informed prior approach in STFlow simplifies learning tasks and enhances training/inference efficiency for trajectory simulation in complex dynamical systems, demonstrating superior performance across diverse applications.

Abstract: Simulating trajectories of dynamical systems is a fundamental problem in a wide range of fields such as molecular dynamics, biochemistry, and pedestrian dynamics. Machine learning has become an invaluable tool for scaling physics-based simulators and developing models directly from experimental data. In particular, recent advances in deep generative modeling and geometric deep learning enable probabilistic simulation by learning complex trajectory distributions while respecting intrinsic permutation and time-shift symmetries. However, trajectories of N-body systems are commonly characterized by high sensitivity to perturbations leading to bifurcations, as well as multi-scale temporal and spatial correlations. To address these challenges, we introduce STFlow (Spatio-Temporal Flow), a generative model based on graph neural networks and hierarchical convolutions. By incorporating data-dependent couplings within the Flow Matching framework, STFlow denoises starting from conditioned random-walks instead of Gaussian noise. This novel informed prior simplifies the learning task by reducing transport cost, increasing training and inference efficiency. We validate our approach on N-body systems, molecular dynamics, and human trajectory forecasting. Across these benchmarks, STFlow achieves the lowest prediction errors with fewer simulation steps and improved scalability.

[422] Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation

Xintong Duan, Yutong He, Fahim Tajwar, Ruslan Salakhutdinov, J. Zico Kolter, Jeff Schneider

Main category: cs.LG

TL;DR: Consistency distillation method for offline RL that enables single-step sampling while optimizing rewards, achieving both performance gains and 142x inference speedup over diffusion models.

DetailsMotivation: Diffusion models show strong decision-making performance but suffer from slow inference. Consistency models offer faster inference but existing approaches either struggle with suboptimal demonstrations under behavior cloning or require complex multi-network training in actor-critic frameworks.

Method: Proposes a novel consistency distillation approach for offline RL that directly incorporates reward optimization into the distillation process. Uses decoupled training and noise-free reward signals to achieve single-step sampling while generating higher-reward action trajectories.

Result: Achieves 9.7% improvement over previous state-of-the-art on Gym MuJoCo, FrankaKitchen, and long horizon planning benchmarks, with up to 142x speedup over diffusion counterparts in inference time.

Conclusion: The method successfully addresses the inference speed limitation of diffusion models in decision-making while maintaining or improving performance through reward-optimized consistency distillation.

Abstract: Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While consistency models offer a potential solution, existing applications to decision-making either struggle with suboptimal demonstrations under behavior cloning or rely on complex concurrent training of multiple networks under the actor-critic framework. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method achieves single-step sampling while generating higher-reward action trajectories through decoupled training and noise-free reward signals. Empirical evaluations on the Gym MuJoCo, FrankaKitchen, and long horizon planning benchmarks demonstrate that our approach can achieve a 9.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.

[423] Physics vs Distributions: Pareto Optimal Flow Matching with Physics Constraints

Giacomo Baldan, Qiang Liu, Alberto Guardone, Nils Thuerey

Main category: cs.LG

TL;DR: PBFM introduces physics-based flow matching that optimizes both generative fidelity and physical constraints simultaneously without conflicting gradients, achieving Pareto-optimal trade-offs for physics-constrained generative modeling.

DetailsMotivation: Current physics-constrained generative models struggle with conflicting optimization objectives between distributional accuracy and physical consistency, often degrading generative fidelity or requiring costly inference-time corrections.

Method: Physics-Based Flow Matching (PBFM) enforces physical constraints at training time using conflict-free gradient updates and unrolling to mitigate Jensen’s gap, avoiding manual loss balancing while enabling simultaneous optimization of both objectives.

Result: PBFM achieves Pareto-optimal trade-off between distributional and physical accuracy across three PDE benchmarks, maintains competitive inference speed, and generalizes to various physics-constrained generative tasks.

Conclusion: PBFM provides a practical tool for scientific machine learning by resolving the inherent conflict between generative and physical objectives, enabling high-quality physics-constrained generation without compromising inference performance.

Abstract: Physics-constrained generative modeling aims to produce high-dimensional samples that are both physically consistent and distributionally accurate, a task that remains challenging due to often conflicting optimization objectives. Recent advances in flow matching and diffusion models have enabled efficient generative modeling, but integrating physical constraints often degrades generative fidelity or requires costly inference-time corrections. Our work is the first to recognize the trade-off between distributional and physical accuracy. Based on the insight of inherently conflicting objectives, we introduce Physics-Based Flow Matching (PBFM) a method that enforces physical constraints at training time using conflict-free gradient updates and unrolling to mitigate Jensen’s gap. Our approach avoids manual loss balancing and enables simultaneous optimization of generative and physical objectives. As a consequence, physics constraints do not impede inference performance. We benchmark our method across three representative PDE benchmarks. PBFM achieves a Pareto-optimal trade-off, competitive inference speed, and generalizes to a wide range of physics-constrained generative tasks, providing a practical tool for scientific machine learning. Code and datasets available at https://github.com/tum-pbs/PBFM.

[424] A Multi-Token Coordinate Descent Method for Semi-Decentralized Vertical Federated Learning

Pedro Valdeira, Yuejie Chi, Cláudia Soares, João Xavier

Main category: cs.LG

TL;DR: MTCD is a semi-decentralized vertical FL method that combines client-server and client-client communication to overcome limitations of both pure approaches.

DetailsMotivation: Client-server FL suffers from bandwidth bottlenecks and single points of failure, while fully decentralized methods struggle with convergence in large sparse networks. This work bridges the gap between these approaches for vertical FL where clients hold different features of the same samples.

Method: Proposes Multi-Token Coordinate Descent (MTCD), a flexible semi-decentralized method for vertical FL that exploits both client-server and client-client links. By tuning hyperparameters, MTCD can recover client-server or decentralized schemes as special cases, or explore intermediate configurations.

Result: Proves MTCD converges at O(1/T) rate for nonconvex objectives with sufficiently large batch sizes when tokens roam across disjoint client subsets. Analytical and empirical results show semi-decentralized MTCD instances can outperform both pure client-server and decentralized approaches across applications.

Conclusion: MTCD provides a flexible framework that bridges client-server and decentralized FL, offering performance advantages by balancing communication costs and convergence properties through tunable server dependency.

Abstract: Most federated learning (FL) methods use a client-server scheme, where clients communicate only with a central server. However, this scheme is prone to bandwidth bottlenecks at the server and has a single point of failure. In contrast, in a (fully) decentralized approach, clients communicate directly with each other, dispensing with the server and mitigating these issues. Yet, as the client network grows larger and sparser, the convergence of decentralized methods slows down, even failing to converge if the network is disconnected. This work addresses this gap between client-server and decentralized schemes, focusing on the vertical FL setup, where clients hold different features of the same samples. We propose multi-token coordinate descent (MTCD), a flexible semi-decentralized method for vertical FL that can exploit both client-server and client-client links. By selecting appropriate hyperparameters, MTCD recovers the client-sever and decentralized schemes as special cases. In fact, its decentralized instance is itself a novel method of independent interest. Yet, by controlling the degree of dependency on client-server links, MTCD can also explore a spectrum of schemes ranging from client-server to decentralized. We prove that, for sufficiently large batch sizes, MTCD converges at an $\mathcal{O}(1/T)$ rate for nonconvex objectives when the tokens roam across disjoint subsets of clients. To capture the aforementioned drawbacks of the client-server scheme succinctly, we model the relative impact of using client-server versus client-client links as the ratio of their “costs”, which depends on the application. This allows us to demonstrate, both analytically and empirically, that by tuning the degree of dependency on the server, the semi-decentralized instances of MTCD can outperform both client-server and decentralized approaches across a range of applications.

[425] HSG-12M: A Large-Scale Benchmark of Spatial Multigraphs from the Energy Spectra of Non-Hermitian Crystals

Xianquan Yan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee

Main category: cs.LG

TL;DR: Poly2Graph pipeline automates mapping 1D crystal Hamiltonians to spectral graphs, creating HSG-12M dataset with 11.6M static and 5.1M dynamic Hamiltonian spectral graphs across 1401 polynomial classes.

DetailsMotivation: AI's impact on scientific research is constrained by lack of large, high-quality domain-specific datasets. Non-Hermitian quantum physics contains rich untapped resources in Hamiltonian spectral graphs, but systematic study has been intractable due to manual extraction.

Method: Developed Poly2Graph: a high-performance, open-source pipeline that automates mapping of 1-D crystal Hamiltonians to spectral graphs. Used this to create HSG-12M dataset containing millions of static and dynamic Hamiltonian spectral graphs across polynomial classes.

Result: Created HSG-12M - first large-scale dataset of spatial multigraphs (graphs embedded in metric space with multiple geometrically distinct trajectories between nodes). Benchmarks with GNNs reveal new challenges in learning spatial multi-edges at scale.

Conclusion: Spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging new algebra-to-graph link. HSG-12M lays groundwork for data-driven scientific discovery in condensed matter physics and new opportunities in geometry-aware graph learning.

Abstract: AI is transforming scientific research by revealing new ways to understand complex physical systems, but its impact remains constrained by the lack of large, high-quality domain-specific datasets. A rich, largely untapped resource lies in non-Hermitian quantum physics, where the energy spectra of crystals form intricate geometries on the complex plane – termed as Hamiltonian spectral graphs. Despite their significance as fingerprints for electronic behavior, their systematic study has been intractable due to the reliance on manual extraction. To unlock this potential, we introduce Poly2Graph: a high-performance, open-source pipeline that automates the mapping of 1-D crystal Hamiltonians to spectral graphs. Using this tool, we present HSG-12M: a dataset containing 11.6 million static and 5.1 million dynamic Hamiltonian spectral graphs across 1401 characteristic-polynomial classes, distilled from 177 TB of spectral potential data. Crucially, HSG-12M is the first large-scale dataset of spatial multigraphs – graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. This simultaneously addresses a critical gap, as existing graph benchmarks overwhelmingly assume simple, non-spatial edges, discarding vital geometric information. Benchmarks with popular GNNs expose new challenges in learning spatial multi-edges at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for data-driven scientific discovery in condensed matter physics, new opportunities in geometry-aware graph learning and beyond.

[426] Forecasting with Hyper-Trees

Alexander März, Kashif Rasul

Main category: cs.LG

TL;DR: Hyper-Trees is a novel framework that combines gradient boosted trees with classical time series models (like ARIMA/Exponential Smoothing) by learning model parameters as functions of features, then using those parameters in the target model for forecasting.

DetailsMotivation: To bridge the gap between tree-based models (effective on tabular data) and time series forecasting by inducing time series inductive bias into tree-based approaches, overcoming limitations of conventional tree-based time series forecasting.

Method: Hyper-Trees learn parameters of target time series models (ARIMA/Exponential Smoothing) as functions of features using gradient boosted trees. For high-dimensional parameter estimation, they combine decision trees (for feature representation) with shallow neural networks (to learn time series model parameters) in a hybrid framework.

Result: The framework extends tree-based modeling beyond conventional time series analysis and is evaluated across various forecasting tasks, though specific performance metrics aren’t provided in the abstract.

Conclusion: Hyper-Trees successfully combine the strengths of decision trees on tabular data with classical forecasting models, creating a novel approach that incorporates time series inductive bias into tree-based methods through parameter learning.

Abstract: We introduce Hyper-Trees as a novel framework for modeling time series data using gradient boosted trees. Unlike conventional tree-based approaches that forecast time series directly, Hyper-Trees learn the parameters of a target time series model, such as ARIMA or Exponential Smoothing, as functions of features. These parameters are then used by the target model to generate the final forecasts. Our framework combines the effectiveness of decision trees on tabular data with classical forecasting models, thereby inducing a time series inductive bias into tree-based models. To resolve the scaling limitations of boosted trees when estimating a high-dimensional set of target model parameters, we combine decision trees and neural networks within a unified framework. In this hybrid approach, the trees generate informative representations from the input features, which a shallow network then uses as input to learn the parameters of a time series model. With our research, we explore the effectiveness of Hyper-Trees across a range of forecasting tasks and extend tree-based modeling beyond its conventional use in time series analysis.

[427] Dataset Distillation as Pushforward Optimal Quantization

Hong Ye Tan, Emma Slade

Main category: cs.LG

TL;DR: Dataset distillation method using optimal quantization in latent space achieves state-of-the-art performance on ImageNet-1K with better inter-model generalization and computational efficiency.

DetailsMotivation: Existing dataset distillation methods face challenges with computational efficiency and scalability. Disentangled methods are faster but lack theoretical grounding. The paper aims to provide a theoretical foundation for disentangled methods and improve their performance.

Method: Proposes Dataset Distillation by Optimal Quantization, which reformulates disentangled methods as optimal quantization problems. Uses encoder-decoder structure to map data to latent space, then performs clustering to find representative points that approximate the underlying probability distribution.

Result: Achieves SOTA performance on ImageNet-1K, better inter-model generalization than previous methods (D⁴M), and trivial additional computation. Also achieves SOTA in higher image-per-class settings and when using distilled noise initializations with diffusion transformer models.

Conclusion: The optimal quantization framework provides theoretical grounding for disentangled dataset distillation methods and enables efficient, high-performance dataset compression with strong generalization properties.

Abstract: Dataset distillation aims to find a synthetic training set such that training on the synthetic data achieves similar performance to training on real data, with orders of magnitude less computational requirements. Existing methods can be broadly categorized as either bi-level optimization problems that have neural network training heuristics as the lower level problem, or disentangled methods that bypass the bi-level optimization by matching distributions of data. The latter method has the major advantages of speed and scalability in terms of size of both training and distilled datasets. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure by minimizing the expected projection distance. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors. We propose Dataset Distillation by Optimal Quantization, based on clustering in a latent space. Compared to the previous SOTA method D\textsuperscript{4}M, we achieve better performance and inter-model generalization on the ImageNet-1K dataset with trivial additional computation, and SOTA performance in higher image-per-class settings. Using the distilled noise initializations in a stronger diffusion transformer model, we obtain SOTA distillation performance on ImageNet-1K and its subsets, outperforming diffusion guidance methods.

[428] Training in reverse: How iteration order influences convergence and stability in deep learning

Benoit Dherin, Benny Avelin, Anders Karlsson, Hanna Mazzawi, Javier Gonzalvo, Michael Munn

Main category: cs.LG

TL;DR: Backward-SGD optimization method improves training stability by reversing gradient composition order, converging to points rather than distributions around minima.

DetailsMotivation: Training neural networks is computationally expensive and plagued by instabilities that degrade convergence. While learning rate schedules help, finding optimal schedules is resource-intensive. The paper explores theoretical issues in constant-learning-rate, small-batch-size regimes.

Method: Introduces backward-SGD which produces parameter iterates by reverting the usual forward composition order of batch gradients. The composition order of gradient updates affects stability and convergence. Theoretical analysis shows backward-SGD converges to points in contractive regions while forward-SGD only converges to distributions.

Result: Backward-SGD demonstrates improved stability and convergence experimentally. While computationally intensive in practice, it shows that modifying iteration composition by creatively reusing previous batches has beneficial effects on training.

Conclusion: The composition order of gradient updates significantly impacts training stability and convergence. Backward-SGD represents a new, unexplored avenue in deep learning optimization, highlighting that extra freedom in modifying iteration composition can improve training.

Abstract: Despite exceptional achievements, training neural networks remains computationally expensive and is often plagued by instabilities that can degrade convergence. While learning rate schedules can help mitigate these issues, finding optimal schedules is time-consuming and resource-intensive. This work explores theoretical issues concerning training stability in the constant-learning-rate (i.e., without schedule) and small-batch-size regime. Surprisingly, we show that the composition order of gradient updates affects stability and convergence in gradient-based optimizers. We illustrate this new line of thinking using backward-SGD, which produces parameter iterates at each step by reverting the usual forward composition order of batch gradients. Our theoretical analysis shows that in contractive regions (e.g., around minima) backward-SGD converges to a point while the standard forward-SGD generally only converges to a distribution. This leads to improved stability and convergence which we demonstrate experimentally. While full backward-SGD is computationally intensive in practice, it highlights that the extra freedom of modifying the usual iteration composition by reusing creatively previous batches at each optimization step may have important beneficial effects in improving training. Our experiments provide a proof of concept supporting this phenomenon. To our knowledge, this represents a new and unexplored avenue in deep learning optimization.

[429] DoRAN: Stabilizing Weight-Decomposed Low-Rank Adaptation via Noise Injection and Auxiliary Networks

Nghiem T. Diep, Hien Dang, Tuan Truong, Tan Dinh, Huy Nguyen, Nhat Ho

Main category: cs.LG

TL;DR: DoRAN improves DoRA by adding learnable noise for regularization and using auxiliary networks for dynamic low-rank matrix generation, enhancing training stability and sample efficiency in vision/language tasks.

DetailsMotivation: DoRA improves LoRA but still faces training instability and limited sample efficiency. The authors aim to address these limitations by introducing noise-based regularization and dynamic parameter generation.

Method: Two key innovations: (1) Inject learnable noise into DoRA’s denominator for adaptive regularization to stabilize training and improve low-rank matrix estimation; (2) Replace static low-rank matrices with auxiliary networks that generate them dynamically, enabling parameter coupling between query/value projections.

Result: DoRAN consistently outperforms LoRA, DoRA, and other PEFT baselines across vision and language benchmarks, demonstrating improved training stability and sample efficiency.

Conclusion: Combining noise-based regularization with network-based parameter generation effectively addresses DoRA’s limitations, making DoRAN a superior PEFT method for vision and language tasks.

Abstract: Parameter-efficient fine-tuning (PEFT) methods have become the standard paradigm for adapting large-scale models. Among these techniques, Weight-Decomposed Low-Rank Adaptation (DoRA) has been shown to improve both the learning capacity and training stability of the Low-Rank Adaptation (LoRA) method by explicitly decomposing pre-trained weights into magnitude and directional components. In this work, we propose DoRAN, a new technique designed to stabilize training and boost the sample efficiency of DoRA. Our framework introduces two key components: (i) the injection of learnable noise into the denominator of DoRA weight decomposition, which serves as an adaptive regularizer to mitigate instabilities and improve the estimation rate of low-rank matrices; and (ii) the replacement of static low-rank matrices with auxiliary networks that generate them dynamically, enabling parameter coupling between the query and value projection matrices, leading to improved sample efficiency both theoretically and empirically. Comprehensive experiments on vision and language benchmarks show that DoRAN consistently outperforms LoRA, DoRA, and other PEFT baselines, underscoring the effectiveness of combining noise-based regularization with network-based parameter generation.

[430] Echo State Transformer: Attention Over Finite Memories

Yannis Bendi-Ouis, Xavier Hinaut

Main category: cs.LG

TL;DR: Echo State Transformers (EST) combine Transformer attention with Reservoir Computing to create fixed-size memory systems with linear complexity, achieving state-of-the-art performance on time series tasks while overcoming Transformers’ quadratic scaling limitations.

DetailsMotivation: Transformers have computational limitations (quadratic complexity with sequence length) and don't reflect biological brain processing or working memory dynamics. The authors aim to design architectures that leverage efficient working memory dynamics to overcome these computational barriers.

Method: EST integrates Transformer attention mechanisms with Reservoir Computing nodes to create a fixed-size memory system. It uses multiple parallel reservoirs (random recurrent networks) as lightweight working memory with learned internal dynamics and adaptive leak rates. Attention is applied on these fixed memory units instead of input tokens, achieving linear complexity.

Result: EST ranks first overall in two of five categories on the Time Series Library benchmark (69 tasks across five categories), outperforming state-of-the-art baselines on classification and anomaly detection tasks while remaining competitive on short-term forecasting.

Conclusion: By shifting attention from the entire input sequence to a fixed set of evolving memory units, EST maintains high sensitivity to temporal events while achieving constant computational complexity per step, effectively breaking the quadratic scaling problem of standard Transformers.

Abstract: While Large Language Models and their underlying Transformer architecture are remarkably efficient, they do not reflect how our brain processes and learns a diversity of cognitive tasks such as language, nor how it leverages working memory. Furthermore, Transformers encounters a computational limitation: quadratic complexity growth with sequence length. Motivated by these limitations, we aim to design architectures that leverage efficient working memory dynamics to overcome standard computational barriers. We introduce Echo State Transformers (EST), a hybrid architecture that resolves this challenge while demonstrating state of the art performance in classification and detection tasks. EST integrates the Transformer attention mechanisms with nodes from Reservoir Computing to create a fixed-size memory system. Drawing inspiration from Echo State Networks, our approach leverages several reservoirs (random recurrent networks) in parallel as a lightweight and efficient working memory. These independent units possess distinct and learned internal dynamics with an adaptive leak rate, enabling them to dynamically adjust their own temporality. By applying attention on those fixed number of units instead of input tokens, EST achieves linear complexity for the whole sequence, effectively breaking the quadratic scaling problem of standard Transformers. We evaluate ESTs on a recent timeseries benchmark: the Time Series Library, which comprises 69 tasks across five categories. Results show that ESTs ranks first overall in two of five categories, outperforming strong state-of-the-art baselines on classification and anomaly detection tasks, while remaining competitive on short-term forecasting. These results demonstrate that by shifting the attention mechanism from the entire input sequence to a fixed set of evolving memory units, it is possible to maintains high sensitivity to temporal events while achieving constant computational complexity per step.

[431] VFScale: Intrinsic Reasoning through Verifier-Free Test-time Scalable Diffusion Model

Tao Zhang, Jia-Shu Pan, Ruiqi Feng, Tailin Wu

Main category: cs.LG

TL;DR: VFScale introduces a verifier-free test-time scalable diffusion model for complex reasoning tasks, using intrinsic energy functions as verifiers and hybrid Monte Carlo Tree Search for efficient inference.

DetailsMotivation: Current diffusion models lack test-time scaling for complex reasoning tasks, relying on external verifiers unlike human intrinsic reasoning, and suffer from inefficient search algorithms.

Method: VFScale uses MRNCL loss and KL regularization to improve energy landscape for reliable intrinsic verification, and integrates denoising with hybrid Monte Carlo Tree Search (hMCTS) for efficient inference.

Result: VFScale solves 88% of Maze problems at 15×15 size after training only on 6×6 mazes, while standard diffusion models completely fail on larger sizes.

Conclusion: VFScale enables scalable intrinsic reasoning in diffusion models without external verifiers, demonstrating strong generalization to larger problem sizes than trained on.

Abstract: Inspired by human SYSTEM 2 thinking, LLMs excel at complex reasoning tasks via extended Chain-of-Thought. However, similar test-time scaling for diffusion models to tackle complex reasoning remains largely unexplored. From existing work, two primary challenges emerge in this setting: (i) the dependence on an external verifier indicating a notable gap from intrinsic reasoning of human intelligence without any external feedback, and (ii) the lack of an efficient search algorithm. In this paper, we introduce the Verifier-free Test-time Scalable Diffusion Model (VFScale) to achieve scalable intrinsic reasoning, which equips number-of-sample test-time scaling with the intrinsic energy function of diffusion models as the verifier. Concretely, VFScale comprises two key innovations to address the aforementioned challenges. On the training side, VFScale consists of a novel MRNCL loss and a KL regularization to improve the energy landscape, ensuring that the learned energy function itself serves as a reliable verifier. On the inference side, VFScale integrates the denoising process with a novel hybrid Monte Carlo Tree Search (hMCTS) to improve search efficiency. On challenging reasoning tasks of Maze and Sudoku, we demonstrate the effectiveness of VFScale’s training objective and scalable inference method. In particular, trained with Maze sizes of up to $6\times6$, our VFScale solves 88% of Maze problems with much larger sizes of $15\times15$, while standard diffusion models completely fail. The code can be found at https://github.com/AI4Science-WestlakeU/VFScale.

[432] T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning

Julie Mordacq, David Loiseaux, Vicky Kalogeiton, Steve Oudot

Main category: cs.LG

TL;DR: T-REGS introduces a Minimum Spanning Tree-based regularization framework for self-supervised learning that addresses dimensional collapse and promotes uniformity in learned representations.

DetailsMotivation: Self-supervised learning often suffers from dimensional collapse (features occupy low-dimensional subspace) and poor distribution uniformity, which degrade representation quality. Existing methods don't simultaneously address both issues effectively.

Method: T-REGS uses Minimum Spanning Tree (MST) length over learned representations as a regularization term. The MST length captures both dimensionality and uniformity properties, with theoretical analysis showing it mitigates dimensional collapse and promotes uniformity on compact Riemannian manifolds.

Result: Experiments on synthetic data and classical SSL benchmarks demonstrate T-REGS effectively enhances representation quality by addressing both dimensional collapse and distribution uniformity issues.

Conclusion: T-REGS provides a simple yet theoretically grounded regularization framework that simultaneously addresses two key challenges in self-supervised learning: dimensional collapse and distribution uniformity.

Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data, often by enforcing invariance to input transformations such as rotations or blurring. Recent studies have highlighted two pivotal properties for effective representations: (i) avoiding dimensional collapse-where the learned features occupy only a low-dimensional subspace, and (ii) enhancing uniformity of the induced distribution. In this work, we introduce T-REGS, a simple regularization framework for SSL based on the length of the Minimum Spanning Tree (MST) over the learned representation. We provide theoretical analysis demonstrating that T-REGS simultaneously mitigates dimensional collapse and promotes distribution uniformity on arbitrary compact Riemannian manifolds. Several experiments on synthetic data and on classical SSL benchmarks validate the effectiveness of our approach at enhancing representation quality.

[433] Generative modelling with jump-diffusions

Adrian Baule

Main category: cs.LG

TL;DR: Generalization of diffusion models to non-Gaussian noise processes, specifically Lévy processes with Poisson jumps, showing improved performance over Gaussian models in certain regimes.

DetailsMotivation: While score-based diffusion models with Gaussian noise are state-of-the-art for image generation, recent observations suggest performance can be improved using heavy-tailed noise. This paper aims to generalize diffusion processes beyond Gaussian noise to a wider class of non-Gaussian noise processes.

Method: Proposes forward processes driven by standard Gaussian noise with superimposed Poisson jumps representing finite activity Lévy processes. Derives both probability flow ODE and SDE formulations. Provides detailed implementation for pure jump process with Laplace-distributed amplitudes, yielding generalized score function in closed analytical form.

Result: The generalized score function depends on jump amplitude distribution and can be estimated by minimizing MSE loss similar to conventional Gaussian models. The Laplace jump process implementation outperforms equivalent Gaussian model in specific parameter regimes.

Conclusion: Diffusion models can be successfully generalized to non-Gaussian noise processes, specifically Lévy processes with jumps, offering potential performance improvements over traditional Gaussian-based approaches in certain applications.

Abstract: Score-based diffusion models generate samples from an unknown target distribution using a time-reversed diffusion process. While such models represent state-of-the-art approaches in industrial applications such as artificial image generation, it has recently been noted that their performance can be further improved by considering injection noise with heavy tailed characteristics. Here, I present a generalization of generative diffusion processes to a wide class of non-Gaussian noise processes. I consider forward processes driven by standard Gaussian noise with super-imposed Poisson jumps representing a finite activity Levy process. The generative process is shown to be governed by a generalized score function that depends on the jump amplitude distribution and can be estimated by minimizing a simple MSE loss as in conventional Gaussian models. Both probability flow ODE and SDE formulations are derived using basic technical effort. A detailed implementation for a pure jump process with Laplace distributed amplitudes yields a generalized score function in closed analytical form and is shown to outperform the equivalent Gaussian model in specific parameter regimes.

[434] Reparameterization Proximal Policy Optimization

Hai Zhong, Xun Wang, Zhuoran Li, Longbo Huang

Main category: cs.LG

TL;DR: RPO (Reparameterization Proximal Policy Optimization) addresses limitations in RPG by enabling sample reuse through a PPO-style surrogate objective and adding stability mechanisms like clipped policy gradients and KL regularization.

DetailsMotivation: Current Reparameterization Policy Gradient (RPG) methods suffer from under-utilization of expensive dynamics Jacobians and training instability. Sample reuse could help with under-utilization but lacks principled frameworks and risks worsening instability.

Method: RPO establishes that RPG with sample reuse optimizes a PPO-style surrogate objective via Backpropagation Through Time. It integrates clipped policy gradient tailored for RPG and explicit KL divergence regularization for stability.

Result: RPO maintains superior sample efficiency and consistently outperforms or achieves state-of-the-art performance across diverse tasks.

Conclusion: RPO provides a unified framework for on- and off-policy updates in reparameterization-based RL, addressing both sample efficiency and stability issues through principled sample reuse and regularization techniques.

Abstract: By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive dynamics Jacobians and inherent training instability. While sample reuse offers a remedy for under-utilization, no prior principled framework exists, and naive attempts risk exacerbating instability. To address these challenges, we propose Reparameterization Proximal Policy Optimization (RPO). We first establish that under sample reuse, RPG naturally optimizes a PPO-style surrogate objective via Backpropagation Through Time, providing a unified framework for both on- and off-policy updates. To further ensure stability, RPO integrates a clipped policy gradient mechanism tailored for RPG and employs explicit Kullback-Leibler divergence regularization. Experimental results demonstrate that RPO maintains superior sample efficiency and consistently outperforms or achieves state-of-the-art performance across diverse tasks.

[435] Enhancing Customer Contact Efficiency with Graph Neural Networks in Credit Card Fraud Detection Workflow

Menghao Huo, Kuan Lu, Qiang Zhu, Zhenrui Chen

Main category: cs.LG

TL;DR: Proposes RGCN-based fraud detection framework to reduce false positives in credit card transactions

DetailsMotivation: Credit card fraud causes significant financial losses, but current detection systems often flag legitimate transactions as fraudulent (false positives), disrupting user experience and eroding customer trust.

Method: Uses Relational Graph Convolutional Networks (RGCN) to leverage relational structure of transaction data, reducing need for direct customer confirmation while maintaining high detection performance.

Result: Experiments conducted using IBM credit card transaction dataset to evaluate effectiveness of the RGCN approach.

Conclusion: RGCN-based framework enhances accuracy and efficiency of fraud detection while reducing false positives that damage customer experience.

Abstract: Credit card fraud has been a persistent issue since the last century, causing significant financial losses to the industry. The most effective way to prevent fraud is by contacting customers to verify suspicious transactions. However, while these systems are designed to detect fraudulent activity, they often mistakenly flag legitimate transactions, leading to unnecessary declines that disrupt the user experience and erode customer trust. Frequent false positives can frustrate customers, resulting in dissatisfaction, increased complaints, and a diminished sense of security. To address these limitations, we propose a fraud detection framework incorporating Relational Graph Convolutional Networks (RGCN) to enhance the accuracy and efficiency of identifying fraudulent transactions. By leveraging the relational structure of transaction data, our model reduces the need for direct customer confirmation while maintaining high detection performance. Our experiments are conducted using the IBM credit card transaction dataset to evaluate the effectiveness of this approach.

[436] Inverse problems with diffusion models: MAP estimation via mode-seeking loss

Sai Bharath Chandra Gutha, Ricardo Vinuesa, Hossein Azizpour

Main category: cs.LG

TL;DR: VML-MAP: A new MAP estimation method for solving inverse problems using pre-trained unconditional diffusion models without task-specific training, based on variational mode-seeking loss minimization.

DetailsMotivation: Existing methods for solving inverse problems with pre-trained diffusion models rely on approximations and are computationally demanding. Need for more accurate and efficient MAP estimation techniques.

Method: Proposes variational mode-seeking loss (VML) derived from minimizing KL divergence between diffusion posterior and measurement posterior. VML-MAP algorithm minimizes VML at each reverse diffusion step to guide samples toward MAP estimates.

Result: Validated efficacy through extensive experiments on diverse image-restoration tasks across multiple datasets, showing improved performance and computational efficiency compared to existing methods.

Conclusion: VML-MAP provides an effective, theoretically grounded approach for solving inverse problems with pre-trained diffusion models, offering better accuracy and computational efficiency than previous methods.

Abstract: A pre-trained unconditional diffusion model, combined with posterior sampling or maximum a posteriori (MAP) estimation techniques, can solve arbitrary inverse problems without task-specific training or fine-tuning. However, existing posterior sampling and MAP estimation methods often rely on modeling approximations and can also be computationally demanding. In this work, we propose a new MAP estimation strategy for solving inverse problems with a pre-trained unconditional diffusion model. Specifically, we introduce the variational mode-seeking loss (VML) and show that its minimization at each reverse diffusion step guides the generated sample towards the MAP estimate (modes in practice). VML arises from a novel perspective of minimizing the Kullback-Leibler (KL) divergence between the diffusion posterior $p(\mathbf{x}_0|\mathbf{x}_t)$ and the measurement posterior $p(\mathbf{x}_0|\mathbf{y})$, where $\mathbf{y}$ denotes the measurement. Importantly, for linear inverse problems, VML can be analytically derived without any modeling approximations. Based on further theoretical insights, we propose VML-MAP, an empirically effective algorithm for solving inverse problems via VML minimization, and validate its efficacy in both performance and computational time through extensive experiments on diverse image-restoration tasks across multiple datasets.

[437] Calibration and Transformation-Free Weight-Only LLMs Quantization via Dynamic Grouping

Xinzhe Zheng, Zhen-Qun Yang, Zishan Liu, Haoran Xie, S. Joe Qin, Arlene Chen, Fangzhen Lin

Main category: cs.LG

TL;DR: MSB is a calibration-free, transformation-free post-training quantization method that generalizes binary quantization to multi-bit settings by optimizing dynamic grouping to minimize within-group variance.

DetailsMotivation: Current low-bit post-training quantization methods for LLMs typically require calibration data, auxiliary transformations, and GPU tools, making deployment under tight memory and compute constraints challenging.

Method: MSB optimizes a dynamic grouping criterion that minimizes within-group variance, yielding group-wise multiscale levels that can be applied consistently across granularities from per-tensor to block-wise configurations without calibration or intermediate transforms.

Result: On Llama 3.2 3B, MSB achieves 8.43 perplexity on WikiText-2 under 4-bit weight-only block-wise quantization, compared to 7.81 in full precision and 12.23 with GPTQ in its default setup.

Conclusion: MSB provides a new optimization perspective for low-bit PTQ while simplifying the pipeline by removing calibration and transformations, making LLM deployment more efficient under resource constraints.

Abstract: Large Language Models (LLMs) deliver strong performance but are difficult to deploy under tight memory and compute constraints. Low-bit post-training quantization (PTQ) is a promising direction; however, it typically relies on calibration data, auxiliary transformations, and GPU tools. To address these limitations, we propose MSB (Multi Scale Binary), a calibration-free and transformation-free PTQ method that generalizes binary quantization to multi-bit settings. MSB optimizes a dynamic grouping criterion that minimizes within group variance, yielding group-wise multiscale levels that can be applied consistently across granularities from per tensor to block-wise configurations with 64 elements groups per row, without calibration or intermediate transforms. We implement the optimization in a CPU based solver for the quantization step and evaluate using standard bfloat16 execution without low-bit packing. On Llama 3.2 3B, MSB achieves 8.43 perplexity on WikiText-2 under 4-bit weight only block-wise quantization, compared to 7.81 in full precision and 12.23 with GPTQ its default setup. Overall, MSB provides a new optimization perspective for low-bit PTQ while simplifying the pipeline by removing calibration and transformations.

[438] Single-loop Algorithms for Stochastic Non-convex Optimization with Weakly-Convex Constraints

Ming Yang, Gang Li, Quanqi Hu, Qihang Lin, Tianbao Yang

Main category: cs.LG

TL;DR: A novel single-loop penalty-based stochastic algorithm for weakly convex constrained optimization with multiple functional inequality constraints, achieving state-of-the-art complexity for approximate KKT solutions.

DetailsMotivation: Constrained optimization with multiple functional inequality constraints is important in machine learning, but existing methods have limitations like slow convergence or complex double-loop designs. The paper addresses weakly convex objective and constraint functions.

Method: Proposes a single-loop penalty-based stochastic algorithm using a hinge-based penalty, allowing constant penalty parameter. Extends to finite-sum coupled compositional objectives common in AI applications.

Result: Achieves state-of-the-art complexity for finding approximate KKT solutions. Validated through experiments on fair learning with ROC fairness constraints and continual learning with non-forgetting constraints.

Conclusion: The proposed method overcomes limitations of existing approaches and provides improved complexity for weakly convex constrained optimization problems in machine learning.

Abstract: Constrained optimization with multiple functional inequality constraints has significant applications in machine learning. This paper examines a crucial subset of such problems where both the objective and constraint functions are weakly convex. Existing methods often face limitations, including slow convergence rates or reliance on double-loop algorithmic designs. To overcome these challenges, we introduce a novel single-loop penalty-based stochastic algorithm. Following the classical exact penalty method, our approach employs a {\bf hinge-based penalty}, which permits the use of a constant penalty parameter, enabling us to achieve a {\bf state-of-the-art complexity} for finding an approximate Karush-Kuhn-Tucker (KKT) solution. We further extend our algorithm to address finite-sum coupled compositional objectives, which are prevalent in artificial intelligence applications, establishing improved complexity over existing approaches. Finally, we validate our method through experiments on fair learning with receiver operating characteristic (ROC) fairness constraints and continual learning with non-forgetting constraints.

[439] SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models

Jiesong Lian, Ruizhe Zhong, Zixiang Zhou, Xiaoyue Mi, Yixue Hao, Yuan Zhou, Qinglin Lu, Long Hu, Junchi Yan

Main category: cs.LG

TL;DR: SoliReward: A systematic framework for training video reward models using single-item binary annotations, cross-prompt pairing, hierarchical attention, and modified Bradley-Terry loss to address labeling noise, architectural limitations, and reward hacking in video generation alignment.

DetailsMotivation: Post-training alignment of video generation models with human preferences faces challenges: current data collection methods suffer from labeling noise due to in-prompt pairwise annotations, VLM-based reward model architectures are underexplored, and reward models are susceptible to reward hacking during post-training.

Method: 1) Collect high-quality data via single-item binary annotations and construct preference pairs using cross-prompt pairing; 2) Use Hierarchical Progressive Query Attention for feature aggregation; 3) Introduce modified Bradley-Terry loss that accommodates win-tie scenarios to regularize score distributions and provide nuanced preference signals.

Result: Validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment. Demonstrates improvements in direct reward model evaluation metrics and efficacy of post-training on video generation models.

Conclusion: SoliReward provides a systematic framework that addresses key limitations in video reward model training, offering better data collection, architectural improvements, and regularization techniques for more effective alignment of video generation models with human preferences.

Abstract: Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles. Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM’s score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models. Code and benchmark are available at https://github.com/lian700/SoliReward

[440] Knowledgeable Language Models as Black-Box Optimizers for Personalized Medicine

Michael S. Yao, Osbert Bastani, Alma Andersson, Tommaso Biancalani, Aïcha Bentaieb, Claudia Iriondo

Main category: cs.LG

TL;DR: LEON uses LLMs as black-box optimizers with medical knowledge priors to propose personalized treatment plans without fine-tuning, outperforming traditional methods.

DetailsMotivation: Personalized medicine needs to optimize treatments based on patient factors, but surrogate models fail to generalize to unseen patient-treatment combinations. Medical knowledge sources like textbooks and biomedical graphs could provide better fitness signals for treatment optimization.

Method: LLM-based Entropy-guided Optimization with kNowledgeable priors (LEON) uses LLMs as black-box optimizers without task-specific fine-tuning. It implements ‘optimization by prompting’ where LLMs serve as stochastic engines for proposing treatment designs, leveraging their ability to contextualize unstructured domain knowledge.

Result: Experiments on real-world optimization tasks show LEON outperforms both traditional optimization methods and other LLM-based approaches in proposing individualized treatments for patients.

Conclusion: LEON demonstrates that LLMs can effectively serve as optimization engines for personalized medicine when combined with domain knowledge priors, providing a promising approach for treatment optimization without requiring task-specific fine-tuning.

Abstract: The goal of personalized medicine is to discover a treatment regimen that optimizes a patient’s clinical outcome based on their personal genetic and environmental factors. However, candidate treatments cannot be arbitrarily administered to the patient to assess their efficacy; we often instead have access to an in silico surrogate model that approximates the true fitness of a proposed treatment. Unfortunately, such surrogate models have been shown to fail to generalize to previously unseen patient-treatment combinations. We hypothesize that domain-specific prior knowledge - such as medical textbooks and biomedical knowledge graphs - can provide a meaningful alternative signal of the fitness of proposed treatments. To this end, we introduce LLM-based Entropy-guided Optimization with kNowledgeable priors (LEON), a mathematically principled approach to leverage large language models (LLMs) as black-box optimizers without any task-specific fine-tuning, taking advantage of their ability to contextualize unstructured domain knowledge to propose personalized treatment plans in natural language. In practice, we implement LEON via ‘optimization by prompting,’ which uses LLMs as stochastic engines for proposing treatment designs. Experiments on real-world optimization tasks show LEON outperforms both traditional and LLM-based methods in proposing individualized treatments for patients.

[441] Multi-Order Wavelet Derivative Transform for Deep Time Series Forecasting

Ziyu Zhou, Jiaxi Hu, Qingsong Wen, James T. Kwok, Yuxuan Liang

Main category: cs.LG

TL;DR: WaveTS introduces multi-order Wavelet Derivative Transform (WDT) for time series forecasting, capturing multi-scale time-sensitive patterns by operating on series derivatives to highlight rate-of-change cues and regime shifts.

DetailsMotivation: Existing Fourier Transform (FT) struggles with multi-scale, time-sensitive patterns in time series forecasting. While Wavelet Transform (WT) can capture multi-scale patterns, its coefficients are insensitive to change points, leading to suboptimal modeling.

Method: Proposes multi-order Wavelet Derivative Transform (WDT) based on WT, operating on series derivatives to extract time-aware patterns. Embeds WDT into WaveTS framework: decomposes input into multi-scale time-frequency coefficients, refines via linear layers, and reconstructs via inverse WDT.

Result: Extensive experiments on ten benchmark datasets demonstrate WaveTS achieves state-of-the-art forecasting accuracy while maintaining high computational efficiency.

Conclusion: WDT effectively captures multi-scale time-sensitive patterns in time series, and WaveTS framework provides superior forecasting performance with computational efficiency.

Abstract: In deep time series forecasting, the Fourier Transform (FT) is extensively employed for frequency representation learning. However, it often struggles in capturing multi-scale, time-sensitive patterns. Although the Wavelet Transform (WT) can capture these patterns through frequency decomposition, its coefficients are insensitive to change points in time series, leading to suboptimal modeling. To mitigate these limitations, we introduce the multi-order Wavelet Derivative Transform (WDT) grounded in the WT, enabling the extraction of time-aware patterns spanning both the overall trend and subtle fluctuations. Compared with the standard FT and WT, which model the raw series, the WDT operates on the derivative of the series, selectively magnifying rate-of-change cues and exposing abrupt regime shifts that are particularly informative for time series modeling. Practically, we embed the WDT into a multi-branch framework named WaveTS, which decomposes the input series into multi-scale time-frequency coefficients, refines them via linear layers, and reconstructs them into the time domain via the inverse WDT. Extensive experiments on ten benchmark datasets demonstrate that WaveTS achieves state-of-the-art forecasting accuracy while retaining high computational efficiency.

[442] PromptSplit: Revealing Prompt-Level Disagreement in Generative Models

Mehdi Lotfian, Mohammad Jalali, Farzan Farnia

Main category: cs.LG

TL;DR: PromptSplit is a kernel-based framework for detecting and analyzing prompt-dependent disagreements between generative AI models across vision and language domains.

DetailsMotivation: As prompt-guided generative AI models proliferate across vision and language domains with different training data and architectures, there's a need for principled methods to identify which types of prompts lead to distinct model behaviors and disagreements between models.

Method: PromptSplit constructs joint prompt-output representations using tensor-product embeddings of prompt and image/text features, computes kernel covariance matrices for model pairs, and uses the eigenspace of the weighted difference between matrices to identify main directions of behavioral differences. Employs random-projection approximation for scalability with O(nr² + r³) complexity.

Result: Experiments across text-to-image, text-to-text, and image-captioning settings show PromptSplit accurately detects ground-truth behavioral differences and isolates responsible prompts, providing an interpretable tool for detecting where generative models disagree.

Conclusion: PromptSplit offers a scalable, theoretically-grounded framework for analyzing prompt-dependent disagreements between generative models, enabling better understanding of model behaviors across different prompt types in multimodal settings.

Abstract: Prompt-guided generative AI models have rapidly expanded across vision and language domains, producing realistic and diverse outputs from textual inputs. The growing variety of such models, trained with different data and architectures, calls for principled methods to identify which types of prompts lead to distinct model behaviors. In this work, we propose PromptSplit, a kernel-based framework for detecting and analyzing prompt-dependent disagreement between generative models. For each compared model pair, PromptSplit constructs a joint prompt–output representation by forming tensor-product embeddings of the prompt and image (or text) features, and then computes the corresponding kernel covariance matrix. We utilize the eigenspace of the weighted difference between these matrices to identify the main directions of behavioral difference across prompts. To ensure scalability, we employ a random-projection approximation that reduces computational complexity to $O(nr^2 + r^3)$ for projection dimension $r$. We further provide a theoretical analysis showing that this approximation yields an eigenstructure estimate whose expected deviation from the full-dimensional result is bounded by $O(1/r^2)$. Experiments across text-to-image, text-to-text, and image-captioning settings demonstrate that PromptSplit accurately detects ground-truth behavioral differences and isolates the prompts responsible, offering an interpretable tool for detecting where generative models disagree.

[443] Feature Identification via the Empirical NTK

Jennifer Lin

Main category: cs.LG

TL;DR: Eigenanalysis of empirical neural tangent kernel (eNTK) can identify features learned by neural networks across three toy models, recovering ground-truth features in superposition models and Fourier features in modular arithmetic, with applications for feature discovery and phase transition detection.

DetailsMotivation: The paper aims to develop practical methods for mechanistic interpretability of neural networks by showing that eigenanalysis of the empirical neural tangent kernel can surface the actual features used by trained models, addressing the challenge of understanding what neural networks learn.

Method: The researchers analyze the eNTK (empirical neural tangent kernel) of trained neural networks across three standard toy models for mechanistic interpretability: Toy Models of Superposition (TMS), a 1-layer MLP trained on modular addition, and a 1-layer Transformer trained on modular addition. They perform eigenanalysis of the eNTK to identify top eigenspaces and compare them with ground-truth features.

Result: The eNTK successfully recovers ground-truth features in both sparse (high superposition) and dense regimes of TMS. In modular arithmetic, the eNTK can recover Fourier feature families. Additionally, layerwise eNTK analysis localizes features to specific layers, and the evolution of the eNTK spectrum can diagnose the grokking phase transition.

Conclusion: eNTK analysis provides a practical tool for feature discovery in neural networks and for detecting phase changes in small models, offering a new approach to mechanistic interpretability that could scale to more complex architectures.

Abstract: We provide evidence that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface the features used by trained neural networks. Across three standard toy models for mechanistic interpretability, Toy Models of Superposition (TMS), a 1-layer MLP trained on modular addition and a 1-layer Transformer trained on modular addition, we find that top eigenspaces of the eNTK align with ground-truth features. In TMS, the eNTK recovers the ground-truth features in both the sparse (high superposition) and dense regimes. In modular arithmetic, the eNTK can be used to recover Fourier feature families. Moreover, we provide evidence that a layerwise eNTK localizes features to specific layers and that the evolution of the eNTK spectrum can be used to diagnose the grokking phase transition. These results suggest that eNTK analysis may provide a practical handle for feature discovery and for detecting phase changes in small models.

[444] Rethinking Multi-Modal Learning from Gradient Uncertainty

Peizheng Guo, Jingyao Wang, Wenwen Qiang, Jiahuan Zhou, Changwen Zheng, Gang Hua

Main category: cs.LG

TL;DR: BOGC-MML: Bayesian-oriented gradient calibration for multi-modal learning that models gradient uncertainty using probability distributions and evidence theory to improve optimization reliability.

DetailsMotivation: Existing multi-modal learning optimization focuses on mitigating gradient direction conflicts, but performance fluctuations persist even in non-conflict settings. The authors argue that gradient reliability (uncertainty) is a decisive factor that needs explicit modeling beyond just direction alignment.

Method: Proposes Bayesian-Oriented Gradient Calibration (BOGC-MML) that models gradients as probability distributions to capture uncertainty, interprets gradient precision as evidence using subjective logic and evidence theory, and aggregates signals using a reduced Dempster’s combination rule to weight gradients based on reliability for calibrated updates.

Result: Extensive experiments demonstrate the effectiveness and advantages of the proposed method, showing improved optimization performance in multi-modal learning scenarios.

Conclusion: Explicit modeling of gradient uncertainty through Bayesian-oriented calibration provides a more reliable optimization approach for multi-modal learning, addressing limitations of existing methods that focus only on gradient direction conflicts.

Abstract: Multi-Modal Learning (MML) integrates information from diverse modalities to improve predictive accuracy. While existing optimization strategies have made significant strides by mitigating gradient direction conflicts, we revisit MML from a gradient-based perspective to explore further improvements. Empirically, we observe an interesting phenomenon: performance fluctuations can persist in both conflict and non-conflict settings. Based on this, we argue that: beyond gradient direction, the intrinsic reliability of gradients acts as a decisive factor in optimization, necessitating the explicit modeling of gradient uncertainty. Guided by this insight, we propose Bayesian-Oriented Gradient Calibration for MML (BOGC-MML). Our approach explicitly models gradients as probability distributions to capture uncertainty, interpreting their precision as evidence within the framework of subjective logic and evidence theory. By subsequently aggregating these signals using a reduced Dempster’s combination rule, BOGC-MML adaptively weights gradients based on their reliability to generate a calibrated update. Extensive experiments demonstrate the effectiveness and advantages of the proposed method.

[445] Generative Modeling via Drifting

Mingyang Deng, He Li, Tianhong Li, Yilun Du, Kaiming He

Main category: cs.LG

TL;DR: Drifting Models: A new generative modeling paradigm that evolves pushforward distributions during training to achieve one-step inference with state-of-the-art results on ImageNet.

DetailsMotivation: Current generative models like diffusion and flow-based models require iterative inference steps, which can be computationally expensive. The authors aim to develop a method that achieves high-quality generation in a single step while maintaining competitive performance.

Method: Proposes Drifting Models that evolve the pushforward distribution during training using a drifting field that governs sample movement. The method achieves equilibrium when distributions match, allowing neural network optimizers to evolve the distribution naturally. This enables one-step inference at test time.

Result: Achieves state-of-the-art results on ImageNet at 256×256 resolution with FID scores of 1.54 in latent space and 1.61 in pixel space, demonstrating high-quality one-step generation.

Conclusion: Drifting Models offer a new paradigm for generative modeling that enables efficient one-step inference while maintaining high-quality results, opening opportunities for improved generative model efficiency.

Abstract: Generative modeling can be formulated as learning a mapping f such that its pushforward distribution matches the data distribution. The pushforward behavior can be carried out iteratively at inference time, for example in diffusion and flow-based models. In this paper, we propose a new paradigm called Drifting Models, which evolve the pushforward distribution during training and naturally admit one-step inference. We introduce a drifting field that governs the sample movement and achieves equilibrium when the distributions match. This leads to a training objective that allows the neural network optimizer to evolve the distribution. In experiments, our one-step generator achieves state-of-the-art results on ImageNet at 256 x 256 resolution, with an FID of 1.54 in latent space and 1.61 in pixel space. We hope that our work opens up new opportunities for high-quality one-step generation.

[446] Position: Epistemic uncertainty estimation methods are fundamentally incomplete

Sebastián Jiménez, Mira Jürgens, Willem Waegeman

Main category: cs.LG

TL;DR: Current second-order uncertainty quantification methods are fundamentally incomplete, failing to properly disentangle aleatoric and epistemic uncertainty due to unaccounted bias and partial variance capture.

DetailsMotivation: The paper aims to address limitations in widely used uncertainty quantification methods for supervised learning, particularly in safety-critical applications where trustworthy uncertainty estimates are essential.

Method: The authors analyze existing second-order uncertainty methods that separate aleatoric (data-related) and epistemic (model-related) uncertainty, identifying fundamental flaws in their approach to uncertainty decomposition.

Result: Results show that current methods: 1) Contaminate uncertainty estimates by overestimating aleatoric uncertainty and underestimating epistemic uncertainty due to unaccounted bias, and 2) Capture only partial contributions to variance-driven epistemic uncertainty, leading to incomplete and difficult-to-interpret estimates.

Conclusion: Current epistemic uncertainty estimates can only be safely used in critical applications when their limitations are fully understood by end users and acknowledged by AI developers, highlighting the need for more complete uncertainty quantification methods.

Abstract: Identifying and disentangling sources of predictive uncertainty is essential for trustworthy supervised learning. We argue that widely used second-order methods that disentangle aleatoric and epistemic uncertainty are fundamentally incomplete. First, we show that unaccounted bias contaminates uncertainty estimates by overestimating aleatoric (data-related) uncertainty and underestimating the epistemic (model-related) counterpart, leading to incorrect uncertainty quantification. Second, we demonstrate that existing methods capture only partial contributions to the variance-driven part of epistemic uncertainty; different approaches account for different variance sources, yielding estimates that are incomplete and difficult to interpret. Together, these results highlight that current epistemic uncertainty estimates can only be used in safety-critical and high-stakes decision-making when limitations are fully understood by end users and acknowledged by AI developers.

[447] Extreme Weather Nowcasting via Local Precipitation Pattern Prediction

Changhoon Song, Teng Yuan Chang, Youngjoon Hong

Main category: cs.LG

TL;DR: exPreCast: Efficient deterministic framework for precipitation nowcasting using balanced radar data with local spatiotemporal attention and texture-preserving upsampling.

DetailsMotivation: Current precipitation nowcasting models face challenges: diffusion-based generative ensembles are computationally expensive, deterministic models are biased toward normal rainfall, and benchmark datasets are skewed toward either ordinary or extreme events, limiting real-world applicability.

Method: Proposes exPreCast framework with local spatiotemporal attention, texture-preserving cubic dual upsampling decoder, and temporal extractor for flexible forecasting horizons. Introduces balanced KMA radar dataset containing both ordinary precipitation and extreme events.

Result: Achieves state-of-the-art performance on established benchmarks (SEVIR and MeteoNet) and the new balanced KMA dataset, delivering accurate and reliable nowcasts across both normal and extreme rainfall regimes.

Conclusion: The proposed deterministic framework with balanced dataset addresses computational efficiency and bias issues in precipitation nowcasting, enabling practical real-time applications while maintaining accuracy across diverse rainfall conditions.

Abstract: Accurate forecasting of extreme weather events such as heavy rainfall or storms is critical for risk management and disaster mitigation. Although high-resolution radar observations have spurred extensive research on nowcasting models, precipitation nowcasting remains particularly challenging due to pronounced spatial locality, intricate fine-scale rainfall structures, and variability in forecasting horizons. While recent diffusion-based generative ensembles show promising results, they are computationally expensive and unsuitable for real-time applications. In contrast, deterministic models are computationally efficient but remain biased toward normal rainfall. Furthermore, the benchmark datasets commonly used in prior studies are themselves skewed–either dominated by ordinary rainfall events or restricted to extreme rainfall episodes–thereby hindering general applicability in real-world settings. In this paper, we propose exPreCast, an efficient deterministic framework for generating finely detailed radar forecasts, and introduce a newly constructed balanced radar dataset from the Korea Meteorological Administration (KMA), which encompasses both ordinary precipitation and extreme events. Our model integrates local spatiotemporal attention, a texture-preserving cubic dual upsampling decoder, and a temporal extractor to flexibly adjust forecasting horizons. Experiments on established benchmarks (SEVIR and MeteoNet) as well as on the balanced KMA dataset demonstrate that our approach achieves state-of-the-art performance, delivering accurate and reliable nowcasts across both normal and extreme rainfall regimes.

[448] Little By Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts

Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong

Main category: cs.LG

TL;DR: MoRAM: A continual learning method using fine-grained rank-1 adapters as associative memory units, eliminating explicit routers and improving plasticity-stability trade-offs for large pre-trained models.

DetailsMotivation: Existing LoRA-based Mixture-of-Experts methods for continual learning suffer from redundancy, interference, and ambiguous routing due to coarse-grained experts. These methods add new task-specific adapters but face issues with expert duplication, conflict, and routing degradation as experts accumulate.

Method: MoRAM treats weight matrices as linear associative memories and achieves continual learning through gradual incrementing of atomic rank-1 memory experts. Each rank-1 adapter acts as a fine-grained MoE expert or associative memory unit. By viewing rank-1 adapters as key-value pairs, it eliminates explicit routers using a self-activation mechanism where each memory atom evaluates its own relevance via its intrinsic key, transforming adaptation into content-addressable retrieval.

Result: Extensive experiments on CLIP and LLMs demonstrate that MoRAM significantly outperforms state-of-the-art baselines, achieving superior plasticity-stability trade-offs, improving generalization while mitigating forgetting.

Conclusion: MoRAM provides an effective approach for continual learning with large pre-trained models by using fine-grained rank-1 associative memories, eliminating routing complexity, and achieving better balance between learning new tasks and retaining old knowledge.

Abstract: Continual learning (CL) with large pre-trained models is challenged by task interference and catastrophic forgetting. Existing LoRA-based Mixture-of-Experts (MoE) methods mitigate forgetting by adding new task-specific adapters and freezing old ones, but often suffer from redundancy, interference, and ambiguous routing due to coarse-grained experts and routing. Coarse-grained experts (i.e., full LoRA adapters with large rank) encode low-specialty information. Newly added experts often duplicate or conflict with existing ones, causing redundancy and interference. Their low specialization further confuses the router, accelerating routing degradation and forgetting as experts accumulate. In this work, we propose MoRAM (Mixture of Rank-1 Associative Memory). Grounded in the view that weight matrices function as linear associative memories, MoRAM achieves CL as gradual incrementing of atomic rank-1 memory experts. Each rank-1 adapter acts as a fine-grained MoE expert or an associative memory unit. By viewing rank-1 adapters as key-value pairs, we eliminate explicit routers in MoE-LoRA, using a self-activation mechanism where each memory atom evaluates its own relevance via its intrinsic key. This transforms the adaptation process into robust, content-addressable retrieval. Extensive experiments on CLIP and LLMs demonstrate that MoRAM significantly outperforms state-of-the-art baselines, achieving superior plasticity-stability trade-offs, improving generalization while mitigating forgetting.

[449] A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

Moritz Wagner, Christophe Roux, Max Zimmer, Sebastian Pokutta

Main category: cs.LG

TL;DR: Local reconstruction enables efficient post-pruning adaptation for LLMs using small calibration sets and minimal compute, challenging the notion that such adaptation is impractical.

DetailsMotivation: Post-training pruning reduces inference costs but causes quality degradation. Retraining pruned LLMs is considered impractical due to computational costs, motivating complex pruning criteria. This work revisits post-pruning adaptation to make it practical.

Method: Local reconstruction: adapt only small pruned submodels at a time using small calibration sets by matching intermediate activations of the dense model. Studied across model families and scales up to 72B parameters.

Result: 1) Local reconstruction matches post-pruning PEFT with 10x less data/compute. 2) “Free lunch” regime: quality unchanged across wide submodel size range. 3) Pruning criterion becomes less critical with reconstruction - simple methods become competitive.

Conclusion: Post-pruning adaptation is practical for LLMs via local reconstruction, challenging prevailing narrative. Enables efficient adaptation with minimal resources while reducing importance of sophisticated pruning criteria.

Abstract: Post-training pruning substantially reduces inference costs but often causes severe quality degradation without adapting the remaining weights. For LLMs, such retraining is commonly considered impractical due to large computational costs, motivating increasingly sophisticated pruning criteria to compensate by selecting better sparsity patterns. In this work, we revisit post-pruning adaptation and study local reconstruction: adapting only a small pruned submodel at a time using a small calibration set by matching intermediate activations of the dense model. We conduct a large-scale study across model families and scales (up to 72B parameters) and establish three central results. First, local reconstruction is an effective adaptation mechanism for LLMs, matching post-pruning PEFT while using over an order of magnitude less data and compute. Second, we identify a broad “free lunch” regime in reconstruction granularity: across a wide range of submodel sizes, final quality remains essentially unchanged, allowing granularity to be chosen based on memory constraints. Finally, with reconstruction, the pruning criterion becomes less critical: performance gaps between sophisticated methods and simple baselines shrink with model size, making simple methods competitive again. Collectively, our results challenge the prevailing narrative that post-pruning adaptation is impractical for LLMs.

[450] SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

Melanie Rieff, Maya Varma, Ossian Rabow, Subathra Adithan, Julie Kim, Ken Chang, Hannah Lee, Nidhi Rohatgi, Christian Bluethgen, Mohamed S. Muneer, Jean-Benoit Delbrouck, Michael Moor

Main category: cs.LG

TL;DR: SMMILE is the first expert-driven multimodal in-context learning benchmark for medical tasks, evaluating MLLMs’ ability to learn from multimodal examples. Most models show poor medical multimodal ICL, with only 8-9% improvement over zero-shot, and exhibit sensitivity to irrelevant examples and recency bias.

DetailsMotivation: Multimodal ICL is underexplored in medicine despite clinicians' need to adapt from limited examples. Current MLLMs show advances in medical VQA but their ability to learn multimodal tasks from context is unknown, creating a gap in understanding MLLMs' medical ICL capabilities.

Method: Created SMMILE benchmark with 111 problems (517 question-image-answer triplets) across 6 specialties and 13 imaging modalities, curated by 11 medical experts. Also created SMMILE++ with 1038 permuted problems. Evaluated 15 MLLMs on their multimodal ICL ability in medical tasks.

Result: Most MLLMs show moderate to poor multimodal ICL ability in medicine. ICL provides only 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. Models are sensitive to irrelevant examples (single noisy example degrades performance by up to 9.5%) and show recency bias (relevant example last improves performance by up to 71%).

Conclusion: Current MLLMs have critical limitations and biases in learning multimodal medical tasks from context. The SMMILE benchmark reveals significant gaps in multimodal ICL capabilities for medical applications, highlighting the need for improved medical multimodal reasoning in MLLMs.

Abstract: Multimodal in-context learning (ICL) remains underexplored despite significant potential for domains such as medicine. Clinicians routinely encounter diverse, specialized tasks requiring adaptation from limited examples, such as drawing insights from a few relevant prior cases or considering a constrained set of differential diagnoses. While multimodal large language models (MLLMs) have shown advances in medical visual question answering (VQA), their ability to learn multimodal tasks from context is largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL benchmark for medical tasks. Eleven medical experts curated problems, each including a multimodal query and multimodal in-context examples as task demonstrations. SMMILE encompasses 111 problems (517 question-image-answer triplets) covering 6 medical specialties and 13 imaging modalities. We further introduce SMMILE++, an augmented variant with 1038 permuted problems. A comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit moderate to poor multimodal ICL ability in medical tasks. In open-ended evaluations, ICL contributes only an 8% average improvement over zero-shot on SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant in-context examples: even a single noisy or irrelevant example can degrade performance by up to 9.5%. Moreover, we observe that MLLMs are affected by a recency bias, where placing the most relevant example last can lead to substantial performance improvements of up to 71%. Our findings highlight critical limitations and biases in current MLLMs when learning multimodal medical tasks from context. SMMILE is available at https://smmile-benchmark.github.io.

[451] Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Encoder-Only and Decoder-Only Transformers

Marko Karbevski, Antonij Mijoski

Main category: cs.LG

TL;DR: Theoretical analysis shows Query weights in transformers are redundant and can be replaced with identity matrix, reducing attention parameters by 25% while maintaining performance with adjusted scaling and weight decay.

DetailsMotivation: To investigate parameter redundancy in transformer architectures and determine if Query, Key, Value weight triplets can be reduced to improve efficiency and simplify optimization.

Method: Theoretical analysis under mild assumptions proves Query weights are redundant and can be replaced with identity matrix. Validation through training decoder-only GPT-style small models from scratch with adjusted attention scaling and weight decay.

Result: Reduced models match baseline performance despite 25% fewer attention parameters. Training remains stable at over 3× lower weight decay, suggesting Query weight elimination provides implicit regularization. Also discovered structural expressivity boundary in ReLU MLPs with skip connections.

Conclusion: Query weights in transformers are redundant and can be eliminated, offering efficiency gains and simplified optimization. Findings motivate investigation across modalities and at scale where stability and efficiency gains could be most impactful.

Abstract: We theoretically investigate whether the Query, Key, Value weight triplet can be reduced in encoder-only and decoder-only transformers. Under mild assumptions, we prove that Query weights are redundant and can be replaced with the identity matrix, reducing attention parameters by $25%$. This also simplifies optimization: attention logits become linear rather than quadratic in learned weights. Validating on decoder-only GPT-style small models trained from scratch, we find that with adjusted attention scaling and weight decay, reduced models match baseline performance despite fewer parameters. Training remains stable at over $3\times$ lower weight decay, suggesting Query weight elimination provides implicit regularization. Our analysis has also led us to a structural expressivity boundary: in the mathematically tractable ReLU setting, skip connections push MLPs into a generically disjoint function class at fixed width. These findings motivate investigation across modalities and at scale, where the observed stability and efficiency gains may prove most consequential.

[452] Adversarial generalization of unfolding (model-based) networks

Vicky Kouni

Main category: cs.LG

TL;DR: Theoretical analysis of adversarial generalization in unfolding networks for compressed sensing, providing first theoretical bounds and showing overparameterization can enhance robustness.

DetailsMotivation: Unfolding networks are interpretable networks used for inverse problems like compressed sensing, but their adversarial robustness lacks theoretical understanding despite critical applications in domains like medical imaging and cryptography where robustness is essential.

Method: Study adversarial generalization of overparameterized unfolding networks under l2-norm constrained attacks using fast gradient sign method. Deploy new framework to estimate adversarial Rademacher complexity and provide tight adversarial generalization error bounds.

Result: First theoretical analysis on adversarial generalization of unfolding networks, with tight error bounds relative to attack level. Experiments on real-world data corroborate theory, showing overparameterization can be exploited to promote adversarial robustness.

Conclusion: Provides foundational theoretical understanding of adversarial robustness in unfolding networks, demonstrating that overparameterization can be leveraged for efficient robustification of neural networks in inverse problem applications.

Abstract: Unfolding networks are interpretable networks emerging from iterative algorithms, incorporate prior knowledge of data structure, and are designed to solve inverse problems like compressed sensing, which deals with recovering data from noisy, missing observations. Compressed sensing finds applications in critical domains, from medical imaging to cryptography, where adversarial robustness is crucial to prevent catastrophic failures. However, a solid theoretical understanding of the performance of unfolding networks in the presence of adversarial attacks is still in its infancy. In this paper, we study the adversarial generalization of unfolding networks when perturbed with $l_2$-norm constrained attacks, generated by the fast gradient sign method. Particularly, we choose a family of state-of-the-art overaparameterized unfolding networks and deploy a new framework to estimate their adversarial Rademacher complexity. Given this estimate, we provide adversarial generalization error bounds for the networks under study, which are tight with respect to the attack level. To our knowledge, this is the first theoretical analysis on the adversarial generalization of unfolding networks. We further present a series of experiments on real-world data, with results corroborating our derived theory, consistently for all data. Finally, we observe that the family’s overparameterization can be exploited to promote adversarial robustness, shedding light on how to efficiently robustify neural networks.

[453] RMT-KD: Random Matrix Theoretic Causal Knowledge Distillation

Davide Ettori, Nastaran Darabi, Sureshkumar Senthilkumar, Amit Ranjan Trivedi

Main category: cs.LG

TL;DR: RMT-KD uses Random Matrix Theory for knowledge distillation to compress deep learning models by preserving only informative spectral directions, achieving 80% parameter reduction with minimal accuracy loss.

DetailsMotivation: Large models like BERT and ResNet are computationally expensive to deploy at the edge due to their size and resource demands, requiring efficient compression methods that maintain performance.

Method: RMT-KD leverages Random Matrix Theory to identify informative directions in hidden representations via spectral properties, applying layer-by-layer causal reduction with self-distillation to maintain stability and accuracy.

Result: Achieves up to 80% parameter reduction with only 2% accuracy loss on GLUE and CIFAR-10, delivering 2.8x faster inference and nearly halved power consumption.

Conclusion: RMT-KD establishes a mathematically grounded approach to network distillation that effectively compresses models for edge deployment while preserving performance.

Abstract: Large deep learning models such as BERT and ResNet achieve state-of-the-art performance but are costly to deploy at the edge due to their size and compute demands. We present RMT-KD, a compression method that leverages Random Matrix Theory (RMT) for knowledge distillation to iteratively reduce network size. Instead of pruning or heuristic rank selection, RMT-KD preserves only informative directions identified via the spectral properties of hidden representations. RMT-based causal reduction is applied layer by layer with self-distillation to maintain stability and accuracy. On GLUE and CIFAR-10, RMT-KD achieves up to 80% parameter reduction with only 2% accuracy loss, delivering 2.8x faster inference and nearly halved power consumption. These results establish RMT-KD as a mathematically grounded approach to network distillation.

[454] EigenTrack: Spectral Activation Feature Tracking for Hallucination and Out-of-Distribution Detection in LLMs and VLMs

Davide Ettori, Nastaran Darabi, Sina Tayebati, Ranganath Krishnan, Mahesh Subedar, Omesh Tickoo, Amit Ranjan Trivedi

Main category: cs.LG

TL;DR: EigenTrack is an interpretable real-time detector that uses spectral geometry of hidden activations to detect hallucination and out-of-distribution errors in LLMs before surface errors appear.

DetailsMotivation: LLMs are prone to hallucination and OOD errors, but existing detection methods are either black-box (lacking interpretability), grey-box (requiring resampling), or white-box (lacking temporal context). There's a need for a real-time, interpretable detector that can identify errors before they manifest in outputs.

Method: Uses spectral geometry of hidden activations as a compact global signature of model dynamics. Streams covariance-spectrum statistics (entropy, eigenvalue gaps, KL divergence from random baselines) into a lightweight recurrent classifier. This tracks temporal shifts in representation structure with a single forward pass, preserving temporal context and aggregating global signals.

Result: EigenTrack offers interpretable accuracy-latency trade-offs and can detect hallucination and OOD drift before surface errors appear. It operates in real-time with only a single forward pass, unlike methods requiring resampling.

Conclusion: EigenTrack provides a novel approach to LLM error detection that combines interpretability, real-time operation, and temporal context awareness, addressing limitations of existing black-, grey-, and white-box methods.

Abstract: Large language models (LLMs) offer broad utility but remain prone to hallucination and out-of-distribution (OOD) errors. We propose EigenTrack, an interpretable real-time detector that uses the spectral geometry of hidden activations, a compact global signature of model dynamics. By streaming covariance-spectrum statistics such as entropy, eigenvalue gaps, and KL divergence from random baselines into a lightweight recurrent classifier, EigenTrack tracks temporal shifts in representation structure that signal hallucination and OOD drift before surface errors appear. Unlike black- and grey-box methods, it needs only a single forward pass without resampling. Unlike existing white-box detectors, it preserves temporal context, aggregates global signals, and offers interpretable accuracy-latency trade-offs.

[455] On the Empirical Power of Goodness-of-Fit Tests in Watermark Detection

Weiqing He, Xiang Li, Tianqi Shang, Li Shen, Weijie Su, Qi Long

Main category: cs.LG

TL;DR: Systematic evaluation of goodness-of-fit tests for LLM watermark detection shows they improve detection power and robustness, especially for low-temperature text repetition.

DetailsMotivation: LLMs raise concerns about content authenticity, and while text watermarks offer provable origin verification, goodness-of-fit tests remain underexplored for watermark detection despite being a natural fit for analyzing statistical signals in generated text.

Method: Systematically evaluated eight goodness-of-fit tests across three popular watermarking schemes using three open-source LLMs, two datasets, various generation temperatures, and multiple post-editing methods.

Result: General goodness-of-fit tests improve both detection power and robustness of watermark detectors. Text repetition in low-temperature settings gives GoF tests a unique advantage not exploited by existing methods.

Conclusion: Classic goodness-of-fit tests are a simple yet powerful and underused tool for watermark detection in LLMs, offering improved performance especially in detecting statistical patterns from text repetition.

Abstract: Large language models (LLMs) raise concerns about content authenticity and integrity because they can generate human-like text at scale. Text watermarks, which embed detectable statistical signals into generated text, offer a provable way to verify content origin. Many detection methods rely on pivotal statistics that are i.i.d. under human-written text, making goodness-of-fit (GoF) tests a natural tool for watermark detection. However, GoF tests remain largely underexplored in this setting. In this paper, we systematically evaluate eight GoF tests across three popular watermarking schemes, using three open-source LLMs, two datasets, various generation temperatures, and multiple post-editing methods. We find that general GoF tests can improve both the detection power and robustness of watermark detectors. Notably, we observe that text repetition, common in low-temperature settings, gives GoF tests a unique advantage not exploited by existing methods. Our results highlight that classic GoF tests are a simple yet powerful and underused tool for watermark detection in LLMs.

[456] GraphToxin: Reconstructing Full Unlearned Graphs from Graph Unlearning

Ying Song, Balaji Palanisamy

Main category: cs.LG

TL;DR: GraphToxin: First full graph reconstruction attack against graph unlearning that recovers deleted nodes and sensitive connections using curvature matching guidance.

DetailsMotivation: Graph unlearning promises compliance with data privacy regulations but has vulnerabilities - residual traces remain and multiple parties create attack surfaces that can be exploited to recover supposedly erased data.

Method: Proposes GraphToxin with novel curvature matching module for fine-grained guidance in unlearned graph recovery; extends to multiple-node removal under white-box and black-box settings; includes systematic evaluation framework for worst-case analysis.

Result: GraphToxin successfully recovers deleted individuals’ information, personal links, and sensitive content from connections; works in multiple-node removal scenarios; existing defenses are largely ineffective or sometimes amplify the attack.

Conclusion: GraphToxin demonstrates severe privacy risks in graph unlearning, highlighting urgent need for more effective and robust defenses against such reconstruction attacks.

Abstract: Graph unlearning has emerged as a promising solution to comply with “the right to be forgotten” regulations by enabling the removal of sensitive information upon request. However, this solution is not foolproof. The involvement of multiple parties creates new attack surfaces, and residual traces of deleted data can still remain in the unlearned graph neural networks (GNNs). These vulnerabilities can be exploited by attackers to recover the supposedly erased samples, thereby undermining the intended functionality of graph unlearning. In this work, we propose GraphToxin, the first full graph reconstruction attack against graph unlearning. Specifically, we introduce a novel curvature matching module to provide fine-grained guidance for unlearned graph recovery. We demonstrate that GraphToxin can successfully subvert the regulatory guarantees expected from graph unlearning, it can recover not only a deleted individual’s information and personal links but also sensitive content from their connections, thereby posing substantially more detrimental threats. Furthermore, we extend GraphToxin to multiple-node removal under both white-box and black-box settings, showcasing its practical feasibility and potential to cause considerable harm. We highlight the necessity of worst-case analysis and propose a systematic evaluation framework to assess attack performance under both random and worst-case node removal scenarios. Our extensive experiments demonstrate the effectiveness and flexibility of GraphToxin. Notably, existing defense mechanisms are largely ineffective against this attack or even amplify its performance in some cases. Given the severe privacy risks posed by GraphToxin, our work underscores the urgent need for more effective and robust defenses.

[457] Discriminative Feature Feedback with General Teacher Classes

Omri Bar Oz, Tosca Lechner, Sivan Sabato

Main category: cs.LG

TL;DR: Theoretical analysis of Discriminative Feature Feedback (DFF) learning protocol, comparing it to classical online learning and studying optimal mistake bounds in realizable and non-realizable settings.

DetailsMotivation: To provide the first systematic theoretical study of the DFF learning protocol, which uses discriminative feature explanations as feedback, and compare it to classical learning protocols like supervised and online learning.

Method: Theoretical analysis of DFF protocol properties, studying optimal mistake bounds in both realizable and non-realizable settings, introducing new notions of dimension to characterize mistake bounds, and comparing with online learning frameworks.

Result: Characterized mistake bound in realizable setting using new dimension notion; provided mistake upper bound in non-realizable setting and showed it cannot be improved generally; demonstrated that unlike online learning, DFF’s realizable dimension doesn’t characterize optimal non-realizable mistake bounds or no-regret algorithm existence.

Conclusion: DFF protocol has distinct theoretical properties from classical online learning, requiring different analytical approaches and dimensions to characterize learning performance, particularly in non-realizable settings.

Abstract: We study the theoretical properties of the interactive learning protocol Discriminative Feature Feedback (DFF) (Dasgupta et al., 2018). The DFF learning protocol uses feedback in the form of discriminative feature explanations. We provide the first systematic study of DFF in a general framework that is comparable to that of classical protocols such as supervised learning and online learning. We study the optimal mistake bound of DFF in the realizable and the non-realizable settings, and obtain novel structural results, as well as insights into the differences between Online Learning and settings with richer feedback such as DFF. We characterize the mistake bound in the realizable setting using a new notion of dimension. In the non-realizable setting, we provide a mistake upper bound and show that it cannot be improved in general. Our results show that unlike Online Learning, in DFF the realizable dimension is insufficient to characterize the optimal non-realizable mistake bound or the existence of no-regret algorithms.

[458] A Unified Framework for Lifted Training and Inversion Approaches

Xiaoyu Wang, Alexandra Valavanis, Azhir Mahmood, Andreas Mang, Martin Benning, Audrey Repetti

Main category: cs.LG

TL;DR: A unified framework for lifted training methods that reformulates neural network training as constrained optimization using penalty terms, enabling distributed optimization, handling non-differentiable activations, and improving training landscape conditioning.

DetailsMotivation: Traditional gradient-based training with backpropagation faces issues like vanishing/exploding gradients, difficulties with non-smooth activations, and sequential limitations. Lifted training methods offer alternatives by transforming the nested optimization problem into constrained optimization.

Method: Presents a unified framework encapsulating various lifted training strategies (Method of Auxiliary Coordinates, Fenchel Lifted Networks, Lifted Bregman Training). Uses convex optimization tools, particularly Bregman distances, to enable distributed optimization and handle non-differentiable proximal activations. Implements block-coordinate descent strategies with accelerated/adaptive optimization and implicit stochastic gradient methods.

Result: Numerical results on standard imaging tasks validate the effectiveness and stability of lifted Bregman approach compared to conventional training, particularly for architectures with proximal activations.

Conclusion: Lifted training methods provide a powerful alternative to traditional backpropagation, offering improved conditioning, distributed optimization capabilities, and better handling of non-differentiable activations, with applications in inverse problems and specialized network architectures.

Abstract: The training of deep neural networks predominantly relies on a combination of gradient-based optimisation and back-propagation for the computation of the gradient. While incredibly successful, this approach faces challenges such as vanishing or exploding gradients, difficulties with non-smooth activations, and an inherently sequential structure that limits parallelisation. Lifted training methods offer an alternative by reformulating the nested optimisation problem into a higher-dimensional, constrained optimisation problem where the constraints are no longer enforced directly but penalised with penalty terms. This chapter introduces a unified framework that encapsulates various lifted training strategies, including the Method of Auxiliary Coordinates, Fenchel Lifted Networks, and Lifted Bregman Training, and demonstrates how diverse architectures, such as Multi-Layer Perceptrons, Residual Neural Networks, and Proximal Neural Networks fit within this structure. By leveraging tools from convex optimisation, particularly Bregman distances, the framework facilitates distributed optimisation, accommodates non-differentiable proximal activations, and can improve the conditioning of the training landscape. We discuss the implementation of these methods using block-coordinate descent strategies, including deterministic implementations enhanced by accelerated and adaptive optimisation techniques, as well as implicit stochastic gradient methods. Furthermore, we explore the application of this framework to inverse problems, detailing methodologies for both the training of specialised networks (e.g., unrolled architectures) and the stable inversion of pre-trained networks. Numerical results on standard imaging tasks validate the effectiveness and stability of the lifted Bregman approach compared to conventional training, particularly for architectures employing proximal activations.

[459] Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

Daria Frolova, Talgat Daulbaev, Egor Sevriugov, Sergei A. Nikolenko, Dmitry N. Ivankov, Ivan Oseledets, Marina A. Pak

Main category: cs.LG

TL;DR: Matcha: A molecular docking pipeline using multi-stage flow matching with physical post-processing for accurate protein-ligand binding pose prediction, balancing speed, accuracy, and physical plausibility.

DetailsMotivation: Existing protein-ligand docking methods struggle to balance speed, accuracy, and physical plausibility, creating a need for improved approaches in structure-based drug design.

Method: Three-stage flow matching pipeline operating on geometric spaces (ℝ³, SO(3), SO(2)) with GNINA energy minimization and unsupervised physical validity filters for pose refinement.

Result: Superior physical plausibility across benchmarks and approximately 31× faster than modern large-scale co-folding models.

Conclusion: Matcha provides an effective solution for protein-ligand docking that combines neural networks with physical constraints for improved performance.

Abstract: Accurate prediction of protein-ligand binding poses is crucial for structure-based drug design, yet existing methods struggle to balance speed, accuracy, and physical plausibility. We introduce Matcha, a novel molecular docking pipeline that combines multi-stage flow matching with physically-aware post-processing. Our approach consists of three sequential stages applied consecutively to progressively refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces ($\mathbb{R}^3$, $\mathrm{SO}(3)$, and $\mathrm{SO}(2)$). We enhance the prediction quality through GNINA energy minimization and apply unsupervised physical validity filters to eliminate unrealistic poses. Compared to various approaches, Matcha demonstrates superior physical plausibility across all considered benchmarks. Moreover, our method works approximately 31 times faster than modern large-scale co-folding models. The model weights and inference code to reproduce our results are available at https://github.com/LigandPro/Matcha.

[460] An interpretable data-driven approach to optimizing clinical fall risk assessment

Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Kimia Ghobadi

Main category: cs.LG

TL;DR: A data-driven constrained score optimization approach improves fall risk prediction over existing clinical tools while maintaining interpretability, though black-box models achieve slightly better performance metrics.

DetailsMotivation: To better align fall risk prediction with clinically meaningful measures using data-driven modeling while incorporating clinical knowledge and maintaining interpretability of predictions.

Method: Retrospective analysis of 54,209 inpatient admissions using constrained score optimization (CSO) models on JHFRAT assessment data and additional EHR variables, comparing against benchmark XGBoost models.

Result: CSO models significantly improved predictive performance over current JHFRAT (AUC-ROC: 0.91 vs 0.86), performed similarly with/without EHR variables, and showed more robustness to risk labeling variations than XGBoost (AUC-ROC: 0.94).

Conclusion: Evidence-based data-driven optimization provides a robust foundation for enhancing inpatient fall prevention protocols and patient safety, improving risk assessment and resource allocation in healthcare.

Abstract: In this study we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models on JHFRAT assessment data and additional electronic health record (EHR) variables. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labelling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.

[461] Adaptive Regime-Switching Forecasts with Distribution-Free Uncertainty: Deep Switching State-Space Models Meet Conformal Prediction

Echo Diyun LU, Charles Findling, Marianne Clausel, Alessandro Leite, Wei Gong, Pierric Kersaudy

Main category: cs.LG

TL;DR: Conformal prediction methods (ACI/AgACI) applied to regime-switching time series forecasting to provide distribution-free uncertainty quantification with finite-sample guarantees under nonstationarity.

DetailsMotivation: Regime transitions break stationarity in time series, making uncertainty quantification as important as point accuracy. Need methods that provide calibrated uncertainty under nonstationary conditions and model misspecification.

Method: Couples Deep Switching State Space Models with Adaptive Conformal Inference (ACI) and its aggregated variant (AgACI). Introduces unified conformal wrapper that can be applied to various sequence models (S4, MC-Dropout GRU, sparse Gaussian processes, change-point local models) to produce online predictive bands.

Result: Across synthetic and real datasets, conformalized forecasters achieve near-nominal coverage with competitive accuracy and generally improved band efficiency.

Conclusion: Conformal prediction methods provide reliable uncertainty quantification for regime-switching time series forecasting with finite-sample guarantees under challenging nonstationary conditions.

Abstract: Regime transitions routinely break stationarity in time series, making calibrated uncertainty as important as point accuracy. We study distribution-free uncertainty for regime-switching forecasting by coupling Deep Switching State Space Models with Adaptive Conformal Inference (ACI) and its aggregated variant (AgACI). We also introduce a unified conformal wrapper that sits atop strong sequence baselines including S4, MC-Dropout GRU, sparse Gaussian processes, and a change-point local model to produce online predictive bands with finite-sample marginal guarantees under nonstationarity and model misspecification. Across synthetic and real datasets, conformalized forecasters achieve near-nominal coverage with competitive accuracy and generally improved band efficiency.

[462] Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

Svetlana Churina, Niranjan Chebrolu, Kokil Jaidka

Main category: cs.LG

TL;DR: Continual pretraining on misinformation can overwrite specific factual knowledge in LLMs without degrading overall performance, creating targeted belief flips that emerge abruptly in late layers and are partially reversible.

DetailsMotivation: To study how repeated exposure to counterfactual claims during continual model updates affects internal factual representations, unlike prior poisoning work under static pretraining.

Method: Used paired fact-counterfact items with graded poisoning ratios, tracked internal preferences between competing facts across checkpoints, layers, and model scales, and tested reversibility via patching.

Result: Moderate poisoning (50-100%) flips over 55% of responses from correct to counterfactual while leaving ambiguity nearly unchanged. Belief flips emerge abruptly, concentrate in late layers, and are partially reversible via patching (up to 56.8%).

Conclusion: Continual pretraining exposes a failure mode where targeted misinformation replaces internal factual representations without triggering broad performance collapse, motivating representation-level monitoring of factual integrity during model updates.

Abstract: We show that continual pretraining on plausible misinformation can overwrite specific factual knowledge in large language models without degrading overall performance. Unlike prior poisoning work under static pretraining, we study repeated exposure to counterfactual claims during continual updates. Using paired fact-counterfact items with graded poisoning ratios, we track how internal preferences between competing facts evolve across checkpoints, layers, and model scales. Even moderate poisoning (50-100%) flips over 55% of responses from correct to counterfactual while leaving ambiguity nearly unchanged. These belief flips emerge abruptly, concentrate in late layers (e.g., Layers 29-36 in 3B models), and are partially reversible via patching (up to 56.8%). The corrupted beliefs generalize beyond poisoned prompts, selectively degrading commonsense reasoning while leaving alignment benchmarks largely intact and transferring imperfectly across languages. These results expose a failure mode of continual pre-training in which targeted misinformation replaces internal factual representations without triggering broad performance collapse, motivating representation-level monitoring of factual integrity during model updates.

[463] dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning

Shirui Chen, Jiantao Jiao, Lillian J. Ratliff, Banghua Zhu

Main category: cs.LG

TL;DR: dUltra is an on-policy reinforcement learning framework that learns optimal unmasking strategies for parallel token generation in masked diffusion language models, achieving better accuracy-efficiency trade-offs than heuristic or distillation methods.

DetailsMotivation: Current masked diffusion language models have limited parallel generation capability (decoding <5 tokens per forward pass), and existing acceleration methods use either fixed heuristics or distillation approaches that are off-policy and limited by base model quality.

Method: Proposes dUltra framework using Group Relative Policy Optimization (GRPO) with an unmasking planner head that predicts per-token unmasking likelihoods under independent Bernoulli distributions. Jointly optimizes base diffusion LLM and unmasking order planner using rewards combining verifiable reward, distillation reward, and unmasking steps.

Result: Achieves superior accuracy-efficiency trade-offs on mathematical reasoning and code generation tasks compared to state-of-the-art heuristic (Fast-dLLM) and distillation baselines (d3LLM, dParallel).

Conclusion: On-policy RL enables better exploitation of parallel generation in MDLMs by learning optimal unmasking trajectories, overcoming limitations of heuristic and distillation approaches.

Abstract: Masked diffusion language models (MDLMs) offer the potential for parallel token generation, but most open-source MDLMs decode fewer than 5 tokens per model forward pass even with sophisticated sampling strategies, limiting their parallel generation potential. Existing acceleration methods either rely on fixed confidence-based heuristics or use distillation-based approaches that finetune MDLMs on trajectories generated by a base model, which can become off-policy during finetuning and restrict performance to the quality of the base model’s samples. We propose \texttt{dUltra}, an on-policy reinforcement learning framework based on Group Relative Policy Optimization (GRPO) that learns unmasking strategies for efficient parallel decoding. dUltra introduces an unmasking planner head that predicts per-token unmasking likelihoods under independent Bernoulli distributions. We jointly optimize the base diffusion LLM and the unmasking order planner using reward signals combining verifiable reward, distillation reward, and the number of unmasking steps. Across mathematical reasoning and code generation tasks, dUltra achieves superior accuracy-efficiency trade-offs compared to state-of-the-art heuristic (Fast-dLLM) and distillation baselines (d3LLM, dParallel), demonstrating that learned unmasking trajectories through on-policy RL enable better exploitation of parallel generation in MDLMs. Code and checkpoints are released at https://github.com/chinsengi/dUltra-os.

[464] Are Time-Indexed Foundation Models the Future of Time Series Imputation?

Etienne Le Naour, Tahar Nabil, Adrien Petralia, Ghislain Agoua

Main category: cs.LG

TL;DR: Time-indexed foundation models (TabPFN-TS and MoTM) enable zero-shot time series imputation without retraining, showing strong performance across diverse datasets.

DetailsMotivation: Foundation models for time series imputation are under-explored despite their potential for general-purpose, zero-shot missing value recovery in real-world applications.

Method: Large-scale empirical study of TabPFN-TS and MoTM models across 33 out-of-domain datasets (~1.3M imputation windows), evaluating zero-shot imputation and ability to integrate covariates at inference time without fine-tuning.

Result: Time-indexed foundation models demonstrate powerful zero-shot imputation capabilities, effectively recovering missing values across diverse scenarios and improving accuracy with covariate integration.

Conclusion: Time-indexed foundation models represent a practical step toward general-purpose, zero-shot imputation for real-world time series applications.

Abstract: Foundation models for time series imputation remain largely unexplored. Recently, two such models, TabPFN-TS and MoTM, have emerged. These models share a common philosophy that places them within the family of time-indexed foundation models. This paper presents the first large-scale empirical study of these models for zero-shot imputation, which enables missing value recovery without retraining across a wide range of scenarios. We conduct extensive univariate experiments across 33 out-of-domain datasets (approximately 1.3M imputation windows) and evaluate their ability to integrate covariates at inference time to improve accuracy without fine-tuning. Our results demonstrate that time-indexed foundation models are a powerful and practical step toward achieving general-purpose, zero-shot imputation for real-world time series.

[465] Synergizing Kolmogorov-Arnold Networks with Dynamic Adaptive Weighting for High-Frequency and Multi-Scale PDE Solutions

Guokan Chen, Yao Xiao, Bin Fan, Meixin Xionga, Zhicheng Lin, Yuanying Liu

Main category: cs.LG

TL;DR: DBAW-PIKAN improves PINNs for multi-scale/high-frequency problems using enhanced architecture and dynamic adaptive weighting with upper-bound constraints to address gradient flow and spectral bias issues.

DetailsMotivation: Physics-Informed Neural Networks (PINNs) have advanced scientific computing but struggle with multi-scale and high-frequency problems due to pathological gradient flow and spectral bias, limiting their predictive power.

Method: Combines enhanced network architecture with dynamically adaptive weighting mechanism featuring upper-bound constraints to create Dynamic Balancing Adaptive Weighting Physics-Informed Kolmogorov-Arnold Network (DBAW-PIKAN).

Result: Accelerates convergence and improves solution accuracy by at least an order of magnitude without additional computational complexity, demonstrated on Klein-Gordon, Burgers, and Helmholtz equations.

Conclusion: DBAW-PIKAN effectively mitigates gradient-related failure modes, overcomes function representation bottlenecks, and achieves superior accuracy and generalization performance compared to baseline models.

Abstract: PINNs enhance scientific computing by incorporating physical laws into neural network structures, leading to significant advancements in scientific computing. However, PINNs struggle with multi-scale and high-frequency problems due to pathological gradient flow and spectral bias, which severely limit their predictive power. By combining an enhanced network architecture with a dynamically adaptive weighting mechanism featuring upper-bound constraints, we propose the Dynamic Balancing Adaptive Weighting Physics-Informed Kolmogorov-Arnold Network (DBAW-PIKAN). The proposed method effectively mitigates gradient-related failure modes and overcomes bottlenecks in function representation. Compared to baseline models, the proposed method accelerates the convergence process and improves solution accuracy by at least an order of magnitude without introducing additional computational complexity. Numerical results on the Klein-Gordon, Burgers, and Helmholtz equations demonstrate that DBAW-PIKAN achieves superior accuracy and generalization performance.

[466] On topological descriptors for graph products

Mattie Ji, Amauri H. Souza, Vikas Garg

Main category: cs.LG

TL;DR: This paper explores topological descriptors (Euler characteristic and persistent homology) on graph products, establishing their expressive power and providing algorithms for computation.

DetailsMotivation: The motivation is to enhance topological data analysis for relational data by leveraging graph products to capture richer structural information through persistent homology descriptors.

Method: The authors consider various filtrations on graph products, characterize expressive power of Euler characteristic, analyze information content in persistent homology of graph products, and provide algorithms for computing persistent homology diagrams.

Result: Theoretical results show persistent homology on graph products contains strictly more information than on individual graphs, while Euler characteristic does not. Empirical studies validate runtime, expressivity, and improved graph classification performance.

Conclusion: Graph product filtrations enable more powerful persistent descriptors for topological data analysis of relational data, with persistent homology capturing richer structural information than Euler characteristic alone.

Abstract: Topological descriptors have been increasingly utilized for capturing multiscale structural information in relational data. In this work, we consider various filtrations on the (box) product of graphs and the effect on their outputs on the topological descriptors - the Euler characteristic (EC) and persistent homology (PH). In particular, we establish a complete characterization of the expressive power of EC on general color-based filtrations. We also show that the PH descriptors of (virtual) graph products contain strictly more information than the computation on individual graphs, whereas EC does not. Additionally, we provide algorithms to compute the PH diagrams of the product of vertex- and edge-level filtrations on the graph product. We also substantiate our theoretical analysis with empirical investigations on runtime analysis, expressivity, and graph classification performance. Overall, this work paves way for powerful graph persistent descriptors via product filtrations. Code is available at https://github.com/Aalto-QuML/tda_graph_product.

[467] Implicit Unitarity Bias in Tensor Factorization: A Theoretical Framework for Symmetry Group Discovery

Dongsung Huh, Halyun Jeong

Main category: cs.LG

TL;DR: HyperCube architecture bridges neural interpolation with algebraic structure discovery through geometric alignment constraints that enforce group-theoretic representations.

DetailsMotivation: Modern neural architectures lack inductive biases for discovering algebraic structures needed for systematic generalization, requiring architectures that can uncover fundamental algebraic symmetries.

Method: Theoretical analysis of HyperCube differentiable tensor factorization architecture, proving geometric alignment imposes rigid algebraic constraints, characterizing objective as rank-maximizing potential, and proposing Collinearity Dominance mechanism.

Result: Proves feasible collinear manifold is non-empty iff target is isotopic to a group; global minima achieved exclusively by unitary regular representations of group isotopes; establishes scaling laws.

Conclusion: HyperCube formalizes as differentiable proxy for associativity, demonstrating rigid geometric constraints enable discovery of latent algebraic symmetry for systematic generalization.

Abstract: While modern neural architectures typically generalize via smooth interpolation, it lacks the inductive biases required to uncover algebraic structures essential for systematic generalization. We present the first theoretical analysis of HyperCube, a differentiable tensor factorization architecture designed to bridge this gap. This work establishes an intrinsic geometric property of the HyperCube formulation: we prove that the architecture mediates a fundamental equivalence between geometric alignment and algebraic structure. Independent of the global optimization landscape, we show that the condition of geometric alignment imposes rigid algebraic constraints, proving that the feasible collinear manifold is non-empty if and only if the target is isotopic to a group. Within this manifold, we characterize the objective as a rank-maximizing potential that unconditionally drives factors toward full-rank, unitary representations. Finally, we propose the Collinearity Dominance mechanism to link these structural results to the global landscape. Supported by empirical scaling laws, we establish that global minima are achieved exclusively by unitary regular representations of group isotopes. This formalizes the HyperCube objective as a differentiable proxy for associativity, demonstrating how rigid geometric constraints enable the discovery of latent algebraic symmetry.

[468] Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Yingru Li, Jiacai Liu, Jiawei Xu, Yuxuan Tong, Ziniu Li, Qian Liu, Baoxiang Wang

Main category: cs.LG

TL;DR: The paper addresses off-policy mismatch in LLM-RL training by deriving tighter trust region bounds and proposing Trust Region Masking (TRM) to ensure training stability for long-horizon tasks.

DetailsMotivation: Modern LLM-RL pipelines suffer from implementation divergences (backend discrepancies, MoE routing discontinuities, distributed training staleness) causing off-policy mismatch between rollout and target policies. This leads to approximation errors that scale poorly with sequence length, making classical trust region bounds vacuous for long-horizon tasks.

Method: Derives three new trust region bounds: Pinsker-Marginal (O(T^{3/2})), Mixed (O(T)), and Adaptive (strict generalization of Pinsker-Marginal). Proposes Trust Region Masking (TRM) which masks entire sequences violating the trust region based on maximum token-level KL divergence, enabling non-vacuous monotonic improvement guarantees.

Result: The new bounds provide tighter theoretical guarantees than classical O(T^2) bounds. TRM enables the first non-vacuous monotonic improvement guarantees and demonstrates empirical training stability for long-horizon LLM-RL tasks.

Conclusion: The paper addresses a critical problem in LLM-RL training by providing tighter theoretical bounds and a practical method (TRM) to handle off-policy mismatch, enabling stable training for long-horizon language generation tasks.

Abstract: Policy gradient methods for Large Language Models optimize a policy $π_θ$ via a surrogate objective computed from samples of a rollout policy $π_{\text{roll}}$. However, modern LLM-RL pipelines suffer from unavoidable implementation divergences, such as backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness. These factors cause an off-policy mismatch ($π_{\text{roll}} \neq π_θ$), leading to approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. To address this, we derive two new bounds: a Pinsker-Marginal bound scaling as $O(T^{3/2})$ and a Mixed bound scaling as $O(T)$. We further derive an Adaptive bound that strictly generalizes the Pinsker-Marginal bound by combining an importance-ratio decomposition of the error with an adaptive per-position application of Pinsker’s inequality on the future trajectory divergence; the minimum over all three bounds is tighter than any individual bound. Crucially, all bounds depend on $D_{\mathrm{KL}}^{\mathrm{tok,max}}$, the maximum token-level KL divergence across the sequence. As a \emph{sequence-level} term, the divergence cannot be controlled by previous token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which masks entire sequences that violate the trust region. TRM enables the first non-vacuous monotonic improvement guarantees and demonstrates empirical training stability for long-horizon LLM-RL.

[469] InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages

Mamadou K. Keita, Sebastien Diarra, Christopher Homan, Seydou Diallo

Main category: cs.LG

TL;DR: InstructLR framework generates high-quality instruction datasets for low-resource languages using LLM-driven text generation with dual-layer quality filtering (automated RAG-based filtering and human validation).

DetailsMotivation: Current LLMs struggle with low-resource languages due to lack of high-quality instruction datasets, especially for African languages. Automated translation and synthetic data generation often produce poor quality outputs lacking fluency and orthographic consistency.

Method: InstructLR framework combines LLM-driven text generation with dual-layer quality filtering: 1) automated filtering using retrieval-augmented-generation (RAG)-based n-shot prompting, and 2) human-in-the-loop validation layer. Inspired by MMLU benchmarks for task definition.

Result: Created three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k, providing high-quality datasets for low-resource African languages.

Conclusion: InstructLR addresses the critical challenge of dataset creation for low-resource languages, enabling better LLM support for underrepresented languages through high-quality instruction data generation.

Abstract: Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.

[470] Dynamic Vocabulary Pruning: Stable LLM-RL by Taming the Tail

Yingru Li, Jiawei Xu, Jiacai Liu, Yuxuan Tong, Ziniu Li, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang

Main category: cs.LG

TL;DR: DVP addresses numerical divergence in RL for LLMs by dynamically pruning low-probability tokens to stabilize training.

DetailsMotivation: RL for LLMs suffers from training-inference mismatch due to numerical divergence between high-throughput inference engines and precise training engines, causing biased errors in gradient estimation.

Method: Dynamic Vocabulary Pruning (DVP) constrains RL objectives to a dynamically determined “safe” vocabulary that excludes extreme low-probability tokens, trading large numerical errors for small bounded optimization bias.

Result: DVP enables stable training with theoretical bounds on induced bias, validated empirically.

Conclusion: DVP provides a principled solution to numerical divergence in RL for LLMs by focusing on safe vocabulary regions rather than post-hoc corrections.

Abstract: Reinforcement Learning (RL) for Large Language Models (LLMs) faces a fundamental tension: the numerical divergence between high-throughput inference engines and numerically precise training engines. Although these systems share the same parameters, they produce slightly different probability distributions, creating a training-inference mismatch. We prove that the bound on the log-probability divergence arising from this mismatch scales as $(1-p)$, where $p$ is the token probability. This scaling induces a highly asymmetric effect: the bound vanishes for high-probability tokens but remains significant for low-probability tokens in the distribution tail. When sampled, these tail tokens introduce systematically biased errors that accumulate over sequences, thereby destabilizing gradient estimation. Instead of applying post-hoc corrections, we propose Dynamic Vocabulary Pruning (DVP), which constrains the RL objective to a dynamically determined ‘‘safe’’ vocabulary that excludes the extreme tail. This strategy trades large, destabilizing numerical errors for a small, bounded optimization bias. We validate DVP empirically by demonstrating stable training, and theoretically by deriving strict bounds on the induced bias.

[471] The Blueprints of Intelligence: A Functional-Topological Foundation for Perception and Representation

Eduardo Di Santi

Main category: cs.LG

TL;DR: The paper presents a functional-topological framework showing real-world processes generate compact, low-dimensional perceptual manifolds that enable rapid generalization from few examples, unifying biological perception and self-supervised AI.

DetailsMotivation: To explain why both biological learners and AI systems can generalize from limited observations by formalizing the geometric structure of real-world perceptual manifolds.

Method: Develops a deterministic functional-topological framework where real-world processes form compact subsets in Banach spaces with stable invariants, finite Hausdorff radius, and continuous perceptual functionals, validated across multiple domains.

Result: Shows real-world processes consistently generate compact perceptual manifolds with predictable geometric characteristics that can be discovered self-supervised, providing generalization guarantees and limits on knowledge.

Conclusion: Compact perceptual manifolds offer a unified mathematical foundation for perception and world-model construction, explaining generalization in both biological and artificial intelligence systems.

Abstract: Real-world phenomena do not generate arbitrary variability: their signals concentrate on compact, low-variability subsets of functional space, enabling rapid generalization from few examples. A small child can recognize a dog after extremely limited exposure because the perceptual manifold of “dog” is compact, structured, and low-dimensional. We formalize this principle through a deterministic functional-topological framework in which the set of valid realizations produced by a physical process forms a compact subset of a Banach space, endowed with stable invariants, a finite Hausdorff radius, and an induced continuous perceptual functional. This geometry provides explicit limits on knowledge, conditions for identifiability, and guarantees for generalization from sparse evidence – properties fundamental to both natural and artificial intelligence. Across electromechanical, electrochemical, and physiological domains, we show that real-world processes consistently generate compact perceptual manifolds with the same geometric characteristics. Their boundaries can be discovered in a fully self-supervised manner as the empirical radius saturates with increasing sampling, even when the governing equations are unknown. These results demonstrate that deterministic functional topology offers a unified mathematical foundation for perception, representation, and world-model construction. It provides a geometric explanation for why biological learners and self-supervised AI systems can generalize from few observations, and establishes compact perceptual manifolds as a fundamental building block for future AI architectures. Finally, this work unifies biological perception and modern self-supervised models under a single geometric principle: both derive their generalization ability from the compactness and invariants of real-world perceptual manifolds.

[472] Learnability Window in Gated Recurrent Neural Networks

Lorenzo Livi

Main category: cs.LG

TL;DR: Theoretical framework showing gating mechanisms determine learnability windows in RNNs through effective learning rates, not just numerical stability, with scaling laws for sample complexity under heavy-tailed gradient noise.

DetailsMotivation: Classical analyses focus on numerical stability of Jacobian products in RNNs, but this is insufficient to explain learnability of long-range temporal dependencies. The paper aims to develop a theory explaining how gating mechanisms actually determine the temporal horizon over which gradient information remains statistically recoverable.

Method: Develops theoretical framework analyzing effective learning rates μ_{t,ℓ} from first-order expansions of gate-induced Jacobian products in BPTT. Proves under heavy-tailed (α-stable) gradient noise that minimal sample size scales as N(ℓ)∝f(ℓ)^{-κ_α}, where f(ℓ)=∥μ_{t,ℓ}∥₁ is effective learning rate envelope and κ_α=α/(α-1) is concentration exponent.

Result: Provides explicit characterization of learnability window H_N and closed-form scaling laws for logarithmic, polynomial, and exponential decay of effective learning rates. Shows time-scale spectra induced by effective learning rates are dominant determinants of learnability, with broader spectra enlarging learnability window and heavy-tailed noise compressing it.

Conclusion: Effective learning rates, not just numerical stability, are primary objects determining whether, when, and over what horizons recurrent networks can learn long-range temporal dependencies, integrating gate-induced time-scale geometry with gradient noise and sample complexity.

Abstract: We develop a theoretical framework that explains how gating mechanisms determine the learnability window $\mathcal{H}N$ of recurrent neural networks, defined as the largest temporal horizon over which gradient information remains statistically recoverable. While classical analyses emphasize numerical stability of Jacobian products, we show that stability alone is insufficient: learnability is governed instead by the effective learning rates $μ{t,\ell}$, per-lag and per-neuron quantities obtained from first-order expansions of gate-induced Jacobian products in Backpropagation Through Time. These effective learning rates act as multiplicative filters that control both the magnitude and anisotropy of gradient transport. Under heavy-tailed ($α$-stable) gradient noise, we prove that the minimal sample size required to detect a dependency at lag~$\ell$ scales as $N(\ell)\propto f(\ell)^{-κ_α}$, where $f(\ell)=|μ_{t,\ell}|_1$ is the effective learning rate envelope and $κ_α=α/(α-1)$ is the concentration exponent governing empirical averages. This yields an explicit characterization of $\mathcal{H}_N$ and closed-form scaling laws for logarithmic, polynomial, and exponential decay of $f(\ell)$. The theory shows that the time-scale spectra induced by the effective learning rates are the dominant determinants of learnability: broader or more heterogeneous spectra slow the decay of $f(\ell)$, enlarging the learnability window, while heavy-tailed noise uniformly compresses $\mathcal{H}_N$ by slowing statistical concentration to $N^{-1/κ_α}$. By integrating gate-induced time-scale geometry with gradient noise and sample complexity, the framework identifies effective learning rates as the primary objects that determine whether, when, and over what horizons recurrent networks can learn long-range temporal dependencies.

[473] Always Keep Your Promises: A Model-Agnostic Attribution Algorithm for Neural Networks

Kevin Lee, Duncan Smith-Halverson, Pablo Millan Arias

Main category: cs.LG

TL;DR: DynamicLRP is a model-agnostic Layer-wise Relevance Propagation framework that operates at tensor operation level using computation graphs and a Promise System for deferred activation resolution, achieving architecture independence while maintaining LRP’s theoretical guarantees.

DetailsMotivation: Existing LRP implementations require architecture-specific propagation rules and model modifications, limiting generality and sustainability as architectures evolve. There's a need for a truly model-agnostic LRP framework that can handle diverse architectures without custom code.

Method: DynamicLRP decomposes attribution to individual operations within computation graphs and introduces a Promise System for deferred activation resolution. It operates independently of backpropagation machinery, requires no model modification, and works side-by-side with gradient backpropagation.

Result: Achieved 99.92% node coverage across 31,465 computation graph nodes from 15 diverse architectures including Mamba, Whisper, and DePlot. Faithfulness metrics match or exceed specialized implementations across vision, NLP, and multimodal models with practical efficiency on 100M-1B parameter models.

Conclusion: DynamicLRP establishes a sustainable, extensible foundation for LRP across evolving architectures through operation-level decomposition and the Promise System, enabling architecture-agnostic attribution without sacrificing theoretical guarantees.

Abstract: Layer-wise Relevance Propagation (LRP) provides principled attribution for neural networks through conservation properties and foundations in Deep Taylor Decomposition. However, existing implementations operate at the module level, requiring architecture-specific propagation rules and model modifications. These limit the generality of target model and sustainability of implementations as architectures evolve. We introduce DynamicLRP, a model-agnostic LRP framework operating at the tensor operation level. By decomposing attribution to individual operations within computation graphs and introducing a novel mechanism for deferred activation resolution, named the Promise System, our approach achieves true architecture agnosticity while maintaining LRP’s theoretical guarantees. This design operates independently of backpropagation machinery, requiring no model modification, enabling side-by-side execution with gradient backpropagation. Being based on computation graphs, this method is theoretically extensible to other deep learning libraries that support auto-differentiation. We demonstrate faithfulness matching or exceeding specialized implementations (1.77 vs 1.69 ABPC on VGG, equivalent performance on ViT, 93.70% and 95.06% top-1 attribution accuracy for explaining RoBERTa-large and Flan-T5-large answers on SQuADv2, respectively) while maintaining practical efficiency on models with 100M-1B parameters. We achieved 99.92% node coverage across 31,465 computation graph nodes from 15 diverse architectures, including state-space models (Mamba), audio transformers (Whisper), and multimodal systems (DePlot) without any model-specific code with rules for 47 fundamental operations implemented. Our operation-level decomposition and Promise System establish a sustainable, extensible foundation for LRP across evolving architectures. All code is available at https://github.com/keeinlev/dynamicLRP .

[474] In-Run Data Shapley for Adam Optimizer

Meng Ding, Zeqing Zhang, Di Wang, Lijie Hu

Main category: cs.LG

TL;DR: Proposes Adam-Aware In-Run Data Shapley, a method for accurate data attribution under Adam optimizer, addressing limitations of SGD-based methods that fail with adaptive optimizers.

DetailsMotivation: Existing data attribution methods rely on SGD's linear structure and fail with adaptive optimizers like Adam, which are widely used in modern ML. SGD-based proxies diverge significantly from true contributions under Adam, making them ineffective for modern training pipelines.

Method: Derives closed-form approximation that restores additivity by redefining utility under fixed-state assumption, and enables scalable computation via Linearized Ghost Approximation that linearizes variance-dependent scaling term to compute pairwise gradient dot-products without materializing per-sample gradients.

Result: Achieves near-perfect fidelity to ground-truth marginal contributions (Pearson R > 0.99) while retaining ~95% of standard training throughput. Significantly outperforms SGD-based baselines in data attribution downstream tasks.

Conclusion: Data attribution is optimizer-dependent, and the proposed Adam-aware method effectively bridges the gap for modern training pipelines using adaptive optimizers, enabling reliable data attribution without sacrificing computational efficiency.

Abstract: Reliable data attribution is essential for mitigating bias and reducing computational waste in modern machine learning, with the Shapley value serving as the theoretical gold standard. While recent “In-Run” methods bypass the prohibitive cost of retraining by estimating contributions dynamically, they heavily rely on the linear structure of Stochastic Gradient Descent (SGD) and fail to capture the complex dynamics of adaptive optimizers like Adam. In this work, we demonstrate that data attribution is inherently optimizer-dependent: we show that SGD-based proxies diverge significantly from true contributions under Adam (Pearson $R \approx 0.11$), rendering them ineffective for modern training pipelines. To bridge this gap, we propose Adam-Aware In-Run Data Shapley. We derive a closed-form approximation that restores additivity by redefining utility under a fixed-state assumption and enable scalable computation via a novel Linearized Ghost Approximation. This technique linearizes the variance-dependent scaling term, allowing us to compute pairwise gradient dot-products without materializing per-sample gradients. Extensive experiments show that our method achieves near-perfect fidelity to ground-truth marginal contributions ($R > 0.99$) while retaining $\sim$95% of standard training throughput. Furthermore, our Adam-aware attribution significantly outperforms SGD-based baselines in data attribution downstream tasks.

[475] Credit Risk Estimation with Non-Financial Features: Evidence from a Synthetic Istanbul Dataset

Atalay Denknalbant, Emre Sezdi, Zeki Furkan Kutlu

Main category: cs.LG

TL;DR: Synthetic dataset creation and ML modeling to evaluate underbanked borrowers using alternative behavioral data instead of traditional credit bureau records

DetailsMotivation: Financial exclusion constrains entrepreneurship and widens wealth gaps; underbanked consumers lack bureau files due to informal earnings/payments, requiring alternative evaluation methods

Method: Created synthetic dataset of 100K Istanbul residents reproducing census marginals and telecom patterns using RAG with OpenAI o3; trained CatBoost, LightGBM, XGBoost models with/without alternative behavioral attributes

Result: Alternative behavioral data raised AUC by ~1.3 percentage points and boosted balanced F1 from ~0.84 to 0.95 (14% gain), approaching bureau-level discrimination power

Conclusion: Concise behavioral attributes can effectively evaluate underbanked borrowers, providing lenders/regulators with transparent blueprint for fair credit access extension

Abstract: Financial exclusion constrains entrepreneurship, increases income volatility, and widens wealth gaps. Underbanked consumers in Istanbul often have no bureau file because their earnings and payments flow through informal channels. To study how such borrowers can be evaluated we create a synthetic dataset of one hundred thousand Istanbul residents that reproduces first quarter 2025 TÜİK (TURKSTAT) census marginals and telecom usage patterns. Retrieval augmented generation feeds these public statistics into the OpenAI o3 model, which synthesises realistic yet private records. Each profile contains seven socio demographic variables and nine alternative attributes that describe phone specifications, online shopping rhythm, subscription spend, car ownership, monthly rent, and a credit card flag. To test the impact of the alternative financial data CatBoost, LightGBM, and XGBoost are each trained in two versions. Demo models use only the socio demographic variables; Full models include both socio demographic and alternative attributes. Across five fold stratified validation the alternative block raises area under the curve by about one point three percentage and lifts balanced F 1 from roughly 0.84 to 0.95, a fourteen percent gain. We contribute an open Istanbul 2025 Q1 synthetic dataset, a fully reproducible modeling pipeline, and empirical evidence that a concise set of behavioural attributes can approach bureau level discrimination power while serving borrowers who lack formal credit records. These findings give lenders and regulators a transparent blueprint for extending fair and safe credit access to the underbanked.

[476] FLAME: Flow Enhanced Legendre Memory Models for General Time Series Forecasting

Xingjian Wu, Hanyin Cheng, Xiangfei Qiu, Zhengyu Li, Jilin Hu, Chenjuan Guo, Bin Yang

Main category: cs.LG

TL;DR: FLAME is a lightweight time series foundation model family that supports both deterministic and probabilistic forecasting using Legendre Memory variants and Normalization Flow for efficient long-range inference.

DetailsMotivation: To create extremely lightweight yet capable time series foundation models that can handle both deterministic and probabilistic forecasting efficiently while maintaining strong generalization capabilities.

Method: Uses Legendre Memory variants (LegT and LegS) in encoding/decoding phases to capture inductive bias and enable efficient long-range inference. Employs Normalization Flow based forecasting head for generative probabilistic modeling of complex distributions.

Result: Achieves state-of-the-art zero-shot performance on both deterministic and probabilistic forecasting tasks across TSFM-Bench and ProbTS benchmarks.

Conclusion: FLAME provides an efficient, lightweight foundation model for time series forecasting that excels in both deterministic and probabilistic settings with strong generalization capabilities.

Abstract: In this work, we introduce FLAME, a family of extremely lightweight and capable Time Series Foundation Models, which support both deterministic and probabilistic forecasting via generative probabilistic modeling, thus ensuring both efficiency and robustness. FLAME utilizes the Legendre Memory for strong generalization capabilities. Through adapting variants of Legendre Memory, i.e., translated Legendre (LegT) and scaled Legendre (LegS), in the Encoding and Decoding phases, FLAME can effectively capture the inherent inductive bias within data and make efficient long-range inferences. To enhance the accuracy of probabilistic forecasting while keeping efficient, FLAME adopts a Normalization Flow based forecasting head, which can model the arbitrarily intricate distributions over the forecasting horizon in a generative manner. Comprehensive experiments on well-recognized benchmarks, including TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art zero-shot performance of FLAME on both deterministic and probabilistic forecasting tasks.

[477] Cardinality-Preserving Structured Sparse Graph Transformers for Molecular Property Prediction

Abhijit Gupta

Main category: cs.LG

TL;DR: CardinalGraphFormer is a graph transformer for molecular property prediction that incorporates structural biases and cardinality-preserving aggregation, achieving state-of-the-art results on 11 molecular benchmarks.

DetailsMotivation: Drug discovery requires efficient molecular property prediction with limited labeled data due to the vast chemical space (~10^60 molecules) vs. only thousands of approved drugs. Self-supervised pretraining on large unlabeled molecular corpora is essential for data-efficient representation learning.

Method: Graph transformer incorporating Graphormer-inspired structural biases (shortest-path distance, centrality, direct-bond edge bias) within structured sparse attention limited to shortest-path distance ≤3. Adds cardinality-preserving unnormalized aggregation channel over same support set. Pretraining combines contrastive graph-level alignment with masked attribute reconstruction.

Result: Improves mean performance across all 11 evaluated tasks, achieves statistically significant gains on 10 of 11 public benchmarks spanning MoleculeNet, OGB, and TDC ADMET tasks compared to strong reproduced baselines under fully matched evaluation protocol.

Conclusion: CardinalGraphFormer demonstrates superior molecular representation learning through structured sparse attention with structural biases and cardinality-preserving aggregation, advancing data-efficient molecular property prediction for drug discovery.

Abstract: Drug discovery motivates efficient molecular property prediction under limited labeled data. Chemical space is vast, often estimated at approximately 10^60 drug-like molecules, while only thousands of drugs have been approved. As a result, self-supervised pretraining on large unlabeled molecular corpora has become essential for data-efficient molecular representation learning. We introduce CardinalGraphFormer, a graph transformer that incorporates Graphormer-inspired structural biases, including shortest-path distance and centrality, as well as direct-bond edge bias, within a structured sparse attention regime limited to shortest-path distance <= 3. The model further augments this design with a cardinality-preserving unnormalized aggregation channel over the same support set. Pretraining combines contrastive graph-level alignment with masked attribute reconstruction. Under a fully matched evaluation protocol, CardinalGraphFormer improves mean performance across all 11 evaluated tasks and achieves statistically significant gains on 10 of 11 public benchmarks spanning MoleculeNet, OGB, and TDC ADMET tasks when compared to strong reproduced baselines.

[478] Universal consistency of the $k$-NN rule in metric spaces and Nagata dimension. III

Vladimir G. Pestov

Main category: cs.LG

TL;DR: The paper establishes the complete characterization of metric spaces where k-nearest neighbor classification is universally consistent, linking it to dimension theory and measure differentiation properties.

DetailsMotivation: To determine the precise conditions under which the k-nearest neighbor classifier achieves universal consistency in metric spaces, connecting this to fundamental properties in real analysis and dimension theory.

Method: Mathematical analysis establishing equivalences between: (1) universal consistency of k-NN, (2) strong Lebesgue-Besicovitch differentiation property, (3) sigma-finite dimensionality in Nagata’s sense, and (4) Cover-Hart property.

Result: Proved the missing implication (1)⇒(3), completing the equivalence chain. Showed weak Lebesgue-Besicovitch property is insufficient for consistency, and that even on the real line, certain equivalent metrics can break k-NN consistency.

Conclusion: The paper provides a complete characterization of metric spaces where k-nearest neighbor classification works universally, linking statistical learning theory with geometric measure theory and dimension theory.

Abstract: We establish the last missing link allowing to describe those complete separable metric spaces $X$ in which the $k$ nearest neighbour classifier is universally consistent, both in combinatorial terms of dimension theory and via a fundamental property of real analysis. The following are equivalent: (1) The $k$-nearest neighbour classifier is universally consistent in $X$, (2) The strong Lebesgue–Besicovitch differentiation property holds in $X$ for every locally finite Borel measure, (3) $X$ is sigma-finite dimensional in the sense of Jun-Iti Nagata. The equivalence (2)$\iff$(3) was announced by Preiss (1983), while a detailed proof of the implication (3)$\Rightarrow$(2) has only appeared in Assouad and Quentin de Gromard (2006). The implication (2)$\Rightarrow$(1) was established by Cérou and Guyader (2006). We prove the implication (1)$\Rightarrow$(3). We further show that the weak (instead of strong) Lebesgue–Besicovitch property is insufficient for the consistency of the $k$-NN rule, as witnessed, for example, by the Heisenberg group (here we correct a wrong claim made in the previous article (Kumari and Pestov 2024)). A bit counter-intuitively, there is a metric on the real line uniformly equivalent to the usual distance but under which the $k$-NN classifier fails. Finally, another equivalent condition that can be added to the above is the Cover–Hart property: (4) the error of the $1$-nearest neighbour classifier is asymptotically at most twice as bad as the Bayes error.

[479] D$^2$Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs

Xianglong Yan, ChengZhu Bao, Zhiteng Li, Tianao Zhang, Shaoqiu Zhang, Ruobing Xie, Samm Sun, Yulun Zhang

Main category: cs.LG

TL;DR: D²Quant: A weight-only post-training quantization framework for LLMs that improves sub-4-bit precision through dual-scale quantization for down-projection matrices and deviation-aware activation correction.

DetailsMotivation: LLMs have high compute and memory costs making deployment difficult in resource-constrained scenarios. Weight-only PTQ is appealing but suffers from accuracy degradation at sub-4-bit precision due to: (1) down-projection matrices being quantization bottlenecks, and (2) weight quantization inducing activation deviations without effective correction strategies.

Method: Proposes D²Quant with two key components: 1) Dual-Scale Quantizer (DSQ) tailored to down-projection matrices with absorbable scaling factor, and 2) Deviation-Aware Correction (DAC) incorporating mean-shift correction within LayerNorm to mitigate quantization-induced activation distribution shifts.

Result: Extensive experiments across multiple LLM families and evaluation metrics show D²Quant delivers superior performance for weight-only PTQ at sub-4-bit precision compared to existing methods.

Conclusion: D²Quant effectively addresses key challenges in weight-only PTQ for LLMs, enabling practical deployment in resource-constrained scenarios while maintaining accuracy at sub-4-bit precision.

Abstract: Large language models (LLMs) deliver strong performance, but their high compute and memory costs make deployment difficult in resource-constrained scenarios. Weight-only post-training quantization (PTQ) is appealing, as it reduces memory usage and enables practical speedup without low-bit operators or specialized hardware. However, accuracy often degrades significantly in weight-only PTQ at sub-4-bit precision, and our analysis identifies two main causes: (1) down-projection matrices are a well-known quantization bottleneck, but maintaining their fidelity often requires extra bit-width; (2) weight quantization induces activation deviations, but effective correction strategies remain underexplored. To address these issues, we propose D$^2$Quant, a novel weight-only PTQ framework that improves quantization from both the weight and activation perspectives. On the weight side, we design a Dual-Scale Quantizer (DSQ) tailored to down-projection matrices, with an absorbable scaling factor that significantly improves accuracy without increasing the bit budget. On the activation side, we propose Deviation-Aware Correction (DAC), which incorporates a mean-shift correction within LayerNorm to mitigate quantization-induced activation distribution shifts. Extensive experiments across multiple LLM families and evaluation metrics show that D$^2$Quant delivers superior performance for weight-only PTQ at sub-4-bit precision. The code and models will be available at https://github.com/XIANGLONGYAN/D2Quant.

[480] To Grok Grokking: Provable Grokking in Ridge Regression

Mingyue Xu, Gal Vardi, Itay Safran

Main category: cs.LG

TL;DR: Theoretical analysis of grokking phenomenon in ridge regression showing delayed generalization after overfitting, with quantitative bounds on grokking time and hyperparameter control.

DetailsMotivation: To provide rigorous theoretical understanding of grokking - the phenomenon where generalization emerges long after overfitting occurs - in a tractable linear regression setting, and to show how hyperparameters affect this behavior.

Method: Theoretical analysis of over-parameterized linear regression models trained with gradient descent and weight decay, proving three-stage grokking behavior and deriving quantitative bounds on generalization delay (grokking time). Empirical validation on both linear models and non-linear neural networks.

Result: Proved end-to-end grokking results showing: (i) early overfitting, (ii) prolonged poor generalization, (iii) eventual near-perfect generalization. Derived first rigorous quantitative bounds on grokking time in terms of hyperparameters. Showed grokking can be amplified or eliminated through hyperparameter tuning.

Conclusion: Grokking is not an inherent failure mode of deep learning but rather a consequence of specific training conditions, and can be controlled without fundamental architectural or algorithmic changes.

Abstract: We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the “grokking time”) in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.

[481] Demystifying Mergeability: Interpretable Properties to Predict Model Merging Success

Luca Zhou, Bo Zhao, Rose Yu, Emanuele Rodolà

Main category: cs.LG

TL;DR: Model merging success depends on both merging method and partner tasks, not just intrinsic model properties; gradient alignment and subspace overlap are key prerequisites for compatibility across methods.

DetailsMotivation: Current understanding of model merging treats mergeability as an intrinsic property, but the authors argue that success depends on both the merging method and the specific tasks being merged, requiring a more nuanced framework.

Method: Developed an architecture-agnostic framework using linear optimization over interpretable pairwise metrics (e.g., gradient L2 distance) to analyze mergeability across four different merging methods and identify success factors.

Result: Found substantial variation in success drivers across methods (46.7% metric overlap; 55.3% sign agreement), revealing method-specific “fingerprints,” but subspace overlap and gradient alignment consistently emerged as foundational prerequisites for compatibility.

Conclusion: Mergeability depends on both merging method and partner tasks, with gradient alignment and subspace overlap serving as method-agnostic prerequisites; these findings provide diagnostic tools and motivate fine-tuning strategies that explicitly encourage these properties.

Abstract: Model merging combines knowledge from separately fine-tuned models, yet success factors remain poorly understood. While recent work treats mergeability as an intrinsic property, we show with an architecture-agnostic framework that it fundamentally depends on both the merging method and the partner tasks. Using linear optimization over a set of interpretable pairwise metrics (e.g., gradient L2 distance), we uncover properties correlating with post-merge performance across four merging methods. We find substantial variation in success drivers (46.7% metric overlap; 55.3% sign agreement), revealing method-specific “fingerprints”. Crucially, however, subspace overlap and gradient alignment metrics consistently emerge as foundational, method-agnostic prerequisites for compatibility. These findings provide a diagnostic foundation for understanding mergeability and motivate future fine-tuning strategies that explicitly encourage these properties.

[482] CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs

Zhiyuan Yao, Yi-Kai Zhang, Yuxin Chen, Yueqing Sun, Zishan Xu, Yu Yang, Tianhao Hu, Qi Gu, Hui Su, Xunliang Cai

Main category: cs.LG

TL;DR: CoBA-RL is a reinforcement learning algorithm that adaptively allocates rollout budgets based on model capability, using a capability-oriented value function and heap-based greedy strategy to optimize computational resource distribution for LLM post-training efficiency.

DetailsMotivation: Standard RLVR frameworks use uniform rollout budgets causing resource inefficiency, and existing adaptive methods rely on instance-level metrics that fail to capture the model's dynamic learning state, necessitating a more intelligent budget allocation approach.

Method: CoBA-RL uses a Capability-Oriented Value function to map tasks to potential training gains, and employs a heap-based greedy strategy to self-calibrate computational resource distribution to samples with high training value.

Result: Extensive experiments show the approach effectively balances exploration and exploitation, delivering consistent generalization improvements across multiple challenging benchmarks.

Conclusion: Quantifying sample training value and optimizing budget allocation are crucial for advancing LLM post-training efficiency, with CoBA-RL demonstrating effective adaptive resource allocation.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key approach for enhancing LLM reasoning. However, standard frameworks like Group Relative Policy Optimization (GRPO) typically employ a uniform rollout budget, leading to resource inefficiency. Moreover, existing adaptive methods often rely on instance-level metrics, such as task pass rates, failing to capture the model’s dynamic learning state. To address these limitations, we propose CoBA-RL, a reinforcement learning algorithm designed to adaptively allocate rollout budgets based on the model’s evolving capability. Specifically, CoBA-RL utilizes a Capability-Oriented Value function to map tasks to their potential training gains and employs a heap-based greedy strategy to efficiently self-calibrate the distribution of computational resources to samples with high training value. Extensive experiments demonstrate that our approach effectively orchestrates the trade-off between exploration and exploitation, delivering consistent generalization improvements across multiple challenging benchmarks. These findings underscore that quantifying sample training value and optimizing budget allocation are pivotal for advancing LLM post-training efficiency.

[483] SALAAD: Sparse And Low-Rank Adaptation via ADMM for Large Language Model Inference

Hao Ma, Melis Ilayda Bal, Liang Zhang, Bingcong Li, Niao He, Melanie Zeilinger, Michael Muehlebach

Main category: cs.LG

TL;DR: SALAAD is a plug-and-play framework for inducing sparse and low-rank structures in LLMs during training, enabling flexible control over model capacity and memory consumption without architectural modifications.

DetailsMotivation: Modern LLMs face compute and memory constraints, requiring flexible capacity control. Existing sparse/low-rank approaches rely on heuristic designs that ignore layer heterogeneity or need model-specific modifications.

Method: Formulates structured weight learning under augmented Lagrangian framework with adaptive controller that dynamically balances training loss and structural constraints, preserving training stability while controlling capacity evolution.

Result: Substantially reduces memory consumption during deployment while achieving performance comparable to ad-hoc methods. Single training run yields continuous spectrum of model capacities for elastic deployment across memory budgets.

Conclusion: SALAAD provides a practical plug-and-play solution for flexible model capacity control, enabling efficient deployment of LLMs under varying resource constraints without retraining.

Abstract: Modern large language models are increasingly deployed under compute and memory constraints, making flexible control of model capacity a central challenge. While sparse and low-rank structures naturally trade off capacity and performance, existing approaches often rely on heuristic designs that ignore layer and matrix heterogeneity or require model-specific architectural modifications. We propose SALAAD, a plug-and-play framework applicable to different model architectures that induces sparse and low-rank structures during training. By formulating structured weight learning under an augmented Lagrangian framework and introducing an adaptive controller that dynamically balances the training loss and structural constraints, SALAAD preserves the stability of standard training dynamics while enabling explicit control over the evolution of effective model capacity during training. Experiments across model scales show that SALAAD substantially reduces memory consumption during deployment while achieving performance comparable to ad-hoc methods. Moreover, a single training run yields a continuous spectrum of model capacities, enabling smooth and elastic deployment across diverse memory budgets without the need for retraining.

[484] CORE: Context-Robust Remasking for Diffusion Language Models

Kevin Zhai, Sabbir Mollah, Zhenyi Wang, Mubarak Shah

Main category: cs.LG

TL;DR: CORE is a training-free inference-time revision framework for Masked Diffusion Models that identifies context-brittle tokens through targeted masked-context perturbations rather than relying on static confidence scores.

DetailsMotivation: Standard decoding in Masked Diffusion Models suffers from context rigidity where tokens are retained based on transient high confidence, ignoring that early predictions lack full context. This creates cascade effects where initial inconsistencies misguide remaining generation. Existing revision strategies rely on static confidence scores which are inherently myopic.

Method: CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations rather than trusting static token probabilities. It formalizes revision as a robust optimization objective over context shifts and efficiently approximates this objective to prioritize unstable tokens for revision.

Result: On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.

Conclusion: CORE provides an effective training-free framework for inference-time revision in Masked Diffusion Models by addressing context rigidity through dynamic assessment of token stability rather than static confidence measures.

Abstract: Standard decoding in Masked Diffusion Models (MDMs) is hindered by context rigidity: tokens are retained based on transient high confidence, often ignoring that early predictions lack full context. This creates cascade effects where initial inconsistencies misguide the remaining generation. Existing revision strategies attempt to mitigate this by relying on static confidence scores, but these signals are inherently myopic; inconsistent tokens can appear confident to the model itself. We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision. Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations. We formalize revision as a robust optimization objective over context shifts and efficiently approximate this objective to prioritize unstable tokens for revision. On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.

[485] Near-Optimal Dynamic Matching via Coarsening with Application to Heart Transplantation

Itai Zilberstein, Ioannis Anagnostides, Zachary W. Sollie, Arman Kilic, Tuomas Sandholm

Main category: cs.LG

TL;DR: Online matching algorithms using coarsening approach achieve near-optimal theoretical guarantees, applied to heart transplant allocation with significant performance improvement over status quo.

DetailsMotivation: Practical online matching algorithms in domains like Internet advertising and organ allocation often lack strong theoretical guarantees, creating a gap between data-driven heuristics and theoretical lower bounds.

Method: Develop new online matching algorithms based on a coarsening approach that aggregates offline nodes into capacitated clusters, maintaining near-optimal theoretical guarantees despite the aggregation.

Result: Applied to heart transplant allocation, the policy achieves competitive ratio 0.91 in simulations based on real data, drastically higher than the US status quo policy’s 0.51, closely matching omniscient benchmark performance.

Conclusion: Coarsening approach bridges the gap between data-driven heuristics and pessimistic theoretical lower bounds, providing theoretically grounded policies with strong practical performance in online matching problems.

Abstract: Online matching has been a mainstay in domains such as Internet advertising and organ allocation, but practical algorithms often lack strong theoretical guarantees. We take an important step toward addressing this by developing new online matching algorithms based on a coarsening approach. Although coarsening typically implies a loss of granularity, we show that, to the contrary, aggregating offline nodes into capacitated clusters can yield near-optimal theoretical guarantees. We apply our methodology to heart transplant allocation to develop theoretically grounded policies based on structural properties of historical data. Furthermore, in simulations based on real data, our policy closely matches the performance of the omniscient benchmark, achieving competitive ratio 0.91, drastically higher than the US status quo policy’s 0.51. Our work bridges the gap between data-driven heuristics and pessimistic theoretical lower bounds.

[486] QUATRO: Query-Adaptive Trust Region Policy Optimization for LLM Fine-tuning

Doyeon Lee, Eunyi Lyou, Hyunsoo Cho, Sookyung Kim, Joonseok Lee, Jaemoo Choi

Main category: cs.LG

TL;DR: QUATRO is a new RL-based LLM fine-tuning method that enforces exact trust-region constraints for more stable optimization compared to heuristic approximations in existing methods.

DetailsMotivation: Current GRPO-style RL fine-tuning algorithms rely on heuristic trust-region approximations that can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to properly regulate samples with importance ratios outside the clipping range.

Method: Proposes Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through principled optimization, yielding a clear interpretable objective with explicit control over policy updates and stable entropy-controlled optimization.

Result: Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates while maintaining well-controlled entropy throughout training.

Conclusion: QUATRO provides a more robust and stable approach to RL-based LLM fine-tuning by addressing the limitations of heuristic trust-region approximations in existing methods.

Abstract: GRPO-style reinforcement learning (RL)-based LLM fine-tuning algorithms have recently gained popularity. Relying on heuristic trust-region approximations, however, they can lead to brittle optimization behavior, as global importance-ratio clipping and group-wise normalization fail to regulate samples whose importance ratios fall outside the clipping range. We propose Query-Adaptive Trust-Region policy Optimization (QUATRO), which directly enforces trust-region constraints through a principled optimization. This yields a clear and interpretable objective that enables explicit control over policy updates and stable, entropy-controlled optimization, with a stabilizer terms arising intrinsically from the exact trust-region formulation. Empirically verified on diverse mathematical reasoning benchmarks, QUATRO shows stable training under increased policy staleness and aggressive learning rates, maintaining well-controlled entropy throughout training.

[487] Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

John Yan, Michael Yu, Yuqi Sun, Alexander Duffy, Tyler Marques, Matthew Lyle Olson

Main category: cs.LG

TL;DR: SAEs and LLM-summarizers analyze RL training dynamics in complex environments like Diplomacy, revealing fine-grained behaviors and strategic patterns, though human interpretability remains challenging.

DetailsMotivation: As LLMs are increasingly trained in complex RL/multi-agent environments, understanding how behavior changes during training becomes difficult. The paper aims to apply interpretability methods to analyze training dynamics in sophisticated environments like Full-Press Diplomacy.

Method: Uses pretrained Sparse Autoencoders (SAEs) and LLM-summarizer methods to analyze large-scale RL training runs. Introduces Meta-Autointerp for grouping SAE features into interpretable hypotheses about training dynamics. Validates through automated evaluation and user studies.

Result: Discovered fine-grained behaviors (role-playing patterns, degenerate outputs, language switching) and high-level strategic behaviors/environment bugs. 90% of discovered SAE Meta-Features were significant, including surprising reward hacking. User studies showed most SAE features and LLM hypotheses were not helpful to humans, but a subset were predictively useful. Augmenting an untrained agent’s system prompt improved score by +14.2%.

Conclusion: SAEs and LLM-summarizers provide complementary views into agent behavior, forming a practical starting point for data-centric interpretability work on ensuring trustworthy LLM behavior throughout training, though human interpretability challenges remain.

Abstract: Large language models (LLMs) are increasingly trained in complex Reinforcement Learning, multi-agent environments, making it difficult to understand how behavior changes over training. Sparse Autoencoders (SAEs) have recently shown to be useful for data-centric interpretability. In this work, we analyze large-scale reinforcement learning training runs from the sophisticated environment of Full-Press Diplomacy by applying pretrained SAEs, alongside LLM-summarizer methods. We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover fine-grained behaviors including role-playing patterns, degenerate outputs, language switching, alongside high-level strategic behaviors and environment-specific bugs. Through automated evaluation, we validate that 90% of discovered SAE Meta-Features are significant, and find a surprising reward hacking behavior. However, through two user studies, we find that even subjectively interesting and seemingly helpful SAE features may be worse than useless to humans, along with most LLM generated hypotheses. However, a subset of SAE-derived hypotheses are predictively useful for downstream tasks. We further provide validation by augmenting an untrained agent’s system prompt, improving the score by +14.2%. Overall, we show that SAEs and LLM-summarizer provide complementary views into agent behavior, and together our framework forms a practical starting point for future data-centric interpretability work on ensuring trustworthy LLM behavior throughout training.

[488] Euphonium: Steering Video Flow Matching via Process Reward Gradient Guided Stochastic Dynamics

Ruizhe Zhong, Jiesong Lian, Xiaoyue Mi, Zixiang Zhou, Yuan Zhou, Qinglin Lu, Junchi Yan

Main category: cs.LG

TL;DR: Euphonium: A novel RL framework for aligning flow matching models with human preferences using process reward gradient guided dynamics to steer generation and improve exploration efficiency.

DetailsMotivation: Current online RL approaches for aligning flow matching models suffer from inefficient exploration during training rollouts, relying on undirected stochasticity and sparse outcome rewards, which leads to slow discovery of high-reward samples and data-inefficient optimization.

Method: Formulates sampling as a Stochastic Differential Equation incorporating Process Reward Model gradients into flow drift, enabling dense step-by-step steering. Uses Dual-Reward Group Relative Policy Optimization combining latent process rewards for credit assignment with pixel-level outcome rewards for visual fidelity. Includes distillation to internalize guidance into flow network.

Result: Achieves better alignment than existing methods in text-to-video generation while accelerating training convergence by 1.66x.

Conclusion: Euphonium provides a principled framework for efficient RL-based alignment of flow matching models through process reward gradient guidance, overcoming exploration limitations of prior methods.

Abstract: While online Reinforcement Learning has emerged as a crucial technique for aligning flow matching models with human preferences, current approaches are hindered by inefficient exploration during training rollouts. Relying on undirected stochasticity and sparse outcome rewards, these methods struggle to discover high-reward samples, resulting in data-inefficient and slow optimization. To address these limitations, we propose Euphonium, a novel framework that steers generation via process reward gradient guided dynamics. Our key insight is to formulate the sampling process as a theoretically principled Stochastic Differential Equation that explicitly incorporates the gradient of a Process Reward Model into the flow drift. This design enables dense, step-by-step steering toward high-reward regions, advancing beyond the unguided exploration in prior works, and theoretically encompasses existing sampling methods (e.g., Flow-GRPO, DanceGRPO) as special cases. We further derive a distillation objective that internalizes the guidance signal into the flow network, eliminating inference-time dependency on the reward model. We instantiate this framework with a Dual-Reward Group Relative Policy Optimization algorithm, combining latent process rewards for efficient credit assignment with pixel-level outcome rewards for final visual fidelity. Experiments on text-to-video generation show that Euphonium achieves better alignment compared to existing methods while accelerating training convergence by 1.66x. Our code is available at https://github.com/zerzerzerz/Euphonium

[489] Position: Machine Learning for Heart Transplant Allocation Policy Optimization Should Account for Incentives

Ioannis Anagnostides, Itai Zilberstein, Zachary W. Sollie, Arman Kilic, Tuomas Sandholm

Main category: cs.LG

TL;DR: Position paper arguing that organ allocation should be treated as an incentive-aware game rather than just an optimization problem, calling for integration of mechanism design and strategic ML approaches.

DetailsMotivation: Current organ allocation approaches overlook incentive misalignments among stakeholders (organ procurement organizations, transplant centers, clinicians, patients, regulators), leading to adverse consequences despite the transition to ML and data-driven optimization.

Method: Position paper analysis focusing on US adult heart transplant allocation, identifying critical incentive misalignments across the decision-making pipeline and presenting data showing current adverse effects.

Result: Demonstrates that incentive misalignments are having adverse consequences in current organ allocation systems, highlighting the need for incentive-aware approaches.

Conclusion: Next generation of allocation policies should be incentive-aware, requiring integration of mechanism design, strategic classification, causal inference, and social choice to ensure robustness, efficiency, fairness, and trust in the face of strategic behavior.

Abstract: The allocation of scarce donor organs constitutes one of the most consequential algorithmic challenges in healthcare. While the field is rapidly transitioning from rigid, rule-based systems to machine learning and data-driven optimization, we argue that current approaches often overlook a fundamental barrier: incentives. In this position paper, we highlight that organ allocation is not merely an optimization problem, but rather a complex game involving organ procurement organizations, transplant centers, clinicians, patients, and regulators. Focusing on US adult heart transplant allocation, we identify critical incentive misalignments across the decision-making pipeline, and present data showing that they are having adverse consequences today. Our main position is that the next generation of allocation policies should be incentive aware. We outline a research agenda for the machine learning community, calling for the integration of mechanism design, strategic classification, causal inference, and social choice to ensure robustness, efficiency, fairness, and trust in the face of strategic behavior from the various constituent groups.

[490] Robust Federated Learning via Byzantine Filtering over Encrypted Updates

Adda Akram Bendoukha, Aymen Boudguiga, Nesrine Kaaniche, Renaud Sirdey, Didem Demirag, Sébastien Gambs

Main category: cs.LG

TL;DR: FLASH combines homomorphic encryption for privacy-preserving aggregation with meta-classifiers for Byzantine filtering in federated learning

DetailsMotivation: Federated learning faces privacy and security challenges including inference attacks and Byzantine behaviors; existing solutions don't adequately address both secure aggregation and Byzantine resilience simultaneously

Method: Uses homomorphic encryption (CKKS cryptosystem) for privacy-preserving aggregation combined with property-inference-inspired meta-classifiers trained on labeled shadow updates to detect and filter Byzantine behaviors; includes automated hyperparameter optimization for homomorphic inference

Result: Achieves 90-94% accuracy in identifying Byzantine updates with marginal utility loss; encrypted inference runtimes of 6-24 seconds for individual updates and 9-26 seconds for overall aggregation on FEMNIST, CIFAR10, GTSRB, and acsincome benchmarks

Conclusion: Proposed approach effectively addresses both privacy preservation and Byzantine resilience in federated learning through combined homomorphic encryption and meta-classifier filtering

Abstract: Federated Learning (FL) aims to train a collaborative model while preserving data privacy. However, the distributed nature of this approach still raises privacy and security issues, such as the exposure of sensitive data due to inference attacks and the influence of Byzantine behaviors on the trained model. In particular, achieving both secure aggregation and Byzantine resilience remains challenging, as existing solutions often address these aspects independently. In this work, we propose to address these challenges through a novel approach that combines homomorphic encryption for privacy-preserving aggregation with property-inference-inspired meta-classifiers for Byzantine filtering. First, following the property-inference attacks blueprint, we train a set of filtering meta-classifiers on labeled shadow updates, reproducing a diverse ensemble of Byzantine misbehaviors in FL, including backdoor, gradient-inversion, label-flipping and shuffling attacks. The outputs of these meta-classifiers are then used to cancel the Byzantine encrypted updates by reweighting. Second, we propose an automated method for selecting the optimal kernel and the dimensionality hyperparameters with respect to homomorphic inference, aggregation constraints and efficiency over the CKKS cryptosystem. Finally, we demonstrate through extensive experiments the effectiveness of our approach against Byzantine participants on the FEMNIST, CIFAR10, GTSRB, and acsincome benchmarks. More precisely, our SVM filtering achieves accuracies between $90$% and $94$% for identifying Byzantine updates at the cost of marginal losses in model utility and encrypted inference runtimes ranging from $6$ to $24$ seconds and from $9$ to $26$ seconds for an overall aggregation.

[491] CSRv2: Unlocking Ultra-Sparse Embeddings

Lixuan Guo, Yifei Wang, Tiansheng Wen, Yifan Wang, Aosong Feng, Bo Chen, Stefanie Jegelka, Chenyu You

Main category: cs.LG

TL;DR: CSRv2 improves ultra-sparse embeddings via progressive k-annealing and supervised contrastive objectives, reducing dead neurons from 80% to 20% and achieving 14% accuracy gain at k=2 while maintaining efficiency.

DetailsMotivation: Current dense embeddings are high-dimensional and computationally expensive, while existing sparse methods like CSR suffer severe degradation in ultra-sparse regimes where most neurons remain inactive, limiting efficiency gains.

Method: CSRv2 uses progressive k-annealing to stabilize sparsity learning, supervised contrastive objectives to enhance representation quality, and full backbone finetuning for end-to-end adaptability.

Result: CSRv2 reduces dead neurons from 80% to 20%, achieves 14% accuracy gain at k=2, matches CSR at k=8 and MRL at 32 dimensions with only two active features, and provides 7x speedup over MRL with up to 300x compute/memory efficiency improvements.

Conclusion: CSRv2 makes ultra-sparse embeddings practical without performance compromise, enabling real-time and edge-deployable AI systems where both embedding quality and efficiency are critical.

Abstract: In the era of large foundation models, the quality of embeddings has become a central determinant of downstream task performance and overall system capability. Yet widely used dense embeddings are often extremely high-dimensional, incurring substantial costs in storage, memory, and inference latency. To address these, Contrastive Sparse Representation (CSR) is recently proposed as a promising direction, mapping dense embeddings into high-dimensional but k-sparse vectors, in contrast to compact dense embeddings such as Matryoshka Representation Learning (MRL). Despite its promise, CSR suffers severe degradation in the ultra-sparse regime, where over 80% of neurons remain inactive, leaving much of its efficiency potential unrealized. In this paper, we introduce CSRv2, a principled training approach designed to make ultra-sparse embeddings viable. CSRv2 stabilizes sparsity learning through progressive k-annealing, enhances representational quality via supervised contrastive objectives, and ensures end-to-end adaptability with full backbone finetuning. CSRv2 reduces dead neurons from 80% to 20% and delivers a 14% accuracy gain at k=2, bringing ultra-sparse embeddings on par with CSR at k=8 and MRL at 32 dimensions, all with only two active features. While maintaining comparable performance, CSRv2 delivers a 7x speedup over MRL, and yields up to 300x improvements in compute and memory efficiency relative to dense embeddings in text representation. Extensive experiments across text and vision demonstrate that CSRv2 makes ultra-sparse embeddings practical without compromising performance, where CSRv2 achieves 7%/4% improvement over CSR when k=4 and further increases this gap to 14%/6% when k=2 in text/vision representation. By making extreme sparsity viable, CSRv2 broadens the design space for real-time and edge-deployable AI systems where both embedding quality and efficiency are critical.

[492] BLITZRANK: Principled Zero-shot Ranking Agents with Tournament Graphs

Sheshansh Agrawal, Thien Hang Nguyen, Douwe Kiela

Main category: cs.LG

TL;DR: A tournament graph framework for efficient k-wise ranking that aggregates pairwise preferences from k-item comparisons to identify top-m items with fewer oracle calls, achieving significant token reduction in LLM reranking applications.

DetailsMotivation: Existing methods for selecting top-m items from n items via expensive k-wise comparisons either rely on inefficient heuristics or fail to fully exploit the information each comparison reveals, creating a need for more efficient and principled approaches.

Method: Introduces a tournament graph framework where each k-item comparison reveals a complete tournament of pairwise preferences; aggregates these into a global preference graph, computes transitive closure to infer additional orderings without oracle calls, and designs greedy query schedules to maximize information gain for identifying top-m items.

Result: Applied to LLM reranking across 14 benchmarks and 5 models, the method achieves Pareto dominance: matching or exceeding accuracy while requiring 25-40% fewer tokens than comparable methods, and 7× fewer tokens than pairwise reranking at near-identical quality.

Conclusion: The tournament graph framework provides a principled foundation for k-wise ranking that efficiently exploits information from comparisons, significantly reducing computational costs while maintaining or improving accuracy in applications like LLM reranking.

Abstract: Selecting the top $m$ from $n$ items via expensive $k$-wise comparisons is fundamental to settings ranging from LLM-based document reranking to crowdsourced evaluation and tournament design. Existing methods either rely on heuristics that fail to fully exploit the information each comparison reveals, or are inefficient when they do. We introduce a tournament graph framework that provides a principled foundation for $k$-wise ranking. Our key observation is that each $k$-item comparison reveals a complete tournament of $\binom{k}{2}$ pairwise preferences; aggregating these into a global preference graph and computing its transitive closure yields many additional orderings without further oracle calls. We formalize when an item’s rank is certifiably determined and design a greedy query schedule that maximizes information gain towards identifying the top-$m$ items. The framework also gracefully handles non-transitive preferences (cycles induced by real-world oracles) by collapsing them into equivalence classes that yield principled tiered rankings. Applied to LLM reranking across 14 benchmarks and 5 models, our method achieves Pareto dominance over existing approaches: matching or exceeding accuracy while requiring 25-40% fewer tokens than comparable methods, and $7\times$ fewer than pairwise reranking at near-identical quality.

cs.MA

[493] Communication Enhances LLMs’ Stability in Strategic Thinking

Nunzio Lore, Babak Heydari

Main category: cs.MA

TL;DR: LLMs show strategic instability in repeated Prisoner’s Dilemma; cheap-talk pre-play messages reduce trajectory noise and improve predictability in multi-agent systems.

DetailsMotivation: Large Language Models exhibit unpredictable strategic behavior in multi-agent settings, which undermines their reliability for tasks requiring strategic thinking. The researchers investigate whether simple, costless pre-play communication (cheap-talk) can stabilize LLM behavior in repeated game scenarios.

Method: Studied LLMs (7-9B parameters) in ten-round repeated Prisoner’s Dilemma with/without pre-play messaging. Used simulation-level bootstrap resampling, nonparametric inference, and LOWESS regression to analyze cooperation trajectories across messaging vs. no-messaging conditions.

Result: Cheap-talk messages consistently reduced trajectory noise across most model-context pairings. Stabilizing effect persisted across multiple prompt variants and decoding regimes, with models having higher baseline volatility benefiting most. Communication rarely caused harmful instability, with only a few context-specific exceptions.

Conclusion: Cheap-talk communication serves as a low-cost, practical tool for improving predictability and reliability of strategic behavior in multi-agent LLM systems, though effectiveness depends on model choice and contextual framing.

Abstract: Large Language Models (LLMs) often exhibit pronounced context-dependent variability that undermines predictable multi-agent behavior in tasks requiring strategic thinking. Focusing on models that range from 7 to 9 billion parameters in size engaged in a ten-round repeated Prisoner’s Dilemma, we evaluate whether short, costless pre-play messages emulating the cheap-talk paradigm affect strategic stability. Our analysis uses simulation-level bootstrap resampling and nonparametric inference to compare cooperation trajectories fitted with LOWESS regression across both the messaging and the no-messaging condition. We demonstrate consistent reductions in trajectory noise across a majority of the model-context pairings being studied. The stabilizing effect persists across multiple prompt variants and decoding regimes, though its magnitude depends on model choice and contextual framing, with models displaying higher baseline volatility gaining the most. While communication rarely produces harmful instability, we document a few context-specific exceptions and identify the limited domains in which communication harms stability. These findings position cheap-talk style communication as a low-cost, practical tool for improving the predictability and reliability of strategic behavior in multi-agent LLM systems.

[494] Prism: Spectral Parameter Sharing for Multi-Agent Reinforcement Learning

Kyungbeom Kim, Seungwon Oh, Kyung-Joong Kim

Main category: cs.MA

TL;DR: Prism is a parameter sharing framework for multi-agent reinforcement learning that uses spectral domain representation via SVD to induce inter-agent diversity while maintaining scalability.

DetailsMotivation: Conventional fully shared architectures in MARL often lead to homogeneous behaviors, while existing diversity methods compromise resource efficiency. There's a need for a framework that can induce inter-agent diversity while preserving scalability and efficiency.

Method: Uses singular value decomposition (SVD) to represent shared networks in the spectral domain. All agents share singular vector directions while learning distinct spectral masks on singular values, enabling diversity through different spectral masks while maintaining parameter sharing.

Result: Achieves competitive performance on both homogeneous (LBF, SMACv2) and heterogeneous (MaMuJoCo) benchmarks with superior resource efficiency compared to existing methods.

Conclusion: Prism provides an effective way to induce inter-agent diversity in MARL through spectral domain representation while maintaining the benefits of parameter sharing and scalability.

Abstract: Parameter sharing is a key strategy in multi-agent reinforcement learning (MARL) for improving scalability, yet conventional fully shared architectures often collapse into homogeneous behaviors. Recent methods introduce diversity through clustering, pruning, or masking, but typically compromise resource efficiency. We propose Prism, a parameter sharing framework that induces inter-agent diversity by representing shared networks in the spectral domain via singular value decomposition (SVD). All agents share the singular vector directions while learning distinct spectral masks on singular values. This mechanism encourages inter-agent diversity and preserves scalability. Extensive experiments on both homogeneous (LBF, SMACv2) and heterogeneous (MaMuJoCo) benchmarks show that Prism achieves competitive performance with superior resource efficiency.

[495] Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response

Ariyan Bighashdel, Thiago D. Simão, Frans A. Oliehoek

Main category: cs.MA

TL;DR: JBR improves PSRO’s efficiency by collecting joint experience once and reusing it for simultaneous best-response computation across all agents, addressing distribution shift with three methods.

DetailsMotivation: PSRO is computationally expensive for many-agent settings due to per-agent best-response training requiring separate environment interactions. Need more sample-efficient approach for large-scale strategic learning.

Method: Joint Experience Best Response (JBR) collects trajectories once under current meta-strategy profile and reuses joint dataset to compute best responses for all agents simultaneously. Proposes three remedies for distribution-shift bias: Conservative JBR, Exploration-Augmented JBR, and Hybrid BR.

Result: Exploration-Augmented JBR achieves best accuracy-efficiency trade-off. Hybrid BR attains near-PSRO performance at fraction of sample cost. JBR makes PSRO substantially more practical for large-scale strategic learning.

Conclusion: JBR improves PSRO’s sample efficiency while preserving equilibrium robustness, making large-scale strategic learning more practical through amortized environment interaction and offline RL techniques.

Abstract: Multi-agent reinforcement learning (MARL) offers a scalable alternative to exact game-theoretic analysis but suffers from non-stationarity and the need to maintain diverse populations of strategies that capture non-transitive interactions. Policy Space Response Oracles (PSRO) address these issues by iteratively expanding a restricted game with approximate best responses (BRs), yet per-agent BR training makes it prohibitively expensive in many-agent or simulator-expensive settings. We introduce Joint Experience Best Response (JBR), a drop-in modification to PSRO that collects trajectories once under the current meta-strategy profile and reuses this joint dataset to compute BRs for all agents simultaneously. This amortizes environment interaction and improves the sample efficiency of best-response computation. Because JBR converts BR computation into an offline RL problem, we propose three remedies for distribution-shift bias: (i) Conservative JBR with safe policy improvement, (ii) Exploration-Augmented JBR that perturbs data collection and admits theoretical guarantees, and (iii) Hybrid BR that interleaves JBR with periodic independent BR updates. Across benchmark multi-agent environments, Exploration-Augmented JBR achieves the best accuracy-efficiency trade-off, while Hybrid BR attains near-PSRO performance at a fraction of the sample cost. Overall, JBR makes PSRO substantially more practical for large-scale strategic learning while preserving equilibrium robustness.

cs.MM

[496] Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space

Zihang Wang, Siyue Zhang, Yilun Zhao, Jingyi Yang, Tingyu Song, Anh Tuan Luu, Chen Zhao

Main category: cs.MM

TL;DR: First systematic study evaluating Multimodal Diffusion Language Models (dLLMs) as multimodal embedding models, finding they generally underperform autoregressive Vision Language Models (VLMs) across classification, VQA, and retrieval tasks.

DetailsMotivation: Recent advances in large foundation models have led to the development of Multimodal dLLMs as competitive alternatives to autoregressive models. This raises the critical question of whether Multimodal dLLMs can serve as effective multimodal embedding models, which remains unexplored.

Method: Systematic study converting Multimodal dLLMs into embedding models and evaluating state-of-the-art Multimodal dLLMs and Autoregressive VLMs across three categories of embedding tasks: classification, visual question answering, and information retrieval.

Result: Multimodal dLLM embeddings generally underperform their autoregressive VLM counterparts. LaViDa lags by 3.5 points on classification, 2.5 points on VQA, and 4.4 points on retrieval, while MMaDA shows larger gaps exceeding 20 points across all tasks.

Conclusion: Insufficient image-text alignment in diffusion-based models accounts for their limitations in embedding performance, suggesting current Multimodal dLLMs are not yet competitive with autoregressive VLMs for multimodal embedding tasks.

Abstract: Embedding models are a fundamental component of modern AI systems such as semantic search and retrieval-augmented generation. Recent advances in large foundation models have substantially accelerated the development of embedding models, including those based on Large Language Models (LLMs), Vision Language Models (VLMs), and Multimodal LLMs. More recently, Large Diffusion Language Models (dLLMs) and Multimodal dLLMs have emerged as competitive alternatives to autoregressive models, offering advantages such as bidirectional attention and parallel generation. This progress naturally raises a critical yet unexplored question: can Multimodal dLLMs serve as effective multimodal embedding models? To answer this, we present the first systematic study of converting Multimodal dLLMs into embedding models. We evaluate state-of-the-art Multimodal dLLMs and Autoregressive VLMs across three categories of embedding tasks: classification, visual question answering, and information retrieval. Our results show that Multimodal dLLM embeddings generally underperform their autoregressive VLM counterparts. The stronger diffusion-based model, LaViDa, lags by only 3.5 points on classification, 2.5 points on VQA, and 4.4 points on retrieval tasks, whereas the other diffusion-based model, MMaDA, exhibits substantially larger performance gaps, exceeding 20 points across all tasks. Further analysis reveals insufficient image-text alignment in diffusion-based models, accounting for the observed limitations in their embedding performance.

eess.AS

[497] STACodec: Semantic Token Assignment for Balancing Acoustic Fidelity and Semantic Information in Audio Codecs

Kaiyuan Zhang, Mohan Shi, Eray Eren, Natarajan Balaji Shankar, Zilai Wang, Abeer Alwan

Main category: eess.AS

TL;DR: STACodec: A unified audio codec that integrates semantic information into residual vector quantization via semantic token assignment, achieving better balance between acoustic fidelity and semantic capability than existing hybrid codecs.

DetailsMotivation: Traditional neural audio codecs preserve acoustic details well but lack semantic information, while recent hybrid codecs that incorporate semantic information through distillation often degrade reconstruction performance. There's a need for a unified approach that achieves both good acoustic fidelity and semantic capability.

Method: STACodec integrates semantic information from self-supervised learning models into the first layer of residual vector quantization via semantic token assignment. It also includes a semantic pre-distillation module that predicts semantic tokens directly for assignment during inference, eliminating reliance on SSL-based semantic tokenizers and improving efficiency.

Result: Experimental results show STACodec outperforms existing hybrid codecs in both audio reconstruction and downstream semantic tasks, demonstrating a better balance between acoustic fidelity and semantic capability.

Conclusion: STACodec successfully addresses the limitation of existing hybrid codecs by integrating semantic information into the quantization process while maintaining reconstruction quality, offering a unified solution for audio compression with both acoustic and semantic fidelity.

Abstract: Neural audio codecs are widely used for audio compression and can be integrated into token-based language models. Traditional codecs preserve acoustic details well but lack semantic information. Recent hybrid codecs attempt to incorporate semantic information through distillation, but this often degrades reconstruction performance, making it difficult to achieve both. To address this limitation, we introduce STACodec, a unified codec that integrates semantic information from self-supervised learning (SSL) models into the first layer of residual vector quantization (RVQ-1) via semantic token assignment (STA). To further eliminate reliance on SSL-based semantic tokenizers and improve efficiency during inference, we propose a semantic pre-distillation (SPD) module, which predicts semantic tokens directly for assignment to the first RVQ layer during inference. Experimental results show that STACodec outperforms existing hybrid codecs in both audio reconstruction and downstream semantic tasks, demonstrating a better balance between acoustic fidelity and semantic capability.

[498] From Hallucination to Articulation: Language Model-Driven Losses for Ultra Low-Bitrate Neural Speech Coding

Jayeon Yi, Minje Kim

Main category: eess.AS

TL;DR: Proposes language model-driven losses to reduce phoneme hallucinations in low-bitrate speech codecs by leveraging pretrained speech-text models for better semantic preservation.

DetailsMotivation: Addresses the problem of phoneme hallucinations in low-bitrate DNN-based speech codecs, where excessive compression leads to missing semantic information and the decoder generates incorrect phonemes. Current methods like semantic distillation have limitations in very-low-bitrate settings.

Method: Proposes two language model-driven losses: 1) Using modified Whisper ASR to compare decoded utterances against ASR-inferred transcripts when ground truth is unavailable, and 2) Using timed-text regularizer (TTR) to compare WavLM representations of decoded speech against BERT representations of ground-truth transcripts. These are compared against semantic distillation objectives.

Result: LM losses provide stronger guidance for extracting semantic information from self-supervised speech representations, improving human-perceived semantic adherence while preserving overall output quality, outperforming semantic distillation in very-low-bitrate settings.

Conclusion: Language model-driven losses are effective for reducing phoneme hallucinations in low-bitrate speech codecs by better leveraging semantic information from pretrained speech-text models, offering improved performance over semantic distillation approaches.

Abstract: ``Phoneme Hallucinations (PH)’’ commonly occur in low-bitrate DNN-based codecs. It is the generative decoder’s attempt to synthesize plausible outputs from excessively compressed tokens missing some semantic information. In this work, we propose language model-driven losses (LM loss) and show they may alleviate PHs better than a semantic distillation (SD) objective in very-low-bitrate settings. The proposed LM losses build upon language models pretrained to associate speech with text. When ground-truth transcripts are unavailable, we propose to modify a popular automatic speech recognition (ASR) model, Whisper, to compare the decoded utterance against the ASR-inferred transcriptions of the input speech. Else, we propose to use the timed-text regularizer (TTR) to compare WavLM representations of the decoded utterance against BERT representations of the ground-truth transcriptions. We test and compare LM losses against an SD objective, using a reference codec whose three-stage training regimen was designed after several popular codecs. Subjective and objective evaluations conclude that LM losses may provide stronger guidance to extract semantic information from self-supervised speech representations, boosting human-perceived semantic adherence while preserving overall output quality. Demo samples, code, and checkpoints are available online.

[499] B-GRPO: Unsupervised Speech Emotion Recognition based on Batched-Group Relative Policy Optimization

Yingying Gao, Shilei Zhang, Runyan Yang, Zihao Cui, Junlan Feng

Main category: eess.AS

TL;DR: Proposes RL-based unsupervised speech emotion recognition using modified GRPO with self/teacher reward functions to measure sample quality without human annotations.

DetailsMotivation: Address data sparsity and annotation bias in speech emotion recognition by using reinforcement learning to measure sample quality without human annotations, treating sample selection as a long-term RL procedure.

Method: Modified Group Relative Policy Optimization adapted for classification problems, using batch samples as a group with average reward as baseline. Introduces self-reward and teacher-reward functions to encourage high-confidence outputs instead of verifiable reward functions.

Result: Experiments show the proposed method improves baseline performance without RL by 19.8%.

Conclusion: RL-based approach effectively addresses data sparsity and annotation bias in unsupervised speech emotion recognition by measuring sample quality through modified GRPO with novel reward functions.

Abstract: Unsupervised speech emotion recognition (SER) focuses on addressing the problem of data sparsity and annotation bias of emotional speech. Reinforcement learning (RL) is a promising method which enhances the performance through rule-based or model-based verification functions rather than human annotations. We treat the sample selection during the learning process as a long-term procedure and whether to select a sample as the action to make policy, thus achieving the application of RL to measure sample quality in SER. We propose a modified Group Relative Policy Optimization (GRPO) to adapt it to classification problems, which takes the samples in a batch as a group and uses the average reward of these samples as the baseline to calculate the advantage. And rather than using a verifiable reward function as in GRPO, we put forward self-reward functions and teacher-reward functions to encourage the model to produce high-confidence outputs. Experiments indicate that the proposed method improves the performance of baseline without RL by 19.8%.

[500] Automatic Detection and Analysis of Singing Mistakes for Music Pedagogy

Sumit Kumar, Suraj Jaiswal, Parampreet Singh, Vipul Arora

Main category: eess.AS

TL;DR: A framework for automatic singing mistake detection using synchronized teacher-learner vocal recordings with deep learning models that outperform rule-based methods.

DetailsMotivation: To advance technology-enhanced music education by developing automated systems for detecting singing mistakes, which can provide objective feedback to learners and support music pedagogy.

Method: Created a new dataset of synchronized teacher-learner vocal recordings with mistake annotations, developed various deep learning models for mistake detection, proposed a new evaluation methodology, and benchmarked against rule-based methods.

Result: Learning-based methods outperformed rule-based methods in singing mistake detection. Systematic error analysis and cross-teacher studies provided insights applicable to music pedagogy.

Conclusion: The framework sets new research directions in music pedagogy and demonstrates the effectiveness of deep learning for automated singing assessment. The dataset and code are publicly available.

Abstract: The advancement of machine learning in audio analysis has opened new possibilities for technology-enhanced music education. This paper introduces a framework for automatic singing mistake detection in the context of music pedagogy, supported by a newly curated dataset. The dataset comprises synchronized teacher learner vocal recordings, with annotations marking different types of mistakes made by learners. Using this dataset, we develop different deep learning models for mistake detection and benchmark them. To compare the efficacy of mistake detection systems, a new evaluation methodology is proposed. Experiments indicate that the proposed learning-based methods are superior to rule-based methods. A systematic study of errors and a cross-teacher study reveal insights into music pedagogy that can be utilised for various music applications. This work sets out new directions of research in music pedagogy. The codes and dataset are publicly available.

[501] The Combination of Several Decorrelation Methods to Improve Acoustic Feedback Cancellation

Klaus Linhard, Philipp Bulling

Main category: eess.AS

TL;DR: This paper extends acoustic feedback cancellation systems by combining multiple decorrelation methods including variable time delay, prediction, distortion compensation, and simplified reverberation modeling, showing that each contributes to performance improvements and their combination yields superior results.

DetailsMotivation: Existing acoustic feedback cancellation systems often focus on single extensions like prediction, but the authors aim to demonstrate that multiple decorrelation methods can be combined for better performance, with each individual extension contributing to overall improvement.

Method: Extends a frequency-domain Kalman filter in a multi-delay structure with four decorrelation extensions: variable time delay line, prediction, distortion compensation, and simplified reverberation model. Each extension is analyzed individually and in combination, with practical parameter ranges defined.

Result: Evaluation on public datasets shows each individual extension contributes to performance improvements, and the combination of all proposed extensions results in a superior system, validated through system distance metrics and PSEQ objective speech quality measures.

Conclusion: Multiple decorrelation methods can be effectively combined in acoustic feedback cancellation systems, with each extension providing meaningful contributions and their combination yielding the best overall performance.

Abstract: This paper extends an acoustic feedback cancellation system by incorporating multiple decorrelation methods. The baseline system is based on a frequency-domain Kalman filter implemented in a multi-delay structure. The proposed extensions include a variable time delay line, prediction, distortion compensation, and a simplified reverberation model. Each extension is analyzed, and a practical parameter range is defined. While existing literature often focuses on a single extension, such as prediction, to describe an optimal system, this work demonstrates that each individual extension contributes to performance improvements. Furthermore, the combination of all proposed extensions results in a superior system. The evaluation is conducted using publicly available datasets, with performance assessed through system distance metrics and the objective speech quality measure PSEQ.

[502] Benchmarking Automatic Speech Recognition for Indian Languages in Agricultural Contexts

Chandrashekar M S, Vineet Singh, Lakshmi Pedapudi

Main category: eess.AS

TL;DR: Benchmarking framework for ASR in Indian agricultural contexts across Hindi, Telugu, and Odia languages, introducing domain-specific metrics and analyzing performance variations.

DetailsMotivation: The digitization of agricultural advisory services in India requires robust ASR systems capable of accurately transcribing domain-specific terminology in multiple Indian languages, particularly in low-resource agricultural domains.

Method: Developed a benchmarking framework with evaluation metrics including Agriculture Weighted Word Error Rate (AWWER) and domain-specific utility scoring. Evaluated 10,934 audio recordings across Hindi, Telugu, and Odia languages using up to 10 ASR models, analyzing performance variations and audio quality challenges.

Result: Hindi achieved best overall performance (WER: 16.2%), while Odia presented greatest challenges (best WER: 35.1% with speaker diarization). Speaker diarization with best-speaker selection reduced WER for multi-speaker recordings by up to 66%. Identified recurring error patterns in agricultural terminology.

Conclusion: Established baseline benchmarks for agricultural ASR development, provided practical recommendations for improving ASR systems in low-resource agricultural domains, and demonstrated the importance of domain-specific evaluation metrics.

Abstract: The digitization of agricultural advisory services in India requires robust Automatic Speech Recognition (ASR) systems capable of accurately transcribing domain-specific terminology in multiple Indian languages. This paper presents a benchmarking framework for evaluating ASR performance in agricultural contexts across Hindi, Telugu, and Odia languages. We introduce evaluation metrics including Agriculture Weighted Word Error Rate (AWWER) and domain-specific utility scoring to complement traditional metrics. Our evaluation of 10,934 audio recordings, each transcribed by up to 10 ASR models, reveals performance variations across languages and models, with Hindi achieving the best overall performance (WER: 16.2%) while Odia presents the greatest challenges (best WER: 35.1%, achieved only with speaker diarization). We characterize audio quality challenges inherent to real-world agricultural field recordings and demonstrate that speaker diarization with best-speaker selection can substantially reduce WER for multi-speaker recordings (upto 66% depending on the proportion of multi-speaker audio). We identify recurring error patterns in agricultural terminology and provide practical recommendations for improving ASR systems in low-resource agricultural domains. The study establishes baseline benchmarks for future agricultural ASR development.

eess.IV

[503] Adaptive Resolution and Chroma Subsampling for Energy-Efficient Video Coding

Amritha Premkumar, Christian Herglotz

Main category: eess.IV

TL;DR: ARCS framework adaptively selects optimal resolution-chroma subsampling pairs per bitrate to balance quality and decoding efficiency, achieving significant bitrate savings and reduced decoding time compared to fixed-format encoding.

DetailsMotivation: Fixed chroma subsampling formats like YUV420 don't adapt to content variations, leading to suboptimal chroma quality and inefficient bitrate allocation. There's a need for adaptive approaches that balance perceptual quality with decoding efficiency for energy-efficient video streaming.

Method: Proposes Adaptive Resolution-Chroma Subsampling (ARCS) framework that jointly optimizes spatial resolution and chroma subsampling. It selects optimal (resolution, chroma format) pairs for each bitrate by maximizing a composite quality-complexity objective with monotonicity constraints for smooth transitions between representations.

Result: Compared to fixed-format YUV444 encoding, ARCS achieves average 13.48% bitrate savings and 62.18% reduction in decoding time (proxy for decoding energy) while maintaining same colorVideoVDP score. Demonstrates chroma adaptivity as effective control dimension for energy-efficient streaming.

Conclusion: ARCS successfully introduces chroma adaptivity as a new dimension for optimizing video streaming efficiency, significantly improving both bitrate efficiency and decoding performance through adaptive resolution-chroma subsampling selection.

Abstract: Conventional video encoders typically employ a fixed chroma subsampling format, such as YUV420, which may not optimally reflect variations in chroma detail across different types of content. This can lead to suboptimal chroma quality and inefficiencies in bitrate allocation. We propose an Adaptive Resolution-Chroma Subsampling (ARCS) framework that jointly optimizes spatial resolution and chroma subsampling to balance perceptual quality and decoding efficiency. ARCS selects an optimal (resolution, chroma format) pair for each bitrate by maximizing a composite quality-complexity objective, while enforcing monotonicity constraints to ensure smooth transitions between representations. Experimental results using x265 show that, compared to a fixed-format encoding (YUV444), on average, ARCS achieves a 13.48 % bitrate savings and a 62.18 % reduction in decoding time, which we use as a proxy for the decoding energy, to yield the same colorVideoVDP score. The proposed framework introduces chroma adaptivity as a new control dimension for energy-efficient video streaming.

[504] ALIEN: Analytic Latent Watermarking for Controllable Generation

Liangqi Lei, Keke Gai, Jing Yu, Qi Wu

Main category: eess.IV

TL;DR: ALIEN proposes an analytical watermarking framework for latent diffusion models that uses time-dependent modulation coefficients to guide watermark embedding, achieving better quality and robustness than optimization-based methods.

DetailsMotivation: Existing watermarking methods for latent diffusion models rely on computationally intensive heuristic optimization for iterative signal refinement, leading to high training overhead and local optima entrapment. There's a need for a more efficient and effective approach to watermarking in generative models.

Method: ALIEN develops the first analytical derivation of time-dependent modulation coefficients that guide the diffusion of watermark residuals to achieve controllable watermark embedding patterns, eliminating the need for iterative optimization.

Result: ALIEN-Q outperforms state-of-the-art by 33.1% across 5 quality metrics, and ALIEN-R shows 14.0% improved robustness against generative variant and stability threats across 15 distinct conditions.

Conclusion: The analytical framework provides an efficient and effective solution for watermarking in latent diffusion models, offering superior quality and robustness compared to optimization-based approaches.

Abstract: Watermarking is a technical alternative to safeguarding intellectual property and reducing misuse. Existing methods focus on optimizing watermarked latent variables to balance watermark robustness and fidelity, as Latent diffusion models (LDMs) are considered a powerful tool for generative tasks. However, reliance on computationally intensive heuristic optimization for iterative signal refinement results in high training overhead and local optima entrapment.To address these issues, we propose an \underline{A}na\underline{l}ytical Watermark\underline{i}ng Framework for Controllabl\underline{e} Generatio\underline{n} (ALIEN). We develop the first analytical derivation of the time-dependent modulation coefficient that guides the diffusion of watermark residuals to achieve controllable watermark embedding pattern.Experimental results show that ALIEN-Q outperforms the state-of-the-art by 33.1% across 5 quality metrics, and ALIEN-R demonstrates 14.0% improved robustness against generative variant and stability threats compared to the state-of-the-art across 15 distinct conditions. Code can be available at https://anonymous.4open.science/r/ALIEN/.

[505] COSMOS: Coherent Supergaussian Modeling with Spatial Priors for Sparse-View 3D Splatting

Chaeyoung Jeong, Kwangsu Kim

Main category: eess.IV

TL;DR: COSMOS improves 3D Gaussian Splatting for sparse-view reconstruction by introducing supergaussian groupings with spatial priors and attention mechanisms to enhance structural coherence.

DetailsMotivation: 3D Gaussian Splatting suffers from overfitting and structural degradation with sparse input views due to lack of 3D structure priors, limiting novel view generalization.

Method: Introduces supergaussian groupings based on geometric and appearance features, uses inter-group global self-attention and sparse local attention, and applies intra-group positional regularization to maintain structural coherence.

Result: Outperforms state-of-the-art methods on Blender and DTU datasets in sparse-view settings without external depth supervision.

Conclusion: COSMOS effectively addresses sparse-view limitations in 3DGS by incorporating spatial priors through supergaussian groupings and attention mechanisms, enabling more consistent 3D reconstructions.

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a promising approach for 3D reconstruction, providing explicit, point-based representations and enabling high-quality real time rendering. However, when trained with sparse input views, 3DGS suffers from overfitting and structural degradation, leading to poor generalization on novel views. This limitation arises from its optimization relying solely on photometric loss without incorporating any 3D structure priors. To address this issue, we propose Coherent supergaussian Modeling with Spatial Priors (COSMOS). Inspired by the concept of superpoints from 3D segmentation, COSMOS introduces 3D structure priors by newly defining supergaussian groupings of Gaussians based on local geometric cues and appearance features. To this end, COSMOS applies inter group global self-attention across supergaussian groups and sparse local attention among individual Gaussians, enabling the integration of global and local spatial information. These structure-aware features are then used for predicting Gaussian attributes, facilitating more consistent 3D reconstructions. Furthermore, by leveraging supergaussian-based grouping, COSMOS enforces an intra-group positional regularization to maintain structural coherence and suppress floaters, thereby enhancing training stability under sparse-view conditions. Our experiments on Blender and DTU show that COSMOS surpasses state-of-the-art methods in sparse-view settings without any external depth supervision.

[506] Content-Driven Frame-Level Bit Prediction for Rate Control in Versatile Video Coding

Amritha Premkumar, Prajit T Rajendran, Vignesh V Menon, Christian Herglotz

Main category: eess.IV

TL;DR: A content-adaptive rate control framework for VVC video coding that uses Video Complexity Analyzer features and Random Forest regression to predict frame-level bit consumption, reducing encoding time by 33.3% while maintaining comparable coding efficiency to conventional two-pass rate control.

DetailsMotivation: Conventional two-pass rate control in VVC relies on analytical rate-QP models that fail to capture nonlinear spatial-temporal variations, causing quality instability and high complexity due to multiple trial encodes. There's a need for more efficient and accurate bit allocation methods.

Method: Proposes a content-adaptive framework that predicts frame-level bit consumption using lightweight features from the Video Complexity Analyzer (VCA) and quantization parameters within a Random Forest regression model. The approach is integrated into a rate-control loop.

Result: Achieves strong correlation with ground truth bit consumption, with R² values of 0.93, 0.88, and 0.77 for I-, P-, and B-frames respectively. When integrated into rate control, achieves comparable coding efficiency to conventional two-pass rate control while reducing total encoding time by 33.3%.

Conclusion: VCA-driven bit prediction provides a computationally efficient and accurate alternative to conventional rate-QP models for video coding rate control, offering significant time savings while maintaining quality.

Abstract: Rate control allocates bits efficiently across frames to meet a target bitrate while maintaining quality. Conventional two-pass rate control (2pRC) in Versatile Video Coding (VVC) relies on analytical rate-QP models, which often fail to capture nonlinear spatial-temporal variations, causing quality instability and high complexity due to multiple trial encodes. This paper proposes a content-adaptive framework that predicts frame-level bit consumption using lightweight features from the Video Complexity Analyzer (VCA) and quantization parameters within a Random Forest regression. On ultra-high-definition sequences encoded with VVenC, the model achieves strong correlation with ground truth, yielding R2 values of 0.93, 0.88, and 0.77 for I-, P-, and B-frames, respectively. Integrated into a rate-control loop, it achieves comparable coding efficiency to 2pRC while reducing total encoding time by 33.3%. The results show that VCA-driven bit prediction provides a computationally efficient and accurate alternative to conventional rate-QP models.

[507] Zero-shot Multi-Contrast Brain MRI Registration by Intensity Randomizing T1-weighted MRI (LUMIR25)

Hengjie Liu, Yimeng Dou, Di Xu, Xinyi Fu, Dan Ruan, Ke Sheng

Main category: eess.IV

TL;DR: A medical image registration method that achieves zero-shot generalization across MRI contrasts using MIND-based multimodal loss, intensity randomization, and instance-specific optimization, winning 1st place in LUMIR25 challenge.

DetailsMotivation: To develop a brain MRI registration method that can generalize to unseen MRI contrasts (domain shifts) while being trained only on T1-weighted data, addressing the challenge of zero-shot registration under diverse clinical scenarios.

Method: Three key strategies: 1) Multimodal loss using modality-independent neighborhood descriptor (MIND) for contrast-invariant feature matching, 2) Intensity randomization for appearance augmentation during training, and 3) Lightweight instance-specific optimization (ISO) on feature encoders at inference time for adaptation to new contrasts.

Result: Achieved 1st place overall on the LUMIR25 challenge test set, with reasonable T1-T2 registration accuracy on validation set while maintaining good deformation regularity, demonstrating effective zero-shot generalization across MRI contrasts.

Conclusion: Simple but effective strategies combining MIND-based multimodal loss, intensity augmentation, and instance-specific optimization enable robust zero-shot medical image registration across domain shifts, particularly valuable for clinical applications with diverse MRI contrasts.

Abstract: In this paper, we summarize the methods and results of our submission to the LUMIR25 challenge in Learn2Reg 2025, which achieved 1st place overall on the test set. Extended from LUMIR24, this year’s task focuses on zero-shot registration under domain shifts (high-field MRI, pathological brains, and various MRI contrasts), while the training data comprise only in-domain T1-weighted brain MRI. We start with a meticulous analysis of LUMIR24 winners to identify the main contributors to good monomodal registration performance. To achieve good generalization with diverse contrasts from a model trained with T1-weighted MRI only, we employ three simple but effective strategies: (i) a multimodal loss based on the modality-independent neighborhood descriptor (MIND), (ii) intensity randomization for appearance augmentation, and (iii) lightweight instance-specific optimization (ISO) on feature encoders at inference time. On the validation set, our approach achieves reasonable T1-T2 registration accuracy while maintaining good deformation regularity.

[508] AS-Mamba: Asymmetric Self-Guided Mamba Decoupled Iterative Network for Metal Artifact Reduction

Bowen Ning, Zekun Zhou, Xinyi Zhong, Zhongzhen Wang, HongXin Wu, HaiTao Wang, Liu Shi, Qiegen Liu

Main category: eess.IV

TL;DR: AS-Mamba: A novel deep learning approach using State Space Models (Mamba) for metal artifact reduction in CT images by explicitly capturing directional streak artifacts and incorporating frequency domain correction with self-guided contrastive regularization.

DetailsMotivation: Metal artifacts significantly degrade CT image quality, hindering clinical diagnosis. Existing CNN and Transformer approaches fail to explicitly capture directional geometric features of artifacts, leading to compromised structural restoration.

Method: Proposes Asymmetric Self-Guided Mamba (AS-Mamba) that leverages Mamba architecture to capture directional streak artifacts via sequential modeling, incorporates frequency domain correction to rectify global amplitude spectrum, and uses self-guided contrastive regularization to bridge distribution gaps across clinical scenarios.

Result: Extensive experiments on public and clinical dental CBCT datasets demonstrate superior performance in suppressing directional streaks and preserving structural details compared to existing methods.

Conclusion: AS-Mamba effectively integrates physical geometric priors into deep network design for metal artifact reduction, validating the effectiveness of using State Space Models for capturing directional artifacts in medical imaging.

Abstract: Metal artifact significantly degrades Computed Tomography (CT) image quality, impeding accurate clinical diagnosis. However, existing deep learning approaches, such as CNN and Transformer, often fail to explicitly capture the directional geometric features of artifacts, leading to compromised structural restoration. To address these limitations, we propose the Asymmetric Self-Guided Mamba (AS-Mamba) for metal artifact reduction. Specifically, the linear propagation of metal-induced streak artifacts aligns well with the sequential modeling capability of State Space Models (SSMs). Consequently, the Mamba architecture is leveraged to explicitly capture and suppress these directional artifacts. Simultaneously, a frequency domain correction mechanism is incorporated to rectify the global amplitude spectrum, thereby mitigating intensity inhomogeneity caused by beam hardening. Furthermore, to bridge the distribution gap across diverse clinical scenarios, we introduce a self-guided contrastive regularization strategy. Extensive experiments on public andclinical dental CBCT datasets demonstrate that AS-Mamba achieves superior performance in suppressing directional streaks and preserving structural details, validating the effectiveness of integrating physical geometric priors into deep network design.

[509] Orientation-Robust Latent Motion Trajectory Learning for Annotation-free Cardiac Phase Detection in Fetal Echocardiography

Yingyu Yang, Qianye Yang, Can Peng, Elena D’Alberti, Olga Patey, Aris T. Papageorghiou, J. Alison Noble

Main category: eess.IV

TL;DR: ORBIT: Self-supervised framework for detecting cardiac phases (end-diastolic and end-systolic) in fetal echocardiography without manual annotations, robust to various fetal heart orientations.

DetailsMotivation: Automated detection of cardiac phases in fetal echocardiography is crucial for congenital heart disease analysis, but manual identification is labor-intensive, especially without fetal ECG. Existing methods are constrained by fixed orientation assumptions.

Method: ORBIT uses registration as a self-supervision task to learn a latent motion trajectory of cardiac deformation. Turning points in this trajectory capture transitions between cardiac relaxation and contraction, enabling orientation-robust localization of ED and ES frames.

Result: Achieves MAE = 1.9 frames for ED and 1.6 for ES on normal cases, and MAE = 2.4 frames for ED and 2.1 for ES on CHD cases, outperforming existing annotation-free approaches with fixed orientation constraints.

Conclusion: ORBIT demonstrates robust cardiac phase detection directly from 4CH fetal echocardiography, trained only on normal videos but generalizing well to CHD cases, with potential to facilitate automated CHD analysis.

Abstract: Fetal echocardiography is essential for detecting congenital heart disease (CHD), facilitating pregnancy management, optimized delivery planning, and timely postnatal interventions. Among standard imaging planes, the four-chamber (4CH) view provides comprehensive information for CHD diagnosis, where clinicians carefully inspect the end-diastolic (ED) and end-systolic (ES) phases to evaluate cardiac structure and motion. Automated detection of these cardiac phases is thus a critical component toward fully automated CHD analysis. Yet, in the absence of fetal electrocardiography (ECG), manual identification of ED and ES frames remains a labor-intensive bottleneck. We present ORBIT (Orientation-Robust Beat Inference from Trajectories), a self-supervised framework that identifies cardiac phases without manual annotations under various fetal heart orientation. ORBIT employs registration as self-supervision task and learns a latent motion trajectory of cardiac deformation, whose turning points capture transitions between cardiac relaxation and contraction, enabling accurate and orientation-robust localization of ED and ES frames across diverse fetal positions. Trained exclusively on normal fetal echocardiography videos, ORBIT achieves consistent performance on both normal (MAE = 1.9 frames for ED and 1.6 for ES) and CHD cases (MAE = 2.4 frames for ED and 2.1 for ES), outperforming existing annotation-free approaches constrained by fixed orientation assumptions. These results highlight the potential of ORBIT to facilitate robust cardiac phase detection directly from 4CH fetal echocardiography.

[510] Downscaling Neural Network for Coastal Simulations

Zhi-Song Liu, Markus Büttner, Matthew Scarborough, Eirik Valseth, Vadym Aizinger, Bernhard Kainz, Andreas Rupp

Main category: eess.IV

TL;DR: DNNCS uses neural networks with spatiotemporal attention and physics-informed loss to downscale coarse coastal ocean simulations to high-resolution predictions of free surface elevation and velocity.

DetailsMotivation: High-resolution coastal simulations are needed for predicting flooding from tsunamis and storm surges, but are computationally expensive. The paper aims to develop an efficient neural network approach to learn high-resolution solutions from coarse simulations.

Method: Proposes Downscaling Neural Network for Coastal Simulation (DNNCS) with grid-aware spatiotemporal attention to project temporal features to spatial domain, positional encoding for coordinate information, spatiotemporal bilinear interpolation for missing frames, and frequency domain expansion for residual mapping. Uses physics-informed loss for gradient consistency and momentum changes.

Result: The method achieves 24% reduction in root-mean-square error compared to models trained with only data-driven losses, shows superior downscaling quality and fast computation compared to state-of-the-art methods.

Conclusion: DNNCS effectively learns high-resolution coastal simulations from coarse inputs using neural networks with spatiotemporal attention and physics-informed constraints, providing accurate and efficient downscaling for practical applications.

Abstract: Learning the fine-scale details of a coastal ocean simulation from a coarse representation is a challenging task. For real-world applications, high-resolution simulations are necessary to advance understanding of many coastal processes, specifically, to predict flooding resulting from tsunamis and storm surges. We propose a Downscaling Neural Network for Coastal Simulation (DNNCS) for spatiotemporal enhancement to learn the high-resolution numerical solution. Given images of coastal simulations produced on low-resolution computational meshes using low polynomial order discontinuous Galerkin discretizations and a coarse temporal resolution, the proposed DNNCS learns to produce high-resolution free surface elevation and velocity visualizations in both time and space. To model the dynamic changes over time and space, we propose grid-aware spatiotemporal attention to project the temporal features to the spatial domain for non-local feature matching. The coordinate information is also utilized via positional encoding. For the final reconstruction, we use the spatiotemporal bilinear operation to interpolate the missing frames and then expand the feature maps to the frequency domain for residual mapping. Besides data-driven losses, the proposed physics-informed loss guarantees gradient consistency and momentum changes, leading to a 24% reduction in root-mean-square error compared to the model trained with only data-driven losses. To train the proposed model, we propose a coastal simulation dataset and use it for model optimization and evaluation. Our method shows superior downscaling quality and fast computation compared to the state-of-the-art methods.

[511] Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation

Szymon Płotka, Gizem Mert, Maciej Chrabaszcz, Ewa Szczurek, Arkadiusz Sitek

Main category: eess.IV

TL;DR: HoME introduces a hierarchical soft mixture-of-experts architecture built on Mamba SSM for efficient 3D medical image segmentation, achieving state-of-the-art results across diverse medical imaging modalities.

DetailsMotivation: Current AI methods for medical image segmentation face challenges in efficiently processing 3D medical images across diverse modalities and handling data variability, requiring improved long-context modeling and adaptive feature extraction.

Method: Two-level hierarchical soft mixture-of-experts (HoME) built on Mamba Selective State Space Model backbone: first level partitions input sequences into local groups with specialized per-group experts for localized feature extraction; second level aggregates outputs through global SMoE layer for cross-group information fusion and global context refinement.

Result: Surpasses state-of-the-art results across datasets from the three most widely used 3D medical imaging modalities and varying data qualities, demonstrating enhanced generalizability and segmentation performance.

Conclusion: HoME’s hierarchical design combining local expert routing with global expert refinement provides an effective solution for efficient 3D medical image segmentation across diverse modalities and data qualities.

Abstract: In recent years, artificial intelligence has significantly advanced medical image segmentation. Nonetheless, challenges remain, including efficient 3D medical image processing across diverse modalities and handling data variability. In this work, we introduce Hierarchical Soft Mixture-of-Experts (HoME), a two-level token-routing layer for efficient long-context modeling, specifically designed for 3D medical image segmentation. Built on the Mamba Selective State Space Model (SSM) backbone, HoME enhances sequential modeling through adaptive expert routing. In the first level, a Soft Mixture-of-Experts (SMoE) layer partitions input sequences into local groups, routing tokens to specialized per-group experts for localized feature extraction. The second level aggregates these outputs through a global SMoE layer, enabling cross-group information fusion and global context refinement. This hierarchical design, combining local expert routing with global expert refinement, enhances generalizability and segmentation performance, surpassing state-of-the-art results across datasets from the three most widely used 3D medical imaging modalities and varying data qualities. The code is publicly available at https://github.com/gmum/MambaHoME.

[512] Mamba-FCS: Joint Spatio- Frequency Feature Fusion, Change-Guided Attention, and SeK Loss for Enhanced Semantic Change Detection in Remote Sensing

Buddhi Wijenayake, Athulya Ratnayake, Praveen Sumanasekara, Roshan Godaliyadda, Parakrama Ekanayake, Vijitha Herath, Nichula Wasalathilaka

Main category: eess.IV

TL;DR: Mamba-FCS: A semantic change detection framework using Visual State Space Model backbone with frequency domain features and change-guided attention for remote sensing imagery.

DetailsMotivation: Need for SCD models that balance spatial context, computational efficiency, and sensitivity to class-imbalanced land-cover transitions, addressing limitations of CNNs (lack global context) and Transformers (high computational cost).

Method: Uses Mamba architecture (state-space models) with Visual State Space Model backbone, Joint Spatio-Frequency Fusion block (log-amplitude frequency domain features), Change-Guided Attention module linking BCD and SCD tasks, and Separated Kappa loss for class-imbalanced optimization.

Result: Achieves SOTA on SECOND (88.62% OA, 65.78% F_scd, 25.50% SeK) and Landsat-SCD (96.25% OA, 89.27% F_scd, 60.26% SeK) datasets, with ablation studies confirming contributions of each component.

Conclusion: Mamba architectures with proposed techniques show substantial potential for effective and scalable semantic change detection in remote sensing, setting new benchmarks.

Abstract: Semantic Change Detection (SCD) from remote sensing imagery requires models balancing extensive spatial context, computational efficiency, and sensitivity to class-imbalanced land-cover transitions. While Convolutional Neural Networks excel at local feature extraction but lack global context, Transformers provide global modeling at high computational costs. Recent Mamba architectures based on state-space models offer compelling solutions through linear complexity and efficient long-range modeling. In this study, we introduce Mamba-FCS, a SCD framework built upon Visual State Space Model backbone incorporating, a Joint Spatio-Frequency Fusion block incorporating log-amplitude frequency domain features to enhance edge clarity and suppress illumination artifacts, a Change-Guided Attention (CGA) module that explicitly links the naturally intertwined BCD and SCD tasks, and a Separated Kappa (SeK) loss tailored for class-imbalanced performance optimization. Extensive evaluation on SECOND and Landsat-SCD datasets shows that Mamba-FCS achieves state-of-the-art metrics, 88.62% Overall Accuracy, 65.78% F_scd, and 25.50% SeK on SECOND, 96.25% Overall Accuracy, 89.27% F_scd, and 60.26% SeK on Landsat-SCD. Ablation analyses confirm distinct contributions of each novel component, with qualitative assessments highlighting significant improvements in SCD. Our results underline the substantial potential of Mamba architectures, enhanced by proposed techniques, setting a new benchmark for effective and scalable semantic change detection in remote sensing applications. The complete source code, configuration files, and pre-trained models will be publicly available upon publication.

[513] Learning a distance measure from the information-estimation geometry of data

Guy Ohayon, Pierre-Etienne H. Fiquet, Florentin Guth, Jona Ballé, Eero P. Simoncelli

Main category: eess.IV

TL;DR: IEM is a novel distance metric derived from probability densities using information-estimation theory, linking log-probability to denoising errors, which adapts to distribution geometry and performs well on image quality assessment.

DetailsMotivation: The paper aims to develop a principled distance metric that captures the geometry of signal distributions by leveraging connections between information theory and estimation theory, particularly focusing on how denoising errors relate to signal probabilities.

Method: IEM is derived from comparing denoising error vectors of signals across noise amplitudes, which geometrically corresponds to comparing score vector fields of blurred densities. The metric can be computed using learned denoisers (similar to diffusion models) and solving a 1D integral.

Result: Theoretical results show IEM is a valid global distance metric with closed-form local Riemannian approximation. For Gaussian signals, it reduces to Mahalanobis distance. On ImageNet, learned IEM competes with or outperforms state-of-the-art supervised image quality metrics in predicting human perceptual judgments.

Conclusion: IEM provides a principled, distribution-adaptive distance metric that connects information theory with estimation theory, offering strong performance on perceptual image quality assessment tasks.

Abstract: We introduce the Information-Estimation Metric (IEM), a novel form of distance function derived from an underlying continuous probability density over a domain of signals. The IEM is rooted in a fundamental relationship between information theory and estimation theory, which links the log-probability of a signal with the errors of an optimal denoiser, applied to noisy observations of the signal. In particular, the IEM between a pair of signals is obtained by comparing their denoising error vectors over a range of noise amplitudes. Geometrically, this amounts to comparing the score vector fields of the blurred density around the signals over a range of blur levels. We prove that the IEM is a valid global distance metric and derive a closed-form expression for its local second-order approximation, which yields a Riemannian metric. For Gaussian-distributed signals, the IEM coincides with the Mahalanobis distance. But for more complex distributions, it adapts, both locally and globally, to the geometry of the distribution. In practice, the IEM can be computed using a learned denoiser (analogous to generative diffusion models) and solving a one-dimensional integral. To demonstrate the value of our framework, we learn an IEM on the ImageNet database. Experiments show that this IEM is competitive with or outperforms state-of-the-art supervised image quality metrics in predicting human perceptual judgments.

[514] Robust Detection of Retinal Neovascularization in Widefield Optical Coherence Tomography

Jinyi Hao, Jie Wang, Liqin Gao, Tristan T. Hormel, Yukun Guo, An-Lun Wu, Christina J. Flaxel, Steven T. Bailey, Kotaro Tsuboi, Thomas S. Hwang, Yali Jia

Main category: eess.IV

TL;DR: Deep learning approach for retinal neovascularization diagnosis and staging on widefield OCT/OCTA images using binary localization instead of multi-layer segmentation

DetailsMotivation: Retinal neovascularization (RNV) in diabetic retinopathy causes vision loss but is preventable with timely intervention. Widefield OCTA imaging improves early detection but existing algorithms are optimized for narrow fields of view, lacking effective RNV detection and quantification for clinical widefield applications.

Method: Reframes RNV identification as a direct binary localization task rather than conventional multi-layer retinal segmentation. Fully automated deep learning approach trained and validated on 589 widefield scans (17x17-mm to 26x21-mm) from multiple devices and clinics.

Result: Achieved device-dependent AUC ranging from 0.96 to 0.99 for RNV diagnosis, and mean IOU ranging from 0.76 to 0.88 for segmentation. Demonstrated ability to monitor lesion growth longitudinally.

Conclusion: Deep learning-based analysis for widefield OCTA images offers valuable means for improving RNV screening and management in clinical settings.

Abstract: Retinal neovascularization (RNV) is a vision threatening development in diabetic retinopathy (DR). Vision loss associated with RNV is preventable with timely intervention, making RNV clinical screening and monitoring a priority. Optical coherence tomography (OCT) angiography (OCTA) provides high-resolution imaging and high-sensitivity detection of RNV lesions. With recent commercial devices introducing widefield OCTA imaging to the clinic, the technology stands to improve early detection of RNV pathology. However, to meet clinical requirements these imaging capabilities must be combined with effective RNV detection and quantification, but existing algorithms for OCTA images are optimized for conventional, i.e. narrow, fields of view. Here, we present a novel approach for RNV diagnosis and staging on widefield OCT/OCTA. Unlike conventional methods dependent on multi-layer retinal segmentation, our model reframes RNV identification as a direct binary localization task. Our fully automated approach was trained and validated on 589 widefield scans (17x17-mm to 26x21-mm) collected from multiple devices at multiple clinics. Our method achieved a device-dependent area under curve (AUC) ranging from 0.96 to 0.99 for RNV diagnosis, and mean intersection over union (IOU) ranging from 0.76 to 0.88 for segmentation. We also demonstrate our method’s ability to monitor lesion growth longitudinally. Our results indicate that deep learning-based analysis for widefield OCTA images could offer a valuable means for improving RNV screening and management.

[515] EchoJEPA: A Latent Predictive Foundation Model for Echocardiography

Alif Munim, Adibvafa Fallahpour, Teodora Szasz, Ahmadreza Attarpour, River Jiang, Brana Sooriyakanthan, Maala Sooriyakanthan, Heather Whitney, Jeremy Slivnick, Barry Rubin, Wendy Tsang, Bo Wang

Main category: eess.IV

TL;DR: EchoJEPA is a foundation model for echocardiography trained on 18M scans using latent predictive objectives to learn robust anatomical representations that ignore ultrasound speckle noise, achieving superior performance in cardiac function estimation and generalization.

DetailsMotivation: Current foundation models for echocardiography struggle to separate anatomical signals from inherent ultrasound artifacts like speckle noise and acquisition variations, limiting their robustness and generalization capabilities in medical AI applications.

Method: Trained on 18 million echocardiograms across 300K patients using a latent predictive objective (JEPA - Joint Embedding Predictive Architecture) to learn robust anatomical representations that ignore speckle noise, with validation through multi-view probing framework using frozen backbones.

Result: Outperforms leading baselines by ~20% in LVEF estimation and ~17% in RVSP estimation; achieves 79% view classification accuracy with only 1% labeled data vs 42% for best baseline with 100% data; degrades only 2% under acoustic perturbations vs 17% for competitors; zero-shot performance on pediatric patients surpasses fully fine-tuned baselines.

Conclusion: Latent prediction is a superior paradigm for robust, generalizable medical AI in echocardiography, with EchoJEPA demonstrating remarkable sample efficiency, noise robustness, and cross-population generalization capabilities.

Abstract: Foundation models for echocardiography often struggle to disentangle anatomical signal from the stochastic speckle and acquisition artifacts inherent to ultrasound. We present EchoJEPA, a foundation model trained on 18 million echocardiograms across 300K patients, representing the largest pretraining corpus for this modality to date. By leveraging a latent predictive objective, EchoJEPA learns robust anatomical representations that ignore speckle noise. We validate this using a novel multi-view probing framework with frozen backbones, where EchoJEPA outperforms leading baselines by approximately 20% in left ventricular ejection fraction (LVEF) estimation and 17% in right ventricular systolic pressure (RVSP) estimation. The model also exhibits remarkable sample efficiency, reaching 79% view classification accuracy with only 1% of labeled data versus 42% for the best baseline trained on 100%. Crucially, EchoJEPA demonstrates superior generalization, degrading by only 2% under physics-informed acoustic perturbations compared to 17% for competitors. Most remarkably, its zero-shot performance on pediatric patients surpasses fully fine-tuned baselines, establishing latent prediction as a superior paradigm for robust, generalizable medical AI.

Last updated: 2026-02-13
Built with Hugo, theme modified on Stack