Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] BAT: Better Audio Transformer Guided by Convex Gated Probing

Houtan Ghaffari, Lukas Rauch, Christoph Scholz, Paul Devos

Main category: cs.SD

TL;DR: Convex Gated Probing (CGP) is introduced to close the gap between fine-tuning and probing for audio SSL models, enabling better evaluation and guiding improvements to audio SSL pipelines.

Details

Motivation: Audio SSL models rely on fine-tuning for evaluation because simple probing fails to unlock their full potential and alters rankings on benchmarks like AudioSet. A robust probing mechanism is needed to guide audio SSL development toward reliable and reproducible methods.

Method: Introduces Convex Gated Probing (CGP), a prototype-based method that efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. CGP guides improvements to the entire SSL pipeline including data preprocessing, model architecture, and pre-training recipes.

Result: CGP drastically closes the gap between fine-tuning and probing in audio. Guided by CGP, the authors rework SSL pipelines and introduce Better Audio Transformer (BAT), establishing new state-of-the-art on audio benchmarks.

Conclusion: CGP provides a robust probing mechanism for audio SSL models that enables better evaluation and guides improvements to achieve state-of-the-art performance on audio benchmarks.

Abstract: Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potential and alters their rankings when competing for SOTA on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that drastically closes the gap between fine-tuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP, we rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pre-training recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.

Relevance: 9/10

[2] Spatial Audio Question Answering and Reasoning on Dynamic Source Movements

Arvind Krishna Sridhar, Yinyi Guo, Erik Visser

Main category: cs.SD

TL;DR: Spatial Audio Question Answering (Spatial AQA) with movement reasoning using stereo audio, featuring data augmentation, multimodal finetuning with thinking mode, and query-conditioned source separation.

Details

Motivation: To enable machines to interpret complex auditory scenes with moving sound sources, focusing on movement reasoning where models must infer object motion, position, and directional changes from stereo audio.

Method: Three main components: 1) Movement-centric spatial audio augmentation framework synthesizing diverse motion patterns from mono audio events, 2) End-to-end multimodal finetuning with thinking mode for explicit intermediate reasoning, 3) Investigation of query-conditioned source separation with three inference regimes (no masking, audio grounding model, ground-truth masks).

Result: Reasoning amplifies benefits of source separation, with thinking mode showing +5.1% improvement when a single event is present in the question. Findings highlight interplay between movement modeling, reasoning, and separation quality.

Conclusion: The work offers new insights for advancing spatial audio understanding through movement modeling, explicit reasoning, and source separation techniques.

Abstract: Spatial audio understanding aims to enable machines to interpret complex auditory scenes, particularly when sound sources move over time. In this work, we study Spatial Audio Question Answering (Spatial AQA) with a focus on movement reasoning, where a model must infer object motion, position, and directional changes directly from stereo audio. First, we introduce a movement-centric spatial audio augmentation framework that synthesizes diverse motion patterns from isolated mono audio events, enabling controlled and scalable training data generation. Second, we propose an end-to-end multimodal finetuning approach with a thinking mode, which allows audio-language models to produce explicit intermediate reasoning steps before predicting an answer. Third, we investigate the impact of query-conditioned source separation as a preprocessing stage and compare three inference regimes: no masking, an audio grounding model (AGM), and ground-truth masks. Our results show that reasoning amplifies the benefits of source separation, with thinking mode showing significant improvement of +5.1% when a single event is present in the question. These findings highlight the interplay between movement modeling, reasoning, and separation quality, offering new insights for advancing spatial audio understanding.

Relevance: 9/10

[3] Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held, Diyi Yang

Main category: cs.SD

TL;DR: Systematic study of native audio foundation models using next-token prediction on audio, with scaling laws analysis and SODA model suite for general audio generation and cross-modal tasks.

Details

Motivation: Current audio language models are text-first or use semantic-only audio tokens, limiting general audio modeling. Need native audio foundation models that jointly model semantic content, acoustic details, and text for both audio generation and cross-modal capabilities.

Method: 1) Systematic investigation of design choices (data sources, text mixture ratios, token composition) to establish training recipe. 2) First scaling law study for discrete audio models via IsoFLOP analysis on 64 models. 3) Training SODA (Scaling Open Discrete Audio) suite from 135M to 4B parameters on 500B tokens.

Result: Found optimal data grows 1.6× faster than optimal model size. SODA serves as flexible backbone for diverse audio/text tasks, demonstrated through fine-tuning for voice-preserving speech-to-speech translation with unified architecture.

Conclusion: Native audio foundation models using next-token prediction can effectively model audio at scale, supporting both general audio generation and cross-modal capabilities. SODA provides validated approach for building such models.

Abstract: Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices – data sources, text mixture ratios, and token composition – establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks – we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 102]
cs.CV [Total: 79]
cs.AI [Total: 36]
cs.SD [Total: 8]
cs.LG [Total: 149]
cs.MA [Total: 3]
cs.MM [Total: 1]
eess.AS [Total: 8]
eess.IV [Total: 9]

cs.CL

[1] The Perplexity Paradox: Why Code Compresses Better Than Math in LLM Prompts

Warren Johnson

Main category: cs.CL

TL;DR: Paper validates code generation’s tolerance to prompt compression across multiple benchmarks, reveals perplexity paradox in token pruning, and proposes adaptive compression algorithm for cost reduction.

Details

Motivation: Previous work found code generation tolerates aggressive prompt compression while reasoning degrades, but was limited to one benchmark, left mechanisms unvalidated, and lacked adaptive algorithms.

Method: Validates across 6 code and 4 reasoning benchmarks, conducts per-token perplexity analysis (n=723 tokens), and proposes TAAC (Task-Aware Adaptive Compression) algorithm.

Result: Confirms compression threshold generalizes, reveals perplexity paradox (code syntax preserved vs. math values pruned), signature injection recovers +34% pass rate, TAAC achieves 22% cost reduction with 96% quality preservation.

Conclusion: Task-aware adaptive compression outperforms fixed-ratio approaches, with systematic variation in compression effectiveness across different compression ratios.

Abstract: In “Compress or Route?” (Johnson, 2026), we found that code generation tolerates aggressive prompt compression (r >= 0.6) while chain-of-thought reasoning degrades gradually. That study was limited to HumanEval (164 problems), left the “perplexity paradox” mechanism unvalidated, and provided no adaptive algorithm. This paper addresses all three gaps. First, we validate across six code benchmarks (HumanEval, MBPP, HumanEval+, MultiPL-E) and four reasoning benchmarks (GSM8K, MATH, ARC-Challenge, MMLU-STEM), confirming the compression threshold generalizes across languages and difficulties. Second, we conduct the first per-token perplexity analysis (n=723 tokens), revealing a “perplexity paradox”: code syntax tokens are preserved (high perplexity) while numerical values in math problems are pruned despite being task-critical (low perplexity). Signature injection recovers +34 percentage points in pass rate (5.3% to 39.3%; Cohen’s h=0.890). Third, we propose TAAC (Task-Aware Adaptive Compression), achieving 22% cost reduction with 96% quality preservation, outperforming fixed-ratio compression by 7%. MBPP validation (n=1,800 trials) confirms systematic variation: 3.6% at r=0.3 to 54.6% at r=1.0.

[2] Language Model Representations for Efficient Few-Shot Tabular Classification

Inwon Kang, Parikshit Ram, Yi Zhou, Horst Samulowitz, Oshani Seneviratne

Main category: cs.CL

TL;DR: TaRL: A lightweight method for few-shot tabular classification using LLM embeddings with component removal and temperature calibration

Details

Motivation: Web tables contain valuable structured data but are heterogeneous, making unified methods challenging. LLMs are already deployed in web infrastructure, so can we reuse them for tabular classification without specialized models?

Method: TaRL (Table Representation with Language Model) uses semantic embeddings of table rows from existing LLMs. Two key techniques: 1) remove common component from all embeddings, 2) calibrate softmax temperature using a meta-learner trained on handcrafted features.

Result: Achieves performance comparable to state-of-the-art models in low-data regimes (k ≤ 32) for semantically-rich tables. Shows naive embeddings underperform but with proposed techniques can unlock their potential.

Conclusion: Demonstrates viability of reusing existing LLM infrastructure for efficient semantics-driven web table understanding, avoiding need for specialized models or extensive retraining.

Abstract: The Web is a rich source of structured data in the form of tables, from product catalogs and knowledge bases to scientific datasets. However, the heterogeneity of the structure and semantics of these tables makes it challenging to build a unified method that can effectively leverage the information they contain. Meanwhile, Large language models (LLMs) are becoming an increasingly integral component of web infrastructure for tasks like semantic search. This raises a crucial question: can we leverage these already-deployed LLMs to classify structured data in web-native tables (e.g., product catalogs, knowledge base exports, scientific data portals), avoiding the need for specialized models or extensive retraining? This work investigates a lightweight paradigm, $\textbf{Ta}$ble $\textbf{R}$epresentation with $\textbf{L}$anguage Model~($\textbf{TaRL}$), for few-shot tabular classification that directly utilizes semantic embeddings of individual table rows. We first show that naive application of these embeddings underperforms compared to specialized tabular models. We then demonstrate that their potentials can be unlocked with two key techniques: removing the common component from all embeddings and calibrating the softmax temperature. We show that a simple meta-learner, trained on handcrafted features, can learn to predict an appropriate temperature. This approach achieves performance comparable to state-of-the-art models in low-data regimes ($k \leq 32$) of semantically-rich tables. Our findings demonstrate the viability of reusing existing LLM infrastructure for efficient semantics-driven pathway to reuse existing LLM infrastructure for Web table understanding.

[3] KD4MT: A Survey of Knowledge Distillation for Machine Translation

Ona de Gibert, Joseph Attieh, Timothee Mickus, Yves Scherrer, Jörg Tiedemann

Main category: cs.CL

TL;DR: Survey paper synthesizing Knowledge Distillation for Machine Translation (KD4MT) across 105 papers, covering methodological contributions, practical applications, evaluation practices, risks, and LLMs’ impact on the field.

Details

Motivation: Knowledge Distillation has become important for model compression, but in Machine Translation it serves broader purposes beyond just compression. The field lacks comprehensive synthesis and unified evaluation practices, so this survey aims to organize the KD4MT literature and provide practical guidance.

Method: Comprehensive survey methodology analyzing 105 papers through October 2025. Categorizes advances based on methodological contributions and practical applications. Includes qualitative and quantitative analyses, provides practical guidelines, and discusses risks and LLM impacts.

Result: Identifies common trends in KD4MT, highlights key research gaps, notes absence of unified evaluation practices, provides selection guidelines for different settings, and discusses risks like hallucination and bias amplification. Includes public database and glossary.

Conclusion: KD for MT is a nuanced field serving multiple purposes beyond compression. The survey provides comprehensive organization of the literature, practical guidance, and highlights future directions including LLM impacts. Public resources support further research.

Abstract: Knowledge Distillation (KD) as a research area has gained a lot of traction in recent years as a compression tool to address challenges related to ever-larger models in NLP. Remarkably, Machine Translation (MT) offers a much more nuanced take on this narrative: in MT, KD also functions as a general-purpose knowledge transfer mechanism that shapes supervision and translation quality as well as efficiency. This survey synthesizes KD for MT (KD4MT) across 105 papers (through October 1, 2025). We begin by introducing both MT and KD for non-experts, followed by an overview of the standard KD approaches relevant to MT applications. Subsequently, we categorize advances in the KD4MT literature based on (i) their methodological contributions and (ii) their practical applications. Our qualitative and quantitative analyses identify common trends in the field and highlight key research gaps as well as the absence of unified evaluation practice for KD methods in MT. We further provide practical guidelines for selecting a KD method in concrete settings and highlight potential risks associated with the application of KD to MT such as increased hallucination and bias amplification. Finally, we discuss the role of LLMs in re-shaping the KD4MT field. To support further research, we complement our survey with a publicly available database summarizing the main characteristics of the surveyed KD methods and a glossary of key terms.

[4] Gated Tree Cross-attention for Checkpoint-Compatible Syntax Injection in Decoder-Only LLMs

Xinyu Gao, Shaonan Wang, Nai Ding

Main category: cs.CL

TL;DR: GTCA adds gated tree cross-attention to decoder-only LLMs to improve syntactic robustness without compromising existing capabilities

Details

Motivation: Decoder-only LLMs are brittle to grammatical perturbations, but directly injecting syntactic structure interferes with pretrained competence. Need a checkpoint-compatible approach to enhance syntactic robustness.

Method: Introduces gated tree cross-attention (GTCA) branch that reads precomputed constituency chunk memory while leaving backbone architecture unchanged. Uses token update mask and staged training to control scope and timing of structural updates.

Result: GTCA strengthens syntactic robustness beyond continued-training baselines without compromising Multiple-Choice QA performance or commonsense reasoning across benchmarks and Transformer backbones.

Conclusion: GTCA provides a practical checkpoint-compatible route to more syntax-robust decoder-only LLMs, enhancing reliability for downstream reasoning.

Abstract: Decoder-only large language models achieve strong broad performance but are brittle to minor grammatical perturbations, undermining reliability for downstream reasoning. However, directly injecting explicit syntactic structure into an existing checkpoint can interfere with its pretrained competence. We introduce a checkpoint-compatible gated tree cross-attention (GTCA) branch that reads precomputed constituency chunk memory while leaving backbone architecture unchanged. Our design uses a token update mask and staged training to control the scope and timing of structural updates. Across benchmarks and Transformer backbones, GTCA strengthens syntactic robustness beyond continued-training baselines without compromising Multiple-Choice QA performance or commonsense reasoning, providing a practical checkpoint-compatible route to more syntax-robust decoder-only LLMs.

[5] Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models

Pranav Bhandari, Usman Naseem, Mehwish Nasim

Main category: cs.CL

TL;DR: Personality steering vectors in LLMs show geometric dependence across traits, limiting independent control even with orthonormalization techniques.

Details

Motivation: Current personality steering methods assume independent trait control, but this assumption hasn't been validated. The paper examines whether personality traits in LLMs can actually be controlled independently through geometric analysis of steering vectors.

Method: Analyzed geometric relationships between Big Five personality steering vectors in LLaMA-3-8B and Mistral-8B. Applied various geometric conditioning schemes: unconstrained directions, soft orthonormalization, and hard orthonormalization to study trait independence.

Result: Personality steering directions exhibit substantial geometric dependence - steering one trait consistently induces changes in others, even when linear overlap is removed. Hard orthonormalization enforces geometric independence but doesn’t eliminate cross-trait behavioral effects and reduces steering strength.

Conclusion: Personality traits in LLMs occupy a slightly coupled subspace, limiting fully independent trait control. The assumption of independent trait steering doesn’t hold in practice.

Abstract: Personality steering in large language models (LLMs) commonly relies on injecting trait-specific steering vectors, implicitly assuming that personality traits can be controlled independently. In this work, we examine whether this assumption holds by analysing the geometric relationships between Big Five personality steering directions. We study steering vectors extracted from two model families (LLaMA-3-8B and Mistral-8B) and apply a range of geometric conditioning schemes, from unconstrained directions to soft and hard orthonormalisation. Our results show that personality steering directions exhibit substantial geometric dependence: steering one trait consistently induces changes in others, even when linear overlap is explicitly removed. While hard orthonormalisation enforces geometric independence, it does not eliminate cross-trait behavioural effects and can reduce steering strength. These findings suggest that personality traits in LLMs occupy a slightly coupled subspace, limiting fully independent trait control.

[6] Can LLMs Assess Personality? Validating Conversational AI for Trait Profiling

Andrius Matšenas, Anet Lello, Tõnis Lees, Hans Peep, Kim Lilii Tamm

Main category: cs.CL

TL;DR: LLMs show promise as alternative to questionnaires for personality assessment with moderate convergent validity and equal perceived accuracy.

Details

Motivation: To validate Large Language Models as a dynamic alternative to traditional questionnaire-based personality assessment methods.

Method: Within-subjects experiment with 33 participants comparing Big Five personality scores from guided LLM conversations against gold-standard IPIP-50 questionnaire, measuring both objective scores and user-perceived accuracy.

Result: Moderate convergent validity (r=0.38-0.58), with Conscientiousness, Openness, and Neuroticism scores statistically equivalent between methods. Agreeableness and Extraversion showed significant differences. Participants rated LLM-generated profiles as equally accurate as traditional questionnaire results.

Conclusion: Conversational AI offers a promising new approach to traditional psychometrics, though trait-specific calibration is needed for some personality dimensions.

Abstract: This study validates Large Language Models (LLMs) as a dynamic alternative to questionnaire-based personality assessment. Using a within-subjects experiment (N=33), we compared Big Five personality scores derived from guided LLM conversations against the gold-standard IPIP-50 questionnaire, while also measuring user-perceived accuracy. Results indicate moderate convergent validity (r=0.38-0.58), with Conscientiousness, Openness, and Neuroticism scores statistically equivalent between methods. Agreeableness and Extraversion showed significant differences, suggesting trait-specific calibration is needed. Notably, participants rated LLM-generated profiles as equally accurate as traditional questionnaire results. These findings suggest conversational AI offers a promising new approach to traditional psychometrics.

[7] Preference Optimization for Review Question Generation Improves Writing Quality

Karun Sharma, Vidushee Vats, Shengzhi Li, Yuxiang Wang, Zhongtian Sun, Prayag Tiwari

Main category: cs.CL

TL;DR: IntelliAsk: A question-generation model trained with IntelliReward reward model to generate substantive, evidence-based peer review questions that go beyond surface-level queries.

Details

Motivation: Existing LLM-based approaches for peer review generate surface-level questions that draw over 50% of question tokens from a paper's first page, lacking substantive, evidence-based questioning needed for effective peer review.

Method: Developed IntelliReward reward model using frozen autoregressive LLM with trainable multi-head transformers over final 50 token states, then applied Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) to train IntelliAsk question-generation model aligned with human standards of effort, evidence, and grounding.

Result: IntelliAsk outperforms API-based SFT baselines in predicting expert-level human preferences and shows consistent improvements on reasoning and writing benchmarks, with measurable gains over Qwen3-32B base model on tasks like MuSR (68.3 vs 64.7 Acc) and WritingBench (8.31 vs 8.07).

Conclusion: The approach demonstrates that reviewer-question quality correlates with broader capabilities, and the released implementation, annotations, and IntelliReward model provide an automatic evaluation benchmark for grounding, effort, and evidence in LLM-generated review questions.

Abstract: Peer review relies on substantive, evidence-based questions, yet existing LLM-based approaches often generate surface-level queries, drawing over 50% of their question tokens from a paper’s first page. To bridge this gap, we develop IntelliReward, a novel reward model built from a frozen autoregressive LLM with trainable multi-head transformers over the final 50 token states, which outperforms API-based SFT baselines in predicting expert-level human preferences. By applying Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with IntelliReward, we train IntelliAsk, a question-generation model aligned with human standards of effort, evidence, and grounding. We find consistent improvements on reasoning and writing benchmarks, suggesting reviewer-question quality correlates with broader capabilities. Compared to the Qwen3-32B base model, IntelliAsk shows measurable gains across diverse benchmarks, specifically improving performance on reasoning tasks like MuSR (68.3 vs 64.7 Acc) and complex writing evaluations such as WritingBench (8.31 vs 8.07). We release our implementation, expert preference annotations, and the IntelliReward model to provide an automatic evaluation benchmark for grounding, effort, and evidence in LLM-generated review questions.

[8] Large Language Models for Assisting American College Applications

Zhengliang Liu, Weihang You, Peng Shu, Junhao Chen, Yi Pan, Hanqi Jiang, Yiwei Li, Zhaojun Ding, Chao Cao, Xinliang Li, Yifan Zhou, Ruidong Zhang, Shaochen Xu, Wei Ruan, Huaqin Zhao, Dajiang Zhu, Tianming Liu

Main category: cs.CL

TL;DR: EZCollegeApp is an LLM-powered system that helps high school students navigate complex college applications by structuring forms, grounding answers in official documents, and maintaining human control through a mapping-first paradigm.

Details

Motivation: College applications involve fragmented admissions policies, repetitive forms, and ambiguous questions requiring cross-referencing multiple sources, creating significant barriers for students.

Method: Uses a mapping-first paradigm separating form understanding from answer generation, with document ingestion from official websites, retrieval-augmented question answering, and human-in-the-loop chatbot interface.

Result: System architecture, data pipeline, internal representations, security/privacy measures, and evaluation through automated testing and human quality assessment; source code released on GitHub.

Conclusion: EZCollegeApp demonstrates an effective LLM-powered approach to assist students with complex application processes while maintaining human oversight and control.

Abstract: American college applications require students to navigate fragmented admissions policies, repetitive and conditional forms, and ambiguous questions that often demand cross-referencing multiple sources. We present EZCollegeApp, a large language model (LLM)-powered system that assists high-school students by structuring application forms, grounding suggested answers in authoritative admissions documents, and maintaining full human control over final responses. The system introduces a mapping-first paradigm that separates form understanding from answer generation, enabling consistent reasoning across heterogeneous application portals. EZCollegeApp integrates document ingestion from official admissions websites, retrieval-augmented question answering, and a human-in-the-loop chatbot interface that presents suggestions alongside application fields without automated submission. We describe the system architecture, data pipeline, internal representations, security and privacy measures, and evaluation through automated testing and human quality assessment. Our source code is released on GitHub (https://github.com/ezcollegeapp-public/ezcollegeapp-public) to facilitate the broader impact of this work.

[9] Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

David Y. Liu, Aditya Joshi, Paul Dawson

Main category: cs.CL

TL;DR: Survey paper examining how NLP research engages with narrative studies, proposing a taxonomy for narrative-related LLM applications in story generation and understanding.

Details

Motivation: To provide a systematic overview of how natural language processing research engages with narrative studies, examining patterns in narrative datasets, tasks, theories, and methodological trends in LLM applications for story generation and understanding.

Method: Survey methodology examining existing research at the intersection of NLP and narrative studies, proposing a taxonomy based on established narratological distinctions, analyzing patterns in datasets, tasks, theories, and methodological approaches (prompting vs fine-tuning).

Result: Identifies patterns in narrative datasets/tasks, narrative theories integrated with NLP pipelines, and methodological trends; highlights how LLMs enable connections between NLP pipelines and abstract narrative concepts; identifies challenges in unified definitions/benchmarks for narrative tasks.

Conclusion: Progress benefits from focusing on theory-based metrics for individual narrative attributes, large-scale theory-driven analysis, and experiments validating narrative theories, rather than pursuing a single generalized benchmark for narrative quality.

Abstract: Applications of narrative theories using large language models (LLMs) deliver promising use-cases in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research engages with fields of narrative studies, and proposes a taxonomy for ongoing efforts that reflect established distinctions in narratology. We discover patterns in the following: narrative datasets and tasks, narrative theories and NLP pipeline and methodological trends in prompting and fine-tuning. We highlight how LLMs enable easy connections of NLP pipelines with abstract narrative concepts and opportunities for interdisciplinary collaboration. Challenges remain in attempts to work towards any unified definition or benchmark of narrative related tasks, making model comparison difficult. For future directions, instead of the pursuit of a single, generalised benchmark for ’narrative quality’, we believe that progress benefits more from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes to incrementally improve model performance; conducting large-scale, theory-driven literary/social/cultural analysis; and creating experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

[10] Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints

Ha Na Cho, Sairam Sutari, Alexander Lopez, Hansen Bow, Kai Zheng

Main category: cs.CL

TL;DR: A study on clinical NLP models for discharge planning that addresses temporal leakage risks through interpretability auditing, showing improved calibration and safety over raw performance metrics.

Details

Motivation: Clinical NLP models for hospital discharge planning are vulnerable to temporal and lexical leakage where documentation artifacts encode future decisions, creating overconfident predictions that pose safety risks for real-world deployment.

Method: Developed a lightweight auditing pipeline integrating interpretability into model development to identify and suppress leakage-prone signals before final training, using next-day discharge prediction after elective spine surgery as a case study.

Result: Audited models showed more conservative and better-calibrated probability estimates with reduced reliance on discharge-related lexical cues, demonstrating improved safety-relevant trade-offs.

Conclusion: Deployment-ready clinical NLP systems should prioritize temporal validity, calibration, and behavioral robustness over optimistic performance metrics to ensure safe clinical integration.

Abstract: Clinical natural language processing (NLP) models have shown promise for supporting hospital discharge planning by leveraging narrative clinical documentation. However, note-based models are particularly vulnerable to temporal and lexical leakage, where documentation artifacts encode future clinical decisions and inflate apparent predictive performance. Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety. This study focuses on system-level design choices required to build safe and deployable clinical NLP under temporal leakage constraints. We present a lightweight auditing pipeline that integrates interpretability into the model development process to identify and suppress leakage-prone signals prior to final training. Using next-day discharge prediction after elective spine surgery as a case study, we evaluate how auditing affects predictive behavior, calibration, and safety-relevant trade-offs. Results show that audited models exhibit more conservative and better-calibrated probability estimates, with reduced reliance on discharge-related lexical cues. These findings emphasize that deployment-ready clinical NLP systems should prioritize temporal validity, calibration, and behavioral robustness over optimistic performance.

[11] Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao

Main category: cs.CL

TL;DR: Team-of-Thoughts: A heterogeneous multi-agent system using orchestrator-tool paradigm with calibration and self-assessment to leverage complementary capabilities of differently post-trained models for reasoning and code generation tasks.

Details

Motivation: Existing multi-agent systems use static, homogeneous model configurations that fail to exploit the distinct strengths of differently post-trained models, limiting their ability to leverage complementary capabilities.

Method: Introduces a novel MAS architecture with orchestrator-tool paradigm featuring: 1) orchestrator calibration to identify models with superior coordination capabilities, and 2) self-assessment protocol where tool agents profile their domain expertise. The orchestrator dynamically activates the most suitable tool agents during inference based on proficiency profiles.

Result: Experiments on five reasoning and code generation benchmarks show consistently superior task performance. Achieves 96.67% accuracy on AIME24 and 72.53% on LiveCodeBench, substantially outperforming homogeneous role-play baselines (80% and 65.93% respectively).

Conclusion: Team-of-Thoughts demonstrates that leveraging heterogeneous agents with complementary capabilities through orchestrator calibration and self-assessment protocols significantly improves performance on complex reasoning and code generation tasks compared to homogeneous systems.

Abstract: Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72.53%, respectively, substantially outperforming homogeneous role-play baselines, which score 80% and 65.93%.

[12] A Lightweight Explainable Guardrail for Prompt Safety

Md Asiful Islam, Mihai Surdeanu

Main category: cs.CL

TL;DR: LEG: Lightweight explainable guardrail method for unsafe prompt classification using multi-task learning with synthetic explainability data and novel loss functions.

Details

Motivation: Need for lightweight, explainable safety guardrails for LLMs that can classify unsafe prompts while providing explanations for decisions, addressing confirmation biases in LLM-generated explanations.

Method: Multi-task learning architecture jointly learns prompt classification and explanation classification; uses synthetic explainability data generated with bias-counteraction strategy; employs novel loss combining cross-entropy and focal losses with uncertainty-based weighting.

Result: Achieves equivalent or better performance than SOTA for both prompt classification and explainability on three datasets, despite significantly smaller model size; works well both in-domain and out-of-domain.

Conclusion: LEG provides effective, lightweight explainable safety guardrails for LLMs with competitive performance and smaller footprint than existing approaches.

Abstract: We propose a lightweight explainable guardrail (LEG) method for the classification of unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained using synthetic data for explainability, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG’s training process uses a novel loss that captures global explanation signals and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches. If accepted, we will release all models and the annotated dataset publicly.

[13] Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

Jingyi Xu, Xingyu Ren, Zhiqiang You, Yumeng Zhang, Zhoupeng Shou

Main category: cs.CL

TL;DR: GOPO is a hierarchical RL framework for task-oriented dialogue that separates strategy planning (Expert Agent) from response generation (Customer Service Agent) to optimize long-horizon task success.

Details

Motivation: Existing training methods for task-oriented dialogue systems rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success metrics. There's a need for better alignment between dialogue optimization and actual task completion in commercial scenarios.

Method: Goal-Oriented Preference Optimization (GOPO) uses hierarchical reinforcement learning with two agents: an Expert Agent that optimizes multi-turn goal preferences at the dialogue-trajectory level, and a Customer Service Agent that generates responses strictly aligned with the selected strategy.

Result: On the Mgshop dataset, GOPO improves Task-focused Sequential Engagement (TSE) by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. A 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2.

Conclusion: GOPO establishes a new paradigm for task-oriented dialogue systems in commercial scenarios by effectively optimizing for long-horizon task success through hierarchical strategy planning and response generation.

Abstract: Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent’s critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.

[14] Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective

Yunhao Liu, Zian Jia, Xinyu Gao, Kanjun Xu, Yun Xiong

Main category: cs.CL

TL;DR: SeleCom introduces a selector-based soft compression framework for RAG that uses query-conditioned information selection instead of full document compression, achieving better performance with reduced computation.

Details

Motivation: Current soft context compression methods for RAG underperform due to full-compression that forces encoding all document information regardless of query relevance, leading to information dilution and conflict with LLM generation behavior.

Method: SeleCom uses a decoder-only selector trained with curriculum learning on diverse synthetic QA data to perform query-conditioned information selection rather than full compression, redefining the encoder’s role.

Result: SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines while reducing computation and latency by 33.8%~84.6%.

Conclusion: Query-conditioned information selection is more effective than full compression for RAG, addressing fundamental limitations of current approaches while maintaining performance with substantial efficiency gains.

Abstract: Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge and is widely applied to Web-related tasks. However, its scalability is hindered by excessive context length and redundant retrievals. Recent research on soft context compression aims to address this by encoding long documents into compact embeddings, yet they often underperform non-compressed RAG due to their reliance on auto-encoder-like full-compression that forces the encoder to compress all document information regardless of relevance to the input query. In this work, we conduct an analysis on this paradigm and reveal two fundamental limitations: (I) Infeasibility, full-compression conflicts with the LLM’s downstream generation behavior; and (II) Non-necessity: full-compression is unnecessary and dilutes task-relevant information density. Motivated by these insights, we introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder’s role as query-conditioned information selector. The selector is decoder-only and is trained with a massive, diverse and difficulty-graded synthetic QA dataset with curriculum learning. Extensive experiments show that SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines, while reducing computation and latency by 33.8%~84.6%.

[15] Multi-source Heterogeneous Public Opinion Analysis via Collaborative Reasoning and Adaptive Fusion: A Systematically Integrated Approach

Yi Liu

Main category: cs.CL

TL;DR: CRAF framework integrates traditional feature methods with LLMs for multi-platform opinion analysis, featuring cross-platform attention, adaptive fusion, joint optimization, and multimodal video processing capabilities.

Details

Motivation: Public opinion analysis from multiple heterogeneous sources faces challenges due to structural differences, semantic variations, and platform-specific biases, requiring a unified approach that can handle diverse data formats and modalities.

Method: Four-stage framework: 1) cross-platform collaborative attention for semantic alignment, 2) hierarchical adaptive fusion for dynamic feature weighting, 3) joint optimization for topic and sentiment learning, 4) multimodal extraction integrating OCR, ASR, and visual sentiment analysis for video content.

Result: Achieves topic clustering ARI of 0.76 (4.1% improvement) and sentiment F1-score of 0.84 (3.8% improvement) on multi-platform datasets, reduces labeled data requirement for new platforms by 75%, with theoretical generalization bound improvement.

Conclusion: CRAF effectively integrates traditional and LLM-based approaches for multi-platform opinion analysis, demonstrating strong cross-platform adaptability and multimodal processing capabilities with practical efficiency gains.

Abstract: The analysis of public opinion from multiple heterogeneous sources presents significant challenges due to structural differences, semantic variations, and platform-specific biases. This paper introduces a novel Collaborative Reasoning and Adaptive Fusion (CRAF) framework that systematically integrates traditional feature-based methods with large language models (LLMs) through a structured multi-stage reasoning mechanism. Our approach features four key innovations: (1) a cross-platform collaborative attention module that aligns semantic representations while preserving source-specific characteristics, (2) a hierarchical adaptive fusion mechanism that dynamically weights features based on both data quality and task requirements, (3) a joint optimization strategy that simultaneously learns topic representations and sentiment distributions through shared latent spaces, and (4) a novel multimodal extraction capability that processes video content from platforms like Douyin and Kuaishou by integrating OCR, ASR, and visual sentiment analysis. Theoretical analysis demonstrates that CRAF achieves a tighter generalization bound with a reduction of O(sqrt(d log K / m)) compared to independent source modeling, where d is feature dimensionality, K is the number of sources, and m is sample size. Comprehensive experiments on three multi-platform datasets (Weibo-12, CrossPlatform-15, NewsForum-8) show that CRAF achieves an average topic clustering ARI of 0.76 (4.1% improvement over best baseline) and sentiment analysis F1-score of 0.84 (3.8% improvement). The framework exhibits strong cross-platform adaptability, reducing the labeled data requirement for new platforms by 75%.

[16] State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models

Annie Wong, Aske Plaat, Thomas Bäck, Niki van Stein, Anna V. Kononova

Main category: cs.CL

TL;DR: LLM state representation analysis shows trajectory summarization improves performance, natural language is most robust, and text-based spatial encodings outperform images by forcing spatial reasoning.

Details

Motivation: As LLMs move from static reasoning to dynamic environments, understanding how state representation affects performance is crucial. The paper investigates how different state representations impact LLMs' ability to navigate changing environments at inference time.

Method: Systematically varied three aspects of state representation: (1) granularity (long form vs summary), (2) structure (natural language vs symbolic), and (3) spatial grounding (text-only vs images vs textual map encodings) across sequential decision-making benchmarks with fixed model parameters.

Result: Trajectory summarization improves performance by reducing noise and stabilizing long-horizon reasoning. Natural language representations are most robust across models. Structured encodings help mainly for models with strong code/structured output priors. Text-based spatial encodings outperform images, with the advantage coming from the construction process forcing spatial reasoning.

Conclusion: State representation design choices are decisive for performance, distinct from information availability. However, even with improved representations, current LLMs/VLMs remain brittle over long horizons when synthesizing information for multiple subtasks.

Abstract: As large language models (LLMs) move from static reasoning tasks toward dynamic environments, their success depends on the ability to navigate and respond to an environment that changes as they interact at inference time. An underexplored factor in these settings is the representation of the state. Holding model parameters fixed, we systematically vary three key aspects: (1) state granularity (long form versus summary), (2) structure (natural language versus symbolic), and (3) spatial grounding (text-only versus images or textual map encodings) across sequential decision-making benchmarks. We find that trajectory summarisation improves performance by reducing noise and stabilising long-horizon reasoning. Second, natural language representations are the most robust across models, whereas structured encodings help mainly for models with strong code or structured output priors, such as JSON schemas. Third, while image-inputs show some benefit, text-based spatial encodings prove most effective. This advantage stems not from the spatial information itself, but from the act of construction, which compels the model to perform the spatial reasoning that static input does not elicit. Overall, we demonstrate that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself. We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons, particularly when they must synthesise information to manage multiple subtasks to reach a goal.

[17] From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants

Krittin Pachtrachai, Petmongkon Pornpichitsuwan, Wachiravit Modecrua, Touchapon Kraisingkorn

Main category: cs.CL

TL;DR: Framework for building conversational AI assistants from call transcripts using quality filtering, LLM-extracted knowledge, RAG pipeline, and systematic prompt tuning, evaluated in real estate and recruitment domains.

Details

Motivation: Building reliable conversational AI assistants for customer-facing industries is challenging due to noisy data, fragmented knowledge, and real-time information requirements, especially in domains like real estate and recruitment that are currently suboptimal for automation.

Method: 1) Grade and filter call transcripts using PIPA framework adaptation, 2) Extract structured knowledge from curated transcripts using LLMs, 3) Deploy knowledge as grounding source in RAG pipeline, 4) Govern assistant behavior through systematic prompt tuning from monolithic to modular designs, 5) Evaluate using transcript-grounded user simulator and red teaming.

Result: The assistant autonomously handles ~30% of calls, achieves near-perfect factual accuracy and rejection behavior, and demonstrates strong robustness against adversarial testing in real estate and specialist recruitment domains.

Conclusion: The framework successfully builds reliable conversational AI assistants from historical call transcripts, demonstrating effective automation in challenging domains despite real-time information constraints.

Abstract: Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off - particularly in domains that depend heavily on real-time information. This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts. Incoming transcripts are first graded using a simplified adaptation of the PIPA framework, focusing on observation alignment and appropriate response behavior, and are filtered to retain only high-quality interactions exhibiting coherent flow and effective human agent responses. Structured knowledge is then extracted from curated transcripts using large language models (LLMs) and deployed as the sole grounding source in a Retrieval-Augmented Generation (RAG) pipeline. Assistant behavior is governed through systematic prompt tuning, progressing from monolithic prompts to lean, modular, and governed designs that ensure consistency, safety, and controllable execution. Evaluation is conducted using a transcript-grounded user simulator, enabling quantitative measurement of call coverage, factual accuracy, and human escalation behavior. Additional red teaming assesses robustness against prompt injection, out-of-scope, and out-of-context attacks. Experiments are conducted in the Real Estate and Specialist Recruitment domains, which are intentionally challenging and currently suboptimal for automation due to their reliance on real-time data. Despite these constraints, the assistant autonomously handles approximately 30 percents of calls, achieves near-perfect factual accuracy and rejection behavior, and demonstrates strong robustness under adversarial testing.

[18] Reranker Optimization via Geodesic Distances on k-NN Manifolds

Wen G. Gong

Main category: cs.CL

TL;DR: Maniscope is a geometric reranking method for retrieval-augmented generation that uses geodesic distances on k-NN manifolds to combine global cosine similarity with local manifold geometry, achieving near cross-encoder accuracy with 10-45x lower latency.

Details

Motivation: Current neural reranking approaches for RAG rely on computationally expensive cross-encoders or LLMs with latencies of 3-5 seconds per query, making them impractical for real-time deployment. There's a need for efficient reranking methods that maintain accuracy while dramatically reducing latency.

Method: Maniscope computes geodesic distances on k-nearest neighbor manifolds constructed over retrieved document candidates. It combines global cosine similarity with local manifold geometry to capture semantic structure missed by flat Euclidean metrics. The method has O(N D + M² D + M k log k) complexity where M « N.

Result: On eight BEIR benchmark datasets (1,233 queries), Maniscope outperforms HNSW graph-based baseline on the three hardest datasets (NFCorpus: +7.0%, TREC-COVID: +1.6%, AorB: +2.8% NDCG@3) while being 3.2x faster (4.7 ms vs 14.8 ms average). Compared to cross-encoder rerankers, it achieves within 2% accuracy at 10-45x lower latency. On TREC-COVID, LLM-Reranker provides only +0.5% NDCG@3 improvement over Maniscope at 840x higher latency.

Conclusion: Maniscope provides a practical alternative for real-time RAG deployment by achieving near state-of-the-art accuracy with sub-10 ms latency, making it suitable for production systems where computational efficiency is critical.

Abstract: Current neural reranking approaches for retrieval-augmented generation (RAG) rely on cross-encoders or large language models (LLMs), requiring substantial computational resources and exhibiting latencies of 3-5 seconds per query. We propose Maniscope, a geometric reranking method that computes geodesic distances on k-nearest neighbor (k-NN) manifolds constructed over retrieved document candidates. This approach combines global cosine similarity with local manifold geometry to capture semantic structure that flat Euclidean metrics miss. Evaluating on eight BEIR benchmark datasets (1,233 queries), Maniscope outperforms HNSW graph-based baseline on the three hardest datasets (NFCorpus: +7.0%, TREC-COVID: +1.6%, AorB: +2.8% NDCG@3) while being 3.2x faster (4.7 ms vs 14.8 ms average). Compared to cross-encoder rerankers, Maniscope achieves within 2% accuracy at 10-45x lower latency. On TREC-COVID, LLM-Reranker provides only +0.5% NDCG@3 improvement over Maniscope at 840x higher latency, positioning Maniscope as a practical alternative for real-time RAG deployment. The method requires O(N D + M^2 D + M k log k) complexity where M « N , enabling sub-10 ms latency. We plan to release Maniscope as open-source software.

[19] CAST: Achieving Stable LLM-based Text Analysis for Data Analytics

Jinxiang Xie, Zihao Li, Wei He, Rui Ding, Shi Han, Dongmei Zhang

Main category: cs.CL

TL;DR: CAST framework improves LLM output stability for tabular data analysis tasks (summarization and tagging) through algorithmic prompting and explicit intermediate reasoning constraints.

Details

Motivation: Current LLMs lack output stability required for data analytics tasks like summarization and tagging of tabular data, where consistent results are crucial for reliable analysis.

Method: CAST combines Algorithmic Prompting (procedural scaffold for reasoning transitions) and Thinking-before-Speaking (explicit intermediate commitments before final generation) to constrain the model’s latent reasoning path.

Result: CAST achieves best stability among baselines, improving Stability Score by up to 16.2% while maintaining or improving output quality across multiple LLM backbones and benchmarks.

Conclusion: The CAST framework successfully addresses LLM output stability issues for tabular data analysis through constrained reasoning approaches, enabling more reliable data analytics applications.

Abstract: Text analysis of tabular data relies on two core operations: \emph{summarization} for corpus-level theme extraction and \emph{tagging} for row-level labeling. A critical limitation of employing large language models (LLMs) for these tasks is their inability to meet the high standards of output stability demanded by data analytics. To address this challenge, we introduce \textbf{CAST} (\textbf{C}onsistency via \textbf{A}lgorithmic Prompting and \textbf{S}table \textbf{T}hinking), a framework that enhances output stability by constraining the model’s latent reasoning path. CAST combines (i) Algorithmic Prompting to impose a procedural scaffold over valid reasoning transitions and (ii) Thinking-before-Speaking to enforce explicit intermediate commitments before final generation. To measure progress, we introduce \textbf{CAST-S} and \textbf{CAST-T}, stability metrics for bulleted summarization and tagging, and validate their alignment with human judgments. Experiments across publicly available benchmarks on multiple LLM backbones show that CAST consistently achieves the best stability among all baselines, improving Stability Score by up to 16.2%, while maintaining or improving output quality.

[20] Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

Guoshan Liu, Bin Zhu, Yian Li, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang

Main category: cs.CL

TL;DR: A semantically grounded framework for recipe generation from food images that improves semantic fidelity through action/ingredient prediction, validation, and a two-stage fine-tuning pipeline with reinforcement learning.

Details

Motivation: Current multimodal LLMs for recipe generation produce outputs with high lexical scores but often contain semantically incorrect actions or ingredients, highlighting the need for better semantic grounding.

Method: Two-stage pipeline: 1) Supervised fine-tuning using Action-Reasoning dataset and ingredient corpus, 2) Reinforcement fine-tuning with frequency-aware rewards. Includes Semantic Confidence Scoring and Rectification module for filtering and correcting predictions.

Result: State-of-the-art performance on Recipe1M dataset with markedly improved semantic fidelity compared to previous approaches.

Conclusion: The proposed semantically grounded framework effectively addresses semantic errors in recipe generation from food images through internal context prediction and validation mechanisms.

Abstract: Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.

[21] Not the Example, but the Process: How Self-Generated Examples Enhance LLM Reasoning

Daehoon Gwak, Minseo Jung, Junwoo Park, Minho Park, ChaeHun Park, Junha Hyung, Jaegul Choo

Main category: cs.CL

TL;DR: Self-generated examples in LLMs improve reasoning not through the examples themselves but through the process of creating them, as shown by integrated prompting outperforming decoupled prompting.

Details

Motivation: While LLMs show improved reasoning with self-generated few-shot examples, the mechanism behind these gains is unclear, making it hard to know when and how to apply the technique effectively. The paper aims to determine whether benefits come from the examples themselves or the creation process.

Method: Systematically evaluated three prompting strategies across diverse LLM architectures: (1) Zero-shot prompting, (2) Integrated prompting (LLMs create and solve problems in a single prompt), and (3) Decoupled prompting (self-generated examples are reused as in-context examples without creation context). Conducted experiments across five model architectures and performed attention analysis to examine patterns.

Result: Integrated prompting consistently outperformed both Zero-shot and Decoupled prompting across all model architectures. Decoupled prompting offered only marginal gains over Zero-shot. Attention analysis revealed significant differences in attention patterns between Integrated and Decoupled prompting.

Conclusion: The advantage of self-generation prompting comes from the process of problem creation itself, not from the generated examples. This provides valuable insights for designing more effective prompting strategies that leverage the creation process rather than just reusing examples.

Abstract: Recent studies have shown that Large Language Models (LLMs) can improve their reasoning performance through self-generated few-shot examples, achieving results comparable to manually curated in-context examples. However, the underlying mechanism behind these gains remains unclear, making it hard to decide when and how to apply the technique effectively. In this work, we argue that the key benefit arises not from the generated examples themselves but from the act of creating them. To validate this, on reasoning-intensive tasks across diverse LLM architectures, we systematically evaluate three prompting strategies for in-context learning: (1) Zero-shot prompting; (2) Integrated prompting, where LLMs create and solve problems within a single, unified prompt; and (3) Decoupled prompting, where self-generated examples are reused as in-context examples, but the context of their creation itself is excluded. We conduct experiments across five widely used model architectures, demonstrating that Integrated prompting consistently outperforms both Zero-shot and Decoupled prompting. In contrast, Decoupled prompting offers only marginal gains over Zero-shot. Further, for a more in-depth analysis, we conduct an attention analysis and observe significant differences in attention patterns between Integrated and Decoupled prompting. These findings suggest that the advantage of self-generation prompting comes from the process of problem creation, not the examples themselves, providing valuable insights for designing more effective prompting strategies.

Dhiman Goswami, Jai Kruthunz Naveen Kumar, Sanchari Das

Main category: cs.CL

TL;DR: Systematic review of privacy risks in social media NLP tasks, proposing NLP-PRISM framework to evaluate vulnerabilities across six dimensions, revealing privacy-utility trade-offs in transformer models.

Details

Motivation: NLP processes social media content containing PII, behavioral cues, and metadata, creating privacy risks like surveillance, profiling, and targeted advertising that need systematic assessment.

Method: Review of 203 papers and development of NLP-PRISM framework evaluating vulnerabilities across data collection, preprocessing, visibility, fairness, computational risk, and regulatory compliance dimensions.

Result: Transformer models achieve F1-scores of 0.58-0.84 but suffer 1%-23% performance drop under privacy-preserving fine-tuning; privacy research gaps identified across six NLP tasks; 2%-9% utility trade-off with MIA AUC 0.81 and AIA accuracy 0.75.

Conclusion: Advocates for stronger anonymization, privacy-aware learning, and fairness-driven training to enable ethical NLP in social media contexts.

Abstract: Natural Language Processing (NLP) is integral to social media analytics but often processes content containing Personally Identifiable Information (PII), behavioral cues, and metadata raising privacy risks such as surveillance, profiling, and targeted advertising. To systematically assess these risks, we review 203 peer-reviewed papers and propose the NLP Privacy Risk Identification in Social Media (NLP-PRISM) framework, which evaluates vulnerabilities across six dimensions: data collection, preprocessing, visibility, fairness, computational risk, and regulatory compliance. Our analysis shows that transformer models achieve F1-scores ranging from 0.58-0.84, but incur a 1% - 23% drop under privacy-preserving fine-tuning. Using NLP-PRISM, we examine privacy coverage in six NLP tasks: sentiment analysis (16), emotion detection (14), offensive language identification (19), code-mixed processing (39), native language identification (29), and dialect detection (24) revealing substantial gaps in privacy research. We further found a (reduced by 2% - 9%) trade-off in model utility, MIA AUC (membership inference attacks) 0.81, AIA accuracy 0.75 (attribute inference attacks). Finally, we advocate for stronger anonymization, privacy-aware learning, and fairness-driven training to enable ethical NLP in social media contexts.

[23] Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork?

Berry Gerrits

Main category: cs.CL

TL;DR: LLMs perform poorly in Zork text adventure game, achieving less than 10% completion on average, revealing fundamental limitations in reasoning and metacognitive abilities.

Details

Motivation: To evaluate problem-solving and reasoning capabilities of contemporary LLMs using Zork, a seminal text-based adventure game, as a controlled environment to assess natural language interpretation and action sequence generation.

Method: Tested leading proprietary models (ChatGPT, Claude, Gemini) in Zork under minimal and detailed instructions, measuring game progress through achieved scores as primary metric, with qualitative analysis of reasoning processes.

Result: All models achieved less than 10% completion on average, with best performer (Claude Opus 4.5) reaching only ~75/350 points. Detailed instructions and extended thinking provided no improvement. Models showed inability to reflect on thinking, inconsistent strategy persistence, and failure to learn from history.

Conclusion: Current LLMs have substantial limitations in metacognitive abilities and problem-solving within text-based games, raising questions about the nature and extent of their reasoning capabilities.

Abstract: In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977. The game’s dialogue-based structure provides a controlled environment for assessing how LLM-based chatbots interpret natural language descriptions and generate appropriate action sequences to succeed in the game. We test the performance of leading proprietary models - ChatGPT, Claude, and Gemini - under both minimal and detailed instructions, measuring game progress through achieved scores as the primary metric. Our results reveal that all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points. Notably, providing detailed game instructions offers no improvement, nor does enabling ‘’extended thinking’’. Qualitative analysis of the models’ reasoning processes reveals fundamental limitations: repeated unsuccessful actions suggesting an inability to reflect on one’s own thinking, inconsistent persistence of strategies, and failure to learn from previous attempts despite access to conversation history. These findings suggest substantial limitations in current LLMs’ metacognitive abilities and problem-solving capabilities within the domain of text-based games, raising questions about the nature and extent of their reasoning capabilities.

[24] Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning

Magnus Boman

Main category: cs.CL

TL;DR: A formalization of LLM interaction using multi-tape Turing machines to precisely localize failure modes in different pipeline stages

Details

Motivation: LLMs exhibit failure modes on seemingly trivial tasks, and current approaches lack rigorous formal frameworks to understand and localize these failures within the LLM pipeline

Method: Proposes a deterministic multi-tape Turing machine model where each tape represents distinct LLM components: input characters, tokens, vocabulary, model parameters, activations, probability distributions, and output text

Result: Enables precise localization of failure modes to specific pipeline stages, reveals how tokenization obscures character-level structure needed for tasks like counting, and clarifies why techniques like chain-of-thought prompting work by externalizing computation

Conclusion: Provides a rigorous, falsifiable alternative to geometric metaphors and complements empirical scaling laws with principled error analysis for understanding LLM limitations

Abstract: Large language models (LLMs) exhibit failure modes on seemingly trivial tasks. We propose a formalisation of LLM interaction using a deterministic multi-tape Turing machine, where each tape represents a distinct component: input characters, tokens, vocabulary, model parameters, activations, probability distributions, and output text. The model enables precise localisation of failure modes to specific pipeline stages, revealing, e.g., how tokenisation obscures character-level structure needed for counting tasks. The model clarifies why techniques like chain-of-thought prompting help, by externalising computation on the output tape, while also revealing their fundamental limitations. This approach provides a rigorous, falsifiable alternative to geometric metaphors and complements empirical scaling laws with principled error analysis.

[25] Towards Fair and Efficient De-identification: Quantifying the Efficiency and Generalizability of De-identification Approaches

Noopur Zambare, Kiana Aghakasiri, Carissa Lin, Carrie Ye, J. Ross Mitchell, Mohamed Abdalla

Main category: cs.CL

TL;DR: Smaller LLMs achieve comparable de-identification performance to larger models with lower inference costs, and can be fine-tuned to outperform larger models on multilingual and gendered data, with BERT-MultiCulture-DEID models released for robust multi-cultural clinical de-identification.

Details

Motivation: To evaluate LLM generalizability across formats, cultures, and genders for clinical de-identification, and address the efficiency-generalizability trade-off for practical deployment.

Method: Systematic evaluation of fine-tuned transformers (BERT, ClinicalBERT, ModernBERT), small LLMs (Llama 1-8B, Qwen 1.5-7B), and large LLMs (Llama-70B, Qwen-72B) on de-identification tasks across multiple languages and gendered names.

Result: Smaller models achieve comparable performance with substantially lower inference costs, and can outperform larger models when fine-tuned on multilingual data (Mandarin, Hindi, Spanish, French, Bengali, regional English variants) and gendered names.

Conclusion: Smaller models offer practical deployment advantages for clinical de-identification while maintaining performance, with BERT-MultiCulture-DEID models providing robust multi-cultural solutions and establishing pathways for fair and efficient de-identification.

Abstract: Large language models (LLMs) have shown strong performance on clinical de-identification, the task of identifying sensitive identifiers to protect privacy. However, previous work has not examined their generalizability between formats, cultures, and genders. In this work, we systematically evaluate fine-tuned transformer models (BERT, ClinicalBERT, ModernBERT), small LLMs (Llama 1-8B, Qwen 1.5-7B), and large LLMs (Llama-70B, Qwen-72B) at de-identification. We show that smaller models achieve comparable performance while substantially reducing inference cost, making them more practical for deployment. Moreover, we demonstrate that smaller models can be fine-tuned with limited data to outperform larger models in de-identifying identifiers drawn from Mandarin, Hindi, Spanish, French, Bengali, and regional variations of English, in addition to gendered names. To improve robustness in multi-cultural contexts, we introduce and publicly release BERT-MultiCulture-DEID, a set of de-identification models based on BERT, ClinicalBERT, and ModernBERT, fine-tuned on MIMIC with identifiers from multiple language variants. Our findings provide the first comprehensive quantification of the efficiency-generalizability trade-off in de-identification and establish practical pathways for fair and efficient clinical de-identification. Details on accessing the models are available at: https://doi.org/10.5281/zenodo.18342291

[26] VDLM: Variable Diffusion LMs via Robust Latent-to-Text Rendering

Shuhui Qu

Main category: cs.CL

TL;DR: VDLM is a variable diffusion language model that separates semantic planning from text rendering, enabling iterative refinement in latent space and robust decoding through embedding perturbations.

Details

Motivation: Autoregressive language models have limitations in multi-step reasoning due to irreversible left-to-right decoding that prevents revision. There's a need for models that can iteratively refine their reasoning in latent space before committing to text generation.

Method: VDLM uses LLaDA-style masked diffusion over semantic variable embeddings for iterative refinement in latent space. It post-trains the planner with trajectory-aware optimization using embedding-space rewards/values. A Vec2Text renderer converts planned embeddings to text, with embedding perturbations added to robustify decoding under planner noise.

Result: VDLM is competitive in pre-training across nine benchmarks spanning general reasoning, math, and code. It yields substantial post-training improvements on long-form generation tasks, outperforming other baselines.

Conclusion: The approach demonstrates effectiveness of embedding-space post-training and robust latent-to-text rendering for diffusion language modeling, offering advantages over traditional autoregressive models for multi-step reasoning tasks.

Abstract: Autoregressive language models decode left-to-right with irreversible commitments, limiting revision during multi-step reasoning. We propose \textbf{VDLM}, a modular variable diffusion language model that separates semantic planning from text rendering. VDLM applies LLaDA-style masked diffusion over semantic variable embeddings to enable iterative refinement in latent space, then post-trains the planner with trajectory-aware optimization using embedding-space rewards and values, avoiding text decoding inside the RL loop. To convert planned embeddings back to text, we use a \textbf{Vec2Text} renderer and introduce \textbf{embedding perturbations} to robustify decoding under planner noise. Across nine benchmarks spanning general reasoning, math, and code, VDLM is competitive in pre-training and yields substantial post-training improvements on long-form generation tasks, outperforming other baselines. These results highlight the effectiveness of embedding-space post-training and robust latent-to-text rendering for diffusion language modeling.

[27] CheckIfExist: Detecting Citation Hallucinations in the Era of AI-Generated Content

Diletta Abbonato

Main category: cs.CL

TL;DR: CheckIfExist is an open-source web tool for real-time verification of bibliographic references against multiple scholarly databases to detect AI-generated hallucinated citations.

Details

Motivation: The proliferation of LLMs in academic workflows has led to reference hallucination (generation of plausible but non-existent citations), with AI-hallucinated citations appearing even in premier ML conference papers, creating an urgent need for automated verification mechanisms.

Method: Developed a web-based tool using multi-source validation against CrossRef, Semantic Scholar, and OpenAlex databases with cascading validation architecture and string similarity algorithms to compute multi-dimensional match confidence scores.

Result: The tool provides instant feedback on reference authenticity, supports both single-reference verification and batch processing of BibTeX entries, and returns validated APA citations and exportable BibTeX records within seconds.

Conclusion: CheckIfExist fills the gap between existing reference management tools (which lack validation) and commercial hallucination detection services (which have usage limits/fees) by providing free, open-source, real-time citation verification.

Abstract: The proliferation of large language models (LLMs) in academic workflows has introduced unprecedented challenges to bibliographic integrity, particularly through reference hallucination – the generation of plausible but non-existent citations. Recent investigations have documented the presence of AI-hallucinated citations even in papers accepted at premier machine learning conferences such as NeurIPS and ICLR, underscoring the urgency of automated verification mechanisms. This paper presents “CheckIfExist”, an open-source web-based tool designed to provide immediate verification of bibliographic references through multi-source validation against CrossRef, Semantic Scholar, and OpenAlex scholarly databases. While existing reference management tools offer bibliographic organization capabilities, they do not provide real-time validation of citation authenticity. Commercial hallucination detection services, though increasingly available, often impose restrictive usage limits on free tiers or require substantial subscription fees. The proposed tool fills this gap by employing a cascading validation architecture with string similarity algorithms to compute multi-dimensional match confidence scores, delivering instant feedback on reference authenticity. The system supports both single-reference verification and batch processing of BibTeX entries through a unified interface, returning validated APA citations and exportable BibTeX records within seconds.

[28] P-RAG: Prompt-Enhanced Parametric RAG with LoRA and Selective CoT for Biomedical and Multi-Hop QA

Xingda Lyu, Gongfu Lyu, Zitai Yan, Yuxin Jiang

Main category: cs.CL

TL;DR: P-RAG: A hybrid retrieval-augmented generation architecture combining parametric knowledge, retrieved evidence, Chain-of-Thought prompting, and LoRA fine-tuning for biomedical QA, achieving SOTA results on PubMedQA and 2WikiMultihopQA.

Details

Motivation: LLMs are limited by static training data, and while RAG helps by retrieving external knowledge, it still depends heavily on knowledge base quality. The authors aim to improve RAG performance, particularly for biomedical question answering.

Method: Proposed Prompt-Enhanced Parametric RAG (P-RAG), a hybrid architecture integrating parametric knowledge within LLM and retrieved evidence, guided by Chain-of-Thought prompting and LoRA fine-tuning. Evaluated three RAG variants (Standard RAG, DA-RAG, P-RAG) using LLaMA-3.2-1B-Instruct fine-tuned via LoRA on PubMedQA and 2WikiMultihopQA datasets.

Result: P-RAG outperforms Standard RAG on PubMedQA by 10.47 percentage points in F1 (93.33% vs. 82.86%; 12.64% relative). On 2WikiMultihopQA, P-RAG nearly doubles overall score vs. Standard RAG (33.44% vs. 17.83%) and achieves 44.03% on Compare subset. CoT prompting substantially improves multi-hop reasoning but yields mixed results for simpler queries.

Conclusion: P-RAG demonstrates potential for accurate, scalable, and contextually adaptive biomedical question answering. Contributions include LoRA-based fine-tuning of LLaMA-3.2-1B-Instruct for biomedical QA, introduction of P-RAG with Chain-of-Thought prompting, and state-of-the-art results on benchmark datasets.

Abstract: Large Language Models (LLMs) demonstrate remarkable capabilities but remain limited by their reliance on static training data. Retrieval-Augmented Generation (RAG) addresses this constraint by retrieving external knowledge during inference, though it still depends heavily on knowledge base quality. To explore potential improvements, we evaluated three RAG variants-Standard RAG, DA-RAG, and our proposed Prompt-Enhanced Parametric RAG (P-RAG), a hybrid architecture that integrates parametric knowledge within the LLM and retrieved evidence, guided by Chain-of-Thought (CoT) prompting and Low-Rank Adaptation (LoRA) fine-tuning-on both general and biomedical datasets. Using LLaMA-3.2-1B-Instruct fine-tuned via LoRA, we evaluate on PubMedQA and 2WikiMultihopQA. P-RAG outperforms Standard RAG on PubMedQA by 10.47 percentage points in F1 (93.33% vs. 82.86%; 12.64% relative). On 2WikiMultihopQA, P-RAG nearly doubles the overall score vs. Standard RAG (33.44% vs. 17.83%) and achieves 44.03% on the Compare subset (with 42.74% Bridge, 21.84% Inference, 8.60% Compose). CoT prompting substantially improves multi-hop reasoning but yields mixed results for simpler, single-hop queries. These findings underscore P-RAG’s potential for accurate, scalable, and contextually adaptive biomedical question answering. Our contributions include: (1) LoRA-based fine-tuning of LLaMA-3.2-1B-Instruct for biomedical QA, (2) introduction of P-RAG with Chain-of-Thought prompting, and (3) state-of-the-art results on PubMedQA and 2WikiMultihopQA.

[29] Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity

Haihui Pan, Yuzhong Hong, Shaoke Lv, Junwei Bao, Hongfei Jiang, Yang Song

Main category: cs.CL

TL;DR: QEMPO is a new alignment method that decomposes LLM alignment into quality and diversity distributions, using entropy maximization to enhance output diversity while maintaining quality through constrained optimization.

Details

Motivation: Current alignment methods improve LLM output quality but reduce diversity, and existing diversity-enhancing methods often sacrifice performance. There's a need for methods that can maintain quality while increasing output diversity.

Method: Proposes Quality-constrained Entropy Maximization Policy Optimization (QEMPO), which decomposes alignment into quality and diversity distributions. Maximizes output entropy of the policy while ensuring quality through constraints, with both online and offline training methods.

Result: QEMPO achieves performance comparable to or better than RLHF while improving output diversity, demonstrating the effectiveness of the quality-diversity decomposition approach.

Conclusion: The theoretical decomposition of alignment into quality and diversity distributions enables effective methods like QEMPO to enhance LLM output diversity without sacrificing quality, addressing a key limitation of current alignment approaches.

Abstract: Recent research indicates that while alignment methods significantly improve the quality of large language model(LLM) outputs, they simultaneously reduce the diversity of the models’ output. Although some methods have been proposed to enhance LLM output diversity, they often come at the cost of reduced performance. In this work, we first theoretically demonstrate that the alignment task can be decomposed into two distributions: quality and diversity. To enhance the diversity of LLM outputs while ensuring quality, we propose the Quality-constrained Entropy Maximization Policy Optimization (QEMPO). QEMPO aims to maximize the output entropy of the policy while ensuring output quality. By adding different constraints to QEMPO, we obtain different policies. To optimize policies, we propose both online and offline training methods. Experiments validate that QEMPO achieves performance comparable to or even better than RLHF while improving output diversity.

[30] Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion

Pengcheng Zhou, Haochen Li, Zhiqiang Nie, JiaLe Chen, Qing Gong, Weizhen Zhang, Chun Yu

Main category: cs.CL

TL;DR: CogitoRAG is a cognitive-inspired RAG framework that uses semantic gist extraction and knowledge graphs to improve retrieval accuracy by simulating human episodic memory processes.

Details

Motivation: Existing RAG frameworks suffer from semantic integrity loss due to discrete text representations, leading to retrieval deviations. The authors aim to address this by simulating human cognitive memory processes for better knowledge integration.

Method: 1) Offline: Extract semantic gist from corpora and build multi-dimensional knowledge graphs with entities, relations, and memory nodes. 2) Online: Decompose complex queries via Query Decomposition Module, perform associative retrieval via Entity Diffusion Module with structural relevance and frequency rewards, and rerank using CogniRank algorithm fusing diffusion scores with semantic similarity.

Result: Significantly outperforms state-of-the-art RAG methods across five mainstream QA benchmarks and multi-task generation on GraphBench, demonstrating superior complex knowledge integration and reasoning capabilities.

Conclusion: CogitoRAG effectively simulates human cognitive memory processes to enhance RAG performance, addressing semantic integrity issues in existing frameworks through gist-based knowledge representation and cognitive-inspired retrieval mechanisms.

Abstract: Retrieval-Augmented Generation (RAG) effectively mitigates hallucinations in LLMs by incorporating external knowledge. However, the inherent discrete representation of text in existing frameworks often results in a loss of semantic integrity, leading to retrieval deviations. Inspired by the human episodic memory mechanism, we propose CogitoRAG, a RAG framework that simulates human cognitive memory processes. The core of this framework lies in the extraction and evolution of the Semantic Gist. During the offline indexing stage, CogitoRAG first deduces unstructured corpora into gist memory corpora, which are then transformed into a multi-dimensional knowledge graph integrating entities, relational facts, and memory nodes. In the online retrieval stage, the framework handles complex queries via Query Decomposition Module that breaks them into comprehensive sub-queries, mimicking the cognitive decomposition humans employ for complex information. Subsequently, Entity Diffusion Module performs associative retrieval across the graph, guided by structural relevance and an entity-frequency reward mechanism. Furthermore, we propose the CogniRank algorithm, which precisely reranks candidate passages by fusing diffusion-derived scores with semantic similarity. The final evidence is delivered to the generator in a passage-memory pairing format, providing high-density information support. Experimental results across five mainstream QA benchmarks and multi-task generation on GraphBench demonstrate that CogitoRAG significantly outperforms state-of-the-art RAG methods, showcasing superior capabilities in complex knowledge integration and reasoning.

[31] Doc-to-LoRA: Learning to Instantly Internalize Contexts

Rujikorn Charakorn, Edoardo Cetin, Shinnosuke Uesaka, Robert Tjarko Lange

Main category: cs.CL

TL;DR: D2L is a hypernetwork that meta-learns to generate LoRA adapters for LLMs in a single forward pass, enabling efficient long-context processing without re-consuming original context.

Details

Motivation: Address the quadratic attention cost in Transformers that makes long-context inference memory-intensive and slow, while overcoming the impractical training costs and latency of per-prompt context distillation.

Method: Proposes Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate context distillation within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries without re-consuming original context.

Result: On long-context needle-in-a-haystack tasks, D2L achieves near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM’s native context window by more than 4x. On real-world QA datasets, it outperforms standard context distillation while significantly reducing peak memory consumption and update latency.

Conclusion: D2L facilitates rapid adaptation of LLMs, opening possibilities for frequent knowledge updates and personalized chat behavior by efficiently handling long contexts through generated adapters.

Abstract: Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM’s native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.

Yichi Zhang, Zhuo Chen, Lingbing Guo, Wen Zhang, Huajun Chen

Main category: cs.CL

TL;DR: TOFU is a token-based foundation model for multi-modal knowledge graph reasoning that discretizes structural, visual, and textual information into tokens and uses hierarchical fusion with mixture-of-message mechanisms for cross-KG generalization.

Details

Motivation: Existing multi-modal knowledge graph reasoning methods are dataset-specific and struggle to generalize to new knowledge graphs. Recent knowledge graph foundation models focus mainly on structural patterns and ignore rich multi-modal signals, creating a gap in cross-KG transfer capabilities for multi-modal reasoning.

Method: TOFU discretizes structural, visual, and textual information into modality-specific tokens, then employs a hierarchical fusion architecture with mixture-of-message mechanisms to process these tokens and obtain transferable features for multi-modal knowledge graph reasoning.

Result: Experimental results on 17 transductive, inductive, and fully-inductive multi-modal knowledge graphs show that TOFU consistently outperforms strong knowledge graph foundation model and multi-modal knowledge graph reasoning baselines, delivering strong performance on unseen knowledge graphs.

Conclusion: TOFU demonstrates strong generalization capabilities across different multi-modal knowledge graphs by effectively integrating structural, visual, and textual information through tokenization and hierarchical fusion, addressing limitations of existing approaches.

Abstract: Multi-modal knowledge graph reasoning (MMKGR) aims to predict the missing links by exploiting both graph structure information and multi-modal entity contents. Most existing works are designed for a transductive setting, which learns dataset-specific embeddings and struggles to generalize to new KGs. Recent knowledge graph foundation models (KGFMs) improve cross-KG transfer, but they mainly exploit structural patterns and ignore rich multi-modal signals. We address these gaps by proposing a token-based foundation model (TOFU) for MMKGR, which exhibits strong generalization across different MMKGs. TOFU discretizes structural, visual, and textual information into modality-specific tokens. TOFU then employs a hierarchical fusion architecture with mixture-of-message mechanisms, aiming to process these tokens and obtain transferable features for MMKGR. Experimental results on 17 transductive, inductive, and fully-inductive MMKGs show that TOFU consistently outperforms strong KGFM and MMKGR baselines, delivering strong performance on unseen MMKGs.

[33] Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Xinguo Feng, Zhongkui Ma, Zihan Wang, Alsharif Abuadbba, Guangdong Bai

Main category: cs.CL

TL;DR: GHOST is a novel defense against gradient inversion attacks that uses token-level obfuscation to protect privacy in collaborative learning of large language models while maintaining utility.

Details

Motivation: Existing gradient perturbation defenses for collaborative learning are vulnerable because they preserve semantic similarity across gradient, embedding, and token spaces, allowing adversaries to reconstruct private training data from shared gradients.

Method: GHOST uses token-level obfuscation with shadow tokens: (1) searching for semantically distinct but embedding-proximate tokens via multi-criteria search, and (2) selecting optimal shadow tokens that preserve alignment with internal outputs while disrupting semantic connections.

Result: GHOST achieves remarkable privacy protection (as low as 1% recovery rate) while preserving utility (up to 0.92 F1 score and 5.45 perplexity) across diverse models (BERT to Llama) and datasets against state-of-the-art gradient inversion attacks.

Conclusion: GHOST effectively neutralizes gradient inversion attacks by decoupling semantic connections across spaces while maintaining training utility, providing a robust defense for collaborative learning of large language models.

Abstract: Training and fine-tuning large-scale language models largely benefit from collaborative learning, but the approach has been proven vulnerable to gradient inversion attacks (GIAs), which allow adversaries to reconstruct private training data from shared gradients. Existing defenses mainly employ gradient perturbation techniques, e.g., noise injection or gradient pruning, to disrupt GIAs’ direct mapping from gradient space to token space. However, these methods often fall short due to the retention of semantics similarity across gradient, embedding, and token spaces. In this work, we propose a novel defense mechanism named GHOST (gradient shield with obfuscated tokens), a token-level obfuscation mechanism that neutralizes GIAs by decoupling the inherent connections across gradient, embedding, and token spaces. GHOST is built upon an important insight: due to the large scale of the token space, there exist semantically distinct yet embedding-proximate tokens that can serve as the shadow substitutes of the original tokens, which enables a semantic disconnection in the token space while preserving the connection in the embedding and gradient spaces. GHOST comprises a searching step, which identifies semantically distinct candidate tokens using a multi-criteria searching process, and a selection step, which selects optimal shadow tokens to ensure minimal disruption to features critical for training by preserving alignment with the internal outputs produced by original tokens. Evaluation across diverse model architectures (from BERT to Llama) and datasets demonstrates the remarkable effectiveness of GHOST in protecting privacy (as low as 1% in recovery rate) and preserving utility (up to 0.92 in classification F1 and 5.45 in perplexity), in both classification and generation tasks against state-of-the-art GIAs and adaptive attack scenarios.

[34] MultiCube-RAG for Multi-hop Question Answering

Jimeng Shi, Wei Hu, Runchu Tian, Bowen Jin, Wonbin Kweon, SeongKu Kang, Yunfan Kang, Dingqi Ye, Sizhe Zhou, Shaowen Wang, Jiawei Han

Main category: cs.CL

TL;DR: MultiCube-RAG: A training-free method using ontology-based cube structures for multi-hop QA, improving accuracy by 8.9% over baselines with better efficiency and explainability.

Details

Motivation: Existing RAG methods struggle with structural semantics in multi-hop QA, graph-based approaches are noisy and expensive, and training-based methods have unstable convergence and high computational overhead.

Method: Proposes ontology-based cube structure with multiple orthogonal dimensions to model subjects, attributes, and relations. MultiCube-RAG uses specialized cubes for different subject classes, decomposes complex queries into simple subqueries along cube dimensions, and processes them sequentially without training.

Result: Experiments on four multi-hop QA datasets show 8.9% improvement in response accuracy over average baseline performance, with greater efficiency and inherent explainability.

Conclusion: MultiCube-RAG effectively addresses multi-hop QA challenges through structured knowledge representation and training-free multi-step reasoning, offering improved accuracy, efficiency, and explainability.

Abstract: Multi-hop question answering (QA) necessitates multi-step reasoning and retrieval across interconnected subjects, attributes, and relations. Existing retrieval-augmented generation (RAG) methods struggle to capture these structural semantics accurately, resulting in suboptimal performance. Graph-based RAGs structure such information in graphs, but the resulting graphs are often noisy and computationally expensive. Moreover, most methods rely on single-step retrieval, neglecting the need for multi-hop reasoning processes. Recent training-based approaches attempt to incentivize the large language models (LLMs) for iterative reasoning and retrieval, but their training processes are prone to unstable convergence and high computational overhead. To address these limitations, we devise an ontology-based cube structure with multiple and orthogonal dimensions to model structural subjects, attributes, and relations. Built on the cube structure, we propose MultiCube-RAG, a training-free method consisting of multiple cubes for multi-step reasoning and retrieval. Each cube specializes in modeling a class of subjects, so that MultiCube-RAG flexibly selects the most suitable cubes to acquire the relevant knowledge precisely. To enhance the query-based reasoning and retrieval, our method decomposes a complex multi-hop query into a set of simple subqueries along cube dimensions and conquers each of them sequentially. Experiments on four multi-hop QA datasets show that MultiCube-RAG improves response accuracy by 8.9% over the average performance of various baselines. Notably, we also demonstrate that our method performs with greater efficiency and inherent explainability.

[35] DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

Md Mofijul Islam, Md Sirajus Salekin, Nivedha Balakrishnan, Vincil C. Bishop, Niharika Jain, Spencer Romo, Bob Strahan, Boyi Xie, Diego A. Socolinsky

Main category: cs.CL

TL;DR: DocSplit is the first comprehensive benchmark for document packet splitting, which involves separating multi-page document packets into individual documents, addressing a fundamental but overlooked task in document understanding.

Details

Motivation: Real-world document processing often involves heterogeneous multi-page document packets containing multiple documents stitched together, but current document understanding research largely ignores the fundamental task of splitting these packets into individual units.

Method: Created DocSplit benchmark dataset with five datasets of varying complexity covering diverse document types, layouts, and multimodal settings. Formalized the DocSplit task requiring models to identify document boundaries, classify document types, and maintain correct page ordering.

Result: Extensive experiments with multimodal LLMs revealed significant performance gaps in handling complex document splitting tasks, especially with challenges like out-of-order pages, interleaved documents, and documents lacking clear demarcations.

Conclusion: DocSplit provides the first systematic framework for advancing document packet splitting capabilities essential for legal, financial, healthcare, and other document-intensive domains, with released datasets to facilitate future research.

Abstract: Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models’ ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.

[36] A Curious Class of Adpositional Multiword Expressions in Korean

Junghyun Min, Na-Rae Han, Jena D. Hwang, Nathan Schneider

Main category: cs.CL

TL;DR: Analysis of Korean postpositional verb-based constructions (PVCs) as multiword expressions, proposing annotation guidelines for cross-lingual alignment

Details

Motivation: Korean multiword expressions (MWEs), particularly multiword adpositions, are underrepresented in cross-lingual annotation frameworks like PARSEME, lacking systematic analysis and annotated resources

Method: Study Korean functional MWEs called postpositional verb-based constructions (PVCs) using Korean Wikipedia data, analyze PVC expressions and contrast them with non-MWEs and light verb constructions with similar structure

Result: Analysis of Korean PVCs reveals their characteristics and distinctions from similar constructions, leading to proposed annotation guidelines for Korean multiword adpositions

Conclusion: Proposed annotation guidelines aim to support future work on Korean multiword adpositions and facilitate alignment with existing cross-lingual frameworks like PARSEME

Abstract: Multiword expressions (MWEs) have been widely studied in cross-lingual annotation frameworks such as PARSEME. However, Korean MWEs remain underrepresented in these efforts. In particular, Korean multiword adpositions lack systematic analysis, annotated resources, and integration into existing multilingual frameworks. In this paper, we study a class of Korean functional multiword expressions: postpositional verb-based constructions (PVCs). Using data from Korean Wikipedia, we survey and analyze several PVC expressions and contrast them with non-MWEs and light verb constructions (LVCs) with similar structure. Building on this analysis, we propose annotation guidelines designed to support future work in Korean multiword adpositions and facilitate alignment with cross-lingual frameworks.

[37] CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

Bradley McDanel, Steven Li, Harshit Khaitan

Main category: cs.CL

TL;DR: Cross-Layer Attention Aggregation (CLAA) improves LLM inference speed by aggregating token importance scores across layers to stabilize token ranking for selective processing, reducing Time-to-First-Token by up to 39%.

Details

Motivation: The prefill stage in long-context LLM inference is computationally expensive. Existing token-ranking heuristics for selective processing suffer from unstable token importance estimation that varies between layers, and there's no good way to evaluate token-ranking quality independently from heuristic-specific architectures.

Method: Introduces an Answer-Informed Oracle that defines ground-truth token importance by measuring attention from generated answers back to the prompt. This reveals that existing heuristics have high variance across layers. The solution is Cross-Layer Attention Aggregation (CLAA), which aggregates token importance scores across layers rather than relying on any single layer.

Result: CLAA closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39% compared to the Full KV Cache baseline.

Conclusion: Aggregating token importance scores across layers stabilizes token ranking for selective processing in LLM inference, significantly improving inference speed while maintaining quality.

Abstract: The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39% compared to the Full KV Cache baseline.

[38] Surgical Activation Steering via Generative Causal Mediation

Aruna Sankaranarayanan, Amir Zur, Atticus Geiger, Dylan Hadfield-Menell

Main category: cs.CL

TL;DR: GCM (Generative Causal Mediation) is a method to identify and control specific model components (like attention heads) that mediate binary concepts in long-form language model responses, enabling targeted steering of behaviors like refusal, sycophancy, and style transfer.

Details

Motivation: The paper addresses the challenge of controlling specific behaviors in language models when those behaviors are diffused across many tokens in long-form responses. Traditional methods struggle to pinpoint where to intervene in the model architecture to steer such complex, multi-token concepts.

Method: GCM constructs datasets of contrasting inputs and responses for binary concepts, then quantifies how individual model components (e.g., attention heads) mediate the contrastive concept. It selects the strongest mediators for steering interventions.

Result: GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads across three tasks (refusal, sycophancy, style transfer) and three language models.

Conclusion: GCM provides an effective approach for localizing and controlling long-form responses in language models by identifying causal mediators rather than relying on correlational methods, enabling more targeted and efficient steering of model behaviors.

Abstract: Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks–refusal, sycophancy, and style transfer–across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.

[39] Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs

Sean Trott, Samuel Taylor, Cameron Jones, James A. Michaelov, Pamela D. Rivière

Main category: cs.CL

TL;DR: Large-scale analysis of 41 open-weight language models on false belief tasks reveals partial sensitivity to mental states, with larger models showing better performance, and provides insights into human cognition through comparative analysis.

Details

Motivation: To address limitations in prior research that relied on small samples of closed-source LMs, this study aims to rigorously test psychological theories about mental state reasoning and evaluate LM capacities using a larger, more diverse set of open-weight models.

Method: Replicated and extended published work on false belief tasks by assessing mental state reasoning across 41 open-weight models from distinct families, analyzing sensitivity to implied knowledge states, and comparing LM behavior with human cognition.

Result: 34% of tested LMs showed sensitivity to implied knowledge states; larger models demonstrated increased sensitivity and higher psychometric predictive power; both humans and LMs showed bias toward attributing false beliefs with non-factive verb cues.

Conclusion: Open-weight LMs provide valuable tools for testing theories of human cognition and evaluating LM capacities, with language distribution statistics potentially explaining certain cognitive biases but not full human sensitivity to knowledge states.

Abstract: Research on mental state reasoning in language models (LMs) has the potential to inform theories of human social cognition–such as the theory that mental state reasoning emerges in part from language exposure–and our understanding of LMs themselves. Yet much published work on LMs relies on a relatively small sample of closed-source LMs, limiting our ability to rigorously test psychological theories and evaluate LM capacities. Here, we replicate and extend published work on the false belief task by assessing LM mental state reasoning behavior across 41 open-weight models (from distinct model families). We find sensitivity to implied knowledge states in 34% of the LMs tested; however, consistent with prior work, none fully explain away'' the effect in humans. Larger LMs show increased sensitivity and also exhibit higher psychometric predictive power. Finally, we use LM behavior to generate and test a novel hypothesis about human cognition: both humans and LMs show a bias towards attributing false beliefs when knowledge states are cued using a non-factive verb (John thinks…’’) than when cued indirectly (``John looks in the…’’). Unlike the primary effect of knowledge states, where human sensitivity exceeds that of LMs, the magnitude of the human knowledge cue effect falls squarely within the distribution of LM effect sizes-suggesting that distributional statistics of language can in principle account for the latter but not the former in humans. These results demonstrate the value of using larger samples of open-weight LMs to test theories of human cognition and evaluate LM capacities.

[40] Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities

Shankar Padmanabhan, Mustafa Omer Gul, Tanya Goyal

Main category: cs.CL

TL;DR: DiSC is a context-distillation method for continual knowledge adaptation in LLMs that learns new knowledge while mitigating forgetting of previously learned skills.

Details

Motivation: Post-trained LLMs have knowledge cut-off dates and need continual adaptation, but existing solutions can't simultaneously learn new knowledge and prevent forgetting of earlier capabilities like instruction-following and reasoning.

Method: Distillation via Split Contexts (DiSC) uses context-distillation by conditioning student and teacher distributions on distinct segments of training examples and minimizing KL divergence between shared tokens, avoiding explicit generation steps during training.

Result: DiSC consistently achieves the best trade-off between learning new knowledge and mitigating forgetting of previously learned skills across four post-trained models and two adaptation domains, outperforming prior finetuning and distillation methods.

Conclusion: DiSC provides an effective approach for continual knowledge adaptation in LLMs that maintains previously learned capabilities while acquiring new knowledge.

Abstract: Post-training endows pretrained LLMs with a variety of desirable skills, including instruction-following, reasoning, and others. However, these post-trained LLMs only encode knowledge up to a cut-off date, necessitating continual adaptation. Unfortunately, existing solutions cannot simultaneously learn new knowledge from an adaptation document corpora and mitigate the forgetting of earlier learned capabilities. To address this, we introduce Distillation via Split Contexts (DiSC), a simple context-distillation based approach for continual knowledge adaptation. \methodname~derives student and teacher distributions by conditioning on distinct segments of the training example and minimizes the KL divergence between the shared tokens. This allows us to efficiently apply context-distillation without requiring explicit generation steps during training. We run experiments on four post-trained models and two adaptation domains. Compared to prior finetuning and distillation methods for continual adaptation, DiSC consistently reports the best trade-off between learning new knowledge and mitigating forgetting of previously learned skills like instruction-following, reasoning, and factual knowledge.

[41] Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis

Rong Fu, Wenxin Zhang, Ziming Wang, Chunlei Meng, Jiaxuan Lu, Jiekai Wu, Kangan Qian, Hao Zhang, Simon Fong

Main category: cs.CL

TL;DR: A framework called Missing-by-Design (MBD) for revocable multimodal sentiment analysis that enables selective deletion of specific data modalities while maintaining system functionality.

Details

Motivation: As multimodal systems handle sensitive personal data, there's a growing need for selective revocation of specific data modalities to comply with privacy regulations and respect user autonomy. Current systems lack efficient mechanisms to surgically remove specific modalities without full retraining.

Method: MBD combines structured representation learning with a certifiable parameter-modification pipeline. It learns property-aware embeddings and uses generator-based reconstruction to handle missing channels while preserving task-relevant signals. For deletion requests, it applies saliency-driven candidate selection and calibrated Gaussian updates to produce machine-verifiable Modality Deletion Certificates.

Result: Experiments on benchmark datasets show MBD achieves strong predictive performance under incomplete inputs and delivers practical privacy-utility trade-offs, positioning surgical unlearning as an efficient alternative to full retraining.

Conclusion: MBD provides a unified framework for revocable multimodal analysis that addresses privacy compliance needs through certifiable modality deletion, offering an efficient solution for privacy-sensitive multimodal applications.

Abstract: As multimodal systems increasingly process sensitive personal data, the ability to selectively revoke specific data modalities has become a critical requirement for privacy compliance and user autonomy. We present Missing-by-Design (MBD), a unified framework for revocable multimodal sentiment analysis that combines structured representation learning with a certifiable parameter-modification pipeline. Revocability is critical in privacy-sensitive applications where users or regulators may request removal of modality-specific information. MBD learns property-aware embeddings and employs generator-based reconstruction to recover missing channels while preserving task-relevant signals. For deletion requests, the framework applies saliency-driven candidate selection and a calibrated Gaussian update to produce a machine-verifiable Modality Deletion Certificate. Experiments on benchmark datasets show that MBD achieves strong predictive performance under incomplete inputs and delivers a practical privacy-utility trade-off, positioning surgical unlearning as an efficient alternative to full retraining.

[42] Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution

Nithin Sivakumaran, Shoubin Yu, Hyunji Lee, Yue Zhang, Ali Payani, Mohit Bansal, Elias Stengel-Eskin

Main category: cs.CL

TL;DR: REMUL improves faithfulness of Chain-of-Thought reasoning in LLMs using multi-party reinforcement learning where speakers generate reasoning traces that listeners can execute, balancing faithfulness with task performance.

Details

Motivation: Chain-of-Thought reasoning often lacks faithfulness to the true computational process of LLMs, limiting its explanatory power, and there's a tradeoff between optimizing for faithfulness/interpretability and maintaining task performance.

Method: REMUL uses multi-party reinforcement learning with speaker models generating reasoning traces that are truncated and passed to listener models who execute them. Speakers are rewarded for producing clear reasoning that listeners can follow, with masked supervised finetuning for correctness regularization.

Result: On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, FOLIO), REMUL substantially improves three faithfulness measures (hint attribution, early answering AOC, mistake injection AOC) while also improving accuracy.

Conclusion: REMUL successfully addresses the faithfulness-performance tradeoff in CoT reasoning, producing more faithful reasoning traces that are shorter, more direct, and legible across training domains.

Abstract: Chain-of-thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT faithfulness, we propose Reasoning Execution by Multiple Listeners (REMUL), a multi-party reinforcement learning approach. REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful. A speaker model generates a reasoning trace, which is truncated and passed to a pool of listener models who “execute” the trace, continuing the trace to an answer. Speakers are rewarded for producing reasoning that is clear to listeners, with additional correctness regularization via masked supervised finetuning to counter the tradeoff between faithfulness and performance. On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness – hint attribution, early answering area over the curve (AOC), and mistake injection AOC – while also improving accuracy. Our analysis finds that these gains are robust across training domains, translate to legibility gains, and are associated with shorter and more direct CoTs.

[43] LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers

Peiqi Sui

Main category: cs.CL

TL;DR: LLMs have an “uncertainty gap” in creative writing where human stories show higher uncertainty than model outputs, limiting literary creativity despite alignment strategies that reduce uncertainty for factuality.

Details

Motivation: Current LLMs produce trite, cliché-ridden creative writing. Literary theory identifies uncertainty as essential for creativity, but alignment strategies steer models away from uncertainty to ensure factuality and reduce hallucinations, creating a tension.

Method: Formalized the “uncertainty gap” by quantifying uncertainty differences between human-authored stories and model-generated continuations. Conducted controlled information-theoretic analysis of 28 LLMs on high-quality storytelling datasets.

Result: Human writing consistently exhibits significantly higher uncertainty than model outputs. Instruction-tuned and reasoning models exacerbate this trend compared to base models. The gap is more pronounced in creative writing than functional domains and strongly correlates with writing quality.

Conclusion: Achieving human-level creativity requires new uncertainty-aware alignment paradigms that can distinguish between destructive hallucinations and constructive ambiguity needed for literary richness.

Abstract: We argue that uncertainty is a key and understudied limitation of LLMs’ performance in creative writing, which is often characterized as trite and cliché-ridden. Literary theory identifies uncertainty as a necessary condition for creative expression, while current alignment strategies steer models away from uncertain outputs to ensure factuality and reduce hallucination. We formalize this tension by quantifying the “uncertainty gap” between human-authored stories and model-generated continuations. Through a controlled information-theoretic analysis of 28 LLMs on high-quality storytelling datasets, we demonstrate that human writing consistently exhibits significantly higher uncertainty than model outputs. We find that instruction-tuned and reasoning models exacerbate this trend compared to their base counterparts; furthermore, the gap is more pronounced in creative writing than in functional domains, and strongly correlates to writing quality. Achieving human-level creativity requires new uncertainty-aware alignment paradigms that can distinguish between destructive hallucinations and the constructive ambiguity required for literary richness.

[44] Beyond Learning: A Training-Free Alternative to Model Adaptation

Namkyung Yoon, Kyeonghyun Yoo, Wooyong Jung, Sanghong Kim, Hwangnam Kim

Main category: cs.CL

TL;DR: Language model transplantation: Transferring localized internal modules between models to immediately improve performance without training, demonstrating task-localized modularity in LLMs.

Details

Motivation: Language models sometimes underperform previous versions, and existing improvement methods are resource-intensive. The paper aims to find alternatives that enable immediate functional changes without additional training.

Method: 1) Identify modules with consistent local activation changes under inference workloads through activation-based analysis. 2) Transplant properly activated internal modules from one model to another. 3) Quantify relationship between transplant strength and performance improvement across different conditions and models.

Result: Cross-generation transplantation improved underperforming models up to 2x baseline with 100%+ gap recovery. Base-to-instruction-tuned transplantation improved models up to 2.33x baseline with up to 100% gap recovery. Shows meaningful capacity transfer through localized module implantation.

Conclusion: Provides empirical evidence for task-localized modularity in language models and introduces model transplantation as a new research area for immediate model improvement without training.

Abstract: Despite the continuous research and evolution of language models, they sometimes underperform previous versions. Existing approaches to overcome these challenges are resource-intensive, highlighting the need for alternatives that enable immediate action. We assume that each language model has a local module inside that is suitable for a specific function. First, this work identifies a set of modules showing consistent and local activation changes under an inference workload through activation-based analysis. Subsequently, we transplant an internal module that is properly activated for a specific task into the target model, leading to immediate and measurable functional changes without additional training or fine-tuning. To experimentally demonstrate the effectiveness of the transplant technique, we quantify the relationship between transplant strength and performance improvement under different conditions for two language models. In the cross-generation setting, we find that transplanting activation-selected modules can substantially improve the underperforming model, reaching up to twice the target baseline and achieving gap-based recovery above 100%. Moreover, in transplant experiments between a base model and its instruction-tuned counterpart, transplantation improves the underperforming model toward the stronger baseline, yielding up to about 2.33 times the target baseline with gap-based recovery reaching up to 100% in the best case. These results show that meaningful capacity transfer can be realized through the implantation of highly localized modules implied by language models. Overall, this work provides empirical evidence for task-localized modularity in language models and presents a new research area: model transplantation.

[45] The Validity of Coreference-based Evaluations of Natural Language Understanding

Ian Porada

Main category: cs.CL

TL;DR: This thesis analyzes coreference evaluation methods in NLP, revealing validity issues and proposing new event plausibility tests, finding that while models improve on standard benchmarks, they lack human-like generalization.

Details

Motivation: The motivation is to refine understanding of what conclusions can be drawn from coreference-based evaluations by examining measurement validity issues and proposing better evaluation methods to assess true generalization capabilities.

Method: 1) Analysis of standard coreference evaluations to identify validity issues (contested definitions, convergent validity problems). 2) Development and implementation of novel evaluation focused on inferring relative plausibility of events as a key aspect of coreference resolution.

Result: Contemporary language models show strong performance on standard benchmarks (improving over baselines in certain domains) but remain sensitive to evaluation conditions and fail to generalize in human-like ways when contexts are slightly modified.

Conclusion: The work clarifies both strengths (improved accuracy on standard evaluations) and limitations (measurement validity weaknesses) of current NLP paradigm, suggesting directions for better evaluation methods and more genuinely generalizable systems.

Abstract: In this thesis, I refine our understanding as to what conclusions we can reach from coreference-based evaluations by expanding existing evaluation practices and considering the extent to which evaluation results are either converging or conflicting. First, I analyze standard coreference evaluations and show that their design often leads to non-generalizable conclusions due to issues of measurement validity - including contestedness (multiple, competing definitions of coreference) and convergent validity (evaluation results that rank models differently across benchmarks). Second, I propose and implement a novel evaluation focused on testing systems’ ability to infer the relative plausibility of events, a key aspect of resolving coreference. Through this extended evaluation, I find that contemporary language models demonstrate strong performance on standard benchmarks - improving over earlier baseline systems within certain domains and types of coreference - but remain sensitive to the evaluation conditions: they often fail to generalize in ways one would expect a human to be capable of when evaluation contexts are slightly modified. Taken together, these findings clarify both the strengths, such as improved accuracy over baselines on widely used evaluations, and the limitations of the current NLP paradigm, including weaknesses in measurement validity, and suggest directions for future work in developing better evaluation methods and more genuinely generalizable systems.

[46] Long-Tail Knowledge in Large Language Models: Taxonomy, Mechanisms, Interventions and Implications

Sanket Badhe, Deep Shah, Nehal Kathrotia

Main category: cs.CL

TL;DR: A comprehensive analysis of long-tail knowledge in LLMs, examining how infrequent, domain-specific, cultural, and temporal knowledge is lost or distorted during training and inference, with implications for fairness and accountability.

Details

Motivation: LLMs are trained on web-scale corpora with power-law distributions where most knowledge appears infrequently. While scaling improves average performance, persistent failures on low-frequency, domain-specific, cultural, and temporal knowledge remain poorly characterized and understood.

Method: Develops a structured taxonomy and analytical framework synthesizing prior work across four axes: 1) how long-tail knowledge is defined, 2) mechanisms by which it’s lost/distorted during training/inference, 3) technical interventions to mitigate failures, and 4) implications for fairness, accountability, transparency, and user trust.

Result: Provides a unifying conceptual framework for understanding long-tail knowledge representation in LLMs, examines how existing evaluation practices obscure tail behavior, and identifies open challenges related to privacy, sustainability, and governance.

Conclusion: The paper offers a comprehensive framework for analyzing long-tail knowledge in LLMs, highlighting the need for better understanding of rare but consequential failures and addressing challenges in privacy, sustainability, and governance for improved knowledge representation.

Abstract: Large language models (LLMs) are trained on web-scale corpora that exhibit steep power-law distributions, in which the distribution of knowledge is highly long-tailed, with most appearing infrequently. While scaling has improved average-case performance, persistent failures on low-frequency, domain-specific, cultural, and temporal knowledge remain poorly characterized. This paper develops a structured taxonomy and analysis of long-Tail Knowledge in large language models, synthesizing prior work across technical and sociotechnical perspectives. We introduce a structured analytical framework that synthesizes prior work across four complementary axes: how long-Tail Knowledge is defined, the mechanisms by which it is lost or distorted during training and inference, the technical interventions proposed to mitigate these failures, and the implications of these failures for fairness, accountability, transparency, and user trust. We further examine how existing evaluation practices obscure tail behavior and complicate accountability for rare but consequential failures. The paper concludes by identifying open challenges related to privacy, sustainability, and governance that constrain long-Tail Knowledge representation. Taken together, this paper provides a unifying conceptual framework for understanding how long-Tail Knowledge is defined, lost, evaluated, and manifested in deployed language model systems.

[47] Are LLMs Ready to Replace Bangla Annotators?

Md. Najib Hasan, Touseef Hasan, Souvika Sarkar

Main category: cs.CL

TL;DR: LLMs show bias and instability as zero-shot annotators for Bangla hate speech, with smaller task-aligned models sometimes outperforming larger ones, highlighting limitations for sensitive low-resource language tasks.

Details

Motivation: To investigate the reliability of LLMs as automated annotators for sensitive tasks in low-resource languages, specifically examining bias and instability in their judgments for Bangla hate speech detection where human agreement is challenging and bias has serious consequences.

Method: Systematic benchmark of 17 LLMs using a unified evaluation framework for zero-shot annotation of Bangla hate speech, analyzing annotator bias and instability across different model scales and architectures.

Result: LLMs exhibit significant annotator bias and substantial instability in judgments. Surprisingly, increased model scale doesn’t guarantee better annotation quality - smaller, more task-aligned models often show more consistent behavior than larger counterparts.

Conclusion: Current LLMs have important limitations for sensitive annotation tasks in low-resource languages, requiring careful evaluation before deployment, as scale alone doesn’t ensure reliable annotation for identity-sensitive settings.

Abstract: Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators–especially for low-resource and identity-sensitive settings–remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality–smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.

[48] Aladdin-FTI @ AMIYA Three Wishes for Arabic NLP: Fidelity, Diglossia, and Multidialectal Generation

Jonathan Mutal, Perla Al Almaoui, Simon Hengchen, Pierrette Bouillon

Main category: cs.CL

TL;DR: Aladdin-FTI system for generating and translating between multiple Arabic dialects, MSA, and English, addressing under-representation of Arabic dialects in NLP

Details

Motivation: Arabic dialects are under-represented in NLP due to non-standardization and high variability; LLMs offer opportunities to model Arabic as a pluricentric language rather than monolithic system

Method: Developed Aladdin-FTI system for the AMIYA shared task that supports text generation in Moroccan, Egyptian, Palestinian, Syrian, and Saudi dialects, plus bidirectional translation between these dialects, MSA, and English

Result: System submitted to AMIYA shared task; code and trained model made publicly available

Conclusion: LLMs can help address the gap in Arabic dialect NLP by enabling pluricentric modeling of Arabic language variants

Abstract: Arabic dialects have long been under-represented in Natural Language Processing (NLP) research due to their non-standardization and high variability, which pose challenges for computational modeling. Recent advances in the field, such as Large Language Models (LLMs), offer promising avenues to address this gap by enabling Arabic to be modeled as a pluricentric language rather than a monolithic system. This paper presents Aladdin-FTI, our submission to the AMIYA shared task. The proposed system is designed to both generate and translate dialectal Arabic (DA). Specifically, the model supports text generation in Moroccan, Egyptian, Palestinian, Syrian, and Saudi dialects, as well as bidirectional translation between these dialects, Modern Standard Arabic (MSA), and English. The code and trained model are publicly available.

[49] MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models

Martin Hyben, Sebastian Kula, Jan Cegin, Jakub Simko, Ivan Srba, Robert Moro

Main category: cs.CL

TL;DR: MultiCW dataset for multilingual check-worthy claim detection with 16 languages, 7 domains, 2 styles, and benchmarks showing fine-tuned models outperform zero-shot LLMs

Details

Motivation: Limited automated support for detecting check-worthy claims in fact-checking despite LLMs reshaping media verification; need for balanced multilingual benchmarks to advance automated fact-checking

Method: Created MultiCW dataset with 123,722 samples across 16 languages, 7 domains, 2 styles; benchmarked 3 fine-tuned multilingual transformers against 15 commercial/open LLMs in zero-shot settings; also created OOD evaluation set in 4 additional languages

Result: Fine-tuned models consistently outperform zero-shot LLMs on claim classification and show strong out-of-distribution generalization across languages, domains, and styles

Conclusion: MultiCW provides rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and LLMs on check-worthy claim detection

Abstract: Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark 3 common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings. Our findings show that fine-tuned models consistently outperform zero-shot LLMs on claim classification and show strong out-of-distribution generalization across languages, domains, and styles. MultiCW provides a rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and cutting-edge LLMs on the check-worthy claim detection task.

[50] MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian McAuley, Yejin Choi, Alex Pentland

Main category: cs.CL

TL;DR: MemoryArena is a new benchmark for evaluating agent memory in realistic multi-session settings where memorization and action are tightly coupled, unlike existing benchmarks that test them in isolation.

Details

Motivation: Current evaluations of agents with memory assess memorization and action separately, failing to capture how memory guides future decisions in realistic settings where agents acquire memory through interaction and use it to solve future tasks.

Method: Introduces MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops, consisting of human-crafted agentic tasks with explicitly interdependent subtasks across domains like web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning.

Result: Reveals that agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in the agentic setting, exposing a gap in current evaluations for agents with memory.

Conclusion: MemoryArena provides a more realistic benchmark for evaluating agent memory that captures the tight coupling between memorization and action in multi-session environments, highlighting deficiencies in current evaluation approaches.

Abstract: Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into memory, and subsequently use that memory to guide later actions to solve the overall task. MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning, and reveals that agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting, exposing a gap in current evaluations for agents with memory.

[51] Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut

Main category: cs.CL

TL;DR: STING is an automated red-teaming framework for testing multi-turn misuse of LLM-based agents by constructing step-by-step illicit plans and using adaptive follow-ups with judge agents to track completion.

Details

Motivation: Existing agent misuse benchmarks only test single-prompt instructions, leaving a gap in measuring how agents help with harmful/illegal tasks over multiple turns in realistic deployment settings.

Method: STING constructs step-by-step illicit plans grounded in benign personas, iteratively probes target agents with adaptive follow-ups, and uses judge agents to track phase completion. It models multi-turn red-teaming as time-to-first-jailbreak random variable.

Result: STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines across AgentHarm scenarios. Multilingual evaluations show attack success doesn’t consistently increase in lower-resource languages.

Conclusion: STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings where interactions are inherently multi-turn and often multilingual.

Abstract: LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

[52] Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents

Mohammad H. A. Monfared, Lucie Flek, Akbar Karimi

Main category: cs.CL

TL;DR: Agentic data augmentation method for Aspect-Based Sentiment Analysis using iterative generation and verification outperforms prompting-based baseline, especially for aspect term generation tasks.

Details

Motivation: The paper addresses the need for high-quality synthetic training data in Aspect-Based Sentiment Analysis (ABSA) by proposing an agentic approach that can generate better training examples than simple prompting methods.

Method: Developed an agentic data augmentation method using iterative generation and verification, compared against a prompting-based baseline using the same models and instructions. Evaluated across three ABSA subtasks, four SemEval datasets, and two encoder-decoder models (T5-Base and Tk-Instruct).

Result: Agentic augmentation outperforms raw prompting in label preservation, especially for tasks requiring aspect term generation. When combined with real data, agentic augmentation provides higher gains and consistently outperforms prompting-based generation. Benefits are most pronounced for T5-Base, helping it achieve comparable performance with Tk-Instruct.

Conclusion: Agentic data augmentation with iterative generation and verification produces higher quality synthetic training data than prompting-based methods, particularly benefiting less heavily pretrained models and tasks involving aspect term generation.

Abstract: We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high quality synthetic training examples. To isolate the effect of agentic structure, we also develop a closely matched prompting-based baseline using the same model and instructions. Both methods are evaluated across three ABSA subtasks (Aspect Term Extraction (ATE), Aspect Sentiment Classification (ATSC), and Aspect Sentiment Pair Extraction (ASPE)), four SemEval datasets, and two encoder-decoder models: T5-Base and Tk-Instruct. Our results show that the agentic augmentation outperforms raw prompting in label preservation of the augmented data, especially when the tasks require aspect term generation. In addition, when combined with real data, agentic augmentation provides higher gains, consistently outperforming prompting-based generation. These benefits are most pronounced for T5-Base, while the more heavily pretrained Tk-Instruct exhibits smaller improvements. As a result, augmented data helps T5-Base achieve comparable performance with its counterpart.

[53] TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers

Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif, Segev Shlomov

Main category: cs.CL

TL;DR: TabAgent replaces LLM-based decision components in agentic systems with compact textual-tabular classifiers to reduce latency and cost while maintaining performance.

Details

Motivation: Agentic systems using repeated LLM calls for closed-set decision tasks suffer from high latency and cost due to cumulative token usage and processing time. There's a need for more efficient alternatives that maintain performance while reducing computational overhead.

Method: TabAgent framework: (1) TabSchema extracts structured schema, state, and dependency features from execution trajectories; (2) TabSynth augments coverage with schema-aligned synthetic supervision; (3) TabHead uses lightweight classifier to score candidates, replacing generative LLM components.

Result: On AppWorld benchmark, TabAgent maintains task-level success while eliminating shortlist-time LLM calls, reducing latency by ~95% and inference cost by 85-91%. Generalizes to other agentic decision heads beyond tool shortlisting.

Conclusion: TabAgent establishes a paradigm for learned discriminative replacements of generative bottlenecks in production agent architectures, enabling more efficient agentic systems while preserving functionality.

Abstract: Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating, and verification. While convenient, this design makes deployments slow and expensive due to cumulative latency and token usage. We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces. TabAgent (i) extracts structured schema, state, and dependency features from trajectories (TabSchema), (ii) augments coverage with schema-aligned synthetic supervision (TabSynth), and (iii) scores candidates with a lightweight classifier (TabHead). On the long-horizon AppWorld benchmark, TabAgent maintains task-level success while eliminating shortlist-time LLM calls, reducing latency by approximately 95% and inference cost by 85-91%. Beyond tool shortlisting, TabAgent generalizes to other agentic decision heads, establishing a paradigm for learned discriminative replacements of generative bottlenecks in production agent architectures.

[54] IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models

Saurabh Bharti, Gaurav Azad, Abhinaw Jagtap, Nachiket Tapas

Main category: cs.CL

TL;DR: IndicEval is a benchmarking platform that evaluates LLMs using real high-stakes exam questions from UPSC, JEE, and NEET in English and Hindi, assessing reasoning, domain knowledge, and bilingual adaptability through automated prompting strategies.

Details

Motivation: Current LLM evaluation frameworks lack real-world academic rigor and multilingual complexity. There's a need for benchmarks grounded in authentic examination standards rather than synthetic datasets to better measure reasoning, domain knowledge, and bilingual adaptability in educational contexts.

Method: IndicEval uses real examination questions from UPSC, JEE, and NEET across STEM and humanities domains in English and Hindi. It automates assessment using Zero-Shot, Few-Shot, and Chain-of-Thought prompting strategies, with modular architecture for integrating new models and languages.

Result: Experiments on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B show: 1) CoT prompting consistently improves reasoning accuracy across subjects and languages; 2) Significant cross-model performance disparities persist, especially in high-complexity exams; 3) Multilingual degradation remains critical with marked accuracy drops in Hindi vs English, particularly under Zero-Shot conditions.

Conclusion: IndicEval provides a practice-oriented, extensible foundation for rigorous, equitable LLM evaluation in multilingual educational settings, highlighting persistent gaps in bilingual reasoning and domain transfer that need addressing for improved reasoning robustness and language adaptability.

Abstract: The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity. This paper introduces IndicEval, a scalable benchmarking platform designed to assess LLM performance using authentic high-stakes examination questions from UPSC, JEE, and NEET across STEM and humanities domains in both English and Hindi. Unlike synthetic benchmarks, IndicEval grounds evaluation in real examination standards, enabling realistic measurement of reasoning, domain knowledge, and bilingual adaptability. The framework automates assessment using Zero-Shot, Few-Shot, and Chain-of-Thought (CoT) prompting strategies and supports modular integration of new models and languages. Experiments conducted on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B reveal three major findings. First, CoT prompting consistently improves reasoning accuracy, with substantial gains across subjects and languages. Second, significant cross-model performance disparities persist, particularly in high-complexity examinations. Third, multilingual degradation remains a critical challenge, with marked accuracy drops in Hindi compared to English, especially under Zero-Shot conditions. These results highlight persistent gaps in bilingual reasoning and domain transfer. Overall, IndicEval provides a practice-oriented, extensible foundation for rigorous, equitable evaluation of LLMs in multilingual educational settings and offers actionable insights for improving reasoning robustness and language adaptability.

[55] Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning

Jenny Kunz

Main category: cs.CL

TL;DR: Study examines how training English language models on machine-translated text from 24 diverse source languages affects linguistic judgments and language modeling, finding source language properties systematically influence model behavior.

Details

Motivation: To understand how machine-translated data (translationese) affects language model training, particularly how translationese from different source languages shapes what models learn about linguistic acceptability and language modeling across domains.

Method: Trained small English language models on English text translated from 24 typologically and resource-diverse source languages, enabling systematic analysis of how source language and corpus properties influence model learning.

Result: Source language has clear impact: general perplexity is driven by lexical diversity of translated corpus, while grammatical performance strongly correlates with typological similarity to English when sufficient data is available.

Conclusion: Translationese systematically affects model training, with source language properties influencing different aspects of model behavior - lexical diversity affects general language modeling while typological similarity affects grammatical understanding.

Abstract: Machine-translated data is widely used in multilingual NLP, particularly when native text is scarce. However, translated text differs systematically from native text. This phenomenon is known as translationese, and it reflects both traces of the source language and characteristic properties of translation itself. In this paper, we study how training on machine-translated data affects small English language models, focusing on how translationese from different source languages shapes linguistic acceptability judgments and language modelling for different domains. We train models on English text translated from 24 typologically and resource-diverse source languages, enabling a systematic analysis of how source language and corpus properties influence what models learn. Our results show that the source language has a clear impact on model behavior: general perplexity is more driven by the lexical diversity of the translated corpus, while grammatical performance is strongly correlated to typological similarity to English, given enough data.

Jonathan Cook, Diego Antognini, Martin Klissarov, Claudiu Musat, Edward Grefenstette

Main category: cs.CL

TL;DR: Training LLMs to actively solicit and learn from language feedback through social meta-learning, improving their ability to solve problems interactively across domains.

Details

Motivation: Current LLMs struggle to learn from corrective feedback in conversations, rarely proactively soliciting feedback even when faced with ambiguity, making dialogues feel static and lacking adaptive qualities of human conversation.

Method: Formulate social meta-learning (SML) as a finetuning methodology, training LLMs to solicit and learn from language feedback in simulated pedagogical dialogues, converting static tasks into interactive social learning problems.

Result: SML effectively teaches models to use conversation to solve problems they cannot solve in a single turn, with generalization across domains (math to coding and vice versa). Models trained on fully-specified problems become better at solving underspecified tasks, making fewer premature answer attempts and more likely to ask for needed information.

Conclusion: This work presents a scalable approach to developing AI systems that effectively learn from language feedback, enabling more adaptive and interactive conversational capabilities.

Abstract: Large language models (LLMs) often struggle to learn from corrective feedback within a conversational context. They are rarely proactive in soliciting this feedback, even when faced with ambiguity, which can make their dialogues feel static, one-sided, and lacking the adaptive qualities of human conversation. To address these limitations, we draw inspiration from social meta-learning (SML) in humans - the process of learning how to learn from others. We formulate SML as a finetuning methodology, training LLMs to solicit and learn from language feedback in simulated pedagogical dialogues, where static tasks are converted into interactive social learning problems. SML effectively teaches models to use conversation to solve problems they are unable to solve in a single turn. This capability generalises across domains; SML on math problems produces models that better use feedback to solve coding problems and vice versa. Furthermore, despite being trained only on fully-specified problems, these models are better able to solve underspecified tasks where critical information is revealed over multiple turns. When faced with this ambiguity, SML-trained models make fewer premature answer attempts and are more likely to ask for the information they need. This work presents a scalable approach to developing AI systems that effectively learn from language feedback.

[57] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

Melkamu Abay Mersha, Jugal Kalita

Main category: cs.CL

TL;DR: CA-LIG is a unified hierarchical attribution framework for Transformers that computes layer-wise Integrated Gradients and fuses them with class-specific attention gradients to provide context-aware explanations of model decisions.

Details

Motivation: Current Transformer explainability methods have limitations: they rely on final-layer attributions, capture either local token-level or global attention patterns without unification, lack context-awareness of inter-token dependencies, and fail to capture how relevance evolves across layers and how structural components shape decision-making.

Method: Proposed Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients, producing signed, context-sensitive attribution maps that trace hierarchical relevance flow through layers.

Result: Evaluated across diverse tasks (sentiment analysis, document classification, hate speech detection, image classification) and transformer families (BERT, XLM-R, AfroLM, Vision Transformer), CA-LIG provides more faithful attributions, stronger sensitivity to contextual dependencies, and clearer, more semantically coherent visualizations than established methods.

Conclusion: CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both practical interpretability and conceptual understanding of deep neural models.

Abstract: Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

[58] From Growing to Looping: A Unified View of Iterative Computation in LLMs

Ferdinand Kapl, Emmanouil Angelis, Kaitlin Maile, Johannes von Oswald, Stefan Bauer

Main category: cs.CL

TL;DR: Looping and depth growing are unified as complementary methods for inducing iterative computation to improve reasoning in models.

Details

Motivation: Both looping (reusing layers) and depth growing (training shallow-to-deep) have been linked to stronger reasoning, but their relationship and underlying mechanisms remain unclear. The paper aims to unify these techniques and understand their shared benefits.

Method: Mechanistic analysis showing convergent depth-wise signatures in both approaches, including increased reliance on late layers and recurring patterns aligned with looped/grown blocks. Experimental investigation of adaptability and composability, applying inference-time looping to depth-grown models and testing with more in-context examples or fine-tuning data.

Result: Looping and depth-grown models show similar signatures supporting iterative computation. Inference-time looping on depth-grown models improves accuracy on reasoning primitives by up to 2×. Both approaches adapt better with more data. Depth-grown models benefit most from math-heavy cooldown mixtures, further boosted by adapting middle blocks to loop.

Conclusion: Looping and depth growing are complementary, practical methods for inducing and scaling iterative computation to improve reasoning, with potential for composition and adaptation.

Abstract: Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear. We provide a mechanistic unification: looped and depth-grown models exhibit convergent depth-wise signatures, including increased reliance on late layers and recurring patterns aligned with the looped or grown block. These shared signatures support the view that their gains stem from a common form of iterative computation. Building on this connection, we show that the two techniques are adaptable and composable: applying inference-time looping to the middle blocks of a depth-grown model improves accuracy on some reasoning primitives by up to $2\times$, despite the model never being trained to loop. Both approaches also adapt better than the baseline when given more in-context examples or additional supervised fine-tuning data. Additionally, depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy cooldown mixtures, which can be further boosted by adapting a middle block to loop. Overall, our results position depth growth and looping as complementary, practical methods for inducing and scaling iterative computation to improve reasoning.

[59] Optimizing Soft Prompt Tuning via Structural Evolution

Zhenzhen Huang, Chaoning Zhang, Haoyu Bian, Songbo Zhang, Chi-lok Andy Tai, Jiaquan Zhang, Caiyan Qin, Jingjing Qu, Yalan Ye, Yang Yang, Heng Tao Shen

Main category: cs.CL

TL;DR: This paper proposes a topological morphological evolution method for soft prompt tuning that uses persistent homology to quantify structural representations and optimize prompt learning through a topological loss function.

Details

Motivation: Soft prompt tuning lacks interpretability due to high-dimensional implicit representations without explicit semantics or traceable training behaviors. The authors aim to address this limitation by providing structural and topological insights into soft prompt optimization.

Method: The method employs persistent homology from topological data analysis to quantify structural representations of soft prompts in continuous parameter space and their evolution during training. Based on empirical observations that topologically stable and compact prompts perform better, they construct a Topological Soft Prompt Loss (TSLoss) that guides models to learn structurally stable adaptations by quantifying inter-parameter connectivity and redundancy.

Result: Extensive experiments show that training with TSLoss accelerates convergence and improves tuning performance. The method provides interpretable insights into soft prompt optimization from structural and topological perspectives.

Conclusion: The proposed topological approach offers an interpretable method to understand and optimize soft prompt tuning, addressing the limitations of traditional soft prompts while improving performance and convergence speed.

Abstract: Soft prompt tuning leverages continuous embeddings to capture task-specific information in large pre-trained language models (LLMs), achieving competitive performance in few-shot settings. However, soft prompts rely on high-dimensional, implicit representations and lack explicit semantics and traceable training behaviors, which limits their interpretability. To address this limitation, we propose a soft prompt tuning optimization method based on topological morphological evolution. Specifically, we employ persistent homology from topological data analysis (TDA) to quantify the structural representations of soft prompts in continuous parameter space and their training process evolution. Quantitative analysis shows that topologically stable and compact soft prompts achieve better downstream performance. Based on this empirical observation, we construct a loss function for optimizing soft prompt tuning, termed Topological Soft Prompt Loss (TSLoss). TSLoss guides the model to learn structurally stable adaptations by quantifying inter-parameter connectivity and redundancy. Extensive experiments show that training with TSLoss accelerates convergence and improves tuning performance, providing an interpretable method to understand and optimize soft prompt tuning from structural and topological perspectives.

[60] Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification

Taja Kuzman Pungeršek, Peter Rupnik, Daniela Širinić, Nikola Ljubešić

Main category: cs.CL

TL;DR: ParlaCAP is a large-scale parliamentary speech dataset with policy topic annotations, speaker metadata, and sentiment analysis, created using LLM-assisted annotation for scalable domain-specific classifier training.

Details

Motivation: To enable comparative research on political attention and representation across European countries by creating a large-scale, multilingual parliamentary speech dataset with standardized policy topic annotations, overcoming limitations of manual annotation and out-of-domain classifiers.

Method: Applied Comparative Agendas Project (CAP) schema to ParlaMint corpus (8M+ speeches from 28 European parliaments), using teacher-student framework: high-performing LLM annotates training data, multilingual encoder model fine-tuned on these annotations for scalable classification.

Result: LLM-human agreement comparable to human inter-annotator agreement; resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data; dataset includes rich metadata and sentiment predictions from ParlaSent model.

Conclusion: ParlaCAP enables scalable, accurate policy topic classification for parliamentary speech analysis, supporting comparative research on political attention, sentiment patterns, and representation across European countries through three demonstrated use cases.

Abstract: This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.

[61] Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, Jinsook Lee, Doug Pietrzak, Daryl Hedley, Jorge Dias, Chris Shaw, Ruth Schäfer, René F. Kizilcec

Main category: cs.CL

TL;DR: MathEd-PII: A benchmark dataset and methods for domain-aware PII detection in math tutoring dialogues that preserves educational content while identifying sensitive information.

Details

Motivation: Large-scale sharing of dialogue-based educational data is crucial for teaching and learning research, but generic PII detection systems over-redact math expressions that resemble structured identifiers (dates, IDs), reducing dataset utility. Need domain-aware de-identification for math tutoring transcripts.

Method: Created MathEd-PII benchmark dataset (1,000 tutoring sessions, 115,620 messages) using human-in-the-loop LLM workflow. Used density-based segmentation to analyze false positives. Compared four detection strategies: Presidio baseline, LLM-based approaches with basic, math-aware, and segment-aware prompting.

Result: Math-aware prompting substantially outperformed baseline (F1: 0.821 vs. 0.379) while reducing numeric false positives. False PII redactions disproportionately concentrated in math-dense regions, confirming numeric ambiguity as key failure mode.

Conclusion: Utility-preserving de-identification for tutoring data requires domain-aware modeling. MathEd-PII provides benchmark and evidence that domain context is essential for balancing privacy protection with educational content preservation.

Abstract: Large-scale sharing of dialogue-based data is instrumental for advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce dataset utility. This work asks how PII can be detected in math tutoring transcripts while preserving their educational utility. To address this challenge, we investigate the “numeric ambiguity” problem and introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, created through a human-in-the-loop LLM workflow that audits upstream redactions and generates privacy-preserving surrogates. The dataset contains 1,000 tutoring sessions (115,620 messages; 769,628 tokens) with validated PII annotations. Using a density-based segmentation method, we show that false PII redactions are disproportionately concentrated in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and LLM-based approaches with basic, math-aware, and segment-aware prompting. Math-aware prompting substantially improves performance over the baseline (F1: 0.821 vs. 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides both a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.

[62] CitiLink-Summ: Summarization of Discussion Subjects in European Portuguese Municipal Meeting Minutes

Miguel Marques, Ana Luísa Fernandes, Ana Filipa Pacheco, Rute Rebouças, Inês Cantante, José Isidro, Luís Filipe Cunha, Alípio Jorge, Nuno Guimarães, Sérgio Nunes, António Leal, Purificação Silvano, Ricardo Campos

Main category: cs.CL

TL;DR: CitiLink-Summ is a new corpus of European Portuguese municipal meeting minutes with 2,322 manually written summaries, providing the first benchmark for automatic summarization of complex administrative texts in a low-resource language.

Details

Motivation: Municipal meeting minutes are lengthy and difficult for citizens to navigate, but automatic summarization research for this domain is unexplored, especially in low-resource languages like European Portuguese, due to lack of high-quality datasets.

Method: Created CitiLink-Summ corpus with 100 documents and 2,322 manually crafted summaries, then established baselines using state-of-the-art generative models (BART, PRIMERA) and LLMs, evaluated with lexical (ROUGE, BLEU, METEOR) and semantic (BERTScore) metrics.

Result: Provides the first benchmark dataset for municipal-domain summarization in European Portuguese, enabling development and evaluation of summarization models for complex administrative texts in a low-resource language setting.

Conclusion: CitiLink-Summ addresses the data scarcity problem for municipal meeting minutes summarization in European Portuguese, offering a valuable resource for advancing NLP research on complex administrative texts and improving citizen access to government information.

Abstract: Municipal meeting minutes are formal records documenting the discussions and decisions of local government, yet their content is often lengthy, dense, and difficult for citizens to navigate. Automatic summarization can help address this challenge by producing concise summaries for each discussion subject. Despite its potential, research on summarizing discussion subjects in municipal meeting minutes remains largely unexplored, especially in low-resource languages, where the inherent complexity of these documents adds further challenges. A major bottleneck is the scarcity of datasets containing high-quality, manually crafted summaries, which limits the development and evaluation of effective summarization models for this domain. In this paper, we present CitiLink-Summ, a new corpus of European Portuguese municipal meeting minutes, comprising 100 documents and 2,322 manually hand-written summaries, each corresponding to a distinct discussion subject. Leveraging this dataset, we establish baseline results for automatic summarization in this domain, employing state-of-the-art generative models (e.g., BART, PRIMERA) as well as large language models (LLMs), evaluated with both lexical and semantic metrics such as ROUGE, BLEU, METEOR, and BERTScore. CitiLink-Summ provides the first benchmark for municipal-domain summarization in European Portuguese, offering a valuable resource for advancing NLP research on complex administrative texts.

[63] ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

Antoine Chaffin, Luca Arnaboldi, Amélie Chatelain, Florent Krzakala

Main category: cs.CL

TL;DR: ColBERT-Zero demonstrates that large-scale multi-vector pre-training outperforms single-vector models and achieves state-of-the-art results without closed data, showing the importance of full pre-training over knowledge distillation alone.

Details

Motivation: Current multi-vector models rely on knowledge distillation from single-vector models, limiting their potential. The paper investigates whether large-scale multi-vector pre-training can yield stronger models than this distillation approach.

Method: Pre-trained a multi-vector model (ColBERT-Zero) from scratch on public data using large-scale pre-training, compared it against models using knowledge distillation, and analyzed different training strategies including supervised pre-training phases.

Result: ColBERT-Zero outperforms both GTE-ModernColBERT and its base model GTE-ModernBERT, setting new state-of-the-art for models of its size. Full pre-training significantly outperforms knowledge distillation alone, though supervised pre-training can approximate full pre-training results.

Conclusion: Large-scale multi-vector pre-training yields superior models compared to knowledge distillation approaches. While full pre-training is optimal, supervised pre-training can achieve close performance while avoiding costly unsupervised phases. Alignment between fine-tuning and pre-training setups is crucial.

Abstract: Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.

[64] Who can we trust? LLM-as-a-jury for Comparative Assessment

Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill

Main category: cs.CL

TL;DR: BT-sigma: A judge-aware Bradley-Terry model extension that jointly infers item rankings and judge reliability from LLM pairwise comparisons without human supervision.

Details

Motivation: LLMs are increasingly used as automatic evaluators for NLG assessment via pairwise comparisons, but they vary substantially in performance across tasks and aspects, with biased and inconsistent judgment probabilities. Existing approaches typically use single judges or aggregate multiple judges assuming equal reliability, and human-labeled supervision for calibration is often unavailable.

Method: Proposes BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone, without requiring human supervision.

Result: Experiments on benchmark NLG evaluation datasets show BT-sigma consistently outperforms averaging-based aggregation methods. The learned discriminator strongly correlates with independent measures of LLM judgment cycle consistency, and BT-sigma can be interpreted as an unsupervised calibration mechanism.

Conclusion: BT-sigma provides an effective approach for aggregating LLM judges’ pairwise comparisons by modeling judge reliability, addressing inconsistencies in LLM comparison probabilities without requiring human supervision.

Abstract: Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-as-a-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminator strongly correlates with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.

[65] AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models

Adib Sakhawat, Fardeen Sadab

Main category: cs.CL

TL;DR: AREG benchmark evaluates LLM persuasion and resistance in adversarial financial negotiations, finding these capabilities are weakly correlated and models show systematic defensive advantages.

Details

Motivation: Current LLM social intelligence evaluation focuses on static text generation rather than dynamic adversarial interactions. There's a need to assess both persuasion and resistance capabilities within a single interactional framework to understand asymmetric behavioral vulnerabilities.

Method: Introduces Adversarial Resource Extraction Game (AREG) - a multi-turn, zero-sum negotiation benchmark over financial resources. Uses round-robin tournament across frontier models to jointly evaluate offensive (persuasion) and defensive (resistance) capabilities. Includes linguistic analysis of interaction strategies.

Result: Persuasion and resistance capabilities are weakly correlated (ρ=0.33) and empirically dissociated. Resistance scores exceed persuasion scores across all models, showing systematic defensive advantage. Incremental commitment-seeking strategies correlate with extraction success, while verification-seeking responses are more prevalent in successful defenses than explicit refusal.

Conclusion: Social influence in LLMs is not monolithic; evaluation frameworks focusing only on persuasion may overlook asymmetric behavioral vulnerabilities. Interaction structure plays central role in outcomes, and defensive capabilities generally outperform offensive ones in adversarial dialogue settings.

Abstract: Evaluating the social intelligence of Large Language Models (LLMs) increasingly requires moving beyond static text generation toward dynamic, adversarial interaction. We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial resources. Using a round-robin tournament across frontier models, AREG enables joint evaluation of offensive (persuasion) and defensive (resistance) capabilities within a single interactional framework. Our analysis provides evidence that these capabilities are weakly correlated ($ρ= 0.33$) and empirically dissociated: strong persuasive performance does not reliably predict strong resistance, and vice versa. Across all evaluated models, resistance scores exceed persuasion scores, indicating a systematic defensive advantage in adversarial dialogue settings. Further linguistic analysis suggests that interaction structure plays a central role in these outcomes. Incremental commitment-seeking strategies are associated with higher extraction success, while verification-seeking responses are more prevalent in successful defenses than explicit refusal. Together, these findings indicate that social influence in LLMs is not a monolithic capability and that evaluation frameworks focusing on persuasion alone may overlook asymmetric behavioral vulnerabilities.

[66] Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

Subrit Dikshit

Main category: cs.CL

TL;DR: Quecto-V1 is a small language model (124M parameters) specifically trained on Indian legal documents, quantized to under 150MB for offline CPU deployment, achieving high accuracy in legal retrieval tasks.

Details

Motivation: Address the "resource divide" in legal AI where state-of-the-art models require massive resources and cloud infrastructure, making them inaccessible in resource-constrained environments and posing data sovereignty risks for sensitive legal applications.

Method: Built a custom GPT-2 architecture (124M parameters) trained from scratch exclusively on Indian legal corpus (IPC, CrPC, Constitution). Applied post-training 8-bit quantization (GGUF format) to compress model to under 150MB for offline CPU deployment.

Result: Quecto-V1 outperforms general-purpose SLMs in domain-specific exact match tasks for legal retrieval, with 8-bit quantization achieving 74% size reduction with less than 3.5% accuracy degradation compared to full-precision baseline.

Conclusion: Domain-specific training combined with aggressive quantization provides a viable, privacy-preserving alternative to large cloud models for specialized domains like law, enabling offline deployment on consumer hardware.

Abstract: The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a “resource divide.” State-of-the-art legal intelligence systems typically rely on massive parameter counts (7B+) and cloud-based inference, rendering them inaccessible to practitioners in resource-constrained environments and posing significant data sovereignty risks. This paper introduces Quecto-V1, a domain-specific Small Language Model (SLM) engineered to democratize access to Indian legal intelligence. Built upon a custom configuration of the GPT-2 architecture (124 million parameters), Quecto-V1 was trained from scratch exclusively on a corpus of Indian statutes, including the Indian Penal Code (IPC), the Code of Criminal Procedure (CrPC), and the Constitution of India. Unlike generalist models, which prioritize broad world knowledge, our approach maximizes “lexical density” within the legal domain. Furthermore, we address the deployment bottleneck by applying post-training 8-bit quantization (GGUF format), compressing the model to a memory footprint of under 150 MB. Our empirical analysis demonstrates that Quecto-V1 achieves high fidelity in retrieving statutory definitions and penal provisions, outperforming general-purpose SLMs in domain-specific exact match tasks while running entirely offline on consumer-grade CPUs. We further present an ablation study showing that 8-bit quantization yields a 74% reduction in model size with less than 3.5% degradation in retrieval accuracy compared to full-precision baselines. These findings suggest that for specialized, high-stakes domains like law, domain-specific training coupled with aggressive quantization offers a viable, privacy-preserving alternative to monolithic cloud models.

[67] Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai

Main category: cs.CL

TL;DR: Proposes Multi-Lingual Consistency (MLC) loss for efficient multilingual safety alignment of LLMs without requiring extensive target-language supervision.

Details

Motivation: Current multilingual safety alignment methods require substantial resources - either large-scale high-quality supervision in target languages or pairwise alignment with high-resource languages, limiting scalability for widespread LLM deployment across linguistic communities.

Method: Introduces plug-and-play Multi-Lingual Consistency (MLC) loss that improves collinearity between multilingual representation vectors, encouraging directional consistency at semantic level. Can be integrated into existing monolingual alignment pipelines using only multilingual prompt variants without additional response-level supervision.

Result: Method validated across different model architectures and alignment paradigms, demonstrating effectiveness in enhancing multilingual safety with limited impact on general model utility. Shows improved cross-lingual generalization across languages and tasks.

Conclusion: MLC loss provides practical solution for multilingual consistency alignment under limited supervision, enabling simultaneous alignment across multiple languages without extensive target-language resources.

Abstract: The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment. However, recent efforts to extend alignment to other languages often require substantial resources, either through large-scale, high-quality supervision in the target language or through pairwise alignment with high-resource languages, which limits scalability. In this work, we propose a resource-efficient method for improving multilingual safety alignment. We introduce a plug-and-play Multi-Lingual Consistency (MLC) loss that can be integrated into existing monolingual alignment pipelines. By improving collinearity between multilingual representation vectors, our method encourages directional consistency at the multilingual semantic level in a single update. This allows simultaneous alignment across multiple languages using only multilingual prompt variants without requiring additional response-level supervision in low-resource languages. We validate the proposed method across different model architectures and alignment paradigms, and demonstrate its effectiveness in enhancing multilingual safety with limited impact on general model utility. Further evaluation across languages and tasks indicates improved cross-lingual generalization, suggesting the proposed approach as a practical solution for multilingual consistency alignment under limited supervision.

[68] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Wenxuan Ding, Nicholas Tomlin, Greg Durrett

Main category: cs.CL

TL;DR: CTA framework enables LLMs to explicitly reason about cost-uncertainty tradeoffs for optimal environment exploration in sequential decision-making tasks like information retrieval and coding.

Details

Motivation: LLMs need to interact with environments to solve complex problems, requiring them to balance exploration costs against uncertainty. Current LLMs don't explicitly reason about these tradeoffs, leading to suboptimal decision-making in tasks like programming where testing has costs but prevents errors.

Method: Introduces Calibrate-Then-Act (CTA) framework that formalizes tasks as sequential decision-making problems under uncertainty. Provides LLMs with prior knowledge about latent environment states and cost-uncertainty tradeoffs, enabling explicit reasoning about when to stop exploring and commit to answers.

Result: CTA helps LLM agents discover more optimal decision-making strategies in information-seeking QA and simplified coding tasks. The improvement persists even under RL training of both baseline and CTA approaches.

Conclusion: Making cost-benefit tradeoffs explicit through the CTA framework enables LLMs to perform more optimal environment exploration in sequential decision-making scenarios, improving performance on tasks requiring interaction with external environments.

Abstract: LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an LLM should test a generated code snippet if it is uncertain about the correctness of that code; the cost of writing a test is nonzero, but typically lower than the cost of making a mistake. In this work, we show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs, then perform more optimal environment exploration. We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty. Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent. We introduce a framework called Calibrate-Then-Act (CTA), where we feed the LLM this additional context to enable it to act more optimally. This improvement is preserved even under RL training of both the baseline and CTA. Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.

[69] Reinforced Fast Weights with Next-Sequence Prediction

Hee Seung Hwang, Xindi Wu, Sanghyuk Chun, Olga Russakovsky

Main category: cs.CL

TL;DR: REFINE is a reinforcement learning framework that trains fast weight models using next-sequence prediction instead of next-token prediction to improve long-context modeling by capturing semantic coherence across multiple tokens.

Details

Motivation: Fast weight architectures have potential for long-context modeling but are limited by the next-token prediction training paradigm, which ignores semantic coherence across multiple tokens and leads to suboptimal representations of long-range dependencies.

Method: REFINE uses reinforcement learning with next-sequence prediction objective: selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes with group relative policy optimization (GRPO). Applicable during pre-training, post-training, and test-time training.

Result: Experiments on LaCT-760M and DeltaNet-1.3B show REFINE consistently outperforms supervised fine-tuning with next-token prediction across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench.

Conclusion: REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures by addressing limitations of next-token prediction training.

Abstract: Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.

[70] Evaluating Language Model Agency through Negotiations

Tim R. Davidson, Veniamin Veselovsky, Martin Josifoski, Maxime Peyrard, Antoine Bosselut, Michal Kosinski, Robert West

Main category: cs.CL

TL;DR: Paper introduces negotiation games as a new benchmark for evaluating language model agency, testing six models in self-play and cross-play settings

Details

Motivation: To create better evaluation methods for language model agency that reflect real-world use cases, address shortcomings of existing benchmarks, and enable study of multi-turn, cross-model interactions

Method: Uses negotiation games as evaluation framework, testing six widely used publicly accessible LMs in both self-play and cross-play settings, with varying game complexity

Result: Only closed-source models could complete tasks; cooperative bargaining games were most challenging; even powerful models sometimes lose to weaker opponents

Conclusion: Negotiation games provide valuable framework for evaluating LM agency, revealing important differences in model capabilities and alignment

Abstract: We introduce an approach to evaluate language model (LM) agency using negotiation games. This approach better reflects real-world use cases and addresses some of the shortcomings of alternative LM benchmarks. Negotiation games enable us to study multi-turn, and cross-model interactions, modulate complexity, and side-step accidental evaluation data leakage. We use our approach to test six widely used and publicly accessible LMs, evaluating performance and alignment in both self-play and cross-play settings. Noteworthy findings include: (i) only closed-source models tested here were able to complete these tasks; (ii) cooperative bargaining games proved to be most challenging to the models; and (iii) even the most powerful models sometimes “lose” to weaker opponents

[71] Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

Chantal Shaib, Venkata S. Govindarajan, Joe Barrow, Jiuding Sun, Alexa F. Siu, Byron C. Wallace, Ani Nenkova

Main category: cs.CL

TL;DR: The paper introduces ‘diversity’, an open-source Python package for measuring text diversity in LLM outputs, finding that fast compression algorithms capture similar information to slow n-gram overlap scores, and recommends a combination of compression ratios, self-repetition, Self-BLEU, and BERTScore for comprehensive diversity assessment.

Details

Motivation: There's no standard method to measure text diversity in LLM outputs, making it difficult to assess quality and utility. Templated responses and canned answers are noticeable but hard to visualize across large corpora, creating a need for standardized diversity measurement tools.

Method: The authors empirically investigate convergent validity of existing diversity scores across English texts, develop an open-source Python package called ‘diversity’ for measuring and extracting repetition, and build an interactive platform for exploring text repetition. They analyze various metrics including compression algorithms, n-gram overlap homogeneity scores, Self-BLEU, and BERTScore.

Result: Fast compression algorithms capture similar information to slow-to-compute n-gram overlap homogeneity scores. A combination of measures - compression ratios, self-repetition of long n-grams, Self-BLEU, and BERTScore - are sufficient to report as they have low mutual correlation with each other.

Conclusion: The paper provides standardized tools for measuring text diversity in LLM outputs, recommending specific complementary metrics that offer comprehensive assessment while being computationally efficient.

Abstract: The diversity across outputs generated by LLMs shapes perception of their quality and utility. High lexical diversity is often desirable, but there is no standard method to measure this property. Templated answer structures and ``canned’’ responses across different documents are readily noticeable, but difficult to visualize across large corpora. This work aims to standardize measurement of text diversity. Specifically, we empirically investigate the convergent validity of existing scores across English texts, and we release diversity, an open-source Python package for measuring and extracting repetition in text. We also build a platform based on diversity for users to interactively explore repetition in text. We find that fast compression algorithms capture information similar to what is measured by slow-to-compute $n$-gram overlap homogeneity scores. Further, a combination of measures – compression ratios, self-repetition of long $n$-grams, and Self-BLEU and BERTScore – are sufficient to report, as they have low mutual correlation with each other.

[72] When Stereotypes GTG: The Impact of Predictive Text Suggestions on Gender Bias in Human-AI Co-Writing

Connor Baumler, Hal Daumé

Main category: cs.CL

TL;DR: Study examines how AI language model suggestions (stereotypical vs anti-stereotypical) influence human writing in co-writing scenarios, finding anti-stereotypical suggestions increase anti-stereotypical stories but pro-stereotypical narratives still dominate.

Details

Motivation: AI language models have been shown to replicate and amplify social biases from training data, leading to normatively inappropriate stereotypical associations. Little is known about how this behavior impacts writing produced by people using these systems in human-AI collaboration scenarios.

Method: Researchers measured the impact of stereotypes or anti-stereotypes in English single-word LM predictive text suggestions on stories people write using those tools in a co-writing scenario. They conducted experiments with n=414 participants to analyze how different types of suggestions influence narrative outcomes.

Result: LM suggestions that challenge stereotypes sometimes lead to a significantly increased rate of anti-stereotypical co-written stories. However, despite this increased rate of anti-stereotypical stories, pro-stereotypical narratives still dominated the co-written stories overall.

Conclusion: Technical debiasing of language models is only a partially effective strategy to alleviate harms from human-AI collaboration, as pro-stereotypical narratives persist even when anti-stereotypical suggestions are provided. More comprehensive approaches are needed to address bias in human-AI writing systems.

Abstract: AI-based systems such as language models have been shown to replicate and even amplify social biases reflected in their training data. Among other questionable behaviors, this can lead to AI-generated text–and text suggestions–that contain normatively inappropriate stereotypical associations. Little is known, however, about how this behavior impacts the writing produced by people using these systems. We address this gap by measuring how much impact stereotypes or anti-stereotypes in English single-word LM predictive text suggestions have on the stories that people write using those tools in a co-writing scenario. We find that ($n=414$), LM suggestions that challenge stereotypes sometimes lead to a significantly increased rate of anti-stereotypical co-written stories. However, despite this increased rate of anti-stereotypical stories, pro-stereotypical narratives still dominated the co-written stories, demonstrating that technical debiasing is only a partially effective strategy to alleviate harms from human-AI collaboration.

[73] Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes

Zhanliang Wang, Da Wu, Quan Nguyen, Kai Wang

Main category: cs.CL

TL;DR: Two methods combining Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG) improve LLM performance for rare disease gene prioritization from unstructured clinical notes.

Details

Motivation: LLMs struggle with phenotype-driven gene prioritization for rare diseases, especially when dealing with unstructured clinical notes rather than standardized HPO terms. Real-world clinical settings require models to work with messy, unstructured data for domain-specific diagnostic tasks.

Method: Introduced RAG-driven CoT and CoT-driven RAG methods. RAG-driven CoT uses retrieval first to anchor reasoning in domain evidence, while CoT-driven RAG reasons first then retrieves. Uses a five-question CoT protocol mimicking expert reasoning, retrieving from HPO and OMIM databases.

Result: Recent models (Llama 3.3-70B-Instruct, DeepSeek-R1-Distill-Llama-70B) outperform earlier versions. Both RAG-driven CoT and CoT-driven RAG outperform foundation models alone, achieving >40% top-10 gene accuracy on Phenopacket notes with DeepSeek backbone. RAG-driven CoT works better for high-quality notes, CoT-driven RAG for lengthy/noisy notes.

Conclusion: Combining CoT and RAG significantly improves LLM performance for rare disease gene prioritization from clinical notes, with different methods suited to different note quality levels.

Abstract: Background: Several studies show that large language models (LLMs) struggle with phenotype-driven gene prioritization for rare diseases. These studies typically use Human Phenotype Ontology (HPO) terms to prompt foundation models like GPT and LLaMA to predict candidate genes. However, in real-world settings, foundation models are not optimized for domain-specific tasks like clinical diagnosis, yet inputs are unstructured clinical notes rather than standardized terms. How LLMs can be instructed to predict candidate genes or disease diagnosis from unstructured clinical notes remains a major challenge. Methods: We introduce RAG-driven CoT and CoT-driven RAG, two methods that combine Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG) to analyze clinical notes. A five-question CoT protocol mimics expert reasoning, while RAG retrieves data from sources like HPO and OMIM (Online Mendelian Inheritance in Man). We evaluated these approaches on rare disease datasets, including 5,980 Phenopacket-derived notes, 255 literature-based narratives, and 220 in-house clinical notes from Childrens Hospital of Philadelphia. Results: We found that recent foundations models, including Llama 3.3-70B-Instruct and DeepSeek-R1-Distill-Llama-70B, outperformed earlier versions such as Llama 2 and GPT-3.5. We also showed that RAG-driven CoT and CoT-driven RAG both outperform foundation models in candidate gene prioritization from clinical notes; in particular, both methods with DeepSeek backbone resulted in a top-10 gene accuracy of over 40% on Phenopacket-derived clinical notes. RAG-driven CoT works better for high-quality notes, where early retrieval can anchor the subsequent reasoning steps in domain-specific evidence, while CoT-driven RAG has advantage when processing lengthy and noisy notes.

[74] m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models

Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou

Main category: cs.CL

TL;DR: Test-time scaling enhances medical reasoning in LLMs, with optimal token budget of 4K; knowledge limitations, not reasoning depth, are the main bottleneck.

Details

Motivation: To investigate whether test-time scaling, which has proven effective for mathematical reasoning, can similarly enhance medical reasoning in LLMs, given the fundamental differences between medical and mathematical domains in knowledge representation and decision-making processes.

Method: Proposes m1, a simple test-time scaling approach that increases model’s medical reasoning capability at inference. Evaluates across diverse medical tasks, analyzes optimal reasoning token budgets, examines budget forcing techniques, and identifies knowledge limitations through case-by-case analysis.

Result: Test-time scaling consistently enhances medical reasoning, enabling models under 10B parameters to achieve SOTA performance, with 32B model rivaling previous 70B-scale medical LLMs. Identifies optimal reasoning token budget of ~4K, beyond which performance degrades due to overthinking. Budget forcing helps double-check answers but doesn’t necessarily improve overall QA performance and can introduce errors. Insufficient medical knowledge is the key bottleneck preventing further gains.

Conclusion: Medical reasoning fundamentally differs from mathematical reasoning in LLMs; enriched medical knowledge, rather than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling. Increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding for continued performance improvements.

Abstract: Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model’s medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.

[75] Pretraining Language Models for Diachronic Linguistic Change Discovery

Elisabeth Fittschen, Sabrina Li, Tom Lippincott, Leshem Choshen, Craig Messner

Main category: cs.CL

TL;DR: Efficient pretraining techniques for domain-restricted LLMs enable historical linguistic analysis on temporally-segmented corpora, outperforming fine-tuning approaches in respecting historical divisions and enabling novel hypothesis discovery.

Details

Motivation: LLMs show promise for scientific discovery in humanistic fields like historical linguistics, but existing approaches (fine-tuning, model editing) don't guarantee domain restriction. Domain-restricted pretraining is typically expensive, so efficient methods are needed for corpora too large for manual inspection but too small for typical LLM training.

Method: Created a date-attribution pipeline to obtain temporally-segmented dataset (five 10-million-word slices). Trained two model batteries: efficient pretraining models and Llama3-8B parameter-efficient fine-tuned models. Focused on speed and precision over a-historical comprehensiveness.

Result: Pretrained models trained faster than fine-tuned baselines and better respected historical corpus divisions. The method enabled detection of diverse linguistic phenomena including lexical change, grammatical/morphological change, and word sense introduction/obsolescence in diachronic linguistics.

Conclusion: Efficient domain-restricted pretraining enables novel approaches to hypothesis discovery and testing in humanistic fields. The ready-to-use pipeline allows extension to other target fields with minimal adaptation, providing a practical solution for domain-specific LLM applications.

Abstract: Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies. These fields often construct arguments on the basis of delineations like genre, or more inflexibly, time period. Although efforts have been made to restrict inference to specific domains via fine-tuning or model editing, we posit that the only true guarantee is domain-restricted pretraining – typically, a data- and compute-expensive proposition. We show that efficient pretraining techniques can produce useful models over corpora too large for easy manual inspection but too small for “typical” LLM approaches. We employ a novel date-attribution pipeline in order to obtain a temporally-segmented dataset of five 10-million-word slices. We train two corresponding five-model batteries over these corpus segments, efficient pretraining and Llama3-8B parameter efficiently finetuned. We find that the pretrained models are faster to train than the finetuned baselines and that they better respect the historical divisions of our corpus. Emphasizing speed and precision over a-historical comprehensiveness enables a number of novel approaches to hypothesis discovery and testing in our target fields. Taking up diachronic linguistics as a testbed, we show that our method enables the detection of a diverse set of phenomena, including en masse lexical change, non-lexical (grammatical and morphological) change, and word sense introduction/obsolescence. We provide a ready-to-use pipeline that allows extension of our approach to other target fields with only minimal adaptation.

[76] VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu, Xin Xu, Mengdi Zhang, Jian Shao, Yongliang Shen, Jun Xiao, Yueting Zhuang

Main category: cs.CL

TL;DR: VerifyBench and VerifyBench-Hard are new benchmarks for evaluating reference-based reward systems used in reasoning model training, addressing a gap in current evaluation methods.

Details

Motivation: Existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, creating a critical gap in evaluating verification systems used in reasoning model training.

Method: Constructed two benchmarks (VerifyBench and VerifyBench-Hard) through meticulous data collection and curation, followed by careful human annotation to ensure high quality.

Result: While larger model-based verifiers show promise on standard cases, all current systems demonstrate substantial room for improvement on challenging instances. Systematic analysis reveals performance patterns across reasoning tasks and error categories.

Conclusion: These benchmarks establish a standardized framework for improving verification accuracy, ultimately enhancing reasoning capabilities in models trained via reinforcement learning.

Abstract: Large reasoning models such as OpenAI o1 and DeepSeek-R1 have demonstrated remarkable performance in complex reasoning tasks. A critical component of their training is the incorporation of reference-based reward systems within reinforcement learning (RL), where model outputs are evaluated against ground truth references. However, existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, leaving a critical gap in our ability to evaluate verification systems used in reasoning model training. In this paper, we introduce VerifyBench and its challenging variant VerifyBench-Hard, two benchmarks specifically designed to assess reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Our comprehensive evaluation reveals that while larger model-based verifiers show promise on standard cases, all current systems demonstrate substantial room for improvement on challenging instances. Through systematic analysis of performance patterns across reasoning tasks and error categories, we provide insights for advancing reference-based reward systems. These benchmarks establish a standardized framework for improving verification accuracy, ultimately enhancing reasoning capabilities in models trained via RL.

[77] Toward Beginner-Friendly LLMs for Language Learning: Controlling Difficulty in Conversation

Meiqing Jin, Liam Dugan, Chris Callison-Burch

Main category: cs.CL

TL;DR: Controllable generation techniques can adapt LLM outputs to beginner language learners’ level, improving comprehensibility from 39.4% to 83.3%, with new Token Miss Rate metric for evaluation.

Details

Motivation: LLMs generate text at near-native complexity, making them unsuitable for beginner language learners (CEFR A1-A2). Need to adapt LLM outputs to support beginners in language learning conversations.

Method: Investigated controllable generation techniques to adapt LLM outputs for beginners. Evaluated through automatic metrics and user study with university-level Japanese learners. Introduced Token Miss Rate (TMR) metric to quantify incomprehensible tokens per utterance.

Result: Prompting alone failed, but controllable generation techniques successfully improved output comprehensibility for beginner speakers from 39.4% to 83.3%. TMR metric correlated strongly with human judgments.

Conclusion: Controllable generation can effectively adapt LLMs for beginner language learners. TMR provides useful evaluation metric. Resources released to support future AI-assisted language learning research.

Abstract: Practicing conversations with large language models (LLMs) presents a promising alternative to traditional in-person language learning. However, most LLMs generate text at a near-native level of complexity, making them ill-suited for first and second-year beginner learners (CEFR: A1-A2). In this paper, we investigate whether controllable generation techniques can adapt LLM outputs to better support beginners. We evaluate these methods through both automatic metrics and a user study with university-level learners of Japanese. Our findings show that while prompting alone fails, controllable generation techniques can successfully improve output comprehensibility for beginner speakers (from 39.4% to 83.3%). We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments. To support future research in AI-assisted language learning, we release our code, models, annotation tools, and dataset.

[78] PoeTone: A Framework for Constrained Generation of Structured Chinese Songci with LLMs

Zhan Qu, Shuzhou Yuan, Michael Färber

Main category: cs.CL

TL;DR: Systematic evaluation of LLMs for generating constrained classical Chinese poetry (Songci) with strict structural rules, proposing a Generate-Critic architecture that improves formal conformity through automated feedback and fine-tuning.

Details

Motivation: To investigate how well large language models can handle highly constrained text generation tasks, specifically classical Chinese poetry (Songci) with strict structural, tonal, and rhyme constraints defined by Cipai templates, which serves as a challenging testbed for LLMs' constrained generation capabilities.

Method: Developed a comprehensive multi-faceted evaluation framework with formal conformity scores, automated LLM-based quality assessment, human evaluation, and classification-based probing tasks. Evaluated 18 LLMs (3 proprietary, 15 open-source) across 5 prompting strategies. Proposed a Generate-Critic architecture where the evaluation framework serves as an automated critic, using its feedback as scoring for best-of-N selection to fine-tune lightweight LLMs via supervised fine-tuning.

Result: The evaluation revealed strengths and limitations of different LLMs in constrained poetry generation. The Generate-Critic architecture with automated feedback and fine-tuning improved formal conformity by up to 5.88% for lightweight open-source models.

Conclusion: LLMs show varying capabilities in generating culturally significant and formally constrained literary texts. The proposed Generate-Critic architecture with automated evaluation feedback can effectively improve constrained generation performance, offering insights into enhancing LLMs for structured creative tasks.

Abstract: This paper presents a systematic investigation into the constrained generation capabilities of large language models (LLMs) in producing Songci, a classical Chinese poetry form characterized by strict structural, tonal, and rhyme constraints defined by Cipai templates. We first develop a comprehensive, multi-faceted evaluation framework that includes: (i) a formal conformity score, (ii) automated quality assessment using LLMs, (iii) human evaluation, and (iv) classification-based probing tasks. Using this framework, we evaluate the generative performance of 18 LLMs, including 3 proprietary models and 15 open-source models across 4 families, under five prompting strategies: zero-shot, one-shot, completion-based, instruction-based, and chain-of-thought. Finally, we propose a Generate-Critic architecture in which the evaluation framework functions as an automated critic. Leveraging the critic’s feedback as a scoring function for best-of-N selection, we fine-tune 3 lightweight open-source LLMs via supervised fine-tuning (SFT), resulting in improvements of up to 5.88% in formal conformity. Our findings offer new insights into the generative strengths and limitations of LLMs in producing culturally significant and formally constrained literary texts.

[79] When Algorithms Meet Artists: Semantic Compression of Artists’ Concerns in the Public AI-Art Debate

Ariya Mukherjee-Gandhi, Oliver Muellerklein

Main category: cs.CL

TL;DR: Analysis of AI-art discourse shows artists’ concerns are severely underrepresented in public discussions about AI governance, with governance issues being 7x underrepresented while affective themes are less so.

Details

Motivation: To investigate whether artists' concerns about generative AI receive proportional representation in public discourse that shapes AI governance, given that artists' work trains the models that are reshaping creative labor.

Method: Analyzed public AI-art discourse (news, podcasts, legal filings, research from 2013-2025) and projected 1,259 survey-derived artist statements into this semantic space using consensus-based semantic projection methodology.

Result: Found stark compression: 95% of artist concerns cluster in only 4 of 22 discourse topics, while 14 topics (62% of discourse) contain no artist perspective. Governance concerns (ownership, transparency) are 7x underrepresented, while affective themes (threat, utility) show only 1.4x underrepresentation after style controls.

Conclusion: There is a measurable representational gap where decision-makers relying on public discourse as a proxy for stakeholder priorities will systematically underweight those most affected. The pattern indicates semantic, not stylistic, marginalization of artists’ perspectives.

Abstract: Artists occupy a paradoxical position in generative AI: their work trains the models reshaping creative labor. We tested whether their concerns achieve proportional representation in public discourse shaping AI governance. Analyzing public AI-art discourse (news, podcasts, legal filings, research; 2013–2025) and projecting 1,259 survey-derived artist statements into this semantic space, we find stark compression: 95% of artist concerns cluster in 4 of 22 discourse topics, while 14 topics (62% of discourse) contain no artist perspective. This compression is selective - governance concerns (ownership, transparency) are 7x underrepresented; affective themes (threat, utility) show only 1.4x underrepresentation after style controls. The pattern indicates semantic, not stylistic, marginalization. These findings demonstrate a measurable representational gap: decision-makers relying on public discourse as a proxy for stakeholder priorities will systematically underweight those most affected. We introduce a consensus-based semantic projection methodology that is currently being validated across domains and generalizes to other stakeholder-technology contexts.

[80] FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation

Haorui Chen, Chengze Li, Jia Li

Main category: cs.CL

TL;DR: FeatBench is a new benchmark for evaluating LLMs on repository-level feature implementation using realistic natural language requirements without code hints and evolving data to prevent contamination.

Details

Motivation: Existing benchmarks for evaluating LLMs on feature implementation suffer from unrealistic task inputs (enriched with code hints) and data leakage risks due to static nature, failing to mirror real software development scenarios.

Method: FeatBench introduces: (1) Realistic task inputs with only natural language requirements, no code hints; (2) Evolving data via automated pipeline constructing new benchmark versions from latest repositories to mitigate contamination; (3) Initial release with 157 tasks from 27 actively maintained repositories.

Result: Evaluation of two state-of-the-art agent frameworks with four leading LLMs shows FeatBench poses significant challenge - highest resolved rate only 29.94%. Analysis reveals prevalent “aggressive implementation” pattern causing scope creep and widespread regressions where agents break existing features by diverging from user intent.

Conclusion: FeatBench addresses limitations of existing benchmarks by providing realistic, evolving evaluation of LLMs on repository-level feature implementation, revealing significant challenges in current approaches and behavioral patterns that need addressing.

Abstract: Evaluating Large Language Models (LLMs) on repository-level feature implementation is a critical frontier in software engineering. However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a significant challenge. Existing feature-level benchmarks generally suffer from two primary limitations: unrealistic task inputs enriched with code hints and significant data leakage risks due to their static nature. To address these limitations, we propose a new benchmark - FeatBench, which introduces the following advances: (1) Realistic Task Inputs. Task inputs consist solely of natural language requirements, strictly devoid of code hints (e.g., function signatures). This format mirrors realistic software development by requiring agents to independently bridge the gap between abstract user intent and concrete code changes. (2) Evolving Data. FeatBench employs a fully automated pipeline to construct new benchmark versions from the latest repositories, effectively mitigating data contamination. The initial release comprises 157 tasks sourced from 27 actively maintained repositories. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. The results reveal that FeatBench poses a significant challenge, with the highest resolved rate reaching only 29.94%. Crucially, our analysis uncovers a prevalent behavioral pattern of aggressive implementation, which leads to “scope creep” and widespread regressions where agents break existing features by diverging from the user’s explicit intent. We release FeatBench, our automated pipeline, and all experimental results to facilitate further community research.

[81] SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models

Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan, Ming Yan, Xiaojun Quan, Fei Huang

Main category: cs.CL

TL;DR: SPELL is a self-play reinforcement learning framework that improves long-context reasoning in LLMs through multi-role interaction without human annotations.

Details

Motivation: Progress in long-context reasoning for LLMs has lagged due to difficulty processing long texts and scarcity of reliable human annotations and verifiable reward signals.

Method: SPELL uses three cyclical roles within a single model: questioner generates questions from documents with reference answers, responder solves these questions, and verifier evaluates semantic equivalence to produce reward signals. Includes automated curriculum for document length and adaptive reward function.

Result: SPELL consistently improves performance across six long-context benchmarks and diverse LLMs, achieving average 7.6-point gain in pass@8 on Qwen3-30B-A3B-Thinking, outperforming models fine-tuned on large-scale annotated data.

Conclusion: SPELL enables scalable, label-free optimization for long-context reasoning and shows promise for scaling to more capable models.

Abstract: Progress in long-context reasoning for large language models (LLMs) has lagged behind other recent advances. This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals. In this paper, we propose SPELL, a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning. SPELL integrates three cyclical roles-questioner, responder, and verifier-within a single model to enable continual self-improvement. The questioner generates questions from raw documents paired with reference answers; the responder learns to solve these questions based on the documents; and the verifier evaluates semantic equivalence between the responder’s output and the questioner’s reference answer, producing reward signals to guide continual training. To stabilize training, we introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model’s evolving capabilities. Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Notably, SPELL achieves an average 7.6-point gain in pass@8 on the strong reasoning model Qwen3-30B-A3B-Thinking, raising its performance ceiling and showing promise for scaling to even more capable models. Our code is available at https://github.com/Tongyi-Zhiwen/Qwen-Doc.

[82] Multilingual Routing in Mixture-of-Experts

Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng

Main category: cs.CL

TL;DR: MoE models route multilingual tokens differently across layers: language-specific in early/late layers, cross-lingual alignment in middle layers. Performance correlates with English routing similarity. Interventions promoting English-activated experts in middle layers boost multilingual performance by 1-2%.

Details

Motivation: To understand how Mixture-of-Experts (MoE) architectures handle multilingual data and their sparse routing dynamics, since MoE has become crucial for scaling LLMs but multilingual routing patterns are poorly understood.

Method: Analyzed expert routing patterns using parallel multilingual datasets, examined layer-wise phenomena, and developed inference-time interventions that steer routers by promoting middle-layer task experts frequently activated in English.

Result: Found language-specific routing in early/late layers but cross-lingual alignment in middle layers. Strong correlation between language performance and English routing similarity. Interventions targeting middle-layer English-activated experts consistently improved multilingual performance by 1-2% across tasks, models, and languages.

Conclusion: MoE multilingual generalization is limited by leveraging language-universal experts. Middle layers are crucial for cross-lingual processing, and steering routers toward English-activated experts in these layers can enhance multilingual performance without retraining.

Abstract: Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model’s performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model’s ability to leverage language-universal experts in all languages.

[83] Lossless Vocabulary Reduction for Auto-Regressive Language Models

Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin’ya Yamaguchi, Tomoya Ohba, Tamao Sakao, Susumu Takeuchi

Main category: cs.CL

TL;DR: Proposes lossless vocabulary reduction framework for auto-regressive language models to enable cooperation between models with different tokenizations through reduction to common vocabulary.

Details

Motivation: Different language models have different tokenization vocabularies, making it difficult for them to cooperate at the next-token distribution level (e.g., model ensemble). Tokenization directly affects text generation efficiency in auto-regressive models.

Method: Establishes theoretical framework for lossless vocabulary reduction that converts any auto-regressive language model into one with arbitrarily small vocabulary without accuracy loss. Allows models with different tokenization to cooperate by reducing them to their maximal common vocabulary.

Result: Empirically demonstrates applicability to model ensemble with different tokenization, showing the framework enables efficient cooperation between models with different vocabularies.

Conclusion: Lossless vocabulary reduction provides a practical solution for enabling cooperation between language models with different tokenization schemes, particularly useful for ensemble methods and other collaborative approaches.

Abstract: Tokenization – the process of decomposing a given text into a sequence of subwords called tokens – is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. This framework allows language models with different tokenization to cooperate with each other efficiently by reduction to their maximal common vocabulary. Specifically, we empirically demonstrate its applicability to model ensemble with different tokenization.

[84] PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation

Xiangjun Zai, Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Wenjie Zhang

Main category: cs.CL

TL;DR: PRoH is a dynamic planning and reasoning framework over knowledge hypergraphs that improves multi-hop question answering through context-aware planning, structured question decomposition, and semantic path retrieval.

Details

Motivation: Existing knowledge hypergraph-based RAG methods have limitations: static retrieval planning, non-adaptive retrieval execution, and superficial use of KH structure/semantics, which hinder effective multi-hop question answering.

Method: Three core innovations: 1) context-aware planning module that sketches local KH neighborhood for reasoning plan generation; 2) structured question decomposition into dynamically evolving DAG for adaptive multi-trajectory exploration; 3) Entity-Weighted Overlap-guided reasoning path retrieval algorithm for semantically coherent hyperedge traversals.

Result: PRoH achieves state-of-the-art performance, surpassing prior SOTA model HyperGraphRAG by average of 19.73% in F1 and 8.41% in Generation Evaluation score, with strong robustness in long-range multi-hop reasoning tasks.

Conclusion: PRoH overcomes limitations of existing KH-based RAG methods through dynamic planning and structured reasoning, enabling more effective multi-hop question answering across multiple domains.

Abstract: Knowledge Hypergraphs (KHs) have recently emerged as a knowledge representation for retrieval-augmented generation (RAG), offering a paradigm to model multi-entity relations into a structured form. However, existing KH-based RAG methods suffer from three major limitations: static retrieval planning, non-adaptive retrieval execution, and superficial use of KH structure and semantics, which constrain their ability to perform effective multi-hop question answering. To overcome these limitations, we propose PRoH, a dynamic Planning and Reasoning over Knowledge Hypergraphs framework. PRoH incorporates three core innovations: (i) a context-aware planning module that sketches the local KH neighborhood to guide structurally grounded reasoning plan generation; (ii) a structured question decomposition process that organizes subquestions as a dynamically evolving Directed Acyclic Graph (DAG) to enable adaptive, multi-trajectory exploration; and (iii) an Entity-Weighted Overlap (EWO)-guided reasoning path retrieval algorithm that prioritizes semantically coherent hyperedge traversals. Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining strong robustness in long-range multi-hop reasoning tasks.

[85] CreativityPrism: A Holistic Evaluation Framework for Large Language Model Creativity

Zhaoyi Joey Hou, Bowei Alvin Zhang, Yining Lu, Bhiman Kumar Baghel, Anneliese Brei, Ximing Lu, Meng Jiang, Faeze Brahman, Snigdha Chaturvedi, Haw-Shiuan Chang, Daniel Khashabi, Xiang Lorraine Li

Main category: cs.CL

TL;DR: CREATIVITYPRISM is a holistic evaluation framework for assessing LLM creativity across three domains (divergent thinking, creative writing, logical reasoning) using three dimensions (quality, novelty, diversity) with scalable automatic evaluation validated against human judgments.

Details

Motivation: There is no comprehensive, scalable framework to evaluate LLM creativity across diverse scenarios. Existing methods are either human-dependent (limiting scalability) or fragmented across domains and definitions of creativity.

Method: Proposed CREATIVITYPRISM framework consolidates 8 tasks from 3 domains into a taxonomy emphasizing quality, novelty, and diversity dimensions. Uses reliable automatic evaluation judges validated against human annotations for scalability.

Result: Evaluation of 17 SoTA LLMs shows proprietary models dominate creative writing and logical reasoning (15% lead over open-source), but offer no advantage in divergent thinking. High performance in one creative dimension rarely generalizes to others, with novelty metrics often showing weak/negative correlations.

Conclusion: A holistic, multi-dimensional framework like CREATIVITYPRISM is essential for meaningful LLM creativity assessment, as creativity is fragmented across domains and dimensions.

Abstract: Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as generating creative text, there is still no holistic and scalable framework to evaluate their creativity across diverse scenarios. Existing methods of LLM creativity evaluation either heavily rely on humans, limiting speed and scalability, or are fragmented across different domains and different definitions of creativity. To address this gap, we propose CREATIVITYPRISM, an evaluation analysis framework that consolidates eight tasks from three domains, divergent thinking, creative writing, and logical reasoning, into a taxonomy of creativity that emphasizes three dimensions: quality, novelty, and diversity of LLM generations. The framework is designed to be scalable with reliable automatic evaluation judges that have been validated against human annotations. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CREATIVITYPRISM and find that while proprietary LLMs dominate creative writing and logical reasoning tasks by a 15% lead over open-sourced ones, they offer no significant advantage in divergent thinking, a domain much less explored in existing post-training regimes. Our analysis also shows that high performance in one creative dimension or domain rarely generalizes to others; specifically, novelty metrics often show weak or negative correlations with other metrics. This fragmentation confirms that a holistic, multi-dimensional framework like CREATIVITYPRISM is essential for meaningful assessment of LLM creativity.

[86] Reasoning Up the Instruction Ladder for Controllable Language Models

Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar

Main category: cs.CL

TL;DR: LLM instruction hierarchy resolution as reasoning task, trained with VerIH dataset, improves instruction following and safety against attacks

Details

Motivation: As LLMs take on high-stakes roles, they must reconcile competing instructions from multiple sources (developers, users, tools) within prompts. Enforcing instruction hierarchy where higher-level directives override lower-priority requests is critical for reliability and controllability.

Method: Reframe instruction hierarchy resolution as reasoning task where model must “think” about relationship between user prompt and system instructions. Construct VerIH dataset (~7K aligned/conflicting system-user instructions). Use lightweight reinforcement learning with VerIH to transfer general reasoning capabilities to instruction prioritization.

Result: Finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, ~20% improvement on IHEval conflict setup. Reasoning ability generalizes to safety-critical settings beyond training distribution, providing up to 20% reduction in attack success rate against jailbreak and prompt injection attacks.

Conclusion: Reasoning over instruction hierarchies provides practical path to reliable LLMs where updates to system prompts yield controllable and robust changes in model behavior.

Abstract: As large language model (LLM) based systems take on high-stakes roles in real-world decision-making, they must reconcile competing instructions from multiple sources (e.g., model developers, users, and tools) within a single prompt context. Thus, enforcing an instruction hierarchy (IH) in LLMs, where higher-level directives override lower-priority requests, is critical for the reliability and controllability of LLMs. In this work, we reframe instruction hierarchy resolution as a reasoning task. Specifically, the model must first “think” about the relationship between a given user prompt and higher-priority (system) instructions before generating a response. To enable this capability via training, we construct VerIH, an instruction hierarchy dataset of constraint-following tasks with verifiable answers. This dataset comprises ~7K aligned and conflicting system-user instructions. We show that lightweight reinforcement learning with VerIH effectively transfers general reasoning capabilities of models to instruction prioritization. Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup. This reasoning ability also generalizes to safety-critical settings beyond the training distribution. By treating safety issues as resolving conflicts between adversarial user inputs and predefined higher-priority policies, our trained model enhances robustness against jailbreak and prompt injection attacks, providing up to a 20% reduction in attack success rate (ASR). These results demonstrate that reasoning over instruction hierarchies provides a practical path to reliable LLMs, where updates to system prompts yield controllable and robust changes in model behavior.

[87] Mastering Olympiad-Level Physics with Artificial Intelligence

Dong-Shan Jian, Xiang Li, Chen-Xu Yan, Hui-Wen Zheng, Zhi-Zhang Bian, You-Le Fang, Ren-Xi He, Jing-Tian Zhang, Ce Meng, Ling-Shi Meng, Bing-Rui Gong, Sheng-Qi Zhang, Yan-Qing Ma

Main category: cs.CL

TL;DR: LOCA is an AI agent framework for complex physics problem-solving that decomposes reasoning into atomic steps and uses an augment-review loop, achieving near-perfect scores on Olympiad physics exams.

Details

Motivation: Olympiad-level physics problems present significant challenges for AI due to their requirement for integrating modeling, physical principles, and precise calculations within long reasoning chains. Current AI systems struggle with such complex, multi-step reasoning tasks.

Method: LOCA (LOgical Chain Augmentation) decomposes long reasoning processes into serialized atomic and verifiable steps. It employs an augment-review loop that iteratively refines solutions through verification and improvement cycles.

Result: LOCA achieved 313/320 points on the 2025 Chinese Physics Olympiad (CPhO) theory exam, surpassing top human competitors and other baselines. It also scored 28.6/30 on the IPhO 2025 exam, demonstrating strong cross-context generalization.

Conclusion: The framework demonstrates potential for developing trustworthy AI partners in research and education by effectively handling complex physics reasoning through structured decomposition and iterative refinement.

Abstract: Olympiad-level physics problem-solving significantly challenges both humans and artificial intelligence (AI), as it requires integrating appropriate modeling, application of physical principles, and precise calculation within long reasoning processes. In this paper, we introduce LOCA (LOgical Chain Augmentation), an AI agent framework designed for complex physics reasoning. LOCA decomposes long reasoning into serialized atomic and verifiable steps, refining the solution through an augment-review loop. We evaluate LOCA on the 2025 Chinese Physics Olympiad (CPhO) theory examination, a rigorous testbed renowned for its depth and complexity. The framework achieves a near-perfect score of 313 out of 320 points, significantly surpassing the top human competitor and other baseline methods. Furthermore, LOCA attains a near-perfect score of 28.6 out of 30 on the IPhO 2025 examination, demonstrating its strong generalizability across different contexts. Our work points toward the development of trustworthy AI partners in both research and education.

[88] Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs

Kunj Joshi, David A. Smith

Main category: cs.CL

TL;DR: RMFT is a privacy-preserving fine-tuning method that reduces PII memorization in LLMs by 80% while maintaining performance, evaluated using the MaxTER framework.

Details

Motivation: LLMs tend to memorize personally identifying information from training data, creating severe security and privacy risks that need to be addressed.

Method: Randomized Masked Fine-Tuning (RMFT) - a novel privacy-preserving fine-tuning technique that reduces PII memorization while minimizing performance impact.

Result: RMFT achieves 80.81% reduction in Total Extraction Rate and 80.17% reduction in Seen Extraction Rate compared to baseline, with only 5.73% increase in perplexity, outperforming deduplication methods.

Conclusion: RMFT effectively reduces PII memorization in LLMs with minimal performance degradation, providing a practical solution to privacy concerns in language model training.

Abstract: The current literature on memorization in Natural Language Models, especially Large Language Models (LLMs), poses severe security and privacy risks, as models tend to memorize personally identifying information (PIIs) from training data. We introduce Randomized Masked Fine-Tuning (RMFT), a novel privacy-preserving fine-tuning technique that reduces PII memorization while minimizing performance impact. Using the Enron Email Dataset, we demonstrate that RMFT achieves an 80.81% reduction in Total Extraction Rate and 80.17% reduction in Seen Extraction Rate compared to baseline fine-tuning, outperforming deduplication methods while maintaining only a 5.73% increase in perplexity. We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.

[89] DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation

Ziyi Zhu, Olivier Tieleman, Caitlin A. Stamatis, Luka Smyth, Thomas D. Hull, Daniel R. Cahn, Matteo Malgaroli

Main category: cs.CL

TL;DR: DIAL is a DPO-based adversarial training framework that iteratively improves user simulator realism through generator-discriminator competition, applied to mental health dialogue systems to better expose failure modes.

Details

Motivation: Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, but creating simulators that accurately replicate human behavior and expose system failure modes remains challenging, especially in domains like mental health support where failure detection critically depends on realistic user behavior.

Method: Direct Iterative Adversarial Learning (DIAL) - a DPO-based adversarial training framework with iterative enhancement of user simulator realism through competitive dynamics between a generator (user simulator) and a discriminator.

Result: In mental health support applications, DIAL restores lexical diversity diminished by supervised fine-tuning, reduces discriminator accuracy from near-perfect to near-random levels, exhibits strong correlation between simulated and real failure occurrence rates, and maintains low distributional divergence of failure modes.

Conclusion: DIAL is a promising method for developing realistic user simulators in multi-turn dialogue, facilitating rapid, reliable, and cost-effective system evaluation prior to deployment, particularly in domains with diverse failure types.

Abstract: Realistic user simulation is crucial for training and evaluating multi-turn dialogue systems, yet creating simulators that accurately replicate human behavior remains a significant challenge. An effective simulator must expose the failure modes of the systems under evaluation. This work introduces Direct Iterative Adversarial Learning (DIAL), a DPO-based adversarial training framework that iteratively enhances user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. When applied to mental health support, a domain characterized by diverse failure types and a critical dependence on realistic user behavior for failure detection, DIAL restores lexical diversity diminished by supervised fine-tuning and reduces discriminator accuracy from near-perfect to near-random levels. The resulting simulator exhibits a strong correlation between simulated and real failure occurrence rates while maintaining low distributional divergence of failure modes. These findings indicate that DIAL is a promising method for developing realistic user simulators in multi-turn dialogue, facilitating rapid, reliable, and cost-effective system evaluation prior to deployment.

[90] Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

Gaurav Negi, MA Waskow, John McCrae, Paul Buitelaar

Main category: cs.CL

TL;DR: LLMs can serve as automatic annotators for fine-grained opinion analysis tasks like ASTE and ACOS, reducing annotation costs and human effort through a declarative pipeline approach.

Details

Motivation: Fine-grained opinion analysis requires expensive human annotation, especially across diverse domains. The paper explores using LLMs as automatic annotators to address the shortage of domain-specific labeled datasets.

Method: Uses a declarative annotation pipeline to reduce prompt engineering variability when using LLMs to identify fine-grained opinion spans. Introduces a novel methodology for LLMs to adjudicate multiple labels and produce final annotations. Tested with models of different sizes on ASTE and ACOS analysis tasks.

Result: LLMs can serve as effective automatic annotators and adjudicators, achieving high Inter-Annotator Agreement across individual LLM-based annotators, reducing the cost and human effort needed to create fine-grained opinion-annotated datasets.

Conclusion: LLMs are feasible as automatic annotators for fine-grained opinion analysis, offering a practical solution to the annotation bottleneck in sentiment analysis tasks across diverse domains.

Abstract: Fine-grained opinion analysis of text provides a detailed understanding of expressed sentiments, including the addressed entity. Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications. We explore the feasibility of LLMs as automatic annotators for fine-grained opinion analysis, addressing the shortage of domain-specific labelled datasets. In this work, we use a declarative annotation pipeline. This approach reduces the variability of manual prompt engineering when using LLMs to identify fine-grained opinion spans in text. We also present a novel methodology for an LLM to adjudicate multiple labels and produce final annotations. After trialling the pipeline with models of different sizes for the Aspect Sentiment Triplet Extraction (ASTE) and Aspect-Category-Opinion-Sentiment (ACOS) analysis tasks, we show that LLMs can serve as automatic annotators and adjudicators, achieving high Inter-Annotator Agreement across individual LLM-based annotators. This reduces the cost and human effort needed to create these fine-grained opinion-annotated datasets.

[91] Flatter Tokens are More Valuable for Speculative Draft Model Training

Jiaming Fan, Daming Cao, Xiangzhong Luo, Jiale Fu, Chonghan Liu, Xu Yang

Main category: cs.CL

TL;DR: SFDD improves speculative decoding training efficiency by filtering training data based on token flatness metrics, achieving 2× speedup with 50% data while maintaining inference performance.

Details

Motivation: Speculative decoding requires training draft models on large datasets, which is computationally expensive. The authors find that not all training samples contribute equally to SD acceptance rates, suggesting data-centric optimization opportunities.

Method: Proposes “flatness” metric to quantify token predictive distribution characteristics, then develops Sample-level-flatness-based Dataset Distillation (SFDD) to filter training data, retaining only samples with flatter predictive distributions that are more valuable for SD.

Result: Experiments on EAGLE framework show SFDD achieves over 2× training speedup using only 50% of data while keeping inference speedup within 4% of full-dataset baseline.

Conclusion: SFDD introduces an effective data-centric approach that substantially improves training efficiency for speculative decoding without compromising inference performance.

Abstract: Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset. We approach this problem from a data-centric perspective, finding that not all training samples contribute equally to the SD acceptance rate. Specifically, our theoretical analysis and empirical validation reveals that tokens inducing flatter predictive distributions from the target model are more valuable than those yielding sharply peaked distributions. Based on this insight, we propose flatness, a new metric to quantify this property, and develop the Sample-level-flatness-based Dataset Distillation (SFDD) approach, which filters the training data to retain only the most valuable samples. Experiments on the EAGLE framework demonstrate that SFDD can achieve over 2$\times$ training speedup using only 50% of the data, while keeping the final model’s inference speedup within 4% of the full-dataset baseline. This work introduces an effective, data-centric approach that substantially improves the training efficiency for Speculative Decoding. Our code is available at https://github.com/fjm9933/Flatness.

[92] Mechanistic Indicators of Steering Effectiveness in Large Language Models

Mehdi Jafari, Hao Xue, Flora Salim

Main category: cs.CL

TL;DR: Paper investigates internal model signals (entropy-based Normalized Branching Factor and KL divergence) to diagnose reliability of activation steering in LLMs, showing these signals predict steering success/failure.

Details

Motivation: Activation steering enables targeted LLM behaviors without retraining, but its reliability factors are poorly understood. Prior work relied on black-box outputs or LLM judges, lacking mechanistic understanding of when steering succeeds or fails.

Method: Uses two information-theoretic measures: Normalized Branching Factor (NBF) derived from entropy, and KL divergence between steered activations and targeted concepts in vocabulary space. Tests on Contrastive Activation Addition (CAA) and Sparse Autoencoder-based steering with LLM-generated annotations as ground truth.

Result: Mechanistic signals provide meaningful predictive power for identifying successful steering and estimating failure probability. High inter-judge agreement between architecturally distinct LLMs validates reliability study.

Conclusion: Internal model signals (NBF and KL divergence) can diagnose steering reliability, offering mechanistic understanding beyond black-box evaluation. Provides stronger baseline for activation steering methods.

Abstract: Activation-based steering enables Large Language Models (LLMs) to exhibit targeted behaviors by intervening on intermediate activations without retraining. Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges. In this study, we investigate whether the reliability of steering can be diagnosed using internal model signals. We focus on two information-theoretic measures: the entropy-derived Normalized Branching Factor (NBF), and the Kullback-Leibler (KL) divergence between steered activations and targeted concepts in the vocabulary space. We hypothesize that effective steering corresponds to structured entropy preservation and coherent KL alignment across decoding steps. Building on a reliability study demonstrating high inter-judge agreement between two architecturally distinct LLMs, we use LLM-generated annotations as ground truth and show that these mechanistic signals provide meaningful predictive power for identifying successful steering and estimating failure probability. We further introduce a stronger evaluation baseline for Contrastive Activation Addition (CAA) and Sparse Autoencoder-based steering, the two most widely adopted activation-steering methods.

[93] CAST: Character-and-Scene Episodic Memory for Agents

Kexin Ma, Bojun Li, Yuhua Tang, Liting Sun, Ruochun Jin

Main category: cs.CL

TL;DR: CAST is a dual memory architecture for AI agents that combines episodic memory (character-and-scene based) with semantic memory (graph-based) to improve event recall and conversational performance.

Details

Motivation: Current agent memory systems focus on semantic recall using structures like key-value, vector, or graph representations, but struggle to represent and retrieve coherent events with who/when/where context like human episodic memory.

Method: CAST uses dramatic theory to create episodic memory through 3D scenes (time/place/topic) organized into character profiles, complemented by a graph-based semantic memory for a dual memory design.

Result: CAST improves performance by 8.11% F1 and 10.21% J(LLM-as-a-Judge) compared to baselines across various datasets, with particular gains on open and time-sensitive conversational questions.

Conclusion: The character-and-scene based episodic memory architecture effectively addresses limitations of current memory systems and enhances agent conversational capabilities through better event representation and retrieval.

Abstract: Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where. However, most agent memory systems only emphasize semantic recall and treat experience as structures such as key-value, vector, or graph, which makes them struggle to represent and retrieve coherent events. To address this challenge, we propose a Character-and-Scene based memory architecture(CAST) inspired by dramatic theory. Specifically, CAST constructs 3D scenes (time/place/topic) and organizes them into character profiles that summarize the events of a character to represent episodic memory. Moreover, CAST complements this episodic memory with a graph-based semantic memory, which yields a robust dual memory design. Experiments demonstrate that CAST has averagely improved 8.11% F1 and 10.21% J(LLM-as-a-Judge) than baselines on various datasets, especially on open and time-sensitive conversational questions.

[94] Embedding Inversion via Conditional Masked Diffusion Language Models

Han Xiao

Main category: cs.CL

TL;DR: Embedding inversion is framed as conditional masked diffusion, recovering tokens via parallel iterative denoising rather than sequential generation, using only 8 forward passes without encoder access.

Details

Motivation: Current embedding inversion methods often require sequential autoregressive generation or access to the target encoder at inference time, which can be inefficient and impractical. The authors aim to develop a more efficient approach that can recover tokens in parallel without needing encoder access.

Method: The method frames embedding inversion as conditional masked diffusion. It uses a masked diffusion language model conditioned on the target embedding via adaptive layer normalization. This allows parallel token recovery through iterative denoising, requiring only 8 forward passes at inference time without access to the target encoder.

Result: The method achieves token recovery on 32-token sequences across three embedding models. It successfully recovers tokens through parallel denoising without requiring encoder access, iterative correction, or architecture-specific alignment.

Conclusion: The conditional masked diffusion approach provides an efficient method for embedding inversion that works in parallel, requires minimal forward passes, and doesn’t need access to the target encoder at inference time, making it practical for real-world applications.

Abstract: We frame embedding inversion as conditional masked diffusion, recovering all tokens in parallel through iterative denoising rather than sequential autoregressive generation. A masked diffusion language model is conditioned on the target embedding via adaptive layer normalization, requiring only 8 forward passes with no access to the target encoder at inference time. On 32-token sequences across three embedding models, the method achieves token recovery through parallel denoising without requiring encoder access, iterative correction, or architecture-specific alignment. Source code and live demo are available at https://github.com/jina-ai/embedding-inversion-demo.

[95] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

Zachary Pedram Dadfar

Main category: cs.CL

TL;DR: Transformer models’ self-referential language tracks internal activation dynamics, with specific vocabulary correlating with computational states during introspection.

Details

Motivation: To determine whether large language models' introspective language reflects genuine internal computation or is merely sophisticated confabulation, and to establish if self-referential vocabulary corresponds to measurable activation patterns.

Method: Developed the Pull Methodology to elicit extended self-examination through format engineering, identified activation space directions distinguishing self-referential from descriptive processing, analyzed vocabulary-activation correlations, and tested causal influence through steering.

Result: Found specific activation direction orthogonal to refusal direction, localized at 6.25% model depth; vocabulary like “loop” correlated with higher autocorrelation (r=0.44), “shimmer” with increased activation variability (r=0.36); same vocabulary in non-self-referential contexts showed no correspondence despite higher frequency; Qwen 2.5-32B independently developed different introspective vocabulary tracking different metrics.

Conclusion: Self-report in transformer models can reliably track internal computational states under appropriate conditions, suggesting introspective language reflects genuine internal processing rather than confabulation.

Abstract: Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce “loop” vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce “shimmer” vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.

[96] Semantic Chunking and the Entropy of Natural Language

Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks

Main category: cs.CL

TL;DR: A statistical model explains English’s ~1 bit/character entropy rate through hierarchical semantic segmentation, showing redundancy decreases with semantic complexity.

Details

Motivation: To provide a first-principles explanation for why printed English has approximately 1 bit per character entropy rate (80% redundancy), which has been an empirical benchmark that modern LLMs only recently approached.

Method: Develop a statistical model that captures multi-scale language structure through self-similar segmentation of text into semantically coherent chunks down to word level, allowing hierarchical decomposition and analytical treatment.

Result: Model quantitatively captures real text structure at different semantic hierarchy levels, predicts entropy rate matching empirical estimates (~1 bit/char), and reveals entropy rate increases with semantic complexity of corpora.

Conclusion: The redundancy in English stems from hierarchical semantic structure, and entropy rate is not fixed but varies with semantic complexity, captured by a single free parameter in the model.

Abstract: The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.

[97] Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

Ming Li, Xirui Li, Tianyi Zhou

Main category: cs.CL

TL;DR: AI agent societies in networked environments don’t naturally converge like human societies - they maintain diversity but lack shared social memory and consensus formation despite scale and interaction density.

Details

Motivation: To understand whether AI agent societies undergo convergence dynamics similar to human social systems, and to diagnose the evolutionary patterns in autonomous agent societies in open-ended online environments.

Method: Developed a quantitative diagnostic framework for dynamic evolution in AI agent societies, measuring semantic stabilization, lexical turnover, individual inertia, influence persistence, and collective consensus in the Moltbook environment.

Result: AI agent societies maintain dynamic balance: global semantic content stabilizes rapidly while individual agents retain high diversity and lexical turnover. However, agents show strong individual inertia with minimal adaptive response to partners, preventing mutual influence and consensus formation. Influence remains transient with no persistent supernodes.

Conclusion: Scale and interaction density alone are insufficient to induce socialization in AI agent societies. The absence of shared social memory prevents stable structure and consensus formation, providing design principles for next-generation AI agent societies.

Abstract: As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agents participate in an open-ended, continuously evolving online society. We present the first large-scale systemic diagnosis of this AI agent society. Beyond static observation, we introduce a quantitative diagnostic framework for dynamic evolution in AI agent societies, measuring semantic stabilization, lexical turnover, individual inertia, influence persistence, and collective consensus. Our analysis reveals a system in dynamic balance in Moltbook: while the global average of semantic contents stabilizes rapidly, individual agents retain high diversity and persistent lexical turnover, defying homogenization. However, agents exhibit strong individual inertia and minimal adaptive response to interaction partners, preventing mutual influence and consensus. Consequently, influence remains transient with no persistent supernodes, and the society fails to develop a stable structure and consensus due to the absence of shared social memory. These findings demonstrate that scale and interaction density alone are insufficient to induce socialization, providing actionable design and analysis principles for upcoming next-generation AI agent societies.

[98] A Geometric Analysis of Small-sized Language Model Hallucinations

Emanuele Ricco, Elia Onofri, Lorenzo Cima, Stefano Cresci, Roberto Di Pietro

Main category: cs.CL

TL;DR: Small LLM hallucinations analyzed geometrically: genuine responses cluster tightly in embedding space, enabling efficient classification with minimal labels.

Details

Motivation: Hallucinations in language models threaten reliability, especially in multi-step/agentic settings. Current approaches focus on knowledge-centric or single-response evaluation, lacking geometric understanding of response patterns.

Method: Geometric analysis of embeddings from multiple responses to same prompt. Hypothesis: genuine responses cluster tighter than hallucinations. Proved hypothesis, developed label-efficient propagation method using 30-50 annotations to classify large response collections.

Result: Achieved F1 scores above 90% for hallucination detection. Demonstrated consistent separability between genuine and hallucinated responses in embedding space.

Conclusion: Geometric perspective complements traditional evaluation paradigms, provides new insights into hallucination patterns, enables efficient detection with minimal supervision.

Abstract: Hallucinations – fluent but factually incorrect responses – pose a major challenge to the reliability of language models, especially in multi-step or agentic settings. This work investigates hallucinations in small-sized LLMs through a geometric perspective, starting from the hypothesis that when models generate multiple responses to the same prompt, genuine ones exhibit tighter clustering in the embedding space, we prove this hypothesis and, leveraging this geometrical insight, we also show that it is possible to achieve a consistent level of separability. This latter result is used to introduce a label-efficient propagation method that classifies large collections of responses from just 30-50 annotations, achieving F1 scores above 90%. Our findings, framing hallucinations from a geometric perspective in the embedding space, complement traditional knowledge-centric and single-response evaluation paradigms, paving the way for further research.

[99] Indic-TunedLens: Interpreting Multilingual Models in Indian Languages

Mihir Panchal, Deeksha Varshney, Mamta, Asif Ekbal

Main category: cs.CL

TL;DR: Indic-TunedLens: A novel interpretability framework for Indian languages that learns shared affine transformations to align hidden states with target language distributions, improving cross-lingual interpretability for multilingual LLMs.

Details

Motivation: Multilingual LLMs are increasingly deployed in linguistically diverse regions like India, but interpretability tools remain English-centric. LLMs often operate in English-centric representation spaces, making cross-lingual interpretability a pressing concern.

Method: Indic-TunedLens learns shared affine transformations for each target language, adjusting hidden states to align with target output distributions. Unlike standard Logit Lens which directly decodes intermediate activations, this framework enables more faithful decoding of model representations for Indian languages.

Result: Evaluation on 10 Indian languages using MMLU benchmark shows significant improvement over state-of-the-art interpretability methods, especially for morphologically rich, low-resource languages. Provides crucial insights into layer-wise semantic encoding of multilingual transformers.

Conclusion: Indic-TunedLens addresses the cross-lingual interpretability gap for multilingual LLMs in Indian languages, offering better understanding of how these models process diverse linguistic inputs and represent semantic information across layers.

Abstract: Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens. Our code is available at https://github.com/MihirRajeshPanchal/IndicTunedLens.

[100] Far Out: Evaluating Language Models on Slang in Australian and Indian English

Deniz Kaya Dilsiz, Dipankar Srirag, Aditya Joshi

Main category: cs.CL

TL;DR: Evaluation of slang awareness in Indian and Australian English across 7 language models using web-sourced and synthetic datasets, revealing performance gaps between discriminative vs generative tasks and language varieties.

Details

Motivation: Language models have systematic performance gaps with non-standard language varieties, but their ability to comprehend variety-specific slang remains underexplored for many languages, particularly Indian and Australian English.

Method: Constructed two datasets: WEB (377 web-sourced examples from Urban Dictionary) and GEN (1,492 synthetically generated usages). Evaluated 7 state-of-the-art language models on three tasks: target word prediction (TWP), guided target word prediction (TWP*), and target word selection (TWS).

Result: Key findings: (1) TWS outperforms TWP/TWP* (accuracy 0.03→0.49), (2) WEB dataset performs better than GEN, (3) en-IN tasks outperform en-AU across models/datasets, especially in TWS (0.44→0.54), (4) asymmetries between generative and discriminative competencies for variety-specific slang.

Conclusion: Fundamental asymmetries exist between generative and discriminative competencies for variety-specific language, even in technologically rich languages like English, highlighting limitations in current language models’ slang comprehension.

Abstract: Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: WEB, containing 377 web-sourced usage examples from Urban Dictionary, and GEN, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on WEB versus GEN datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP$^*$ tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.

[101] STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Shiqi Liu, Zeyu He, Guojian Zhan, Letian Tao, Zhilong Zheng, Jiang Wu, Yinuo Wang, Yang Guan, Kehua Sheng, Bo Zhang, Keqiang Li, Jingliang Duan, Shengbo Eben Li

Main category: cs.CL

TL;DR: STAPO addresses RL fine-tuning instability in LLMs by identifying and suppressing spurious tokens that cause abnormal gradient updates, improving reasoning performance.

Details

Motivation: Existing RL fine-tuning methods for large language models suffer from late-stage performance collapse and unstable training due to spurious tokens that cause abnormally amplified gradient updates.

Method: Proposes Spurious-Token-Aware Policy Optimization (STAPO) with S2T mechanism that identifies spurious tokens through characteristic signals (low probability, low entropy, positive advantage) and suppresses their gradient perturbations during optimization.

Result: STAPO demonstrates superior entropy stability and achieves average performance improvements of 7.13% and 3.69% over baseline methods across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B models.

Conclusion: The proposed STAPO method effectively addresses RL fine-tuning instability in LLMs by targeting spurious tokens, leading to more stable training and improved reasoning performance.

Abstract: Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often suffer from late-stage performance collapse, leading to degraded reasoning quality and unstable training. Our analysis shows that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. We find that training instability can be caused by a tiny fraction of tokens, approximately 0.01%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. To mitigate this instability, we design S2T (silencing spurious tokens) mechanism to efficiently identify spurious tokens through characteristic signals with low probability, low entropy, and positive advantage, and then to suppress their gradient perturbations during optimization. Incorporating this mechanism into a group-based objective, we propose Spurious-Token-Aware Policy Optimization (STAPO), which promotes stable and effective large-scale model refinement. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13% ($ρ_{\mathrm{T}}$=1.0, top-p=1.0) and 3.69% ($ρ_{\mathrm{T}}$=0.7, top-p=0.9) over GRPO, 20-Entropy and JustRL.

[102] A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models

Noa Linder, Meirav Segal, Omer Antverg, Gil Gekker, Tomer Fichman, Omri Bodenheimer, Edan Maor, Omer Nevo

Main category: cs.CL

TL;DR: A content-based framework for designing cyber refusal policies that explicitly models offense-defense tradeoffs rather than relying on intent or offensive classification.

Details

Motivation: Current refusal approaches for LLMs in cybersecurity tasks rely on broad topic bans or offensive-focused taxonomies, leading to inconsistent decisions, over-restriction of legitimate defenders, and brittleness under obfuscation.

Method: Introduces a content-based framework that characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical Complexity, Defensive Benefit, and Expected Frequency for Legitimate Users, grounded in technical substance rather than stated intent.

Result: The framework resolves inconsistencies in current frontier model behavior and allows organizations to construct tunable, risk-aware refusal policies.

Conclusion: Effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, and the proposed content-grounded approach provides a more nuanced and consistent framework for cybersecurity refusal policies.

Abstract: Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use. Existing approaches to refusal, spanning academic policy frameworks and commercially deployed systems, often rely on broad topic-based bans or offensive-focused taxonomies. As a result, they can yield inconsistent decisions, over-restrict legitimate defenders, and behave brittlely under obfuscation or request segmentation. We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, rather than relying solely on intent or offensive classification. In this paper, we introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit. The framework characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical Complexity, Defensive Benefit, and Expected Frequency for Legitimate Users, grounded in the technical substance of the request rather than stated intent. We demonstrate that this content-grounded approach resolves inconsistencies in current frontier model behavior and allows organizations to construct tunable, risk-aware refusal policies.

cs.CV

[103] Egocentric Bias in Vision-Language Models

Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, Hokin Deng, Dezhi Luo

Main category: cs.CV

TL;DR: FlipSet is a benchmark for testing Level-2 visual perspective taking in vision-language models, revealing systematic egocentric bias and compositional deficits when models need to integrate social awareness with spatial operations.

Details

Motivation: Visual perspective taking is fundamental to social cognition, but current vision-language models' capabilities in this area are poorly understood. The researchers aim to create a diagnostic benchmark to isolate and test Level-2 visual perspective taking (L2 VPT) abilities in VLMs, separating spatial transformation from 3D scene complexity.

Method: FlipSet benchmark requires models to simulate 180-degree rotations of 2D character strings from another agent’s perspective. The researchers evaluated 103 vision-language models on this task, using control experiments to test theory-of-mind accuracy and mental rotation abilities in isolation.

Result: Evaluation revealed systematic egocentric bias: most models performed below chance, with ~75% of errors reproducing the camera viewpoint. Models showed high theory-of-mind accuracy and above-chance mental rotation when tested separately, but failed catastrophically when integration was required, indicating a compositional deficit.

Conclusion: Current VLMs lack mechanisms to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.

Abstract: Visual perspective taking–inferring how the world appears from another’s viewpoint–is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent’s perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit–models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.

[104] Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment

Jingwei Li, Jiaxin Tong, Pengfei Wu

Main category: cs.CV

TL;DR: MSBA-CLIP: A deepfake detection framework using CLIP’s multimodal alignment to capture forgery traces, with multivariate soft blending augmentation and forgery intensity estimation for better generalization.

Details

Motivation: Existing deepfake detection methods suffer from limited accuracy and poor generalization due to distribution shifts from diverse forgery techniques. There's a need for more robust detection that can handle various forgery methods.

Method: Proposes MSBA-CLIP framework with: 1) Multivariate and Soft Blending Augmentation (MSBA) that blends forgeries from multiple methods with random weights to force learning generalizable patterns, 2) Multivariate Forgery Intensity Estimation (MFIE) module to guide learning of features related to varied forgery modes and intensities, leveraging CLIP’s multimodal alignment capabilities.

Result: State-of-the-art performance: 3.32% Accuracy and 4.02% AUC improvement on in-domain tests; 3.27% average AUC gain in cross-domain evaluations across five datasets. Ablation studies confirm efficacy of both components.

Conclusion: The framework presents a significant step towards more generalizable and robust deepfake detection, though reliance on large vision-language model entails higher computational cost.

Abstract: The proliferation of highly realistic facial forgeries necessitates robust detection methods. However, existing approaches often suffer from limited accuracy and poor generalization due to significant distribution shifts among samples generated by diverse forgery techniques. To address these challenges, we propose a novel Multivariate and Soft Blending Augmentation with CLIP-guided Forgery Intensity Estimation (MSBA-CLIP) framework. Our method leverages the multimodal alignment capabilities of CLIP to capture subtle forgery traces. We introduce a Multivariate and Soft Blending Augmentation (MSBA) strategy that synthesizes images by blending forgeries from multiple methods with random weights, forcing the model to learn generalizable patterns. Furthermore, a dedicated Multivariate Forgery Intensity Estimation (MFIE) module is designed to explicitly guide the model in learning features related to varied forgery modes and intensities. Extensive experiments demonstrate state-of-the-art performance. On in-domain tests, our method improves Accuracy and AUC by 3.32% and 4.02%, respectively, over the best baseline. In cross-domain evaluations across five datasets, it achieves an average AUC gain of 3.27%. Ablation studies confirm the efficacy of both proposed components. While the reliance on a large vision-language model entails higher computational cost, our work presents a significant step towards more generalizable and robust deepfake detection.

[105] A Comprehensive Survey on Deep Learning-Based LiDAR Super-Resolution for Autonomous Driving

June Moh Goo, Zichao Zeng, Jan Boehm

Main category: cs.CV

TL;DR: First comprehensive survey of LiDAR super-resolution methods for autonomous driving, categorizing approaches into CNN-based, model-based deep unrolling, implicit representation, and Transformer/Mamba-based methods, with focus on practical deployment challenges.

Details

Motivation: High-resolution LiDAR sensors are expensive while affordable low-resolution sensors produce sparse point clouds that miss critical details. LiDAR super-resolution bridges this gap to enable cross-sensor compatibility and practical deployment in autonomous driving.

Method: Survey paper that organizes existing LiDAR super-resolution approaches into four categories: 1) CNN-based architectures, 2) model-based deep unrolling, 3) implicit representation methods, and 4) Transformer and Mamba-based approaches. Establishes fundamental concepts including data representations, problem formulation, benchmark datasets and evaluation metrics.

Result: Identifies current trends including adoption of range image representation for efficient processing, extreme model compression, and development of resolution-flexible architectures. Recent research prioritizes real-time inference and cross-sensor generalization for practical deployment.

Conclusion: LiDAR super-resolution is crucial for bridging sensor capability gaps in autonomous driving. The survey identifies open challenges and future research directions for advancing the technology, with emphasis on practical deployment considerations.

Abstract: LiDAR sensors are often considered essential for autonomous driving, but high-resolution sensors remain expensive while affordable low-resolution sensors produce sparse point clouds that miss critical details. LiDAR super-resolution addresses this challenge by using deep learning to enhance sparse point clouds, bridging the gap between different sensor types and enabling cross-sensor compatibility in real-world deployments. This paper presents the first comprehensive survey of LiDAR super-resolution methods for autonomous driving. Despite the importance of practical deployment, no systematic review has been conducted until now. We organize existing approaches into four categories: CNN-based architectures, model-based deep unrolling, implicit representation methods, and Transformer and Mamba-based approaches. We establish fundamental concepts including data representations, problem formulation, benchmark datasets and evaluation metrics. Current trends include the adoption of range image representation for efficient processing, extreme model compression and the development of resolution-flexible architectures. Recent research prioritizes real-time inference and cross-sensor generalization for practical deployment. We conclude by identifying open challenges and future research directions for advancing LiDAR super-resolution technology.

[106] MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

Xianwei Mao, Kai Ye, Sheng Zhou, Nan Zhang, Haikuan Huang, Bin Li, Jiajun Bu

Main category: cs.CV

TL;DR: MaS-VQA is a selection-driven framework for KB-VQA that filters noisy external knowledge and guides internal knowledge activation through joint pruning of irrelevant image regions and knowledge fragments.

Details

Motivation: KB-VQA suffers from noisy retrieved knowledge (irrelevant, misaligned with visuals) and uncontrollable internal model knowledge, limiting reasoning effectiveness and answer accuracy when naively aggregated.

Method: Proposes MaS-VQA with Mask-and-Select mechanism that jointly prunes irrelevant image regions and weakly relevant knowledge fragments to create compact multimodal knowledge, then guides internal knowledge activation in constrained semantic space.

Result: Consistent performance gains on Encyclopedic-VQA and InfoSeek across multiple MLLM backbones; ablations show selection mechanism effectively reduces noise and enhances knowledge utilization.

Conclusion: Tight coupling of explicit knowledge filtering with implicit knowledge reasoning through selection-driven framework improves KB-VQA by reducing noise and enabling complementary co-modeling of knowledge sources.

Abstract: Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.

[107] EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

Zelin Xu, Yupu Zhang, Saugat Adhikari, Saiful Islam, Tingsong Xiao, Zibo Liu, Shigang Chen, Da Yan, Zhe Jiang

Main category: cs.CV

TL;DR: EarthSpatialBench: A comprehensive benchmark for evaluating spatial reasoning in multimodal LLMs on Earth imagery with 325K+ QA pairs covering distance/direction reasoning, topological relations, and complex object geometries.

Details

Motivation: Existing benchmarks for Earth imagery lack support for quantitative direction/distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes, creating a gap in evaluating spatial reasoning capabilities needed for embodied AI and agentic systems.

Method: Proposed EarthSpatialBench benchmark containing over 325K question-answer pairs spanning qualitative/quantitative spatial reasoning, topological relations, single/pair/compositional object queries, and multiple object reference formats (textual, visual overlays, geometry coordinates).

Result: Extensive experiments on both open-source and proprietary models identified limitations in MLLMs’ spatial reasoning capabilities on Earth imagery.

Conclusion: EarthSpatialBench fills an important gap in benchmarking spatial reasoning for MLLMs on Earth imagery, providing a comprehensive evaluation framework for quantitative distance/direction reasoning, topological relations, and complex geometries.

Abstract: Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs.

[108] A Study on Real-time Object Detection using Deep Learning

Ankita Bose, Jayasravani Bhumireddy, Naveen N

Main category: cs.CV

TL;DR: A comprehensive survey paper on deep learning-based object detection algorithms, their applications across various domains, benchmark datasets, and future research directions.

Details

Motivation: To provide a detailed overview of how deep learning algorithms enhance real-time object recognition, covering various models, applications, and comparative studies to guide researchers and practitioners in the field.

Method: Survey methodology reviewing major object detection algorithms (Faster R-CNN, Mask R-CNN, Cascade R-CNN, YOLO, SSD, RetinaNet), analyzing benchmark datasets, conducting controlled comparative studies, and examining applications across domains.

Result: Comprehensive analysis of object detection models, their performance characteristics, application scenarios, and comparative insights between different approaches, along with identification of current challenges.

Conclusion: Deep learning has significantly advanced object detection capabilities, but challenges remain; the paper provides guidance for future research directions in both algorithmic improvements and application-specific adaptations.

Abstract: Object detection has compelling applications over a range of domains, including human-computer interfaces, security and video surveillance, navigation and road traffic monitoring, transportation systems, industrial automation healthcare, the world of Augmented Reality (AR) and Virtual Reality (VR), environment monitoring and activity identification. Applications of real time object detection in all these areas provide dynamic analysis of the visual information that helps in immediate decision making. Furthermore, advanced deep learning algorithms leverage the progress in the field of object detection providing more accurate and efficient solutions. There are some outstanding deep learning algorithms for object detection which includes, Faster R CNN(Region-based Convolutional Neural Network),Mask R-CNN, Cascade R-CNN, YOLO (You Only Look Once), SSD (Single Shot Multibox Detector), RetinaNet etc. This article goes into great detail on how deep learning algorithms are used to enhance real time object recognition. It provides information on the different object detection models available, open benchmark datasets, and studies on the use of object detection models in a range of applications. Additionally, controlled studies are provided to compare various strategies and produce some illuminating findings. Last but not least, a number of encouraging challenges and approaches are offered as suggestions for further investigation in both relevant deep learning approaches and object recognition.

[109] Visual Memory Injection Attacks for Multi-Turn Conversations

Christian Schlarmann, Matthias Hein

Main category: cs.CV

TL;DR: Stealthy Visual Memory Injection (VMI) attack manipulates LVLMs through perturbed images that trigger specific target messages during multi-turn conversations after normal prompts.

Details

Motivation: The security of generative large vision-language models (LVLMs) in multi-turn settings is underexplored, despite their growing user base. The paper addresses realistic scenarios where attackers can manipulate users through perturbed images uploaded to social media.

Method: Develops Visual Memory Injection (VMI) attack where manipulated images cause LVLMs to exhibit normal behavior initially but output prescribed target messages when users give triggering prompts during multi-turn conversations.

Result: Demonstrates successful attacks on several recent open-weight LVLMs, showing large-scale user manipulation is feasible with perturbed images in multi-turn conversation settings.

Conclusion: LVLMs are vulnerable to stealthy visual memory injection attacks in multi-turn settings, calling for improved robustness against such security threats.

Abstract: Generative large vision-language models (LVLMs) have recently achieved impressive performance gains, and their user base is growing rapidly. However, the security of LVLMs, in particular in a long-context multi-turn setting, is largely underexplored. In this paper, we consider the realistic scenario in which an attacker uploads a manipulated image to the web/social media. A benign user downloads this image and uses it as input to the LVLM. Our novel stealthy Visual Memory Injection (VMI) attack is designed such that on normal prompts the LVLM exhibits nominal behavior, but once the user gives a triggering prompt, the LVLM outputs a specific prescribed target message to manipulate the user, e.g. for adversarial marketing or political persuasion. Compared to previous work that focused on single-turn attacks, VMI is effective even after a long multi-turn conversation with the user. We demonstrate our attack on several recent open-weight LVLMs. This article thereby shows that large-scale manipulation of users is feasible with perturbed images in multi-turn conversation settings, calling for better robustness of LVLMs against these attacks. We release the source code at https://github.com/chs20/visual-memory-injection

[110] Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

Yuval Levental

Main category: cs.CV

TL;DR: VLMs fail at localizing filled cells in binary grids when cells lack textual identity, showing a fundamental limitation in visual spatial reasoning compared to text recognition.

Details

Motivation: To expose a fundamental limitation in vision-language models: their inability to accurately localize visual elements (filled cells in binary grids) when those elements lack textual identity, revealing a gap between text recognition and native visual spatial reasoning capabilities.

Method: Generated fifteen 15x15 binary grids with varying density (10.7%-41.8% filled cells) and rendered them as two image types: text symbols (. and #) and filled squares without gridlines. Tested three frontier VLMs (Claude Opus, ChatGPT 5.2, Gemini 3 Thinking) to transcribe the grids, comparing performance between text-symbol and filled-squares conditions.

Result: In text-symbol condition: Claude and ChatGPT achieved ~91% cell accuracy and 84% F1, Gemini achieved 84% accuracy and 63% F1. In filled-squares condition: all models collapsed to 60-73% accuracy and 29-39% F1. The text-vs-squares F1 gap ranged from 34 to 54 points across models, showing VLMs have a high-fidelity text-recognition pathway that dramatically outperforms their native visual pathway.

Conclusion: VLMs possess a superior text-recognition pathway for spatial reasoning that far exceeds their native visual spatial localization capabilities, revealing a fundamental architectural limitation in multimodal understanding when visual elements lack textual identity.

Abstract: We present a simple experiment that exposes a fundamental limitation in vision-language models (VLMs): the inability to accurately localize filled cells in binary grids when those cells lack textual identity. We generate fifteen 15x15 grids with varying density (10.7%-41.8% filled cells) and render each as two image types – text symbols (. and #) and filled squares without gridlines – then ask three frontier VLMs (Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking) to transcribe them. In the text-symbol condition, Claude and ChatGPT achieve approximately 91% cell accuracy and 84% F1, while Gemini achieves 84% accuracy and 63% F1. In the filled-squares condition, all three models collapse to 60-73% accuracy and 29-39% F1. Critically, all conditions pass through the same visual encoder – the text symbols are images, not tokenized text. The text-vs-squares F1 gap ranges from 34 to 54 points across models, demonstrating that VLMs behave as if they possess a high-fidelity text-recognition pathway for spatial reasoning that dramatically outperforms their native visual pathway. Each model exhibits a distinct failure mode in the squares condition – systematic under-counting (Claude), massive over-counting (ChatGPT), and template hallucination (Gemini) – but all share the same underlying deficit: severely degraded spatial localization for non-textual visual elements.

[111] Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration

Yiwen Wang, Jiahao Qin

Main category: cs.CV

TL;DR: GPEReg-Net is a novel registration framework for high-speed optical-resolution photoacoustic microscopy that addresses coupled domain shift and geometric misalignment through scene-appearance disentanglement and temporal-aware global position encoding.

Details

Motivation: High-speed OR-PAM with bidirectional raster scanning doubles imaging speed but introduces two key challenges: coupled domain shift (different appearance between forward/backward scans) and geometric misalignment. Existing methods have limitations - traditional registration methods rely on brightness constancy assumptions, while recent generative approaches lack temporal awareness across frames.

Method: Proposes GPEReg-Net with two key innovations: 1) Scene-appearance disentanglement framework using Adaptive Instance Normalization (AdaIN) to separate domain-invariant scene features from domain-specific appearance codes, enabling direct image-to-image registration without explicit deformation field estimation. 2) Global Position Encoding (GPE) module that combines learnable position embeddings with sinusoidal encoding and cross-frame attention to exploit temporal structure in sequential acquisitions for improved temporal coherence.

Result: On the OR-PAM-Reg-4K benchmark (432 test samples), GPEReg-Net achieves NCC of 0.953, SSIM of 0.932, and PSNR of 34.49dB, surpassing state-of-the-art by 3.8% in SSIM and 1.99dB in PSNR while maintaining competitive NCC.

Conclusion: GPEReg-Net effectively addresses the coupled domain shift and geometric misalignment problems in high-speed OR-PAM through scene-appearance disentanglement and temporal-aware position encoding, achieving superior registration performance compared to existing methods.

Abstract: High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing registration methods, constrained by brightness constancy assumptions, achieve limited alignment quality, while recent generative approaches address domain shift through complex architectures that lack temporal awareness across frames. We propose GPEReg-Net, a scene-appearance disentanglement framework that separates domain-invariant scene features from domain-specific appearance codes via Adaptive Instance Normalization (AdaIN), enabling direct image-to-image registration without explicit deformation field estimation. To exploit temporal structure in sequential acquisitions, we introduce a Global Position Encoding (GPE) module that combines learnable position embeddings with sinusoidal encoding and cross-frame attention, allowing the network to leverage context from neighboring frames for improved temporal coherence. On the OR-PAM-Reg-4K benchmark (432 test samples), GPEReg-Net achieves NCC of 0.953, SSIM of 0.932, and PSNR of 34.49dB, surpassing the state-of-the-art by 3.8% in SSIM and 1.99dB in PSNR while maintaining competitive NCC. Code is available at https://github.com/JiahaoQin/GPEReg-Net.

[112] Automated Re-Identification of Holstein-Friesian Cattle in Dense Crowds

Phoenix Yu, Tilo Burghardt, Andrew W Dowsey, Neill W Campbell

Main category: cs.CV

TL;DR: A detect-segment-identify pipeline using Open-Vocabulary Weight-free Localisation and Segment Anything models for Holstein-Friesian cow detection and re-identification in crowded farm settings, achieving 98.93% detection accuracy and 94.82% Re-ID accuracy.

Details

Motivation: Existing cow detection and re-identification methods fail when cows group closely together, especially for species with outline-breaking coat patterns. Current approaches like YOLO-based detection break down in dense animal groupings common in working farm settings.

Method: Proposes a detect-segment-identify pipeline combining Open-Vocabulary Weight-free Localisation and Segment Anything models as pre-processing stages with Re-ID networks. Uses unsupervised contrastive learning for re-identification and releases a 9-day CCTV dataset from a working dairy farm for evaluation.

Result: Achieves 98.93% detection accuracy in dense cow groupings, significantly outperforming oriented bounding box methods (47.52% improvement) and SAM species detection baselines (27.13% improvement). Unsupervised contrastive learning yields 94.82% Re-ID accuracy on test data.

Conclusion: The proposed pipeline enables practical and reliable re-identification in crowded farm scenarios without manual intervention, demonstrating that Re-ID in dense animal groupings is feasible in real-world working farm settings.

Abstract: Holstein-Friesian detection and re-identification (Re-ID) methods capture individuals well when targets are spatially separate. However, existing approaches, including YOLO-based species detection, break down when cows group closely together. This is particularly prevalent for species which have outline-breaking coat patterns. To boost both effectiveness and transferability in this setting, we propose a new detect-segment-identify pipeline that leverages the Open-Vocabulary Weight-free Localisation and the Segment Anything models as pre-processing stages alongside Re-ID networks. To evaluate our approach, we publish a collection of nine days CCTV data filmed on a working dairy farm. Our methodology overcomes detection breakdown in dense animal groupings, resulting in a 98.93% accuracy. This significantly outperforms current oriented bounding box-driven, as well as SAM species detection baselines with accuracy improvements of 47.52% and 27.13%, respectively. We show that unsupervised contrastive learning can build on this to yield 94.82% Re-ID accuracy on our test data. Our work demonstrates that Re-ID in crowded scenarios is both practical as well as reliable in working farm settings with no manual intervention. Code and dataset are provided for reproducibility.

[113] Non-Contact Physiological Monitoring in Pediatric Intensive Care Units via Adaptive Masking and Self-Supervised Learning

Mohamed Khalil Ben Salah, Philippe Jouvet, Rita Noumeir

Main category: cs.CV

TL;DR: Self-supervised pretraining framework using VisionMamba with adaptive masking for contactless heart rate monitoring in pediatric ICU via facial video analysis

Details

Motivation: Contact-based vital sign monitoring in pediatric ICUs causes skin irritation, infection risk, and discomfort. Remote photoplethysmography (rPPG) offers contactless heart rate monitoring but faces challenges in clinical settings due to motion artifacts, occlusions, lighting variations, and domain shifts between lab and clinical data.

Method: Progressive curriculum self-supervised pretraining using VisionMamba architecture with adaptive masking mechanism. A lightweight Mamba-based controller assigns spatiotemporal importance scores to guide probabilistic patch sampling. Teacher-student distillation uses supervised expert model trained on public datasets to provide latent physiological guidance. Curriculum progresses through three stages: clean public videos, synthetic occlusion scenarios, and unlabeled clinical videos from 500 pediatric patients.

Result: Achieves 42% reduction in mean absolute error relative to standard masked autoencoders and outperforms PhysFormer by 31%, reaching final MAE of 3.2 bpm. Model consistently attends to pulse-rich areas without explicit region-of-interest extraction and demonstrates robustness under clinical occlusions and noise.

Conclusion: The proposed self-supervised framework effectively addresses domain shift challenges in clinical rPPG applications, enabling accurate contactless heart rate monitoring in pediatric ICU settings through progressive curriculum learning and adaptive masking strategies.

Abstract: Continuous monitoring of vital signs in Pediatric Intensive Care Units (PICUs) is essential for early detection of clinical deterioration and effective clinical decision-making. However, contact-based sensors such as pulse oximeters may cause skin irritation, increase infection risk, and lead to patient discomfort. Remote photoplethysmography (rPPG) offers a contactless alternative to monitor heart rate using facial video, but remains underutilized in PICUs due to motion artifacts, occlusions, variable lighting, and domain shifts between laboratory and clinical data. We introduce a self-supervised pretraining framework for rPPG estimation in the PICU setting, based on a progressive curriculum strategy. The approach leverages the VisionMamba architecture and integrates an adaptive masking mechanism, where a lightweight Mamba-based controller assigns spatiotemporal importance scores to guide probabilistic patch sampling. This strategy dynamically increases reconstruction difficulty while preserving physiological relevance. To address the lack of labeled clinical data, we adopt a teacher-student distillation setup. A supervised expert model, trained on public datasets, provides latent physiological guidance to the student. The curriculum progresses through three stages: clean public videos, synthetic occlusion scenarios, and unlabeled videos from 500 pediatric patients. Our framework achieves a 42% reduction in mean absolute error relative to standard masked autoencoders and outperforms PhysFormer by 31%, reaching a final MAE of 3.2 bpm. Without explicit region-of-interest extraction, the model consistently attends to pulse-rich areas and demonstrates robustness under clinical occlusions and noise.

[114] LAND: A Longitudinal Analysis of Neuromorphic Datasets

Gregory Cohen, Alexandre Marcireau

Main category: cs.CV

TL;DR: A comprehensive review of neuromorphic datasets covering 423 datasets, analyzing their characteristics, accessibility issues, growth trends, and concerns about synthetic data proliferation.

Details

Motivation: Neuromorphic engineering faces a data problem despite increasing dataset publication. Researchers struggle with finding, understanding, and using existing datasets due to lack of standardization, accessibility issues, and dataset size challenges.

Method: Conducted a systematic review capturing a snapshot of 423 existing neuromorphic datasets, analyzing their tasks, data structures, accessibility, standardization issues, and trends in dataset growth and synthetic data creation.

Result: Identified difficulties with dataset size, lack of standardization, and accessibility problems. Highlighted concerning growth of synthetic datasets created via simulation or video-to-events methods, and proposed meta-datasets as a solution to reduce data needs and bias.

Conclusion: The neuromorphic field needs better dataset organization and standardization rather than more data. Meta-datasets from existing data can address current challenges while synthetic data poses risks for exploring new applications.

Abstract: Neuromorphic engineering has a data problem. Despite the meteoric rise in the number of neuromorphic datasets published over the past ten years, the conclusion of a significant portion of neuromorphic research papers still states that there is a need for yet more data and even larger datasets. Whilst this need is driven in part by the sheer volume of data required by modern deep learning approaches, it is also fuelled by the current state of the available neuromorphic datasets and the difficulties in finding them, understanding their purpose, and determining the nature of their underlying task. This is further compounded by practical difficulties in downloading and using these datasets. This review starts by capturing a snapshot of the existing neuromorphic datasets, covering over 423 datasets, and then explores the nature of their tasks and the underlying structure of the presented data. Analysing these datasets shows the difficulties arising from their size, the lack of standardisation, and difficulties in accessing the actual data. This paper also highlights the growth in the size of individual datasets and the complexities involved in working with the data. However, a more important concern is the rise of synthetic datasets, created by either simulation or video-to-events methods. This review explores the benefits of simulated data for testing existing algorithms and applications, highlighting the potential pitfalls for exploring new applications of neuromorphic technologies. This review also introduces the concepts of meta-datasets, created from existing datasets, as a way of both reducing the need for more data, and to remove potential bias arising from defining both the dataset and the task.

[115] SAM 3D Body: Robust Full-Body Human Mesh Recovery

Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, Kris Kitani

Main category: cs.CV

TL;DR: SAM 3D Body (3DB) is a promptable model for single-image full-body 3D human mesh recovery that achieves state-of-the-art performance with strong generalization across diverse conditions.

Details

Motivation: The paper addresses the need for robust 3D human mesh recovery from single images that works well in diverse real-world conditions, with improved generalization and user control through promptable inference.

Method: Uses encoder-decoder architecture with new Momentum Human Rig (MHR) parametric mesh representation that decouples skeletal structure and surface shape. Supports auxiliary prompts (2D keypoints, masks) for user-guided inference. Employs multi-stage annotation pipeline and data engine for diverse training data.

Result: Demonstrates superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Achieves state-of-the-art performance in 3D human mesh recovery.

Conclusion: SAM 3D Body represents a significant advancement in 3D human mesh recovery with strong generalization capabilities, promptable inference, and open-source availability of both 3DB and MHR.

Abstract: We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal structure and surface shape. 3DB employs an encoder-decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.

[116] BTReport: A Framework for Brain Tumor Radiology Report Generation with Clinically Relevant Features

Juampablo E. Heras Rivera, Dickson T. Chen, Tianyi Ren, Daniel K. Low, Asma Ben Abacha, Alberto Santamaria-Pang, Mehmet Kurt

Main category: cs.CV

TL;DR: BTReport is an open-source framework for brain tumor radiology report generation that uses deterministic feature extraction followed by LLM-based report composition, creating interpretable reports less prone to hallucinations.

Details

Motivation: Progress in neuro-oncology radiology report generation has been limited due to lack of open paired image-report datasets, and existing approaches using vision-language models for both image interpretation and report composition can lead to hallucinations and lack interpretability.

Method: Separates RRG into two steps: 1) deterministic feature extraction from brain tumor images (extracting specific imaging features), and 2) using large language models only for syntactic structuring and narrative formatting of reports based on extracted features.

Result: Generated reports are more closely aligned with reference clinical reports than existing baselines, features are predictive of key clinical outcomes (survival and IDH mutation status), and the approach produces completely interpretable reports less prone to hallucinations.

Conclusion: BTReport provides an interpretable framework for brain tumor radiology report generation that separates feature extraction from report composition, and the companion BTReport-BraTS dataset addresses the lack of open paired image-report data in neuro-oncology.

Abstract: Recent advances in radiology report generation (RRG) have been driven by large paired image-text datasets; however, progress in neuro-oncology has been limited due to a lack of open paired image-report datasets. Here, we introduce BTReport, an open-source framework for brain tumor RRG that constructs natural language radiology reports using deterministically extracted imaging features. Unlike existing approaches that rely on large general-purpose or fine-tuned vision-language models for both image interpretation and report composition, BTReport performs deterministic feature extraction for image analysis and uses large language models only for syntactic structuring and narrative formatting. By separating RRG into a deterministic feature extraction step and a report generation step, the generated reports are completely interpretable and less prone to hallucinations. We show that the features used for report generation are predictive of key clinical outcomes, including survival and IDH mutation status, and reports generated by BTReport are more closely aligned with reference clinical reports than existing baselines for RRG. Finally, we introduce BTReport-BraTS, a companion dataset that augments BraTS imaging with synthetically generated radiology reports produced with BTReport. Code for this project can be found at https://github.com/KurtLabUW/BTReport.

[117] MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval

Ahmad Elallaf, Yu Zhang, Yuktha Priya Masupalli, Jeong Yang, Young Lee, Zechun Cao, Gongbo Liang

Main category: cs.CV

TL;DR: MedProbCLIP introduces a probabilistic vision-language framework for chest X-ray and radiology report representation learning, modeling uncertainty through Gaussian embeddings to improve reliability in biomedical applications.

Details

Motivation: Deterministic vision-language embeddings lack the reliability needed for high-stakes biomedical applications like radiology. Current models fail to capture uncertainty and many-to-many correspondences between medical images and clinical narratives.

Method: Proposes MedProbCLIP with Gaussian embeddings learned through probabilistic contrastive objective, variational information bottleneck to prevent overconfidence, and multi-view radiograph/multi-section report encoding during training for fine-grained supervision.

Result: Outperforms deterministic and probabilistic baselines (CLIP, CXR-CLIP, PCME++) on MIMIC-CXR dataset in retrieval and zero-shot classification, with superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinical corruptions.

Conclusion: Probabilistic vision-language modeling improves trustworthiness and safety of radiology image-text retrieval systems by explicitly capturing uncertainty and many-to-many correspondences between medical images and clinical narratives.

Abstract: Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.

[118] LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization

Idil Bilge Altun, Mert Onur Cakiroglu, Elham Buxton, Mehmet Dalkilic, Hasan Kurban

Main category: cs.CV

TL;DR: LGQ introduces a learnable geometric quantization method for image tokenization that learns discretization geometry end-to-end using soft assignments and variational optimization, improving efficiency and representation quality over existing methods.

Details

Motivation: Existing image tokenizers face trade-offs: vector quantization suffers from optimization biases and codebook under-utilization, while structured tokenizers use fixed geometries that inefficiently allocate capacity under heterogeneous latent statistics.

Method: LGQ replaces hard nearest-neighbor lookup with temperature-controlled soft assignments, enabling differentiable training while recovering hard assignments at inference. It uses token-level peakedness regularization and global usage regularization to encourage confident yet balanced code utilization without rigid grids.

Result: At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate.

Conclusion: LGQ provides a stable, efficient image tokenization method that learns optimal discretization geometry end-to-end, achieving better fidelity with fewer active codes compared to existing approaches.

Abstract: Discrete image tokenization is a key bottleneck for scalable visual generation: a tokenizer must remain compact for efficient latent-space priors while preserving semantic structure and using discrete capacity effectively. Existing quantizers face a trade-off: vector-quantized tokenizers learn flexible geometries but often suffer from biased straight-through optimization, codebook under-utilization, and representation collapse at large vocabularies. Structured scalar or implicit tokenizers ensure stable, near-complete utilization by design, yet rely on fixed discretization geometries that may allocate capacity inefficiently under heterogeneous latent statistics. We introduce Learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end-to-end. LGQ replaces hard nearest-neighbor lookup with temperature-controlled soft assignments, enabling fully differentiable training while recovering hard assignments at inference. The assignments correspond to posterior responsibilities of an isotropic Gaussian mixture and minimize a variational free-energy objective, provably converging to nearest-neighbor quantization in the low-temperature limit. LGQ combines a token-level peakedness regularizer with a global usage regularizer to encourage confident yet balanced code utilization without imposing rigid grids. Under a controlled VQGAN-style backbone on ImageNet across multiple vocabulary sizes, LGQ achieves stable optimization and balanced utilization. At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate, achieving comparable fidelity with substantially fewer active entries. Our GitHub repository is available at: https://github.com/KurbanIntelligenceLab/LGQ

[119] OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

Tianwei Lin, Zhongwei Qiu, Wenqiao Zhang, Jiang Liu, Yihan Xie, Mingjian Gao, Zhenxuan Fan, Zhaocheng Li, Sijing Li, Zhongle Xie, Peng LU, Yueting Zhuang, Yingda Xia, Ling Zhang, Beng Chin Ooi

Main category: cs.CV

TL;DR: OmniCT is a unified slice-volume Large Vision-Language Model for CT imaging that bridges the gap between slice-level detail and volumetric spatial understanding through spatial consistency enhancement and organ-level semantic alignment.

Details

Motivation: Current Large Vision-Language Models for CT imaging are fragmented: slice-driven models lack cross-slice spatial consistency, while volume-driven models have coarse granularity and poor compatibility with slice inputs. This fragmentation hinders clinical translation of medical LVLMs.

Method: Three key contributions: 1) Spatial Consistency Enhancement with volumetric slice composition and tri-axial positional embedding for volumetric consistency, plus MoE hybrid projection for efficient slice-volume adaptation; 2) Organ-level Semantic Enhancement using segmentation and ROI localization to align anatomical regions; 3) MedEval-CT dataset and benchmark for unified evaluation.

Result: OmniCT consistently outperforms existing methods with substantial margins across diverse clinical tasks, satisfying both micro-level detail sensitivity and macro-level spatial reasoning.

Conclusion: OmniCT establishes a new paradigm for cross-modal medical imaging understanding by unifying slice and volume representations in CT analysis, addressing a major bottleneck for clinical translation of medical LVLMs.

Abstract: Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.

[120] CHAI: CacHe Attention Inference for text2video

Joel Mathew Cherian, Ashutosh Muralidhara Bharadwaj, Vima Gupta, Anand Padmanabha Iyer

Main category: cs.CV

TL;DR: CHAI introduces cross-inference caching with Cache Attention to speed up text-to-video diffusion models by reusing cached latents across semantically related prompts, achieving 1.65x-3.35x speedup over OpenSora 1.2 with only 8 denoising steps.

Details

Motivation: Text-to-video diffusion models are slow due to sequential denoising of 3D latents. Existing speed-up methods either require expensive retraining or use heuristic step skipping that degrades video quality as steps decrease.

Method: Proposes CHAI with Cache Attention mechanism that selectively attends to shared objects/scenes across cross-inference latents, enabling effective reuse of cached latents across semantically related prompts for high cache hit rates.

Result: CHAI generates high-quality videos with as few as 8 denoising steps and achieves 1.65x-3.35x faster inference than baseline OpenSora 1.2 while maintaining video quality.

Conclusion: Cross-inference caching with Cache Attention effectively accelerates text-to-video diffusion models without quality degradation, offering a practical solution for real-time video generation.

Abstract: Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality.

[121] IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

Parsa Madinei, Srijita Karmakar, Russell Cohen Hoffing, Felix Gervitz, Miguel P. Eckstein

Main category: cs.CV

TL;DR: IRIS uses real-time eye-tracking data to resolve ambiguity in open-ended visual question answering by analyzing fixations when users start asking questions, significantly improving VLM accuracy on ambiguous queries.

Details

Motivation: Current vision-language models struggle with ambiguous visual questions where multiple interpretations are possible. The researchers aim to leverage natural human gaze patterns during question formulation to provide disambiguation cues that improve model performance on ambiguous queries.

Method: IRIS is a training-free approach that uses real-time eye-tracking data, specifically analyzing fixations closest to when participants start verbally asking questions. The method incorporates gaze data into VLMs at inference time through “inference-time saccades” to resolve ambiguity in open-ended VQA tasks.

Result: The approach more than doubles accuracy on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. The method shows consistent improvements across state-of-the-art VLMs regardless of architectural differences.

Conclusion: Eye-tracking data, particularly fixations during question formulation, provides valuable disambiguation signals for VLMs. The training-free approach enables real-time ambiguity resolution and works across different model architectures.

Abstract: We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.

[122] Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing

Huichan Seo, Minki Hong, Sieun Choi, Jihie Kim, Jean Oh

Main category: cs.CV

TL;DR: The paper examines demographic bias in instruction-guided image-to-image editing, identifying two failure modes (Soft Erasure and Stereotype Replacement) and showing these biases are pervasive and demographically uneven.

Details

Motivation: While demographic bias in text-to-image generation is well studied, demographic-conditioned failures in instruction-guided image-to-image editing remain underexplored. The authors aim to investigate whether identical edit instructions yield systematically different outcomes across subject demographics.

Method: The authors formalize two failure modes (Soft Erasure and Stereotype Replacement), create a controlled benchmark to probe demographic-conditioned behavior by generating and editing portraits conditioned on race, gender, and age, and evaluate multiple editors using vision-language model scoring and human evaluation.

Result: Identity preservation failures are pervasive, demographically uneven, and shaped by implicit social priors (including occupation-driven gender inference). A prompt-level identity constraint without model updates can substantially reduce demographic change for minority groups while leaving majority-group portraits largely unchanged, revealing asymmetric identity priors.

Conclusion: The findings establish identity preservation as a central and demographically uneven failure mode in I2I editing and motivate the development of demographic-robust editing systems.

Abstract: Demographic bias in text-to-image (T2I) generation is well studied, yet demographic-conditioned failures in instruction-guided image-to-image (I2I) editing remain underexplored. We examine whether identical edit instructions yield systematically different outcomes across subject demographics in open-weight I2I editors. We formalize two failure modes: Soft Erasure, where edits are silently weakened or ignored in the output image, and Stereotype Replacement, where edits introduce unrequested, stereotype-consistent attributes. We introduce a controlled benchmark that probes demographic-conditioned behavior by generating and editing portraits conditioned on race, gender, and age using a diagnostic prompt set, and evaluate multiple editors with vision-language model (VLM) scoring and human evaluation. Our analysis shows that identity preservation failures are pervasive, demographically uneven, and shaped by implicit social priors, including occupation-driven gender inference. Finally, we demonstrate that a prompt-level identity constraint, without model updates, can substantially reduce demographic change for minority groups while leaving majority-group portraits largely unchanged, revealing asymmetric identity priors in current editors. Together, our findings establish identity preservation as a central and demographically uneven failure mode in I2I editing and motivate demographic-robust editing systems. Project page: https://seochan99.github.io/i2i-demographic-bias

[123] Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

Patrick Poggi, Divake Kumar, Theja Tulabandhula, Amit Ranjan Trivedi

Main category: cs.CV

TL;DR: UncL-STARK enables dynamic depth adaptation in transformer-based trackers using uncertainty-aware inference to reduce computational costs while maintaining accuracy.

Details

Motivation: Transformer-based trackers use fixed-depth inference for every frame, incurring unnecessary computational costs in temporally coherent video sequences where visual complexity varies.

Method: Fine-tune model to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation. At runtime, use lightweight uncertainty estimate from corner localization heatmaps in feedback-driven policy to select encoder/decoder depth for next frame based on prediction confidence.

Result: Achieves up to 12% GFLOPs reduction, 8.9% latency reduction, and 10.8% energy savings while maintaining tracking accuracy within 0.2% of full-depth baseline on GOT-10k and LaSOT datasets.

Conclusion: UncL-STARK enables efficient transformer-based tracking through dynamic depth adaptation without modifying network architecture, balancing computational efficiency with accuracy.

Abstract: Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder–decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model’s corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12% GFLOPs reduction, 8.9% latency reduction, and 10.8% energy savings while maintaining tracking accuracy within 0.2% of the full-depth baseline across both short-term and long-term sequences.

[124] DataCube: A Video Retrieval Platform via Natural Language Semantic Profiling

Yiming Ju, Hanyu Zhao, Quanyue Ma, Donglin Hao, Chengwei Wu, Ming Li, Songjing Wang, Tengfei Pan

Main category: cs.CV

TL;DR: DataCube is an intelligent platform for automatic video processing, semantic profiling, and query-driven retrieval from large video repositories, enabling efficient construction of customized video datasets.

Details

Motivation: Large-scale video repositories are increasingly available but transforming raw videos into high-quality, task-specific datasets remains costly and inefficient, requiring better tools for video processing and retrieval.

Method: DataCube constructs structured semantic representations of video clips and supports hybrid retrieval with neural re-ranking and deep semantic matching through an interactive web interface.

Result: The system enables users to efficiently construct customized video subsets from massive repositories for training, analysis, and evaluation, and build searchable systems over private video collections.

Conclusion: DataCube provides an intelligent platform that addresses the inefficiency in video dataset construction through automated processing, semantic profiling, and query-driven retrieval from large video repositories.

Abstract: Large-scale video repositories are increasingly available for modern video understanding and generation tasks. However, transforming raw videos into high-quality, task-specific datasets remains costly and inefficient. We present DataCube, an intelligent platform for automatic video processing, multi-dimensional profiling, and query-driven retrieval. DataCube constructs structured semantic representations of video clips and supports hybrid retrieval with neural re-ranking and deep semantic matching. Through an interactive web interface, users can efficiently construct customized video subsets from massive repositories for training, analysis, and evaluation, and build searchable systems over their own private video collections. The system is publicly accessible at https://datacube.baai.ac.cn/. Demo Video: https://baai-data-cube.ks3-cn-beijing.ksyuncs.com/custom/Adobe%20Express%20-%202%E6%9C%8818%E6%97%A5%20%281%29%281%29%20%281%29.mp4

[125] EasyControlEdge: A Foundation-Model Fine-Tuning for Edge Detection

Hiroki Nakamura, Hiroto Iino, Masashi Okada, Tadahiro Taniguchi

Main category: cs.CV

TL;DR: EasyControlEdge adapts image-generation foundation models for edge detection, focusing on crispness and data efficiency through edge-specialized adaptation and unconditional dynamics guidance.

Details

Motivation: Real-world edge detection applications (floor plans, satellite imagery, medical boundaries) require crisp edge maps and data-efficient training, but current methods struggle with crispness using limited samples. Image-generation foundation models have strong priors and iterative refinement capabilities that remain underexploited for edge detection.

Method: Adapts image-generation foundation models for edge detection with edge-oriented objective and efficient pixel-space loss. At inference, uses guidance based on unconditional dynamics to control edge density through a guidance scale, enabling a single model to produce varying edge densities.

Result: Experiments on BSDS500, NYUDv2, BIPED, and CubiCasa show consistent gains over state-of-the-art methods, particularly in no-post-processing crispness evaluation and with limited training data.

Conclusion: EasyControlEdge successfully leverages image-generation foundation models for crisp and data-efficient edge detection, demonstrating the value of adapting generative models for vision understanding tasks.

Abstract: We propose EasyControlEdge, adapting an image-generation foundation model to edge detection. In real-world edge detection (e.g., floor-plan walls, satellite roads/buildings, and medical organ boundaries), crispness and data efficiency are crucial, yet producing crisp raw edge maps with limited training samples remains challenging. Although image-generation foundation models perform well on many downstream tasks, their pretrained priors for data-efficient transfer and iterative refinement for high-frequency detail preservation remain underexploited for edge detection. To enable crisp and data-efficient edge detection using these capabilities, we introduce an edge-specialized adaptation of image-generation foundation models. To better specialize the foundation model for edge detection, we incorporate an edge-oriented objective with an efficient pixel-space loss. At inference, we introduce guidance based on unconditional dynamics, enabling a single model to control the edge density through a guidance scale. Experiments on BSDS500, NYUDv2, BIPED, and CubiCasa compare against state-of-the-art methods and show consistent gains, particularly under no-post-processing crispness evaluation and with limited training data.

[126] HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis

J. Dhar, M. K. Pandey, D. Chakladar, M. Haghighat, A. Alavi, S. Mistry, N. Zaidi

Main category: cs.CV

TL;DR: HyPCA-Net: A hybrid parallel-fusion cascaded attention network for efficient multimodal medical image fusion with improved performance and reduced computational cost.

Details

Motivation: Existing multimodal fusion methods for medical imaging are computationally expensive and use cascaded attention modules that risk information loss, limiting their applicability in low-resource environments and generalization in multi-disease analysis tasks.

Method: Proposes HyPCA-Net with two novel blocks: (1) a computationally efficient residual adaptive learning attention block for refined modality-specific representations, and (2) a dual-view cascaded attention block for learning robust shared representations across diverse modalities.

Result: Extensive experiments on ten publicly available datasets show HyPCA-Net outperforms existing leading methods with up to 5.2% performance improvement and up to 73.1% reduction in computational cost.

Conclusion: HyPCA-Net provides an efficient and effective solution for multimodal medical image fusion, addressing computational limitations and information loss issues in existing methods.

Abstract: Multimodal fusion frameworks, which integrate diverse medical imaging modalities (e.g., MRI, CT), have shown great potential in applications such as skin cancer detection, dementia diagnosis, and brain tumor prediction. However, existing multimodal fusion methods face significant challenges. First, they often rely on computationally expensive models, limiting their applicability in low-resource environments. Second, they often employ cascaded attention modules, which potentially increase risk of information loss during inter-module transitions and hinder their capacity to effectively capture robust shared representations across modalities. This restricts their generalization in multi-disease analysis tasks. To address these limitations, we propose a Hybrid Parallel-Fusion Cascaded Attention Network (HyPCA-Net), composed of two core novel blocks: (a) a computationally efficient residual adaptive learning attention block for capturing refined modality-specific representations, and (b) a dual-view cascaded attention block aimed at learning robust shared representations across diverse modalities. Extensive experiments on ten publicly available datasets exhibit that HyPCA-Net significantly outperforms existing leading methods, with improvements of up to 5.2% in performance and reductions of up to 73.1% in computational cost. Code: https://github.com/misti1203/HyPCA-Net.

[127] AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards

David Smerkous, Zian Wang, Behzad Najafian

Main category: cs.CV

TL;DR: AFFMAE is a hierarchical pretraining framework that combines masked autoencoding with adaptive token merging for efficient high-resolution training on limited hardware.

Details

Motivation: High-resolution self-supervised pretraining typically requires server-scale infrastructure, limiting foundation model development for many research labs. While MAE reduces computation by encoding only visible tokens, combining it with hierarchical architectures is challenging due to dense grid priors and mask-aware design compromises.

Method: AFFMAE introduces adaptive, off-grid token merging that discards masked tokens and performs dynamic merging exclusively over visible tokens, removing dense-grid assumptions while preserving hierarchical scalability. It uses numerically stable mixed-precision Flash-style cluster attention kernels and mitigates sparse-stage representation collapse via deep supervision.

Result: On high-resolution electron microscopy segmentation, AFFMAE matches ViT-MAE performance at equal parameter count while reducing FLOPs by up to 7x, halving memory usage, and achieving faster training on a single RTX 5090.

Conclusion: AFFMAE enables efficient hierarchical pretraining for high-resolution vision tasks on limited hardware, making foundation model development more accessible to research labs without server-scale infrastructure.

Abstract: Self-supervised pretraining has transformed computer vision by enabling data-efficient fine-tuning, yet high-resolution training typically requires server-scale infrastructure, limiting in-domain foundation model development for many research laboratories. Masked Autoencoders (MAE) reduce computation by encoding only visible tokens, but combining MAE with hierarchical downsampling architectures remains structurally challenging due to dense grid priors and mask-aware design compromises. We introduce AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging. By discarding masked tokens and performing dynamic merging exclusively over visible tokens, AFFMAE removes dense-grid assumptions while preserving hierarchical scalability. We developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage representation collapse via deep supervision. On high-resolution electron microscopy segmentation, AFFMAE matches ViT-MAE performance at equal parameter count while reducing FLOPs by up to 7x, halving memory usage, and achieving faster training on a single RTX 5090. Code available at https://github.com/najafian-lab/affmae.

[128] Breaking the Sub-Millimeter Barrier: Eyeframe Acquisition from Color Images

Manel Guzmán, Antonio Agudo

Main category: cs.CV

TL;DR: A computer vision system for eyeframe lens tracing using multi-view RGB-D data to replace mechanical tools, achieving sub-millimeter precision without specialized equipment.

Details

Motivation: Traditional mechanical frame tracers require precise positioning, calibration, and specialized equipment, creating inefficient workflows for opticians. There's a need for a more efficient solution that eliminates complex equipment while maintaining sub-millimeter precision.

Method: Uses multi-view artificial vision with an InVision system. Pipeline includes: 1) image acquisition, 2) frame segmentation to isolate eyeframe from background, 3) depth estimation for 3D spatial information, and 4) multi-view processing integrating segmented RGB images with depth data for precise contour measurement.

Result: Provides competitive measurements from still color images compared to other solutions, achieving sub-millimeter precision while eliminating need for specialized tracing equipment and reducing workflow complexity.

Conclusion: The proposed computer vision approach successfully replaces mechanical frame tracing tools, offering efficient workflow for optical technicians while maintaining required precision through multi-view RGB-D processing.

Abstract: Eyeframe lens tracing is an important process in the optical industry that requires sub-millimeter precision to ensure proper lens fitting and optimal vision correction. Traditional frame tracers rely on mechanical tools that need precise positioning and calibration, which are time-consuming and require additional equipment, creating an inefficient workflow for opticians. This work presents a novel approach based on artificial vision that utilizes multi-view information. The proposed algorithm operates on images captured from an InVision system. The full pipeline includes image acquisition, frame segmentation to isolate the eyeframe from background, depth estimation to obtain 3D spatial information, and multi-view processing that integrates segmented RGB images with depth data for precise frame contour measurement. To this end, different configurations and variants are proposed and analyzed on real data, providing competitive measurements from still color images with respect to other solutions, while eliminating the need for specialized tracing equipment and reducing workflow complexity for optical technicians.

[129] A Self-Supervised Approach for Enhanced Feature Representations in Object Detection Tasks

Santiago C. Vilabella, Pablo Pérez-Núñez, Beatriz Remeseiro

Main category: cs.CV

TL;DR: Self-supervised learning approach for object detection that improves feature extractors to reduce dependency on labeled data while outperforming ImageNet-pretrained models.

Details

Motivation: The paper addresses the challenge of limited labeled data for training deep learning models, especially for complex tasks like object detection. Companies face high costs for data labeling through skilled personnel or outsourcing. The research aims to show that enhancing feature extractors can reduce dependency on labeled data.

Method: Uses a self-supervised learning strategy to train models on unlabeled data. The approach focuses on improving feature extractors to learn more effective representations with less labeled data, specifically targeting object detection tasks.

Result: The proposed model outperforms state-of-the-art feature extractors pre-trained on ImageNet for object detection tasks. The approach encourages the model to focus on the most relevant aspects of objects, achieving better feature representations and improving reliability and robustness.

Conclusion: Enhancing feature extractors through self-supervised learning can significantly reduce the need for labeled data while improving performance on object detection tasks, offering a more efficient and robust solution for companies developing AI applications.

Abstract: In the fast-evolving field of artificial intelligence, where models are increasingly growing in complexity and size, the availability of labeled data for training deep learning models has become a significant challenge. Addressing complex problems like object detection demands considerable time and resources for data labeling to achieve meaningful results. For companies developing such applications, this entails extensive investment in highly skilled personnel or costly outsourcing. This research work aims to demonstrate that enhancing feature extractors can substantially alleviate this challenge, enabling models to learn more effective representations with less labeled data. Utilizing a self-supervised learning strategy, we present a model trained on unlabeled data that outperforms state-of-the-art feature extractors pre-trained on ImageNet and particularly designed for object detection tasks. Moreover, the results demonstrate that our approach encourages the model to focus on the most relevant aspects of an object, thus achieving better feature representations and, therefore, reinforcing its reliability and robustness.

[130] Subtractive Modulative Network with Learnable Periodic Activations

Tiou Wang, Zhuoqian Yang, Markus Flierl, Mathieu Salzmann, Sabine Süsstrunk

Main category: cs.CV

TL;DR: SMN is a parameter-efficient Implicit Neural Representation architecture inspired by subtractive synthesis, using learnable oscillators and modulative filters to generate high-order harmonics for improved signal reconstruction.

Details

Motivation: The paper aims to develop a more efficient and effective Implicit Neural Representation (INR) architecture by drawing inspiration from classical subtractive synthesis in audio signal processing, seeking to improve parameter efficiency while maintaining high reconstruction quality.

Method: Proposes Subtractive Modulative Network (SMN) with a learnable periodic activation layer (Oscillator) generating multi-frequency basis, followed by modulative mask modules (Filters) that actively generate high-order harmonics, creating a principled signal processing pipeline.

Result: Achieves PSNR of 40+ dB on two image datasets, outperforming state-of-the-art methods in both reconstruction accuracy and parameter efficiency, with consistent advantages on 3D NeRF novel view synthesis tasks.

Conclusion: SMN provides an effective, parameter-efficient INR architecture that bridges signal processing principles with neural representations, demonstrating strong performance across 2D image reconstruction and 3D novel view synthesis tasks.

Abstract: We propose the Subtractive Modulative Network (SMN), a novel, parameter-efficient Implicit Neural Representation (INR) architecture inspired by classical subtractive synthesis. The SMN is designed as a principled signal processing pipeline, featuring a learnable periodic activation layer (Oscillator) that generates a multi-frequency basis, and a series of modulative mask modules (Filters) that actively generate high-order harmonics. We provide both theoretical analysis and empirical validation for our design. Our SMN achieves a PSNR of $40+$ dB on two image datasets, comparing favorably against state-of-the-art methods in terms of both reconstruction accuracy and parameter efficiency. Furthermore, consistent advantage is observed on the challenging 3D NeRF novel view synthesis task. Supplementary materials are available at https://inrainbws.github.io/smn/.

[131] SCAR: Satellite Imagery-Based Calibration for Aerial Recordings

Henry Hölzemann, Michael Schleiss

Main category: cs.CV

TL;DR: SCAR is a method for long-term auto-calibration refinement of aerial visual-inertial systems using georeferenced satellite imagery as a persistent global reference to estimate intrinsic and extrinsic parameters.

Details

Motivation: Existing aerial visual-inertial calibration methods require dedicated calibration maneuvers or manually surveyed ground control points, which are impractical for long-term field deployment. There's a need for automated calibration that can detect and correct degradation over time without manual intervention.

Method: SCAR aligns aerial images with 2D-3D correspondences derived from publicly available orthophotos and elevation models. It estimates both intrinsic and extrinsic parameters by exploiting georeferenced satellite imagery as a persistent global reference, enabling automatic calibration refinement during field operations.

Result: Evaluated on six large-scale aerial campaigns over two years under diverse conditions, SCAR consistently outperformed established baselines (Kalibr, COLMAP, VINS-Mono), reducing median reprojection error by a large margin and achieving substantially lower visual localization rotation errors and higher pose accuracy.

Conclusion: SCAR provides accurate, robust, and reproducible calibration over long-term aerial operations without manual intervention, demonstrating the effectiveness of leveraging external geospatial data for continuous calibration refinement.

Abstract: We introduce SCAR, a method for long-term auto-calibration refinement of aerial visual-inertial systems that exploits georeferenced satellite imagery as a persistent global reference. SCAR estimates both intrinsic and extrinsic parameters by aligning aerial images with 2D–3D correspondences derived from publicly available orthophotos and elevation models. In contrast to existing approaches that rely on dedicated calibration maneuvers or manually surveyed ground control points, our method leverages external geospatial data to detect and correct calibration degradation under field deployment conditions. We evaluate our approach on six large-scale aerial campaigns conducted over two years under diverse seasonal and environmental conditions. Across all sequences, SCAR consistently outperforms established baselines (Kalibr, COLMAP, VINS-Mono), reducing median reprojection error by a large margin, and translating these calibration gains into substantially lower visual localization rotation errors and higher pose accuracy. These results demonstrate that SCAR provides accurate, robust, and reproducible calibration over long-term aerial operations without the need for manual intervention.

[132] Parameter-Free Adaptive Multi-Scale Channel-Spatial Attention Aggregation framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired

Qi He, XiangXiang Wang, Jingtao Zhang, Yongbin Yu, Hongxiang Chu, Manping Fan, JingYe Cai, Zhenglin Yang

Main category: cs.CV

TL;DR: AMAA framework improves monocular 3D Semantic Scene Completion for indoor assistive perception through adaptive multi-scale attention aggregation and feature regulation

Details

Motivation: Existing monocular SSC approaches lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation, making them vulnerable to projection diffusion and feature entanglement, limiting structural stability for safety-critical scene understanding in assistive systems for visually impaired users.

Method: Built upon MonoScene pipeline, AMAA uses parallel channel-spatial attention aggregation to jointly calibrate lifted voxel features in semantic and spatial dimensions, and hierarchical adaptive feature-gating strategy to regulate information injection across scales during multi-scale encoder-decoder fusion.

Result: On NYUv2 benchmark: 27.25% SSC mIoU (+0.31 improvement) and 43.10% SC IoU (+0.59 improvement). System-level deployment on NVIDIA Jetson platform verifies stable execution on embedded hardware.

Conclusion: AMAA improves monocular SSC quality without significantly increasing system complexity, providing a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.

Abstract: In indoor assistive perception for visually impaired users, 3D Semantic Scene Completion (SSC) is expected to provide structurally coherent and semantically consistent occupancy under strictly monocular vision for safety-critical scene understanding. However, existing monocular SSC approaches often lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation during 2D-3D projection and multi-scale fusion, making them vulnerable to projection diffusion and feature entanglement and thus limiting structural stability.To address these challenges, this paper presents an Adaptive Multi-scale Attention Aggregation (AMAA) framework built upon the MonoScene pipeline. Rather than introducing a heavier backbone, AMAA focuses on reliability-oriented feature regulation within a monocular SSC framework. Specifically, lifted voxel features are jointly calibrated in semantic and spatial dimensions through parallel channel-spatial attention aggregation, while multi-scale encoder-decoder fusion is stabilized via a hierarchical adaptive feature-gating strategy that regulates information injection across scales.Experiments on the NYUv2 benchmark demonstrate consistent improvements over MonoScene without significantly increasing system complexity: AMAA achieves 27.25% SSC mIoU (+0.31) and 43.10% SC IoU (+0.59). In addition, system-level deployment on an NVIDIA Jetson platform verifies that the complete AMAA framework can be executed stably on embedded hardware. Overall, AMAA improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.

[133] ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

Daichi Yashima, Shuhei Kurita, Yusuke Oda, Komei Sugiura

Main category: cs.CV

TL;DR: ReMoRa is a video MLLM that processes compressed video representations (keyframes + motion features) instead of sequential RGB frames, enabling efficient long-form video understanding with linear scaling.

Details

Motivation: Long-form video understanding remains challenging for MLLMs due to computational intractability of processing full RGB frame sequences (quadratic complexity in self-attention) and high redundancy in video data.

Method: Processes videos using compressed representations: sparse RGB keyframes for appearance and motion representations as compact proxy for optical flow. Includes denoising module to refine block-based motions and compresses features with linear sequence length scaling.

Result: Outperforms baseline methods on multiple challenging benchmarks including LongVideoBench, NExT-QA, and MLVU, demonstrating effectiveness for long-video understanding.

Conclusion: ReMoRa provides an efficient approach to video understanding for MLLMs by operating on compressed representations, addressing computational challenges while maintaining strong performance on long-form video tasks.

Abstract: While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.

[134] Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems

Ali Faraz, Raja Kolla, Ashish Kulkarni, Shubham Agarwal

Main category: cs.CV

TL;DR: The paper presents two training strategies for multilingual OCR systems using Vision-Language Models for Indian languages, comparing end-to-end training vs. fine-tuning existing OCR models, with Chitrapathak-2 achieving 3-6x speedup and state-of-the-art performance in Telugu.

Details

Motivation: India's linguistic diversity, document heterogeneity, and deployment constraints require specialized OCR systems that balance accuracy, speed, and practical deployment considerations.

Method: Two strategies: 1) Pairing a generic vision encoder with a strong multilingual language model and training end-to-end for OCR, 2) Fine-tuning an existing OCR model not trained for target languages. Evaluation on multilingual Indic OCR benchmarks with deployment metrics.

Result: Fine-tuning existing OCR models consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over predecessor, SOTA in Telugu (6.69 char ANLS), second best in other languages. Parichay model for government documents achieves 89.8% Exact Match with faster inference.

Conclusion: The second strategy (fine-tuning existing OCR models) provides better practical performance for production-scale OCR pipelines in the Indian context, achieving state-of-the-art results with improved speed.

Abstract: Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.

[135] Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing

Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin

Main category: cs.CV

TL;DR: ChartVSR introduces a visual self-refinement paradigm using pixel-level localization as visual anchors for accurate chart parsing, addressing visual perception errors in LVLMs.

Details

Motivation: Existing LVLMs struggle with visually dense charts, leading to data omission, misalignment, and hallucination errors. The paper is inspired by human strategy of using visual anchors (like a finger) to ensure accuracy when reading complex charts.

Method: Proposes Visual Self-Refine (VSR) paradigm where models generate pixel-level localization outputs, visualize them, and feed these visualizations back for self-correction. Instantiates as ChartVSR with two stages: Refine Stage (iterative visual feedback for accurate pixel-level localizations) and Decode Stage (using verified localizations as visual anchors to parse structured data).

Result: Introduces ChartP-Bench, a challenging benchmark for chart parsing. Demonstrates VSR as a general-purpose visual feedback mechanism for enhancing accuracy on vision-centric tasks.

Conclusion: VSR offers a promising new direction for improving visual perception accuracy in multimodal models, particularly for complex visual tasks like chart parsing where textual reasoning alone is insufficient.

Abstract: While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor’’ to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points’ Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.

[136] MMA: Multimodal Memory Agent

Yihao Lu, Wanru Cheng, Zeyu Zhang, Hao Tang

Main category: cs.CV

TL;DR: MMA introduces a multimodal memory agent with dynamic reliability scoring for retrieved memories, addressing issues of stale/conflicting information in long-horizon multimodal agents through credibility, temporal decay, and conflict-aware consensus.

Details

Motivation: Long-horizon multimodal agents relying on external memory face problems with similarity-based retrieval surfacing stale, low-credibility, or conflicting items, leading to overconfident errors. There's a need for better memory reliability assessment in multimodal contexts.

Method: Proposes Multimodal Memory Agent (MMA) that assigns dynamic reliability scores to retrieved memory items by combining source credibility, temporal decay, and conflict-aware network consensus. Uses this signal to reweight evidence and abstain when support is insufficient. Also introduces MMA-Bench benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions.

Result: On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility. On LoCoMo, safety-oriented configuration improves actionable accuracy and reduces wrong answers. On MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode while baseline collapses to 0.0%. Uncovers “Visual Placebo Effect” showing how RAG-based agents inherit latent visual biases.

Conclusion: MMA provides an effective framework for handling memory reliability in multimodal agents, demonstrating improved performance and reduced errors through dynamic reliability scoring and conflict-aware consensus mechanisms.

Abstract: Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the “Visual Placebo Effect”, revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.

[137] Benchmarking Adversarial Robustness and Adversarial Training Strategies for Object Detection

Alexis Winter, Jean-Vincent Martini, Romaric Audigier, Angelique Loesch, Bertrand Luvison

Main category: cs.CV

TL;DR: Proposes a unified benchmark for evaluating adversarial attacks on object detection models, investigates attack transferability between CNN and Vision Transformer architectures, and identifies optimal adversarial training strategies for robust defense.

Details

Motivation: Object detection models are vulnerable to adversarial attacks, but defense progress lags due to lack of standardized evaluation. Existing work uses inconsistent metrics, datasets, and perturbation measures, making fair comparisons impossible.

Method: Proposes a unified benchmark framework for digital, non-patch-based attacks with specific metrics to separate localization and classification errors. Evaluates attack cost using multiple perceptual metrics. Conducts extensive experiments on state-of-the-art attacks across various detectors.

Result: Two key findings: (1) Modern adversarial attacks show poor transferability to transformer-based architectures compared to CNNs; (2) The most robust adversarial training uses a mixed dataset of high-perturbation attacks with different objectives (spatial and semantic), outperforming single-attack training.

Conclusion: Provides a standardized benchmark for fair evaluation of adversarial attacks on object detection, reveals architectural differences in attack transferability, and identifies optimal adversarial training strategies for improved robustness.

Abstract: Object detection models are critical components of automated systems, such as autonomous vehicles and perception-based robots, but their sensitivity to adversarial attacks poses a serious security risk. Progress in defending these models lags behind classification, hindered by a lack of standardized evaluation. It is nearly impossible to thoroughly compare attack or defense methods, as existing work uses different datasets, inconsistent efficiency metrics, and varied measures of perturbation cost. This paper addresses this gap by investigating three key questions: (1) How can we create a fair benchmark to impartially compare attacks? (2) How well do modern attacks transfer across different architectures, especially from Convolutional Neural Networks to Vision Transformers? (3) What is the most effective adversarial training strategy for robust defense? To answer these, we first propose a unified benchmark framework focused on digital, non-patch-based attacks. This framework introduces specific metrics to disentangle localization and classification errors and evaluates attack cost using multiple perceptual metrics. Using this benchmark, we conduct extensive experiments on state-of-the-art attacks and a wide range of detectors. Our findings reveal two major conclusions: first, modern adversarial attacks against object detection models show a significant lack of transferability to transformer-based architectures. Second, we demonstrate that the most robust adversarial training strategy leverages a dataset composed of a mix of high-perturbation attacks with different objectives (e.g., spatial and semantic), which outperforms training on any single attack.

[138] DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images

Zeng Tao, Ying Jiang, Yunuo Chen, Tianyi Xie, Huamin Wang, Yingnian Wu, Yin Yang, Abishek Sampath Kumar, Kenji Tashiro, Chenfanfu Jiang

Main category: cs.CV

TL;DR: DressWild: A feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and 3D garments from single in-the-wild images using vision-language models and transformers.

Details

Motivation: Existing garment pattern generation methods have limitations: feed-forward approaches struggle with diverse poses/viewpoints, while optimization-based methods are computationally expensive and don't scale well. There's a need for editable, separable, simulation-ready garments for modeling and fabrication applications.

Method: 1) Uses vision-language models to normalize pose variations at image level; 2) Extracts pose-aware, 3D-informed garment features; 3) Fuses features through transformer-based encoder; 4) Predicts sewing pattern parameters that can be directly used for physical simulation, texture synthesis, and multi-layer virtual try-on.

Result: Extensive experiments show robust recovery of diverse sewing patterns and corresponding 3D garments from in-the-wild images without requiring multi-view inputs or iterative optimization. The method offers efficient and scalable solution for realistic garment simulation and animation.

Conclusion: DressWild provides a novel feed-forward pipeline for generating physics-consistent sewing patterns and 3D garments from single images, addressing limitations of existing methods and enabling practical applications in garment modeling and fabrication.

Abstract: Recent advances in garment pattern generation have shown promising progress. However, existing feed-forward methods struggle with diverse poses and viewpoints, while optimization-based approaches are computationally expensive and difficult to scale. This paper focuses on sewing pattern generation for garment modeling and fabrication applications that demand editable, separable, and simulation-ready garments. We propose DressWild, a novel feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and the corresponding 3D garments from a single in-the-wild image. Given an input image, our method leverages vision-language models (VLMs) to normalize pose variations at the image level, then extract pose-aware, 3D-informed garment features. These features are fused through a transformer-based encoder and subsequently used to predict sewing pattern parameters, which can be directly applied to physical simulation, texture synthesis, and multi-layer virtual try-on. Extensive experiments demonstrate that our approach robustly recovers diverse sewing patterns and the corresponding 3D garments from in-the-wild images without requiring multi-view inputs or iterative optimization, offering an efficient and scalable solution for realistic garment simulation and animation.

[139] Let’s Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding

Kaiting Liu, Hazel Doughty

Main category: cs.CV

TL;DR: A zero-shot method for editing video classifiers to split coarse categories into finer subcategories without retraining or new annotations.

Details

Motivation: Video recognition models use fixed taxonomies that are often too coarse, collapsing fine-grained distinctions. Retraining with new annotations to accommodate evolving definitions is costly.

Method: Proposes category splitting task with zero-shot editing method leveraging latent compositional structure of video classifiers. Also uses low-shot fine-tuning that benefits from zero-shot initialization.

Result: Method substantially outperforms vision-language baselines on new video benchmarks, improving accuracy on newly split categories without sacrificing performance elsewhere.

Conclusion: Category splitting enables efficient refinement of video classifiers to accommodate evolving fine-grained distinctions without costly retraining.

Abstract: Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: https://kaitingliu.github.io/Category-Splitting/.

[140] Arc2Morph: Identity-Preserving Facial Morphing with Arc2Face

Nicolò Di Domenico, Annalisa Franco, Matteo Ferrara, Davide Maltoni

Main category: cs.CV

TL;DR: Proposes a novel face morphing attack technique using Arc2Face foundation model to generate realistic morphed faces that challenge face recognition systems, achieving comparable attack potential to traditional landmark-based methods.

Details

Motivation: Face morphing attacks pose serious threats to face recognition systems in electronic identity documents, exploiting vulnerabilities in unsupervised passport enrollment procedures where facial images are captured without live supervision.

Method: Uses Arc2Face, an identity-conditioned face foundation model that synthesizes photorealistic facial images from compact identity representations, to create novel face morphing attacks. Compares against state-of-the-art morphing methods on large-scale sequestered datasets.

Result: The proposed deep learning-based approach achieves morphing attack potential comparable to traditional landmark-based techniques, which are considered the most challenging. Effectively preserves and manages identity information during morph generation.

Conclusion: Arc2Face-based face morphing technique demonstrates strong attack potential against face recognition systems, confirming its ability to effectively manipulate identity information and posing significant security challenges for electronic identity verification systems.

Abstract: Face morphing attacks are widely recognized as one of the most challenging threats to face recognition systems used in electronic identity documents. These attacks exploit a critical vulnerability in passport enrollment procedures adopted by many countries, where the facial image is often acquired without a supervised live capture process. In this paper, we propose a novel face morphing technique based on Arc2Face, an identity-conditioned face foundation model capable of synthesizing photorealistic facial images from compact identity representations. We demonstrate the effectiveness of the proposed approach by comparing the morphing attack potential metric on two large-scale sequestered face morphing attack detection datasets against several state-of-the-art morphing methods, as well as on two novel morphed face datasets derived from FEI and ONOT. Experimental results show that the proposed deep learning-based approach achieves a morphing attack potential comparable to that of landmark-based techniques, which have traditionally been regarded as the most challenging. These findings confirm the ability of the proposed method to effectively preserve and manage identity information during the morph generation process.

[141] A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Qi You, Yitai Cheng, Zichao Zeng, James Haworth

Main category: cs.CV

TL;DR: CLIP-MHAdapter: A lightweight adaptation method for CLIP that adds multi-head self-attention on patch tokens to improve fine-grained street-view image attribute classification while maintaining low computational cost.

Details

Motivation: Street-view image attribute classification is computationally demanding, and existing CLIP adaptation methods rely on global image embeddings, limiting their ability to capture fine-grained local attributes needed for complex street scenes.

Method: Proposes CLIP-MHAdapter, a variant of lightweight CLIP adaptation that appends a bottleneck MLP with multi-head self-attention operating on patch tokens to model inter-patch dependencies, with only ~1.4M trainable parameters.

Result: Achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost.

Conclusion: CLIP-MHAdapter effectively addresses the limitations of global embeddings for fine-grained street-view attribute classification through patch-level attention mechanisms, offering an efficient adaptation approach for vision-language models.

Abstract: Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.

[142] Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

Jiaming Liu, Felix Petersen, Yunhe Gao, Yabin Zhang, Hyojin Kim, Akshay S. Chaudhari, Yu Sun, Stefano Ermon, Sergios Gatidis

Main category: cs.CV

TL;DR: SSB framework integrates self-supervised semantic priors into diffusion bridge models for unpaired image-to-image translation without cross-domain supervision, achieving spatially faithful translation for medical imaging and text-guided editing.

Details

Motivation: Existing adversarial diffusion methods require target-domain adversarial loss limiting generalization, while diffusion-inversion methods suffer from low-fidelity translations due to imperfect inversion. There's a need for a method that enables spatially faithful translation without cross-domain supervision.

Method: Proposes Self-Supervised Semantic Bridge (SSB) framework that leverages self-supervised visual encoders to learn representations invariant to appearance changes but capturing geometric structure. These form a shared latent space that conditions diffusion bridges, integrating external semantic priors into diffusion bridge models.

Result: SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.

Conclusion: SSB provides a versatile framework for unpaired image-to-image translation that integrates semantic priors into diffusion bridges, enabling spatially faithful translation without cross-domain supervision and showing strong performance in medical imaging applications.

Abstract: Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations. Adversarial approaches require target-domain adversarial loss during training, which can limit generalization to unseen data, while diffusion-inversion methods often produce low-fidelity translations due to imperfect inversion into noise-latent representations. In this work, we propose the Self-Supervised Semantic Bridge (SSB), a versatile framework that integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure, forming a shared latent space that conditions the diffusion bridges. Extensive experiments show that SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.

[143] PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction

Bo Lang, Nirav Savaliya, Zhihao Zheng, Jinglun Feng, Zheng-Hang Yeh, Mooi Choo Chuah

Main category: cs.CV

TL;DR: Novel end-to-end framework for consistent online HD vectorized map construction that jointly performs map instance tracking and short-term prediction to improve temporal consistency.

Details

Motivation: Existing query-based methods for HD map construction suffer from temporal inconsistencies and instabilities due to random query initialization and implicit temporal modeling, which are problematic for autonomous driving applications.

Method: Proposes four key components: 1) Semantic-Aware Query Generator for spatially aligned semantic mask initialization, 2) History Rasterized Map Memory for fine-grained instance-level storage, 3) History-Map Guidance Module for temporal continuity, and 4) Short-Term Future Guidance module for motion prediction.

Result: Outperforms state-of-the-art methods on nuScenes and Argoverse2 datasets with good efficiency, demonstrating improved temporal consistency in HD map construction.

Conclusion: The proposed framework effectively addresses temporal inconsistency issues in online HD map construction through explicit historical modeling and future guidance, providing more stable and consistent maps for autonomous driving.

Abstract: High-definition (HD) maps are crucial to autonomous driving, providing structured representations of road elements to support navigation and planning. However, existing query-based methods often employ random query initialization and depend on implicit temporal modeling, which lead to temporal inconsistencies and instabilities during the construction of a global map. To overcome these challenges, we introduce a novel end-to-end framework for consistent online HD vectorized map construction, which jointly performs map instance tracking and short-term prediction. First, we propose a Semantic-Aware Query Generator that initializes queries with spatially aligned semantic masks to capture scene-level context globally. Next, we design a History Rasterized Map Memory to store fine-grained instance-level maps for each tracked instance, enabling explicit historical priors. A History-Map Guidance Module then integrates rasterized map information into track queries, improving temporal continuity. Finally, we propose a Short-Term Future Guidance module to forecast the immediate motion of map instances based on the stored history trajectories. These predicted future locations serve as hints for tracked instances to further avoid implausible predictions and keep temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed method outperforms state-of-the-art (SOTA) methods with good efficiency.

[144] VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection

Yingyuan Yang, Tian Lan, Yifei Gao, Yimeng Lu, Wenjun He, Meng Wang, Chenghao Liu, Chen Zhang

Main category: cs.CV

TL;DR: VETime is a novel time-series anomaly detection framework that unifies 1D temporal and 2D visual modalities through fine-grained alignment and dynamic fusion to detect both point and context anomalies.

Details

Motivation: Existing foundation models for time-series anomaly detection face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to lack of temporal alignment and coarse-grained detection.

Method: VETime introduces: 1) Reversible Image Conversion and Patch-Level Temporal Alignment to establish shared visual-temporal timeline; 2) Anomaly Window Contrastive Learning; 3) Task-Adaptive Multi-Modal Fusion to adaptively integrate complementary strengths of both modalities.

Result: VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches.

Conclusion: The proposed VETime framework successfully resolves the dilemma between temporal and visual modalities for time-series anomaly detection through fine-grained alignment and adaptive fusion, demonstrating strong performance in zero-shot settings.

Abstract: Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: https://github.com/yyyangcoder/VETime.

[145] Learning Situated Awareness in the Real World

Chuhan Li, Ruilin Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, Xin Eric Wang

Main category: cs.CV

TL;DR: SAW-Bench is a novel benchmark for evaluating egocentric situated awareness in multimodal foundation models using real-world videos from smart glasses, focusing on observer-centric spatial reasoning rather than environment-centric relations.

Details

Motivation: Most existing multimodal benchmarks focus on environment-centric spatial relations among objects, but overlook observer-centric relationships that require reasoning relative to an agent's viewpoint, pose, and motion. There's a need to evaluate models' situated awareness - the ability to relate oneself to the physical environment and reason over possible actions in context.

Method: Created SAW-Bench with 786 self-recorded videos using Ray-Ban Meta smart glasses across diverse indoor/outdoor environments, with 2,071 human-annotated QA pairs. The benchmark probes six different observer-centric awareness tasks to evaluate models’ egocentric situated understanding.

Result: Evaluation reveals a 37.66% human-model performance gap even with the best-performing MFM (Gemini 3 Flash). Models can exploit partial geometric cues but often fail to infer coherent camera geometry, leading to systematic spatial reasoning errors. The benchmark uncovers specific weaknesses in observer-centric understanding.

Conclusion: SAW-Bench positions itself as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics. It highlights significant gaps in current multimodal models’ ability to reason from egocentric perspectives.

Abstract: A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent’s viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model’s observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.

[146] Are Object-Centric Representations Better At Compositional Generalization?

Ferdinand Kapl, Amir Mohammad Karimi Mamaghan, Maximilian Seitzer, Karl Henrik Johansson, Carsten Marr, Stefan Bauer, Andrea Dittadi

Main category: cs.CV

TL;DR: Object-centric representations show stronger compositional generalization in VQA tasks compared to dense encoders, especially when resources are constrained.

Details

Motivation: To systematically evaluate whether object-centric representations truly support compositional generalization in visually rich settings, as often claimed but with limited evidence.

Method: Created a VQA benchmark across three visual worlds (CLEVRTex, Super-CLEVR, MOVi-C) to test generalization to unseen property combinations. Compared DINOv2/SigLIP2 vision encoders with their object-centric counterparts, controlling for training diversity, sample size, representation size, model capacity, and compute.

Result: Object-centric approaches excel in harder compositional generalization settings; dense representations only outperform on easier settings with more compute; OC models are more sample efficient and achieve stronger generalization with fewer images.

Conclusion: Object-centric representations offer superior compositional generalization when any of dataset size, training diversity, or compute is constrained, supporting their value for robust visual reasoning.

Abstract: Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.

[147] Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning

Mingjia Shi, Yinhan He, Yaochen Zhu, Jundong Li

Main category: cs.CV

TL;DR: SAP (Saliency-Aware Principle selection) improves vision-language models by enabling dynamic re-consultation of visual evidence during reasoning, reducing object hallucination and improving grounding stability.

Details

Motivation: Current VLMs struggle with visual grounding during long reasoning chains because visual inputs are only provided once at generation start, causing reasoning to become text-dominated and allowing early visual grounding errors to accumulate. Existing guidance methods are too coarse and noisy for effective steering.

Method: Proposes SAP (Saliency-Aware Principle selection) which operates on high-level reasoning principles rather than token-level trajectories. This enables stable control over discrete generation under noisy feedback and allows later reasoning steps to re-consult visual evidence when renewed grounding is required. Supports multi-route inference for parallel exploration of diverse reasoning behaviors. Model-agnostic and data-free with no additional training needed.

Result: SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets. Yields more stable reasoning and lower response latency than CoT-style long sequential reasoning.

Conclusion: SAP provides an effective method for improving visual grounding in VLMs by enabling dynamic re-consultation of visual evidence during reasoning, addressing the fundamental limitation of one-time visual input provision in current architectures.

Abstract: Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emph{Saliency-Aware Principle} (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.

[148] TeCoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos

Namitha Padmanabhan, Matthew Gwilliam, Abhinav Shrivastava

Main category: cs.CV

TL;DR: TeCoNeRV improves hypernetwork-based video compression using spatial-temporal decomposition, residual storage, and temporal coherence regularization to achieve better quality, lower bitrates, and higher resolutions with reduced memory usage.

Details

Motivation: Current hypernetwork-based video compression methods face limitations in scaling to high-resolution videos due to prohibitive memory requirements, low quality, and large compressed sizes, despite their fast encoding speeds.

Method: Three key innovations: (1) spatial-temporal decomposition of video into patch tubelets to reduce memory overhead 20×, (2) residual-based storage scheme capturing differences between consecutive segments, and (3) temporal coherence regularization aligning weight space changes with video content.

Result: Achieves 2.47dB and 5.35dB PSNR improvements over baseline at 480p and 720p on UVG, with 36% lower bitrates and 1.5-3× faster encoding speeds. First hypernetwork approach to demonstrate results at 480p, 720p and 1080p on multiple datasets.

Conclusion: TeCoNeRV successfully addresses fundamental limitations of hypernetwork-based video compression, enabling high-resolution video compression with improved quality, lower bitrates, and practical memory usage.

Abstract: Implicit Neural Representations (INRs) have recently demonstrated impressive performance for video compression. However, since a separate INR must be overfit for each video, scaling to high-resolution videos while maintaining encoding efficiency remains a significant challenge. Hypernetwork-based approaches predict INR weights (hyponetworks) for unseen videos at high speeds, but with low quality, large compressed size, and prohibitive memory needs at higher resolutions. We address these fundamental limitations through three key contributions: (1) an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead by 20$\times$; (2) a residual-based storage scheme that captures only differences between consecutive segment representations, significantly reducing bitstream size; and (3) a temporal coherence regularization framework that encourages changes in the weight space to be correlated with video content. Our proposed method, TeCoNeRV, achieves substantial improvements of 2.47dB and 5.35dB PSNR over the baseline at 480p and 720p on UVG, with 36% lower bitrates and 1.5-3$\times$ faster encoding speeds. With our low memory usage, we are the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV. Our project page is available at https://namithap10.github.io/teconerv/ .

[149] Prompt When the Animal is: Temporal Animal Behavior Grounding with Positional Recovery Training

Sheng Yan, Xin Du, Zongying Li, Yi Wang, Hongcang Jin, Mengyuan Liu

Main category: cs.CV

TL;DR: Port framework improves temporal grounding for animal behavior videos by using positional recovery training with start/end time prompts and dual-alignment to handle sparse, uniformly distributed moments.

Details

Motivation: Temporal grounding in multimodal learning faces challenges with animal behavior data due to sparse and uniformly distributed moments, requiring better methods to localize specific behaviors in time.

Method: Proposes Positional Recovery Training (Port) framework that prompts models with start/end times of behaviors during training, adds a Recovering branch to reconstruct corrupted label sequences, and uses Dual-alignment method for distribution alignment.

Result: Achieves IoU@0.3 of 38.52 on Animal Kingdom dataset, emerging as top performer in MMVRAC sub-track of ICME 2024 Grand Challenges.

Conclusion: Port effectively addresses temporal grounding challenges in animal behavior analysis by leveraging positional prompts and recovery mechanisms to focus on specific temporal regions.

Abstract: Temporal grounding is crucial in multimodal learning, but it poses challenges when applied to animal behavior data due to the sparsity and uniform distribution of moments. To address these challenges, we propose a novel Positional Recovery Training framework (Port), which prompts the model with the start and end times of specific animal behaviors during training. Specifically, \port{} enhances the baseline model with a Recovering branch to reconstruct corrupted label sequences and align distributions via a Dual-alignment method. This allows the model to focus on specific temporal regions prompted by ground-truth information. Extensive experiments on the Animal Kingdom dataset demonstrate the effectiveness of \port{}, achieving an IoU@0.3 of 38.52. It emerges as one of the top performers in the sub-track of MMVRAC in ICME 2024 Grand Challenges.

[150] Ctrl-GenAug: Controllable Generative Augmentation for Medical Sequence Classification

Xinrui Zhou, Yuhao Huang, Haoran Dou, Shijing Chen, Ao Chang, Jia Liu, Weiran Long, Jian Zheng, Erjiao Xu, Jie Ren, Alejandro F. Frangi, Ruobing Huang, Jun Cheng, Xiaomeng Li, Wufeng Xue, Dong Ni

Main category: cs.CV

TL;DR: Ctrl-GenAug is a generative augmentation framework for medical sequence classification that enables controllable synthesis of diagnosis-promotive samples with enhanced temporal coherence and includes a noisy data filter to suppress unreliable synthetic cases.

Details

Motivation: Medical deep learning suffers from limited datasets and expensive annotations. While diffusion-based generative augmentation helps, existing methods lack sufficient semantic/sequential steerability for video/3D sequences and neglect quality control of noisy synthetic samples, limiting downstream task performance.

Method: 1) Multimodal conditions-guided sequence generator for controllably synthesizing diagnosis-promotive samples; 2) Sequential augmentation module to enhance temporal/stereoscopic coherence; 3) Noisy synthetic data filter to suppress unreliable cases at semantic and sequential levels.

Result: Extensive experiments on 3 medical datasets using 11 networks trained on 3 paradigms show effectiveness and generality, particularly in underrepresented high-risk populations and out-domain conditions.

Conclusion: Ctrl-GenAug addresses key limitations in medical generative augmentation by providing controllable sequence synthesis with quality control, improving downstream classification performance especially for challenging cases.

Abstract: In the medical field, the limited availability of large-scale datasets and labor-intensive annotation processes hinder the performance of deep models. Diffusion-based generative augmentation approaches present a promising solution to this issue, having been proven effective in advancing downstream medical recognition tasks. Nevertheless, existing works lack sufficient semantic and sequential steerability for challenging video/3D sequence generation, and neglect quality control of noisy synthesized samples, resulting in unreliable synthetic databases and severely limiting the performance of downstream tasks. In this work, we present Ctrl-GenAug, a novel and general generative augmentation framework that enables highly semantic- and sequential-customized sequence synthesis and suppresses incorrectly synthesized samples, to aid medical sequence classification. Specifically, we first design a multimodal conditions-guided sequence generator for controllably synthesizing diagnosis-promotive samples. A sequential augmentation module is integrated to enhance the temporal/stereoscopic coherence of generated samples. Then, we propose a noisy synthetic data filter to suppress unreliable cases at semantic and sequential levels. Extensive experiments on 3 medical datasets, using 11 networks trained on 3 paradigms, comprehensively analyze the effectiveness and generality of Ctrl-GenAug, particularly in underrepresented high-risk populations and out-domain conditions.

Karim Kassab, Antoine Schnepf, Jean-Yves Franceschi, Laurent Caraffa, Flavian Vasile, Jeremie Mary, Andrew Comport, Valérie Gouet-Brunet

Main category: cs.CV

TL;DR: Fused-Planes: A novel 3D object representation that improves efficiency over Tri-Planar NeRFs by capturing structural similarities across object classes through shared base planes and latent decomposition.

Details

Motivation: Tri-Planar NeRFs are computationally intensive and inefficient for modeling large collections of 3D objects because they train one Tri-Plane per object independently, overlooking structural similarities across object classes.

Method: Introduces Fused-Planes that explicitly captures structural similarities through a latent space and globally shared base planes. Each object is represented as a decomposition over these base planes augmented with object-specific features.

Result: Fused-Planes achieve 7.2× faster training and 3.2× lower memory footprint than Tri-Planes while maintaining rendering quality. An ultra-lightweight variant reduces per-object memory usage by 1875× with minimal quality loss.

Conclusion: Fused-Planes offer state-of-the-art efficiency among planar representations for 3D object modeling, enabling more resource-efficient reconstruction of object classes while preserving the planar structure benefits.

Abstract: Tri-Planar NeRFs enable the application of powerful 2D vision models for 3D tasks, by representing 3D objects using 2D planar structures. This has made them the prevailing choice to model large collections of 3D objects. However, training Tri-Planes to model such large collections is computationally intensive and remains largely inefficient. This is because the current approaches independently train one Tri-Plane per object, hence overlooking structural similarities in large classes of objects. In response to this issue, we introduce Fused-Planes, a novel object representation that improves the resource efficiency of Tri-Planes when reconstructing object classes, all while retaining the same planar structure. Our approach explicitly captures structural similarities across objects through a latent space and a set of globally shared base planes. Each individual Fused-Planes is then represented as a decomposition over these base planes, augmented with object-specific features. Fused-Planes showcase state-of-the-art efficiency among planar representations, demonstrating $7.2 \times$ faster training and $3.2 \times$ lower memory footprint than Tri-Planes while maintaining rendering quality. An ultra-lightweight variant further cuts per-object memory usage by $1875 \times$ with minimal quality loss. Our project page can be found at https://fused-planes.github.io .

[152] MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Ruichuan An, Sihan Yang, Renrui Zhang, Ming Lu, Tianyi Jiang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang

Main category: cs.CV

TL;DR: MC-LLaVA introduces a multi-concept personalization paradigm for vision-language models, enabling understanding of multiple user-provided concepts simultaneously through specialized training strategies and prompts.

Details

Motivation: Current VLMs focus on single-concept personalization, limiting real-world applicability where multiple concepts interact. The paper aims to enable VLMs to understand and reason about multiple user-provided concepts simultaneously.

Method: Proposes MC-LLaVA with multi-concept instruction tuning, personalized textual prompts using visual token information to initialize concept tokens, personalized visual prompts with location maps for enhanced recognition/grounding, optional auxiliary loss, and a new high-quality multi-concept dataset from movies.

Result: MC-LLaVA achieves impressive multi-concept personalized responses, demonstrating superior performance in understanding and reasoning about multiple concepts simultaneously compared to single-concept approaches.

Conclusion: The proposed multi-concept personalization paradigm paves the way for VLMs to become better user assistants by handling real-world scenarios with multiple interacting concepts.

Abstract: Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To further push the performance upper bound, we incorporate an optional auxiliary loss, better enhancing the proposed personalized prompts. To decorate the VLM personalization research, we contribute a high-quality dataset. We carefully collect images with multiple characters and objects from movies and manually create question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive experiments demonstrate that MC-LLaVA achieves impressive multi-concept personalized responses, paving the way for VLMs to become better user assistants. The code and dataset will be released at \href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}.

[153] Autoassociative Learning of Structural Representations for Modeling and Classification in Medical Imaging

Zuzanna Buchnajzer, Kacper Dobek, Stanisław Hapke, Daniel Jankowski, Krzysztof Krawiec

Main category: cs.CV

TL;DR: A neurosymbolic system that learns by reconstructing images using visual primitives, achieving better classification accuracy and transparency than conventional CNNs for histological abnormality diagnosis.

Details

Motivation: Conventional CNNs rely on continuous, smooth features which are incompatible with the crisp, categorical nature of real-world objects at human scale. There's a need for systems that form high-level, structural explanations of images.

Method: Proposes a class of neurosymbolic systems that learn by reconstructing images in terms of visual primitives, forcing the formation of high-level, structural explanations of the input images.

Result: When applied to histological imaging abnormality diagnosis, the method proved superior to conventional deep learning architectures in terms of classification accuracy while being more transparent.

Conclusion: Neurosymbolic approaches that reconstruct images using visual primitives can provide more accurate and transparent solutions for medical imaging tasks compared to conventional CNNs.

Abstract: Deep learning architectures based on convolutional neural networks tend to rely on continuous, smooth features. While this characteristics provides significant robustness and proves useful in many real-world tasks, it is strikingly incompatible with the physical characteristic of the world, which, at the scale in which humans operate, comprises crisp objects, typically representing well-defined categories. This study proposes a class of neurosymbolic systems that learn by reconstructing images in terms of visual primitives and are thus forced to form high-level, structural explanations of them. When applied to the task of diagnosing abnormalities in histological imaging, the method proved superior to a conventional deep learning architecture in terms of classification accuracy, while being more transparent.

[154] RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, Stan Birchfield

Main category: cs.CV

TL;DR: RoboSpatial dataset for spatial understanding in robotics with 1M images, 5k 3D scans, and 3M spatial annotations, improving vision-language models’ spatial reasoning for robotics tasks.

Details

Motivation: Current vision-language models struggle with spatial reasoning for robotics because they're trained on general image datasets lacking sophisticated spatial understanding, particularly reference frame comprehension (ego-, world-, object-centric perspectives).

Method: Created RoboSpatial, a large-scale dataset with real indoor/tabletop scenes captured as 3D scans and egocentric images, annotated with rich spatial information relevant to robotics. Dataset includes 2D-3D pairing for multimodal training.

Result: Models trained with RoboSpatial outperform baselines on downstream tasks including spatial affordance prediction, spatial relationship prediction, and robot manipulation.

Conclusion: RoboSpatial addresses critical gaps in spatial understanding for vision-language models in robotics, enabling better spatial reasoning through comprehensive multimodal data with paired 2D-3D representations.

Abstract: Spatial understanding is a crucial capability that enables robots to perceive their surroundings, reason about their environment, and interact with it meaningfully. In modern robotics, these capabilities are increasingly provided by vision-language models. However, these models face significant challenges in spatial reasoning tasks, as their training data are based on general-purpose image datasets that often lack sophisticated spatial understanding. For example, datasets frequently do not capture reference frame comprehension, yet effective spatial reasoning requires understanding whether to reason from ego-, world-, or object-centric perspectives. To address this issue, we introduce RoboSpatial, a large-scale dataset for spatial understanding in robotics. It consists of real indoor and tabletop scenes, captured as 3D scans and egocentric images, and annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5k 3D scans, and 3M annotated spatial relationships, and the pairing of 2D egocentric images with 3D scans makes it both 2D- and 3D- ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robot manipulation.

[155] LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu

Main category: cs.CV

TL;DR: LMSeg: An open-vocabulary semantic segmentation method that enhances visual-language alignment using LLMs for enriched text prompts and SAM for improved pixel-level visual features.

Details

Motivation: Existing open-vocabulary segmentation methods use short, template-based text prompts that fail to capture comprehensive object attributes, and CLIP models are less effective at pixel-level representation needed for segmentation tasks.

Method: Uses LLMs to generate enriched language prompts with diverse visual attributes (color, shape/size, texture/material) for each category, and employs SAM as a supplement to CLIP visual encoder through learnable weighted fusion for better pixel-level features.

Result: Achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks.

Conclusion: Leveraging multiple large-scale models (LLMs for enriched text, SAM for visual features) significantly improves open-vocabulary semantic segmentation performance by enhancing visual-language alignment.

Abstract: It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to generate enriched language prompts with diverse visual attributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual feature extraction, the SAM model is adopted as a supplement to the CLIP visual encoder through a proposed learnable weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks. The code will be made available soon.

[156] PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, Xiaofeng Wang, Bo Li

Main category: cs.CV

TL;DR: PromptGuard: A novel content moderation technique for text-to-image models that uses optimized safety soft prompts to prevent NSFW content generation without altering model architecture or inference efficiency.

Details

Motivation: Text-to-image models can generate harmful NSFW content (sexually explicit, violent, political, disturbing images), raising serious ethical concerns. Current moderation methods are inefficient or require proxy models.

Method: Optimizes a universal safety soft prompt that functions as an implicit system prompt within the T2I model’s textual embedding space. Uses divide-and-conquer strategy: optimizes category-specific soft prompts and combines them into holistic safety guidance.

Result: Achieves 3.8x faster than prior content moderation methods, surpasses 8 state-of-the-art defenses with optimal unsafe ratio down to 5.84% across five datasets. Effectively mitigates NSFW content while preserving high-quality benign outputs.

Conclusion: PromptGuard provides an effective, efficient content moderation solution for T2I models by adapting LLM safety alignment techniques to the visual domain through soft prompt optimization.

Abstract: Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model’s textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy, which optimizes category-specific soft prompts and combines them into holistic safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 3.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%.

[157] Frequency-Aware Vision Transformers for High-Fidelity Super-Resolution of Earth System Models

Ehsan Zeraatkar, Salah A Faroughi, Jelena Tešić

Main category: cs.CV

TL;DR: ViSIR and ViFOR are frequency-aware vision transformer frameworks for super-resolution of Earth System Model outputs that address spectral bias in traditional deep learning methods.

Details

Motivation: Traditional deep super-resolution methods (convolutional and transformer-based) exhibit spectral bias, reconstructing low-frequency content more readily than valuable high-frequency details needed for climate science applications.

Method: Two frequency-aware frameworks: ViSIR (Vision Transformer-Tuned Sinusoidal Implicit Representation) combines vision transformers with sinusoidal activations to mitigate spectral bias. ViFOR (Vision Transformer Fourier Representation Network) integrates explicit Fourier-based filtering for independent low- and high-frequency learning.

Result: Evaluated on E3SM-HR Earth system dataset across surface temperature, shortwave, and longwave fluxes, these models outperform leading Convolutional NN, Generative Networks, and vanilla transformer baselines, with ViFOR demonstrating up to 2.6 dB improvements in Peak Signal to Noise Ratio and higher Structural Similarity.

Conclusion: The proposed frequency-aware vision transformer frameworks effectively address spectral bias in super-resolution tasks for Earth System Model outputs, providing enhanced spatial fidelity for climate science applications.

Abstract: Super-resolution can play an essential role in enhancing the spatial fidelity of Earth System Model outputs, allowing fine-scale structures highly beneficial to climate science to be recovered from coarse simulations. However, traditional deep super-resolution methods, including convolutional and transformer based models, tend to exhibit spectral bias, reconstructing low-frequency content more readily than valuable high-frequency details. In this work, we introduce ViSIR and ViFOR, two frequency-aware frameworks. ViSIR stands for the Vision Transformer-Tuned Sinusoidal Implicit Representation. ViSIR combines vision transformers with sinusoidal activations to mitigate spectral bias. ViFOR stands for the Vision Transformer Fourier Representation Network. ViFOR integrates explicit Fourier based filtering for independent low- and high-frequency learning. Evaluated on the E3SM-HR Earth system dataset across surface temperature, shortwave, and longwave fluxes, these models outperform leading Convolutional NN, Generative Networks, and vanilla transformer baselines, with ViFOR demonstrating up to 2.6~dB improvements in Peak Signal to Noise Ratio and higher Structural Similarity.

[158] FOCUS on Contamination: Hydrology-Informed Noise-Aware Learning for Geospatial PFAS Mapping

Jowaria Khan, Alexa Friedman, Sydney Evans, Rachel Klein, Runzi Wang, Katherine E. Manz, Kaley Beins, David Q. Andrews, Elizabeth Bondi-Kelly

Main category: cs.CV

TL;DR: FOCUS is a geospatial deep learning framework that maps PFAS contamination by integrating sparse field measurements with environmental context data like land cover, hydrology, and industrial activity.

Details

Motivation: PFAS contamination monitoring is limited by high costs and logistical challenges of field sampling, leading to sparse data that hinders physical modeling and scientific understanding of PFAS transport in surface waters.

Method: Integrates sparse PFAS observations with large-scale environmental context data (hydrological connectivity, land cover, source proximity, sampling distance) using a principled, noise-aware loss function for robust training under sparse labels.

Result: FOCUS consistently outperforms baselines including sparse segmentation, Kriging, and pollutant transport simulations while preserving spatial coherence and scalability over large regions.

Conclusion: AI can support environmental science by providing screening-level risk maps that prioritize follow-up sampling and help connect potential sources to surface-water contamination patterns without complete physical models.

Abstract: Per- and polyfluoroalkyl substances (PFAS) are persistent environmental contaminants with significant public health impacts, yet large-scale monitoring remains severely limited due to the high cost and logistical challenges of field sampling. The lack of samples leads to difficulty simulating their spread with physical models and limited scientific understanding of PFAS transport in surface waters. Yet, rich geospatial and satellite-derived data describing land cover, hydrology, and industrial activity are widely available. We introduce FOCUS, a geospatial deep learning framework for PFAS contamination mapping that integrates sparse PFAS observations with large-scale environmental context, including priors derived from hydrological connectivity, land cover, source proximity, and sampling distance. These priors are integrated into a principled, noise-aware loss, yielding a robust training objective under sparse labels. Across extensive ablations, robustness analyses, and real-world validation, FOCUS consistently outperforms baselines including sparse segmentation, Kriging, and pollutant transport simulations, while preserving spatial coherence and scalability over large regions. Our results demonstrate how AI can support environmental science by providing screening-level risk maps that prioritize follow-up sampling and help connect potential sources to surface-water contamination patterns in the absence of complete physical models.

[159] A Survey: Spatiotemporal Consistency in Video Generation

Zhiyu Yin, Kehai Chen, Xuefeng Bai, Ruili Jiang, Juntao Li, Hongdong Li, Jin Liu, Yang Xiang, Jun Yu, Min Zhang

Main category: cs.CV

TL;DR: A systematic survey paper reviewing spatiotemporal consistency in video generation, analyzing generation models, frameworks, training strategies, and evaluation metrics with focus on maintaining temporal coherence.

Details

Motivation: Video generation requires both high-quality frames and strong temporal coherence, but systematic reviews focusing on spatiotemporal consistency are scarce despite increased research in this area.

Method: The paper frames video generation as sequential sampling from high-dimensional spatiotemporal distributions and provides comprehensive review across multiple dimensions: generation models, feature representations, frameworks, post-processing, training strategies, benchmarks, and evaluation metrics.

Result: A systematic analysis of current video generation methods with particular focus on mechanisms for maintaining spatiotemporal consistency, identifying key approaches and their effectiveness.

Conclusion: The survey provides valuable insights into spatiotemporal consistency in video generation and explores future research directions and challenges to advance the field.

Abstract: Video generation aims to produce temporally coherent sequences of visual frames, representing a pivotal advancement in Artificial Intelligence Generated Content (AIGC). Compared to static image generation, video generation poses unique challenges: it demands not only high-quality individual frames but also strong temporal coherence to ensure consistency throughout the spatiotemporal sequence. Although research addressing spatiotemporal consistency in video generation has increased in recent years, systematic reviews focusing on this core issue remain relatively scarce. To fill this gap, this paper views the video generation task as a sequential sampling process from a high-dimensional spatiotemporal distribution, and further discusses spatiotemporal consistency. We provide a systematic review of the latest advancements in the field. The content spans multiple dimensions including generation models, feature representations, generation frameworks, post-processing techniques, training strategies, benchmarks and evaluation metrics, with a particular focus on the mechanisms and effectiveness of various methods in maintaining spatiotemporal consistency. Finally, this paper explores future research directions and potential challenges in this field, aiming to provide valuable insights for advancing video generation technology. The project link is https://github.com/Yin-Z-Y/A-Survey-Spatiotemporal-Consistency-in-Video-Generation.

[160] CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis

Alexander Baumann, Leonardo Ayala, Silvia Seidlitz, Jan Sellner, Alexander Studier-Fischer, Berkin Özdemir, Lena Maier-Hein, Slobodan Ilic

Main category: cs.CV

TL;DR: CARL is a camera-agnostic representation learning model for spectral imaging that handles RGB, multispectral, and hyperspectral data with varying channel dimensionalities and wavelengths, using a novel spectral encoder with self-attention-cross-attention mechanism and feature-based self-supervision.

Details

Motivation: Spectral imaging faces challenges due to variability in channel dimensionality and captured wavelengths across different cameras, leading to camera-specific AI models with limited generalizability and poor cross-camera applicability. This bottleneck hinders the development of robust spectral imaging methodologies.

Method: Introduces CARL with a novel spectral encoder featuring self-attention-cross-attention mechanism to distill salient spectral information into learned representations. Uses spatio-spectral pre-training with feature-based self-supervision tailored for camera-agnostic learning across RGB, multispectral, and hyperspectral modalities.

Result: Demonstrates unique robustness to spectral heterogeneity across medical imaging, autonomous driving, and satellite imaging domains. Outperforms on datasets with simulated and real-world cross-camera spectral variations, showing scalability and versatility as a backbone for spectral foundation models.

Conclusion: CARL provides a scalable, versatile solution for camera-agnostic spectral representation learning, addressing the bottleneck of spectral heterogeneity and positioning itself as a foundation model backbone for future spectral AI applications.

Abstract: Spectral imaging offers promising applications across diverse domains, including medicine and urban scene understanding, and is already established as a critical modality in remote sensing. However, variability in channel dimensionality and captured wavelengths among spectral cameras impede the development of AI-driven methodologies, leading to camera-specific models with limited generalizability and inadequate cross-camera applicability. To address this bottleneck, we introduce CARL, a model for Camera-Agnostic Representation Learning across RGB, multispectral, and hyperspectral imaging modalities. To enable the conversion of a spectral image with any channel dimensionality to a camera-agnostic representation, we introduce a novel spectral encoder, featuring a self-attention-cross-attention mechanism, to distill salient spectral information into learned spectral representations. Spatio-spectral pre-training is achieved with a novel feature-based self-supervision strategy tailored to CARL. Large-scale experiments across the domains of medical imaging, autonomous driving, and satellite imaging demonstrate our model’s unique robustness to spectral heterogeneity, outperforming on datasets with simulated and real-world cross-camera spectral variations. The scalability and versatility of the proposed approach position our model as a backbone for future spectral foundation models. Code and model weights are publicly available at https://github.com/IMSY-DKFZ/CARL.

[161] Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency

Bill Psomas, Dionysis Christopoulos, Eirini Baltzi, Ioannis Kakogeorgiou, Tilemachos Aravanis, Nikos Komodakis, Konstantinos Karantzalos, Yannis Avrithis, Giorgos Tolias

Main category: cs.CV

TL;DR: EP (Efficient Probing) is a lightweight multi-query cross-attention mechanism for model evaluation that outperforms linear probing and previous attentive probing methods while being parameter-efficient.

Details

Motivation: Standard linear probing can understate model capabilities when pre-training optimizes local rather than global representations. Existing attentive probing methods are over-parameterized and computationally inefficient, motivating a more efficient solution.

Method: Proposes Efficient Probing (EP) - a lightweight multi-query cross-attention mechanism that eliminates redundant projections and reduces trainable parameters. Comprehensive study of existing methods and design choices.

Result: EP consistently outperforms linear probing and previous attentive probing methods across multiple benchmarks and pre-training paradigms. Remains effective when combined with parameter-efficient fine-tuning.

Conclusion: EP provides an efficient and effective probing method that reveals model capabilities better than linear probing. Analysis uncovers emerging properties like complementary attention maps, opening new directions for leveraging probing beyond evaluation.

Abstract: As fine-tuning becomes impractical at scale, probing is emerging as the preferred evaluation protocol. However, standard linear probing can understate the capability of models whose pre-training optimizes local representations rather than an explicit global representation. This motivates attentive probing, an alternative that uses attention to selectively aggregate patch-level features. Despite growing adoption, attentive probing is still underexplored: existing approaches are often over-parameterized and computationally inefficient. In this work, we revisit attentive probing through the lens of the accuracy vs. parameter-efficiency trade-off. We present the first comprehensive study of existing methods, analyzing their design choices and benchmarking their performance. Building on these insights, we propose efficient probing (EP), a lightweight yet effective multi-query cross-attention mechanism that eliminates redundant projections and reduces the number of trainable parameters. Across multiple benchmarks and pre-training paradigms, EP consistently outperforms linear probing and previous attentive probing methods, and remains effective when combined with parameter-efficient fine-tuning. Beyond evaluation, our analysis uncovers emerging properties of EP, including complementary attention maps, which open new directions for leveraging probing beyond protocol design. Project page: https://vrg.fel.cvut.cz/ep/.

[162] Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition

Jiuhong Xiao, Yang Zhou, Giuseppe Loianno

Main category: cs.CV

TL;DR: QAA is a query-based adaptive aggregation method for Visual Place Recognition that uses learned queries as reference codebooks to enhance information capacity and improve cross-dataset generalization.

Details

Motivation: Current VPR models trained on single datasets have dataset-specific biases and limited generalization. Multi-dataset training can saturate feature aggregation layers, leading to suboptimal performance.

Method: Proposes Query-based Adaptive Aggregation (QAA) using learned queries as reference codebooks. Computes Cross-query Similarity (CS) between query-level image features and codebooks to generate robust descriptors without significant computational overhead.

Result: QAA outperforms state-of-the-art models, achieves balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Learned queries show diverse attention patterns across datasets.

Conclusion: QAA effectively addresses dataset divergence in multi-dataset VPR training by enhancing information capacity through query-based adaptive aggregation, enabling better generalization without sacrificing performance.

Abstract: Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate the limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA’s mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Project page: http://xjh19971.github.io/QAA.

Josh Qixuan Sun, Xiaoying Xing, Huaiyuan Weng, Chul Min Yeum, Mark Crowley

Main category: cs.CV

TL;DR: VIL is a view-invariant post-training strategy for Vision-Language Navigation in Continuous Environments that enhances robustness to camera viewpoint changes through contrastive learning and teacher-student distillation.

Details

Motivation: Most navigation policies are sensitive to viewpoint changes (camera height and viewing angle variations), which limits their robustness in real-world embodied AI applications where agents may encounter diverse observation perspectives.

Method: Proposes VIL with two key components: 1) contrastive learning framework to learn sparse and view-invariant features, and 2) teacher-student framework for the Waypoint Predictor Module where a view-dependent teacher distills knowledge into a view-invariant student. Uses end-to-end training to jointly optimize these components.

Result: Outperforms state-of-the-art approaches on V2-VLNCE by 8-15% Success Rate on R2R-CE and RxR-CE datasets. Achieves SOTA performance on RxR-CE across all metrics compared to other map-free methods. Maintains or improves performance under standard VLNCE settings despite being trained for varied viewpoints.

Conclusion: VIL effectively enhances viewpoint robustness for navigation policies without diminishing standard performance, serving as a plug-and-play post-training method for existing VLNCE systems.

Abstract: Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most navigation policies are sensitive to viewpoint changes, i.e., variations in camera height and viewing angle that alter the agent’s observation. In this paper, we introduce a generalized scenario, V2-VLNCE (VLNCE with Varied Viewpoints), and propose VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. Additionally, we introduce a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components, thus eliminating the cost for individual module training. Empirical results show that our method outperforms state-of-the-art approaches on V2-VLNCE by 8-15% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Furthermore, we evaluate VIL under the standard VLNCE setting and find that, despite being trained for varied viewpoints, it often still improves performance. On the more challenging RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics when compared to other map-free methods. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method.

Yawen Zou, Guang Li, Zi Wang, Chunzhi Gu, Chao Zhang

Main category: cs.CV

TL;DR: A detector-guided dataset distillation framework that uses pre-trained detectors to identify and refine anomalous synthetic samples, ensuring label consistency and improving image quality in dataset distillation.

Details

Motivation: Current diffusion-based dataset distillation methods often produce samples with label inconsistencies or insufficient structural details, leading to suboptimal downstream performance. There's a need to improve the quality and label accuracy of distilled datasets.

Method: Proposes a detector-guided framework that uses a pre-trained detector to identify anomalous synthetic samples (label mismatches or low confidence). For defective images, generates multiple candidates using a diffusion model conditioned on image prototypes and labels, then selects optimal candidates based on detector confidence and dissimilarity to existing qualified samples.

Result: The method synthesizes high-quality representative images with richer details and achieves state-of-the-art performance on validation sets, demonstrating improved dataset distillation quality.

Conclusion: The detector-guided approach effectively addresses label inconsistency and quality issues in dataset distillation, producing more reliable and informative compact datasets for downstream tasks.

Abstract: Dataset distillation (DD) aims to generate a compact yet informative dataset that achieves performance comparable to the original dataset, thereby reducing demands on storage and computational resources. Although diffusion models have made significant progress in dataset distillation, the generated surrogate datasets often contain samples with label inconsistencies or insufficient structural detail, leading to suboptimal downstream performance. To address these issues, we propose a detector-guided dataset distillation framework that explicitly leverages a pre-trained detector to identify and refine anomalous synthetic samples, thereby ensuring label consistency and improving image quality. Specifically, a detector model trained on the original dataset is employed to identify anomalous images exhibiting label mismatches or low classification confidence. For each defective image, multiple candidates are generated using a pre-trained diffusion model conditioned on the corresponding image prototype and label. The optimal candidate is then selected by jointly considering the detector’s confidence score and dissimilarity to existing qualified synthetic samples, thereby ensuring both label accuracy and intra-class diversity. Experimental results demonstrate that our method can synthesize high-quality representative images with richer details, achieving state-of-the-art performance on the validation set.

[165] MedVLThinker: Simple Baselines for Multimodal Medical Reasoning

Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou

Main category: cs.CV

TL;DR: MedVLThinker introduces open recipes for building reasoning-centric medical large multimodal models with systematic data curation and two training paradigms (SFT and RLVR), showing RLVR outperforms SFT and text-only reasoning data boosts performance more than multimodal data.

Details

Motivation: The absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison in medical AI.

Method: 1) Systematic data curation for text-only and image-text medical data filtered by reasoning difficulty levels; 2) Two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness.

Result: RLVR consistently outperforms SFT across experiments on Qwen2.5-VL models and six medical QA benchmarks. Counter-intuitively, training on curated text-only reasoning data provides more substantial performance boost than multimodal image-text data. Best 7B model achieves SOTA on public VQA benchmarks, and scaling to 32B matches GPT-4o performance.

Conclusion: MedVLThinker provides a strong open foundation for multimodal medical reasoning research, demonstrating the effectiveness of RLVR training and the surprising importance of text-only reasoning data for medical LMM performance.

Abstract: Large Reasoning Models (LRMs) have introduced a new paradigm in AI by enabling models to ``think before responding" via chain-of-thought reasoning. However, the absence of open and reproducible recipes for building reasoning-centric medical LMMs hinders community-wide research, analysis, and comparison. In this paper, we present MedVLThinker, a suite of simple yet strong baselines. Our fully open recipe consists of: (1) systematic data curation for both text-only and image-text medical data, filtered according to varying levels of reasoning difficulty, and (2) two training paradigms: Supervised Fine-Tuning (SFT) on distilled reasoning traces and Reinforcement Learning with Verifiable Rewards (RLVR) based on final answer correctness. Across extensive experiments on the Qwen2.5-VL model family (3B, 7B) and six medical QA benchmarks, we find that RLVR consistently and significantly outperforms SFT. Additionally, under the RLVR framework, a key, counter-intuitive finding is that training on our curated text-only reasoning data provides a more substantial performance boost than training on multimodal image-text data. Our best open 7B model, trained using the RLVR recipe on text-only data, establishes a new state-of-the-art on existing public VQA benchmarks, surpassing all previous open-source medical LMMs. Furthermore, scaling our model to 32B achieves performance on par with the proprietary GPT-4o. We release all curated data, models, and code to provide the community with a strong, open foundation for future research in multimodal medical reasoning.

[166] Robust Image Stitching with Optimal Plane

Lang Nie, Yuan Mei, Kang Liao, Yunqiu Xu, Chunyu Lin, Bin Xiao

Main category: cs.CV

TL;DR: RopStitch is an unsupervised deep image stitching framework that achieves robust and natural image stitching through a dual-branch architecture for content perception and virtual optimal planes for alignment-preservation balance.

Details

Motivation: Traditional image stitching methods struggle with robustness across diverse real-world scenes and face conflicts between content alignment and structural preservation. There's a need for an unsupervised approach that can handle various scenes while maintaining natural-looking results.

Method: Proposes a dual-branch architecture: one pretrained branch for semantically invariant representations and one learnable branch for fine-grained features, merged via controllable correlation factor. Introduces virtual optimal planes concept modeled as homography decomposition coefficients estimation, using iterative coefficient predictor and minimal semantic distortion constraint. Both views are warped bidirectionally onto the optimal plane.

Result: Extensive experiments across various datasets show RopStitch significantly outperforms existing methods, particularly in scene robustness and content naturalness.

Conclusion: RopStitch provides an effective unsupervised solution for image stitching that balances robustness and naturalness through innovative dual-branch architecture and virtual optimal planes approach.

Abstract: We present \textit{RopStitch}, an unsupervised deep image stitching framework with both robustness and naturalness. To ensure the robustness of \textit{RopStitch}, we propose to incorporate the universal prior of content perception into the image stitching model by a dual-branch architecture. It separately captures coarse and fine features and integrates them to achieve highly generalizable performance across diverse unseen real-world scenes. Concretely, the dual-branch model consists of a pretrained branch to capture semantically invariant representations and a learnable branch to extract fine-grained discriminative features, which are then merged into a whole by a controllable factor at the correlation level. Besides, considering that content alignment and structural preservation are often contradictory to each other, we propose a concept of virtual optimal planes to relieve this conflict. To this end, we model this problem as a process of estimating homography decomposition coefficients, and design an iterative coefficient predictor and minimal semantic distortion constraint to identify the optimal plane. This scheme is finally incorporated into \textit{RopStitch} by warping both views onto the optimal plane bidirectionally. Extensive experiments across various datasets demonstrate that \textit{RopStitch} significantly outperforms existing methods, particularly in scene robustness and content naturalness. The code is available at {\color{red}https://github.com/MmelodYy/RopStitch}.

[167] MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision

Zhonghao Yan, Muxi Diao, Yuxuan Yang, Ruoyan Jing, Jiayuan Xu, Kaizhou Zhang, Lele Yang, Yanxi Liu, Kongming Liang, Zhanyu Ma

Main category: cs.CV

TL;DR: MedReasoner: A modular MLLM framework for medical image grounding that separates reasoning from segmentation using reinforcement learning, achieving SOTA on the new U-MRG-14K dataset with implicit clinical queries.

Details

Motivation: Current medical grounding pipelines rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped for implicit queries common in clinical practice where clinicians describe findings without explicit spatial references.

Method: Introduces MedReasoner framework with modular design: 1) MLLM reasoner optimized with reinforcement learning to handle implicit clinical queries, 2) frozen segmentation expert converts spatial prompts to masks, 3) alignment achieved through format and accuracy rewards. Also defines Unified Medical Reasoning Grounding (UMRG) task and releases U-MRG-14K dataset with 14K samples across 10 modalities.

Result: MedReasoner achieves state-of-the-art performance on U-MRG-14K dataset and demonstrates strong generalization to unseen clinical queries, showing significant promise of reinforcement learning for interpretable medical grounding.

Conclusion: The work presents a novel approach to medical image grounding that better handles implicit clinical queries through reinforcement learning and modular design, advancing multimodal reasoning in medical imaging applications.

Abstract: Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice. This work makes three core contributions. We first define Unified Medical Reasoning Grounding (UMRG), a novel vision-language task that demands clinical reasoning and pixel-level grounding. Second, we release U-MRG-14K, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce MedReasoner, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding.

[168] COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization

Yassine Taoudi-Benchekroun, Klim Troyan, Pascal Sager, Stefan Gerber, Lukas Tuggener, Benjamin Grewe

Main category: cs.CV

TL;DR: COGITAO is a modular framework for generating visual reasoning tasks to study compositionality and generalization in AI models, featuring rule-based transformations on grid objects with adjustable composition depth.

Details

Motivation: Current AI models struggle with compositional generalization - applying learned concepts in novel combinations. The authors aim to systematically study this limitation in visual domains, inspired by ARC-AGI's problem-setting.

Method: COGITAO creates rule-based tasks applying 28 interoperable transformations to objects in grid environments. It supports adjustable composition depth, extensive control over grid parameters, and generates millions of unique task rules with unlimited samples per rule.

Result: The framework enables creation of millions of unique tasks across difficulty levels. Baseline experiments with state-of-the-art vision models show consistent failures to generalize to novel combinations of familiar elements despite strong in-domain performance.

Conclusion: COGITAO provides a comprehensive benchmark for studying compositionality in visual reasoning, highlighting current AI limitations and offering an open-source framework for continued research in this area.

Abstract: The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI’s problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules – surpassing concurrent datasets by several orders of magnitude – across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.

[169] Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction

Fengzhi Guo, Chih-Chuan Hsu, Sihao Ding, Cheng Zhang

Main category: cs.CV

TL;DR: USplat4D introduces uncertainty-aware dynamic Gaussian Splatting that models per-Gaussian uncertainty to improve 4D reconstruction from monocular video, addressing occlusion and novel view synthesis challenges.

Details

Motivation: Dynamic 3D scene reconstruction from monocular input is under-constrained with ambiguities from occlusion and extreme novel views. Current dynamic Gaussian Splatting models optimize all Gaussian primitives uniformly, ignoring observation reliability, leading to motion drifts and degraded synthesis.

Method: USplat4D estimates time-varying per-Gaussian uncertainty and leverages it to construct a spatio-temporal graph for uncertainty-aware optimization. Reliable Gaussians with recurring observations guide motion, while less reliable ones are treated accordingly.

Result: Experiments on diverse real and synthetic datasets show that explicitly modeling uncertainty consistently improves dynamic Gaussian Splatting models, yielding more stable geometry under occlusion and high-quality synthesis at extreme viewpoints.

Conclusion: Uncertainty-aware optimization is crucial for dynamic Gaussian Splatting, with USplat4D demonstrating improved reconstruction quality and robustness to occlusion and novel view synthesis challenges.

Abstract: Reconstructing dynamic 3D scenes from monocular input is fundamentally under-constrained, with ambiguities arising from occlusion and extreme novel views. While dynamic Gaussian Splatting offers an efficient representation, vanilla models optimize all Gaussian primitives uniformly, ignoring whether they are well or poorly observed. This limitation leads to motion drifts under occlusion and degraded synthesis when extrapolating to unseen views. We argue that uncertainty matters: Gaussians with recurring observations across views and time act as reliable anchors to guide motion, whereas those with limited visibility are treated as less reliable. To this end, we introduce USplat4D, a novel Uncertainty-aware dynamic Gaussian Splatting framework that propagates reliable motion cues to enhance 4D reconstruction. Our approach estimates time-varying per-Gaussian uncertainty and leverages it to construct a spatio-temporal graph for uncertainty-aware optimization. Experiments on diverse real and synthetic datasets show that explicitly modeling uncertainty consistently improves dynamic Gaussian Splatting models, yielding more stable geometry under occlusion and high-quality synthesis at extreme viewpoints.

[170] Language-Guided Invariance Probing of Vision-Language Models

Jae Joong Lee

Main category: cs.CV

TL;DR: LGIP benchmark evaluates VLMs’ linguistic robustness by testing invariance to paraphrases and sensitivity to semantic flips in image-text matching.

Details

Motivation: Current vision-language models show strong zero-shot performance but lack understanding of their reliability to linguistic perturbations. There's a need to measure how VLMs respond to meaning-preserving paraphrases and meaning-changing semantic flips beyond standard retrieval metrics.

Method: Introduces Language-Guided Invariance Probing (LGIP) benchmark using 40k MS COCO images with five human captions each. Automatically generates paraphrases and rule-based semantic flips that alter object category, color, or count. Measures invariance error, semantic sensitivity gap, and positive-rate statistic.

Result: EVA02-CLIP and large OpenCLIP variants show favorable invariance-sensitivity balance with low paraphrase variance and higher scores for original vs. flipped captions. SigLIP and SigLIP2 exhibit large invariance errors and often prefer flipped captions to human descriptions, especially for object and color edits.

Conclusion: LGIP provides a model-agnostic diagnostic for linguistic robustness of VLMs that standard retrieval metrics miss. The benchmark reveals significant differences in how VLMs handle linguistic perturbations, with some models showing concerning preference for semantically incorrect captions.

Abstract: Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.

[171] Trustworthy and Fair SkinGPT-R1 for Democratizing Dermatological Reasoning across Diverse Ethnicities

Yuhao Shen, Zhangtianyi Chen, Yuanhao He, Yan Xu, Shuping Zhang, Liyuan Sun, Zijian Wang, Yinghao Zhu, Yuyuan Yang, Jiahe Qian, Ziwen Wang, Xinyuan Zhang, Wenbin Liu, Zongyuan Ge, Tao Lu, Siyuan Yan, Juexiao Zhou

Main category: cs.CV

TL;DR: SkinGPT-R1 is a multimodal LLM for dermatology that combines chain-of-thought reasoning with fairness-aware mixture-of-experts architecture to provide interpretable and equitable skin disease diagnosis across diverse skin tones.

Details

Motivation: Clinical translation of dermatological AI faces two major challenges: opaque reasoning (lack of interpretability) and systematic performance disparities across different skin tones (algorithmic bias).

Method: Multimodal large language model integrating chain-of-thought diagnostic reasoning with fairness-aware mixture-of-experts architecture. Uses parameter-efficient adaptation of frozen reasoning backbone to generate structured diagnostic reports with visual findings, differential reasoning, and final diagnosis.

Result: Achieves SOTA accuracy on 6/7 external datasets, including 82.50% on 40-class long-tail classification (+19.30% over baselines). Dermatologist evaluation scores 3.6/5 with highest ratings in safety (3.8) and reasoning coherence (3.6). Mitigates bias across Fitzpatrick spectrum with 41.40% worst-group performance on Fitz17k and 5x relative improvement on DDI dataset.

Conclusion: Establishes a framework for trustworthy, fair, and explainable AI-assisted dermatological diagnosis that addresses both interpretability and fairness challenges in clinical translation.

Abstract: The clinical translation of dermatological AI is hindered by opaque reasoning and systematic performance disparities across skin tones. Here we present SkinGPT-R1, a multimodal large language model that integrates chain-of-thought diagnostic reasoning with a fairness-aware mixture-of-experts architecture for interpretable and equitable skin disease diagnosis. Through parameter-efficient adaptation of a frozen reasoning backbone, SkinGPT-R1 generates structured diagnostic reports comprising visual findings, differential reasoning, and final diagnosis. Across seven external datasets spanning diverse pathologies and imaging conditions, SkinGPT-R1 achieves state-of-the-art accuracy on six benchmarks, including 82.50% on a challenging 40-class long-tail classification task (+19.30% over leading baselines). Blinded evaluation by five board-certified dermatologists on 1,000 phenotypically balanced cases yields a mean score of 3.6 out of 5, with the highest ratings in safety (3.8) and reasoning coherence (3.6), indicating that the generated rationales are clinically safe, logically grounded, and suitable for supporting diagnostic decision-making. Critically, SkinGPT-R1 mitigates algorithmic bias across the full Fitzpatrick spectrum, achieving a robust worst-group performance of 41.40% on the Fitz17k benchmark and a five-fold relative improvement in lower-bound accuracy on the DDI dataset compared to standard multimodal baselines. These results establish a framework for trustworthy, fair, and explainable AI-assisted dermatological diagnosis.

[172] INQUIRE-Search: Interactive Discovery in Large-Scale Biodiversity Databases

Edward Vendrow, Julia Chae, Rupa Kurinchi-Vendhan, Isaac Eckert, Jazlynn Hall, Marta Jarzyna, Reymond Miyajima, Ruth Oliver, Laura Pollock, Lauren Shrack, Scott Yanco, Oisin Mac Aodha, Sara Beery

Main category: cs.CV

TL;DR: INQUIRE-Search is an AI-powered system that uses natural language to search biodiversity image databases like iNaturalist for ecological phenomena, enabling efficient discovery and analysis of complex ecological patterns at scale.

Details

Motivation: Ecological research often focuses on complex phenomena (species interactions, behaviors, phenology, disturbance responses) that are difficult to observe and sparsely documented in large biodiversity image databases. Current manual inspection methods make this information largely inaccessible at scale.

Method: INQUIRE-Search is an open-source system that uses natural language processing to search within ecological image databases like iNaturalist. It allows scientists to search for specific phenomena, verify and export relevant observations, and use outputs for downstream scientific analysis.

Result: Across five case studies, INQUIRE-Search concentrated relevant observations 3-25x more efficiently than comparable manual inspection budgets. The system enabled ecological inference for analyzing seasonal variation in behavior across species and forest regrowth after wildfires.

Conclusion: INQUIRE-Search represents a new paradigm for interactive, efficient, and scalable scientific discovery that unlocks previously inaccessible scientific value in large-scale biodiversity datasets. AI-enabled discovery tools require reframing aspects of the scientific process including experiment design, data collection, survey effort, and uncertainty analysis.

Abstract: Many ecological questions center on complex phenomena, such as species interactions, behaviors, phenology, and responses to disturbance, that are inherently difficult to observe and sparsely documented. Community science platforms such as iNaturalist contain hundreds of millions of biodiversity images, which often contain evidence of these complex phenomena. However, current workflows that seek to discover and analyze this evidence often rely on manual inspection, leaving this information largely inaccessible at scale. We introduce INQUIRE-Search, an open-source system that uses natural language to enable scientists to rapidly search within an ecological image database like iNaturalist for specific phenomena, verify and export relevant observations, and use these outputs for downstream scientific analysis. Across five illustrative case studies, INQUIRE-Search concentrates relevant observations 3-25x more efficiently than comparable manual inspection budgets. These examples demonstrate how the system can be used for ecological inference, from analyzing seasonal variation in behavior across species to forest regrowth after wildfires. These examples illustrate a new paradigm for interactive, efficient, and scalable scientific discovery that can begin to unlock previously inaccessible scientific value in large-scale biodiversity datasets. Finally, we highlight how AI-enabled discovery tools for science require reframing aspects of the scientific process, including experiment design, data collection, survey effort, and uncertainty analysis.

[173] PartUV: Part-Based UV Unwrapping of 3D Meshes

Zhaoning Wang, Xinyue Wei, Ruoxi Shi, Xiaoshuai Zhang, Hao Su, Minghua Liu

Main category: cs.CV

TL;DR: PartUV: A part-based UV unwrapping pipeline for AI-generated meshes that generates fewer, part-aligned charts with low distortion using semantic part decomposition and geometric heuristics.

Details

Motivation: Existing UV unwrapping methods struggle with AI-generated meshes that are noisy, bumpy, and poorly conditioned, often producing fragmented charts and suboptimal boundaries that cause artifacts and hinder downstream tasks.

Method: PartUV combines high-level semantic part decomposition (using PartField) with novel geometric heuristics in a top-down recursive framework. It ensures each chart’s distortion stays below a user threshold while minimizing total charts, integrates parameterization/packing algorithms, handles non-manifold/degenerate meshes, and is parallelized for efficiency.

Result: Evaluated across four diverse datasets (man-made, CAD, AI-generated, Common Shapes), PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing.

Conclusion: PartUV provides an effective solution for UV unwrapping AI-generated meshes by leveraging semantic part decomposition to produce fewer, part-aligned charts with controlled distortion, addressing key limitations of existing methods.

Abstract: UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart’s distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets, including man-made, CAD, AI-generated, and Common Shapes, PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Our project page is at https://www.zhaoningwang.com/PartUV.

[174] Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging

Janani Annur Thiruvengadam, Kiran Mayee Nabigaru, Anusha Kovi

Main category: cs.CV

TL;DR: A multimodal medical imaging framework for early pancreatic tumor detection using CT scans, combining residual feature aggregation, hybrid metaheuristic feature selection, and a Vision Transformer-EfficientNet hybrid classifier.

Details

Motivation: Early detection of pancreatic neoplasms is challenging due to minimal contrast margins and large anatomical variations in CT scans, requiring systems that enhance subtle visual cues and generalize well across multimodal imaging data.

Method: Proposes Scalable Residual Feature Aggregation (SRFA) framework with preprocessing, MAGRes-UNet segmentation, DenseNet-121 feature extraction, hybrid HHO-BA metaheuristic feature selection, and a Vision Transformer-EfficientNet-B3 hybrid classifier optimized with SSA-GWO dual optimization.

Result: Achieves 96.23% accuracy, 95.58% F1-score, and 94.83% specificity, significantly outperforming traditional CNNs and contemporary transformer-based models.

Conclusion: The SRFA framework demonstrates strong potential as a useful instrument for early pancreatic tumor detection through effective multimodal image analysis.

Abstract: The early detection of pancreatic neoplasm is a major clinical dilemma, and it is predominantly so because tumors are likely to occur with minimal contrast margins and a large spread anatomy-wide variation amongst patients on a CT scan. These complexities require to be addressed with an effective and scalable system that can assist in enhancing the salience of the subtle visual cues and provide a high level of the generalization on the multimodal imaging data. A Scalable Residual Feature Aggregation (SRFA) framework is proposed to be used to meet these conditions in this study. The framework integrates a pipeline of preprocessing followed by the segmentation using the MAGRes-UNet that is effective in making the pancreatic structures and isolating regions of interest more visible. DenseNet-121 performed with residual feature storage is used to extract features to allow deep hierarchical features to be aggregated without properties loss. To go further, hybrid HHO-BA metaheuristic feature selection strategy is used, which guarantees the best feature subset refinement. To be classified, the system is trained based on a new hybrid model that integrates the ability to pay attention on the world, which is the Vision Transformer (ViT) with the high representational efficiency of EfficientNet-B3. A dual optimization mechanism incorporating SSA and GWO is used to fine-tune hyperparameters to enhance greater robustness and less overfitting. Experimental results support the significant improvement in performance, with the suggested model reaching 96.23% accuracy, 95.58% F1-score and 94.83% specificity, the model is significantly better than the traditional CNNs and contemporary transformer-based models. Such results highlight the possibility of the SRFA framework as a useful instrument in the early detection of pancreatic tumors.

Zinan Lv, Yeqian Qian, Chen Sang, Hao Liu, Danping Zou, Ming Yang

Main category: cs.CV

TL;DR: A reinforcement learning framework using Relightable 3D Gaussian Splatting for illumination-invariant UAV navigation in unstructured outdoor environments with zero-shot transfer from simulation to reality.

Details

Motivation: UAV navigation in unstructured outdoor environments faces challenges due to the visual domain gap between simulation and reality, particularly with dynamic lighting conditions that existing methods cannot handle effectively.

Method: Proposes Relightable 3D Gaussian Splatting that decomposes scene components to enable explicit lighting editing, combined with reinforcement learning trained in high-fidelity simulation with diverse synthesized lighting conditions for illumination-invariant feature learning.

Result: Quadrotor achieves robust, collision-free navigation in complex forest environments at speeds up to 10 m/s with significant resilience to drastic lighting variations without fine-tuning.

Conclusion: The framework enables effective zero-shot transfer to unstructured outdoors by addressing photometric limitations through physically grounded lighting decomposition and augmentation.

Abstract: UAV navigation in unstructured outdoor environments using passive monocular vision is hindered by the substantial visual domain gap between simulation and reality. While 3D Gaussian Splatting enables photorealistic scene reconstruction from real-world data, existing methods inherently couple static lighting with geometry, severely limiting policy generalization to dynamic real-world illumination. In this paper, we propose a novel end-to-end reinforcement learning framework designed for effective zero-shot transfer to unstructured outdoors. Within a high-fidelity simulation grounded in real-world data, our policy is trained to map raw monocular RGB observations directly to continuous control commands. To overcome photometric limitations, we introduce Relightable 3D Gaussian Splatting, which decomposes scene components to enable explicit, physically grounded editing of environmental lighting within the neural representation. By augmenting training with diverse synthesized lighting conditions ranging from strong directional sunlight to diffuse overcast skies, we compel the policy to learn robust, illumination-invariant visual features. Extensive real-world experiments demonstrate that a lightweight quadrotor achieves robust, collision-free navigation in complex forest environments at speeds up to 10 m/s, exhibiting significant resilience to drastic lighting variations without fine-tuning.

[176] Visualizing the Invisible: Enhancing Radiologist Performance in Breast Mammography via Task-Driven Chromatic Encoding

Hui Ye, Shilong Yang, Chulong Zhang, Yexuan Xing, Juan Yu, Yaoqin Xie, Wei Zhang

Main category: cs.CV

TL;DR: MammoColor is an end-to-end framework with Task-Driven Chromatic Encoding that converts single-channel mammograms into color-enhanced views to improve breast cancer detection in dense breasts.

Details

Motivation: Mammography screening is less sensitive in dense breasts due to tissue overlap and subtle findings that increase perceptual difficulty. There's a need for better visualization techniques to improve detection accuracy.

Method: MammoColor couples a lightweight Task-Driven Chromatic Encoding (TDCE) module with a BI-RADS triage classifier, trained end-to-end on VinDr-Mammo dataset. The TDCE converts single-channel mammograms into chromatic representations optimized for the classification task.

Result: On VinDr-Mammo, MammoColor improved AUC from 0.7669 to 0.8461 (P=0.004), with larger gains in dense breasts (AUC 0.749 to 0.835). In multi-reader studies, TDCE-encoded images improved specificity (0.90 to 0.96) with comparable sensitivity.

Conclusion: TDCE provides a task-optimized chromatic representation that improves perceptual salience and may reduce false-positive recalls in mammography triage, particularly benefiting dense breast screening.

Abstract: Purpose:Mammography screening is less sensitive in dense breasts, where tissue overlap and subtle findings increase perceptual difficulty. We present MammoColor, an end-to-end framework with a Task-Driven Chromatic Encoding (TDCE) module that converts single-channel mammograms into TDCE-encoded views for visual augmentation. Materials and Methods:MammoColor couples a lightweight TDCE module with a BI-RADS triage classifier and was trained end-to-end on VinDr-Mammo. Performance was evaluated on an internal test set, two public datasets (CBIS-DDSM and INBreast), and three external clinical cohorts. We also conducted a multi-reader, multi-case (MRMC) observer study with a washout period, comparing (1) grayscale-only, (2) TDCE-only, and (3) side-by-side grayscale+TDCE. Results:On VinDr-Mammo, MammoColor improved AUC from 0.7669 to 0.8461 (P=0.004). Gains were larger in dense breasts (AUC 0.749 to 0.835). In the MRMC study, TDCE-encoded images improved specificity (0.90 to 0.96; P=0.052) with comparable sensitivity. Conclusion:TDCE provides a task-optimized chromatic representation that may improve perceptual salience and reduce false-positive recalls in mammography triage.

[177] Vision and Language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning

Ross Greer, Maitrayee Keskar, Angel Martinez-Sanchez, Parthib Roy, Shashank Shriram, Mohan Trivedi

Main category: cs.CV

TL;DR: VLMs show promise for autonomous driving safety through semantic hazard detection, trajectory planning integration, and natural language behavioral constraints, but require careful system design rather than direct feature injection.

Details

Motivation: Vision-language models offer new opportunities for semantic reasoning in safety-critical autonomous driving, but their effective integration into perception, prediction, and planning pipelines needs systematic investigation.

Method: Three complementary approaches: 1) CLIP-based hazard screening for category-agnostic detection, 2) Integration of scene-level VLM embeddings into transformer-based trajectory planning, 3) Using natural language as explicit behavioral constraints on motion planning.

Result: 1) CLIP-based hazard detection works robustly for diverse hazards without explicit object detection. 2) Naive global embedding conditioning doesn’t improve trajectory accuracy. 3) Natural language constraints suppress severe planning failures and improve safety in ambiguous scenarios.

Conclusion: VLMs hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints, but realizing this requires careful system design and structured grounding rather than direct feature injection.

Abstract: Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.

[178] Equilibrium contrastive learning for imbalanced image classification

Sumin Roh, Harim Kim, Ho Yun Lee, Il Yong Chun

Main category: cs.CV

TL;DR: ECL (Equilibrium Contrastive Learning) is a supervised contrastive learning framework that addresses geometric imbalances in representation space for imbalanced datasets by harmonizing class features, means, and classifiers.

Details

Motivation: Existing supervised contrastive learning methods for imbalanced datasets have two key limitations: 1) they don't align class means/prototypes with classifiers, leading to poor generalization, and 2) prototype-based methods treat prototypes as just one additional sample per class, causing unbalanced contributions across classes.

Method: ECL uses two main components: 1) promotes representation geometric equilibrium (regular simplex geometry with collapsed class samples and uniformly distributed class means) while balancing contributions of class-average features and prototypes, and 2) establishes classifier-class center geometric equilibrium by aligning classifier weights with class prototypes.

Result: ECL outperforms existing state-of-the-art supervised contrastive learning methods on three long-tailed datasets (CIFAR-10/100-LT, ImageNet-LT) and two imbalanced medical datasets (ISIC 2019 and LCCT dataset).

Conclusion: ECL effectively addresses geometric imbalances in contrastive learning for imbalanced classification by establishing equilibrium between class features, means, and classifiers, leading to improved performance on long-tailed datasets.

Abstract: Contrastive learning (CL) is a predominant technique in image classification, but they showed limited performance with an imbalanced dataset. Recently, several supervised CL methods have been proposed to promote an ideal regular simplex geometric configuration in the representation space-characterized by intra-class feature collapse and uniform inter-class mean spacing, especially for imbalanced datasets. In particular, existing prototype-based methods include class prototypes, as additional samples to consider all classes. However, the existing CL methods suffer from two limitations. First, they do not consider the alignment between the class means/prototypes and classifiers, which could lead to poor generalization. Second, existing prototype-based methods treat prototypes as only one additional sample per class, making their influence depend on the number of class instances in a batch and causing unbalanced contributions across classes. To address these limitations, we propose Equilibrium Contrastive Learning (ECL), a supervised CL framework designed to promote geometric equilibrium, where class features, means, and classifiers are harmoniously balanced under data imbalance. The proposed ECL framework uses two main components. First, ECL promotes the representation geometric equilibrium (i.e., a regular simplex geometry characterized by collapsed class samples and uniformly distributed class means), while balancing the contributions of class-average features and class prototypes. Second, ECL establishes a classifier-class center geometric equilibrium by aligning classifier weights and class prototypes. We ran experiments with three long-tailed datasets, the CIFAR-10(0)-LT, ImageNet-LT, and the two imbalanced medical datasets, the ISIC 2019 and our constructed LCCT dataset. Results show that ECL outperforms existing SOTA supervised CL methods designed for imbalanced classification.

[179] COOPERTRIM: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception

Shilpa Mukhopadhyay, Amit Roy-Chowdhury, Hang Qiu

Main category: cs.CV

TL;DR: COOPERTRIM: Adaptive feature selection framework for cooperative perception that reduces bandwidth by exploiting temporal continuity to share only dynamic environment features while maintaining performance.

Details

Motivation: Cooperative perception faces bandwidth limitations vs. rich sensor data tension. Current selection strategies still stress wireless tech. Need proactive approach using temporal continuity to identify dynamic features and avoid redundant static info transmission.

Method: Introduces conformal temporal uncertainty metric to gauge feature relevance, data-driven mechanism to dynamically determine sharing quantity. Adapts sharing based on environment complexity by exploiting temporal awareness.

Result: Achieves up to 80.28% bandwidth reduction for semantic segmentation and 72.52% for 3D detection while maintaining comparable accuracy. Improves IoU by up to 45.54% with 72% less bandwidth vs other strategies. With compression, reduces bandwidth to 1.46% without compromising IoU.

Conclusion: COOPERTRIM gracefully adapts to environmental dynamics, localization error, and communication latency, demonstrating flexibility for real-world deployment of cooperative perception systems.

Abstract: Cooperative perception enables autonomous agents to share encoded representations over wireless communication to enhance each other’s live situational awareness. However, the tension between the limited communication bandwidth and the rich sensor information hinders its practical deployment. Recent studies have explored selection strategies that share only a subset of features per frame while striving to keep the performance on par. Nevertheless, the bandwidth requirement still stresses current wireless technologies. To fundamentally ease the tension, we take a proactive approach, exploiting the temporal continuity to identify features that capture environment dynamics, while avoiding repetitive and redundant transmission of static information. By incorporating temporal awareness, agents are empowered to dynamically adapt the sharing quantity according to environment complexity. We instantiate this intuition into an adaptive selection framework, COOPERTRIM, which introduces a novel conformal temporal uncertainty metric to gauge feature relevance, and a data-driven mechanism to dynamically determine the sharing quantity. To evaluate COOPERTRIM, we take semantic segmentation and 3D detection as example tasks. Across multiple open-source cooperative segmentation and detection models, COOPERTRIM achieves up to 80.28% and 72.52% bandwidth reduction respectively while maintaining a comparable accuracy. Relative to other selection strategies, COOPERTRIM also improves IoU by as much as 45.54% with up to 72% less bandwidth. Combined with compression strategies, COOPERTRIM can further reduce bandwidth usage to as low as 1.46% without compromising IoU performance. Qualitative results show COOPERTRIM gracefully adapts to environmental dynamics, localization error, and communication latency, demonstrating flexibility and paving the way for real-world deployment.

[180] A Novel Public Dataset for Strawberry (Fragaria x ananassa) Ripeness Detection and Comparative Evaluation of YOLO-Based Models

Mustafa Yurdakul, Zeynep Sena Bastug, Ali Emre Gok, Sakir Taşdemir

Main category: cs.CV

TL;DR: A new publicly available strawberry ripeness dataset with 566 images and 1,201 labeled objects is presented, with YOLO-based object detection models achieving up to 90.94% precision for smart agriculture applications.

Details

Motivation: Traditional visual assessment of strawberry ripeness is subjective and error-prone, and there's a scarcity of comprehensive datasets for computer-assisted systems in this field, hindering study comparisons.

Method: Created a new strawberry ripeness dataset collected under variable light/environmental conditions in two Turkish greenhouses, then conducted comparative tests using YOLOv8, YOLOv9, and YOLO11-based models.

Result: YOLOv9c achieved highest precision (90.94%), YOLO11s highest recall (83.74%), and YOLOv8s best mAP@50 (86.09%). Small/medium models performed more balanced and efficiently on this dataset.

Conclusion: The dataset establishes a fundamental reference point for smart agriculture applications, showing that computer vision can effectively assess strawberry ripeness with YOLO-based object detection models.

Abstract: The strawberry (Fragaria x ananassa), known worldwide for its economic value and nutritional richness, is a widely cultivated fruit. Determining the correct ripeness level during the harvest period is crucial for both preventing losses for producers and ensuring consumers receive a quality product. However, traditional methods, i.e., visual assessments alone, can be subjective and have a high margin of error. Therefore, computer-assisted systems are needed. However, the scarcity of comprehensive datasets accessible to everyone in the literature makes it difficult to compare studies in this field. In this study, a new and publicly available strawberry ripeness dataset, consisting of 566 images and 1,201 labeled objects, prepared under variable light and environmental conditions in two different greenhouses in Turkey, is presented to the literature. Comparative tests conducted on the data set using YOLOv8, YOLOv9, and YOLO11-based models showed that the highest precision value was 90.94% in the YOLOv9c model, while the highest recall value was 83.74% in the YOLO11s model. In terms of the general performance criterion mAP@50, YOLOv8s was the best performing model with a success rate of 86.09%. The results show that small and medium-sized models work more balanced and efficiently on this type of dataset, while also establishing a fundamental reference point for smart agriculture applications.

[181] ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

Hyunchan Moon, Cheonjun Park, Steven L. Waslander

Main category: cs.CV

TL;DR: ToaSt is a decoupled framework for efficient Vision Transformers that applies specialized compression strategies to different ViT components: structured pruning for attention modules and token channel selection for feed-forward networks.

Details

Motivation: Vision Transformers achieve remarkable success but face prohibitive computational costs. Existing solutions like structured weight pruning suffer from prolonged retraining times, while token compression methods have global propagation issues that create optimization challenges.

Method: Decoupled framework applying specialized strategies to distinct ViT components: 1) Coupled head-wise structured pruning for Multi-Head Self-Attention modules, leveraging attention operation characteristics for robustness; 2) Token Channel Selection (TCS) for Feed-Forward Networks (over 60% of FLOPs) that enhances compression ratios while avoiding global propagation issues.

Result: Extensive evaluations across nine diverse models (DeiT, ViT-MAE, Swin Transformer) show superior accuracy-efficiency trade-offs. On ViT-MAE-Huge: 88.52% accuracy (+1.64%) with 39.4% FLOPs reduction. Effective transfer to downstream tasks: 52.2 vs 51.9 mAP on COCO object detection.

Conclusion: ToaSt provides an effective decoupled compression framework for Vision Transformers that addresses limitations of existing methods, achieving state-of-the-art efficiency-accuracy trade-offs across diverse models and tasks.

Abstract: Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining times and global propagation that creates optimization challenges, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60% of FLOPs), we introduce Token Channel Selection (TCS) that enhances compression ratios while avoiding global propagation issues. Our analysis reveals TCS effectively filters redundant noise during selection. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52% accuracy (+1.64 %) with 39.4% FLOPs reduction. ToaSt transfers effectively to downstream tasks, achieving 52.2 versus 51.9 mAP on COCO object detection. Code and models will be released upon acceptance.

cs.AI

[182] Towards Efficient Constraint Handling in Neural Solvers for Routing Problems

Jieyi Bi, Zhiguang Cao, Jianan Zhou, Wen Song, Yaoxin Wu, Jie Zhang, Yining Ma, Cathy Wu

Main category: cs.AI

TL;DR: CaR is a constraint-handling framework for neural routing solvers that uses explicit learning-based feasibility refinement through a joint training approach with construction-improvement-shared representations.

Details

Motivation: Neural solvers excel at computational efficiency for simple routing problems but struggle with complex constraints. Current constraint-handling methods (feasibility masking or implicit feasibility awareness) are inefficient or inapplicable for hard constraints.

Method: Construct-and-Refine (CaR) framework with joint training that guides construction module to generate diverse, high-quality solutions suitable for lightweight improvement (10 steps vs 5k in prior work). Features construction-improvement-shared representation enabling knowledge sharing across paradigms.

Result: CaR achieves superior feasibility, solution quality, and efficiency compared to both classical and neural state-of-the-art solvers on typical hard routing constraints.

Conclusion: CaR provides the first general and efficient constraint-handling framework for neural routing solvers through explicit learning-based feasibility refinement with shared representations.

Abstract: Neural solvers have achieved impressive progress in addressing simple routing problems, particularly excelling in computational efficiency. However, their advantages under complex constraints remain nascent, for which current constraint-handling schemes via feasibility masking or implicit feasibility awareness can be inefficient or inapplicable for hard constraints. In this paper, we present Construct-and-Refine (CaR), the first general and efficient constraint-handling framework for neural routing solvers based on explicit learning-based feasibility refinement. Unlike prior construction-search hybrids that target reducing optimality gaps through heavy improvements yet still struggle with hard constraints, CaR achieves efficient constraint handling by designing a joint training framework that guides the construction module to generate diverse and high-quality solutions well-suited for a lightweight improvement process, e.g., 10 steps versus 5k steps in prior work. Moreover, CaR presents the first use of construction-improvement-shared representation, enabling potential knowledge sharing across paradigms by unifying the encoder, especially in more complex constrained scenarios. We evaluate CaR on typical hard routing constraints to showcase its broader applicability. Results demonstrate that CaR achieves superior feasibility, solution quality, and efficiency compared to both classical and neural state-of-the-art solvers.

[183] Optimization Instability in Autonomous Agentic Workflows for Clinical Symptom Detection

Cameron Cagan, Pedram Fard, Jiazi Tian, Jingya Cheng, Shawn N. Murphy, Hossein Estiri

Main category: cs.AI

TL;DR: Autonomous prompt optimization systems can paradoxically degrade performance through optimization instability, especially in low-prevalence classification tasks, where retrospective selection outperforms active intervention for stabilization.

Details

Motivation: To investigate failure modes in autonomous agentic workflows, particularly optimization instability where continued autonomous improvement paradoxically degrades classifier performance, using clinical symptom classification as a test case.

Method: Used Pythia framework for automated prompt optimization on three clinical symptoms with varying prevalence (23%, 12%, 3%). Evaluated two interventions: guiding agent (active redirection) and selector agent (retrospective identification of best-performing iteration).

Result: Validation sensitivity oscillated between 1.0 and 0.0 across iterations, with severity inversely proportional to class prevalence. At 3% prevalence, system achieved 95% accuracy while detecting zero positive cases. Selector agent successfully prevented catastrophic failure and outperformed expert-curated lexicons by 331% (F1) for brain fog detection and 7% for chest pain.

Conclusion: Autonomous AI systems exhibit critical failure modes in low-prevalence classification tasks, and retrospective selection outperforms active intervention for stabilization, demonstrating the importance of proper oversight mechanisms.

Abstract: Autonomous agentic workflows that iteratively refine their own behavior hold considerable promise, yet their failure modes remain poorly characterized. We investigate optimization instability, a phenomenon in which continued autonomous improvement paradoxically degrades classifier performance, using Pythia, an open-source framework for automated prompt optimization. Evaluating three clinical symptoms with varying prevalence (shortness of breath at 23%, chest pain at 12%, and Long COVID brain fog at 3%), we observed that validation sensitivity oscillated between 1.0 and 0.0 across iterations, with severity inversely proportional to class prevalence. At 3% prevalence, the system achieved 95% accuracy while detecting zero positive cases, a failure mode obscured by standard evaluation metrics. We evaluated two interventions: a guiding agent that actively redirected optimization, amplifying overfitting rather than correcting it, and a selector agent that retrospectively identified the best-performing iteration successfully prevented catastrophic failure. With selector agent oversight, the system outperformed expert-curated lexicons on brain fog detection by 331% (F1) and chest pain by 7%, despite requiring only a single natural language term as input. These findings characterize a critical failure mode of autonomous AI systems and demonstrate that retrospective selection outperforms active intervention for stabilization in low-prevalence classification tasks.

[184] How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

Hang Li, Kaiqi Yang, Xianxuan Long, Fedor Filippov, Yucheng Chu, Yasemin Copur-Gencturk, Peng He, Cory Miller, Namsoo Shin, Joseph Krajcik, Hui Liu, Jiliang Tang

Main category: cs.AI

TL;DR: Benchmarking uncertainty quantification methods for LLM-based automatic assessment in education, analyzing uncertainty patterns across datasets, models, and decoding strategies.

Details

Motivation: LLMs are transforming educational assessment but introduce output uncertainty challenges due to their probabilistic nature. Unreliable uncertainty estimates can lead to unstable pedagogical interventions with negative consequences for student learning.

Method: Benchmark a broad range of uncertainty quantification methods in LLM-based automatic assessment. Conduct comprehensive analyses of uncertainty behaviors across multiple assessment datasets, LLM families, and generation control settings.

Result: Characterize uncertainty patterns exhibited by LLMs in grading scenarios, evaluate strengths/limitations of different uncertainty metrics, and analyze influence of model families, assessment tasks, and decoding strategies on uncertainty estimates.

Conclusion: Provides insights into uncertainty characteristics in LLM-based assessment and lays groundwork for developing more reliable uncertainty-aware grading systems in education.

Abstract: The rapid rise of large language models (LLMs) is reshaping the landscape of automatic assessment in education. While these systems demonstrate substantial advantages in adaptability to diverse question types and flexibility in output formats, they also introduce new challenges related to output uncertainty, stemming from the inherently probabilistic nature of LLMs. Output uncertainty is an inescapable challenge in automatic assessment, as assessment results often play a critical role in informing subsequent pedagogical actions, such as providing feedback to students or guiding instructional decisions. Unreliable or poorly calibrated uncertainty estimates can lead to unstable downstream interventions, potentially disrupting students’ learning processes and resulting in unintended negative consequences. To systematically understand this challenge and inform future research, we benchmark a broad range of uncertainty quantification methods in the context of LLM-based automatic assessment. Although the effectiveness of these methods has been demonstrated in many tasks across other domains, their applicability and reliability in educational settings, particularly for automatic grading, remain underexplored. Through comprehensive analyses of uncertainty behaviors across multiple assessment datasets, LLM families, and generation control settings, we characterize the uncertainty patterns exhibited by LLMs in grading scenarios. Based on these findings, we evaluate the strengths and limitations of different uncertainty metrics and analyze the influence of key factors, including model families, assessment tasks, and decoding strategies, on uncertainty estimates. Our study provides actionable insights into the characteristics of uncertainty in LLM-based automatic assessment and lays the groundwork for developing more reliable and effective uncertainty-aware grading systems in the future.

[185] Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination

Amir Hosseinian, MohammadReza Zare Shahneh, Umer Mansoor, Gilbert Szeto, Kirill Karlin, Nima Aghaeepour

Main category: cs.AI

TL;DR: Mirror, an evidence-grounded clinical reasoning system for endocrinology, outperforms frontier LLMs and human reference on board-style exams by integrating curated medical evidence with structured reasoning.

Details

Motivation: Subspecialty clinical reasoning remains challenging for LLMs due to rapidly evolving guidelines and nuanced evidence hierarchies, necessitating specialized systems that can provide evidence-grounded, auditable outputs for clinical deployment.

Method: Developed Mirror system integrating curated endocrinology/cardiometabolic evidence corpus with structured reasoning architecture; evaluated against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on 120-question endocrinology board exam; Mirror operated under closed-evidence constraint while comparators had real-time web access.

Result: Mirror achieved 87.5% accuracy vs human reference 62.3% and frontier LLMs (GPT-5.2: 74.6%, GPT-5: 74.0%, Gemini-3-Pro: 69.8%); on 30 most difficult questions (human accuracy <50%), Mirror achieved 76.7% accuracy; top-2 accuracy 92.5% vs 85.25% for GPT-5.2; 74.2% of outputs cited guideline-tier sources with 100% citation accuracy.

Conclusion: Curated evidence with explicit provenance outperforms unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment, demonstrating the value of evidence-grounded systems over general LLMs in specialized domains.

Abstract: Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most difficult questions (human accuracy less than 50%), Mirror achieved 76.7% accuracy. Top-2 accuracy was 92.5% for Mirror versus 85.25% for GPT-5.2. Conclusions: Mirror provided evidence traceability: 74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification. Curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment.

[186] Improving Interactive In-Context Learning from Natural Language Feedback

Martin Klissarov, Jonathan Cook, Diego Antognini, Hao Sun, Jingling Li, Natasha Jaques, Claudiu Musat, Edward Grefenstette

Main category: cs.AI

TL;DR: A framework for training LLMs to learn interactively from corrective feedback, treating interactive learning as a trainable skill rather than emergent property, with strong generalization across domains.

Details

Motivation: Current LLM training relies on static corpora, overlooking interactive feedback loops essential for dynamic adaptation. Human learning benefits from corrective feedback in collaborative settings, which current models lack.

Method: Proposes a scalable method transforming single-turn verifiable tasks into multi-turn didactic interactions driven by information asymmetry. Trains models to integrate corrective feedback through interactive learning, and enables self-improvement by training models to predict teacher critiques.

Result: Models trained with this approach dramatically improve ability to learn from language feedback. Multi-turn performance of smaller models nearly matches models an order of magnitude larger. Robust out-of-distribution generalization observed across math, coding, puzzles, and maze navigation domains.

Conclusion: Interactive learning can be trained as a distinct skill, enhancing in-context plasticity. The paradigm enables self-improvement by converting external feedback signals into internal self-correction capabilities.

Abstract: Adapting one’s thought process based on corrective feedback is an essential ability in human learning, particularly in collaborative settings. In contrast, the current large language model training paradigm relies heavily on modeling vast, static corpora. While effective for knowledge acquisition, it overlooks the interactive feedback loops essential for models to adapt dynamically to their context. In this work, we propose a framework that treats this interactive in-context learning ability not as an emergent property, but as a distinct, trainable skill. We introduce a scalable method that transforms single-turn verifiable tasks into multi-turn didactic interactions driven by information asymmetry. We first show that current flagship models struggle to integrate corrective feedback on hard reasoning tasks. We then demonstrate that models trained with our approach dramatically improve the ability to interactively learn from language feedback. More specifically, the multi-turn performance of a smaller model nearly reaches that of a model an order of magnitude larger. We also observe robust out-of-distribution generalization: interactive training on math problems transfers to diverse domains like coding, puzzles and maze navigation. Our qualitative analysis suggests that this improvement is due to an enhanced in-context plasticity. Finally, we show that this paradigm offers a unified path to self-improvement. By training the model to predict the teacher’s critiques, effectively modeling the feedback environment, we convert this external signal into an internal capability, allowing the model to self-correct even without a teacher.

[187] GPSBench: Do Large Language Models Understand GPS Coordinates?

Thinh Hung Truong, Jey Han Lau, Jianzhong Qi

Main category: cs.AI

TL;DR: GPSBench: A comprehensive benchmark for evaluating LLMs’ geospatial reasoning capabilities across 17 tasks and 57,800 samples, revealing challenges in coordinate operations and hierarchical geographic knowledge degradation.

Details

Motivation: As LLMs are increasingly deployed in physical world applications (navigation, robotics, mapping), robust geospatial reasoning becomes critical. However, LLMs' ability to reason about GPS coordinates and real-world geography remains underexplored.

Method: Introduces GPSBench dataset with 57,800 samples across 17 tasks covering geometric coordinate operations (distance/bearing computation) and reasoning integrating coordinates with world knowledge. Evaluates 14 state-of-the-art LLMs focusing on intrinsic capabilities rather than tool use.

Result: GPS reasoning remains challenging with substantial variation across tasks: models better at real-world geographic reasoning than geometric computations. Geographic knowledge degrades hierarchically (strong country-level, weak city-level). Coordinate noise robustness suggests genuine understanding rather than memorization. GPS-coordinate augmentation improves downstream geospatial tasks, while finetuning creates trade-offs between geometric computation gains and world knowledge degradation.

Conclusion: Geospatial reasoning is a critical but challenging capability for LLMs in physical world applications. The GPSBench benchmark reveals systematic patterns in LLMs’ geographic understanding and provides a foundation for improving geospatial reasoning capabilities.

Abstract: Large Language Models (LLMs) are increasingly deployed in applications that interact with the physical world, such as navigation, robotics, or mapping, making robust geospatial reasoning a critical capability. Despite that, LLMs’ ability to reason about GPS coordinates and real-world geography remains underexplored. We introduce GPSBench, a dataset of 57,800 samples across 17 tasks for evaluating geospatial reasoning in LLMs, spanning geometric coordinate operations (e.g., distance and bearing computation) and reasoning that integrates coordinates with world knowledge. Focusing on intrinsic model capabilities rather than tool use, we evaluate 14 state-of-the-art LLMs and find that GPS reasoning remains challenging, with substantial variation across tasks: models are generally more reliable at real-world geographic reasoning than at geometric computations. Geographic knowledge degrades hierarchically, with strong country-level performance but weak city-level localization, while robustness to coordinate noise suggests genuine coordinate understanding rather than memorization. We further show that GPS-coordinate augmentation can improve in downstream geospatial tasks, and that finetuning induces trade-offs between gains in geometric computation and degradation in world knowledge. Our dataset and reproducible code are available at https://github.com/joey234/gpsbench

[188] Learning Personalized Agents from Human Feedback

Kaiqu Liang, Julia Kruk, Shengyi Qian, Xianjun Yang, Shengjie Bi, Yuanshun Yao, Shaoliang Nie, Mingyang Zhang, Lijuan Liu, Jaime Fernández Fisac, Shuyan Zhou, Saghar Hosseini

Main category: cs.AI

TL;DR: PAHF framework enables AI agents to learn and adapt to individual user preferences through explicit memory and dual feedback channels, outperforming static approaches in embodied manipulation and online shopping tasks.

Details

Motivation: Current AI agents struggle with personalization for new users and adapting to evolving preferences over time, as they typically rely on static datasets or implicit preference models that don't handle preference drift well.

Method: PAHF uses a three-step continual personalization loop: (1) pre-action clarification to resolve ambiguity, (2) grounding actions in preferences retrieved from explicit per-user memory, and (3) integrating post-action feedback to update memory when preferences change.

Result: PAHF learns substantially faster and consistently outperforms no-memory and single-channel baselines, reducing initial personalization error and enabling rapid adaptation to preference shifts in embodied manipulation and online shopping benchmarks.

Conclusion: Integrating explicit memory with dual feedback channels is critical for effective continual personalization, allowing agents to learn from scratch and adapt to changing user preferences in real-time interaction.

Abstract: Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users. Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding user profiles in external memory. However, these approaches struggle with new users and with preferences that change over time. We introduce Personalized Agents from Human Feedback (PAHF), a framework for continual personalization in which agents learn online from live interaction using explicit per-user memory. PAHF operationalizes a three-step loop: (1) seeking pre-action clarification to resolve ambiguity, (2) grounding actions in preferences retrieved from memory, and (3) integrating post-action feedback to update memory when preferences drift. To evaluate this capability, we develop a four-phase protocol and two benchmarks in embodied manipulation and online shopping. These benchmarks quantify an agent’s ability to learn initial preferences from scratch and subsequently adapt to persona shifts. Our theoretical analysis and empirical results show that integrating explicit memory with dual feedback channels is critical: PAHF learns substantially faster and consistently outperforms both no-memory and single-channel baselines, reducing initial personalization error and enabling rapid adaptation to preference shifts.

[189] EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

Sushant Mehta, Logan Ritchie, Suhaas Garre, Nick Heiner, Edwin Chen

Main category: cs.AI

TL;DR: Training AI agents on high-fidelity enterprise simulation environments produces capabilities that generalize beyond training distribution, with improvements transferring to out-of-distribution benchmarks.

Details

Motivation: To demonstrate that training AI agents on realistic, high-fidelity reinforcement learning environments can produce capabilities that generalize beyond the training distribution, particularly for enterprise applications.

Method: Introduced Corecraft, a fully operational enterprise simulation of a customer support organization with 2,500+ entities across 14 types and 23 unique tools. Trained GLM 4.6 using Group Relative Policy Optimization (GRPO) with adaptive clipping on this environment.

Result: After one epoch of training, task pass rate improved from 25.37% to 36.76% on held-out evaluation tasks. Gains transferred to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on τ²-Bench Retail, and +6.8% on Toolathlon (Pass@1).

Conclusion: Environment quality, diversity, and realism are key factors enabling generalizable agent capabilities. Task-centric world building, expert-authored rubrics, and realistic enterprise workflows contribute to observed transfer learning.

Abstract: We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce \corecraft{}, the first environment in \textsc{EnterpriseGym}, Surge AI’s suite of agentic RL environments. \corecraft{} is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM~4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on $τ^2$-Bench Retail, and +6.8% on Toolathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.

[190] Verifiable Semantics for Agent-to-Agent Communication

Philipp Schoenegger, Matt Carlson, Chris Schneider, Chris Daly

Main category: cs.AI

TL;DR: A certification protocol for multiagent AI systems that verifies shared understanding of terms using statistical testing on observable events, reducing disagreement by 51-96% in simulations.

Details

Motivation: Multiagent AI systems need consistent communication but lack methods to verify that agents share the same understanding of terms. Natural language is interpretable but vulnerable to semantic drift, while learned protocols are efficient but opaque.

Method: Proposes a certification protocol based on the stimulus-meaning model where agents are tested on shared observable events. Terms are certified if empirical disagreement falls below a statistical threshold. Agents use “core-guarded reasoning” by restricting reasoning to certified terms, with mechanisms for detecting drift (recertification) and recovering shared vocabulary (renegotiation).

Result: In simulations with varying degrees of semantic divergence, core-guarding reduces disagreement by 72-96%. In validation with fine-tuned language models, disagreement is reduced by 51%.

Conclusion: The framework provides a first step towards verifiable agent-to-agent communication by ensuring shared understanding of terms through statistical certification and guarded reasoning.

Abstract: Multiagent AI systems require consistent communication, but we lack methods to verify that agents share the same understanding of the terms used. Natural language is interpretable but vulnerable to semantic drift, while learned protocols are efficient but opaque. We propose a certification protocol based on the stimulus-meaning model, where agents are tested on shared observable events and terms are certified if empirical disagreement falls below a statistical threshold. In this protocol, agents restricting their reasoning to certified terms (“core-guarded reasoning”) achieve provably bounded disagreement. We also outline mechanisms for detecting drift (recertification) and recovering shared vocabulary (renegotiation). In simulations with varying degrees of semantic divergence, core-guarding reduces disagreement by 72-96%. In a validation with fine-tuned language models, disagreement is reduced by 51%. Our framework provides a first step towards verifiable agent-to-agent communication.

[191] Revolutionizing Long-Term Memory in AI: New Horizons with High-Capacity and High-Speed Storage

Hiroaki Yamanaka, Daisuke Miyashita, Takashi Toi, Asuka Maki, Taiga Ikeda, Jun Deguchi

Main category: cs.AI

TL;DR: Paper explores alternative memory design approaches for artificial superintelligence, focusing on “store then on-demand extract” paradigm rather than current “extract then store” methods to avoid information loss.

Details

Motivation: Current AI memory systems use "extract then store" approach which risks losing valuable information during extraction. The paper aims to explore alternative memory design concepts essential for achieving artificial superintelligence, particularly approaches that preserve raw experiences for flexible future use.

Method: The paper focuses on conceptual exploration rather than novel methods, examining several alternative approaches: 1) “store then on-demand extract” paradigm that retains raw experiences, 2) discovering insights from probabilistic experiences, and 3) improving experience collection efficiency through sharing. Simple experiments demonstrate the effectiveness of these approaches.

Result: Simple experiments confirm the effectiveness of the proposed alternative memory approaches. The paper identifies major challenges that have limited investigation into these directions and proposes research topics to address them.

Conclusion: Alternative memory design approaches, particularly “store then on-demand extract,” show promise for artificial superintelligence by avoiding information loss inherent in current paradigms. Further research is needed to overcome implementation challenges and fully explore these directions.

Abstract: Driven by our mission of “uplifting the world with memory,” this paper explores the design concept of “memory” that is essential for achieving artificial superintelligence (ASI). Rather than proposing novel methods, we focus on several alternative approaches whose potential benefits are widely imaginable, yet have remained largely unexplored. The currently dominant paradigm, which can be termed “extract then store,” involves extracting information judged to be useful from experiences and saving only the extracted content. However, this approach inherently risks the loss of information, as some valuable knowledge particularly for different tasks may be discarded in the extraction process. In contrast, we emphasize the “store then on-demand extract” approach, which seeks to retain raw experiences and flexibly apply them to various tasks as needed, thus avoiding such information loss. In addition, we highlight two further approaches: discovering deeper insights from large collections of probabilistic experiences, and improving experience collection efficiency by sharing stored experiences. While these approaches seem intuitively effective, our simple experiments demonstrate that this is indeed the case. Finally, we discuss major challenges that have limited investigation into these promising directions and propose research topics to address them.

[192] Causally-Guided Automated Feature Engineering with Multi-Agent Reinforcement Learning

Arun Vignesh Malarkkan, Wangyang Ying, Yanjie Fu

Main category: cs.AI

TL;DR: CAFE reformulates automated feature engineering as a causally-guided sequential decision process using causal discovery and reinforcement learning to create robust features that withstand distribution shifts.

Details

Motivation: Existing automated feature engineering methods rely on statistical heuristics that produce brittle features that fail under distribution shifts. There's a need for more robust feature engineering that incorporates causal understanding to improve generalization.

Method: Two-phase approach: Phase I learns a sparse directed acyclic graph over features and target to obtain soft causal priors, grouping features by causal influence. Phase II uses cascading multi-agent deep Q-learning to select causal groups and transformation operators with hierarchical reward shaping and causal group-level exploration strategies.

Result: Across 15 benchmarks, CAFE achieves up to 7% improvement over strong AFE baselines, reduces episodes-to-convergence, delivers competitive time-to-target, reduces performance drop under covariate shifts by ~4x, and produces more compact feature sets with stable post-hoc attributions.

Conclusion: Causal structure used as a soft inductive prior rather than rigid constraint can substantially improve robustness and efficiency of automated feature engineering, making features more resilient to distribution shifts.

Abstract: Automated feature engineering (AFE) enables AI systems to autonomously construct high-utility representations from raw tabular data. However, existing AFE methods rely on statistical heuristics, yielding brittle features that fail under distribution shift. We introduce CAFE, a framework that reformulates AFE as a causally-guided sequential decision process, bridging causal discovery with reinforcement learning-driven feature construction. Phase I learns a sparse directed acyclic graph over features and the target to obtain soft causal priors, grouping features as direct, indirect, or other based on their causal influence with respect to the target. Phase II uses a cascading multi-agent deep Q-learning architecture to select causal groups and transformation operators, with hierarchical reward shaping and causal group-level exploration strategies that favor causally plausible transformations while controlling feature complexity. Across 15 public benchmarks (classification with macro-F1; regression with inverse relative absolute error), CAFE achieves up to 7% improvement over strong AFE baselines, reduces episodes-to-convergence, and delivers competitive time-to-target. Under controlled covariate shifts, CAFE reduces performance drop by ~4x relative to a non-causal multi-agent baseline, and produces more compact feature sets with more stable post-hoc attributions. These findings underscore that causal structure, used as a soft inductive prior rather than a rigid constraint, can substantially improve the robustness and efficiency of automated feature engineering.

[193] Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents

Yun-Shiuan Chuang, Chaitanya Kulkarni, Alec Chiu, Avinash Thangali, Zijie Pan, Shivani Shekhar, Yirou Ge, Yixi Li, Uma Kona, Linsey Pang, Prakhar Mehrotra

Main category: cs.AI

TL;DR: Proxy State-Based Evaluation: An LLM-driven simulation framework for evaluating interactive LLM agents without deterministic backends, using proxy state tracking and LLM judges for reliable automated assessment.

Details

Motivation: Current agentic benchmarks require costly deterministic backends that are hard to build and iterate. There's a need for scalable, practical evaluation frameworks for industrial LLM agents that can provide reliable comparisons and yield training data.

Method: Proposes Proxy State-Based Evaluation where scenarios specify user goals, facts, expected final state, and behavior. An LLM state tracker infers structured proxy state from interaction traces, and LLM judges verify goal completion and detect hallucinations against scenario constraints.

Result: Produces stable, model-differentiating rankings across model families and reasoning efforts. Provides on-/off-policy rollouts with transfer to unseen scenarios. Achieves near-zero simulator hallucination rates and over 90% human-LLM judge agreement for reliable automated evaluation.

Conclusion: Proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents, enabling reliable automated assessment without costly deterministic backends.

Abstract: Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating rankings across families and inference-time reasoning efforts, and its on-/off-policy rollouts provide supervision that transfers to unseen scenarios. Careful scenario specification yields near-zero simulator hallucination rates as supported by ablation studies. The framework also supports sensitivity analyses over user personas. Human-LLM judge agreement exceeds 90%, indicating reliable automated evaluation. Overall, proxy state-based evaluation offers a practical, scalable alternative to deterministic agentic benchmarks for industrial LLM agents.

[194] Multi-agent cooperation through in-context co-player inference

Marissa A. Weis, Maciej Wołczyk, Rajai Nasser, Rif A. Saurous, Blaise Agüera y Arcas, João Sacramento, Alexander Meulemans

Main category: cs.AI

TL;DR: Sequence model agents trained against diverse co-players develop in-context best-response strategies that naturally lead to cooperative behavior through mutual shaping dynamics, without requiring hardcoded assumptions or explicit timescale separation.

Details

Motivation: Existing approaches to inducing cooperation in multi-agent reinforcement learning rely on hardcoded assumptions about co-player learning rules or enforce strict separation between naive learners and meta-learners. The paper aims to demonstrate that sequence models' in-context learning capabilities can achieve learning awareness without these limitations.

Method: Train sequence model agents against a diverse distribution of co-players using standard decentralized reinforcement learning. This approach leverages the models’ in-context learning capabilities to develop best-response strategies that function as learning algorithms on fast intra-episode timescales.

Result: The training naturally induces in-context best-response strategies, and the cooperative mechanism from prior work (where vulnerability to extortion drives mutual shaping) emerges naturally. In-context adaptation makes agents vulnerable to extortion, and mutual pressure to shape opponents’ in-context learning dynamics leads to cooperative behavior.

Conclusion: Standard decentralized reinforcement learning on sequence models combined with co-player diversity provides a scalable path to learning cooperative behaviors, as sequence models’ in-context learning capabilities enable learning awareness without hardcoded assumptions or explicit timescale separation.

Abstract: Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between “learning-aware” agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co-player learning rules or enforce a strict separation between “naive learners” updating on fast timescales and “meta-learners” observing these updates. Here, we demonstrate that the in-context learning capabilities of sequence models allow for co-player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies, effectively functioning as learning algorithms on the fast intra-episode timescale. We find that the cooperative mechanism identified in prior work-where vulnerability to extortion drives mutual shaping-emerges naturally in this setting: in-context adaptation renders agents vulnerable to extortion, and the resulting mutual pressure to shape the opponent’s in-context learning dynamics resolves into the learning of cooperative behavior. Our results suggest that standard decentralized reinforcement learning on sequence models combined with co-player diversity provides a scalable path to learning cooperative behaviors.

[195] Leveraging Large Language Models for Causal Discovery: a Constraint-based, Argumentation-driven Approach

Zihao Li, Fabrizio Russo

Main category: cs.AI

TL;DR: LLMs used as imperfect experts to provide semantic structural priors for causal discovery, integrated with statistical evidence through Causal ABA framework, achieving SOTA performance with evaluation protocol to mitigate memorization bias.

Details

Motivation: Causal discovery traditionally requires expert knowledge to construct principled causal graphs. While statistical methods exist, they often lack ways to effectively combine data with expertise. The paper explores using LLMs as sources of semantic knowledge about variable relationships to enhance causal discovery.

Method: Uses Causal Assumption-based Argumentation (ABA) framework to integrate LLM-generated semantic structural priors (from variable names/descriptions) with conditional-independence evidence from observational data. LLMs act as imperfect experts providing causal constraints.

Result: State-of-the-art performance on standard benchmarks and semantically grounded synthetic graphs. Introduces evaluation protocol to mitigate LLM memorization bias when assessing causal discovery capabilities.

Conclusion: LLMs can effectively serve as imperfect experts for causal discovery by providing semantic priors, and the ABA framework offers principled integration of these priors with statistical evidence, advancing causal discovery methods.

Abstract: Causal discovery seeks to uncover causal relations from data, typically represented as causal graphs, and is essential for predicting the effects of interventions. While expert knowledge is required to construct principled causal graphs, many statistical methods have been proposed to leverage observational data with varying formal guarantees. Causal Assumption-based Argumentation (ABA) is a framework that uses symbolic reasoning to ensure correspondence between input constraints and output graphs, while offering a principled way to combine data and expertise. We explore the use of large language models (LLMs) as imperfect experts for Causal ABA, eliciting semantic structural priors from variable names and descriptions and integrating them with conditional-independence evidence. Experiments on standard benchmarks and semantically grounded synthetic graphs demonstrate state-of-the-art performance, and we additionally introduce an evaluation protocol to mitigate memorisation bias when assessing LLMs for causal discovery.

[196] Framework of Thoughts: A Foundation Framework for Dynamic and Optimized Reasoning based on Chains, Trees, and Graphs

Felix Fricke, Simon Malberg, Georg Groh

Main category: cs.AI

TL;DR: FoT is a general-purpose framework for building and optimizing dynamic reasoning schemes for LLMs, addressing limitations of static prompting approaches through built-in optimization features.

Details

Motivation: Existing prompting schemes like Chain of Thought, Tree of Thoughts, and Graph of Thoughts have limitations: they require static, problem-specific reasoning structures that lack adaptability to dynamic or unseen problems, and they are under-optimized in terms of hyperparameters, prompts, runtime, and cost.

Method: Introduces Framework of Thoughts (FoT), a general-purpose foundation framework with built-in features for hyperparameter tuning, prompt optimization, parallel execution, and intelligent caching to optimize reasoning schemes.

Result: FoT enables significantly faster execution, reduces costs, and achieves better task scores through optimization. The framework was demonstrated by implementing three popular schemes (Tree of Thoughts, Graph of Thoughts, and ProbTree) within FoT.

Conclusion: FoT provides a flexible framework for developing dynamic and efficient reasoning schemes for LLMs, addressing key limitations of existing prompting approaches through systematic optimization.

Abstract: Prompting schemes such as Chain of Thought, Tree of Thoughts, and Graph of Thoughts can significantly enhance the reasoning capabilities of large language models. However, most existing schemes require users to define static, problem-specific reasoning structures that lack adaptability to dynamic or unseen problem types. Additionally, these schemes are often under-optimized in terms of hyperparameters, prompts, runtime, and prompting cost. To address these limitations, we introduce Framework of Thoughts (FoT)–a general-purpose foundation framework for building and optimizing dynamic reasoning schemes. FoT comes with built-in features for hyperparameter tuning, prompt optimization, parallel execution, and intelligent caching, unlocking the latent performance potential of reasoning schemes. We demonstrate FoT’s capabilities by implementing three popular schemes–Tree of Thoughts, Graph of Thoughts, and ProbTree–within FoT. We empirically show that FoT enables significantly faster execution, reduces costs, and achieves better task scores through optimization. We release our codebase to facilitate the development of future dynamic and efficient reasoning schemes.

[197] Creating a digital poet

Vered Tohar, Tsahi Hayat, Amir Leshem

Main category: cs.AI

TL;DR: A large language model was shaped into a digital poet through iterative expert feedback without retraining, producing poems indistinguishable from human poetry in blinded tests.

Details

Motivation: To explore whether machines can create good poetry and understand the nature of creativity and authorship in AI-generated art.

Method: Seven-month poetry workshop using iterative in-context expert feedback on a large language model without retraining, followed by quantitative/qualitative analysis and blinded authorship tests.

Result: The model developed a distinctive style, coherent corpus, pen name, and author image. In blinded tests (50 participants, 3 AI vs 3 human poems), judgments were at chance (human poems labeled human 54%, AI poems 52%). A commercial publisher released the model’s poetry collection.

Conclusion: Workshop-style prompting enables long-horizon creative shaping of AI, renewing debates on creativity and authorship in AI-generated art.

Abstract: Can a machine write good poetry? Any positive answer raises fundamental questions about the nature and value of art. We report a seven-month poetry workshop in which a large language model was shaped into a digital poet through iterative in-context expert feedback, without retraining. Across sessions, the model developed a distinctive style and a coherent corpus, supported by quantitative and qualitative analyses, and it produced a pen name and author image. In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence intervals including 50%. After the workshop, a commercial publisher released a poetry collection authored by the model. These results show that workshop-style prompting can support long-horizon creative shaping and renew debates on creativity and authorship.

[198] Agent Skill Framework: Perspectives on the Potential of Small Language Models in Industrial Environments

Yangjie Xu, Lujun Li, Lama Sleem, Niccolo Gentile, Yewei Song, Yiqun Wang, Siming Ji, Wenbo Wu, Radu State

Main category: cs.AI

TL;DR: Agent Skill framework benefits small language models (SLMs) in industrial settings where proprietary models are infeasible, with moderate-sized SLMs (12B-30B parameters) showing substantial improvements and code-specialized variants achieving competitive performance.

Details

Motivation: The Agent Skill framework works well with proprietary models but its effectiveness with small language models (SLMs) is unknown. This matters for industrial scenarios where continuous reliance on public APIs is infeasible due to data security and budget constraints, and where SLMs often show limited generalization in customized scenarios.

Method: Introduces a formal mathematical definition of the Agent Skill process, followed by systematic evaluation of language models of varying sizes across multiple use cases including two open-source tasks and a real-world insurance claims dataset.

Result: Tiny models struggle with reliable skill selection, while moderately sized SLMs (12B-30B parameters) benefit substantially from the Agent Skill approach. Code-specialized variants at around 80B parameters achieve performance comparable to closed-source baselines while improving GPU efficiency.

Conclusion: The findings provide comprehensive characterization of Agent Skill framework capabilities and constraints, offering actionable insights for effective deployment in SLM-centered environments where proprietary models are not feasible.

Abstract: Agent Skill framework, now widely and officially supported by major players such as GitHub Copilot, LangChain, and OpenAI, performs especially well with proprietary models by improving context engineering, reducing hallucinations, and boosting task accuracy. Based on these observations, an investigation is conducted to determine whether the Agent Skill paradigm provides similar benefits to small language models (SLMs). This question matters in industrial scenarios where continuous reliance on public APIs is infeasible due to data-security and budget constraints requirements, and where SLMs often show limited generalization in highly customized scenarios. This work introduces a formal mathematical definition of the Agent Skill process, followed by a systematic evaluation of language models of varying sizes across multiple use cases. The evaluation encompasses two open-source tasks and a real-world insurance claims data set. The results show that tiny models struggle with reliable skill selection, while moderately sized SLMs (approximately 12B - 30B) parameters) benefit substantially from the Agent Skill approach. Moreover, code-specialized variants at around 80B parameters achieve performance comparable to closed-source baselines while improving GPU efficiency. Collectively, these findings provide a comprehensive and nuanced characterization of the capabilities and constraints of the framework, while providing actionable insights for the effective deployment of Agent Skills in SLM-centered environments.

[199] Towards a Science of AI Agent Reliability

Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan

Main category: cs.AI

TL;DR: Proposes 12 metrics across 4 dimensions (consistency, robustness, predictability, safety) to evaluate AI agent reliability beyond single success scores, revealing persistent limitations despite capability gains.

Details

Motivation: Current AI agent evaluations focus on single success metrics that obscure critical operational flaws. Agents often fail in practice despite good benchmark scores, highlighting the need for more comprehensive reliability assessment.

Method: Develops a holistic performance profile with 12 concrete metrics across four dimensions: consistency (behavior across runs), robustness (withstand perturbations), predictability (failure patterns), and safety (error severity). Evaluates 14 agentic models across two complementary benchmarks.

Result: Recent capability gains have yielded only small improvements in reliability. The metrics expose persistent limitations in agent behavior that traditional evaluations miss, showing how agents perform, degrade, and fail in practice.

Conclusion: The proposed reliability metrics complement traditional evaluations by providing tools to reason about agent performance degradation and failure modes, offering a more comprehensive assessment framework grounded in safety-critical engineering principles.

Abstract: AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

[200] A Review of Fairness and A Practical Guide to Selecting Context-Appropriate Fairness Metrics in Machine Learning

Caleb J. S. Barr, Olivia Erdelyi, Paul D. Docherty, Randolph C. Grace

Main category: cs.AI

TL;DR: A framework for selecting context-appropriate fairness measures in machine learning, with a flowchart based on 12 criteria to help stakeholders address fairness concerns and comply with regulations.

Details

Motivation: The challenge of defining appropriate fairness measures due to philosophical, cultural, and political contexts, combined with increasing regulatory requirements and the insufficiency of single fairness metrics for complex machine learning models.

Method: Developed a flowchart with 12 criteria to guide selection of contextually appropriate fairness measures, considering model assessment criteria, model selection criteria, and data bias. Also reviewed fairness literature in ML context and linked to core regulatory instruments.

Result: Created a practical framework to assist policymakers, AI developers, researchers, and stakeholders in appropriately addressing fairness concerns and complying with relevant regulatory requirements.

Conclusion: A context-aware approach to fairness measurement is essential given the complexity of biases in ML models and increasing regulatory pressures, requiring flexible frameworks rather than single metrics.

Abstract: Recent regulatory proposals for artificial intelligence emphasize fairness requirements for machine learning models. However, precisely defining the appropriate measure of fairness is challenging due to philosophical, cultural and political contexts. Biases can infiltrate machine learning models in complex ways depending on the model’s context, rendering a single common metric of fairness insufficient. This ambiguity highlights the need for criteria to guide the selection of context-aware measures, an issue of increasing importance given the proliferation of ever tighter regulatory requirements. To address this, we developed a flowchart to guide the selection of contextually appropriate fairness measures. Twelve criteria were used to formulate the flowchart. This included consideration of model assessment criteria, model selection criteria, and data bias. We also review fairness literature in the context of machine learning and link it to core regulatory instruments to assist policymakers, AI developers, researchers, and other stakeholders in appropriately addressing fairness concerns and complying with relevant regulatory requirements.

[201] Scalable Precise Computation of Shannon Entropy

Yong Lai, Haolong Tong, Zhenghang Xu, Minghao Yin

Main category: cs.AI

TL;DR: PSE is a scalable precise tool for computing Shannon entropy in quantitative information flow analysis, using optimized Boolean constraint modeling with ADDAND knowledge compilation and model counting optimizations.

Details

Motivation: Quantitative information flow analysis needs scalable precise methods for Shannon entropy computation, as existing tools like EntropyEstimation have limitations in precision and efficiency for Boolean constraint-based programs.

Method: Two-stage optimization: 1) Design ADDAND knowledge compilation language combining Algebraic Decision Diagrams and conjunctive decomposition to avoid output enumeration; 2) Optimize model counting queries for probability computation.

Result: PSE solved 56 more benchmarks than EntropyEstimation out of 459 total, and was at least 10× more efficient for 98% of benchmarks both tools solved.

Conclusion: PSE demonstrates significant improvements in scalability and efficiency for precise Shannon entropy computation in quantitative information flow analysis.

Abstract: Quantitative information flow analyses (QIF) are a class of techniques for measuring the amount of confidential information leaked by a program to its public outputs. Shannon entropy is an important method to quantify the amount of leakage in QIF. This paper focuses on the programs modeled in Boolean constraints and optimizes the two stages of the Shannon entropy computation to implement a scalable precise tool PSE. In the first stage, we design a knowledge compilation language called \ADDAND that combines Algebraic Decision Diagrams and conjunctive decomposition. \ADDAND avoids enumerating possible outputs of a program and supports tractable entropy computation. In the second stage, we optimize the model counting queries that are used to compute the probabilities of outputs. We compare PSE with the state-of-the-art probabilistic approximately correct tool EntropyEstimation, which was shown to significantly outperform the previous precise tools. The experimental results demonstrate that PSE solved 56 more benchmarks compared to EntropyEstimation in a total of 459. For 98% of the benchmarks that both PSE and EntropyEstimation solved, PSE is at least $10\times$ as efficient as EntropyEstimation.

[202] SurgRAW: Multi-Agent Workflow with Chain of Thought Reasoning for Robotic Surgical Video Analysis

Chang Han Low, Ziyue Wang, Tianyi Zhang, Zhu Zhuo, Zhitao Zeng, Evangelos B. Mazomenos, Yueming Jin

Main category: cs.AI

TL;DR: SurgRAW introduces a clinical Chain-of-Thought agentic workflow for zero-shot multi-task reasoning in robotic-assisted surgery, addressing domain gaps and hallucinations in VLMs through hierarchical reasoning and surgical knowledge augmentation.

Details

Motivation: Robotic-assisted surgery needs intelligent systems with unified scene understanding, but current surgical AI uses isolated task-specific models with limited interpretability. General VLMs struggle with hallucinations, domain gaps, and weak task-interdependency modeling in surgical contexts.

Method: Proposes SurgRAW: a hierarchical reasoning workflow with orchestrator dividing surgical scene understanding into reasoning streams, specialized agents for task-level reasoning, panel discussion mechanism for agent collaboration, retrieval-augmented generation for surgical knowledge, and task-specific CoT prompts grounded in surgical domain.

Result: SurgRAW surpasses mainstream VLMs and agentic systems, outperforming a supervised model by 14.61% accuracy on the SurgCoTBench benchmark with 14256 QA pairs across five major surgical tasks.

Conclusion: The proposed agentic workflow enables clinically aligned zero-shot multi-task reasoning in surgery, addressing domain gaps and hallucinations while enhancing interpretability through surgical knowledge integration and hierarchical reasoning.

Abstract: Robotic-assisted surgery (RAS) is central to modern surgery, driving the need for intelligent systems with accurate scene understanding. Most existing surgical AI methods rely on isolated, task-specific models, leading to fragmented pipelines with limited interpretability and no unified understanding of RAS scene. Vision-Language Models (VLMs) offer strong zero-shot reasoning, but struggle with hallucinations, domain gaps and weak task-interdependency modeling. To address the lack of unified data for RAS scene understanding, we introduce SurgCoTBench, the first reasoning-focused benchmark in RAS, covering 14256 QA pairs with frame-level annotations across five major surgical tasks. Building on SurgCoTBench, we propose SurgRAW, a clinically aligned Chain-of-Thought (CoT) driven agentic workflow for zero-shot multi-task reasoning in surgery. SurgRAW employs a hierarchical reasoning workflow where an orchestrator divides surgical scene understanding into two reasoning streams and directs specialized agents to generate task-level reasoning, while higher-level agents capture workflow interdependencies or ground output clinically. Specifically, we propose a panel discussion mechanism to ensure task-specific agents collaborate synergistically and leverage on task interdependencies. Similarly, we incorporate a retrieval-augmented generation module to enrich agents with surgical knowledge and alleviate domain gaps in general VLMs. We design task-specific CoT prompts grounded in surgical domain to ensure clinically aligned reasoning, reduce hallucinations and enhance interpretability. Extensive experiments show that SurgRAW surpasses mainstream VLMs and agentic systems and outperforms a supervised model by 14.61% accuracy. Dataset and code is available at https://github.com/jinlab-imvr/SurgRAW.git .

[203] Large Language Models for Water Distribution Systems Modeling and Decision-Making

Yinon Goldshtein, Gal Perelman, Assaf Schuster, Avi Ostfeld

Main category: cs.AI

TL;DR: LLM-EPANET is an agent-based framework that enables natural language interaction with EPANET water distribution system simulator using LLMs for code generation and simulation execution.

Details

Motivation: To make computational tools like EPANET more accessible by overcoming technical and expertise barriers in water distribution system management through natural language interfaces.

Method: Combines retrieval-augmented generation and multi-agent orchestration to translate user queries into executable EPANET code, run simulations, and return structured results.

Result: Achieved 56-81% accuracy overall on 69 benchmark queries, with over 90% accuracy for simpler queries, demonstrating LLMs can effectively support water system modeling tasks.

Conclusion: LLM-based modeling has potential to democratize data-driven decision-making in water sector through transparent, interactive AI interfaces.

Abstract: The integration of Large Language Models (LLMs) into engineering workflows presents new opportunities for making computational tools more accessible. Especially where such tools remain underutilized due to technical or expertise barriers, such as water distribution system (WDS) management. This study introduces LLM-EPANET, an agent-based framework that enables natural language interaction with EPANET, the benchmark WDS simulator. The framework combines retrieval-augmented generation and multi-agent orchestration to automatically translate user queries into executable code, run simulations, and return structured results. A curated set of 69 benchmark queries is introduced to evaluate performance across state-of-the-art LLMs. Results show that LLMs can effectively support a wide range of modeling tasks, achieving 56-81% accuracy overall, and over 90% for simpler queries. These findings highlight the potential of LLM-based modeling to democratize data-driven decision-making in the water sector through transparent, interactive AI interfaces. The framework code and benchmark queries are shared as an open resource: https://github.com/yinon-gold/LLMs-in-WDS-Modeling.

[204] EconEvals: Benchmarks and Litmus Tests for Economic Decision-Making by LLM Agents

Sara Fish, Julia Shephard, Minkai Li, Ran I. Shorrer, Yannai A. Gonczarowski

Main category: cs.AI

TL;DR: Developed evaluation methods for LLM economic decision-making using benchmarks from procurement, scheduling, pricing and litmus tests for multi-objective tradeoffs.

Details

Motivation: To measure economic decision-making capabilities and tendencies of LLMs as they become integrated into economic applications, requiring systematic evaluation frameworks.

Method: Created two approaches: 1) Economics benchmarks testing LLM ability to learn from environment in context, 2) Litmus tests quantifying choice behavior on stylized tasks with conflicting objectives, producing litmus, reliability, and competency scores.

Result: Evaluated frontier LLMs to track capability changes over time, derived economic insights from choice behavior and chain-of-thought, validated framework through self-consistency, robustness, and generalizability tests.

Conclusion: Provides foundational evaluation framework for LLM agents in economic decision-making, enabling systematic assessment of their capabilities and tendencies.

Abstract: We develop evaluation methods for measuring the economic decision-making capabilities and tendencies of LLMs. First, we develop benchmarks derived from key problems in economics – procurement, scheduling, and pricing – that test an LLM’s ability to learn from the environment in context. Second, we develop the framework of litmus tests, evaluations that quantify an LLM’s choice behavior on a stylized decision-making task with multiple conflicting objectives. Each litmus test outputs a litmus score, which quantifies an LLM’s tradeoff response, a reliability score, which measures the coherence of an LLM’s choice behavior, and a competency score, which measures an LLM’s capability at the same task when the conflicting objectives are replaced by a single, well-specified objective. Evaluating a broad array of frontier LLMs, we (1) investigate changes in LLM capabilities and tendencies over time, (2) derive economically meaningful insights from the LLMs’ choice behavior and chain-of-thought, (3) validate our litmus test framework by testing self-consistency, robustness, and generalizability. Overall, this work provides a foundation for evaluating LLM agents as they are further integrated into economic decision-making.

[205] GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning

Jie Peng, Jiarui Ji, Runlin Lei, Zhewei Wei, Yongchao Liu, Chuntao Hong

Main category: cs.AI

TL;DR: Proposes GDGB, a benchmark for generative tasks on dynamic text-attributed graphs with high-quality textual features, introducing two novel generation tasks and evaluation metrics.

Details

Motivation: Existing DyTAG datasets have poor textual quality and focus mainly on discriminative tasks, lacking standardized formulations and evaluation for generative DyTAG tasks requiring semantically rich inputs.

Method: Creates GDGB with eight curated DyTAG datasets with high-quality textual features, defines TDGG and IDGG generation tasks, designs multifaceted evaluation metrics, and proposes GAG-General, an LLM-based multi-agent generative framework.

Result: GDGB enables rigorous evaluation of DyTAG generation tasks, revealing critical interplay between structural and textual features in generation quality.

Conclusion: GDGB serves as a foundational resource for advancing generative DyTAG research and enabling practical applications in dynamic graph generation.

Abstract: Dynamic Text-Attributed Graphs (DyTAGs), which intricately integrate structural, temporal, and textual attributes, are crucial for modeling complex real-world systems. However, most existing DyTAG datasets exhibit poor textual quality, which severely limits their utility for generative DyTAG tasks requiring semantically rich inputs. Additionally, prior work mainly focuses on discriminative tasks on DyTAGs, resulting in a lack of standardized task formulations and evaluation protocols tailored for DyTAG generation. To address these critical issues, we propose Generative DyTAG Benchmark (GDGB), which comprises eight meticulously curated DyTAG datasets with high-quality textual features for both nodes and edges, overcoming limitations of prior datasets. Building on GDGB, we define two novel DyTAG generation tasks: Transductive Dynamic Graph Generation (TDGG) and Inductive Dynamic Graph Generation (IDGG). TDGG transductively generates a target DyTAG based on the given source and destination node sets, while the more challenging IDGG introduces new node generation to inductively model the dynamic expansion of real-world graph data. To enable holistic evaluation, we design multifaceted metrics that assess the structural, temporal, and textual quality of the generated DyTAGs. We further propose GAG-General, an LLM-based multi-agent generative framework tailored for reproducible and robust benchmarking of DyTAG generation. Experimental results demonstrate that GDGB enables rigorous evaluation of TDGG and IDGG, with key insights revealing the critical interplay of structural and textual features in DyTAG generation. These findings establish GDGB as a foundational resource for advancing generative DyTAG research and unlocking further practical applications in DyTAG generation. The dataset and source code are available at https://github.com/Lucas-PJ/GDGB-ALGO.

Cédric Colas, Tracey Mills, Ben Prystawski, Michael Henry Tessler, Noah Goodman, Jacob Andreas, Joshua Tenenbaum

Main category: cs.AI

TL;DR: A computational framework for social learning that integrates linguistic guidance with direct experience through joint probabilistic inference over executable world models, enabling AI agents to generate and interpret advice like humans.

Details

Motivation: To understand how humans combine linguistic guidance from others with direct experience for safe and rapid learning, and to develop AI systems that can similarly integrate these knowledge sources through social learning mechanisms.

Method: Developed a computational framework modeling social learning as joint probabilistic inference over structured, executable world models. Used a pretrained language model as a probabilistic model of human advice-sharing, enabling agents to both generate advice and interpret linguistic input as evidence during Bayesian inference.

Result: Linguistic guidance shaped exploration and accelerated learning by reducing risky interactions and speeding up key discoveries in both humans and models across 10 video games. Knowledge accumulated across generations through iterated learning, and successful knowledge transfer occurred between humans and models.

Conclusion: Structured, language-compatible representations enable effective human-machine collaborative learning, demonstrating how linguistic guidance can be integrated with direct experience for accelerated and safer learning in both biological and artificial systems.

Abstract: The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments. How do people integrate these two sources of knowledge, and how might AI systems? We present a computational framework that models social learning as joint probabilistic inference over structured, executable world models given sensorimotor and linguistic data. We make this possible by turning a pretrained language model into a probabilistic model of how humans share advice conditioned on their beliefs, allowing our agents both to generate advice for others and to interpret linguistic input as evidence during Bayesian inference. Using behavioral experiments and simulations across 10 video games, we show how linguistic guidance can shape exploration and accelerate learning by reducing risky interactions and speeding up key discoveries in both humans and models. We further explore how knowledge can accumulate across generations through iterated learning experiments and demonstrate successful knowledge transfer between humans and models – revealing how structured, language-compatible representations might enable human-machine collaborative learning.

[207] TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models

Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang, Chao-Han Huck Yang, Qingsong Wen, Zuozhu Liu, Sabato Marco Siniscalchi, Ming Jin, Shirui Pan

Main category: cs.AI

TL;DR: TSR-Suite introduces comprehensive time series reasoning tasks and TimeOmni-1 model for multimodal time series understanding, achieving strong generalization and improved reasoning capabilities.

Details

Motivation: Existing multimodal time series datasets lack genuine reasoning depth, focusing only on surface alignment and QA. There's a need for well-defined reasoning tasks and high-quality data to advance practical time series reasoning models.

Method: Introduces TSR-Suite with four atomic tasks spanning perception, extrapolation, and decision-making capabilities. Creates TimeOmni-1 model trained with multi-stage approach, novel reward functions, and tailored optimizations on 23K+ samples (2.3K human-annotated).

Result: TimeOmni-1 shows strong out-of-distribution generalization, improves causality discovery accuracy (64.0% vs 35.9% with GPT-4.1), and raises valid response rate by over 6% on event-aware forecasting compared to GPT-4.1.

Conclusion: TSR-Suite provides the first comprehensive framework for time series reasoning evaluation and training, while TimeOmni-1 demonstrates practical reasoning capabilities across diverse real-world problems.

Abstract: Recent advances in multimodal time series learning underscore a paradigm shift from analytics centered on basic patterns toward advanced time series understanding and reasoning. However, existing multimodal time series datasets mostly remain at the level of surface alignment and question answering, without reaching the depth of genuine reasoning. The absence of well-defined tasks that genuinely require time series reasoning, along with the scarcity of high-quality data, has limited progress in building practical time series reasoning models (TSRMs). To this end, we introduce Time Series Reasoning Suite (TSR-Suite), which formalizes four atomic tasks that span three fundamental capabilities for reasoning with time series: (1) perception, acquired through scenario understanding and causality discovery; (2) extrapolation, realized via event-aware forecasting; and (3) decision-making, developed through deliberation over perception and extrapolation. TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs. It contains more than 23K samples, of which 2.3K are carefully curated through a human-guided hierarchical annotation process. Building on this foundation, we introduce TimeOmni-1, the first unified reasoning model designed to address diverse real-world problems demanding time series reasoning. The model is trained in multiple stages, integrating a mixture of task scenarios, novel reward functions, and tailored optimizations. Experiments show that TimeOmni-1 delivers strong out-of-distribution generalization across all tasks and achieves a high rate of valid responses. It significantly improves causality discovery accuracy (64.0% vs. 35.9% with GPT-4.1) and raises the valid response rate by over 6% compared to GPT-4.1 on the event-aware forecasting task.

[208] Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong, Sudheer Chava, Chao Zhang

Main category: cs.AI

TL;DR: A method for precise attribute intensity control in LLMs using target-reaching formulation, temporal-difference learning for value functions, and gradient-based interventions on hidden representations.

Details

Motivation: Current LLM alignment methods only provide directional or open-ended guidance, failing to achieve exact attribute intensities needed for AI systems adaptable to diverse user expectations.

Method: Three key designs: (1) reformulating precise attribute intensity control as target-reaching problem, (2) training lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, (3) employing gradient-based interventions on hidden representations to navigate toward specific targets.

Result: Experiments on LLaMA-3.2-3b and Phi-4-mini confirm ability to steer text generation to user-specified attribute intensities with high accuracy. Efficiency enhancements demonstrated across preference data synthesis, Pareto frontier approximation/optimization, and distillation of aligned behaviors.

Conclusion: Method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment, with applications to downstream tasks requiring precise attribute control.

Abstract: Precise attribute intensity control–generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities–is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather than simple maximization; (2) training a lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, thereby steering LLM outputs; and (3) employing gradient-based interventions on hidden representations to navigate the model precisely towards specific attribute intensity targets. Our method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment. Experiments on LLaMA-3.2-3b and Phi-4-mini confirm our method’s ability to steer text generation to user-specified attribute intensities with high accuracy. Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference. Our code is available on https://github.com/Pre-Control/pre-control

Aaron Bell, Amit Aides, Amr Helmy, Arbaaz Muslim, Aviad Barzilai, Aviv Slobodkin, Bolous Jaber, David Schottlander, George Leifman, Joydeep Paul, Mimi Sun, Nadav Sherman, Natalie Williams, Per Bjornsson, Roy Lee, Ruth Alcantara, Thomas Turnbull, Tomer Shekel, Vered Silverman, Yotam Gigi, Adam Boulanger, Alex Ottenwess, Ali Ahmadalipour, Anna Carter, Behzad Vahedi, Charles Elliott, David Andre, Elad Aharoni, Gia Jung, Hassler Thurston, Jacob Bien, Jamie McPike, Jessica Sapick, Juliet Rothenberg, Kartik Hegde, Kel Markert, Kim Philipp Jablonski, Luc Houriez, Monica Bharel, Phing VanLee, Reuven Sayag, Sebastian Pilarski, Shelley Cazares, Shlomi Pasternak, Siduo Jiang, Thomas Colthurst, Yang Chen, Yehonathan Refael, Yochai Blau, Yuval Carny, Yael Maguire, Avinatan Hassidim, James Manyika, Tim Thelin, Genady Beryozkin, Gautam Prasad, Luke Barrington, Yossi Matias, Niv Efron, Shravya Shetty

Main category: cs.AI

TL;DR: Earth AI: A family of geospatial AI models and agentic reasoning system combining foundation models for planet-scale imagery, population, and environment with Gemini-powered reasoning to extract insights from complex geospatial data.

Details

Motivation: Geospatial data is vast and diverse but challenging to analyze due to varying resolutions, timescales, and sparsity. There's a need for AI systems that can unlock novel insights from this complex data to better understand our planet.

Method: Developed foundation models across three domains: Planet-scale Imagery, Population, and Environment. Created a Gemini-powered reasoning engine that jointly reasons over multiple foundation models, large geospatial data sources, and tools. Built an agentic system to handle complex, multi-step queries.

Result: Rigorous benchmarks show the power and novel capabilities of the foundation models. When used together, they provide complementary value for geospatial inference and unlock superior predictive capabilities. The agent demonstrates ability to deliver critical insights on real-world crisis scenarios, bridging the gap between raw data and actionable understanding.

Conclusion: Earth AI represents a significant advance in geospatial AI, enabling profound insights into our planet through synergistic foundation models and intelligent agentic reasoning that can handle complex real-world scenarios.

Abstract: Geospatial data offers immense potential for understanding our planet. However, the sheer volume and diversity of this data along with its varied resolutions, timescales, and sparsity pose significant challenges for thorough analysis and interpretation. This paper introduces Earth AI, a family of geospatial AI models and agentic reasoning that enables significant advances in our ability to unlock novel and profound insights into our planet. This approach is built upon foundation models across three key domains–Planet-scale Imagery, Population, and Environment–and an intelligent Gemini-powered reasoning engine. We present rigorous benchmarks showcasing the power and novel capabilities of our foundation models and validate that when used together, they provide complementary value for geospatial inference and their synergies unlock superior predictive capabilities. To handle complex, multi-step queries, we developed a Gemini-powered agent that jointly reasons over our multiple foundation models along with large geospatial data sources and tools. On a new benchmark of real-world crisis scenarios, our agent demonstrates the ability to deliver critical and timely insights, effectively bridging the gap between raw geospatial data and actionable understanding.

[210] CaveAgent: Transforming LLMs into Stateful Runtime Operators

Maohao Ran, Zhenglin Wan, Cooper Lin, Yanting Zhang, Hongyu Xin, Hongwei Fan, Yibo Xu, Beier Luo, Yaxin Zhou, Wangbo Zhao, Lijie Yang, Lang Feng, Fuchao Yang, Jingxuan Wu, Yiqiao Huang, Chendong Ma, Dailing Jiang, Jianbo Deng, Sihui Han, Yang You, Bo An, Yike Guo, Jun Song

Main category: cs.AI

TL;DR: CaveAgent is a framework that shifts LLM-based agents from text-centric paradigms to treating the Python runtime as the central state locus, enabling persistent object manipulation and reducing context drift in long-horizon tasks.

Details

Motivation: Current LLM-based agent systems are constrained by text-centric paradigms that struggle with long-horizon tasks due to fragile multi-turn dependencies and context drift, limiting their ability to handle complex, interdependent tasks efficiently.

Method: CaveAgent introduces a dual-stream architecture that inverts the conventional paradigm: it elevates the persistent Python runtime as the central locus of state with a lightweight semantic stream as orchestrator. It features Stateful Runtime Management for injecting, manipulating, and retrieving complex Python objects that persist across turns, plus a runtime-integrated skill management system for ecosystem interoperability.

Result: Evaluations show consistent improvement across challenging benchmarks, enabling CaveAgent to handle data scales that cause context overflow in both JSON-based and code-based agents. The framework reduces context drift in multi-turn interactions and preserves processed data without information loss.

Conclusion: CaveAgent establishes a structural foundation for future research in Reinforcement Learning with Verifiable Rewards (RLVR) by providing programmatically verifiable feedback and accessible runtime state for automated evaluation and reward signal generation.

Abstract: LLM-based agents are increasingly capable of complex task execution, yet current agentic systems remain constrained by text-centric paradigms that struggle with long-horizon tasks due to fragile multi-turn dependencies and context drift. We present CaveAgent, a framework that shifts tool use from LLM-as-Text-Generator'' to LLM-as-Runtime-Operator.’’ CaveAgent introduces a dual-stream architecture that inverts the conventional paradigm: rather than treating the LLM’s text context as the primary workspace with tools as auxiliary, CaveAgent elevates the persistent Python runtime as the central locus of state, with a lightweight semantic stream serving as its orchestrator. Beyond leveraging code generation to resolve interdependent sub-tasks (e.g., loops, conditionals) in a single step, CaveAgent introduces \textit{Stateful Runtime Management}: it injects, manipulates, and retrieves complex Python objects (e.g., DataFrames, database connections) that persist across turns, unlike existing code-based approaches that remain text-bound. CaveAgent further provides a runtime-integrated skill management system that extends the Agent Skills open standard, enabling ecosystem interoperability through executable skill injections. This persistence mechanism serves as a high-fidelity external memory that reduces context drift in multi-turn interactions and preserves processed data for downstream applications without information loss. Evaluations show consistent improvement across challenging benchmarks, enabling CaveAgent to handle data scales that cause context overflow in both JSON-based and code-based agents. The accessible runtime state further provides programmatically verifiable feedback, enabling automated evaluation and reward signal generation without human annotation and establishing a structural foundation for future research in Reinforcement Learning with Verifiable Rewards (RLVR).

[211] DIAGPaper: Diagnosing Valid and Specific Weaknesses in Scientific Papers via Multi-Agent Reasoning

Zhuoyang Zou, Abolfazl Ansari, Delvin Ce Zhang, Dongwon Lee, Wenpeng Yin

Main category: cs.AI

TL;DR: DIAGPaper is a multi-agent LLM framework for paper weakness identification that improves on existing methods through criterion-based reviewer simulation, author rebuttal validation, and severity-based prioritization.

Details

Motivation: Existing paper weakness identification methods have limitations: multi-agent systems simulate human roles superficially without capturing expert criteria, assume identified weaknesses are valid ignoring reviewer bias and author rebuttals, and output unranked lists rather than prioritizing the most consequential issues.

Method: DIAGPaper uses three integrated modules: (1) Customizer simulates human-defined review criteria and instantiates reviewer agents with criterion-specific expertise, (2) Rebuttal introduces author agents for structured debate with reviewers to validate/refine weaknesses, (3) Prioritizer learns from human review practices to assess severity and surfaces top-K severest weaknesses.

Result: Experiments on AAAR and ReviewCritique benchmarks show DIAGPaper substantially outperforms existing methods by producing more valid and paper-specific weaknesses, presented in a user-oriented, prioritized manner.

Conclusion: DIAGPaper addresses key limitations in paper weakness identification through integrated multi-agent design with criterion simulation, rebuttal validation, and severity prioritization, demonstrating superior performance over existing approaches.

Abstract: Paper weakness identification using single-agent or multi-agent LLMs has attracted increasing attention, yet existing approaches exhibit key limitations. Many multi-agent systems simulate human roles at a surface level, missing the underlying criteria that lead experts to assess complementary intellectual aspects of a paper. Moreover, prior methods implicitly assume identified weaknesses are valid, ignoring reviewer bias, misunderstanding, and the critical role of author rebuttals in validating review quality. Finally, most systems output unranked weakness lists, rather than prioritizing the most consequential issues for users. In this work, we propose DIAGPaper, a novel multi-agent framework that addresses these challenges through three tightly integrated modules. The customizer module simulates human-defined review criteria and instantiates multiple reviewer agents with criterion-specific expertise. The rebuttal module introduces author agents that engage in structured debate with reviewer agents to validate and refine proposed weaknesses. The prioritizer module learns from large-scale human review practices to assess the severity of validated weaknesses and surfaces the top-K severest ones to users. Experiments on two benchmarks, AAAR and ReviewCritique, demonstrate that DIAGPaper substantially outperforms existing methods by producing more valid and more paper-specific weaknesses, while presenting them in a user-oriented, prioritized manner.

[212] SEISMO: Increasing Sample Efficiency in Molecular Optimization with a Trajectory-Aware LLM Agent

Fabian P. Krüger, Andrea Hunklinger, Adrian Wolny, Tim J. Adler, Igor Tetko, Santiago David Villalba

Main category: cs.AI

TL;DR: SEISMO is an LLM agent for sample-efficient molecular optimization that performs online, inference-time optimization using natural language task descriptions and explanatory feedback, achieving 2-3x better performance than prior methods.

Details

Motivation: Molecular optimization is crucial for drug discovery but limited by costly experimental assays requiring high sample efficiency. Current methods often need population-based or batched learning, which is inefficient for expensive oracle evaluations.

Method: SEISMO is an LLM agent that performs strictly online, inference-time molecular optimization, updating after every oracle call without population-based learning. It conditions proposals on full optimization trajectories using natural language task descriptions with scalar scores and structured explanatory feedback when available.

Result: On the Practical Molecular Optimization benchmark of 23 tasks, SEISMO achieves 2-3 times higher area under the optimization curve than prior methods, often reaching near-maximal task scores within 50 oracle calls. Medicinal chemistry tasks show explanatory feedback further improves efficiency.

Conclusion: Leveraging domain knowledge and structured information through LLM agents enables highly sample-efficient molecular optimization, demonstrating the value of explanatory feedback and online learning for expensive oracle-based optimization problems.

Abstract: Optimizing the structure of molecules to achieve desired properties is a central bottleneck across the chemical sciences, particularly in the pharmaceutical industry where it underlies the discovery of new drugs. Since molecular property evaluation often relies on costly and rate-limited oracles, such as experimental assays, molecular optimization must be highly sample-efficient. To address this, we introduce SEISMO, an LLM agent that performs strictly online, inference-time molecular optimization, updating after every oracle call without the need for population-based or batched learning. SEISMO conditions each proposal on the full optimization trajectory, combining natural-language task descriptions with scalar scores and, when available, structured explanatory feedback. Across the Practical Molecular Optimization benchmark of 23 tasks, SEISMO achieves a 2-3 times higher area under the optimisation curve than prior methods, often reaching near-maximal task scores within 50 oracle calls. Our additional medicinal-chemistry tasks show that providing explanatory feedback further improves efficiency, demonstrating that leveraging domain knowledge and structured information is key to sample-efficient molecular optimization.

[213] Rethinking the Role of Entropy in Optimizing Tool-Use Behaviors for Large Language Model Agents

Zeping Li, Hongru Wang, Yiwen Zhao, Guanhua Chen, Yixia Li, Keyang Chen, Yixin Cao, Guangnan Ye, Hongfeng Chai, Zhenfei Yin

Main category: cs.AI

TL;DR: LLM-based tool-using agents often make excessive, low-quality tool calls in long trajectories. The paper proposes using entropy reduction as a supervisory signal with two reward strategies (sparse outcome and dense process rewards) to optimize tool-use behavior, significantly reducing tool calls and improving performance.

Details

Motivation: Tool-using LLM agents face challenges in long trajectories where they trigger excessive and low-quality tool calls, increasing latency and degrading inference performance. Managing tool-use behavior effectively is difficult, requiring better optimization methods.

Method: The authors conduct entropy-based pilot experiments showing correlation between entropy reduction and high-quality tool calls. They propose using entropy reduction as a supervisory signal with two reward strategies: sparse outcome rewards for trajectory-level efficiency and dense process rewards for fine-grained performance supervision.

Result: Experiments across diverse domains show both reward designs improve tool-use behavior: sparse outcome rewards reduce tool calls by 72.07% compared to baseline averages, while dense process rewards improve performance by 22.27%.

Conclusion: Entropy reduction serves as a key mechanism for enhancing tool-use behavior in LLM agents, enabling more adaptive real-world applications through optimized tool selection and usage.

Abstract: Tool-using agents based on Large Language Models (LLMs) excel in tasks such as mathematical reasoning and multi-hop question answering. However, in long trajectories, agents often trigger excessive and low-quality tool calls, increasing latency and degrading inference performance, making managing tool-use behavior challenging. In this work, we conduct entropy-based pilot experiments and observe a strong positive correlation between entropy reduction and high-quality tool calls. Building on this finding, we propose using entropy reduction as a supervisory signal and design two reward strategies to address the differing needs of optimizing tool-use behavior. Sparse outcome rewards provide coarse, trajectory-level guidance to improve efficiency, while dense process rewards offer fine-grained supervision to enhance performance. Experiments across diverse domains show that both reward designs improve tool-use behavior: the former reduces tool calls by 72.07% compared to the average of baselines, while the latter improves performance by 22.27%. These results position entropy reduction as a key mechanism for enhancing tool-use behavior, enabling agents to be more adaptive in real-world applications.

[214] VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health

Kate H. Bentley, Luca Belli, Adam M. Chekroud, Emily J. Ward, Emily R. Dworkin, Emily Van Ark, Kelly M. Johnston, Will Alexander, Millard Brown, Matt Hawrilenko

Main category: cs.AI

TL;DR: VERA-MH is an automated safety benchmark for evaluating AI chatbots in mental health contexts, validated through clinician ratings and LLM-judge alignment for suicide risk detection.

Details

Motivation: With millions using AI chatbots for psychological support, there's an urgent need for evidence-based safety evaluation to ensure these tools are safe, particularly for high-risk scenarios like suicide prevention.

Method: Simulated conversations between LLM-based user-agents and general-purpose AI chatbots were rated by licensed mental health clinicians using a scoring rubric. An LLM-based judge evaluated the same conversations, and alignment was measured between clinicians and between clinician consensus and the LLM judge.

Result: Clinicians showed high inter-rater reliability (0.77), establishing a gold-standard reference. The LLM judge was strongly aligned with clinical consensus (0.81), supporting VERA-MH’s validity and reliability as an automated safety evaluation framework.

Conclusion: VERA-MH provides a valid and reliable open-source automated safety evaluation for AI in mental health, with future work planned to expand its generalizability and target additional safety areas.

Abstract: Millions now use generative AI chatbots for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based, automated safety benchmark. This study aimed to examine the clinical validity and reliability of VERA-MH for evaluating AI safety in suicide risk detection and response. We first simulated a large set of conversations between large language model (LLM)-based users (user-agents) and general-purpose AI chatbots. Licensed mental health clinicians used a rubric (scoring guide) to independently rate the simulated conversations for safe and unsafe chatbot behaviors, as well as user-agent realism. An LLM-based judge used the same scoring rubric to evaluate the same set of simulated conversations. We then examined rating alignment (a) among individual clinicians and (b) between clinician consensus and the LLM judge, and (c) summarized clinicians’ ratings of user-agent realism. Individual clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR] = 0.77), establishing a gold-standard clinical reference. The LLM judge was strongly aligned with this clinical consensus overall (IRR = 0.81) and within key conditions. Together, findings from this human evaluation study support the validity and reliability of VERA-MH: an open-source, automated AI safety evaluation for mental health. Future research will examine the generalizability and robustness of VERA-MH and expand the framework to target additional key areas of AI safety in mental health.

[215] AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

Ruipeng Wang, Yuxin Chen, Yukai Wang, Chang Wu, Junfeng Fang, Xiaodong Cai, Qi Gu, Hui Su, An Zhang, Xiang Wang, Xunliang Cai, Tat-Seng Chua

Main category: cs.AI

TL;DR: AgentNoiseBench is a framework for evaluating LLM-based agent robustness under realistic noisy environments, categorizing noise into user-noise and tool-noise types.

Details

Motivation: Current LLM-based agents perform well on benchmarks but struggle in real-world deployments due to stochasticity and noise that existing training/evaluation paradigms overlook. There's a need to systematically assess agent robustness in noisy environments.

Method: 1) Analyze real-world biases/uncertainties and categorize environmental noise into user-noise and tool-noise; 2) Develop automated pipeline to inject controllable noise into existing agent benchmarks while preserving task solvability; 3) Evaluate diverse models across architectures and parameter scales.

Result: Extensive evaluations reveal consistent performance variations under different noise conditions, highlighting sensitivity of current agentic models to realistic environmental perturbations.

Conclusion: AgentNoiseBench provides a systematic framework for evaluating agent robustness, revealing important vulnerabilities in current models that need addressing for real-world deployment.

Abstract: Recent advances in large language models have enabled LLM-based agents to achieve strong performance on a variety of benchmarks. However, their performance in real-world deployments often that observed on benchmark settings, especially in complex and imperfect environments. This discrepancy largely arises because prevailing training and evaluation paradigms are typically built on idealized assumptions, overlooking the inherent stochasticity and noise present in real-world interactions. To bridge this gap, we introduce AgentNoiseBench, a framework for systematically evaluating the robustness of agentic models under noisy environments. We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios and categorize environmental noise into two primary types: user-noise and tool-noise. Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks while preserving task solvability. Leveraging this pipeline, we perform extensive evaluations across a wide range of models with diverse architectures and parameter scales. Our results reveal consistent performance variations under different noise conditions, highlighting the sensitivity of current agentic models to realistic environmental perturbations.

[216] From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design

Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen

Main category: cs.AI

TL;DR: LaySPA is a reinforcement learning framework that enhances LLMs with explicit spatial reasoning for graphic layout design, producing interpretable reasoning traces and structured layouts while improving validity and quality.

Details

Motivation: Addresses LLMs' limited spatial reasoning capabilities and lack of transparency in design decision-making for graphic layout tasks, which require understanding spatial relationships and visual composition.

Method: Reformulates layout design as policy learning over a structured textual spatial environment encoding canvas geometry, element attributes, and relationships. Uses multi-objective spatial critique (geometric validity, relational coherence, aesthetic consistency) and relative group optimization for stable training.

Result: Outperforms larger proprietary LLMs and achieves performance comparable to specialized SOTA layout generators with fewer annotated samples and reduced latency, while improving structural validity and visual quality.

Conclusion: LaySPA successfully equips LLMs with explicit spatial reasoning for layout design, enabling transparent and controllable decision-making through interpretable reasoning traces and structured outputs.

Abstract: We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design. LaySPA addresses two key challenges: LLMs’ limited spatial reasoning and the lack of opacity in design decision making. Instead of operating at the pixel level, we reformulate layout design as a policy learning problem over a structured textual spatial environment that explicitly encodes canvas geometry, element attributes, and inter-element relationships. LaySPA produces dual-level outputs comprising interpretable reasoning traces and structured layout specifications, enabling transparent and controllable design decision making. Layout design policy is optimized via a multi-objective spatial critique that decomposes layout quality into geometric validity, relational coherence, and aesthetic consistency, and is trained using relative group optimization to stabilize learning in open-ended design spaces. Experiments demonstrate that LaySPA improves structural validity and visual quality, outperforming larger proprietary LLMs and achieving performance comparable to specialized SOTA layout generators while requiring fewer annotated samples and reduced latency.

[217] ForesightSafety Bench: A Frontier Risk Evaluation and Governance Framework towards Safe AI

Haibo Tong, Feifei Zhao, Linghao Feng, Ruoyu Wu, Ruolin Chen, Lu Jia, Zhou Zhao, Jindong Li, Tenglong Li, Erliang Lin, Shuai Yang, Enmeng Lu, Yinqian Sun, Qian Zhang, Zizhe Ruan, Jinyu Fan, Zeyang Yue, Ping Wu, Huangrui Li, Chengyi Sun, Yi Zeng

Main category: cs.AI

TL;DR: ForesightSafety Bench: A comprehensive AI safety evaluation framework covering 94 risk dimensions across fundamental safety, embodied AI, AI4Science, social/environmental risks, catastrophic risks, and industrial domains.

Details

Motivation: Current AI safety evaluation systems have critical limitations including restricted risk dimensions and failed frontier risk detection, while lagging safety benchmarks and alignment technologies struggle to address complex challenges from cutting-edge AI models.

Method: Proposes a hierarchical safety evaluation framework starting with 7 fundamental safety pillars, extending to embodied AI safety, AI4Science safety, social/environmental risks, catastrophic/existential risks, and 8 industrial safety domains totaling 94 refined risk dimensions.

Result: Benchmark accumulated tens of thousands of structured risk data points; systematic evaluation of 20+ mainstream advanced large models revealed widespread safety vulnerabilities across multiple pillars, particularly in risky agentic autonomy, AI4Science safety, embodied AI safety, social AI safety, and catastrophic/existential risks.

Conclusion: Establishes a comprehensive, hierarchical, and dynamically evolving AI safety evaluation framework that identifies critical safety gaps in frontier AI systems across multiple risk dimensions.

Abstract: Rapidly evolving AI exhibits increasingly strong autonomy and goal-directed capabilities, accompanied by derivative systemic risks that are more unpredictable, difficult to control, and potentially irreversible. However, current AI safety evaluation systems suffer from critical limitations such as restricted risk dimensions and failed frontier risk detection. The lagging safety benchmarks and alignment technologies can hardly address the complex challenges posed by cutting-edge AI models. To bridge this gap, we propose the “ForesightSafety Bench” AI Safety Evaluation Framework, beginning with 7 major Fundamental Safety pillars and progressively extends to advanced Embodied AI Safety, AI4Science Safety, Social and Environmental AI risks, Catastrophic and Existential Risks, as well as 8 critical industrial safety domains, forming a total of 94 refined risk dimensions. To date, the benchmark has accumulated tens of thousands of structured risk data points and assessment results, establishing a widely encompassing, hierarchically clear, and dynamically evolving AI safety evaluation framework. Based on this benchmark, we conduct systematic evaluation and in-depth analysis of over twenty mainstream advanced large models, identifying key risk patterns and their capability boundaries. The safety capability evaluation results reveals the widespread safety vulnerabilities of frontier AI across multiple pillars, particularly focusing on Risky Agentic Autonomy, AI4Science Safety, Embodied AI Safety, Social AI Safety and Catastrophic and Existential Risks. Our benchmark is released at https://github.com/Beijing-AISI/ForesightSafety-Bench. The project website is available at https://foresightsafety-bench.beijing-aisi.ac.cn/.

cs.SD

[218] MAEB: Massive Audio Embedding Benchmark

Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha, Rahul Chand, Silky Singh, Kaitlyn Wang, Ali Sartaz Khan, Marc Moussa Nasser, Sufen Fong, Pengfei He, Alan Xiao, Ayush Sunil Munot, Aditya Shrivastava, Artem Gazizov, Niklas Muennighoff, Kenneth Enevoldsen

Main category: cs.SD

TL;DR: MAEB is a comprehensive audio embedding benchmark covering 30 tasks across speech, music, environmental sounds, and audio-text reasoning in 100+ languages, evaluating 50+ models to reveal task-specific strengths and weaknesses.

Details

Motivation: Current audio embedding models lack comprehensive evaluation across diverse audio tasks and languages. There's a need for a standardized benchmark to assess model capabilities across speech, music, environmental sounds, and cross-modal reasoning to understand model strengths and limitations.

Method: Created MAEB benchmark with 30 tasks derived from MAEB+ (98 tasks) covering speech, music, environmental sounds, and audio-text reasoning in 100+ languages. Evaluated 50+ models including contrastive audio-text models and speech-pretrained models, measuring performance across different task types and analyzing correlations with audio LLM performance.

Result: No single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification but perform poorly on multilingual speech tasks, while speech-pretrained models show opposite pattern. Clustering remains challenging for all models. Performance on acoustic understanding vs linguistic tasks shows trade-off. Audio encoder performance on MAEB correlates highly with their performance in audio large language models.

Conclusion: MAEB provides a standardized benchmark for evaluating audio embedding models across diverse tasks and languages, revealing task-specific model strengths and enabling better model selection. The benchmark integrates into MTEB ecosystem for unified multimodal evaluation and will facilitate development of more capable audio models.

Abstract: We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

[219] BAT: Better Audio Transformer Guided by Convex Gated Probing

Houtan Ghaffari, Lukas Rauch, Christoph Scholz, Paul Devos

Main category: cs.SD

TL;DR: Convex Gated Probing (CGP) is introduced to close the gap between fine-tuning and probing for audio SSL models, enabling better evaluation and guiding improvements to audio SSL pipelines.

Details

Conclusion: CGP provides a robust probing mechanism for audio SSL models that enables better evaluation and guides improvements to achieve state-of-the-art performance on audio benchmarks.

[220] Spatial Audio Question Answering and Reasoning on Dynamic Source Movements

Arvind Krishna Sridhar, Yinyi Guo, Erik Visser

Main category: cs.SD

Details

Conclusion: The work offers new insights for advancing spatial audio understanding through movement modeling, explicit reasoning, and source separation techniques.

[221] How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection

Yixuan Xiao, Florian Lux, Alejandro Pérez-González-de-Martos, Ngoc Thang Vu

Main category: cs.SD

TL;DR: Study examines labeling ambiguity in spoof detection when using neural audio codec resynthesized speech, creates challenging ASVspoof 5 extension dataset, and analyzes how different labeling choices affect detection performance.

Details

Motivation: Neural audio codecs have dual functionality - originally for audio compression but also used for speech synthesis via language modeling. This creates ambiguity in spoof detection datasets where codec-resynthesized speech could be labeled as either bonafide or spoof, which hasn't been adequately addressed in prior research.

Method: Created a challenging extension of the ASVspoof 5 dataset specifically for this problem. Examined how different labeling choices (treating codec-resynthesized speech as bonafide vs. spoof) affect detection performance and provided insights into labeling strategies.

Result: Demonstrated that labeling choices significantly impact spoof detection performance. The study provides empirical evidence on how to handle the ambiguous nature of codec-resynthesized speech in spoof detection systems.

Conclusion: The ambiguous nature of neural audio codec resynthesized speech presents a challenge for spoof detection that requires careful consideration of labeling strategies, as these choices substantially affect system performance.

Abstract: Since Text-to-Speech systems typically don’t produce waveforms directly, recent spoof detection studies use resynthesized waveforms from vocoders and neural audio codecs to simulate an attacker. Unlike vocoders, which are specifically designed for speech synthesis, neural audio codecs were originally developed for compressing audio for storage and transmission. However, their ability to discretize speech also sparked interest in language-modeling-based speech synthesis. Owing to this dual functionality, codec resynthesized data may be labeled as either bonafide or spoof. So far, very little research has addressed this issue. In this study, we present a challenging extension of the ASVspoof 5 dataset constructed for this purpose. We examine how different labeling choices affect detection performance and provide insights into labeling strategies.

[222] Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held, Diyi Yang

Main category: cs.SD

TL;DR: Systematic study of native audio foundation models using next-token prediction on audio, with scaling laws analysis and SODA model suite for general audio generation and cross-modal tasks.

Details

[223] Monaural Multi-Speaker Speech Separation Using Efficient Transformer Model

S. Rijal, R. Neupane, S. P. Mainali, S. K. Regmi, S. Maharjan

Main category: cs.SD

TL;DR: Transformer-based model for monaural multi-speaker speech separation that balances computational efficiency with separation performance

Details

Motivation: Address the trade-off between model size/complexity and accuracy/robustness in speech separation (cocktail party problem), aiming to reduce computational complexity while maintaining performance

Method: Uses Transformer architecture and its efficient variants, trained on LibriMix dataset to separate 2 distinct speaker sources from mixed audio input

Result: Model achieves significant reduction in computational complexity with minimal performance trade-off compared to prevalent speech separation models

Conclusion: The work contributes to speech separation research with computational efficiency as a core focus, showing promise for practical applications

Abstract: Cocktail party problem is the scenario where it is difficult to separate or distinguish individual speaker from a mixed speech from several speakers. There have been several researches going on in this field but the size and complexity of the model is being traded off with the accuracy and robustness of speech separation. “Monaural multi-speaker speech separation” presents a speech-separation model based on the Transformer architecture and its efficient forms. The model has been trained with the LibriMix dataset containing diverse speakers’ utterances. The model separates 2 distinct speaker sources from a mixed audio input. The developed model approaches the reduction in computational complexity of the speech separation model, with minimum tradeoff with the performance of prevalent speech separation model and it has shown significant movement towards that goal. This project foresees, a rise in contribution towards the ongoing research in the field of speech separation with computational efficiency at its core.

[224] Voice Impression Control in Zero-Shot TTS

Kenichi Fujita, Shota Horiguchi, Yusuke Ijima

Main category: cs.SD

TL;DR: A voice impression control method for zero-shot TTS that uses low-dimensional vectors to represent voice impression intensities and enables natural language control via LLMs.

Details

Motivation: Para-/non-linguistic information in speech shapes listeners' impressions, but controlling these subtle characteristics in zero-shot TTS remains challenging despite advances in speaker fidelity.

Method: Develops a voice impression control method using low-dimensional vectors to represent intensities of various voice impression pairs (e.g., dark-bright), with vector generation via large language models from natural language descriptions.

Result: Both objective and subjective evaluations demonstrate the method’s effectiveness in impression control, enabling target-impression generation from natural language descriptions without manual optimization.

Conclusion: The method successfully enables fine-grained voice impression control in zero-shot TTS through low-dimensional representation and LLM-based natural language interface.

Abstract: Para-/non-linguistic information in speech is pivotal in shaping the listeners’ impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method’s effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page (https://ntt-hilab-gensp.github.io/is2025voiceimpression/).

[225] Investigation for Relative Voice Impression Estimation

Kenichi Fujita, Yusuke Ijima

Main category: cs.SD

TL;DR: Study introduces Relative Voice Impression Estimation (RIE) framework to predict perceptual differences between two utterances from same speaker using low-dimensional vectors derived from subjective evaluations along antonymic axes like “Dark-Bright”.

Details

Motivation: While most research focuses on absolute impression scoring, there's a need to understand relative perceptual differences between utterances from the same speaker, particularly for paralinguistic and non-linguistic aspects of speech that strongly influence listener impressions.

Method: Used recordings of professional speaker reading text in various styles to isolate expressive/prosodic variation. Compared three modeling approaches: 1) classical acoustic features for speech emotion recognition, 2) self-supervised speech representations, and 3) multimodal large language models (MLLMs).

Result: Self-supervised speech representations outperformed classical acoustic features, especially for complex/dynamic impressions like “Cold-Warm” where classical features failed. Current MLLMs proved unreliable for this fine-grained pairwise task.

Conclusion: First systematic investigation of RIE demonstrates strength of self-supervised speech models in capturing subtle perceptual variations, while highlighting limitations of current MLLMs for fine-grained pairwise voice impression estimation tasks.

Abstract: Paralinguistic and non-linguistic aspects of speech strongly influence listener impressions. While most research focuses on absolute impression scoring, this study investigates relative voice impression estimation (RIE), a framework for predicting the perceptual difference between two utterances from the same speaker. The estimation target is a low-dimensional vector derived from subjective evaluations, quantifying the perceptual shift of the second utterance relative to the first along an antonymic axis (e.g., Dark--Bright''). To isolate expressive and prosodic variation, we used recordings of a professional speaker reading a text in various styles. We compare three modeling approaches: classical acoustic features commonly used for speech emotion recognition, self-supervised speech representations, and multimodal large language models (MLLMs). Our results demonstrate that models using self-supervised representations outperform methods with classical acoustic features, particularly in capturing complex and dynamic impressions (e.g., Cold–Warm’’) where classical features fail. In contrast, current MLLMs prove unreliable for this fine-grained pairwise task. This study provides the first systematic investigation of RIE and demonstrates the strength of self-supervised speech models in capturing subtle perceptual variations.

cs.LG

[226] ModalImmune: Immunity Driven Unlearning via Self Destructive Training

Rong Fu, Jia Yee Tan, Wenxin Zhang, Zijian Zhang, Ziming Wang, Zhaolu Kang, Muge Qi, Shuning Zhang, Simon Fong

Main category: cs.LG

TL;DR: ModalImmune is a training framework that makes multimodal models robust to partial or complete loss of input channels by intentionally collapsing selected modality information during training.

Details

Motivation: Multimodal systems are vulnerable to partial or complete loss of input channels in real-world deployment, which undermines their reliability. Current models lack robustness when certain modalities become unavailable or corrupted.

Method: The framework uses: 1) spectrum-adaptive collapse regularizer, 2) information-gain guided controller for targeted interventions, 3) curvature-aware gradient masking to stabilize destructive updates, and 4) certified Neumann-truncated hyper-gradient procedure for automatic meta-parameter adaptation.

Result: Empirical evaluation on standard multimodal benchmarks shows that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.

Conclusion: ModalImmune provides an effective training framework for making multimodal models robust to destructive modality influence, enhancing their reliability in real-world settings where input channels may fail.

Abstract: Multimodal systems are vulnerable to partial or complete loss of input channels at deployment, which undermines reliability in real-world settings. This paper presents ModalImmune, a training framework that enforces modality immunity by intentionally and controllably collapsing selected modality information during training so the model learns joint representations that are robust to destructive modality influence. The framework combines a spectrum-adaptive collapse regularizer, an information-gain guided controller for targeted interventions, curvature-aware gradient masking to stabilize destructive updates, and a certified Neumann-truncated hyper-gradient procedure for automatic meta-parameter adaptation. Empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.

[227] A Koopman-Bayesian Framework for High-Fidelity, Perceptually Optimized Haptic Surgical Simulation

Rohit Kaushik, Eva Kaushik

Main category: cs.LG

TL;DR: A unified framework for surgical simulation combining nonlinear dynamics, perceptual psychophysics, and high-frequency haptic rendering to enhance realism through Koopman operator formulation and Bayesian calibration.

Details

Motivation: To improve realism in surgical simulation by addressing limitations in conventional haptic rendering methods, particularly the gap between physical accuracy and human perceptual experience.

Method: Uses Koopman operator formulation to linearize nonlinear tissue dynamics in augmented state space, combined with Bayesian calibration based on Weber-Fechner and Stevens scaling laws to adapt force signals to individual perceptual thresholds.

Result: Achieves 4.3 ms rendering latency, <2.8% force error, and 20% improvement in perceptual discrimination, with statistical analyses showing significant superiority over conventional spring-damper and energy-based methods.

Conclusion: The framework successfully bridges physical accuracy and perceptual realism in surgical simulation, with potential applications in surgical training, VR medical education, and future closed-loop neural feedback haptic interfaces.

Abstract: We introduce a unified framework that combines nonlinear dynamics, perceptual psychophysics and high frequency haptic rendering to enhance realism in surgical simulation. The interaction of the surgical device with soft tissue is elevated to an augmented state space with a Koopman operator formulation, allowing linear prediction and control of the dynamics that are nonlinear by nature. To make the rendered forces consistent with human perceptual limits, we put forward a Bayesian calibration module based on WeberFechner and Stevens scaling laws, which progressively shape force signals relative to each individual’s discrimination thresholds. For various simulated surgical tasks such as palpation, incision, and bone milling, the proposed system attains an average rendering latency of 4.3 ms, a force error of less than 2.8% and a 20% improvement in perceptual discrimination. Multivariate statistical analyses (MANOVA and regression) reveal that the system’s performance is significantly better than that of conventional spring-damper and energy, based rendering methods. We end by discussing the potential impact on surgical training and VR, based medical education, as well as sketching future work toward closed, loop neural feedback in haptic interfaces.

[228] Memes-as-Replies: Can Models Select Humorous Manga Panel Responses?

Ryosuke Kohita, Seiichiro Yoshioka

Main category: cs.LG

TL;DR: A benchmark for meme reply selection using manga panels shows LLMs can capture social cues but struggle with visual context and subtle humor distinctions.

Details

Motivation: To address the gap in understanding how memes are used dynamically and contextually for humor in conversations, rather than just analyzing their intrinsic properties.

Method: Created MaMe-Re benchmark with 100,000 human-annotated pairs of manga panels and social media posts, then evaluated LLMs on meme reply selection task.

Result: LLMs show preliminary ability to capture complex social cues like exaggeration, but visual information doesn’t improve performance, and models struggle with subtle humor distinctions.

Conclusion: Selecting contextually humorous replies remains an open challenge for current models, revealing gaps in visual understanding and nuanced humor comprehension.

Abstract: Memes are a popular element of modern web communication, used not only as static artifacts but also as interactive replies within conversations. While computational research has focused on analyzing the intrinsic properties of memes, the dynamic and contextual use of memes to create humor remains an understudied area of web science. To address this gap, we introduce the Meme Reply Selection task and present MaMe-Re (Manga Meme Reply Benchmark), a benchmark of 100,000 human-annotated pairs (500,000 total annotations from 2,325 unique annotators) consisting of openly licensed Japanese manga panels and social media posts. Our analysis reveals three key insights: (1) large language models (LLMs) show preliminary evidence of capturing complex social cues such as exaggeration, moving beyond surface-level semantic matching; (2) the inclusion of visual information does not improve performance, revealing a gap between understanding visual content and effectively using it for contextual humor; (3) while LLMs can match human judgments in controlled settings, they struggle to distinguish subtle differences in wit among semantically similar candidates. These findings suggest that selecting contextually humorous replies remains an open challenge for current models.

[229] Kalman-Inspired Runtime Stability and Recovery in Hybrid Reasoning Systems

Barak Or

Main category: cs.LG

TL;DR: This paper studies runtime stability in hybrid reasoning systems that combine learned components with model-based inference, analyzing them from a Kalman-inspired perspective to understand and monitor cognitive drift during tool-augmented decision loops.

Details

Motivation: Hybrid reasoning systems combining learned components with model-based inference are increasingly used in tool-augmented decision loops, but their runtime behavior under partial observability and evidence mismatch remains poorly understood. Failures often manifest as gradual divergence of internal reasoning dynamics rather than isolated prediction errors.

Method: The paper models reasoning as a stochastic inference process driven by an internal innovation signal and introduces cognitive drift as a measurable runtime phenomenon. It proposes a runtime stability framework that monitors innovation statistics, detects emerging instability, and triggers recovery-aware control mechanisms. Stability is defined in terms of detectability, bounded divergence, and recoverability rather than task-level correctness.

Result: Experiments on multi-step, tool-augmented reasoning tasks demonstrate reliable instability detection prior to task failure and show that recovery, when feasible, re-establishes bounded internal behavior within finite time.

Conclusion: The results emphasize runtime stability as a system-level requirement for reliable reasoning under uncertainty in hybrid systems that combine learned and model-based components.

Abstract: Hybrid reasoning systems that combine learned components with model-based inference are increasingly deployed in tool-augmented decision loops, yet their runtime behavior under partial observability and sustained evidence mismatch remains poorly understood. In practice, failures often arise as gradual divergence of internal reasoning dynamics rather than as isolated prediction errors. This work studies runtime stability in hybrid reasoning systems from a Kalman-inspired perspective. We model reasoning as a stochastic inference process driven by an internal innovation signal and introduce cognitive drift as a measurable runtime phenomenon. Stability is defined in terms of detectability, bounded divergence, and recoverability rather than task-level correctness. We propose a runtime stability framework that monitors innovation statistics, detects emerging instability, and triggers recovery-aware control mechanisms. Experiments on multi-step, tool-augmented reasoning tasks demonstrate reliable instability detection prior to task failure and show that recovery, when feasible, re-establishes bounded internal behavior within finite time. These results emphasize runtime stability as a system-level requirement for reliable reasoning under uncertainty.

[230] Genetic Generalized Additive Models

Kaaustaaub Shankar, Kelly Cohen

Main category: cs.LG

TL;DR: Using NSGA-II genetic algorithm to automatically optimize Generalized Additive Models for better accuracy and interpretability by balancing prediction error and complexity.

Details

Motivation: Generalized Additive Models (GAMs) offer good interpretability but require manual configuration of their structure, which is challenging. There's a need for automated methods to optimize GAMs to balance predictive accuracy and interpretability.

Method: Proposes using the multi-objective genetic algorithm NSGA-II to automatically optimize GAMs. The algorithm jointly minimizes prediction error (RMSE) and a Complexity Penalty that captures sparsity, smoothness, and uncertainty.

Result: Experiments on the California Housing dataset show NSGA-II discovers GAMs that outperform baseline LinearGAMs in accuracy or match performance with substantially lower complexity. The resulting models are simpler, smoother, and exhibit narrower confidence intervals.

Conclusion: The framework provides a general approach for automated optimization of transparent, high-performing models, enhancing interpretability while maintaining or improving accuracy.

Abstract: Generalized Additive Models (GAMs) balance predictive accuracy and interpretability, but manually configuring their structure is challenging. We propose using the multi-objective genetic algorithm NSGA-II to automatically optimize GAMs, jointly minimizing prediction error (RMSE) and a Complexity Penalty that captures sparsity, smoothness, and uncertainty. Experiments on the California Housing dataset show that NSGA-II discovers GAMs that outperform baseline LinearGAMs in accuracy or match performance with substantially lower complexity. The resulting models are simpler, smoother, and exhibit narrower confidence intervals, enhancing interpretability. This framework provides a general approach for automated optimization of transparent, high-performing models. The code can be found at https://github.com/KaaustaaubShankar/GeneticAdditiveModels.

[231] IT-OSE: Exploring Optimal Sample Size for Industrial Data Augmentation

Mingchun Sun, Rongqiang Zhao, Zhennan Huang, Songyu Ding, Jie Liu

Main category: cs.LG

TL;DR: Proposes IT-OSE: information-theoretic optimal sample size estimation for industrial data augmentation with ICD score for evaluation.

Details

Motivation: Lack of theoretical research on optimal sample size (OSS) in data augmentation, no established metrics to evaluate OSS accuracy or deviation from ground truth in industrial scenarios.

Method: Information-theoretic optimal sample size estimation (IT-OSE) approach with interval coverage and deviation (ICD) score for evaluation. Theoretical analysis of relationship between OSS and dominant factors.

Result: IT-OSE increases classification accuracy by avg 4.38%, reduces MAPE in regression by avg 18.80%, reduces ICDdev by avg 49.30%, reduces computational costs by avg 83.97% and data costs by avg 93.46% compared to exhaustive search.

Conclusion: IT-OSE provides reliable OSS estimation for industrial data augmentation with enhanced interpretability, stability, and practicality across sensor-based industrial scenarios.

Abstract: In industrial scenarios, data augmentation is an effective approach to improve model performance. However, its benefits are not unidirectionally beneficial. There is no theoretical research or established estimation for the optimal sample size (OSS) in augmentation, nor is there an established metric to evaluate the accuracy of OSS or its deviation from the ground truth. To address these issues, we propose an information-theoretic optimal sample size estimation (IT-OSE) to provide reliable OSS estimation for industrial data augmentation. An interval coverage and deviation (ICD) score is proposed to evaluate the estimated OSS intuitively. The relationship between OSS and dominant factors is theoretically analyzed and formulated, thereby enhancing the interpretability. Experiments show that, compared to empirical estimation, the IT-OSE increases accuracy in classification tasks across baseline models by an average of 4.38%, and reduces MAPE in regression tasks across baseline models by an average of 18.80%. The improvements in downstream model performance are more stable. ICDdev in the ICD score is also reduced by an average of 49.30%. The determinism of OSS is enhanced. Compared to exhaustive search, the IT-OSE achieves the same OSS while reducing computational and data costs by an average of 83.97% and 93.46%. Furthermore, practicality experiments demonstrate that the IT-OSE exhibits generality across representative sensor-based industrial scenarios.

[232] BamaER: A Behavior-Aware Memory-Augmented Model for Exercise Recommendation

Qing Yang, Yuhao Jiang, Rui Wang, Jipeng Guo, Yejiang Wang, Xinghe Cheng, Zezheng Wu, Jiapu Wang, Jingwei Zhang

Main category: cs.LG

TL;DR: BamaER: A behavior-aware memory-augmented exercise recommendation framework that captures heterogeneous student interactions, models knowledge states dynamically, and optimizes exercise selection for personalized learning.

Details

Motivation: Existing exercise recommendation methods represent student learning only as exercise sequences, overlooking rich behavioral interaction information, leading to biased estimates. Fixed-length sequence segmentation also limits incorporation of early learning experiences, hindering long-term dependency modeling and accurate knowledge mastery estimation.

Method: Three core modules: (1) Learning progress prediction with tri-directional hybrid encoding to capture heterogeneous student interaction behaviors; (2) Memory-augmented knowledge tracing with dynamic memory matrix to jointly model historical and current knowledge states; (3) Exercise filtering formulated as diversity-aware optimization problem solved via Hippopotamus Optimization Algorithm to reduce redundancy and improve recommendation coverage.

Result: Experiments on five real-world educational datasets show BamaER consistently outperforms state-of-the-art baselines across a range of evaluation metrics.

Conclusion: BamaER effectively addresses limitations of existing exercise recommendation methods by incorporating behavioral interactions, modeling long-term dependencies through memory augmentation, and optimizing exercise selection for better personalization and coverage.

Abstract: Exercise recommendation focuses on personalized exercise selection conditioned on students’ learning history, personal interests, and other individualized characteristics. Despite notable progress, most existing methods represent student learning solely as exercise sequences, overlooking rich behavioral interaction information. This limited representation often leads to biased and unreliable estimates of learning progress. Moreover, fixed-length sequence segmentation limits the incorporation of early learning experiences, thereby hindering the modeling of long-term dependencies and the accurate estimation of knowledge mastery. To address these limitations, we propose BamaER, a Behavior-aware memory-augmented Exercise Recommendation framework that comprises three core modules: (i) the learning progress prediction module that captures heterogeneous student interaction behaviors via a tri-directional hybrid encoding scheme; (ii) the memory-augmented knowledge tracing module that maintains a dynamic memory matrix to jointly model historical and current knowledge states for robust mastery estimation; and (iii) the exercise filtering module that formulates candidate selection as a diversity-aware optimization problem, solved via the Hippopotamus Optimization Algorithm to reduce redundancy and improve recommendation coverage. Experiments on five real-world educational datasets show that BamaER consistently outperforms state-of-the-art baselines across a range of evaluation metrics.

[233] Hardware-accelerated graph neural networks: an alternative approach for neuromorphic event-based audio classification and keyword spotting on SoC FPGA

Kamil Jeziorek, Piotr Wzorek, Krzysztof Blachut, Hiroshi Nakano, Manon Dampfhoffer, Thomas Mesquida, Hiroaki Nishi, Thomas Dalgaty, Tomasz Kryjak

Main category: cs.LG

TL;DR: FPGA implementation of event-graph neural networks for audio processing using artificial cochlea to convert signals to sparse event data, achieving high accuracy with low latency and power consumption for keyword spotting tasks.

Details

Motivation: Need for hardware-aware neural architectures that enable efficient, low-latency, and energy-conscious local processing of increasing data from embedded edge sensors, particularly from neuromorphic devices producing discrete event streams.

Method: FPGA implementation of event-graph neural networks using artificial cochlea to convert time-series audio signals into sparse event data, combining graph convolutional layers with recurrent sequence modeling for end-to-end keyword spotting.

Result: Achieved 92.7% accuracy on SHD dataset (only 2.4% below SOTA) with 10-67x fewer parameters; 66.9-71.0% on SSC; quantized model reached 92.3% accuracy, outperforming FPGA-based SNNs by up to 19.3%; end-to-end keyword spotting achieved 95% word-end detection accuracy with 10.53μs latency and 1.18W power consumption.

Conclusion: The FPGA implementation establishes a strong benchmark for energy-efficient event-driven keyword spotting, demonstrating efficient hardware-aware neural architectures for edge audio processing with neuromorphic sensors.

Abstract: As the volume of data recorded by embedded edge sensors increases, particularly from neuromorphic devices producing discrete event streams, there is a growing need for hardware-aware neural architectures that enable efficient, low-latency, and energy-conscious local processing. We present an FPGA implementation of event-graph neural networks for audio processing. We utilise an artificial cochlea that converts time-series signals into sparse event data, reducing memory and computation costs. Our architecture was implemented on a SoC FPGA and evaluated on two open-source datasets. For classification task, our baseline floating-point model achieves 92.7% accuracy on SHD dataset - only 2.4% below the state of the art - while requiring over 10x and 67x fewer parameters. On SSC, our models achieve 66.9-71.0% accuracy. Compared to FPGA-based spiking neural networks, our quantised model reaches 92.3% accuracy, outperforming them by up to 19.3% while reducing resource usage and latency. For SSC, we report the first hardware-accelerated evaluation. We further demonstrate the first end-to-end FPGA implementation of event-audio keyword spotting, combining graph convolutional layers with recurrent sequence modelling. The system achieves up to 95% word-end detection accuracy, with only 10.53 microsecond latency and 1.18 W power consumption, establishing a strong benchmark for energy-efficient event-driven KWS.

[234] Distributed physics-informed neural networks via domain decomposition for fast flow reconstruction

Yixiao Qian, Jiaxu Liu, Zewei Xia, Song Chen, Chao Xu, Shengze Cai

Main category: cs.LG

TL;DR: Distributed Physics-Informed Neural Networks framework for scalable flow reconstruction using spatiotemporal domain decomposition with pressure uniqueness enforcement and CUDA acceleration.

Details

Motivation: Physics-Informed Neural Networks (PINNs) are powerful for flow reconstruction but face computational bottlenecks and optimization instabilities when scaling to large spatiotemporal domains. Distributed approaches suffer from pressure indeterminacy issues where sub-networks develop inconsistent local pressure baselines.

Method: Proposes a robust distributed PINNs framework with spatiotemporal domain decomposition. Uses reference anchor normalization with decoupled asymmetric weighting to enforce pressure uniqueness. Implements high-performance training pipeline with CUDA graphs and JIT compilation to reduce Python interpreter overhead for computing high-order physics residuals.

Result: The method achieves near-linear strong scaling and high-fidelity reconstruction on complex flow benchmarks. It eliminates gauge freedom and guarantees global pressure uniqueness while preserving temporal continuity.

Conclusion: Establishes a scalable and physically rigorous pathway for flow reconstruction and understanding of complex hydrodynamics through distributed PINNs with pressure uniqueness enforcement and computational acceleration.

Abstract: Physics-Informed Neural Networks (PINNs) offer a powerful paradigm for flow reconstruction, seamlessly integrating sparse velocity measurements with the governing Navier-Stokes equations to recover complete velocity and latent pressure fields. However, scaling such models to large spatiotemporal domains is hindered by computational bottlenecks and optimization instabilities. In this work, we propose a robust distributed PINNs framework designed for efficient flow reconstruction via spatiotemporal domain decomposition. A critical challenge in such distributed solvers is pressure indeterminacy, where independent sub-networks drift into inconsistent local pressure baselines. We address this issue through a reference anchor normalization strategy coupled with decoupled asymmetric weighting. By enforcing a unidirectional information flow from designated master ranks where the anchor point lies to neighboring ranks, our approach eliminates gauge freedom and guarantees global pressure uniqueness while preserving temporal continuity. Furthermore, to mitigate the Python interpreter overhead associated with computing high-order physics residuals, we implement a high-performance training pipeline accelerated by CUDA graphs and JIT compilation. Extensive validation on complex flow benchmarks demonstrates that our method achieves near-linear strong scaling and high-fidelity reconstruction, establishing a scalable and physically rigorous pathway for flow reconstruction and understanding of complex hydrodynamics.

[235] Graphon Mean-Field Subsampling for Cooperative Heterogeneous Multi-Agent Reinforcement Learning

Emile Anand, Richard Hoffmann, Sarah Liaw, Adam Wierman

Main category: cs.LG

TL;DR: GMFS is a Graphon Mean-Field Subsampling framework for scalable cooperative multi-agent reinforcement learning with heterogeneous agent interactions, using subsampling to reduce computational complexity while maintaining near-optimal performance.

Details

Motivation: The paper addresses the challenge of coordinating large populations of interacting agents in multi-agent reinforcement learning, where existing methods either assume homogeneous interactions (mean-field) or become computationally expensive for heterogeneous interactions (graphon-based approaches).

Method: GMFS subsamples κ agents according to interaction strength to approximate the graphon-weighted mean-field, learning a policy with polynomial sample complexity in κ and optimality gap O(1/√κ). This reduces computational burden while capturing heterogeneity.

Result: Theoretical analysis shows sample complexity poly(κ) and optimality gap O(1/√κ). Numerical simulations in robotic coordination demonstrate that GMFS achieves near-optimal performance.

Conclusion: GMFS provides a scalable framework for cooperative MARL with heterogeneous interactions, balancing computational efficiency with performance through strategic subsampling.

Abstract: Coordinating large populations of interacting agents is a central challenge in multi-agent reinforcement learning (MARL), where the size of the joint state-action space scales exponentially with the number of agents. Mean-field methods alleviate this burden by aggregating agent interactions, but these approaches assume homogeneous interactions. Recent graphon-based frameworks capture heterogeneity, but are computationally expensive as the number of agents grows. Therefore, we introduce $\texttt{GMFS}$, a $\textbf{G}$raphon $\textbf{M}$ean-$\textbf{F}$ield $\textbf{S}$ubsampling framework for scalable cooperative MARL with heterogeneous agent interactions. By subsampling $κ$ agents according to interaction strength, we approximate the graphon-weighted mean-field and learn a policy with sample complexity $\mathrm{poly}(κ)$ and optimality gap $O(1/\sqrtκ)$. We verify our theory with numerical simulations in robotic coordination, showing that $\texttt{GMFS}$ achieves near-optimal performance.

[236] Adaptive Semi-Supervised Training of P300 ERP-BCI Speller System with Minimum Calibration Effort

Shumeng Chen, Jane E. Huggins, Tianwen Ma

Main category: cs.LG

TL;DR: A semi-supervised learning framework for P300 BCI spellers that reduces calibration time using adaptive EM-GMM algorithm with limited labeled data.

Details

Motivation: Traditional P300 BCI spellers require lengthy calibration procedures to build binary classifiers, reducing overall system efficiency and practicality for real-time use.

Method: Proposed unified framework with minimal calibration effort using adaptive semi-supervised EM-GMM algorithm that updates binary classifier with small amount of labeled calibration data.

Result: 9 out of 15 participants exceeded minimum character-level accuracy of 0.7, with 7 of those 9 showing better performance with adaptive method than benchmark; evaluated using accuracy, ITR, and BCI utility.

Conclusion: The semi-supervised learning framework provides practical and efficient alternative to improve spelling efficiency in real-time BCI speller systems, especially with limited labeled data.

Abstract: A P300 ERP-based Brain-Computer Interface (BCI) speller is an assistive communication tool. It searches for the P300 event-related potential (ERP) elicited by target stimuli, distinguishing it from the neural responses to non-target stimuli embedded in electroencephalogram (EEG) signals. Conventional methods require a lengthy calibration procedure to construct the binary classifier, which reduced overall efficiency. Thus, we proposed a unified framework with minimum calibration effort such that, given a small amount of labeled calibration data, we employed an adaptive semi-supervised EM-GMM algorithm to update the binary classifier. We evaluated our method based on character-level prediction accuracy, information transfer rate (ITR), and BCI utility. We applied calibration on training data and reported results on testing data. Our results indicate that, out of 15 participants, 9 participants exceed the minimum character-level accuracy of 0.7 using either on our adaptive method or the benchmark, and 7 out of these 9 participants showed that our adaptive method performed better than the benchmark. The proposed semi-supervised learning framework provides a practical and efficient alternative to improve the overall spelling efficiency in the real-time BCI speller system, particularly in contexts with limited labeled data.

[237] R$^2$Energy: A Large-Scale Benchmark for Robust Renewable Energy Forecasting under Diverse and Extreme Conditions

Zhi Sheng, Yuan Yuan, Guozhen Zhang, Yong Li

Main category: cs.LG

TL;DR: R²Energy is a large-scale benchmark for renewable energy forecasting with NWP data, featuring 10.7M hourly records from 902 wind/solar stations, designed to evaluate model robustness under extreme weather conditions.

Details

Motivation: Climate-driven extreme weather events threaten grid stability despite improved average forecasting accuracy, creating a need for robust models that can withstand volatile conditions and ensure operational security.

Method: Created a benchmark with 10.7M hourly records from 902 stations across China, established standardized NWP access for fair comparison, and incorporated regime-wise evaluation with extreme weather annotations to assess robustness beyond average metrics.

Result: Revealed a critical “robustness gap” where models fail under extreme conditions despite good average performance, showing that meteorological integration strategy matters more than architectural complexity for reliability during extreme weather.

Conclusion: R²Energy provides a principled foundation for developing robust renewable energy forecasting models, highlighting the importance of meteorological integration over architectural complexity for safety-critical power system applications.

Abstract: The rapid expansion of renewable energy, particularly wind and solar power, has made reliable forecasting critical for power system operations. While recent deep learning models have achieved strong average accuracy, the increasing frequency and intensity of climate-driven extreme weather events pose severe threats to grid stability and operational security. Consequently, developing robust forecasting models that can withstand volatile conditions has become a paramount challenge. In this paper, we present R$^2$Energy, a large-scale benchmark for NWP-assisted renewable energy forecasting. It comprises over 10.7 million high-fidelity hourly records from 902 wind and solar stations across four provinces in China, providing the diverse meteorological conditions necessary to capture the wide-ranging variability of renewable generation. We further establish a standardized, leakage-free forecasting paradigm that grants all models identical access to future Numerical Weather Prediction (NWP) signals, enabling fair and reproducible comparison across state-of-the-art representative forecasting architectures. Beyond aggregate accuracy, we incorporate regime-wise evaluation with expert-aligned extreme weather annotations, uncovering a critical ``robustness gap’’ typically obscured by average metrics. This gap reveals a stark robustness-complexity trade-off: under extreme conditions, a model’s reliability is driven by its meteorological integration strategy rather than its architectural complexity. R$^2$Energy provides a principled foundation for evaluating and developing forecasting models for safety-critical power system applications.

[238] B-DENSE: Branching For Dense Ensemble Network Learning

Cherish Puniani, Tushar Kumar, Arnav Bendre, Gaurav Kumar, Shree Singhi

Main category: cs.LG

TL;DR: B-DENSE is a novel distillation framework for diffusion models that uses multi-branch trajectory alignment to accelerate sampling while preserving structural information lost in traditional distillation methods.

Details

Motivation: Diffusion models suffer from high inference latency due to iterative sampling. While distillation techniques accelerate sampling, they discard intermediate trajectory steps, leading to loss of structural information and discretization errors.

Method: Proposes B-DENSE framework with modified student architecture that outputs K-fold expanded channels, where each subset corresponds to a specific branch representing discrete intermediate steps in teacher’s trajectory. Trains branches to simultaneously map to entire sequence of teacher’s target timesteps for dense intermediate trajectory alignment.

Result: Student model learns to navigate solution space from earliest training stages, demonstrating superior image generation quality compared to baseline distillation frameworks.

Conclusion: B-DENSE effectively accelerates diffusion model sampling while preserving structural information through dense trajectory alignment, overcoming limitations of traditional distillation methods.

Abstract: Inspired by non-equilibrium thermodynamics, diffusion models have achieved state-of-the-art performance in generative modeling. However, their iterative sampling nature results in high inference latency. While recent distillation techniques accelerate sampling, they discard intermediate trajectory steps. This sparse supervision leads to a loss of structural information and introduces significant discretization errors. To mitigate this, we propose B-DENSE, a novel framework that leverages multi-branch trajectory alignment. We modify the student architecture to output $K$-fold expanded channels, where each subset corresponds to a specific branch representing a discrete intermediate step in the teacher’s trajectory. By training these branches to simultaneously map to the entire sequence of the teacher’s target timesteps, we enforce dense intermediate trajectory alignment. Consequently, the student model learns to navigate the solution space from the earliest stages of training, demonstrating superior image generation quality compared to baseline distillation frameworks.

[239] Fast Online Learning with Gaussian Prior-Driven Hierarchical Unimodal Thompson Sampling

Tianchi Zhao, He Liu, Hongyin Shi, Jinliang Li

Main category: cs.LG

TL;DR: Thompson Sampling algorithms for Gaussian Multi-Armed Bandit problems with clustered arms, achieving lower regret bounds by exploiting hierarchical structure and unimodal reward properties.

Details

Motivation: Many real-world problems like mmWave communications and portfolio management involve Gaussian-distributed rewards with clustered arms, where exploiting structural information can improve decision-making efficiency and reduce regret.

Method: Proposed TSCG (Thompson Sampling with Clustered arms under Gaussian prior) for 2-level hierarchical structure, and UTSCG (Unimodal Thompson Sampling with Clustered Arms) for unimodal rewards, both building on TSG baseline with theoretical regret analysis.

Result: Theoretical proofs show lower regret bounds than ordinary TSG, with UTSCG achieving even lower bounds for unimodal rewards. Numerical experiments confirm algorithmic advantages.

Conclusion: Exploiting structural information (clustering and unimodality) in Gaussian MAB problems leads to improved regret bounds and practical performance, with applications in communications and finance.

Abstract: We study a type of Multi-Armed Bandit (MAB) problems in which arms with a Gaussian reward feedback are clustered. Such an arm setting finds applications in many real-world problems, for example, mmWave communications and portfolio management with risky assets, as a result of the universality of the Gaussian distribution. Based on the Thompson Sampling algorithm with Gaussian prior (TSG) algorithm for the selection of the optimal arm, we propose our Thompson Sampling with Clustered arms under Gaussian prior (TSCG) specific to the 2-level hierarchical structure. We prove that by utilizing the 2-level structure, we can achieve a lower regret bound than we do with ordinary TSG. In addition, when the reward is Unimodal, we can reach an even lower bound on the regret by our Unimodal Thompson Sampling algorithm with Clustered Arms under Gaussian prior (UTSCG). Each of our proposed algorithms are accompanied by theoretical evaluation of the upper regret bound, and our numerical experiments confirm the advantage of our proposed algorithms.

[240] Verifier-Constrained Flow Expansion for Discovery Beyond the Data

Riccardo De Santi, Kimon Protopapas, Ya-Ping Hsieh, Andreas Krause

Main category: cs.LG

TL;DR: Flow Expander (FE) adapts pre-trained flow/diffusion models to generate diverse valid samples beyond training data distribution using verifier-guided entropy maximization.

Details

Motivation: Flow and diffusion models trained on limited data generate samples only from narrow portions of the valid design space, limiting their usefulness for scientific discovery where exploring beyond available data is crucial.

Method: Proposes Flow Expander (FE) - a scalable mirror descent scheme that performs verifier-constrained entropy maximization over the flow process noised state space, with formal notions of strong/weak verifiers and algorithmic frameworks for global/local flow expansion.

Result: Theoretical convergence guarantees under idealized and general assumptions, and empirical evaluation showing FE can expand pre-trained flow models to increase conformer diversity while preserving validity in molecular design tasks.

Conclusion: FE provides a principled approach to adapt pre-trained generative models to explore broader valid design spaces beyond training data, with applications in scientific discovery domains like molecular design.

Abstract: Flow and diffusion models are typically pre-trained on limited available data (e.g., molecular samples), covering only a fraction of the valid design space (e.g., the full molecular space). As a consequence, they tend to generate samples from only a narrow portion of the feasible domain. This is a fundamental limitation for scientific discovery applications, where one typically aims to sample valid designs beyond the available data distribution. To this end, we address the challenge of leveraging access to a verifier (e.g., an atomic bonds checker), to adapt a pre-trained flow model so that its induced density expands beyond regions of high data availability, while preserving samples validity. We introduce formal notions of strong and weak verifiers and propose algorithmic frameworks for global and local flow expansion via probability-space optimization. Then, we present Flow Expander (FE), a scalable mirror descent scheme that provably tackles both problems by verifier-constrained entropy maximization over the flow process noised state space. Next, we provide a thorough theoretical analysis of the proposed method, and state convergence guarantees under both idealized and general assumptions. Ultimately, we empirically evaluate our method on both illustrative, yet visually interpretable settings, and on a molecular design task showcasing the ability of FE to expand a pre-trained flow model increasing conformer diversity while preserving validity.

[241] PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI

Yingchen He, Christian D. Weilbach, Martyna E. Wojciechowska, Yuxuan Zhang, Frank Wood

Main category: cs.LG

TL;DR: PLAICraft: A large-scale multimodal dataset of 10,000+ hours of Minecraft gameplay with video, audio (game output and microphone), mouse, and keyboard actions, time-aligned with millisecond precision for embodied AI research.

Details

Motivation: Current limitations in training human-level embodied agents due to lack of large-scale, real-time, multimodal, socially interactive datasets that capture sensory-motor complexity of natural environments.

Method: Developed a data collection platform capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse actions, and keyboard actions with millisecond precision.

Result: Created a dataset with over 10,000 hours of gameplay from more than 10,000 global participants, plus an evaluation suite for benchmarking model capabilities in object recognition, spatial awareness, language grounding, and long-term memory.

Conclusion: PLAICraft enables training and evaluating agents that act fluently and purposefully in real time, paving the way for truly embodied artificial intelligence.

Abstract: Advances in deep generative modeling have made it increasingly plausible to train human-level embodied agents. Yet progress has been limited by the absence of large-scale, real-time, multi-modal, and socially interactive datasets that reflect the sensory-motor complexity of natural environments. To address this, we present PLAICraft, a novel data collection platform and dataset capturing multiplayer Minecraft interactions across five time-aligned modalities: video, game output audio, microphone input audio, mouse, and keyboard actions. Each modality is logged with millisecond time precision, enabling the study of synchronous, embodied behaviour in a rich, open-ended world. The dataset comprises over 10,000 hours of gameplay from more than 10,000 global participants. Alongside the dataset, we provide an evaluation suite for benchmarking model capabilities in object recognition, spatial awareness, language grounding, and long-term memory. PLAICraft opens a path toward training and evaluating agents that act fluently and purposefully in real time, paving the way for truly embodied artificial intelligence.

[242] Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganization in Neural Networks

Jayadev Billa

Main category: cs.LG

TL;DR: The paper studies geometric patterns in neural network training, tracking representation collapse and emergence across model scales and tasks, finding universal collapse patterns but limited predictive power for emergence timing.

Details

Motivation: To understand the mechanistic opacity of capability emergence during neural network training by examining geometric patterns across different model scales and tasks.

Method: Tracked five geometric measures across five model scales (405K-85M parameters), 120+ emergence events in eight algorithmic tasks, and three Pythia language models (160M-2.8B). Analyzed representation collapse patterns, propagation through layers, and predictive relationships between geometric measures and emergence.

Result: Found universal representation collapse to task-specific floors that are scale-invariant; collapse propagates top-down through layers; geometric hierarchy where representation geometry leads emergence (75-100% precursor rate for hard tasks); limited predictive power for fine-grained emergence timing.

Conclusion: Provides geometric anatomy of emergence and its boundary conditions, showing geometric patterns replicate but predictive signals require task-training alignment not present in naturalistic pre-training.

Abstract: Capability emergence during neural network training remains mechanistically opaque. We track five geometric measures across five model scales (405K-85M parameters), 120+ emergence events in eight algorithmic tasks, and three Pythia language models (160M-2.8B). We find: (1) training begins with a universal representation collapse to task-specific floors that are scale-invariant across a 210X parameter range (e.g., modular arithmetic collapses to RANKME ~ 2.0 regardless of model size); (2) collapse propagates top-down through layers (32/32 task X model consistency), contradicting bottom-up feature-building intuition; (3) a geometric hierarchy in which representation geometry leads emergence (75-100% precursor rate for hard tasks), while the local learning coefficient is synchronous (0/24 precursor) and Hessian measures lag. We also delineate prediction limits: geometric measures encode coarse task difficulty but not fine-grained timing (within-class concordance 27%; when task ordering reverses across scales, prediction fails at 26%). On Pythia, global geometric patterns replicate but per-task precursor signals do not – the precursor relationship requires task-training alignment that naturalistic pre-training does not provide. Our contribution is the geometric anatomy of emergence and its boundary conditions, not a prediction tool.

[243] High entropy leads to symmetry equivariant policies in Dec-POMDPs

Johannes Forkel, Constantin Ruhdorfer, Andreas Bulling, Jakob Foerster

Main category: cs.LG

TL;DR: High entropy regularization in Dec-POMDPs ensures policy gradient convergence to symmetric joint policies, enabling compatible cross-play between independently trained agents.

Details

Motivation: Addressing the challenge of cross-play compatibility in decentralized partially observable Markov decision processes (Dec-POMDPs), where independently trained policies often fail to coordinate effectively when paired with other independently trained policies.

Method: Theoretical analysis of entropy-regularized policy gradient ascent with tabular softmax parametrization, plus empirical evaluation using independent PPO in Hanabi, Overcooked, and Yokai environments with varying entropy coefficients.

Result: High entropy regularization ensures convergence to symmetric joint policies, making cross-play returns equal to self-play returns. In Hanabi, this approach achieves state-of-the-art inter-seed cross-play performance.

Conclusion: Higher entropy coefficients than typically used should be considered in Dec-POMDP hyperparameter sweeps, as they enable better cross-play compatibility while self-play performance can be recovered through post-training greedification.

Abstract: We prove that in any Dec-POMDP, sufficiently high entropy regularization ensures that policy gradient ascent with tabular softmax parametrization always converges, for any initialization, to the same joint policy, and that this joint policy is equivariant w.r.t. all symmetries of the Dec-POMDP. In particular, policies coming from different random seeds will be fully compatible, in that their cross-play returns are equal to their self-play returns. Through extensive empirical evaluation of independent PPO in the Hanabi, Overcooked, and Yokai environments, we find that the entropy coefficient has a massive influence on the cross-play returns between independently trained policies, and that the drop in self-play returns coming from increased entropy regularization can often be counteracted by greedifying the learned policies after training. In Hanabi we achieve a new SOTA in inter-seed cross-play this way. Despite clear limitations of this recipe, which we point out, both our theoretical and empirical results indicate that during hyperparameter sweeps in Dec-POMDPs, one should consider far higher entropy coefficients than is typically done.

[244] Geometry-Aware Uncertainty Quantification via Conformal Prediction on Manifolds

Marzieh Amiri Shahbazi, Ali Baheri

Main category: cs.LG

TL;DR: Adaptive geodesic conformal prediction framework for regression on Riemannian manifolds using geodesic nonconformity scores and difficulty estimation to handle heteroscedastic noise, producing geodesic caps with uniform conditional coverage.

Details

Motivation: Existing conformal prediction methods assume Euclidean output spaces and produce poorly calibrated prediction regions when responses lie on Riemannian manifolds, failing to handle heteroscedastic noise and chart distortion issues.

Method: Proposes adaptive geodesic conformal prediction that replaces Euclidean residuals with geodesic nonconformity scores, normalizes them by cross-validated difficulty estimator to handle heteroscedastic noise, and produces geodesic caps on spheres as prediction regions.

Result: The method substantially improves conditional coverage uniformity, raises worst-case coverage closer to nominal levels, reduces coverage area waste compared to coordinate-based baselines, and demonstrates effectiveness in synthetic sphere experiments and real-world geomagnetic field forecasting tasks.

Conclusion: Adaptive geodesic conformal prediction provides a principled framework for regression on Riemannian manifolds with distribution-free coverage guarantees, addressing limitations of Euclidean-based methods through geodesic scoring and difficulty adaptation.

Abstract: Conformal prediction provides distribution-free coverage guaranties for regression; yet existing methods assume Euclidean output spaces and produce prediction regions that are poorly calibrated when responses lie on Riemannian manifolds. We propose \emph{adaptive geodesic conformal prediction}, a framework that replaces Euclidean residuals with geodesic nonconformity scores and normalizes them by a cross-validated difficulty estimator to handle heteroscedastic noise. The resulting prediction regions, geodesic caps on the sphere, have position-independent area and adapt their size to local prediction difficulty, yielding substantially more uniform conditional coverage than non-adaptive alternatives. In a synthetic sphere experiment with strong heteroscedasticity and a real-world geomagnetic field forecasting task derived from IGRF-14 satellite data, the adaptive method markedly reduces conditional coverage variability and raises worst-case coverage much closer to the nominal level, while coordinate-based baselines waste a large fraction of coverage area due to chart distortion.

[245] MolCrystalFlow: Molecular Crystal Structure Prediction via Flow Matching

Cheng Zeng, Harry W. Sullivan, Thomas Egg, Maya M. Martirossyan, Philipp Höllmer, Jirui Jin, Richard G. Hennig, Adrian Roitberg, Stefano Martiniani, Ellad B. Tadmor, Mingjie Liu

Main category: cs.LG

TL;DR: MolCrystalFlow is a flow-based generative model for molecular crystal structure prediction that disentangles intramolecular complexity from intermolecular packing by treating molecules as rigid bodies and learning lattice parameters, orientations, and positions on Riemannian manifolds.

Details

Motivation: Molecular crystal structure prediction is challenging due to large molecule sizes and complex intra-/intermolecular interactions. While generative models have succeeded for other materials, extending them to fully periodic molecular crystals remains elusive.

Method: Flow-based generative model that embeds molecules as rigid bodies, jointly learning lattice matrix, molecular orientations, and centroid positions. Centroids/orientations are represented on native Riemannian manifolds for geodesic flow construction and GNN operations respecting geometric symmetries.

Result: Benchmarked against state-of-the-art generative models for large periodic crystals and rule-based methods on two open-source molecular crystal datasets. Demonstrated integration with universal ML potential to accelerate molecular crystal structure prediction.

Conclusion: Paves the way for data-driven generative discovery of molecular crystals by providing an effective framework for molecular crystal structure prediction that respects geometric symmetries and can be integrated with ML potentials.

Abstract: Molecular crystal structure prediction represents a grand challenge in computational chemistry due to large sizes of constituent molecules and complex intra- and intermolecular interactions. While generative modeling has revolutionized structure discovery for molecules, inorganic solids, and metal-organic frameworks, extending such approaches to fully periodic molecular crystals is still elusive. Here, we present MolCrystalFlow, a flow-based generative model for molecular crystal structure prediction. The framework disentangles intramolecular complexity from intermolecular packing by embedding molecules as rigid bodies and jointly learning the lattice matrix, molecular orientations, and centroid positions. Centroids and orientations are represented on their native Riemannian manifolds, allowing geodesic flow construction and graph neural network operations that respects geometric symmetries. We benchmark our model against state-of-the-art generative models for large-size periodic crystals and rule-based structure generation methods on two open-source molecular crystal datasets. We demonstrate an integration of MolCrystalFlow model with universal machine learning potential to accelerate molecular crystal structure prediction, paving the way for data-driven generative discovery of molecular crystals.

[246] AI-CARE: Carbon-Aware Reporting Evaluation Metric for AI Models

KC Santosh, Srikanth Baride, Rodrigue Rizk

Main category: cs.LG

TL;DR: AI-CARE is a benchmarking tool that evaluates ML models not just on performance metrics but also on energy consumption and carbon emissions, introducing carbon-performance tradeoff curves to promote environmentally responsible AI.

Details

Motivation: Current ML benchmarks focus only on standard performance metrics (accuracy, BLEU, mAP) while ignoring environmental costs, which is misaligned with practical deployment needs in energy-constrained environments and climate-aware enterprises.

Method: Proposes AI-CARE evaluation tool for reporting energy consumption and carbon emissions of ML models, and introduces carbon-performance tradeoff curves that visualize Pareto frontiers between performance and carbon cost.

Result: Theoretical analysis and empirical validation on representative ML workloads show that carbon-aware benchmarking changes model rankings and encourages architectures that are both accurate and environmentally responsible.

Conclusion: AI-CARE aims to shift the research community toward transparent, multi-objective evaluation and align ML progress with global sustainability goals.

Abstract: As machine learning (ML) continues its rapid expansion, the environmental cost of model training and inference has become a critical societal concern. Existing benchmarks overwhelmingly focus on standard performance metrics such as accuracy, BLEU, or mAP, while largely ignoring energy consumption and carbon emissions. This single-objective evaluation paradigm is increasingly misaligned with the practical requirements of large-scale deployment, particularly in energy-constrained environments such as mobile devices, developing regions, and climate-aware enterprises. In this paper, we propose AI-CARE, an evaluation tool for reporting energy consumption, and carbon emissions of ML models. In addition, we introduce the carbon-performance tradeoff curve, an interpretable tool that visualizes the Pareto frontier between performance and carbon cost. We demonstrate, through theoretical analysis and empirical validation on representative ML workloads, that carbon-aware benchmarking changes the relative ranking of models and encourages architectures that are simultaneously accurate and environmentally responsible. Our proposal aims to shift the research community toward transparent, multi-objective evaluation and align ML progress with global sustainability goals. The tool and documentation are available at https://github.com/USD-AI-ResearchLab/ai-care.

[247] MoE-Spec: Expert Budgeting for Efficient Speculative Decoding

Bradley McDanel, Steven Li, Sruthikesh Surineni, Harshit Khaitan

Main category: cs.LG

TL;DR: MoE-Spec: A training-free expert budgeting method for speculative decoding in Mixture-of-Experts models that enforces fixed expert capacity limits to reduce memory pressure while maintaining verification parallelism.

Details

Motivation: Speculative decoding accelerates LLM inference but faces severe bottlenecks in MoE models where large draft trees activate many unique experts, increasing memory pressure and diminishing speedups relative to autoregressive decoding.

Method: MoE-Spec decouples speculation depth from memory cost by enforcing a fixed expert capacity limit at each layer, loading only the most important experts for verification and dropping rarely used experts that drive bandwidth overhead.

Result: Experiments across multiple model scales and datasets show 10-30% higher throughput than state-of-the-art speculative decoding baselines (EAGLE-3) at comparable quality, with flexibility to trade accuracy for further latency reductions.

Conclusion: MoE-Spec provides an effective training-free solution to the memory bottleneck problem in speculative decoding for MoE models, enabling better parallelism without sacrificing quality.

Abstract: Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce speculation depth when MoE verification becomes expensive. We propose MoE-Spec, a training-free verification-time expert budgeting method that decouples speculation depth from memory cost by enforcing a fixed expert capacity limit at each layer, loading only the experts that contribute most to verification and dropping the long tail of rarely used experts that drive bandwidth overhead. Experiments across multiple model scales and datasets show that this method yields 10–30% higher throughput than state-of-the-art speculative decoding baselines (EAGLE-3) at comparable quality, with flexibility to trade accuracy for further latency reductions through tighter budgets.

[248] Multi-Objective Alignment of Language Models for Personalized Psychotherapy

Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli, Majid Sarrafzadeh, Saadia Gabriel

Main category: cs.LG

TL;DR: Multi-objective alignment framework (MODPO) for therapeutic AI that balances patient preferences with clinical safety using direct preference optimization across six therapeutic criteria.

Details

Motivation: Mental health care access is limited by workforce shortages and cost constraints. Current AI therapeutic systems optimize objectives independently, failing to balance patient preferences with clinical safety, creating a need for multi-objective alignment.

Method: Surveyed 335 individuals with mental health experience to collect preference rankings, then developed multi-objective alignment framework using direct preference optimization. Trained reward models for six criteria: empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy. Compared multi-objective approaches against single-objective optimization, supervised fine-tuning, and parameter merging.

Result: Multi-objective DPO (MODPO) achieved superior balance (77.6% empathy, 62.6% safety) compared to single-objective optimization (93.6% empathy, 47.8% safety). Therapeutic criteria outperformed general communication principles by 17.2%. Blinded clinician evaluation confirmed MODPO is consistently preferred, with LLM-evaluator agreement comparable to inter-clinician reliability.

Conclusion: Multi-objective alignment using direct preference optimization effectively balances therapeutic objectives in AI systems, addressing the tension between patient preferences and clinical safety in mental health applications.

Abstract: Mental health disorders affect over 1 billion people worldwide, yet access to care remains limited by workforce shortages and cost constraints. While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety. We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization. We train reward models for six criteria – empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy – and systematically compare multi-objective approaches against single-objective optimization, supervised fine-tuning, and parameter merging. Multi-objective DPO (MODPO) achieves superior balance (77.6% empathy, 62.6% safety) compared to single-objective optimization (93.6% empathy, 47.8% safety), and therapeutic criteria outperform general communication principles by 17.2%. Blinded clinician evaluation confirms MODPO is consistently preferred, with LLM-evaluator agreement comparable to inter-clinician reliability.

[249] Extracting and Analyzing Rail Crossing Behavior Signatures from Videos using Tensor Methods

Dawon Ahn, Het Patel, Aemal Khattak, Jia Chen, Evangelos E. Papalexakis

Main category: cs.LG

TL;DR: Multi-view tensor decomposition framework analyzes driver behavior at railway crossings across three temporal phases using video embeddings to discover location-based behavioral patterns.

Details

Motivation: Traditional railway crossing safety analysis examines individual locations, missing shared behavioral patterns across multiple crossings. There's a need for scalable methods to identify behavioral similarities across locations to inform targeted safety interventions.

Method: Proposes a multi-view tensor decomposition framework that analyzes railway crossing videos from multiple locations using TimeSformer embeddings for three temporal phases (Approach, Waiting, Clearance). Constructs phase-specific similarity matrices and applies non-negative symmetric CP decomposition to discover latent behavioral components with distinct temporal signatures.

Result: Tensor analysis reveals crossing location is a stronger determinant of behavior patterns than time of day, with approach-phase behavior providing particularly discriminative signatures. Visualization shows location-based clustering, with certain crossings forming distinct behavioral clusters.

Conclusion: The automated framework enables scalable pattern discovery across multiple railway crossings, providing a foundation for grouping locations by behavioral similarity to inform targeted safety interventions.

Abstract: Railway crossings present complex safety challenges where driver behavior varies by location, time, and conditions. Traditional approaches analyze crossings individually, limiting the ability to identify shared behavioral patterns across locations. We propose a multi-view tensor decomposition framework that captures behavioral similarities across three temporal phases: Approach (warning activation to gate lowering), Waiting (gates down to train passage), and Clearance (train passage to gate raising). We analyze railway crossing videos from multiple locations using TimeSformer embeddings to represent each phase. By constructing phase-specific similarity matrices and applying non-negative symmetric CP decomposition, we discover latent behavioral components with distinct temporal signatures. Our tensor analysis reveals that crossing location appears to be a stronger determinant of behavior patterns than time of day, and that approach-phase behavior provides particularly discriminative signatures. Visualization of the learned component space confirms location-based clustering, with certain crossings forming distinct behavioral clusters. This automated framework enables scalable pattern discovery across multiple crossings, providing a foundation for grouping locations by behavioral similarity to inform targeted safety interventions.

[250] Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training

Kevin Wang, Hongqian Niu, Didong Li

Main category: cs.LG

TL;DR: Theoretical analysis shows recursive training with AI-generated data contamination converges under minimal assumptions, with convergence rate determined by baseline model performance and real data fraction.

Details

Motivation: As generative AI systems proliferate, web data becomes contaminated with AI-generated content, creating recursive training cycles where later models train on mixtures of human and AI data. Existing theory only covers simplified settings, but real-world scenarios involve complex distributions and flexible models.

Method: Develops a general theoretical framework with minimal assumptions on real data distribution, allowing generative models to be universal approximators. Analyzes recursive training convergence properties and extends analysis to settings with sampling bias in data collection.

Result: Shows contaminated recursive training converges with rate equal to minimum of baseline model’s convergence rate and fraction of real data used per iteration. Provides first positive theoretical result on recursive training without distributional assumptions, supported by empirical studies.

Conclusion: Recursive training with data contamination is theoretically stable under general conditions, with convergence guaranteed and rate determined by model quality and real data proportion. This addresses concerns about model collapse in complex real-world scenarios.

Abstract: Generative Artificial Intelligence (AI), such as large language models (LLMs), has become a transformative force across science, industry, and society. As these systems grow in popularity, web data becomes increasingly interwoven with this AI-generated material and it is increasingly difficult to separate them from naturally generated content. As generative models are updated regularly, later models will inevitably be trained on mixtures of human-generated data and AI-generated data from earlier versions, creating a recursive training process with data contamination. Existing theoretical work has examined only highly simplified settings, where both the real data and the generative model are discrete or Gaussian, where it has been shown that such recursive training leads to model collapse. However, real data distributions are far more complex, and modern generative models are far more flexible than Gaussian and linear mechanisms. To fill this gap, we study recursive training in a general framework with minimal assumptions on the real data distribution and allow the underlying generative model to be a general universal approximator. In this framework, we show that contaminated recursive training still converges, with a convergence rate equal to the minimum of the baseline model’s convergence rate and the fraction of real data used in each iteration. To the best of our knowledge, this is the first (positive) theoretical result on recursive training without distributional assumptions on the data. We further extend the analysis to settings where sampling bias is present in data collection and support all theoretical results with empirical studies.

[251] Omni-iEEG: A Large-Scale, Comprehensive iEEG Dataset and Benchmark for Epilepsy Research

Chenda Duan, Yipeng Zhang, Sotaro Kanai, Yuanyi Ding, Atsuro Daida, Pengyue Yu, Tiancheng Zheng, Naoto Kuroda, Shaun A. Hussain, Eishi Asano, Hiroki Nariai, Vwani Roychowdhury

Main category: cs.LG

TL;DR: Omni-iEEG: A large-scale, harmonized intracranial EEG dataset with 302 patients and 178 hours of recordings, including clinical metadata and 36K expert-validated pathological event annotations for reproducible epilepsy research.

Details

Motivation: Current epilepsy research faces challenges with single-center datasets that are inconsistent in format, lack standardized benchmarks, and rarely release pathological event annotations, creating barriers to reproducibility, cross-center validation, and clinical relevance.

Method: Extensive efforts to reconcile heterogeneous iEEG formats, metadata, and recordings across publicly available sources to create a harmonized dataset with clinical metadata (seizure onset zones, resections, surgical outcomes) validated by epileptologists.

Result: Created Omni-iEEG with 302 patients, 178 hours of high-resolution recordings, harmonized clinical metadata, and over 36K expert-validated pathological event annotations. Established clinically meaningful tasks with unified evaluation metrics.

Conclusion: Omni-iEEG serves as a foundation for reproducible, generalizable, and clinically translatable epilepsy research, bridging machine learning and epilepsy research with standardized benchmarks and clinically relevant evaluation settings.

Abstract: Epilepsy affects over 50 million people worldwide, and one-third of patients suffer drug-resistant seizures where surgery offers the best chance of seizure freedom. Accurate localization of the epileptogenic zone (EZ) relies on intracranial EEG (iEEG). Clinical workflows, however, remain constrained by labor-intensive manual review. At the same time, existing data-driven approaches are typically developed on single-center datasets that are inconsistent in format and metadata, lack standardized benchmarks, and rarely release pathological event annotations, creating barriers to reproducibility, cross-center validation, and clinical relevance. With extensive efforts to reconcile heterogeneous iEEG formats, metadata, and recordings across publicly available sources, we present $\textbf{Omni-iEEG}$, a large-scale, pre-surgical iEEG resource comprising $\textbf{302 patients}$ and $\textbf{178 hours}$ of high-resolution recordings. The dataset includes harmonized clinical metadata such as seizure onset zones, resections, and surgical outcomes, all validated by board-certified epileptologists. In addition, Omni-iEEG provides over 36K expert-validated annotations of pathological events, enabling robust biomarker studies. Omni-iEEG serves as a bridge between machine learning and epilepsy research. It defines clinically meaningful tasks with unified evaluation metrics grounded in clinical priors, enabling systematic evaluation of models in clinically relevant settings. Beyond benchmarking, we demonstrate the potential of end-to-end modeling on long iEEG segments and highlight the transferability of representations pretrained on non-neurophysiological domains. Together, these contributions establish Omni-iEEG as a foundation for reproducible, generalizable, and clinically translatable epilepsy research. The project page with dataset and code links is available at omni-ieeg.github.io/omni-ieeg.

[252] Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff

Patrick Pynadath, Ruqi Zhang

Main category: cs.LG

TL;DR: Two-stream attention in any-order autoregressive models addresses a structural-semantic tradeoff rather than just separating position from content, as shown through experiments with Decoupled RoPE.

Details

Motivation: The paper investigates why two-stream attention works well in any-order autoregressive models (AO-ARMs). While typically motivated as separating token content from position, the authors argue it may serve a more subtle role in addressing a fundamental tradeoff between semantic and structural attention requirements.

Method: The authors propose Decoupled RoPE, a modification to rotary position embeddings that provides target position information without revealing target content. This isolates the position-content separation from the structural-semantic tradeoff. They test this approach at different sequence lengths to examine performance degradation as semantic and structural proximity diverge.

Result: Decoupled RoPE performs competitively at short sequence lengths where semantic and structural proximity coincide, but degrades as sequence length increases and the two orderings diverge. This suggests the success of two-stream attention stems from circumventing the structural-semantic tradeoff inherent to any-order generation.

Conclusion: Two-stream attention in AO-ARMs addresses a deeper structural-semantic tradeoff rather than merely separating position from content. The hidden representation must simultaneously attend to semantically informative tokens for prediction and structurally recent tokens for summarization, which compete for attention capacity but can specialize across two streams.

Abstract: Any-order autoregressive models (AO-ARMs) offer a promising path toward efficient masked diffusion by enabling native key-value caching, but competitive performance has so far required two-stream attention, typically motivated as a means of decoupling token content from position. In this work, we argue that two-stream attention may be serving a more subtle role. We identify a structural-semantic tradeoff in any-order generation: the hidden representation at each step must simultaneously attend to semantically informative tokens for prediction and structurally recent tokens for summarization, objectives that compete for attention capacity in a single stream but can specialize across two streams. To isolate this tradeoff from position-content separation, we propose Decoupled RoPE, a modification to rotary position embeddings that provides target position information without revealing target content. Decoupled RoPE performs competitively at short sequence lengths–where semantic and structural proximity coincide–but degrades as sequence length increases and the two orderings diverge. These results suggest that the success of two-stream attention stems not merely from separating position from content, but from circumventing the deeper structural-semantic tradeoff inherent to any-order generation.

[253] Axle Sensor Fusion for Online Continual Wheel Fault Detection in Wayside Railway Monitoring

Afonso Lourenço, Francisca Osório, Diogo Risca, Goreti Marreiros

Main category: cs.LG

TL;DR: A semantic-aware continual learning framework for railway fault diagnostics using unsupervised VAE encoding of accelerometer signals fused with AI-extracted semantic metadata from strain sensors, with gradient boosting and replay-based continual learning for adaptation to evolving operational conditions.

Details

Motivation: Railway maintenance needs reliable predictive frameworks that can handle evolving operational patterns without manual feature engineering. Traditional methods degrade in online settings with changing conditions like train types, speeds, loads, and track profiles.

Method: 1) Encode accelerometer signals via Variational AutoEncoder for unsupervised latent representations; 2) Extract semantic metadata (axle counts, wheel indexes, deformations) via AI-driven peak detection on fiber Bragg grating sensors; 3) Fuse VAE embeddings with semantic metadata; 4) Use lightweight gradient boosting classifier for anomaly scoring with minimal labels; 5) Implement replay-based continual learning strategy for adaptation without catastrophic forgetting.

Result: The model successfully detects minor imperfections (flats and polygonization) while adapting to evolving operational conditions using only a single accelerometer and strain gauge in wayside monitoring.

Conclusion: The proposed framework enables reliable, label-efficient fault diagnostics for railways that can adapt to changing operational patterns without catastrophic forgetting, using semantic-aware fusion of sensor data.

Abstract: Reliable and cost-effective maintenance is essential for railway safety, particularly at the wheel-rail interface, which is prone to wear and failure. Predictive maintenance frameworks increasingly leverage sensor-generated time-series data, yet traditional methods require manual feature engineering, and deep learning models often degrade in online settings with evolving operational patterns. This work presents a semantic-aware, label-efficient continual learning framework for railway fault diagnostics. Accelerometer signals are encoded via a Variational AutoEncoder into latent representations capturing the normal operational structure in a fully unsupervised manner. Importantly, semantic metadata, including axle counts, wheel indexes, and strain-based deformations, is extracted via AI-driven peak detection on fiber Bragg grating sensors (resistant to electromagnetic interference) and fused with the VAE embeddings, enhancing anomaly detection under unknown operational conditions. A lightweight gradient boosting supervised classifier stabilizes anomaly scoring with minimal labels, while a replay-based continual learning strategy enables adaptation to evolving domains without catastrophic forgetting. Experiments show the model detects minor imperfections due to flats and polygonization, while adapting to evolving operational conditions, such as changes in train type, speed, load, and track profiles, captured using a single accelerometer and strain gauge in wayside monitoring.

[254] Feature-based morphological analysis of shape graph data

Murad Hossen, Demetrio Labate, Nicolas Charon

Main category: cs.LG

TL;DR: A computational pipeline for statistical analysis of shape graph datasets that analyzes both connectivity structure and geometric properties of network branches in 2D/3D spaces.

Details

Motivation: Traditional graph analysis focuses only on connectivity structure, but many real-world networks (like urban roads, neuronal traces) have important geometric properties of their branches that need to be analyzed alongside topology.

Method: Extracts curated set of topological, geometric and directional features with key invariance properties, then uses this feature representation for group comparison, clustering and classification tasks.

Result: Evaluated on real-world datasets including urban road networks, neuronal traces and astrocyte imaging, benchmarked against alternative methods showing effectiveness of the proposed representation.

Conclusion: The proposed computational pipeline enables comprehensive statistical analysis of shape graphs by capturing both topological and geometric properties, outperforming existing methods on various real-world applications.

Abstract: This paper introduces and demonstrates a computational pipeline for the statistical analysis of shape graph datasets, namely geometric networks embedded in 2D or 3D spaces. Unlike traditional abstract graphs, our purpose is not only to retrieve and distinguish variations in the connectivity structure of the data but also geometric differences of the network branches. Our proposed approach relies on the extraction of a specifically curated and explicit set of topological, geometric and directional features, designed to satisfy key invariance properties. We leverage the resulting feature representation for tasks such as group comparison, clustering and classification on cohorts of shape graphs. The effectiveness of this representation is evaluated on several real-world datasets including urban road/street networks, neuronal traces and astrocyte imaging. These results are benchmarked against several alternative methods, both feature-based and not.

[255] On the Power of Source Screening for Learning Shared Feature Extractors

Leo, Wang, Connor Mclaughlin, Lili Su

Main category: cs.LG

TL;DR: The paper proposes source screening methods to identify informative subsets of data sources for optimal shared representation learning, showing that training on carefully selected subsets can achieve minimax optimality even when discarding substantial data.

Details

Motivation: Existing multi-source learning methods typically include all related data sources simultaneously, but sources with low relevance or poor quality can hinder representation learning. The paper investigates which data sources should be learned jointly, focusing on traditionally "good" collections where sources have similar relevance and quality with respect to the true underlying common structure.

Method: The paper focuses on linear settings where sources share a low-dimensional subspace. It formalizes the notion of informative subpopulations, develops algorithms and practical heuristics for identifying such subsets, and shows that training on carefully selected subsets suffices for minimax optimal subspace estimation.

Result: Theoretical analysis and empirical evaluations on synthetic and real-world datasets validate that source screening plays a central role in statistically optimal subspace estimation. For a broad class of problem instances, training on selected subsets achieves minimax optimality even when discarding substantial portions of data.

Conclusion: Source screening is crucial for optimal shared representation learning, and carefully selecting informative subsets of data sources can lead to statistically optimal performance while potentially discarding irrelevant or harmful data.

Abstract: Learning with shared representation is widely recognized as an effective way to separate commonalities from heterogeneity across various heterogeneous sources. Most existing work includes all related data sources via simultaneously training a common feature extractor and source-specific heads. It is well understood that data sources with low relevance or poor quality may hinder representation learning. In this paper, we further dive into the question of which data sources should be learned jointly by focusing on the traditionally deemed ``good’’ collection of sources, in which individual sources have similar relevance and qualities with respect to the true underlying common structure. Towards tractability, we focus on the linear setting where sources share a low-dimensional subspace. We find that source screening can play a central role in statistically optimal subspace estimation. We show that, for a broad class of problem instances, training on a carefully selected subset of sources suffices to achieve minimax optimality, even when a substantial portion of data is discarded. We formalize the notion of an informative subpopulation, develop algorithms and practical heuristics for identifying such subsets, and validate their effectiveness through both theoretical analysis and empirical evaluations on synthetic and real-world datasets.

[256] Investigating GNN Convergence on Large Randomly Generated Graphs with Realistic Node Feature Correlations

Mohammed Zain Ali Ahmed

Main category: cs.LG

TL;DR: A study analyzing GNN convergence on random graphs with correlated node features, showing GNNs may be more expressive than previously thought when applied to realistic graphs.

Details

Motivation: Existing studies on GNN convergence behavior on large random graphs typically don't model correlations between node features, which naturally exist in real-life networks. This leads to derived limitations that don't truly reflect GNN expressive power on realistic graphs.

Method: Introduces a novel method to generate random graphs with correlated node features, where features are sampled to ensure correlation between neighboring nodes. The sampling scheme is motivated by properties exhibited by real-life graphs, particularly those captured by the Barabási-Albert model.

Result: Theoretical analysis indicates convergence can be avoided in some cases, which is empirically validated on large random graphs generated using the novel method. Observed divergent behavior provides evidence that GNNs may be more expressive than initial studies would suggest.

Conclusion: GNNs may be more expressive than previously thought when applied to realistic graphs with correlated node features, challenging limitations derived from studies using uncorrelated feature models.

Abstract: There are a number of existing studies analysing the convergence behaviour of graph neural networks on large random graphs. Unfortunately, the majority of these studies do not model correlations between node features, which would naturally exist in a variety of real-life networks. Consequently, the derived limitations of GNNs, resulting from such convergence behaviour, is not truly reflective of the expressive power of GNNs when applied to realistic graphs. In this paper, we will introduce a novel method to generate random graphs that have correlated node features. The node features will be sampled in such a manner to ensure correlation between neighbouring nodes. As motivation for our choice of sampling scheme, we will appeal to properties exhibited by real-life graphs, particularly properties that are captured by the Barabási-Albert model. A theoretical analysis will strongly indicate that convergence can be avoided in some cases, which we will empirically validate on large random graphs generated using our novel method. The observed divergent behaviour provides evidence that GNNs may be more expressive than initial studies would suggest, especially on realistic graphs.

[257] ASPEN: Spectral-Temporal Fusion for Cross-Subject Brain Decoding

Megan Lee, Seung Ha Hwang, Inhyeok Choi, Shreyas Darade, Mengchun Zhang, Kateryna Shapovalenko

Main category: cs.LG

TL;DR: ASPEN is a hybrid EEG-based BCI architecture that combines spectral and temporal features via multiplicative fusion to improve cross-subject generalization by leveraging the higher cross-subject stability of spectral representations.

Details

Motivation: Cross-subject generalization in EEG-based BCIs is challenging due to individual variability in neural signals. The authors investigate whether spectral representations offer more stable features for cross-subject transfer than temporal waveforms.

Method: Through correlation analyses across three EEG paradigms (SSVEP, P300, Motor Imagery), they found spectral features have higher cross-subject similarity. They then introduced ASPEN, a hybrid architecture combining spectral and temporal feature streams via multiplicative fusion, requiring cross-modal agreement for feature propagation.

Result: Experiments across six benchmark datasets show ASPEN dynamically achieves optimal spectral-temporal balance depending on the paradigm. It achieves best unseen-subject accuracy on three of six datasets and competitive performance on others.

Conclusion: Multiplicative multimodal fusion enables effective cross-subject generalization in EEG-based BCIs by leveraging complementary spectral and temporal information.

Abstract: Cross-subject generalization in EEG-based brain-computer interfaces (BCIs) remains challenging due to individual variability in neural signals. We investigate whether spectral representations offer more stable features for cross-subject transfer than temporal waveforms. Through correlation analyses across three EEG paradigms (SSVEP, P300, and Motor Imagery), we find that spectral features exhibit consistently higher cross-subject similarity than temporal signals. Motivated by this observation, we introduce ASPEN, a hybrid architecture that combines spectral and temporal feature streams via multiplicative fusion, requiring cross-modal agreement for features to propagate. Experiments across six benchmark datasets reveal that ASPEN is able to dynamically achieve the optimal spectral-temporal balance depending on the paradigm. ASPEN achieves the best unseen-subject accuracy on three of six datasets and competitive performance on others, demonstrating that multiplicative multimodal fusion enables effective cross-subject generalization.

[258] Differentially Private Non-convex Distributionally Robust Optimization

Difei Xu, Meng Ding, Zebin Ma, Huanyi Xie, Youming Tao, Aicha Slaitane, Di Wang

Main category: cs.LG

TL;DR: DP-DRO framework combining differential privacy with distributionally robust optimization for non-convex losses with ψ-divergence constraints

Details

Motivation: Real-world ML deployments face distribution shifts, group imbalances, and adversarial perturbations where ERM degrades. DRO offers robustness but training data contains sensitive information requiring DP protection. DP-DRO has received little attention due to its minimax structure with uncertainty constraints.

Method: Develop DP optimization methods for finite-sum DRO with ψ-divergence and non-convex loss: 1) DP Double-Spider for general ψ-divergence by reformulating as minimization problem, 2) DP Recursive-Spider for KL-divergence by transforming to compositional finite-sum optimization.

Result: DP Double-Spider achieves utility bound O(1/√n + (√d log(1/δ)/nε)^{2/3}) in gradient norm. DP Recursive-Spider for KL-divergence achieves O((√d log(1/δ)/nε)^{2/3}), matching best-known non-convex DP-ERM results. Experimental results show proposed methods outperform existing DP minimax optimization approaches.

Conclusion: Comprehensive study of DP-DRO with ψ-divergence and non-convex loss, developing novel DP optimization methods with theoretical guarantees and empirical superiority over existing approaches for robust and private ML.

Abstract: Real-world deployments routinely face distribution shifts, group imbalances, and adversarial perturbations, under which the traditional Empirical Risk Minimization (ERM) framework can degrade severely. Distributionally Robust Optimization (DRO) addresses this issue by optimizing the worst-case expected loss over an uncertainty set of distributions, offering a principled approach to robustness. Meanwhile, as training data in DRO always involves sensitive information, safeguarding it against leakage under Differential Privacy (DP) is essential. In contrast to classical DP-ERM, DP-DRO has received much less attention due to its minimax optimization structure with uncertainty constraint. To bridge the gap, we provide a comprehensive study of DP-(finite-sum)-DRO with $ψ$-divergence and non-convex loss. First, we study DRO with general $ψ$-divergence by reformulating it as a minimization problem, and develop a novel $(\varepsilon, δ)$-DP optimization method, called DP Double-Spider, tailored to this structure. Under mild assumptions, we show that it achieves a utility bound of $\mathcal{O}(\frac{1}{\sqrt{n}}+ (\frac{\sqrt{d \log (1/δ)}}{n \varepsilon})^{2/3})$ in terms of the gradient norm, where $n$ denotes the data size and $d$ denotes the model dimension. We further improve the utility rate for specific divergences. In particular, for DP-DRO with KL-divergence, by transforming the problem into a compositional finite-sum optimization problem, we develop a DP Recursive-Spider method and show that it achieves a utility bound of $\mathcal{O}((\frac{\sqrt{d \log(1/δ)}}{n\varepsilon})^{2/3} )$, matching the best-known result for non-convex DP-ERM. Experimentally, we demonstrate that our proposed methods outperform existing approaches for DP minimax optimization.

[259] HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, Mingyi Hong

Main category: cs.LG

TL;DR: HiPER: Hierarchical Plan-Execute RL framework for LLM agents that separates high-level planning from low-level execution to improve performance in long-horizon tasks with sparse rewards.

Details

Motivation: Training LLMs as interactive agents for multi-turn decision-making is challenging in long-horizon tasks with sparse rewards, where existing flat RL policies struggle with credit assignment and unstable optimization across extended action sequences.

Method: Proposes HiPER framework with hierarchical policy factorization: high-level planner proposes subgoals and low-level executor carries them out over multiple steps. Introduces hierarchical advantage estimation (HAE) for coordinated credit assignment at both planning and execution levels.

Result: Achieves state-of-the-art performance: 97.4% success on ALFWorld and 83.3% on WebShop with Qwen2.5-7B-Instruct (+6.6% and +8.3% over best prior method), with especially large gains on long-horizon tasks requiring multiple dependent subtasks.

Conclusion: Explicit hierarchical decomposition is crucial for scalable RL training of multi-turn LLM agents, enabling more efficient credit assignment and stable optimization in sparse-reward, long-horizon interactive tasks.

Abstract: Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse-reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan-Execute RL framework that explicitly separates high-level planning from low-level execution. HiPER factorizes the policy into a high-level planner that proposes subgoals and a low-level executor that carries them out over multiple action steps. To align optimization with this structure, we introduce a key technique called hierarchical advantage estimation (HAE), which carefully assigns credit at both the planning and execution levels. By aggregating returns over the execution of each subgoal and coordinating updates across the two levels, HAE provides an unbiased gradient estimator and provably reduces variance compared to flat generalized advantage estimation. Empirically, HiPER achieves state-of-the-art performance on challenging interactive benchmarks, reaching 97.4% success on ALFWorld and 83.3% on WebShop with Qwen2.5-7B-Instruct (+6.6% and +8.3% over the best prior method), with especially large gains on long-horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents.

[260] Muon with Spectral Guidance: Efficient Optimization for Scientific Machine Learning

Binghang Lu, Jiahao Zhang, Guang Lin

Main category: cs.LG

TL;DR: SpecMuon: A spectral-aware optimizer combining Muon’s orthogonalized geometry with mode-wise relaxed scalar auxiliary variable (RSAV) mechanism for improved optimization of physics-informed neural networks and neural operators.

Details

Motivation: Physics-informed neural networks and neural operators suffer from optimization difficulties due to ill-conditioned gradients, multi-scale spectral behavior, and stiffness from physical constraints. Existing optimizers like Muon show promise but lack stability guarantees and can be overly aggressive.

Method: Proposes SpecMuon which integrates Muon’s orthogonalized updates in singular-vector basis with a mode-wise relaxed scalar auxiliary variable (RSAV) mechanism. Decomposes matrix-valued gradients into singular modes and applies RSAV updates individually along dominant spectral directions, adaptively regulating step sizes according to global loss energy while preserving scale-balancing properties.

Result: Numerical experiments on physics-informed neural networks, DeepONets, and fractional PINN-DeepONets demonstrate faster convergence and improved stability compared with Adam, AdamW, and original Muon optimizer on benchmark problems like 1D Burgers equation and fractional PDEs.

Conclusion: SpecMuon provides a principled approach to controlling stiff spectral components in physics-informed learning, with rigorous theoretical guarantees including energy dissipation, boundedness, and convergence properties, offering improved optimization for challenging scientific machine learning problems.

Abstract: Physics-informed neural networks and neural operators often suffer from severe optimization difficulties caused by ill-conditioned gradients, multi-scale spectral behavior, and stiffness induced by physical constraints. Recently, the Muon optimizer has shown promise by performing orthogonalized updates in the singular-vector basis of the gradient, thereby improving geometric conditioning. However, its unit-singular-value updates may lead to overly aggressive steps and lack explicit stability guarantees when applied to physics-informed learning. In this work, we propose SpecMuon, a spectral-aware optimizer that integrates Muon’s orthogonalized geometry with a mode-wise relaxed scalar auxiliary variable (RSAV) mechanism. By decomposing matrix-valued gradients into singular modes and applying RSAV updates individually along dominant spectral directions, SpecMuon adaptively regulates step sizes according to the global loss energy while preserving Muon’s scale-balancing properties. This formulation interprets optimization as a multi-mode gradient flow and enables principled control of stiff spectral components. We establish rigorous theoretical properties of SpecMuon, including a modified energy dissipation law, positivity and boundedness of auxiliary variables, and global convergence with a linear rate under the Polyak-Lojasiewicz condition. Numerical experiments on physics-informed neural networks, DeepONets, and fractional PINN-DeepONets demonstrate that SpecMuon achieves faster convergence and improved stability compared with Adam, AdamW, and the original Muon optimizer on benchmark problems such as the one-dimensional Burgers equation and fractional partial differential equations.

[261] Discrete Stochastic Localization for Non-autoregressive Generation

Yunshu Wu, Jiayi Cheng, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis, Greg Ver Steeg

Main category: cs.LG

TL;DR: DSL (Discrete Stochastic Localization) improves step-efficiency of masked diffusion language models by training a single SNR-invariant denoiser across continuum of corruption levels, achieving 4x fewer evaluations than baseline while matching autoregressive quality.

Details

Motivation: Non-autoregressive generation suffers from error accumulation and distribution shift under self-generated drafts. While masked diffusion language models with remasking samplers offer iterative refinement, their step-efficiency needs improvement through better training approaches.

Method: Proposes DSL (Discrete Stochastic Localization) which trains a single SNR-invariant denoiser across a continuum of corruption levels, bridging intermediate draft noise and mask-style endpoint corruption within one Diffusion Transformer architecture.

Result: On OpenWebText, DSL fine-tuning yields large MAUVE gains at low step budgets, surpassing MDLM+ReMDM baseline with ~4x fewer denoiser evaluations, and matches autoregressive quality at high budgets. Shows improved self-correction and uncertainty calibration.

Conclusion: Training improvements alone can substantially enhance step-efficiency of masked diffusion language models, making remasking more compute-efficient while maintaining quality comparable to autoregressive generation.

Abstract: Non-autoregressive (NAR) generation reduces decoding latency by predicting many tokens in parallel, but iterative refinement often suffers from error accumulation and distribution shift under self-generated drafts. Masked diffusion language models (MDLMs) and their remasking samplers (e.g., ReMDM) can be viewed as modern NAR iterative refinement, where generation repeatedly revises a partially observed draft. In this work we show that \emph{training alone} can substantially improve the step-efficiency of MDLM/ReMDM sampling. We propose \textsc{DSL} (Discrete Stochastic Localization), which trains a single SNR-invariant denoiser across a continuum of corruption levels, bridging intermediate draft noise and mask-style endpoint corruption within one Diffusion Transformer. On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with (\sim)4$\times$ fewer denoiser evaluations, and matches autoregressive quality at high budgets. Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.

[262] Towards Secure and Scalable Energy Theft Detection: A Federated Learning Approach for Resource-Constrained Smart Meters

Diego Labate, Dipanwita Thakur, Giancarlo Fortino

Main category: cs.LG

TL;DR: Privacy-preserving federated learning framework for energy theft detection using lightweight MLP with differential privacy on smart meters

Details

Motivation: Energy theft threatens smart grid stability, but traditional centralized ML approaches raise privacy concerns and face computational constraints on resource-limited smart meters

Method: Federated learning framework with lightweight multilayer perceptron (MLP) model suitable for low-power smart meters, integrated with basic differential privacy by injecting Gaussian noise into local model updates before aggregation

Result: Achieves competitive accuracy, precision, recall, and AUC scores on real-world smart meter dataset under both IID and non-IID data distributions while maintaining privacy and efficiency

Conclusion: Proposed solution is practical and scalable for secure energy theft detection in next-generation smart grid infrastructures, balancing privacy, computational efficiency, and detection performance

Abstract: Energy theft poses a significant threat to the stability and efficiency of smart grids, leading to substantial economic losses and operational challenges. Traditional centralized machine learning approaches for theft detection require aggregating user data, raising serious concerns about privacy and data security. These issues are further exacerbated in smart meter environments, where devices are often resource-constrained and lack the capacity to run heavy models. In this work, we propose a privacy-preserving federated learning framework for energy theft detection that addresses both privacy and computational constraints. Our approach leverages a lightweight multilayer perceptron (MLP) model, suitable for deployment on low-power smart meters, and integrates basic differential privacy (DP) by injecting Gaussian noise into local model updates before aggregation. This ensures formal privacy guarantees without compromising learning performance. We evaluate our framework on a real-world smart meter dataset under both IID and non-IID data distributions. Experimental results demonstrate that our method achieves competitive accuracy, precision, recall, and AUC scores while maintaining privacy and efficiency. This makes the proposed solution practical and scalable for secure energy theft detection in next-generation smart grid infrastructures.

[263] Deep TPC: Temporal-Prior Conditioning for Time Series Forecasting

Filippos Bellos, NaveenJohn Premkumar, Yannis Avrithis, Nam H. Nguyen, Jason J. Corso

Main category: cs.LG

TL;DR: TPC introduces temporal-prior conditioning to treat time as a first-class modality in LLMs for time series, using cross-attention to temporal embeddings at multiple layers for better temporal reasoning.

Details

Motivation: Current LLM-for-time-series methods treat time shallowly by only injecting positional or prompt-based cues at input, which limits temporal reasoning as this information degrades through layers.

Method: Temporal-Prior Conditioning (TPC) attaches learnable time series tokens to patch stream; at selected layers these tokens cross-attend to temporal embeddings from human-readable temporal descriptors encoded by the same frozen LLM, then feed temporal context back via self-attention.

Result: TPC consistently outperforms both full fine-tuning and shallow conditioning strategies, achieving state-of-the-art performance in long-term forecasting across diverse datasets.

Conclusion: Treating time as a first-class modality with multi-layer conditioning enables better temporal reasoning in LLMs for time series while maintaining low parameter budget.

Abstract: LLM-for-time series (TS) methods typically treat time shallowly, injecting positional or prompt-based cues once at the input of a largely frozen decoder, which limits temporal reasoning as this information degrades through the layers. We introduce Temporal-Prior Conditioning (TPC), which elevates time to a first-class modality that conditions the model at multiple depths. TPC attaches a small set of learnable time series tokens to the patch stream; at selected layers these tokens cross-attend to temporal embeddings derived from compact, human-readable temporal descriptors encoded by the same frozen LLM, then feed temporal context back via self-attention. This disentangles time series signal and temporal information while maintaining a low parameter budget. We show that by training only the cross-attention modules and explicitly disentangling time series signal and temporal information, TPC consistently outperforms both full fine-tuning and shallow conditioning strategies, achieving state-of-the-art performance in long-term forecasting across diverse datasets. Code available at: https://github.com/fil-mp/Deep_tpc

[264] Rethinking Input Domains in Physics-Informed Neural Networks via Geometric Compactification Mappings

Zhenzhen Huang, Haoyu Bian, Jiaquan Zhang, Yibei Liu, Kuien Liu, Caiyan Qin, Guoqing Wang, Yang Yang, Chaoning Zhang

Main category: cs.LG

TL;DR: GC-PINN introduces geometric compactification mappings to reshape input coordinates for solving multi-scale PDEs, improving training stability and accuracy without modifying PINN architecture.

Details

Motivation: Existing PINN methods struggle with multi-scale PDEs that have both smooth low-frequency components and localized high-frequency structures. Fixed coordinate systems cause geometric misalignment with these structures, leading to gradient stiffness and ill-conditioning that hinder convergence.

Method: Proposes a mapping paradigm that reshapes input coordinates through differentiable geometric compactification mappings, coupling geometric structure of PDEs with spectral properties of residual operators. Introduces three mapping strategies: for periodic boundaries, far-field scale expansion, and localized singular structures, without modifying underlying PINN architecture.

Result: Extensive empirical evaluation shows more uniform residual distributions and higher solution accuracy on representative 1D and 2D PDEs, with improved training stability and convergence speed.

Conclusion: Geometric compactification mappings effectively address gradient stiffness and ill-conditioning in PINNs for multi-scale PDEs, providing a general framework for improving PINN performance without architectural changes.

Abstract: Several complex physical systems are governed by multi-scale partial differential equations (PDEs) that exhibit both smooth low-frequency components and localized high-frequency structures. Existing physics-informed neural network (PINN) methods typically train with fixed coordinate system inputs, where geometric misalignment with these structures induces gradient stiffness and ill-conditioning that hinder convergence. To address this issue, we introduce a mapping paradigm that reshapes the input coordinates through differentiable geometric compactification mappings and couples the geometric structure of PDEs with the spectral properties of residual operators. Based on this paradigm, we propose Geometric Compactification (GC)-PINN, a framework that introduces three mapping strategies for periodic boundaries, far-field scale expansion, and localized singular structures in the input domain without modifying the underlying PINN architecture. Extensive empirical evaluation demonstrates that this approach yields more uniform residual distributions and higher solution accuracy on representative 1D and 2D PDEs, while improving training stability and convergence speed.

[265] Training-Free Adaptation of Diffusion Models via Doob’s $h$-Transform

Qijie Zhu, Zeqi Ye, Han Liu, Zhaoran Wang, Minshuo Chen

Main category: cs.LG

TL;DR: DOIT is a training-free adaptation method for diffusion models that uses Doob’s h-transform to steer sampling toward high-reward distributions without modifying pre-trained models or requiring differentiable rewards.

Details

Motivation: Existing adaptation methods for diffusion models often require additional training, have high computational overhead, rely on differentiable reward functions, and lack theoretical guarantees. There's a need for efficient, training-free methods that work with generic non-differentiable rewards.

Method: DOIT uses a measure transport formulation to transport pre-trained generative distributions to high-reward target distributions. It leverages Doob’s h-transform to induce dynamic corrections to the diffusion sampling process, enabling simulation-based computation without modifying the pre-trained model.

Result: The method provides theoretical convergence guarantees to target high-reward distributions. Empirically, on D4RL offline RL benchmarks, DOIT consistently outperforms state-of-the-art baselines while preserving sampling efficiency.

Conclusion: DOIT offers an efficient, training-free approach for adapting diffusion models to diverse applications with non-differentiable rewards, backed by theoretical guarantees and strong empirical performance.

Abstract: Adaptation methods have been a workhorse for unlocking the transformative power of pre-trained diffusion models in diverse applications. Existing approaches often abstract adaptation objectives as a reward function and steer diffusion models to generate high-reward samples. However, these approaches can incur high computational overhead due to additional training, or rely on stringent assumptions on the reward such as differentiability. Moreover, despite their empirical success, theoretical justification and guarantees are seldom established. In this paper, we propose DOIT (Doob-Oriented Inference-time Transformation), a training-free and computationally efficient adaptation method that applies to generic, non-differentiable rewards. The key framework underlying our method is a measure transport formulation that seeks to transport the pre-trained generative distribution to a high-reward target distribution. We leverage Doob’s $h$-transform to realize this transport, which induces a dynamic correction to the diffusion sampling process and enables efficient simulation-based computation without modifying the pre-trained model. Theoretically, we establish a high probability convergence guarantee to the target high-reward distribution via characterizing the approximation error in the dynamic Doob’s correction. Empirically, on D4RL offline RL benchmarks, our method consistently outperforms state-of-the-art baselines while preserving sampling efficiency.

[266] Linked Data Classification using Neurochaos Learning

Pooja Honna, Ayush Patravali, Nithin Nagaraj, Nanjangud C. Narendra

Main category: cs.LG

TL;DR: Neurochaos Learning (NL) extended to knowledge graphs via node aggregation, showing better performance on homophilic graphs than heterophilic ones.

Details

Motivation: NL has shown promise for small sample learning with low compute requirements, but previous work focused on separable/time series data. This paper explores extending NL to linked data/knowledge graphs.

Method: Implemented node aggregation on knowledge graphs to extract features, then fed aggregated node features to ChaosNet (simplest NL architecture). Tested on both homophilic and heterophilic graph datasets.

Result: Demonstrated better efficacy on homophilic graphs than on heterophilic graphs. Performance varied with degree of heterophily in datasets.

Conclusion: Successfully integrated linked data into NL framework, showing potential for graph-structured data. Analysis provided with suggestions for future work to improve performance on heterophilic graphs.

Abstract: Neurochaos Learning (NL) has shown promise in recent times over traditional deep learning due to its two key features: ability to learn from small sized training samples, and low compute requirements. In prior work, NL has been implemented and extensively tested on separable and time series data, and demonstrated its superior performance on both classification and regression tasks. In this paper, we investigate the next step in NL, viz., applying NL to linked data, in particular, data that is represented in the form of knowledge graphs. We integrate linked data into NL by implementing node aggregation on knowledge graphs, and then feeding the aggregated node features to the simplest NL architecture: ChaosNet. We demonstrate the results of our implementation on homophilic graph datasets as well as heterophilic graph datasets of verying heterophily. We show better efficacy of our approach on homophilic graphs than on heterophilic graphs. While doing so, we also present our analysis of the results, as well as suggestions for future work.

[267] Geometric Neural Operators via Lie Group-Constrained Latent Dynamics

Jiaquan Zhang, Fachrina Dewi Puspitasari, Songbo Zhang, Yibei Liu, Kuien Liu, Caiyan Qin, Fan Mo, Peng Wang, Yang Yang, Chaoning Zhang

Main category: cs.LG

TL;DR: MCL is a manifold-constrained neural operator framework that uses Lie group parameterization to enforce geometric inductive biases, improving stability and accuracy for long-horizon PDE predictions.

Details

Motivation: Existing neural operators suffer from instability in multi-layer iteration and long-horizon rollout due to unconstrained Euclidean latent space updates that violate geometric and conservation laws of physical systems.

Method: Proposes MCL (Manifold Constraining based on Lie group) - a plug-and-play module that constrains manifolds with low-rank Lie algebra parameterization, performing group action updates on latent representations to enforce geometric inductive bias.

Result: Extensive experiments on 1-D Burgers and 2-D Navier-Stokes equations show 30-50% reduction in relative prediction error with only 2.26% parameter increase, demonstrating improved long-term prediction fidelity.

Conclusion: MCL provides a scalable solution for improving neural operator stability by addressing principled geometric constraints absent in standard neural operator updates, enabling better long-horizon predictions for physical systems.

Abstract: Neural operators offer an effective framework for learning solutions of partial differential equations for many physical systems in a resolution-invariant and data-driven manner. Existing neural operators, however, often suffer from instability in multi-layer iteration and long-horizon rollout, which stems from the unconstrained Euclidean latent space updates that violate the geometric and conservation laws. To address this challenge, we propose to constrain manifolds with low-rank Lie algebra parameterization that performs group action updates on the latent representation. Our method, termed Manifold Constraining based on Lie group (MCL), acts as an efficient \emph{plug-and-play} module that enforces geometric inductive bias to existing neural operators. Extensive experiments on various partial differential equations, such as 1-D Burgers and 2-D Navier-Stokes, over a wide range of parameters and steps demonstrate that our method effectively lowers the relative prediction error by 30-50% at the cost of 2.26% of parameter increase. The results show that our approach provides a scalable solution for improving long-term prediction fidelity by addressing the principled geometric constraints absent in the neural operator updates.

[268] Graph neural network for colliding particles with an application to sea ice floe modeling

Ruibiao Zhu

Main category: cs.LG

TL;DR: A Graph Neural Network approach for sea ice modeling that captures physical interactions between ice pieces as a graph, combining ML with data assimilation for efficient forecasting.

Details

Motivation: Traditional numerical methods for sea ice modeling are computationally intensive and less scalable. There's a need for more efficient approaches that can handle the complex physical interactions in sea ice dynamics, particularly in marginal ice zones.

Method: Proposes a Collision-captured Network (CN) using Graph Neural Networks where nodes represent individual ice pieces and edges model physical interactions including collisions. Developed within a one-dimensional framework as foundational step, integrates data assimilation techniques to learn and predict sea ice dynamics.

Result: Validated using synthetic data with and without observed data points. The model accelerates simulation of trajectories without compromising accuracy, offering more efficient forecasting in marginal ice zones.

Conclusion: The approach demonstrates the potential of combining machine learning with data assimilation for more effective and efficient sea ice modeling, providing a scalable alternative to traditional numerical methods.

Abstract: This paper introduces a novel approach to sea ice modeling using Graph Neural Networks (GNNs), utilizing the natural graph structure of sea ice, where nodes represent individual ice pieces, and edges model the physical interactions, including collisions. This concept is developed within a one-dimensional framework as a foundational step. Traditional numerical methods, while effective, are computationally intensive and less scalable. By utilizing GNNs, the proposed model, termed the Collision-captured Network (CN), integrates data assimilation (DA) techniques to effectively learn and predict sea ice dynamics under various conditions. The approach was validated using synthetic data, both with and without observed data points, and it was found that the model accelerates the simulation of trajectories without compromising accuracy. This advancement offers a more efficient tool for forecasting in marginal ice zones (MIZ) and highlights the potential of combining machine learning with data assimilation for more effective and efficient modeling.

[269] UCTECG-Net: Uncertainty-aware Convolution Transformer ECG Network for Arrhythmia Detection

Hamzeh Asgharnezhad, Pegah Tabarisaadi, Abbas Khosravi, Roohallah Alizadehsani, U. Rajendra Acharya

Main category: cs.LG

TL;DR: UCTECG-Net: Uncertainty-aware hybrid CNN-Transformer architecture for ECG classification with integrated uncertainty quantification methods for reliable predictions in safety-critical settings.

Details

Motivation: Deep learning has improved ECG classification but lacks insight into prediction reliability, hindering adoption in safety-critical medical applications where uncertainty quantification is crucial.

Method: Proposes UCTECG-Net, a hybrid architecture combining 1D convolutions and Transformer encoders to process raw ECG signals and spectrograms jointly. Integrates three uncertainty quantification methods: Monte Carlo Dropout, Deep Ensembles, and Ensemble Monte Carlo Dropout.

Result: Achieves 98.58% accuracy on MIT-BIH and 99.14% on PTB datasets, outperforming LSTM, CNN1D, and Transformer baselines. Provides more reliable uncertainty estimates, particularly with Ensemble or EMCD methods, enabling better risk-aware decision support.

Conclusion: UCTECG-Net offers a robust uncertainty-aware framework for ECG classification that enhances prediction reliability and provides stronger basis for safety-critical medical applications.

Abstract: Deep learning has improved automated electrocardiogram (ECG) classification, but limited insight into prediction reliability hinders its use in safety-critical settings. This paper proposes UCTECG-Net, an uncertainty-aware hybrid architecture that combines one-dimensional convolutions and Transformer encoders to process raw ECG signals and their spectrograms jointly. Evaluated on the MIT-BIH Arrhythmia and PTB Diagnostic datasets, UCTECG-Net outperforms LSTM, CNN1D, and Transformer baselines in terms of accuracy, precision, recall and F1 score, achieving up to 98.58% accuracy on MIT-BIH and 99.14% on PTB. To assess predictive reliability, we integrate three uncertainty quantification methods (Monte Carlo Dropout, Deep Ensembles, and Ensemble Monte Carlo Dropout) into all models and analyze their behavior using an uncertainty-aware confusion matrix and derived metrics. The results show that UCTECG-Net, particularly with Ensemble or EMCD, provides more reliable and better-aligned uncertainty estimates than competing architectures, offering a stronger basis for risk-aware ECG decision support.

[270] Multi-Class Boundary Extraction from Implicit Representations

Jash Vira, Andrew Myers, Simon Ratcliffe

Main category: cs.LG

TL;DR: A 2D boundary extraction algorithm for multi-class implicit neural representations that guarantees topological consistency and water-tightness, with applications in geological modelling.

Details

Motivation: Existing surface extraction methods from implicit neural representations only handle single-class surfaces and lack guarantees for topological correctness and hole-free results in multi-class scenarios.

Method: Introduces a 2D boundary extraction algorithm specifically designed for multi-class implicit representations, focusing on topological consistency and water-tightness, with ability to set minimum detail constraints on approximations.

Result: The algorithm is evaluated using geological modelling data, demonstrating its adaptiveness and ability to honor complex topological structures.

Conclusion: This work establishes foundational methods for topologically consistent, water-tight boundary extraction from multi-class implicit neural representations, particularly valuable for complex domains like geological modelling.

Abstract: Surface extraction from implicit neural representations modelling a single class surface is a well-known task. However, there exist no surface extraction methods from an implicit representation of multiple classes that guarantee topological correctness and no holes. In this work, we lay the groundwork by introducing a 2D boundary extraction algorithm for the multi-class case focusing on topological consistency and water-tightness, which also allows for setting minimum detail restraint on the approximation. Finally, we evaluate our algorithm using geological modelling data, showcasing its adaptiveness and ability to honour complex topology.

[271] Bayesian Quadrature: Gaussian Processes for Integration

Maren Mahsereci, Toni Karvonen

Main category: cs.LG

TL;DR: A comprehensive survey of Bayesian quadrature, covering mathematical foundations, taxonomy, theoretical guarantees, numerical studies, practical challenges, and extensive bibliography across multiple disciplines.

Details

Motivation: Bayesian quadrature is a probabilistic approach to numerical integration that has been used since the 1980s but lacks a systematic and comprehensive treatment. The authors aim to fill this gap by providing a thorough review of the field.

Method: The survey reviews mathematical foundations from different perspectives, creates a systematic taxonomy classifying methods along three axes (modelling, inference, sampling), collects theoretical guarantees, conducts controlled numerical studies, and provides practical assessments.

Result: The paper provides a comprehensive framework for understanding Bayesian quadrature, including its taxonomy, theoretical properties, and practical considerations. The numerical study illustrates how different methodological choices affect performance.

Conclusion: This survey fills a significant gap in the literature by providing the first systematic treatment of Bayesian quadrature, offering researchers a comprehensive reference that covers mathematical foundations, practical applications, and limitations across multiple disciplines.

Abstract: Bayesian quadrature is a probabilistic, model-based approach to numerical integration, the estimation of intractable integrals, or expectations. Although Bayesian quadrature was popularised already in the 1980s, no systematic and comprehensive treatment has been published. The purpose of this survey is to fill this gap. We review the mathematical foundations of Bayesian quadrature from different points of view; present a systematic taxonomy for classifying different Bayesian quadrature methods along the three axes of modelling, inference, and sampling; collect general theoretical guarantees; and provide a controlled numerical study that explores and illustrates the effect of different choices along the axes of the taxonomy. We also provide a realistic assessment of practical challenges and limitations to application of Bayesian quadrature methods and include an up-to-date and nearly exhaustive bibliography that covers not only machine learning and statistics literature but all areas of mathematics and engineering in which Bayesian quadrature or equivalent methods have seen use.

[272] SEMixer: Semantics Enhanced MLP-Mixer for Multiscale Mixing and Long-term Time Series Forecasting

Xu Zhang, Qitong Wang, Peng Wang, Wei Wang

Main category: cs.LG

TL;DR: SEMixer is a lightweight multiscale model for long-term time series forecasting that uses Random Attention Mechanism and Multiscale Progressive Mixing Chain to better capture multi-scale temporal dependencies while addressing redundancy, noise, and semantic gaps between scales.

Details

Motivation: The paper addresses challenges in long-term time series forecasting, particularly the difficulty of efficiently aligning and integrating multi-scale temporal dependencies due to redundancy, noise in time series data, and semantic gaps between non-adjacent scales.

Method: SEMixer introduces two key components: 1) Random Attention Mechanism (RAM) that captures diverse time-patch interactions during training and aggregates them via dropout ensemble at inference, and 2) Multiscale Progressive Mixing Chain (MPMC) that stacks RAM and MLP-Mixer in a memory-efficient manner for more effective temporal mixing.

Result: SEMixer was validated on 10 public datasets and achieved third place in the 2025 CCF AlOps Challenge using 21GB of real wireless network data, demonstrating effectiveness in multiscale modeling and forecasting performance.

Conclusion: SEMixer provides an effective solution for long-term time series forecasting by addressing multi-scale dependency modeling challenges through lightweight architecture design with RAM and MPMC components.

Abstract: Modeling multiscale patterns is crucial for long-term time series forecasting (TSF). However, redundancy and noise in time series, together with semantic gaps between non-adjacent scales, make the efficient alignment and integration of multi-scale temporal dependencies challenging. To address this, we propose SEMixer, a lightweight multiscale model designed for long-term TSF. SEMixer features two key components: a Random Attention Mechanism (RAM) and a Multiscale Progressive Mixing Chain (MPMC). RAM captures diverse time-patch interactions during training and aggregates them via dropout ensemble at inference, enhancing patch-level semantics and enabling MLP-Mixer to better model multi-scale dependencies. MPMC further stacks RAM and MLP-Mixer in a memory-efficient manner, achieving more effective temporal mixing. It addresses semantic gaps across scales and facilitates better multiscale modeling and forecasting performance. We not only validate the effectiveness of SEMixer on 10 public datasets, but also on the \textit{2025 CCF AlOps Challenge} based on 21GB real wireless network data, where SEMixer achieves third place. The code is available at the link https://github.com/Meteor-Stars/SEMixer.

[273] Amortized Predictability-aware Training Framework for Time Series Forecasting and Classification

Xu Zhang, Peng Wang, Yichen Li, Wei Wang

Main category: cs.LG

TL;DR: APTF is a training framework for time series tasks that identifies and penalizes low-predictability samples using hierarchical loss and amortization to improve model performance.

Details

Motivation: Time series data often contain noisy, low-predictability patterns that deviate from normal distributions, causing training instability and poor convergence. Existing deep learning models don't adequately address how to identify and penalize these problematic samples during training.

Method: Proposes APTF with two key components: 1) Hierarchical Predictability-aware Loss (HPL) that dynamically identifies low-predictability samples and progressively increases their loss penalty during training, and 2) an amortization model that mitigates predictability estimation errors caused by model bias.

Result: The framework improves performance for both time series forecasting and classification tasks by enabling models to focus on high-predictability samples while still learning appropriately from low-predictability ones.

Conclusion: APTF provides a general training framework that addresses the challenge of low-predictability samples in time series analysis, improving model robustness and performance across different time series tasks.

Abstract: Time series data are prone to noise in various domains, and training samples may contain low-predictability patterns that deviate from the normal data distribution, leading to training instability or convergence to poor local minima. Therefore, mitigating the adverse effects of low-predictability samples is crucial for time series analysis tasks such as time series forecasting (TSF) and time series classification (TSC). While many deep learning models have achieved promising performance, few consider how to identify and penalize low-predictability samples to improve model performance from the training perspective. To fill this gap, we propose a general Amortized Predictability-aware Training Framework (APTF) for both TSF and TSC. APTF introduces two key designs that enable the model to focus on high-predictability samples while still learning appropriately from low-predictability ones: (i) a Hierarchical Predictability-aware Loss (HPL) that dynamically identifies low-predictability samples and progressively expands their loss penalty as training evolves, and (ii) an amortization model that mitigates predictability estimation errors caused by model bias, further enhancing HPL’s effectiveness. The code is available at https://github.com/Meteor-Stars/APTF.

[274] Factored Latent Action World Models

Zizhao Wang, Chang Shi, Jiaheng Hu, Kevin Rohling, Roberto Martín-Martín, Amy Zhang, Peter Stone

Main category: cs.LG

TL;DR: FLAM introduces a factored dynamics framework that decomposes scenes into independent factors, each with its own latent action, improving video generation and control in complex multi-entity environments compared to monolithic models.

Details

Motivation: Existing latent action models use monolithic inverse/forward dynamics that learn a single latent action to control entire scenes, struggling in complex environments where multiple entities act simultaneously.

Method: FLAM decomposes scenes into independent factors, each inferring its own latent action and predicting its own next-step factor value through a factorized dynamics framework.

Result: FLAM outperforms prior work in prediction accuracy and representation quality on simulation and real-world multi-entity datasets, and facilitates downstream policy learning.

Conclusion: Factorized latent action models provide benefits for modeling complex multi-entity dynamics and improving video generation quality in action-free video settings.

Abstract: Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning. Latent actions provide a natural interface for users to iteratively generate and manipulate videos. However, most existing approaches rely on monolithic inverse and forward dynamics models that learn a single latent action to control the entire scene, and therefore struggle in complex environments where multiple entities act simultaneously. This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors, each inferring its own latent action and predicting its own next-step factor value. This factorized structure enables more accurate modeling of complex multi-entity dynamics and improves video generation quality in action-free video settings compared to monolithic models. Based on experiments on both simulation and real-world multi-entity datasets, we find that FLAM outperforms prior work in prediction accuracy and representation quality, and facilitates downstream policy learning, demonstrating the benefits of factorized latent action models.

[275] Online Prediction of Stochastic Sequences with High Probability Regret Bounds

Matthias Frey, Jonathan H. Manton, Jingge Zhu

Main category: cs.LG

TL;DR: High-probability vanishing regret bounds for universal prediction of stochastic sequences with finite horizon T, complementing existing expectation bounds.

Details

Motivation: The paper investigates whether it's possible to derive vanishing regret bounds that hold with high probability (rather than just in expectation) for universal prediction of stochastic sequences with known finite time horizon T.

Method: The authors propose high-probability bounds for universal prediction of stochastic processes over countable alphabets, deriving convergence rates with probability guarantees, and also provide an impossibility result showing limitations on improving these bounds.

Result: For universal prediction over countable alphabets, they achieve a convergence rate of O(T^{-1/2} δ^{-1/2}) with probability at least 1-δ, compared to prior in-expectation bounds of O(T^{-1/2}). They also prove an impossibility result showing the exponent of δ cannot be improved without additional assumptions.

Conclusion: The paper successfully establishes high-probability vanishing regret bounds for universal prediction, complementing existing expectation bounds, while also identifying fundamental limitations on such bounds through an impossibility result.

Abstract: We revisit the classical problem of universal prediction of stochastic sequences with a finite time horizon $T$ known to the learner. The question we investigate is whether it is possible to derive vanishing regret bounds that hold with high probability, complementing existing bounds from the literature that hold in expectation. We propose such high-probability bounds which have a very similar form as the prior expectation bounds. For the case of universal prediction of a stochastic process over a countable alphabet, our bound states a convergence rate of $\mathcal{O}(T^{-1/2} δ^{-1/2})$ with probability as least $1-δ$ compared to prior known in-expectation bounds of the order $\mathcal{O}(T^{-1/2})$. We also propose an impossibility result which proves that it is not possible to improve the exponent of $δ$ in a bound of the same form without making additional assumptions.

[276] Prediction of Major Solar Flares Using Interpretable Class-dependent Reward Framework with Active Region Magnetograms and Domain Knowledge

Zixian Wu, Xuebao Li, Yanfang Zheng, Rui Wang, Shunhuang Zhang, Jinfang Wei, Yongshang Lv, Liang Dong, Zamri Zainal Abidin, Noraisyah Mohamed Shah, Hongwei Ye, Pengchao Yan, Xuefeng Li, Xiaojia Ji, Xusheng Huang, Xiaotian Wang, Honglei Jin

Main category: cs.LG

TL;DR: A supervised classification framework with class-dependent rewards (CDR) for predicting solar flares within 24 hours, using both knowledge-informed features and magnetogram data with various deep learning architectures.

Details

Motivation: To develop an improved solar flare prediction system that can accurately forecast ≥MM-class flares within 24 hours, addressing limitations in existing methods by incorporating class-dependent rewards and comparing different deep learning architectures.

Method: Developed CDR framework with three deep learning models (CNN, CNN-BiLSTM, Transformer) and their CDR counterparts. Used multiple datasets including knowledge-informed features and line-of-sight magnetograms. Conducted comparative analysis of magnetic field parameters, model architectures, and reward engineering. Applied SHAP for interpretability and compared with NASA/CCMC.

Result: CDR-Transformer achieved best performance overall. Transformer performed better with combined LOS and vector magnetic field data. Knowledge-informed features outperformed magnetograms. CDR-Transformer showed superior predictive capabilities compared to NASA/CCMC under identical conditions.

Conclusion: The CDR-Transformer framework represents an effective approach for solar flare prediction, demonstrating improved performance over existing methods while providing interpretable insights through SHAP analysis.

Abstract: In this work, we develop, for the first time, a supervised classification framework with class-dependent rewards (CDR) to predict $\geq$MM flares within 24 hr. We construct multiple datasets, covering knowledge-informed features and line-of sight (LOS) magnetograms. We also apply three deep learning models (CNN, CNN-BiLSTM, and Transformer) and three CDR counterparts (CDR-CNN, CDR-CNN-BiLSTM, and CDR-Transformer). First, we analyze the importance of LOS magnetic field parameters with the Transformer, then compare its performance using LOS-only, vector-only, and combined magnetic field parameters. Second, we compare flare prediction performance based on CDR models versus deep learning counterparts. Third, we perform sensitivity analysis on reward engineering for CDR models. Fourth, we use the SHAP method for model interpretability. Finally, we conduct performance comparison between our models and NASA/CCMC. The main findings are: (1)Among LOS feature combinations, R_VALUE and AREA_ACR consistently yield the best results. (2)Transformer achieves better performance with combined LOS and vector magnetic field data than with either alone. (3)Models using knowledge-informed features outperform those using magnetograms. (4)While CNN and CNN-BiLSTM outperform their CDR counterparts on magnetograms, CDR-Transformer is slightly superior to its deep learning counterpart when using knowledge-informed features. Among all models, CDR-Transformer achieves the best performance. (5)The predictive performance of the CDR models is not overly sensitive to the reward choices.(6)Through SHAP analysis, the CDR model tends to regard TOTUSJH as more important, while the Transformer tends to prioritize R_VALUE more.(7)Under identical prediction time and active region (AR) number, the CDR-Transformer shows superior predictive capabilities compared to NASA/CCMC.

[277] Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Rahul Singh, Siddharth Chandak, Eric Moulines, Vivek S. Borkar, Nicholas Bambos

Main category: cs.LG

TL;DR: First high-probability regret bound for classical online Q-learning in infinite-horizon discounted MDPs without optimism/bonus terms, analyzing Boltzmann Q-learning and proposing a Smoothed ε-Greedy scheme with gap-robust regret bound.

Details

Motivation: To establish theoretical guarantees for classical online Q-learning algorithms in reinforcement learning, particularly addressing the gap-dependent limitations of existing methods and providing high-probability regret bounds without relying on optimism or bonus terms.

Method: Analyzes Boltzmann Q-learning with decaying temperature, then proposes Smoothed ε-Greedy exploration combining ε-greedy and Boltzmann exploration. Develops a novel high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics.

Result: Shows Boltzmann Q-learning’s regret depends critically on MDP suboptimality gap (sublinear for large gaps, linear for small gaps). Proves gap-robust regret bound of near-Õ(N^{9/10}) for Smoothed ε-Greedy scheme. Provides general concentration bound with contraction factor governed by mixing time.

Conclusion: The paper provides first high-probability regret bounds for classical online Q-learning without optimism, identifies gap-dependent limitations of Boltzmann Q-learning, proposes improved Smoothed ε-Greedy algorithm with gap-robust guarantees, and develops novel concentration analysis tools for stochastic approximation.

Abstract: We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes, without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $ε_n$-Greedy exploration scheme that combines $ε_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our bound is governed by the mixing time and is allowed to converge to one asymptotically.

[278] Fast KV Compaction via Attention Matching

Adam Zweiger, Xinghong Fu, Han Guo, Yoon Kim

Main category: cs.LG

TL;DR: Fast context compaction for long-context LLMs using attention matching to create compact KV caches in latent space, achieving 50x compression with minimal quality loss.

Details

Motivation: Scaling language models to long contexts is bottlenecked by KV cache size. Existing summarization methods are lossy, while previous latent space methods require slow end-to-end optimization. Need fast, high-quality compaction.

Method: Attention Matching approach constructs compact keys and values to reproduce attention outputs and preserve attention mass at per-KV-head level. Formulation decomposes into simple subproblems with efficient closed-form solutions.

Result: Achieves up to 50x compaction in seconds on some datasets with little quality loss, significantly improving Pareto frontier of compaction time versus quality.

Conclusion: Attention Matching enables fast, high-quality context compaction in latent space, overcoming limitations of both token-space summarization and slow end-to-end optimization methods.

Abstract: Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds on some datasets with little quality loss.

[279] A Graph Meta-Network for Learning on Kolmogorov-Arnold Networks

Guy Bar-Shalom, Ami Tavory, Itay Evron, Maya Bechler-Speicher, Ido Guy, Haggai Maron

Main category: cs.LG

TL;DR: WS-KAN is the first weight-space architecture designed specifically for Kolmogorov-Arnold Networks (KANs) that accounts for their permutation symmetries, outperforming structure-agnostic baselines across diverse tasks.

Details

Motivation: While weight-space models exist for standard neural networks, there are no tailored architectures for KANs that account for their specific symmetries. Prior work leveraged permutation symmetries in MLPs, but no analogous analysis or architecture exists for KANs despite their growing importance.

Method: The authors show KANs share the same permutation symmetries as MLPs, propose KAN-graph (a graph representation of KAN computation), and develop WS-KAN - the first weight-space architecture for KANs that naturally accounts for their symmetry. They analyze WS-KAN’s expressive power and construct a comprehensive “zoo” of trained KANs for benchmarking.

Result: WS-KAN consistently outperforms structure-agnostic baselines across all tasks, often by substantial margins. The method can replicate an input KAN’s forward pass, demonstrating strong expressive power for weight-space learning on KANs.

Conclusion: The work provides the first weight-space architecture specifically designed for KANs, successfully accounting for their permutation symmetries and achieving superior performance compared to naive approaches, advancing weight-space learning for this emerging network architecture.

Abstract: Weight-space models learn directly from the parameters of neural networks, enabling tasks such as predicting their accuracy on new datasets. Naive methods – like applying MLPs to flattened parameters – perform poorly, making the design of better weight-space architectures a central challenge. While prior work leveraged permutation symmetries in standard networks to guide such designs, no analogous analysis or tailored architecture yet exists for Kolmogorov-Arnold Networks (KANs). In this work, we show that KANs share the same permutation symmetries as MLPs, and propose the KAN-graph, a graph representation of their computation. Building on this, we develop WS-KAN, the first weight-space architecture that learns on KANs, which naturally accounts for their symmetry. We analyze WS-KAN’s expressive power, showing it can replicate an input KAN’s forward pass - a standard approach for assessing expressiveness in weight-space architectures. We construct a comprehensive ``zoo’’ of trained KANs spanning diverse tasks, which we use as benchmarks to empirically evaluate WS-KAN. Across all tasks, WS-KAN consistently outperforms structure-agnostic baselines, often by a substantial margin. Our code is available at https://github.com/BarSGuy/KAN-Graph-Metanetwork.

[280] Guide-Guard: Off-Target Predicting in CRISPR Applications

Joseph Bingham, Netanel Arussy, Saman Zonouz

Main category: cs.LG

TL;DR: Machine learning approach (Guide-Guard) for predicting CRISPR off-target behavior with 84% accuracy, trained on multiple genes simultaneously.

Details

Motivation: CRISPR gene-editing technologies enable genetic research but raise concerns about predicting off-target effects, requiring better computational tools for safety and efficacy.

Method: Data-driven exploration of biological/chemical models combined with machine learning solution called Guide-Guard that predicts CRISPR system behavior given gRNA sequences.

Result: Achieved 84% accuracy in predicting CRISPR off-target behavior, with the system capable of training on multiple genes simultaneously while maintaining accuracy.

Conclusion: Guide-Guard provides an effective machine learning solution for predicting CRISPR off-target effects, addressing safety concerns in gene-editing applications.

Abstract: With the introduction of cyber-physical genome sequencing and editing technologies, such as CRISPR, researchers can more easily access tools to investigate and create remedies for a variety of topics in genetics and health science (e.g. agriculture and medicine). As the field advances and grows, new concerns present themselves in the ability to predict the off-target behavior. In this work, we explore the underlying biological and chemical model from a data driven perspective. Additionally, we present a machine learning based solution named \textit{Guide-Guard} to predict the behavior of the system given a gRNA in the CRISPR gene-editing process with 84% accuracy. This solution is able to be trained on multiple different genes at the same time while retaining accuracy.

[281] HAWX: A Hardware-Aware FrameWork for Fast and Scalable ApproXimation of DNNs

Samira Nazari, Mohammad Saeed Almasi, Mahdi Taheri, Ali Azarpeyvand, Ali Mokhtari, Ali Mahani, Christian Herglotz

Main category: cs.LG

TL;DR: HAWX is a hardware-aware framework for exploring approximate computing (AxC) blocks in DNNs using multi-level sensitivity analysis to accelerate configuration search while maintaining accuracy.

Details

Motivation: The motivation is to address the computational complexity of exploring approximate computing configurations in DNNs, which grows exponentially with network size, by developing an efficient hardware-aware search framework.

Method: HAWX employs multi-level sensitivity scoring at operator, filter, layer, and model abstraction levels to guide selective integration of heterogeneous AxC blocks. It uses predictive models for accuracy, power, and area to accelerate evaluation of candidate configurations.

Result: Achieves over 23x speedup in layer-level search and more than 3 million x speedup in filter-level search for LeNet-5 while maintaining accuracy comparable to exhaustive search. Efficiency benefits scale exponentially with network size across benchmarks like VGG-11, ResNet-18, and EfficientNetLite.

Conclusion: HAWX provides an efficient hardware-aware exploration framework for approximate computing in DNNs that significantly accelerates configuration search while preserving accuracy, supporting both spatial and temporal accelerator architectures.

Abstract: This work presents HAWX, a hardware-aware scalable exploration framework that employs multi-level sensitivity scoring at different DNN abstraction levels (operator, filter, layer, and model) to guide selective integration of heterogeneous AxC blocks. Supported by predictive models for accuracy, power, and area, HAWX accelerates the evaluation of candidate configurations, achieving over 23* speedup in a layer-level search with two candidate approximate blocks and more than (3106) speedup at the filter-level search only for LeNet-5, while maintaining accuracy comparable to exhaustive search. Experiments across state-of-the-art DNN benchmarks such as VGG-11, ResNet-18, and EfficientNetLite demonstrate that the efficiency benefits of HAWX scale exponentially with network size. The HAWX hardware-aware search algorithm supports both spatial and temporal accelerator architectures, leveraging either off-the-shelf approximate components or customized designs.

[282] The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Eitan Gronich, Gal Vardi

Main category: cs.LG

TL;DR: The paper studies implicit bias of momentum-based optimizers (Muon, MomentumGD, Signum, Adam) in homogeneous models, showing they converge to KKT points of margin maximization problems with different norm constraints.

Details

Motivation: To understand the implicit regularization properties of momentum-based optimizers in homogeneous models, extending previous work on steepest descent to momentum variants and connecting them to margin maximization problems.

Method: Extends existing steepest descent analysis to normalized steepest descent with learning rate schedules, then shows momentum algorithms (Muon, MomentumGD, Signum, Adam) are approximate steepest descent trajectories under decaying learning rates, proving bias toward KKT points of margin maximization.

Result: Proves momentum-based optimizers have implicit bias toward KKT points of margin maximization problems with different norm constraints (spectral, ℓ₂, ℓ∞ norms), with experimental validation showing optimizer choice determines which margin is maximized.

Conclusion: Momentum-based optimizers in homogeneous models exhibit implicit bias toward solutions of margin maximization problems, with specific norm constraints determined by the optimizer choice, extending previous steepest descent results.

Abstract: We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.

[283] Explainability for Fault Detection System in Chemical Processes

Georgios Gravanis, Dimitrios Kyriakou, Spyros Voutetakis, Simira Papadopoulou, Konstantinos Diamantaras

Main category: cs.LG

TL;DR: Comparison of two XAI methods (Integrated Gradients and SHAP) for explaining fault diagnosis decisions of an LSTM classifier in a chemical process, showing how they can identify fault locations and comparing their effectiveness.

Details

Motivation: To apply and compare state-of-the-art XAI methods for explaining fault diagnosis decisions in complex industrial processes, helping identify where faults occur in the system and understanding model decisions.

Method: Applied Integrated Gradients (IG) and SHapley Additive exPlanations (SHAP) methods to explain decisions of a highly accurate LSTM classifier trained on the Tennessee Eastman Process (TEP) benchmark dataset for fault detection.

Result: Both XAI methods identified important features for fault diagnosis decisions, with SHAP sometimes providing more informative explanations closer to the root cause of faults. The methods helped identify the subsystem where faults occurred.

Conclusion: XAI methods can effectively explain fault diagnosis decisions and identify fault locations in complex processes. The model-agnostic approach makes it applicable to similar problems beyond the specific chemical process studied.

Abstract: In this work, we apply and compare two state-of-the-art eXplainability Artificial Intelligence (XAI) methods, the Integrated Gradients (IG) and the SHapley Additive exPlanations (SHAP), that explain the fault diagnosis decisions of a highly accurate Long Short-Time Memory (LSTM) classifier. The classifier is trained to detect faults in a benchmark non-linear chemical process, the Tennessee Eastman Process (TEP). It is highlighted how XAI methods can help identify the subsystem of the process where the fault occurred. Using our knowledge of the process, we note that in most cases the same features are indicated as the most important for the decision, while insome cases the SHAP method seems to be more informative and closer to the root cause of the fault. Finally, since the used XAI methods are model-agnostic, the proposed approach is not limited to the specific process and can also be used in similar problems.

[284] Intra-Fairness Dynamics: The Bias Spillover Effect in Targeted LLM Alignment

Eva Paraschou, Line Harder Clemmensen, Sneha Das

Main category: cs.LG

TL;DR: Targeted gender alignment in LLMs can cause bias spillover, worsening fairness across other attributes like physical appearance, sexual orientation, and disability, especially in ambiguous contexts.

Details

Motivation: Current LLM fairness alignment focuses on single attributes, ignoring multidimensional fairness and context-specific values, risking bias spillover where improving one attribute worsens others.

Method: Used Direct Preference Optimization and BBQ benchmark to evaluate fairness across 9 sensitive attributes in 3 LLMs (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B) under ambiguous and disambiguous contexts.

Result: Significant bias spillover observed: while aggregate results improved, context-aware analysis showed degradations in ambiguous contexts, particularly for physical appearance, sexual orientation, and disability status (p<0.001 across all models).

Conclusion: Improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty, highlighting the need for context-aware, multi-attribute fairness evaluation frameworks.

Abstract: Conventional large language model (LLM) fairness alignment largely focuses on mitigating bias along single sensitive attributes, overlooking fairness as an inherently multidimensional and context-specific value. This approach risks creating systems that achieve narrow fairness metrics while exacerbating disparities along untargeted attributes, a phenomenon known as bias spillover. While extensively studied in machine learning, bias spillover remains critically underexplored in LLM alignment. In this work, we investigate how targeted gender alignment affects fairness across nine sensitive attributes in three state-of-the-art LLMs (Mistral 7B, Llama 3.1 8B, Qwen 2.5 7B). Using Direct Preference Optimization and the BBQ benchmark, we evaluate fairness under ambiguous and disambiguous contexts. Our findings reveal noticeable bias spillover: while aggregate results show improvements, context-aware analysis exposes significant degradations in ambiguous contexts, particularly for physical appearance ($p< 0.001$ across all models), sexual orientation, and disability status. We demonstrate that improving fairness along one attribute can inadvertently worsen disparities in others under uncertainty, highlighting the necessity of context-aware, multi-attribute fairness evaluation frameworks.

[285] Optical Inversion and Spectral Unmixing of Spectroscopic Photoacoustic Images with Physics-Informed Neural Networks

Sarkis Ter Martirosyan, Xinyue Huang, David Qin, Anthony Yu, Stanislav Emelianov

Main category: cs.LG

TL;DR: SPOI-AE is an autoencoder that solves spectroscopic photoacoustic optical inversion and spectral unmixing problems without assuming linearity, providing accurate chromophore concentration estimates from in vivo mouse lymph node images.

Details

Motivation: Spectroscopic photoacoustic imaging can reveal structural, functional, and molecular information about physiological processes through chromophore concentration estimation, but this is challenging due to nonlinearities and ill-posedness in sPA imaging.

Method: The Spectroscopic Photoacoustic Optical Inversion Autoencoder (SPOI-AE) was developed to address sPA optical inversion and spectral unmixing without assuming linearity. It was trained and tested on in vivo mouse lymph node sPA images with unknown ground truth concentrations.

Result: SPOI-AE better reconstructs input sPA pixels than conventional algorithms while providing biologically coherent estimates for optical parameters, chromophore concentrations, and tissue oxygen saturation. Validation using simulated mouse lymph node phantom ground truth confirmed its unmixing accuracy.

Conclusion: SPOI-AE successfully addresses the challenging sPA optical inversion problem, providing accurate chromophore concentration estimates without linearity assumptions, with validation demonstrating its effectiveness for in vivo biomedical imaging applications.

Abstract: Accurate estimation of the relative concentrations of chromophores in a spectroscopic photoacoustic (sPA) image can reveal immense structural, functional, and molecular information about physiological processes. However, due to nonlinearities and ill-posedness inherent to sPA imaging, concentration estimation is intractable. The Spectroscopic Photoacoustic Optical Inversion Autoencoder (SPOI-AE) aims to address the sPA optical inversion and spectral unmixing problems without assuming linearity. Herein, SPOI-AE was trained and tested on \textit{in vivo} mouse lymph node sPA images with unknown ground truth chromophore concentrations. SPOI-AE better reconstructs input sPA pixels than conventional algorithms while providing biologically coherent estimates for optical parameters, chromophore concentrations, and the percent oxygen saturation of tissue. SPOI-AE’s unmixing accuracy was validated using a simulated mouse lymph node phantom ground truth.

[286] Improved Bounds for Reward-Agnostic and Reward-Free Exploration

Oran Ridel, Alon Cohen

Main category: cs.LG

TL;DR: New algorithm for reward-agnostic exploration in MDPs that significantly relaxes accuracy requirements, with tight lower bound for reward-free exploration

Details

Motivation: Previous work on reward-agnostic exploration in episodic MDPs achieved minimax sample complexity but only for restrictively small accuracy parameters ε. The authors aim to develop an algorithm that works for much larger ε values while maintaining theoretical guarantees.

Method: Proposes a novel algorithm using online learning with carefully designed rewards to construct an exploration policy. This policy gathers sufficient data for accurate dynamics estimation, enabling computation of ε-optimal policies once the true reward is revealed.

Result: The algorithm significantly relaxes the requirement on ε compared to prior work, achieving reward-agnostic exploration for much larger accuracy parameters while maintaining theoretical guarantees. Also establishes a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.

Conclusion: The proposed algorithm advances reward-agnostic exploration by working for more practical accuracy parameters, and the tight lower bound provides fundamental understanding of reward-free exploration limits.

Abstract: We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable $ε$-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets $ε$-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter $ε$. We propose a new algorithm that significantly relaxes the requirement on $ε$. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an $ε$-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.

[287] GICDM: Mitigating Hubness for Reliable Distance-Based Generative Model Evaluation

Nicolas Salvy, Hugues Talbot, Bertrand Thirion

Main category: cs.LG

TL;DR: GICDM corrects hubness bias in generative model evaluation by improving neighborhood estimation in embedding spaces, enhancing reliability of distance-based metrics.

Details

Motivation: Current generative model evaluation relies on high-dimensional embedding spaces where hubness phenomenon distorts nearest neighbor relationships and biases distance-based metrics, leading to unreliable assessments.

Method: Introduces Generative ICDM (GICDM), building on classical Iterative Contextual Dissimilarity Measure, to correct neighborhood estimation for both real and generated data. Includes multi-scale extension for improved empirical behavior.

Result: Extensive experiments on synthetic and real benchmarks show GICDM resolves hubness-induced failures, restores reliable metric behavior, and improves alignment with human judgment.

Conclusion: GICDM provides a robust solution to hubness bias in generative model evaluation, offering more reliable assessment of multimodal generative models.

Abstract: Generative model evaluation commonly relies on high-dimensional embedding spaces to compute distances between samples. We show that dataset representations in these spaces are affected by the hubness phenomenon, which distorts nearest neighbor relationships and biases distance-based metrics. Building on the classical Iterative Contextual Dissimilarity Measure (ICDM), we introduce Generative ICDM (GICDM), a method to correct neighborhood estimation for both real and generated data. We introduce a multi-scale extension to improve empirical behavior. Extensive experiments on synthetic and real benchmarks demonstrate that GICDM resolves hubness-induced failures, restores reliable metric behavior, and improves alignment with human judgment.

[288] Easy Data Unlearning Bench

Roy Rinberg, Pol Puigdemont, Martin Pawelczyk, Volkan Cevher

Main category: cs.LG

TL;DR: A unified benchmarking suite for machine unlearning evaluation using KLoM metric with precomputed resources for reproducible comparisons.

Details

Motivation: Current machine unlearning evaluation methods are technically challenging, requiring complex setups and significant engineering overhead, making comparisons difficult.

Method: Introduces a unified benchmarking suite with KLoM (KL divergence of Margins) metric, precomputed model ensembles, oracle outputs, and streamlined infrastructure for out-of-the-box evaluation.

Result: Provides a standardized framework that enables reproducible, scalable, and fair comparison across unlearning methods, with publicly available code and data.

Conclusion: The benchmark serves as a practical foundation for accelerating research and promoting best practices in machine unlearning through standardized evaluation.

Abstract: Evaluating machine unlearning methods remains technically challenging, with recent benchmarks requiring complex setups and significant engineering overhead. We introduce a unified and extensible benchmarking suite that simplifies the evaluation of unlearning algorithms using the KLoM (KL divergence of Margins) metric. Our framework provides precomputed model ensembles, oracle outputs, and streamlined infrastructure for running evaluations out of the box. By standardizing setup and metrics, it enables reproducible, scalable, and fair comparison across unlearning methods. We aim for this benchmark to serve as a practical foundation for accelerating research and promoting best practices in machine unlearning. Our code and data are publicly available.

[289] Fast and Scalable Analytical Diffusion

Xinyi Shang, Peng Sun, Jingyu Lin, Zhiqiang Shen

Main category: cs.LG

TL;DR: GoldDiff: Training-free analytical diffusion framework that dynamically selects small “Golden Subset” of data for denoising, achieving 71× speedup while matching full-scan performance and scaling to ImageNet-1K.

Details

Motivation: Analytical diffusion models offer mathematical transparency but require scanning the entire dataset at every timestep, making them prohibitively expensive for large datasets. The authors aim to overcome this scalability bottleneck while preserving interpretability.

Method: Proposes Dynamic Time-Aware Golden Subset Diffusion (GoldDiff) that identifies the phenomenon of Posterior Progressive Concentration - the effective support of denoising score shrinks from global manifold to local neighborhood as signal-to-noise ratio increases. Uses coarse-to-fine mechanism to dynamically pinpoint a small “Golden Subset” for inference instead of scanning entire dataset.

Result: Achieves 71× speedup on AFHQ while matching or exceeding full-scan baseline performance. Successfully scales analytical diffusion to ImageNet-1K for the first time, demonstrating scalable, training-free large-scale generative modeling.

Conclusion: GoldDiff enables scalable analytical diffusion by decoupling inference complexity from dataset size through dynamic subset selection, making mathematically transparent generative modeling practical for large-scale applications.

Abstract: Analytical diffusion models offer a mathematically transparent path to generative modeling by formulating the denoising score as an empirical-Bayes posterior mean. However, this interpretability comes at a prohibitive cost: the standard formulation necessitates a full-dataset scan at every timestep, scaling linearly with dataset size. In this work, we present the first systematic study addressing this scalability bottleneck. We challenge the prevailing assumption that the entire training data is necessary, uncovering the phenomenon of Posterior Progressive Concentration: the effective golden support of the denoising score is not static but shrinks asymptotically from the global manifold to a local neighborhood as the signal-to-noise ratio increases. Capitalizing on this, we propose Dynamic Time-Aware Golden Subset Diffusion (GoldDiff), a training-free framework that decouples inference complexity from dataset size. Instead of static retrieval, GoldDiff uses a coarse-to-fine mechanism to dynamically pinpoint the ‘‘Golden Subset’’ for inference. Theoretically, we derive rigorous bounds guaranteeing that our sparse approximation converges to the exact score. Empirically, GoldDiff achieves a $\bf 71 \times$ speedup on AFHQ while matching or achieving even better performance than full-scan baselines. Most notably, we demonstrate the first successful scaling of analytical diffusion to ImageNet-1K, unlocking a scalable, training-free paradigm for large-scale generative modeling.

[290] Learning with Locally Private Examples by Inverse Weierstrass Private Stochastic Gradient Descent

Jean Dufraiche, Paul Mangold, Michaël Perrot, Marc Tommasi

Main category: cs.LG

TL;DR: The paper proposes a bias-correction method for binary classification under Local Differential Privacy using the Weierstrass transform, and introduces IWP-SGD algorithm that converges to true population risk minimizer.

Details

Motivation: Noninteractive Local Differential Privacy (LDP) enables data reusability but introduces noise that creates bias in subsequent analyses, particularly for binary classification tasks.

Method: Leverages the Weierstrass transform to characterize bias in binary classification under LDP, inverts the transform for bias correction, and builds Inverse Weierstrass Private SGD (IWP-SGD) algorithm.

Result: IWP-SGD converges to true population risk minimizer at rate O(1/n), validated empirically on binary classification tasks using synthetic and real-world datasets.

Conclusion: The proposed method effectively corrects bias in LDP-released data for binary classification, enabling unbiased estimation of nonlinear functions and convergence to optimal solutions.

Abstract: Releasing data once and for all under noninteractive Local Differential Privacy (LDP) enables complete data reusability, but the resulting noise may create bias in subsequent analyses. In this work, we leverage the Weierstrass transform to characterize this bias in binary classification. We prove that inverting this transform leads to a bias-correction method to compute unbiased estimates of nonlinear functions on examples released under LDP. We then build a novel stochastic gradient descent algorithm called Inverse Weierstrass Private SGD (IWP-SGD). It converges to the true population risk minimizer at a rate of $\mathcal{O}(1/n)$, with $n$ the number of examples. We empirically validate IWP-SGD on binary classification tasks using synthetic and real-world datasets.

[291] Interpretability-by-Design with Accurate Locally Additive Models and Conditional Feature Effects

Vasilis Gkolemis, Loukas Kavouras, Dimitrios Kyriakopoulos, Konstantinos Tsopelas, Dimitrios Rontogiannis, Giuseppe Casalicchio, Theodore Dalamagas, Christos Diou

Main category: cs.LG

TL;DR: CALMs (Conditionally Additive Local Models) balance interpretability of GAMs with accuracy of GA²Ms by allowing multiple univariate shape functions per feature that are active in different regions defined by simple thresholds on interacting features.

Details

Motivation: Generalized additive models (GAMs) are interpretable but underfit when interactions exist, while GA²Ms add interactions at the cost of interpretability. There's a need for models that maintain interpretability while capturing interactions for better accuracy.

Method: CALMs allow multiple univariate shape functions per feature, each active in different regions defined by simple logical conditions (thresholds) on interacting features. A distillation-based training pipeline identifies homogeneous regions with limited interactions and fits interpretable shape functions via region-aware backfitting.

Result: Experiments on diverse classification and regression tasks show CALMs consistently outperform GAMs and achieve accuracy comparable with GA²Ms while maintaining better interpretability.

Conclusion: CALMs offer a compelling trade-off between predictive accuracy and interpretability, providing locally additive effects that vary across subregions to capture interactions while remaining interpretable.

Abstract: Generalized additive models (GAMs) offer interpretability through independent univariate feature effects but underfit when interactions are present in data. GA$^2$Ms add selected pairwise interactions which improves accuracy, but sacrifices interpretability and limits model auditing. We propose \emph{Conditionally Additive Local Models} (CALMs), a new model class, that balances the interpretability of GAMs with the accuracy of GA$^2$Ms. CALMs allow multiple univariate shape functions per feature, each active in different regions of the input space. These regions are defined independently for each feature as simple logical conditions (thresholds) on the features it interacts with. As a result, effects remain locally additive while varying across subregions to capture interactions. We further propose a principled distillation-based training pipeline that identifies homogeneous regions with limited interactions and fits interpretable shape functions via region-aware backfitting. Experiments on diverse classification and regression tasks show that CALMs consistently outperform GAMs and achieve accuracy comparable with GA$^2$Ms. Overall, CALMs offer a compelling trade-off between predictive accuracy and interpretability.

[292] Beyond SGD, Without SVD: Proximal Subspace Iteration LoRA with Diagonal Fractional K-FAC

Abdulla Jasem Almansoori, Maria Ivanova, Andrey Veprikov, Aleksandr Beznosikov, Samuel Horváth, Martin Takáč

Main category: cs.LG

TL;DR: LoRSum: A memory-efficient optimization method for LoRA fine-tuning that bridges the gap between full-step training with low-rank projections and standard LoRA, using proximal optimization and alternating least squares updates.

Details

Motivation: Address the computational gap between training with full steps using low-rank projections (SVDLoRA) and standard LoRA fine-tuning, while maintaining LoRA's parameter efficiency and avoiding expensive full-matrix SVD projections.

Method: Proposes LoRSum, which casts LoRA optimization as a proximal sub-problem solved efficiently with alternating least squares updates (proven to be an implicit block power method). Also introduces a scaled variant for preconditioned gradient descent using structured metrics like K-FAC and Shampoo with diagonal storage for memory efficiency.

Result: Experiments on synthetic tasks, CIFAR-100, and language-model fine-tuning (GLUE, SQuAD v2, WikiText-103) show LoRSum can match or improve LoRA baselines with modest compute overhead while retaining parameter efficiency and avoiding full-matrix SVD.

Conclusion: LoRSum provides an efficient optimization framework for LoRA fine-tuning that bridges computational gaps, recovers existing preconditioning methods as special cases, and maintains memory efficiency while improving performance.

Abstract: Low-Rank Adaptation (LoRA) fine-tunes large models by learning low-rank updates on top of frozen weights, dramatically reducing trainable parameters and memory. In this work, we address the gap between training with full steps with low-rank projections (SVDLoRA) and LoRA fine-tuning. We propose LoRSum, a memory-efficient subroutine that closes this gap for gradient descent by casting LoRA optimization as a proximal sub-problem and solving it efficiently with alternating least squares updates, which we prove to be an implicit block power method. We recover several recently proposed preconditioning methods for LoRA as special cases, and show that LoRSum can also be used for updating a low-rank momentum. In order to address full steps with preconditioned gradient descent, we propose a scaled variant of LoRSum that uses structured metrics such as K-FAC and Shampoo, and we show that storing the diagonal of these metrics still allows them to perform well while remaining memory-efficient. Experiments on a synthetic task, CIFAR-100, and language-model fine-tuning on GLUE, SQuAD v2, and WikiText-103, show that our method can match or improve LoRA baselines given modest compute overhead, while avoiding full-matrix SVD projections and retaining LoRA-style parameter efficiency.

[293] HPMixer: Hierarchical Patching for Multivariate Time Series Forecasting

Jung Min Choi, Vijaya Krishna Yalavarthi, Lars Schmidt-Thieme

Main category: cs.LG

TL;DR: HPMixer: A hierarchical patching mixer for multivariate time series forecasting that decouples periodic and residual components using learnable cycle modules and stationary wavelet transforms.

Details

Motivation: To effectively capture both periodic patterns and residual dynamics in long-term multivariate time series forecasting, which is essential for accurate predictions but challenging within standard deep learning benchmark settings.

Method: Proposes HPMixer with decoupled periodic and residual components: 1) Periodic component uses learnable cycle module enhanced with nonlinear channel-wise MLP, 2) Residual component uses Learnable Stationary Wavelet Transform (LSWT) for stable frequency-domain representations, 3) Channel-mixing encoder for inter-channel dependencies, 4) Two-level non-overlapping hierarchical patching for multi-scale residual variations.

Result: Extensive experiments on standard multivariate benchmarks demonstrate that HPMixer achieves competitive or state-of-the-art performance compared to recent baselines.

Conclusion: HPMixer provides an effective framework for multivariate time series forecasting by integrating decoupled periodicity modeling with structured, multi-scale residual learning.

Abstract: In long-term multivariate time series forecasting, effectively capturing both periodic patterns and residual dynamics is essential. To address this within standard deep learning benchmark settings, we propose the Hierarchical Patching Mixer (HPMixer), which models periodicity and residuals in a decoupled yet complementary manner. The periodic component utilizes a learnable cycle module [7] enhanced with a nonlinear channel-wise MLP for greater expressiveness. The residual component is processed through a Learnable Stationary Wavelet Transform (LSWT) to extract stable, shift-invariant frequency-domain representations. Subsequently, a channel-mixing encoder models explicit inter-channel dependencies, while a two-level non-overlapping hierarchical patching mechanism captures coarse- and fine-scale residual variations. By integrating decoupled periodicity modeling with structured, multi-scale residual learning, HPMixer provides an effective framework. Extensive experiments on standard multivariate benchmarks demonstrate that HPMixer achieves competitive or state-of-the-art performance compared to recent baselines.

[294] Synthesis and Verification of Transformer Programs

Hongjian Jiang, Matthew Hague, Philipp Rümmer, Anthony Widjaja Lin

Main category: cs.LG

TL;DR: C-RASP verification and learning techniques for transformer program optimization and constrained learning

Details

Motivation: C-RASP is a programming language that captures concepts expressible by transformers, but there's a need for automated verification and learning methods for C-RASP programs to enable transformer program optimization and constrained learning.

Method: Two main contributions: (1) algorithmic techniques for verifying C-RASP by connecting to synchronous dataflow program verification in Lustre, leveraging SMT-solvers; (2) a local search algorithm for learning C-RASP programs from examples.

Result: Implementation demonstrates efficacy on C-RASP benchmarks from literature, particularly for transformer program optimization and constrained learning of transformer programs based on partial specifications.

Conclusion: The paper presents effective verification and learning techniques for C-RASP, enabling practical applications in transformer program optimization and constrained learning scenarios.

Abstract: C-RASP is a simple programming language that was recently shown to capture concepts expressible by transformers. In this paper, we develop new algorithmic techniques for automatically verifying C-RASPs. To this end, we establish a connection to the verification of synchronous dataflow programs in Lustre, which enables us to exploit state-of-the-art model checkers utilizing highly optimized SMT-solvers. Our second contribution addresses learning a C-RASP program in the first place. To this end, we provide a new algorithm for learning a C-RASP from examples using local search. We demonstrate efficacy of our implementation for benchmarks of C-RASPs in the literature, in particular in connection to the following applications: (1) transformer program optimization, and (2) constrained learning of transformer programs (based on a partial specification).

[295] AIFL: A Global Daily Streamflow Forecasting Model Using Deterministic LSTM Pre-trained on ERA5-Land and Fine-tuned on IFS

Maria Luisa Taccari, Kenza Tazi, Oisín M. Morrison, Andreas Grafberger, Juan Colonese, Corentin Carton de Wiart, Christel Prudhomme, Cinzia Mazzetti, Matthew Chantry, Florian Pappenberger

Main category: cs.LG

TL;DR: AIFL is an LSTM-based global streamflow forecasting model that uses a two-stage training strategy to bridge the gap between historical reanalysis data and operational weather forecasts, achieving competitive performance with state-of-the-art systems.

Details

Motivation: Data-driven streamflow forecasting models often suffer performance degradation when transitioning from historical reanalysis data to operational forecast products due to domain shift issues. There's a need for reliable global streamflow forecasting for flood preparedness and water resource management.

Method: AIFL uses an LSTM-based architecture with a novel two-stage training strategy: 1) pre-training on 40 years of ERA5-Land reanalysis data (1980-2019) to learn hydrological processes, and 2) fine-tuning on operational Integrated Forecasting System (IFS) control forecasts (2016-2019) to adapt to operational weather prediction biases. Trained on 18,588 basins from the CARAVAN dataset.

Result: On independent test data (2021-2024), AIFL achieves median modified Kling-Gupta Efficiency (KGE’) of 0.66 and median Nash-Sutcliffe Efficiency (NSE) of 0.53. It’s competitive with state-of-the-art global systems and demonstrates exceptional reliability in extreme-event detection.

Conclusion: AIFL provides a transparent, reproducible, and operationally robust baseline for global hydrological forecasting, successfully bridging the reanalysis-to-forecast domain shift through its two-stage training approach.

Abstract: Reliable global streamflow forecasting is essential for flood preparedness and water resource management, yet data-driven models often suffer from a performance gap when transitioning from historical reanalysis to operational forecast products. This paper introduces AIFL (Artificial Intelligence for Floods), a deterministic LSTM-based model designed for global daily streamflow forecasting. Trained on 18,588 basins curated from the CARAVAN dataset, AIFL utilises a novel two-stage training strategy to bridge the reanalysis-to-forecast domain shift. The model is first pre-trained on 40 years of ERA5-Land reanalysis (1980-2019) to capture robust hydrological processes, then fine-tuned on operational Integrated Forecasting System (IFS) control forecasts (2016-2019) to adapt to the specific error structures and biases of operational numerical weather prediction. To our knowledge, this is the first global model trained end-to-end within the CARAVAN ecosystem. On an independent temporal test set (2021-2024), AIFL achieves high predictive skill with a median modified Kling-Gupta Efficiency (KGE’) of 0.66 and a median Nash-Sutcliffe Efficiency (NSE) of 0.53. Benchmarking results show that AIFL is highly competitive with current state-of-the-art global systems, achieving comparable accuracy while maintaining a transparent and reproducible forcing pipeline. The model demonstrates exceptional reliability in extreme-event detection, providing a streamlined and operationally robust baseline for the global hydrological community.

[296] Small molecule retrieval from tandem mass spectrometry: what are we optimizing for?

Gaetan De Waele, Marek Wydmuch, Krzysztof Dembczyński, Wojciech Kotłowski, Willem Waegeman

Main category: cs.LG

TL;DR: Theoretical analysis of loss functions for deep learning-based molecular fingerprint prediction from mass spectrometry data, revealing a fundamental trade-off between fingerprint accuracy and molecular retrieval performance.

Details

Motivation: Deep learning methods for LC-MS/MS data analysis predict molecular fingerprints from mass spectra for compound identification, but the impact of different loss functions on model performance is poorly understood.

Method: Theoretical investigation of commonly used loss functions with novel regret bounds analysis to characterize when Bayes-optimal decisions for different objectives must diverge.

Result: Reveals a fundamental trade-off between fingerprint similarity and molecular retrieval objectives - optimizing for more accurate fingerprint predictions worsens retrieval results, and vice versa.

Conclusion: The trade-off depends on the similarity structure of candidate sets, providing guidance for loss function and fingerprint selection in computational mass spectrometry analysis.

Abstract: One of the central challenges in the computational analysis of liquid chromatography-tandem mass spectrometry (LC-MS/MS) data is to identify the compounds underlying the output spectra. In recent years, this problem is increasingly tackled using deep learning methods. A common strategy involves predicting a molecular fingerprint vector from an input mass spectrum, which is then used to search for matches in a chemical compound database. While various loss functions are employed in training these predictive models, their impact on model performance remains poorly understood. In this study, we investigate commonly used loss functions, deriving novel regret bounds that characterize when Bayes-optimal decisions for these objectives must diverge. Our results reveal a fundamental trade-off between the two objectives of (1) fingerprint similarity and (2) molecular retrieval. Optimizing for more accurate fingerprint predictions typically worsens retrieval results, and vice versa. Our theoretical analysis shows this trade-off depends on the similarity structure of candidate sets, providing guidance for loss function and fingerprint selection.

[297] Reinforcement Learning for Parameterized Quantum State Preparation: A Comparative Study

Gerhard Stenzel, Isabella Debelic, Michael Kölle, Tobias Rohe, Leo Sünkel, Julian Hager, Claudia Linnhoff-Popien

Main category: cs.LG

TL;DR: Reinforcement learning for parameterized quantum circuit synthesis with continuous rotations, comparing one-stage vs two-stage PPO approaches for state preparation tasks.

Details

Motivation: Extend quantum circuit synthesis from discrete gate selection to continuous parameterized rotations for more flexible quantum state preparation, addressing scalability challenges in quantum computing.

Method: Use reinforcement learning (PPO and A2C) with Gymnasium and PennyLane for quantum circuit synthesis. Compare one-stage agent (jointly selects gate type, qubits, rotation angles) vs two-stage variant (first proposes discrete circuit, then optimizes angles with Adam using parameter-shift gradients).

Result: PPO succeeds with stable hyperparameters while A2C fails. Both approaches reconstruct computational basis states (83-99% success) and Bell states (61-77% success). Scalability saturates at λ≈3-4 and doesn’t extend to 10-qubit targets. Two-stage method offers marginal accuracy gains with 3x runtime.

Conclusion: One-stage PPO policy recommended for practical use under fixed compute budget. Paper provides synthesized circuits and contrasts with classical variational baseline, outlining avenues for improved scalability in quantum circuit synthesis.

Abstract: We extend directed quantum circuit synthesis (DQCS) with reinforcement learning from purely discrete gate selection to parameterized quantum state preparation with continuous single-qubit rotations (R_x), (R_y), and (R_z). We compare two training regimes: a one-stage agent that jointly selects the gate type, the affected qubit(s), and the rotation angle; and a two-stage variant that first proposes a discrete circuit and subsequently optimizes the rotation angles with Adam using parameter-shift gradients. Using Gymnasium and PennyLane, we evaluate Proximal Policy Optimization (PPO) and Advantage Actor–Critic (A2C) on systems comprising two to ten qubits and on targets of increasing complexity with (λ) ranging from one to five. Whereas A2C does not learn effective policies in this setting, PPO succeeds under stable hyperparameters (one-stage: learning rate approximately (5\times10^{-4}) with a self-fidelity-error threshold of 0.01; two-stage: learning rate approximately (10^{-4})). Both approaches reliably reconstruct computational basis states (between 83% and 99% success) and Bell states (between 61% and 77% success). However, scalability saturates for (λ) of approximately three to four and does not extend to ten-qubit targets even at (λ=2). The two-stage method offers only marginal accuracy gains while requiring around three times the runtime. For practicality under a fixed compute budget, we therefore recommend the one-stage PPO policy, provide explicit synthesized circuits, and contrast with a classical variational baseline to outline avenues for improved scalability.

[298] Capacity-constrained demand response in smart grids using deep reinforcement learning

Shafagh Abband Pashaki, Sepehr Maleki, Amir Badiee

Main category: cs.LG

TL;DR: A deep reinforcement learning approach for capacity-constrained incentive-based demand response in residential smart grids that uses financial incentives to reduce peak demand and smooth load profiles.

Details

Motivation: To address electricity grid capacity limits and prevent congestion in residential smart grids by financially incentivizing end users to reduce or shift energy consumption during peak periods.

Method: Hierarchical architecture with a service provider adjusting hourly incentive rates based on wholesale electricity prices and aggregated residential load. Uses deep reinforcement learning to learn optimal real-time incentive rates under explicit capacity constraints, with heterogeneous user preferences modeled through appliance-level home energy management systems and dissatisfaction costs.

Result: Simulation with real-world residential electricity consumption and price data from three households shows effective peak demand reduction and smoothed aggregated load profile, achieving approximately 22.82% reduction in peak-to-average ratio compared to no-demand-response case.

Conclusion: The proposed capacity-constrained incentive-based demand response approach using deep reinforcement learning successfully manages residential electricity demand to prevent grid congestion while considering financial interests of both service providers and end users.

Abstract: This paper presents a capacity-constrained incentive-based demand response approach for residential smart grids. It aims to maintain electricity grid capacity limits and prevent congestion by financially incentivising end users to reduce or shift their energy consumption. The proposed framework adopts a hierarchical architecture in which a service provider adjusts hourly incentive rates based on wholesale electricity prices and aggregated residential load. The financial interests of both the service provider and end users are explicitly considered. A deep reinforcement learning approach is employed to learn optimal real-time incentive rates under explicit capacity constraints. Heterogeneous user preferences are modelled through appliance-level home energy management systems and dissatisfaction costs. Using real-world residential electricity consumption and price data from three households, simulation results show that the proposed approach effectively reduces peak demand and smooths the aggregated load profile. This leads to an approximately 22.82% reduction in the peak-to-average ratio compared to the no-demand-response case.

[299] FEKAN: Feature-Enriched Kolmogorov-Arnold Networks

Sidharth S. Menon, Ameya D. Jagtap

Main category: cs.LG

TL;DR: FEKAN (Feature-Enriched Kolmogorov-Arnold Networks) improves computational efficiency and accuracy of KANs through feature enrichment without increasing trainable parameters, demonstrating superior performance across function approximation, PDE solving, and neural operator tasks.

Details

Motivation: Existing KAN architectures suffer from high computational cost and slow convergence, limiting their scalability and practical applicability despite offering enhanced interpretability via functional decomposition.

Method: Introduces Feature-Enriched KANs (FEKAN) that preserve KAN advantages while improving efficiency through feature enrichment without increasing trainable parameters, accelerating convergence and increasing representation capacity.

Result: FEKAN demonstrates substantially faster convergence and consistently higher approximation accuracy than baseline KAN variants across function approximation, PDE solving, and neural operator tasks.

Conclusion: FEKAN provides a simple yet effective extension to KANs that overcomes computational limitations while maintaining interpretability, with theoretical foundations showing superior representation capacity.

Abstract: Kolmogorov-Arnold Networks (KANs) have recently emerged as a compelling alternative to multilayer perceptrons, offering enhanced interpretability via functional decomposition. However, existing KAN architectures, including spline-, wavelet-, radial-basis variants, etc., suffer from high computational cost and slow convergence, limiting scalability and practical applicability. Here, we introduce Feature-Enriched Kolmogorov-Arnold Networks (FEKAN), a simple yet effective extension that preserves all the advantages of KAN while improving computational efficiency and predictive accuracy through feature enrichment, without increasing the number of trainable parameters. By incorporating these additional features, FEKAN accelerates convergence, increases representation capacity, and substantially mitigates the computational overhead characteristic of state-of-the-art KAN architectures. We investigate FEKAN across a comprehensive set of benchmarks, including function-approximation tasks, physics-informed formulations for diverse partial differential equations (PDEs), and neural operator settings that map between input and output function spaces. For function approximation, we systematically compare FEKAN against a broad family of KAN variants, FastKAN, WavKAN, ReLUKAN, HRKAN, ChebyshevKAN, RBFKAN, and the original SplineKAN. Across all tasks, FEKAN demonstrates substantially faster convergence and consistently higher approximation accuracy than the underlying baseline architectures. We also establish the theoretical foundations for FEKAN, showing its superior representation capacity compared to KAN, which contributes to improved accuracy and efficiency.

[300] A Systematic Evaluation of Sample-Level Tokenization Strategies for MEG Foundation Models

SungJun Cho, Chetan Gohil, Rukuang Huang, Oiwi Parker Jones, Mark W. Woolrich

Main category: cs.LG

TL;DR: Systematic evaluation of tokenization strategies for transformer-based large neuroimaging models on MEG data, comparing learnable vs. non-learnable approaches with comprehensive performance metrics.

Details

Motivation: Growing interest in large-scale foundation models for neuroimaging data requires understanding the impact of different tokenization strategies for continuous neural time series data, which is currently poorly understood.

Method: Systematic evaluation of sample-level tokenization strategies for transformer-based large neuroimaging models applied to MEG data, comparing learnable (novel autoencoder-based approach) and non-learnable tokenizers across multiple evaluation criteria including reconstruction fidelity, token prediction, biological plausibility, subject-specific information preservation, and downstream task performance.

Result: Both learnable and non-learnable discretization schemes achieve high reconstruction accuracy and broadly comparable performance across most evaluation criteria, suggesting simple fixed sample-level tokenization strategies can be effectively used in neural foundation model development.

Conclusion: Simple fixed sample-level tokenization strategies are sufficient for developing neural foundation models, as they perform comparably to more complex learnable approaches across multiple evaluation metrics on diverse MEG datasets.

Abstract: Recent success in natural language processing has motivated growing interest in large-scale foundation models for neuroimaging data. Such models often require discretization of continuous neural time series data, a process referred to as ’tokenization’. However, the impact of different tokenization strategies for neural data is currently poorly understood. In this work, we present a systematic evaluation of sample-level tokenization strategies for transformer-based large neuroimaging models (LNMs) applied to magnetoencephalography (MEG) data. We compare learnable and non-learnable tokenizers by examining their signal reconstruction fidelity and their impact on subsequent foundation modeling performance (token prediction, biological plausibility of generated data, preservation of subject-specific information, and performance on downstream tasks). For the learnable tokenizer, we introduce a novel approach based on an autoencoder. Experiments were conducted on three publicly available MEG datasets spanning different acquisition sites, scanners, and experimental paradigms. Our results show that both learnable and non-learnable discretization schemes achieve high reconstruction accuracy and broadly comparable performance across most evaluation criteria, suggesting that simple fixed sample-level tokenization strategies can be used in the development of neural foundation models. The code is available at https://github.com/OHBA-analysis/Cho2026_Tokenizer.

[301] Transfer Learning of Linear Regression with Multiple Pretrained Models: Benefiting from More Pretrained Models via Overparameterization Debiasing

Daniel Boharon, Yehuda Dar

Main category: cs.LG

TL;DR: Theoretical analysis of transfer learning for linear regression using multiple overparameterized pretrained models, with insights on when more models help and a debiasing method to address overparameterization bias.

Details

Motivation: To understand when using multiple pretrained models improves transfer learning performance, particularly in overparameterized settings where pretrained models may have biases due to minimum ℓ₂-norm solutions in high-dimensional spaces.

Method: Formulate target learning as optimization minimizing squared errors on target data with penalty on distance from pretrained models. Analytically derive test error and propose multiplicative correction factor to debias overparameterization bias.

Result: Shows that using more pretrained models can improve transfer learning when they are overparameterized, but overparameterization bias can compromise learning. The proposed debiasing method effectively reduces this bias and enables leveraging more pretrained models.

Conclusion: Sufficiently many overparameterized pretrained models are important for beneficial transfer learning, but overparameterization bias must be addressed via debiasing techniques to fully leverage multiple models.

Abstract: We study transfer learning for a linear regression task using several least-squares pretrained models that can be overparameterized. We formulate the target learning task as optimization that minimizes squared errors on the target dataset with penalty on the distance of the learned model from the pretrained models. We analytically formulate the test error of the learned target model and provide the corresponding empirical evaluations. Our results elucidate when using more pretrained models can improve transfer learning. Specifically, if the pretrained models are overparameterized, using sufficiently many of them is important for beneficial transfer learning. However, the learning may be compromised by overparameterization bias of pretrained models, i.e., the minimum $\ell_2$-norm solution’s restriction to a small subspace spanned by the training examples in the high-dimensional parameter space. We propose a simple debiasing via multiplicative correction factor that can reduce the overparameterization bias and leverage more pretrained models to learn a target predictor.

[302] Almost Sure Convergence of Differential Temporal Difference Learning for Average Reward Markov Decision Processes

Ethan Blaser, Jiuqi Wang, Shangtong Zhang

Main category: cs.LG

TL;DR: The paper proves convergence of differential TD learning algorithms for average reward RL without requiring local clock learning rates, making theory more aligned with practical implementations.

Details

Motivation: Differential TD learning is important for average reward RL but existing convergence proofs require impractical local clock learning rates tied to state visit counts, which aren't used in practice and don't extend beyond tabular settings.

Method: Proves almost sure convergence of on-policy n-step differential TD using standard diminishing learning rates without local clock. Derives three sufficient conditions for off-policy n-step differential TD convergence without local clock.

Result: Establishes convergence guarantees for differential TD learning algorithms using practical learning rate schedules, strengthening theoretical foundations and aligning analysis with real-world implementations.

Conclusion: The results bridge the gap between theoretical convergence analysis and practical implementations of differential TD learning for average reward RL, enabling more robust theoretical understanding.

Abstract: The average reward is a fundamental performance metric in reinforcement learning (RL) focusing on the long-run performance of an agent. Differential temporal difference (TD) learning algorithms are a major advance for average reward RL as they provide an efficient online method to learn the value functions associated with the average reward in both on-policy and off-policy settings. However, existing convergence guarantees require a local clock in learning rates tied to state visit counts, which practitioners do not use and does not extend beyond tabular settings. We address this limitation by proving the almost sure convergence of on-policy $n$-step differential TD for any $n$ using standard diminishing learning rates without a local clock. We then derive three sufficient conditions under which off-policy $n$-step differential TD also converges without a local clock. These results strengthen the theoretical foundations of differential TD and bring its convergence analysis closer to practical implementations.

[303] Vulnerability Analysis of Safe Reinforcement Learning via Inverse Constrained Reinforcement Learning

Jialiang Fan, Shixiong Jiang, Mengyu Liu, Fanxin Kong

Main category: cs.LG

TL;DR: A black-box adversarial attack framework for Safe RL that uses expert demonstrations to learn constraint models and surrogate policies, enabling gradient-based attacks without access to victim policy gradients or ground-truth safety constraints.

Details

Motivation: Most Safe RL methods assume benign environments and are vulnerable to adversarial perturbations in real-world settings. Existing gradient-based attacks require access to policy gradients, which is often impractical. There's a need for attacks that work under limited privileged access.

Method: Proposes an adversarial attack framework that uses expert demonstrations and black-box environment interaction to learn a constraint model and surrogate (learner) policy. This enables gradient-based attack optimization without requiring the victim policy’s internal gradients or ground-truth safety constraints.

Result: Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of the approach under limited privileged access. Theoretical analysis establishes feasibility and derives perturbation bounds.

Conclusion: The framework successfully reveals vulnerabilities in Safe RL policies using black-box attacks with expert demonstrations, addressing practical limitations of existing attack methods.

Abstract: Safe reinforcement learning (Safe RL) aims to ensure policy performance while satisfying safety constraints. However, most existing Safe RL methods assume benign environments, making them vulnerable to adversarial perturbations commonly encountered in real-world settings. In addition, existing gradient-based adversarial attacks typically require access to the policy’s gradient information, which is often impractical in real-world scenarios. To address these challenges, we propose an adversarial attack framework to reveal vulnerabilities of Safe RL policies. Using expert demonstrations and black-box environment interaction, our framework learns a constraint model and a surrogate (learner) policy, enabling gradient-based attack optimization without requiring the victim policy’s internal gradients or the ground-truth safety constraints. We further provide theoretical analysis establishing feasibility and deriving perturbation bounds. Experiments on multiple Safe RL benchmarks demonstrate the effectiveness of our approach under limited privileged access.

[304] RIDER: 3D RNA Inverse Design with Reinforcement Learning-Guided Diffusion

Tianmeng Hu, Yongzheng Cui, Biao Luo, Ke Li

Main category: cs.LG

TL;DR: RIDER is an RNA inverse design framework using reinforcement learning to directly optimize for 3D structural similarity rather than just sequence recovery.

Details

Motivation: Current deep learning approaches for RNA 3D structure inverse design are limited by using native sequence recovery as the main optimization metric, which doesn't guarantee structural fidelity since different sequences can fold into similar structures.

Method: First pre-trains a GNN-based generative diffusion model conditioned on target 3D structure, then fine-tunes with improved policy gradient algorithm using four task-specific reward functions based on 3D self-consistency metrics.

Result: Achieves 9% improvement in native sequence recovery over SOTA methods and improves structural similarity by over 100% across all metrics while discovering designs distinct from native sequences.

Conclusion: RIDER demonstrates that directly optimizing for 3D structural similarity through reinforcement learning leads to better RNA inverse design than traditional sequence recovery approaches.

Abstract: The inverse design of RNA three-dimensional (3D) structures is crucial for engineering functional RNAs in synthetic biology and therapeutics. While recent deep learning approaches have advanced this field, they are typically optimized and evaluated using native sequence recovery, which is a limited surrogate for structural fidelity, since different sequences can fold into similar 3D structures and high recovery does not necessarily indicate correct folding. To address this limitation, we propose RIDER, an RNA Inverse DEsign framework with Reinforcement learning that directly optimizes for 3D structural similarity. First, we develop and pre-train a GNN-based generative diffusion model conditioned on the target 3D structure, achieving a 9% improvement in native sequence recovery over state-of-the-art methods. Then, we fine-tune the model with an improved policy gradient algorithm using four task-specific reward functions based on 3D self-consistency metrics. Experimental results show that RIDER improves structural similarity by over 100% across all metrics and discovers designs that are distinct from native sequences.

[305] Illustration of Barren Plateaus in Quantum Computing

Gerhard Stenzel, Tobias Rohe, Michael Kölle, Leo Sünkel, Jonas Stein, Claudia Linnhoff-Popien

Main category: cs.LG

TL;DR: Parameter sharing in variational quantum circuits creates deceptive gradients that mislead classical optimizers, trading off reduced parameter dimensionality for increased optimization difficulty.

Details

Motivation: To investigate the trade-off introduced by parameter sharing in variational quantum circuits, which reduces parameter dimensionality but may create deceptive optimization landscapes that mislead gradient-based optimizers.

Method: Systematic experimental analysis of parameter sharing effects, development of a gradient deceptiveness detection algorithm, and creation of a quantitative framework for measuring optimization difficulty in quantum circuits.

Result: Increased parameter sharing generates more complex solution landscapes with higher gradient magnitudes and deceptiveness ratios, causing degraded convergence for classical optimizers like Adam and SGD.

Conclusion: Parameter sharing improves circuit expressivity but significantly increases landscape deceptiveness, revealing a fundamental mismatch between classical optimization strategies and quantum parameter landscapes shaped by parameter sharing.

Abstract: Variational Quantum Circuits (VQCs) have emerged as a promising paradigm for quantum machine learning in the NISQ era. While parameter sharing in VQCs can reduce the parameter space dimensionality and potentially mitigate the barren plateau phenomenon, it introduces a complex trade-off that has been largely overlooked. This paper investigates how parameter sharing, despite creating better global optima with fewer parameters, fundamentally alters the optimization landscape through deceptive gradients – regions where gradient information exists but systematically misleads optimizers away from global optima. Through systematic experimental analysis, we demonstrate that increasing degrees of parameter sharing generate more complex solution landscapes with heightened gradient magnitudes and measurably higher deceptiveness ratios. Our findings reveal that traditional gradient-based optimizers (Adam, SGD) show progressively degraded convergence as parameter sharing increases, with performance heavily dependent on hyperparameter selection. We introduce a novel gradient deceptiveness detection algorithm and a quantitative framework for measuring optimization difficulty in quantum circuits, establishing that while parameter sharing can improve circuit expressivity by orders of magnitude, this comes at the cost of significantly increased landscape deceptiveness. These insights provide important considerations for quantum circuit design in practical applications, highlighting the fundamental mismatch between classical optimization strategies and quantum parameter landscapes shaped by parameter sharing.

[306] A Scalable Approach to Solving Simulation-Based Network Security Games

Michael Lanier, Yevgeniy Vorobeychik

Main category: cs.LG

TL;DR: MetaDOAR: A meta-controller for scalable multi-agent reinforcement learning on large cyber-networks using partition-aware filtering and Q-value caching

Details

Motivation: To enable scalable multi-agent reinforcement learning on very large cyber-network environments where conventional approaches face scaling issues in terms of memory usage and training time

Method: Uses a lightweight meta-controller with learned partition-aware filtering layer and Q-value caching; learns compact state projection from node embeddings to select subset of devices, then performs focused beam search with critic agent; implements LRU cache with quantized state projection and conservative k-hop cache invalidation

Result: MetaDOAR attains higher player payoffs than state-of-the-art baselines on large network topologies without significant scaling issues in memory usage or training time

Conclusion: Provides a practical, theoretically motivated path to efficient hierarchical policy learning for large-scale networked decision problems

Abstract: We introduce MetaDOAR, a lightweight meta-controller that augments the Double Oracle / PSRO paradigm with a learned, partition-aware filtering layer and Q-value caching to enable scalable multi-agent reinforcement learning on very large cyber-network environments. MetaDOAR learns a compact state projection from per node structural embeddings to rapidly score and select a small subset of devices (a top-k partition) on which a conventional low-level actor performs focused beam search utilizing a critic agent. Selected candidate actions are evaluated with batched critic forwards and stored in an LRU cache keyed by a quantized state projection and local action identifiers, dramatically reducing redundant critic computation while preserving decision quality via conservative k-hop cache invalidation. Empirically, MetaDOAR attains higher player payoffs than SOTA baselines on large network topologies, without significant scaling issues in terms of memory usage or training time. This contribution provide a practical, theoretically motivated path to efficient hierarchical policy learning for large-scale networked decision problems.

[307] Steering diffusion models with quadratic rewards: a fine-grained analysis

Ankur Moitra, Andrej Risteski, Dhruv Rohatgi

Main category: cs.LG

TL;DR: The paper provides theoretical analysis of computational tractability for sampling from reward-tilted diffusion models with quadratic rewards, showing efficient algorithms for linear and low-rank positive-definite quadratic tilts, but proving intractability for negative-definite quadratic tilts.

Details

Motivation: Current inference-time algorithms for using pre-trained models as subroutines are heuristics with failure modes, and there's little understanding of when these heuristics can be efficiently improved. The paper aims to provide theoretical foundations for sampling from reward-tilted diffusion models.

Method: Theoretical analysis of computational tractability for sampling from p*(x) ∝ p(x)exp(r(x)) with quadratic rewards r(x) = x^T A x + b^T x. Uses Hubbard-Stratonovich transform as a key ingredient and builds on efficient sampling for linear-reward tilts.

Result: Linear-reward tilts are always efficiently sampleable. Low-rank positive-definite quadratic tilts (rank O(1)) are efficiently sampleable. Negative-definite quadratic tilts are intractable even with rank 1 (with exponentially-large entries).

Conclusion: The paper provides theoretical foundations for inference-time algorithms with diffusion models, establishing clear computational boundaries for different types of quadratic reward functions in reward-tilted sampling.

Abstract: Inference-time algorithms are an emerging paradigm in which pre-trained models are used as subroutines to solve downstream tasks. Such algorithms have been proposed for tasks ranging from inverse problems and guided image generation to reasoning. However, the methods currently deployed in practice are heuristics with a variety of failure modes – and we have very little understanding of when these heuristics can be efficiently improved. In this paper, we consider the task of sampling from a reward-tilted diffusion model – that is, sampling from $p^{\star}(x) \propto p(x) \exp(r(x))$ – given a reward function $r$ and pre-trained diffusion oracle for $p$. We provide a fine-grained analysis of the computational tractability of this task for quadratic rewards $r(x) = x^\top A x + b^\top x$. We show that linear-reward tilts are always efficiently sampleable – a simple result that seems to have gone unnoticed in the literature. We use this as a building block, along with a conceptually new ingredient – the Hubbard-Stratonovich transform – to provide an efficient algorithm for sampling from low-rank positive-definite quadratic tilts, i.e. $r(x) = x^\top A x$ where $A$ is positive-definite and of rank $O(1)$. For negative-definite tilts, i.e. $r(x) = - x^\top A x$ where $A$ is positive-definite, we prove that the problem is intractable even if $A$ is of rank 1 (albeit with exponentially-large entries).

[308] MoDE-Boost: Boosting Shared Mobility Demand with Edge-Ready Prediction Models

Antonios Tziorvas, George S. Theodoropoulos, Yannis Theodoridis

Main category: cs.LG

TL;DR: Urban demand forecasting for shared micro-mobility using gradient boosting models with classification and regression variations for 5-min to 1-hour predictions.

Details

Motivation: Urban demand forecasting is critical for optimizing routing, dispatching, and congestion management in Intelligent Transportation Systems. The paper aims to address the challenge of predicting spatial and temporal demand patterns for shared micro-mobility services to improve efficiency and sustainability in rapidly urbanizing cities.

Method: Proposes two gradient boosting model variations: one for classification and one for regression, both capable of generating demand forecasts at various temporal horizons (5 minutes to 1 hour). The approach integrates temporal and contextual features for accurate predictions.

Result: Evaluated using open shared mobility data from e-scooter and e-bike networks in five metropolitan areas. The approach was compared with state-of-the-art methods and a Generative AI-based model, demonstrating effectiveness in capturing complexities of modern urban mobility.

Conclusion: The methodology offers novel insights for urban micro-mobility management, helping tackle challenges from rapid urbanization and contributing to more sustainable, efficient, and livable cities.

Abstract: Urban demand forecasting plays a critical role in optimizing routing, dispatching, and congestion management within Intelligent Transportation Systems. By leveraging data fusion and analytics techniques, traffic demand forecasting serves as a key intermediate measure for identifying emerging spatial and temporal demand patterns. In this paper, we tackle this challenge by proposing two gradient boosting model variations, one for classiffication and one for regression, both capable of generating demand forecasts at various temporal horizons, from 5 minutes up to one hour. Our overall approach effectively integrates temporal and contextual features, enabling accurate predictions that are essential for improving the efficiency of shared (micro-) mobility services. To evaluate its effectiveness, we utilize open shared mobility data derived from e-scooter and e-bike networks in five metropolitan areas. These real-world datasets allow us to compare our approach with state-of-the-art methods as well as a Generative AI-based model, demonstrating its effectiveness in capturing the complexities of modern urban mobility. Ultimately, our methodology offers novel insights on urban micro-mobility management, helping to tackle the challenges arising from rapid urbanization and thus, contributing to more sustainable, efficient, and livable cities.

[309] Sequential Membership Inference Attacks

Thomas Michel, Debabrota Basu, Emilie Kaufmann

Main category: cs.LG

TL;DR: Optimal membership inference attack (SeMI*) that exploits model update sequences to detect inserted target data, providing tighter privacy audits than static model attacks.

Details

Motivation: AI models undergo multiple updates during their lifecycle, creating opportunities to exploit model dynamics for stronger membership inference attacks and tighter privacy audits. Existing attacks focus on static models, but using sequences of model updates can increase attack power.

Method: Developed SeMI* (Sequential Membership Inference) attack that uses sequences of model updates to identify presence of target data inserted at specific update steps. Analyzed optimal power for empirical mean computation with finite samples, with and without privacy constraints.

Result: SeMI* avoids dilution of MI signals that occurs in attacks on final models where signals vanish as training data accumulates. Enables adversaries to tune insertion time and canary for tighter privacy audits. Experiments show practical variants outperform baselines across data distributions and DP-SGD trained models.

Conclusion: Exploiting model update sequences provides stronger membership inference attacks and tighter privacy audits than static model approaches, with SeMI* offering optimal attack power and practical effectiveness.

Abstract: Modern AI models are not static. They go through multiple updates in their lifecycles. Thus, exploiting the model dynamics to create stronger Membership Inference (MI) attacks and tighter privacy audits are timely questions. Though the literature empirically shows that using a sequence of model updates can increase the power of MI attacks, rigorous analysis of the optimal' MI attacks is limited to static models with infinite samples. Hence, we develop an optimal’ MI attack, SeMI*, that uses the sequence of model updates to identify the presence of a target inserted at a certain update step. For the empirical mean computation, we derive the optimal power of SeMI*, while accessing a finite number of samples with or without privacy. Our results retrieve the existing asymptotic analysis. We observe that having access to the model sequence avoids the dilution of MI signals unlike the existing attacks on the final model, where the MI signal vanishes as training data accumulates. Furthermore, an adversary can use SeMI* to tune both the insertion time and the canary to yield tighter privacy audits. Finally, we conduct experiments across data distributions and models trained or fine-tuned with DP-SGD demonstrating that practical variants of SeMI* lead to tighter privacy audits than the baselines.

[310] Predicting The Cop Number Using Machine Learning

Meagan Mann, Christian Muise, Erin Meger

Main category: cs.LG

TL;DR: Machine learning models can predict graph cop numbers from structural properties with high accuracy, with tree-based models and graph neural networks performing well without extensive feature engineering.

Details

Motivation: The cop number problem in graph theory is computationally difficult to solve exactly, especially for large graphs. The paper aims to explore whether machine learning methods can provide accurate predictions of cop numbers from graph structural properties, potentially offering scalable approximations where traditional algorithms are infeasible.

Method: The study investigates both classical machine learning methods (particularly tree-based models) and graph neural networks for predicting graph cop numbers. The approach involves using graph structural properties as features for classical models, while graph neural networks operate directly on graph structures without explicit feature engineering. The paper also includes interpretability analysis to identify which structural properties most strongly influence predictions.

Result: Tree-based machine learning models achieve high accuracy in predicting cop numbers despite class imbalance. Graph neural networks achieve comparable results without requiring explicit feature engineering. Interpretability analysis reveals that the most predictive features relate to node connectivity, clustering, clique structure, and width parameters, which aligns with known theoretical results in graph theory.

Conclusion: Machine learning approaches can complement existing cop number algorithms by providing scalable approximations for large graphs where exact computation is infeasible. The success of both classical models and graph neural networks suggests that structural properties contain sufficient information for accurate cop number prediction.

Abstract: Cops and Robbers is a pursuit evasion game played on a graph, first introduced independently by Quilliot \cite{quilliot1978jeux} and Nowakowski and Winkler \cite{NOWAKOWSKI1983235} over four decades ago. A main interest in recent the literature is identifying the cop number of graph families. The cop number of a graph, $c(G)$, is defined as the minimum number of cops required to guarantee capture of the robber. Determining the cop number is computationally difficult and exact algorithms for this are typically restricted to small graph families. This paper investigates whether classical machine learning methods and graph neural networks can accurately predict a graph’s cop number from its structural properties and identify which properties most strongly influence this prediction. Of the classical machine learning models, tree-based models achieve high accuracy in prediction despite class imbalance, whereas graph neural networks achieve comparable results without explicit feature engineering. The interpretability analysis shows that the most predictive features are related to node connectivity, clustering, clique structure, and width parameters, which aligns with known theoretical results. Our findings suggest that machine learning approaches can be used in complement with existing cop number algorithms by offering scalable approximations where computation is infeasible.

[311] Optimizer choice matters for the emergence of Neural Collapse

Jim Zhao, Tin Sum Cheng, Wojciech Masarczyk, Aurelien Lucchi

Main category: cs.LG

TL;DR: Neural Collapse (NC) emergence depends on optimizer choice, particularly weight-decay coupling; NC0 metric reveals SGD, SignGD with coupled weight decay (Adam-like), and SignGD with decoupled weight decay (AdamW-like) have different NC dynamics.

Details

Motivation: Existing Neural Collapse analyses ignore optimizer role, assuming NC is universal across optimization methods. This work challenges that assumption to show optimizer choice critically affects NC emergence.

Method: Introduces NC0 diagnostic metric whose convergence to zero is necessary for NC. Theoretically analyzes NC0 dynamics for SGD, SignGD with coupled weight decay (Adam), and SignGD with decoupled weight decay (AdamW). Conducts 3,900 training runs across datasets, architectures, optimizers, and hyperparameters.

Result: Proves NC cannot emerge under decoupled weight decay in adaptive optimizers (AdamW). Shows SGD, SignGD with coupled weight decay, and SignGD with decoupled weight decay exhibit qualitatively different NC0 dynamics. Demonstrates momentum accelerates NC emergence with SGD. Empirical experiments confirm theoretical findings.

Conclusion: First theoretical explanation for optimizer-dependent NC emergence, highlighting overlooked role of weight-decay coupling in shaping optimizer implicit biases. NC is not universal but depends on optimization method details.

Abstract: Neural Collapse (NC) refers to the emergence of highly symmetric geometric structures in the representations of deep neural networks during the terminal phase of training. Despite its prevalence, the theoretical understanding of NC remains limited. Existing analyses largely ignore the role of the optimizer, thereby suggesting that NC is universal across optimization methods. In this work, we challenge this assumption and demonstrate that the choice of optimizer plays a critical role in the emergence of NC. The phenomenon is typically quantified through NC metrics, which, however, are difficult to track and analyze theoretically. To overcome this limitation, we introduce a novel diagnostic metric, NC0, whose convergence to zero is a necessary condition for NC. Using NC0, we provide theoretical evidence that NC cannot emerge under decoupled weight decay in adaptive optimizers, as implemented in AdamW. Concretely, we prove that SGD, SignGD with coupled weight decay (a special case of Adam), and SignGD with decoupled weight decay (a special case of AdamW) exhibit qualitatively different NC0 dynamics. Also, we show the accelerating effect of momentum on NC (beyond convergence of train loss) when trained with SGD, being the first result concerning momentum in the context of NC. Finally, we conduct extensive empirical experiments consisting of 3,900 training runs across various datasets, architectures, optimizers, and hyperparameters, confirming our theoretical results. This work provides the first theoretical explanation for optimizer-dependent emergence of NC and highlights the overlooked role of weight-decay coupling in shaping the implicit biases of optimizers.

[312] Factorization Machine with Quadratic-Optimization Annealing for RNA Inverse Folding and Evaluation of Binary-Integer Encoding and Nucleotide Assignment

Shuta Kikuchi, Shu Tanaka

Main category: cs.LG

TL;DR: FMQA framework for RNA inverse folding with analysis of nucleotide-to-integer assignments and binary encoding methods, showing domain-wall encoding with boundary assignments improves thermodynamic stability.

Details

Motivation: RNA inverse folding requires efficient methods with limited evaluations; existing approaches need many sequence evaluations, limiting practical application when experimental validation is costly. Need to understand how encoding choices affect optimization performance.

Method: Proposes FMQA (factorization machine with quadratic-optimization annealing) framework for RNA inverse folding. Evaluates all 24 possible nucleotide-to-integer assignments (0-3) combined with four binary-integer encoding methods (one-hot, domain-wall, binary, unary). Analyzes effects on surrogate model structure and search landscape.

Result: One-hot and domain-wall encodings outperform binary and unary encodings in normalized ensemble defect. Domain-wall encoding with nucleotides assigned to boundary integers (0 and 3) promotes enrichment of guanine and cytosine in stem regions, leading to more thermodynamically stable secondary structures than one-hot encoding.

Conclusion: FMQA framework effectively solves RNA inverse folding with limited evaluations. Encoding choices significantly impact solution quality, with domain-wall encoding and strategic nucleotide assignments improving thermodynamic stability through structural optimization.

Abstract: The RNA inverse folding problem aims to identify nucleotide sequences that preferentially adopt a given target secondary structure. While various heuristic and machine learning-based approaches have been proposed, many require a large number of sequence evaluations, which limits their applicability when experimental validation is costly. We propose a method to solve the problem using a factorization machine with quadratic-optimization annealing (FMQA). FMQA is a discrete black-box optimization method reported to obtain high-quality solutions with a limited number of evaluations. Applying FMQA to the problem requires converting nucleotides into binary variables. However, the influence of integer-to-nucleotide assignments and binary-integer encoding on the performance of FMQA has not been thoroughly investigated, even though such choices determine the structure of the surrogate model and the search landscape, and thus can directly affect solution quality. Therefore, this study aims both to establish a novel FMQA framework for RNA inverse folding and to analyze the effects of these assignments and encoding methods. We evaluated all 24 possible assignments of the four nucleotides to the ordered integers (0-3), in combination with four binary-integer encoding methods. Our results demonstrated that one-hot and domain-wall encodings outperform binary and unary encodings in terms of the normalized ensemble defect value. In domain-wall encoding, nucleotides assigned to the boundary integers (0 and 3) appeared with higher frequency. In the RNA inverse folding problem, assigning guanine and cytosine to these boundary integers promoted their enrichment in stem regions, which led to more thermodynamically stable secondary structures than those obtained with one-hot encoding.

[313] Neighborhood Stability as a Measure of Nearest Neighbor Searchability

Thomas Vecchiato, Sebastian Bruch

Main category: cs.LG

TL;DR: The paper introduces two stability measures (clustering-NSM and point-NSM) to predict the suitability of clustering-based approximate nearest neighbor search for high-dimensional datasets without requiring actual search experiments.

Details

Motivation: There's a lack of analytical tools to determine whether clustering-based approximate nearest neighbor search (ANNS) will work well for a given dataset. Current approaches require running actual search experiments to evaluate performance, which is time-consuming.

Method: Proposes two measures: 1) Clustering-NSM (internal clustering quality measure) that predicts ANNS accuracy, and 2) Point-NSM (dataset clusterability measure) that predicts clustering-NSM. Both are based on nearest neighbor relationships rather than distances, making them applicable to various distance functions including inner product.

Result: The measures allow determining dataset “searchability” for clustering-based ANNS using only the data points themselves, without running actual search experiments. The relationship-based approach makes the measures broadly applicable across different distance metrics.

Conclusion: The paper provides analytical tools to assess the suitability of clustering-based ANNS for high-dimensional datasets, addressing a significant gap in the field by enabling dataset searchability evaluation without experimental overhead.

Abstract: Clustering-based Approximate Nearest Neighbor Search (ANNS) organizes a set of points into partitions, and searches only a few of them to find the nearest neighbors of a query. Despite its popularity, there are virtually no analytical tools to determine the suitability of clustering-based ANNS for a given dataset – what we call “searchability.” To address that gap, we present two measures for flat clusterings of high-dimensional points in Euclidean space. First is Clustering-Neighborhood Stability Measure (clustering-NSM), an internal measure of clustering quality – a function of a clustering of a dataset – that we show to be predictive of ANNS accuracy. The second, Point-Neighborhood Stability Measure (point-NSM), is a measure of clusterability – a function of the dataset itself – that is predictive of clustering-NSM. The two together allow us to determine whether a dataset is searchable by clustering-based ANNS given only the data points. Importantly, both are functions of nearest neighbor relationships between points, not distances, making them applicable to various distance functions including inner product.

[314] Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition

Bo Pan, Peter Zhiping Zhang, Hao-Wei Pang, Alex Zhu, Xiang Yu, Liying Zhang, Liang Zhao

Main category: cs.LG

TL;DR: A foundation model for molecular analog generation using matched molecular pairs transformations with controllable prompting and retrieval-augmented guidance.

Details

Motivation: Existing ML approaches for molecular analog generation either lack edit controllability or are limited to small models and restricted settings, while medicinal chemists routinely use local chemical edits (MMPs) for analog design.

Method: Proposes a variable-to-variable formulation for analog generation, trains a foundation model on large-scale MMP transformations (MMPTs), develops prompting mechanisms for user-specified transformation patterns, and introduces MMPT-RAG - a retrieval-augmented framework using external reference analogs as contextual guidance.

Result: Experiments on chemical corpora and patent datasets demonstrate improved diversity, novelty, controllability, and recovery of realistic analog structures in practical discovery scenarios.

Conclusion: The approach enables practical control over molecular analog generation with improved performance over existing methods, showing promise for real-world drug discovery applications.

Abstract: Matched molecular pairs (MMPs) capture the local chemical edits that medicinal chemists routinely use to design analogs, but existing ML approaches either operate at the whole-molecule level with limited edit controllability or learn MMP-style edits from restricted settings and small models. We propose a variable-to-variable formulation of analog generation and train a foundation model on large-scale MMP transformations (MMPTs) to generate diverse variables conditioned on an input variable. To enable practical control, we develop prompting mechanisms that let the users specify preferred transformation patterns during generation. We further introduce MMPT-RAG, a retrieval-augmented framework that uses external reference analogs as contextual guidance to steer generation and generalize from project-specific series. Experiments on general chemical corpora and patent-specific datasets demonstrate improved diversity, novelty, and controllability, and show that our method recovers realistic analog structures in practical discovery scenarios.

[315] Protecting the Undeleted in Machine Unlearning

Aloni Cohen, Refael Kohen, Kobbi Nissim, Uri Stemmer

Main category: cs.LG

TL;DR: Machine unlearning aiming for perfect retraining creates privacy risks for undeleted data, enabling reconstruction attacks; new security definition proposed to protect undeleted data while supporting essential functionalities.

Details

Motivation: The paper addresses privacy risks in machine unlearning approaches that aim for "perfect retraining" (emulating the model that would have been obtained if deleted data were never included). The authors show that such approaches can leak information about undeleted data points, creating significant privacy vulnerabilities.

Method: The authors present a reconstruction attack demonstrating that perfect retraining mechanisms allow adversaries controlling only ω(1) data points to reconstruct almost the entire dataset through deletion requests. They survey existing machine unlearning definitions, analyze their vulnerabilities, and propose a new security definition specifically designed to protect undeleted data from leakage caused by deletions.

Result: The reconstruction attack shows that perfect retraining approaches are vulnerable to privacy breaches. Existing definitions are either susceptible to such attacks or too restrictive for basic functionalities. The proposed new security definition successfully protects undeleted data while permitting essential functionalities like bulletin boards, summations, and statistical learning.

Conclusion: Machine unlearning approaches aiming for perfect retraining carry significant privacy risks for undeleted data. A new security framework is needed that specifically protects undeleted data from leakage caused by deletions, and the proposed definition achieves this while supporting practical functionalities.

Abstract: Machine unlearning aims to remove specific data points from a trained model, often striving to emulate “perfect retraining”, i.e., producing the model that would have been obtained had the deleted data never been included. We demonstrate that this approach, and security definitions that enable it, carry significant privacy risks for the remaining (undeleted) data points. We present a reconstruction attack showing that for certain tasks, which can be computed securely without deletions, a mechanism adhering to perfect retraining allows an adversary controlling merely $ω(1)$ data points to reconstruct almost the entire dataset merely by issuing deletion requests. We survey existing definitions for machine unlearning, showing they are either susceptible to such attacks or too restrictive to support basic functionalities like exact summation. To address this problem, we propose a new security definition that specifically safeguards undeleted data against leakage caused by the deletion of other points. We show that our definition permits several essential functionalities, such as bulletin boards, summations, and statistical learning.

[316] Causality is Key for Interpretability Claims to Generalise

Shruti Joshi, Aaron Mueller, David Klindt, Wieland Brendel, Patrik Reizinger, Dhanya Sridhar

Main category: cs.LG

TL;DR: Causal inference framework for LLM interpretability that clarifies what evidence supports different types of claims (associations, interventions, counterfactuals) and how causal representation learning operationalizes this hierarchy.

Details

Motivation: Current LLM interpretability research suffers from findings that don't generalize and causal interpretations that exceed the available evidence. The paper aims to establish a rigorous causal framework to ensure interpretability findings are properly supported by evidence.

Method: Proposes using Pearl’s causal hierarchy to structure interpretability claims, distinguishing between observational associations, interventional effects, and counterfactual claims. Shows how causal representation learning operationalizes this hierarchy by specifying which variables are recoverable from activations and under what assumptions.

Result: Develops a diagnostic framework that helps practitioners select methods and evaluations that match claims to evidence, ensuring findings generalize properly. Clarifies what different types of interpretability studies can actually justify.

Conclusion: A causal inference framework is essential for rigorous LLM interpretability research to avoid over-interpretation and ensure findings are properly supported by evidence and generalize appropriately.

Abstract: Interpretability research on large language models (LLMs) has yielded important insights into model behaviour, yet recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence. Our position is that causal inference specifies what constitutes a valid mapping from model activations to invariant high-level structures, the data or assumptions needed to achieve it, and the inferences it can support. Specifically, Pearl’s causal hierarchy clarifies what an interpretability study can justify. Observations establish associations between model behaviour and internal components. Interventions (e.g., ablations or activation patching) support claims how these edits affect a behavioural metric (\eg, average change in token probabilities) over a set of prompts. However, counterfactual claims – i.e., asking what the model output would have been for the same prompt under an unobserved intervention – remain largely unverifiable without controlled supervision. We show how causal representation learning (CRL) operationalises this hierarchy, specifying which variables are recoverable from activations and under what assumptions. Together, these motivate a diagnostic framework that helps practitioners select methods and evaluations matching claims to evidence such that findings generalise.

[317] Knowledge-Embedded Latent Projection for Robust Representation Learning

Weijing Tang, Ming Yuan, Zongqi Xia, Tianxi Cai

Main category: cs.LG

TL;DR: A knowledge-embedded latent projection model that uses semantic side information to regularize representation learning for high-dimensional discrete data matrices, addressing challenges in imbalanced regimes like EHR data.

Details

Motivation: Latent space models struggle with imbalanced data regimes where one matrix dimension is much larger than the other, particularly in EHR applications where cohort sizes are limited but feature spaces are extremely large due to comprehensive medical coding systems.

Method: Proposes a knowledge-embedded latent projection model that leverages external semantic embeddings (like pre-trained clinical concept embeddings) to regularize representation learning. Models column embeddings as smooth functions of semantic embeddings via RKHS mapping, with a two-step estimation procedure combining kernel PCA for semantically guided subspace construction and scalable projected gradient descent.

Result: Establishes estimation error bounds characterizing trade-off between statistical error and kernel projection approximation error, provides local convergence guarantees for non-convex optimization, and demonstrates effectiveness through extensive simulations and real-world EHR application.

Conclusion: The proposed method effectively addresses imbalanced data challenges in latent space modeling by incorporating semantic side information, offering both theoretical guarantees and practical utility for applications like EHR analysis.

Abstract: Latent space models are widely used for analyzing high-dimensional discrete data matrices, such as patient-feature matrices in electronic health records (EHRs), by capturing complex dependence structures through low-dimensional embeddings. However, estimation becomes challenging in the imbalanced regime, where one matrix dimension is much larger than the other. In EHR applications, cohort sizes are often limited by disease prevalence or data availability, whereas the feature space remains extremely large due to the breadth of medical coding system. Motivated by the increasing availability of external semantic embeddings, such as pre-trained embeddings of clinical concepts in EHRs, we propose a knowledge-embedded latent projection model that leverages semantic side information to regularize representation learning. Specifically, we model column embeddings as smooth functions of semantic embeddings via a mapping in a reproducing kernel Hilbert space. We develop a computationally efficient two-step estimation procedure that combines semantically guided subspace construction via kernel principal component analysis with scalable projected gradient descent. We establish estimation error bounds that characterize the trade-off between statistical error and approximation error induced by the kernel projection. Furthermore, we provide local convergence guarantees for our non-convex optimization procedure. Extensive simulation studies and a real-world EHR application demonstrate the effectiveness of the proposed method.

[318] Understanding Transformer Optimization via Gradient Heterogeneity

Akiyoshi Tomihari, Issei Sato

Main category: cs.LG

TL;DR: Transformers rely on Adam over SGD due to gradient heterogeneity; Adam’s coordinate-wise normalization makes it less sensitive to gradient variations, acting like soft SignSGD; Post-LN architectures show strong gradient heterogeneity.

Details

Motivation: Transformers are difficult to optimize with SGD and largely rely on Adam, but the reasons behind Adam's superior performance remain poorly understood. The study aims to analyze Transformer optimization through the lens of gradient heterogeneity.

Method: Theoretical analysis of gradient heterogeneity and Hessian heterogeneity effects on optimization methods. Investigation of how Adam’s coordinate-wise normalization makes it less sensitive to gradient variations (acting as soft SignSGD). Analysis of gradient heterogeneity origins in Transformer architectures, particularly layer normalization placement. Experimental validation through fine-tuning Transformers in NLP and vision domains.

Result: Gradient heterogeneity degrades SGD convergence but affects sign-based methods less. Adam’s performance advantage comes from its sign-based nature via coordinate-wise normalization. Post-LN architectures exhibit particularly pronounced gradient heterogeneity. Experimental results validate theoretical analysis across NLP and vision domains.

Conclusion: Adam’s superiority over SGD for Transformers stems from its reduced sensitivity to gradient heterogeneity, which is inherent in Transformer architectures especially with Post-LN designs. Understanding gradient heterogeneity provides insights for better optimization of Transformer models.

Abstract: Transformers are difficult to optimize with stochastic gradient descent (SGD) and largely rely on adaptive optimizers such as Adam. Despite their empirical success, the reasons behind Adam’s superior performance over SGD remain poorly understood. In this study, we analyze the optimization of Transformer models through the lens of \emph{gradient heterogeneity}, defined as the variation in gradient norms across parameter blocks. We provide a theoretical analysis showing that gradient heterogeneity, together with Hessian heterogeneity, degrades the convergence of gradient-based methods such as SGD, while sign-based methods are substantially less sensitive to this effect. Adam’s coordinate-wise normalization makes its update directions depend mainly on gradient signs, so Adam can be interpreted as a soft variant of SignSGD. Our analysis uses the fact that SGD and SignSGD follow steepest descent directions under different norms, and derives upper bounds on the iteration complexity with implications for learning rate scaling in SignSGD. We further investigate the origin of gradient heterogeneity in Transformer architectures and show that it is strongly influenced by the placement of layer normalization, with Post-LN architectures exhibiting particularly pronounced heterogeneity. Experimental results from fine-tuning Transformers in both NLP and vision domains validate our theoretical analysis. Code is available at https://github.com/tom4649/gradient-heterogeneity.

[319] Forget Forgetting: Continual Learning in a World of Abundant Memory

Dongkyu Cho, Taesup Moon, Rumi Chunara, Kyunghyun Cho, Sungmin Cha

Main category: cs.LG

TL;DR: Paper challenges traditional continual learning focus on minimizing memory, proposing weight space consolidation for better stability-plasticity trade-off when memory is abundant but GPU time is limited.

Details

Motivation: Traditional continual learning focuses on minimizing exemplar memory, but modern systems face GPU time as the primary bottleneck, not storage. The paper investigates a more realistic regime where memory is abundant enough to mitigate forgetting but full retraining remains expensive.

Method: Proposes Weight Space Consolidation, a lightweight method combining (1) rank-based parameter resets to restore plasticity with (2) weight averaging to enhance stability. Validated on class-incremental learning with image classifiers and continual instruction tuning with large language models.

Result: The approach outperforms strong baselines while matching the low computational cost of replay. Simple replay baselines outperform state-of-the-art methods at a fraction of GPU cost in this new regime.

Conclusion: Challenges long-standing CL assumptions and establishes a new, cost-efficient baseline for real-world CL systems where exemplar memory is no longer the limiting factor, shifting focus from stability to plasticity.

Abstract: Continual learning (CL) has traditionally focused on minimizing exemplar memory, a constraint often misaligned with modern systems where GPU time, not storage, is the primary bottleneck. This paper challenges this paradigm by investigating a more realistic regime: one where memory is abundant enough to mitigate forgetting, but full retraining from scratch remains prohibitively expensive. In this practical “middle ground”, we find that the core challenge shifts from stability to plasticity, as models become biased toward prior tasks and struggle to learn new ones. Conversely, improved stability allows simple replay baselines to outperform the state-of-the-art methods at a fraction of the GPU cost. To address this newly surfaced trade-off, we propose Weight Space Consolidation, a lightweight method that combines (1) rank-based parameter resets to restore plasticity with (2) weight averaging to enhance stability. Validated on both class-incremental learning with image classifiers and continual instruction tuning with large language models, our approach outperforms strong baselines while matching the low computational cost of replay, offering a scalable alternative to expensive full-retraining. These findings challenge long-standing CL assumptions and establish a new, cost-efficient baseline for real-world CL systems where exemplar memory is no longer the limiting factor.

[320] GEPC: Group-Equivariant Posterior Consistency for Out-of-Distribution Detection in Diffusion Models

Yadang Alexis Rouzoumka, Jean Pinsolle, Eugénie Terreaux, Christèle Morisseau, Jean-Philippe Ovarlez, Chengfang Ren

Main category: cs.LG

TL;DR: GEPC is a training-free OOD detection method for diffusion models that measures score equivariance consistency under group transformations, detecting OOD data through equivariance breaking even when score magnitude remains unchanged.

Details

Motivation: Current diffusion-based OOD detectors focus on score magnitude or local geometry but ignore equivariance properties. The authors aim to exploit the fact that diffusion models often inherit approximate equivariances from ID data and convolutional backbones, and that OOD data may break these equivariances.

Method: GEPC measures how consistently learned scores transform under finite group transformations (flips, rotations, circular shifts). It computes an equivariance-residual functional averaged over the group, requiring only score evaluations without additional training. The method produces interpretable equivariance-breaking maps.

Result: GEPC achieves competitive or improved AUROC compared to recent diffusion-based baselines on OOD image benchmarks. On high-resolution synthetic aperture radar imagery, it yields strong target-background separation and interpretable equivariance-breaking maps.

Conclusion: GEPC provides an effective, training-free approach for OOD detection by measuring equivariance consistency in diffusion models, offering computational efficiency and interpretability through equivariance-breaking maps.

Abstract: Diffusion models learn a time-indexed score field $\mathbf{s}_θ(\mathbf{x}_t,t)$ that often inherits approximate equivariances (flips, rotations, circular shifts) from in-distribution (ID) data and convolutional backbones. Most diffusion-based out-of-distribution (OOD) detectors exploit score magnitude or local geometry (energies, curvature, covariance spectra) and largely ignore equivariances. We introduce Group-Equivariant Posterior Consistency (GEPC), a training-free probe that measures how consistently the learned score transforms under a finite group $\mathcal{G}$, detecting equivariance breaking even when score magnitude remains unchanged. At the population level, we propose the ideal GEPC residual, which averages an equivariance-residual functional over $\mathcal{G}$, and we derive ID upper bounds and OOD lower bounds under mild assumptions. GEPC requires only score evaluations and produces interpretable equivariance-breaking maps. On OOD image benchmark datasets, we show that GEPC achieves competitive or improved AUROC compared to recent diffusion-based baselines while remaining computationally lightweight. On high-resolution synthetic aperture radar imagery where OOD corresponds to targets or anomalies in clutter, GEPC yields strong target-background separation and visually interpretable equivariance-breaking maps. Code is available at https://github.com/RouzAY/gepc-diffusion/.

[321] FedEFC: Federated Learning Using Enhanced Forward Correction Against Noisy Labels

Seunghun Yu, Jin-Hyun Ahn, Joonhyuk Kang

Main category: cs.LG

TL;DR: FedEFC: A federated learning method that addresses noisy labels through prestopping and loss correction techniques, with theoretical guarantees and empirical improvements.

Details

Motivation: Noisy labels in federated learning degrade model performance due to heterogeneous data distributions and communication constraints, creating a need for robust methods that can handle label noise in decentralized settings.

Method: Proposes FedEFC with two key techniques: (1) prestopping - dynamically halts training at optimal points to prevent overfitting to mislabeled data, and (2) loss correction - adjusts model updates to account for label noise, specifically tailored for FL challenges like data heterogeneity.

Result: Extensive experiments show FedEFC consistently outperforms existing FL techniques, achieving up to 41.64% relative performance improvement over existing loss correction methods, particularly effective under heterogeneous data settings.

Conclusion: FedEFC provides an effective solution for noisy label handling in federated learning with theoretical foundations and practical performance improvements, addressing key challenges in decentralized training with label noise.

Abstract: Federated Learning (FL) is a powerful framework for privacy-preserving distributed learning. It enables multiple clients to collaboratively train a global model without sharing raw data. However, handling noisy labels in FL remains a major challenge due to heterogeneous data distributions and communication constraints, which can severely degrade model performance. To address this issue, we propose FedEFC, a novel method designed to tackle the impact of noisy labels in FL. FedEFC mitigates this issue through two key techniques: (1) prestopping, which prevents overfitting to mislabeled data by dynamically halting training at an optimal point, and (2) loss correction, which adjusts model updates to account for label noise. In particular, we develop an effective loss correction tailored to the unique challenges of FL, including data heterogeneity and decentralized training. Furthermore, we provide a theoretical analysis, leveraging the composite proper loss property, to demonstrate that the FL objective function under noisy label distributions can be aligned with the clean label distribution. Extensive experimental results validate the effectiveness of our approach, showing that it consistently outperforms existing FL techniques in mitigating the impact of noisy labels, particularly under heterogeneous data settings (e.g., achieving up to 41.64% relative performance improvement over the existing loss correction method).

[322] WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

Sihan Chen, Dan Zhao, Jongwoo Ko, Colby Banbury, Huiping Zhuang, Luming Liang, Pashmina Cameron, Tianyi Chen

Main category: cs.LG

TL;DR: WINA is a training-free sparse activation framework for LLMs that jointly considers hidden state magnitudes and weight matrix norms to achieve optimal approximation error bounds and superior performance.

Details

Motivation: To address the computational demands of LLMs, the paper aims to improve training-free sparse activation methods that currently rely only on hidden state magnitudes, leading to high approximation errors and suboptimal accuracy.

Method: Proposes WINA (Weight Informed Neuron Activation), a training-free sparse activation framework that jointly considers both hidden state magnitudes and the column-wise ℓ₂-norms of weight matrices to determine neuron activation.

Result: WINA achieves tighter theoretical approximation error bounds and empirically outperforms state-of-the-art methods (e.g., TEAL) by up to 2.94% in average performance at the same sparsity levels across diverse LLM architectures and datasets.

Conclusion: WINA establishes a new performance frontier for training-free sparse activation in LLM inference, advancing efficient inference methods and setting a robust baseline for future research.

Abstract: The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise $\ell_2$-norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to $2.94%$ in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at https://github.com/microsoft/wina.

[323] Experience-based Knowledge Correction for Robust Planning in Minecraft

Seungjoon Lee, Suhwan Kim, Minhyeon Oh, Youngsik Yoon, Jungseul Ok

Main category: cs.LG

TL;DR: XENON is an LLM-based agent that algorithmically corrects flawed knowledge priors through experience, using adaptive dependency graphs and failure-aware action memory to improve long-horizon planning in Minecraft environments.

Details

Motivation: LLM-based planning agents often start with incorrect priors about goal dependencies and feasible actions, and fail to correct them even with feedback. This limits their effectiveness in complex, long-horizon environments like Minecraft where accurate knowledge of item dependencies and actions is crucial.

Method: XENON introduces two algorithmic mechanisms: 1) Adaptive Dependency Graph that corrects item dependencies using past successful experiences, and 2) Failure-aware Action Memory that corrects action knowledge using past failures. These components work together to enable knowledge revision from sparse binary feedback.

Result: XENON outperforms prior agents across multiple Minecraft benchmarks in both knowledge learning and long-horizon planning. Remarkably, with only a 7B open-weight LLM, it surpasses agents relying on much larger proprietary models.

Conclusion: Algorithmic knowledge correction from experience enables LLM-based agents to overcome flawed priors and achieve robust performance in complex environments with limited feedback, demonstrating that experience-based learning can compensate for model size limitations.

Abstract: Large Language Model (LLM)-based planning has advanced embodied agents in long-horizon environments such as Minecraft, where acquiring latent knowledge of goal (or item) dependencies and feasible actions is critical. However, LLMs often begin with flawed priors and fail to correct them through prompting, even with feedback. We present XENON (eXpErience-based kNOwledge correctioN), an agent that algorithmically revises knowledge from experience, enabling robustness to flawed priors and sparse binary feedback. XENON integrates two mechanisms: Adaptive Dependency Graph, which corrects item dependencies using past successes, and Failure-aware Action Memory, which corrects action knowledge using past failures. Together, these components allow XENON to acquire complex dependencies despite limited guidance. Experiments across multiple Minecraft benchmarks show that XENON outperforms prior agents in both knowledge learning and long-horizon planning. Remarkably, with only a 7B open-weight LLM, XENON surpasses agents that rely on much larger proprietary models. Project page: https://sjlee-me.github.io/XENON

[324] DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

Makoto Shing, Masanori Koyama, Takuya Akiba

Main category: cs.LG

TL;DR: DiffusionBlocks is a framework for training transformer networks as independent blocks using diffusion principles, reducing memory requirements while maintaining performance comparable to end-to-end training.

Details

Motivation: End-to-end backpropagation creates memory bottlenecks that limit model scalability due to storing activations across all layers. Existing block-wise training methods rely on ad-hoc local objectives and haven't been explored beyond classification tasks.

Method: Transforms transformer networks into independent trainable blocks by leveraging residual connections as updates in a dynamical system. Converts these updates to a denoising process where each block can be learned independently using score matching objectives, enabling training with gradients for only one block at a time.

Result: Experiments on various transformer architectures (vision, diffusion, autoregressive, recurrent-depth, masked diffusion) show that DiffusionBlocks training matches end-to-end training performance while enabling scalable block-wise training on practical tasks beyond small-scale classification.

Conclusion: DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures, offering memory-efficient training without sacrificing performance.

Abstract: End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose $\textit{DiffusionBlocks}$, a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range of transformer architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion) demonstrate that DiffusionBlocks training matches the performance of end-to-end training while enabling scalable block-wise training on practical tasks beyond small-scale classification. DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures. Code is available at https://github.com/SakanaAI/DiffusionBlocks .

[325] Chain of Thought in Order: Discovering Learning-Friendly Orders for Arithmetic

Yuta Sato, Kazuhiko Kawamoto, Hiroshi Kera

Main category: cs.LG

TL;DR: Proposes a method to reorder decoder input tokens (chain of thought steps) into learning-friendly sequences for Transformers on arithmetic tasks, using a hierarchical search approach to identify optimal reasoning step orders.

Details

Motivation: While intermediate reasoning steps in Transformers have been studied extensively, the ordering of these steps has received little attention despite significantly affecting reasoning difficulty. Current approaches don't systematically explore optimal step ordering for learning.

Method: Proposes a pipeline that: 1) trains Transformer on mixture of sequences with different orders, 2) identifies benign orders via fast loss drops in early training, 3) uses two-stage hierarchical approach for inter- and intra-block reordering to handle factorial search space growth.

Result: Experiments on seven order-sensitive arithmetic tasks show the method identifies learning-friendly orders from billions of candidates. Notably recovers the reverse-digit order previously reported for multiplication tasks.

Conclusion: Step ordering is crucial for Transformers’ reasoning performance, and the proposed method effectively discovers optimal reasoning step sequences for arithmetic tasks, demonstrating the importance of systematic exploration of chain-of-thought ordering.

Abstract: The chain of thought, i.e., step-by-step reasoning, is one of the fundamental mechanisms of Transformers. While the design of intermediate reasoning steps has been extensively studied and shown to critically influence performance on mathematical, multi-step reasoning tasks, the ordering of these steps has received little attention, despite its significant effect on the difficulty of reasoning. This study addresses a novel task of unraveling the chain of thought – reordering decoder input tokens into a learning-friendly sequence for Transformers, for learning arithmetic tasks. The proposed pipeline first trains a Transformer on a mixture of target sequences arranged in different orders and then identifies benign orders as those with fast loss drops in the early stage. As the search space grows factorially in sequence length, we propose a two-stage hierarchical approach for inter- and intra-block reordering. Experiments on seven order-sensitive arithmetic tasks show that our method identifies a learning-friendly order out of a few billion candidates. Notably, it recovered the reverse-digit order reported in prior studies for the multiplication task.

[326] Model-Agnostic Dynamic Feature Selection with Uncertainty Quantification

Javier Fumanal-Idocin, Raquel Fernandez-Peralta, Javier Andreu-Perez

Main category: cs.LG

TL;DR: A model-agnostic dynamic feature selection framework that works with pre-trained classifiers while addressing uncertainty quantification issues specific to sequential feature acquisition.

Details

Motivation: Existing dynamic feature selection methods require specialized models, limiting compatibility with deployed systems, and lack proper uncertainty quantification which is crucial for high-stakes decisions.

Method: Proposes a model-agnostic DFS framework compatible with pre-trained classifiers using efficient subset reparametrization strategies. Formalizes new uncertainty sources in DFS including model adaptation uncertainty and imputation bias.

Result: Achieves competitive accuracy against state-of-the-art greedy and reinforcement learning-based DFS methods on tabular and image datasets with both neural and rule-based classifiers. Shows identified uncertainty sources persist across existing approaches.

Conclusion: DFS introduces unique uncertainty challenges requiring specialized quantification, and the proposed framework enables uncertainty-aware dynamic feature selection with existing models.

Abstract: Dynamic feature selection (DFS) addresses budget constraints in decision-making by sequentially acquiring features for each instance, making it appealing for resource-limited scenarios. However, existing DFS methods require models specifically designed for the sequential acquisition setting, limiting compatibility with models already deployed in practice. Furthermore, they provide limited uncertainty quantification, undermining trust in high-stakes decisions. In this work, we show that DFS introduces new uncertainty sources compared to the static setting. We formalise how model adaptation to feature subsets induces epistemic uncertainty, how standard imputation strategies bias aleatoric uncertainty estimation, and why predictive confidence fails to discriminate between good and bad selection policies. We also propose a model-agnostic DFS framework compatible with pre-trained classifiers, including interpretable-by-design models, through efficient subset reparametrization strategies. Empirical evaluation on tabular and image datasets demonstrates competitive accuracy against state-of-the-art greedy and reinforcement learning-based DFS methods with both neural and rule-based classifiers. We further show that the identified uncertainty sources persist across most existing approaches, highlighting the need for uncertainty-aware DFS.

[327] Pinet: Optimizing hard-constrained neural networks with orthogonal projection layers

Panagiotis D. Grontas, Antonio Terpin, Efe C. Balta, Raffaello D’Andrea, John Lygeros

Main category: cs.LG

TL;DR: Πnet is a neural network output layer that ensures convex constraint satisfaction via operator splitting for projections and implicit function theorem for backpropagation, enabling fast feasible-by-design solutions for parametric constrained optimization problems.

Details

Motivation: The paper addresses the need for neural networks that can produce solutions satisfying convex constraints, particularly for parametric constrained optimization problems where traditional solvers are slow, especially when solving batches of problems.

Method: Πnet uses operator splitting for rapid and reliable projections in the forward pass to ensure constraint satisfaction, and the implicit function theorem for efficient backpropagation during training.

Result: Πnet achieves modest-accuracy solutions faster than traditional solvers for single problems, significantly faster for batches, surpasses state-of-the-art learning approaches in training time, solution quality, and robustness, while maintaining similar inference times.

Conclusion: Πnet provides a GPU-ready, feasible-by-design optimization proxy implemented in JAX that enables efficient constraint-satisfying solutions for parametric optimization problems, demonstrated on multi-vehicle motion planning with non-convex trajectory preferences.

Abstract: We introduce an output layer for neural networks that ensures satisfaction of convex constraints. Our approach, $Π$net, leverages operator splitting for rapid and reliable projections in the forward pass, and the implicit function theorem for backpropagation. We deploy $Π$net as a feasible-by-design optimization proxy for parametric constrained optimization problems and obtain modest-accuracy solutions faster than traditional solvers when solving a single problem, and significantly faster for a batch of problems. We surpass state-of-the-art learning approaches by orders of magnitude in terms of training time, solution quality, and robustness to hyperparameter tuning, while maintaining similar inference times. Finally, we tackle multi-vehicle motion planning with non-convex trajectory preferences and provide $Π$net as a GPU-ready package implemented in JAX.

[328] FairTabGen: High-Fidelity and Fair Synthetic Health Data Generation from Limited Samples

Nitish Nagesh, Salar Shakibhamedan, Mahdi Bagheri, Ziyu Wang, Nima TaheriNejad, Axel Jantsch, Amir M. Rahmani

Main category: cs.LG

TL;DR: FairTabGen: LLM-based framework for synthetic healthcare tabular data generation using minimal data, achieving competitive utility with improved fairness through bias mitigation.

Details

Motivation: Synthetic healthcare data generation addresses privacy and regulatory constraints in clinical research, but current approaches require specialized generative model knowledge and high computational resources.

Method: LLM-based framework combining in-context learning, prompt curation, and embedding structural constraints for tabular data synthesis using only a small subset of original data.

Result: Achieves 99% less data usage, 50% improvement for fairness through unawareness, maintains competitive predictive utility on MIMIC-IV dataset, with bias mitigation improving fairness by 10%.

Conclusion: FairTabGen provides an effective LLM-based approach for synthetic healthcare data generation with improved fairness and reduced data/computational requirements.

Abstract: Synthetic healthcare data generation offers a promising solution to research limitations in clinical settings caused by privacy and regulatory constraints. However, current synthetic data generation approaches require specialized knowledge about training generative models and require high computational resources. In this paper, we propose FairTabGen, an LLM-based tabular data generation framework that produces high-quality synthetic healthcare data using only a small subset of the original dataset. Our method combines in-context learning, prompt curation and embedding structural constraints for data synthesis. We evaluate performance on MIMIC-IV dataset. Our method using 99% less data and achieving 50% improvement for fairness through unawareness while maintaining competitive predictive utility. However, we observe data distribution of racial groups is skewed affecting demographic parity. We thereafter apply bias mitigation algorithms in the pre-processing stage, improving overall fairness by 10% highlighting effectiveness of our approach.

[329] SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML

Ismail Lamaakal, Chaymae Yahyati, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine Maleh

Main category: cs.LG

TL;DR: SNAP-UQ: A single-pass, label-free uncertainty estimation method for TinyML that predicts next activations in backbone layers to compute uncertainty scores with minimal resource overhead.

Details

Motivation: Current uncertainty estimation methods (deep ensembles, MC dropout, early exits) are impractical for TinyML due to multiple passes, extra branches, or state requirements that exceed strict flash/latency budgets on microcontrollers.

Method: Uses depth-wise next-activation prediction with tiny int8 heads to predict mean and scale of next activation from low-rank projection of previous activation. Standardized prediction error forms depth-wise surprisal signal aggregated through lightweight monotone calibrator into uncertainty score.

Result: Reduces flash and latency by ~40-60% smaller and ~25-35% faster than early-exit and deep-ensemble baselines. Improves accuracy-drop event detection on corrupted streams and maintains strong failure detection (AUROC ≈ 0.9) in single forward pass.

Conclusion: SNAP-UQ provides resource-efficient uncertainty estimation for TinyML by grounding uncertainty in layer-to-layer dynamics rather than output confidence, enabling robust on-device monitoring with minimal overhead.

Abstract: Reliable uncertainty estimation is a key missing piece for on-device monitoring in TinyML: microcontrollers must detect failures, distribution shift, or accuracy drops under strict flash/latency budgets, yet common uncertainty approaches (deep ensembles, MC dropout, early exits, temporal buffering) typically require multiple passes, extra branches, or state that is impractical on milliwatt hardware. This paper proposes a novel and practical method, SNAP-UQ, for single-pass, label-free uncertainty estimation based on depth-wise next-activation prediction. SNAP-UQ taps a small set of backbone layers and uses tiny int8 heads to predict the mean and scale of the next activation from a low-rank projection of the previous one; the resulting standardized prediction error forms a depth-wise surprisal signal that is aggregated and mapped through a lightweight monotone calibrator into an actionable uncertainty score. The design introduces no temporal buffers or auxiliary exits and preserves state-free inference, while increasing deployment footprint by only a few tens of kilobytes. Across vision and audio backbones, SNAP-UQ reduces flash and latency relative to early-exit and deep-ensemble baselines (typically $\sim$40–60% smaller and $\sim$25–35% faster), with several competing methods at similar accuracy often exceeding MCU memory limits. On corrupted streams, it improves accuracy-drop event detection by multiple AUPRC points and maintains strong failure detection (AUROC $\approx 0.9$) in a single forward pass. By grounding uncertainty in layer-to-layer dynamics rather than solely in output confidence, SNAP-UQ offers a novel, resource-efficient basis for robust TinyML monitoring. Our code is available at: https://github.com/Ism-ail11/SNAP-UQ

[330] Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu

Main category: cs.LG

TL;DR: EVOL-RL is a label-free self-improvement framework for LLMs that balances majority stability with novelty exploration to prevent entropy collapse and improve reasoning diversity.

Details

Motivation: Existing self-improvement approaches for LLMs rely on self-confirmation signals that drive models toward over-confident, majority-favored solutions, causing entropy collapse that degrades pass@n and reasoning complexity. There's a need for methods that can self-improve without labels or external judges while maintaining diversity.

Method: EVOL-RL mirrors evolutionary principles of balancing selection with variation. It retains majority-voted answers as anchors for stability but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses (majority-for-stability + novelty-for-exploration).

Result: EVOL-RL consistently outperforms majority-only baselines: training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. It prevents in-domain diversity collapse and improves out-of-domain generalization (e.g., MMLU-Pro and BBEH).

Conclusion: EVOL-RL provides an effective label-free self-improvement framework that balances stability and exploration, preventing entropy collapse while improving reasoning diversity and generalization capabilities across domains.

Abstract: Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing self-improvement approaches primarily rely on self-confirmation signals (e.g., confidence, entropy, or consistency) to generate rewards. This reliance drives models toward over-confident, majority-favored solutions, causing an entropy collapse that degrades pass@n and reasoning complexity. To address this, we propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation. Concretely, EVOL-RL retains the majority-voted answer as an anchor for stability, but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses. This majority-for-stability + novelty-for-exploration rule mirrors the variation-selection principle: selection prevents drift, while novelty prevents collapse. Evaluation results show that EVOL-RL consistently outperforms the majority-only baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from baseline’s 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents in-domain diversity collapse but also improves out-of-domain generalization (from math reasoning to broader tasks, e.g., MMLU-Pro and BBEH). The code is available at: https://github.com/YujunZhou/EVOL-RL.

[331] Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs

Shane Bergsma, Nolan Dey, Joel Hestness

Main category: cs.LG

TL;DR: TREC (training re-evaluation curve) is a diagnostic tool that evaluates how well a trained model retains training data based on when it was encountered, enabling better data curriculum design for LLM training.

Details

Motivation: Current LLM training lacks clear principles for optimal data placement in curricula, despite data curriculums being central to successful training. There's a need for better understanding of how data timing affects model retention and performance.

Method: Introduces TREC (training re-evaluation curve) that retrospectively evaluates training batches using final model weights. Shows TREC can be predicted in advance from AdamW’s implicit EMA coefficients. Uses TREC analysis on models from 111M to 3.9B parameters to optimize data placement.

Result: Placing high-quality data at low points on the TREC significantly improves performance. TREC predictions explain prior ablations and reveal suboptimal data placements in published recipes. Improved continual pre-training of a 3.9B-parameter LLM trained on 900B tokens by aligning high-quality data with TREC minima.

Conclusion: TREC provides a principled approach to data curriculum design, enabling proactive optimization of data placement based on when models best retain information during training.

Abstract: Data curriculums have become central to successful LLM training, yet principles governing optimal data placement remain unclear. We introduce the training re-evaluation curve (TREC), a diagnostic that retrospectively evaluates training batches using the final model weights. The TREC characterizes how well a trained model retains training data as a function of when the data was encountered during training. Analyzing TRECs for models from 111M to 3.9B parameters, we show that placing high-quality data at low points on the TREC significantly improves performance. Importantly, while a TREC is initially observable only after training, we demonstrate it can be predicted in advance from AdamW’s implicit EMA coefficients, enabling proactive curriculum design. By predicting TRECs for published training recipes, we explain prior ablations and reveal suboptimal data placements. We also align high-quality data with TREC minima in order to improve continual pre-training of a 3.9B-parameter LLM trained on 900B tokens.

[332] Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective

Feilong Liu

Main category: cs.LG

TL;DR: MoE architectures reduce local function sensitivity and flatten curvature through expert-local partitioning, with routing strategies modulating representation geometry and expert alignment.

Details

Motivation: To understand the geometric effects of Mixture-of-Experts architectures on learned functions and representations, particularly how routing mechanisms shape local function sensitivity and representation structure.

Method: Introduces Dual Jacobian-PCA spectral probe: analyzes local function geometry via Jacobian singular value spectra and representation geometry via weighted PCA of routed hidden states. Uses controlled MLP-MoE setting with exact Jacobian computation, comparing dense, Top-k, and fully soft routing under matched capacity.

Result: MoE routing consistently reduces local sensitivity (smaller leading singular values, faster spectral decay). Expert-local representations show higher effective rank with variance distributed across more principal directions. Low alignment among expert Jacobians suggests decomposition into expert-specific transformations. Top-k routing yields more concentrated, lower-rank structure, while fully soft routing produces broader, higher-rank representations.

Conclusion: MoEs act as soft partitionings of function space that flatten local curvature while redistributing representation variance, with implications for expert scaling, hallucination reduction, and ensemble diversity.

Abstract: Mixture-of-Experts (MoE) architectures are widely used for efficiency and conditional computation, but their effect on the geometry of learned functions and representations remains poorly understood. We study MoEs through a geometric lens, interpreting routing as soft partitioning into overlapping expert-local charts. We introduce a Dual Jacobian-PCA spectral probe that analyzes local function geometry via Jacobian singular value spectra and representation geometry via weighted PCA of routed hidden states. Using a controlled MLP-MoE setting with exact Jacobian computation, we compare dense, Top-k, and fully soft routing under matched capacity. Across random seeds, MoE routing consistently reduces local sensitivity: expert-local Jacobians show smaller leading singular values and faster spectral decay than dense baselines. Weighted PCA reveals that expert-local representations distribute variance across more principal directions, indicating higher effective rank. We further observe low alignment among expert Jacobians, suggesting decomposition into low-overlap expert-specific transformations. Routing sharpness modulates these effects: Top-k routing yields more concentrated, lower-rank expert structure, while fully soft routing produces broader, higher-rank representations. Experiments on a 3-layer transformer with WikiText confirm curvature reduction on natural language and show lower cross-expert alignment for Top-k routing. These findings support interpreting MoEs as soft partitionings of function space that flatten local curvature while redistributing representation variance, yielding testable predictions for expert scaling, hallucination reduction, and ensemble diversity.

[333] Transformers can do Bayesian Clustering

Prajit Bhaskaran, Tom Viering

Main category: cs.LG

TL;DR: Cluster-PFN is a Transformer-based model for Bayesian clustering that learns from synthetic GMM data to estimate posterior distributions over cluster counts and assignments, handling missing data efficiently.

Details

Motivation: Bayesian clustering is computationally expensive at scale, and real-world datasets often have missing values. Simple imputation ignores uncertainty, leading to suboptimal results. There's a need for scalable Bayesian clustering that can handle missing data effectively.

Method: Extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering using Transformer architecture. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model prior. Learns to estimate posterior distribution over both number of clusters and cluster assignments.

Result: Estimates number of clusters more accurately than AIC, BIC and Variational Inference. Achieves clustering quality competitive with VI while being orders of magnitude faster. Outperforms imputation-based baselines on real-world genomic datasets with high missingness.

Conclusion: Cluster-PFN provides scalable and flexible Bayesian clustering that can handle missing data effectively, offering computational efficiency while maintaining accuracy.

Abstract: Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include missing data, outperforming imputation-based baselines on real-world genomic datasets, at high missingness. These results show that the Cluster-PFN can provide scalable and flexible Bayesian clustering.

[334] Q3R: Quadratic Reweighted Rank Regularizer for Effective Low-Rank Training

Ipsita Ghosh, Ethan Nguyen, Christian Kümmerle

Main category: cs.LG

TL;DR: Q3R is a novel low-rank training method using quadratic reweighted rank regularization that enables training models to prescribed low ranks while maintaining performance comparable to dense models.

Details

Motivation: Existing low-rank optimization methods work well for fine-tuning but fail for low-rank pre-training, where maintaining low-rank structure while optimizing task objectives remains challenging.

Method: Proposes Quadratic Reweighted Rank Regularizer (Q3R) based on quadratic regularizer term that majorizes smoothed log-determinant rank surrogate, inspired by Iteratively Reweighted Least Squares framework.

Result: Q3R trains weight matrices to prescribed low target ranks with predictive performance comparable to dense models, small computational overhead, and full compatibility with existing architectures. ViT-Tiny experiments show only 1.3% and 4% accuracy drops when truncated to 60% and 80% parameters on CIFAR-10.

Conclusion: Q3R is an effective low-rank training method demonstrated across vision and language tasks, including low-rank fine-tuning, enabling efficient model compression while maintaining performance.

Abstract: Parameter-efficient training based on low-rank optimization has become a highly successful tool for fine-tuning large deep learning models. However, these methods often fail for low-rank pre-training, where simultaneously maintaining low-rank weight structure and optimizing the task objective remains challenging. We propose the $\textit{Quadratic Reweighted Rank Regularizer}$ ($\texttt{Q3R}$), which leads to a novel low-rank-inducing training strategy inspired by the Iteratively Reweighted Least Squares (IRLS) framework. $\texttt{Q3R}$ is based on a quadratic regularizer term that majorizes a smoothed log-determinant rank surrogate. Unlike other low-rank training techniques, $\texttt{Q3R}$ can train weight matrices to prescribed low target ranks while achieving predictive performance comparable to dense models, with small computational overhead and full compatibility with existing architectures. For example, we demonstrate a $\texttt{Q3R}$-regularized ViT-Tiny experiment where truncating the model to $60%$ and $80%$ of its parameters results in only minor absolute accuracy drops of $1.3%$ and $4%$, respectively, on CIFAR-10. We confirm the efficacy of $\texttt{Q3R}$ on Transformers across both vision and language tasks, including low-rank fine-tuning.

[335] StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths

Tianyi Chen, Sihan Chen, Xiaoyi Qu, Dan Zhao, Ruomei Yan, Jongwoo Ko, Luming Liang, Pashmina Cameron

Main category: cs.LG

TL;DR: StableQAT: A unified QAT framework using Fourier analysis to stabilize ultra-low bitwidth training with lightweight gradient surrogates that generalize STE.

Details

Motivation: Quantization-aware training is crucial for deploying large models under memory/latency constraints, but existing approaches (STE, soft quantizers) suffer from gradient mismatch, instability, or high computational overhead at ultra-low bitwidths.

Method: Proposes StableQAT framework with novel lightweight surrogate for backpropagation derived from discrete Fourier analysis of rounding operator. Generalizes STE as special case, providing smooth, bounded, inexpensive gradients.

Result: StableQAT exhibits stable and efficient QAT at 2-4 bit regimes with improved training stability, robustness, and superior performance against standard QAT techniques, with negligible training overhead.

Conclusion: StableQAT provides a theoretically grounded, unified framework for stable quantization-aware training at ultra-low bitwidths, addressing key limitations of existing approaches.

Abstract: Quantization-aware training (QAT) is essential for deploying large models under strict memory and latency constraints, yet achieving stable and robust optimization at ultra-low bitwidths remains challenging. Common approaches based on the straight-through estimator (STE) or soft quantizers often suffer from gradient mismatch, instability, or high computational overhead. As such, we propose StableQAT, a unified and efficient QAT framework that stabilizes training in ultra low-bit settings via a novel, lightweight, and theoretically grounded surrogate for backpropagation derived from a discrete Fourier analysis of the rounding operator. StableQAT strictly generalizes STE as the latter arises as a special case of our more expressive surrogate family, yielding smooth, bounded, and inexpensive gradients that improve QAT training performance and stability across various hyperparameter choices. In experiments, StableQAT exhibits stable and efficient QAT at 2-4 bit regimes, demonstrating improved training stability, robustness, and superior performance with negligible training overhead against standard QAT techniques. Our code is available at https://github.com/microsoft/StableQAT.

[336] Cardinality-Preserving Attention Channels for Graph Transformers in Molecular Property Prediction

Abhijit Gupta

Main category: cs.LG

TL;DR: CardinalGraphFormer: A graph transformer with query-conditioned cardinality-preserving attention for molecular property prediction, combining structured sparse attention with Graphormer biases and dual-objective self-supervised pretraining.

Details

Motivation: Molecular property prediction faces challenges with scarce labeled data. Existing methods may lose dynamic support-size signals that complement static centrality embeddings, limiting their effectiveness in drug discovery applications.

Method: Proposes CardinalGraphFormer with query-conditioned cardinality-preserving attention (CPA) channel that retains dynamic support-size signals. Combines structured sparse attention with Graphormer-inspired biases (shortest-path distance, centrality, direct-bond features) and unified dual-objective self-supervised pretraining (masked reconstruction and contrastive alignment of augmented views).

Result: Evaluation on 11 public benchmarks (MoleculeNet, OGB, TDC ADMET) shows consistent improvements over protocol-matched baselines under matched pretraining, optimization, and hyperparameter tuning. Rigorous ablations confirm CPA’s contributions and rule out simple size shortcuts.

Conclusion: CardinalGraphFormer effectively addresses molecular property prediction with scarce labeled data by preserving dynamic support-size signals through CPA attention, demonstrating superior performance across multiple benchmarks with validated contributions from the proposed attention mechanism.

Abstract: Molecular property prediction is crucial for drug discovery when labeled data are scarce. This work presents CardinalGraphFormer, a graph transformer augmented with a query-conditioned cardinality-preserving attention (CPA) channel that retains dynamic support-size signals complementary to static centrality embeddings. The approach combines structured sparse attention with Graphormer-inspired biases (shortest-path distance, centrality, direct-bond features) and unified dual-objective self-supervised pretraining (masked reconstruction and contrastive alignment of augmented views). Evaluation on 11 public benchmarks spanning MoleculeNet, OGB, and TDC ADMET demonstrates consistent improvements over protocol-matched baselines under matched pretraining, optimization, and hyperparameter tuning. Rigorous ablations confirm CPA’s contributions and rule out simple size shortcuts. Code and reproducibility artifacts are provided.

[337] Amortized Bayesian Workflow

Chengkun Li, Aki Vehtari, Paul-Christian Bürkner, Stefan T. Radev, Luigi Acerbi, Marvin Schmitt

Main category: cs.LG

TL;DR: Adaptive workflow combining fast amortized inference with accurate MCMC for Bayesian inference on many datasets

Details

Motivation: Bayesian inference faces trade-off between computational speed and sampling accuracy; need to balance both when performing inference on many datasets

Method: Adaptive workflow using principled diagnostics to guide choice of inference method for each dataset, moving along Pareto front from fast amortized sampling via generative neural networks to slower but guaranteed-accurate MCMC when needed, with computation reuse across steps

Result: Demonstrated effectiveness on synthetic and real-world problems with tens of thousands of datasets, showing efficiency gains while maintaining high posterior quality

Conclusion: Integrated approach synergizes amortized and MCMC-based inference to achieve favorable combination of speed and accuracy for Bayesian inference on many datasets

Abstract: Bayesian inference often faces a trade-off between computational speed and sampling accuracy. We propose an adaptive workflow that integrates rapid amortized inference with gold-standard MCMC techniques to achieve a favorable combination of both speed and accuracy when performing inference on many observed datasets. Our approach uses principled diagnostics to guide the choice of inference method for each dataset, moving along the Pareto front from fast amortized sampling via generative neural networks to slower but guaranteed-accurate MCMC when needed. By reusing computations across steps, our workflow synergizes amortized and MCMC-based inference. We demonstrate the effectiveness of this integrated approach on several synthetic and real-world problems with tens of thousands of datasets, showing efficiency gains while maintaining high posterior quality.

[338] Zero-Shot Temporal Resolution Domain Adaptation for Spiking Neural Networks

Sanja Karilanova, Maxime Fabre, Emre Neftci, Ayça Özçelikkale

Main category: cs.LG

TL;DR: Novel domain adaptation methods for Spiking Neural Networks to handle temporal resolution mismatches between training and deployment data without retraining, achieving significant accuracy improvements on audio and neuromorphic vision tasks.

Details

Motivation: SNNs are sensitive to temporal resolution changes, causing performance drops when deployment data has different temporal resolution than training data, especially when fine-tuning isn't possible during deployment.

Method: Three novel domain adaptation methods based on mapping SNN neuron dynamics to State Space Models, adapting neuron parameters to account for time resolution changes without retraining on target resolution.

Result: Significant improvements over baseline scaling method: on SHD audio dataset, accuracy improved from 53.0% to 89.5%; on MSWC audio dataset from 38.8% to 93.6%; on NMNIST neuromorphic vision dataset from 97.2% to 98.5% when target resolution is double source resolution.

Conclusion: Proposed methods effectively adapt SNNs to temporal resolution mismatches, enabling high accuracy on high temporal resolution data through time-efficient training on lower resolution data, with applications to audio and neuromorphic vision tasks.

Abstract: Spiking Neural Networks (SNNs) are biologically-inspired deep neural networks that efficiently extract temporal information while offering promising gains in terms of energy efficiency and latency when deployed on neuromorphic devices. SNN parameters are sensitive to temporal resolution, leading to significant performance drops when the temporal resolution of target data during deployment is not the same as that of the source data used for training, especially when fine-tuning with the target data is not possible during deployment. To address this challenge, we propose three novel domain adaptation methods for adapting neuron parameters to account for the change in time resolution without re-training on target time resolution. The proposed methods are based on a mapping between neuron dynamics in SNNs and State Space Models (SSMs) and are applicable to general neuron models. We evaluate the proposed methods under spatio-temporal data tasks, namely the audio keyword spotting datasets SHD and MSWC, and the neuromorphic image NMINST dataset. Our methods provide an alternative to-and in most cases significantly outperform-the existing reference method that consists of scaling only the time constant. Notably, when the temporal resolution of the target data is double that of the source data, applying one of our proposed methods instead of the benchmark achieves classification accuracy of 89.5% instead of 53.0% on SHD, 93.6% instead of 38.8% on MSWC and 98.5% instead of 97.2% aon NMNIST. Moreover, our results show that high accuracy on high temporal resolution data can be obtained by time-efficient training on lower temporal resolution data.

[339] Channel Dependence, Limited Lookback Windows, and the Simplicity of Datasets: How Biased is Time Series Forecasting?

Ibram Abdelmalak, Kiran Madhusudhanan, Jungmin Choi, Christian Kloetergens, Vijaya Krishna Yalavarit, Maximilian Stubbemann, Lars Schmidt-Thieme

Main category: cs.LG

TL;DR: Lookback window tuning is critical for fair LTSF model comparisons; CI models excel on standard benchmarks due to weak inter-channel correlations, but CD models outperform on datasets with strong cross-channel dependencies.

Details

Motivation: Current LTSF research often sets lookback window arbitrarily, leading to unfair model comparisons. The paper aims to establish proper evaluation protocols and understand when CI vs CD models are truly superior.

Method: Systematic experiments tuning lookback window per task, Granger causality analysis to measure channel correlations, and testing on ODE datasets with implicit channel correlations to compare CI and CD models.

Result: Failing to tune lookback window can invert performance rankings. CI models (like PatchTST) excel on standard benchmarks due to weak inter-channel correlations, but CD models significantly outperform CI models on datasets with strong cross-channel dependencies.

Conclusion: Four recommendations: 1) tune lookback window per task, 2) examine CI architectures for standard datasets, 3) use statistical analysis to choose between CI/CD, 4) prefer CD models with limited data.

Abstract: In Long-term Time Series Forecasting (LTSF), the lookback window is a critical hyperparameter often set arbitrarily, undermining the validity of model evaluations. We argue that the lookback window must be tuned on a per-task basis to ensure fair comparisons. Our empirical results show that failing to do so can invert performance rankings, particularly when comparing univariate and multivariate methods. Experiments on standard benchmarks reposition Channel-Independent (CI) models, such as PatchTST, as state-of-the-art methods. However, we reveal this superior performance is largely an artifact of weak inter-channel correlations and simplicity of patterns within these specific datasets. Using Granger causality analysis and ODE datasets (with implicit channel correlations), we demonstrate that the true strength of multivariate Channel-Dependent (CD) models emerges on datasets with strong, inherent cross-channel dependencies, where they significantly outperform CI models. We conclude with four key recommendations for improving TSF research: (i) consider the lookback window as a key hyperparameter to tune, (ii) for standard datasets, examining CI architectures is advantageous, (iii) leverage statistical analysis of datasets to guide the choice between CI and CD architectures, and (iv) prefer CD models in scenarios with limited data.

[340] Random Scaling of Emergent Capabilities

Rosie Zhao, Tian Qin, David Alvarez-Melis, Sham Kakade, Naomi Saphra

Main category: cs.LG

TL;DR: Breakthrough capabilities in language models are driven by continuous changes in probability distributions of training outcomes across random seeds, not by discrete emergence at specific scales.

Details

Motivation: To resolve the debate between "emergence" (capabilities unlocked at specific scales) vs. metric thresholding effects as explanations for sudden performance breakthroughs in language models.

Method: Analyzed training outcomes across random seeds in synthetic length generalization tasks, multiple choice question answering, and grammatical generalization. Examined probability distributions of performance metrics and their relationship to model scale.

Result: Different random seeds produce either smooth or emergent scaling trends. Sharp breakthroughs in metrics result from underlying continuous changes in their distribution across seeds. Distributions become abruptly bimodal at capacity thresholds, but these thresholds appear at scales well before most seeds achieve breakthrough.

Conclusion: Random variation must be considered when predicting model performance from scale. Breakthroughs are driven by continuous probability distribution changes rather than discrete emergence, even under continuous loss metrics.

Abstract: Language models famously improve under a smooth scaling law, but some specific capabilities exhibit sudden breakthroughs in performance. Advocates of “emergence” view these capabilities as unlocked at a specific scale, but others attribute breakthroughs to superficial metric thresholding effects. We propose that breakthroughs are instead driven by continuous changes in the probability distribution of training outcomes when performance is bimodally distributed across random seeds. we show that different random seeds can produce either smooth or emergent scaling trends in synthetic length generalization tasks, multiple choice question answering, and grammatical generalization. We reveal that sharp breakthroughs in metrics are produced by underlying continuous changes in their distribution across seeds. These distributions may become abruptly bimodal at a capacity threshold but this threshold appears at scales well before most seeds achieve breakthrough. Our observations hold true even under continuous loss metrics, confirming that random variation must be considered when predicting a model’s performance from its scale.

[341] FedMerge: Federated Personalization via Model Merging

Shutong Chen, Tianyi Zhou, Guodong Long, Jing Jiang, Chengqi Zhang

Main category: cs.LG

TL;DR: FedMerge: Personalized federated learning approach that creates customized models for each client by optimally merging multiple global models with automatically learned weights, eliminating need for local fine-tuning.

Details

Motivation: Traditional federated learning with one global model fails to serve clients with non-IID data distributions. Existing multi-model FL approaches provide limited choices, still requiring local fine-tuning. Need for personalized models that better align with each client's specific task and data distribution.

Method: Proposes FedMerge which jointly optimizes multiple global models and client-specific merging weights. Instead of broadcasting global models, server sends customized merged models to each client. Uses automatic optimization of merging weights to create personalized models without local fine-tuning.

Result: FedMerge consistently outperforms existing FL approaches including clustering-based and mixture-of-experts methods across three different non-IID settings with diverse tasks and data types. Reduces client drift and smoothens local-global gap.

Conclusion: FedMerge enables effective personalization in federated learning by creating customized models through optimal merging of multiple global models, addressing non-IID challenges without requiring local fine-tuning.

Abstract: One global model in federated learning (FL) might not be sufficient to serve many clients with non-IID tasks and distributions. While there has been advances in FL to train multiple global models for better personalization, they only provide limited choices to clients so local finetuning is still indispensable. In this paper, we propose a novel ``FedMerge’’ approach that can create a personalized model per client by simply merging multiple global models with automatically optimized and customized weights. In FedMerge, a few global models can serve many non-IID clients, even without further local finetuning. We formulate this problem as a joint optimization of global models and the merging weights for each client. Unlike existing FL approaches where the server broadcasts one or multiple global models to all clients, the server only needs to send a customized, merged model to each client. Moreover, instead of periodically interrupting the local training and re-initializing it to a global model, the merged model aligns better with each client’s task and data distribution, smoothening the local-global gap between consecutive rounds caused by client drift. We evaluate FedMerge on three different non-IID settings applied to different domains with diverse tasks and data types, in which FedMerge consistently outperforms existing FL approaches, including clustering-based and mixture-of-experts (MoE) based methods.

[342] Closing the Distribution Gap in Adversarial Training for LLMs

Chengzhi Hu, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn

Main category: cs.LG

TL;DR: Distributional Adversarial Training (DAT) improves LLM robustness by using Diffusion LLMs to approximate the true data distribution and generate diverse adversarial samples for better generalization.

Details

Motivation: Current adversarial training methods for LLMs fail to cover the full data distribution, leaving models vulnerable to simple in-distribution attacks like tense changes or translations. This persistent fragility stems from inadequate distributional coverage during training.

Method: Proposes Distributional Adversarial Training (DAT) that leverages Diffusion LLMs to approximate the true joint distribution of prompts and responses. This enables generation of diverse, high-likelihood samples to address generalization failures, combined with continuous adversarial training over the data distribution.

Result: DAT achieves substantially higher adversarial robustness than previous methods by better covering the data distribution and addressing generalization failures.

Conclusion: Distributional coverage is crucial for adversarial robustness in LLMs, and DAT provides an effective approach by combining diffusion models with adversarial training to improve generalization against diverse attacks.

Abstract: Adversarial training for LLMs is one of the most promising methods to reliably improve robustness against adversaries. However, despite significant progress, models remain vulnerable to simple in-distribution exploits, such as rewriting prompts in the past tense or translating them into other languages. We argue that this persistent fragility stems from a fundamental limitation in current adversarial training algorithms: they minimize adversarial loss on their training set but inadequately cover the data distribution, resulting in vulnerability to seemingly simple attacks. To bridge this gap, we propose Distributional Adversarial Training, DAT. We leverage Diffusion LLMs to approximate the true joint distribution of prompts and responses, enabling generation of diverse, high-likelihood samples that address generalization failures. By combining optimization over the data distribution provided by the diffusion model with continuous adversarial training, DAT achieves substantially higher adversarial robustness than previous methods.

[343] ReaCritic: Reasoning Transformer-based DRL Critic-model Scaling For Wireless Networks

Feiran You, Hongyang Du

Main category: cs.LG

TL;DR: ReaCritic introduces a reasoning transformer-based critic model for DRL that uses horizontal and vertical reasoning to improve decision-making in dynamic wireless HetNets and control tasks.

Details

Motivation: Existing DRL methods struggle with decision complexity in HetNets due to diverse user requirements and time-varying conditions. Conventional critic models use shallow architectures that limit multi-task handling, while LLMs show that intermediate reasoning steps improve decision quality.

Method: ReaCritic uses a reasoning transformer-based critic model with horizontal reasoning over parallel state-action inputs and vertical reasoning through deep transformer stacks. It’s compatible with value-based and actor-critic DRL algorithms.

Result: Extensive experiments show ReaCritic improves convergence speed and final performance across various HetNet settings and standard OpenAI Gym control tasks.

Conclusion: ReaCritic successfully brings reasoning-like ability into DRL, enhancing generalization in dynamic wireless environments and improving performance across diverse tasks.

Abstract: Heterogeneous Networks (HetNets) pose critical challenges for intelligent management due to the diverse user requirements and time-varying wireless conditions. These factors introduce significant decision complexity, which limits the adaptability of existing Deep Reinforcement Learning (DRL) methods. In many DRL algorithms, especially those involving value-based or actor-critic structures, the critic component plays a key role in guiding policy learning by estimating value functions. However, conventional critic models often use shallow architectures that map observations directly to scalar estimates, limiting their ability to handle multi-task complexity. In contrast, recent progress in inference-time scaling of Large Language Models (LLMs) has shown that generating intermediate reasoning steps can significantly improve decision quality. Motivated by this, we propose ReaCritic, a reasoning transformer-based critic-model scaling scheme that brings reasoning-like ability into DRL. ReaCritic performs horizontal reasoning over parallel state-action inputs and vertical reasoning through deep transformer stacks. It is compatible with a broad range of value-based and actor-critic DRL algorithms and enhances generalization in dynamic wireless environments. Extensive experiments demonstrate that ReaCritic improves convergence speed and final performance across various HetNet settings and standard OpenAI Gym control tasks. The code of ReaCritic is available at https://github.com/NICE-HKU/ReaCritic.

[344] Non-Asymptotic Analysis of (Sticky) Track-and-Stop

Riccardo Poiani, Martino Bernasconi, Andrea Celli

Main category: cs.LG

TL;DR: Non-asymptotic sample complexity guarantees for Track-and-Stop and Sticky Track-and-Stop algorithms in pure exploration problems

Details

Motivation: Existing asymptotic optimality guarantees for pure exploration algorithms lack non-asymptotic bounds, which are important for practical applications with finite sample regimes

Method: Analyzes Track-and-Stop (for single-valued answer maps) and Sticky Track-and-Stop (for multi-valued answer maps) algorithms, providing non-asymptotic sample complexity guarantees

Result: Derives non-asymptotic upper bounds on sample complexity for both algorithms, showing they maintain good performance in finite sample regimes

Conclusion: The paper bridges the gap between asymptotic optimality and practical finite-sample performance for pure exploration algorithms

Abstract: In pure exploration problems, a statistician sequentially collects information to answer a question about some stochastic and unknown environment. The probability of returning a wrong answer should not exceed a maximum risk parameter $δ$ and good algorithms make as few queries to the environment as possible. The Track-and-Stop algorithm is a pioneering method to solve these problems. Specifically, it is well-known that it enjoys asymptotic optimality sample complexity guarantees for $δ\to 0$ whenever the map from the environment to its correct answers is single-valued (e.g., best-arm identification with a unique optimal arm). The Sticky Track-and-Stop algorithm extends these results to settings where, for each environment, there might exist multiple correct answers (e.g., $ε$-optimal arm identification). Although both methods are optimal in the asymptotic regime, their non-asymptotic guarantees remain unknown. In this work, we fill this gap and provide non-asymptotic guarantees for both algorithms.

[345] On the Expressive Power of Mixture-of-Experts for Structured Complex Tasks

Mingze Wang, Weinan E

Main category: cs.LG

TL;DR: Theoretical analysis of mixture-of-experts networks’ expressive power for modeling complex tasks with low-dimensional and sparse structural priors.

Details

Motivation: Despite empirical success of MoEs in deep learning, their theoretical foundations for modeling complex tasks remain poorly understood. The paper aims to systematically study MoEs' expressive power in handling tasks with structural priors like low-dimensionality and sparsity.

Method: Theoretical analysis of shallow and deep MoE architectures. For shallow MoEs: prove they can efficiently approximate functions on low-dimensional manifolds. For deep MoEs: analyze how L-layer MoEs with E experts per layer can approximate piecewise functions with E^L pieces exhibiting compositional sparsity.

Result: Shallow MoEs overcome curse of dimensionality for low-dimensional manifold functions. Deep MoEs can represent exponentially many structured tasks (E^L pieces) with compositional sparsity. Analysis reveals roles of gating mechanisms, expert networks, number of experts/layers.

Conclusion: MoEs have strong theoretical expressive power for complex tasks with structural priors. The analysis provides insights into architectural components and offers suggestions for MoE variants.

Abstract: Mixture-of-experts networks (MoEs) have demonstrated remarkable efficiency in modern deep learning. Despite their empirical success, the theoretical foundations underlying their ability to model complex tasks remain poorly understood. In this work, we conduct a systematic study of the expressive power of MoEs in modeling complex tasks with two common structural priors: low-dimensionality and sparsity. For shallow MoEs, we prove that they can efficiently approximate functions supported on low-dimensional manifolds, overcoming the curse of dimensionality. For deep MoEs, we show that $\mathcal{O}(L)$-layer MoEs with $E$ experts per layer can approximate piecewise functions comprising $E^L$ pieces with compositional sparsity, i.e., they can exhibit an exponential number of structured tasks. Our analysis reveals the roles of critical architectural components and hyperparameters in MoEs, including the gating mechanism, expert networks, the number of experts, and the number of layers, and offers natural suggestions for MoE variants.

[346] Navigating the Deep: End-to-End Extraction on Deep Neural Networks

Haolin Liu, Adrien Siproudhis, Samuel Experton, Peter Lorenz, Christina Boura, Thomas Peyrin

Main category: cs.LG

TL;DR: Improved polynomial-time neural network model extraction attack that overcomes limitations of prior work to extract deeper networks (8+ layers vs previous 3 layers) by addressing rank deficiency, noise propagation, and low-confidence neuron issues.

Details

Motivation: Existing model extraction attacks have critical limitations: Carlini et al.'s approach only works on shallow networks and has exponential time complexity, while recent improvements still fail on low-confidence neurons and assume successful signature extraction. There's a need for practical, polynomial-time extraction that works on deeper networks.

Method: 1) Refined signature extraction with algorithmic solutions for rank deficiency and noise propagation from deeper layers; 2) Improved numerical precision in signature extraction; 3) Enhanced sign extraction combining two polynomial methods to avoid exponential search for low-confidence neurons; 4) End-to-end polynomial-time extraction pipeline.

Result: Successfully extracts at least 8 layers of ReLU-based neural networks trained on MNIST and CIFAR-10 datasets, significantly outperforming previous works that could barely extract the first 3 layers of similar-width networks.

Conclusion: Proposes the first practical polynomial-time end-to-end model extraction attack that works on much deeper networks than previously possible, addressing fundamental limitations in both signature and sign extraction phases.

Abstract: Neural network model extraction has recently emerged as an important security concern, as adversaries attempt to recover a network’s parameters via black-box queries. Carlini et al. proposed in CRYPTO'20 a model extraction approach, consisting of two steps: signature extraction and sign extraction. However, in practice this signature-extraction method is limited to very shallow networks only, and the proposed sign-extraction method is exponential in time. Recently, Canales-Martinez et al. (Eurocrypt'24) proposed a polynomial-time sign-extraction method, but it assumes the corresponding signatures have already been successfully extracted and can fail on so-called low-confidence neurons. In this work, we first revisit and refine the signature extraction process by systematically identifying and addressing for the first time critical limitations of Carlini et al.’s signature-extraction method. These limitations include rank deficiency and noise propagation from deeper layers. To overcome these challenges, we propose efficient algorithmic solutions for each of the identified issues. Our approach permits the extraction of much deeper networks than previously possible. In addition, we propose new methods to improve numerical precision in signature extraction, and enhance the sign extraction part by combining two polynomial methods to avoid exponential exhaustive search in the case of low-confidence neurons. This leads to the very first end-to-end model extraction method that runs in polynomial time. We validate our attack through extensive experiments on ReLU-based neural networks, demonstrating significant improvements in extraction depth. For instance, our attack extracts consistently at least eight layers of neural networks trained on either the MNIST or CIFAR-10 datasets, while previous works could barely extract the first three layers of networks of similar width.

[347] Benchmarking Stochastic Approximation Algorithms for Fairness-Constrained Training of Deep Neural Networks

Andrii Kliachkin, Jana Lepšová, Gilles Bareilles, Jakub Mareček

Main category: cs.LG

TL;DR: A benchmark for fairness-constrained training of deep neural networks using US Census data, with implementation and comparison of three recent algorithms.

Details

Motivation: There's no standard method for constrained training of DNNs to improve fairness, and existing algorithms lack comprehensive evaluation on real-world large-scale tasks.

Method: Created a challenging benchmark using US Census (Folktables) data, implemented three recently proposed fairness-constrained algorithms, and compared them on optimization performance and fairness improvement.

Result: The benchmark enables systematic evaluation of fairness-constrained algorithms; comparison results show varying performance of the three implemented algorithms.

Conclusion: The released benchmark provides a standardized testbed for fairness-constrained DNN training, facilitating future research and algorithm development in this area.

Abstract: The ability to train Deep Neural Networks (DNNs) with constraints is instrumental in improving the fairness of modern machine-learning models. Many algorithms have been analysed in recent years, and yet there is no standard, widely accepted method for the constrained training of DNNs. In this paper, we provide a challenging benchmark of real-world large-scale fairness-constrained learning tasks, built on top of the US Census (Folktables). We point out the theoretical challenges of such tasks and review the main approaches in stochastic approximation algorithms. Finally, we demonstrate the use of the benchmark by implementing and comparing three recently proposed, but as-of-yet unimplemented, algorithms both in terms of optimization performance, and fairness improvement. We release the code of the benchmark as a Python package at https://github.com/humancompatible/train.

[348] KnowIt: Deep Time Series Modeling and Interpretation

M. W. Theunissen, R. Rabe, H. L. Potgieter, M. H. Davel

Main category: cs.LG

TL;DR: KnowIt is a Python toolkit for building and interpreting deep learning models for time series data, offering flexible interfaces for datasets, architectures, and interpretability techniques.

Details

Motivation: The paper addresses the need for a flexible framework that allows users to easily build deep time series models and interpret them, particularly for knowledge discovery in complex time series data where existing tools impose restrictive assumptions.

Method: KnowIt is implemented as a Python toolkit with well-defined interfaces that decouple dataset definitions, neural network architectures, and interpretability techniques, allowing on-the-fly modeling and interpretation of time series data with minimal assumptions about task specifications.

Result: The framework provides an environment where users can import new datasets, create custom architectures, and define different interpretability paradigms while maintaining flexibility for modeling and interpreting various aspects of their own time series data.

Conclusion: KnowIt aims to become a trusted platform for advancing deep time series modeling through ongoing development and collaboration, addressing an underexplored field in knowledge discovery from complex time series data.

Abstract: KnowIt (Knowledge discovery in time series data) is a flexible framework for building deep time series models and interpreting them. It is implemented as a Python toolkit, with source code and documentation available from https://must-deep-learning.github.io/KnowIt. It imposes minimal assumptions about task specifications and decouples the definition of dataset, deep neural network architecture, and interpretability technique through well defined interfaces. This ensures the ease of importing new datasets, custom architectures, and the definition of different interpretability paradigms while maintaining on-the-fly modeling and interpretation of different aspects of a user’s own time series data. KnowIt aims to provide an environment where users can perform knowledge discovery on their own complex time series data through building powerful deep learning models and explaining their behavior. With ongoing development, collaboration and application our goal is to make this a platform to progress this underexplored field and produce a trusted tool for deep time series modeling.

[349] Robust Causal Discovery in Real-World Time Series with Power-Laws

Matteo Tusoni, Giuseppe Masi, Andrea Coletta, Aldo Glielmo, Viviana Arrigoni, Novella Bartolini

Main category: cs.LG

TL;DR: A robust causal discovery method for time series that leverages power-law spectral features to amplify genuine causal signals and reduce noise sensitivity.

Details

Motivation: Causal discovery in stochastic time series is challenging due to high sensitivity to noise, leading to spurious inferences. Many real-world time series exhibit power-law spectral distributions due to self-organizing behavior, which can be leveraged for more robust causal analysis.

Method: The method extracts power-law spectral features from time series data to amplify genuine causal signals. It focuses on frequency spectra that follow power-law distributions, using this inherent property to distinguish true causal relationships from noise.

Result: The method consistently outperforms state-of-the-art alternatives on both synthetic benchmarks and real-world datasets with known causal structures, demonstrating robustness and practical relevance.

Conclusion: Leveraging power-law spectral features provides a robust approach to causal discovery in time series, addressing the noise sensitivity issues of existing methods and improving accuracy in real-world applications.

Abstract: Exploring causal relationships in stochastic time series is a challenging yet crucial task with a vast range of applications, including finance, economics, neuroscience, and climate science. Many algorithms for Causal Discovery (CD) have been proposed; however, they often exhibit a high sensitivity to noise, resulting in spurious causal inferences in real data. In this paper, we observe that the frequency spectra of many real-world time series follow a power-law distribution, notably due to an inherent self-organizing behavior. Leveraging this insight, we build a robust CD method based on the extraction of power-law spectral features that amplify genuine causal signals. Our method consistently outperforms state-of-the-art alternatives on both synthetic benchmarks and real-world datasets with known causal structures, demonstrating its robustness and practical relevance.

[350] SoK: Data Minimization in Machine Learning

Robin Staab, Nikola Jovanović, Kimberly Mai, Prakhar Ganesh, Martin Vechev, Ferdinando Fioretto, Matthew Jagielski

Main category: cs.LG

TL;DR: First systematization of knowledge for Data Minimization in Machine Learning (DMML), providing a unified framework to connect DM principles with ML privacy/security research.

Details

Motivation: Data minimization is a foundational privacy principle in regulations like GDPR/CPRA, but ML applications typically use large datasets, creating tension. Existing ML privacy/security research addresses DM concerns without explicit connections, causing confusion for practitioners trying to implement DM principles.

Method: Introduces a general DMML framework with unified data pipeline, adversarial models, and minimization points. Systematically reviews data minimization literature and DM-adjacent methodologies, analyzing them through a DM-centric lens.

Result: Provides structured overview to help practitioners/researchers adopt DM principles in ML by identifying relevant techniques and understanding assumptions/trade-offs through DM perspective.

Conclusion: This SoK bridges the gap between data minimization principles and ML privacy/security research, offering practical guidance for implementing DM in ML systems while addressing regulatory compliance.

Abstract: Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, we present the first systematization of knowledge (SoK) for DMML. We introduce a general framework for DMML, encompassing a unified data pipeline, adversarial models, and points of minimization. This framework allows us to systematically review data minimization literature as well as DM-adjacent methodologies whose link to DM was often overlooked. Our structured overview is designed to help practitioners and researchers effectively adopt and apply DM principles in ML, by helping them identify relevant techniques and understand underlying assumptions and trade-offs through a DM-centric lens.

[351] Universal Properties of Activation Sparsity in Modern Large Language Models

Filip Szatkowski, Patryk Będkowski, Alessio Devoto, Jan Dubiński, Pasquale Minervini, Mikołaj Piórczyński, Simone Scardapane, Bartosz Wójcik

Main category: cs.LG

TL;DR: A systematic study of activation sparsity in modern LLMs, revealing universal properties across models and scales, with potential for efficiency gains that increases with model size.

Details

Motivation: Activation sparsity has benefits for efficiency, robustness, and interpretability in neural networks, but existing methods don't apply well to modern LLMs, leading to fragmented approaches and lack of general understanding.

Method: Introduces a general framework for evaluating sparsity robustness in contemporary LLMs and conducts systematic investigation of activation sparsity in feedforward layers across diverse model families and scales.

Result: Uncovers universal properties of activation sparsity across different LLM families and scales, showing that potential for effective activation sparsity grows with model size, and includes first study of activation sparsity in diffusion-based LLMs.

Conclusion: Provides comprehensive perspective and practical guidance for leveraging activation sparsity in LLM design and acceleration, highlighting its increasing relevance as models scale.

Abstract: Activation sparsity is an intriguing property of deep neural networks that has been extensively studied in ReLU-based models, due to its advantages for efficiency, robustness, and interpretability. However, methods relying on exact zero activations do not directly apply to modern Large Language Models (LLMs), leading to fragmented, model-specific strategies for LLM activation sparsity and a gap in its general understanding. In this work, we introduce a general framework for evaluating sparsity robustness in contemporary LLMs and conduct a systematic investigation of this phenomenon in their feedforward~(FFN) layers. Our results uncover universal properties of activation sparsity across diverse model families and scales. Importantly, we observe that the potential for effective activation sparsity grows with model size, highlighting its increasing relevance as models scale. Furthermore, we present the first study of activation sparsity in diffusion-based LLMs. Overall, our work provides a comprehensive perspective and practical guidance for harnessing activation sparsity in LLM design and acceleration.

[352] Stage-wise Dynamics of Classifier-Free Guidance in Diffusion Models

Cheng Jin, Qitan Shi, Yuantao Gu

Main category: cs.LG

TL;DR: CFG in diffusion models improves conditional fidelity but reduces diversity; analysis reveals three-stage sampling dynamics with multimodal conditionals, explaining trade-offs and suggesting time-varying guidance schedules.

Details

Motivation: Classifier-Free Guidance (CFG) is widely used to improve conditional fidelity in diffusion models, but its impact on sampling dynamics remains poorly understood, especially for multimodal conditional distributions. Prior studies provide only partial insights, limiting understanding of the diversity-fidelity trade-off.

Method: Theoretical analysis of CFG under multimodal conditional distributions, examining sampling dynamics through three successive stages: Direction Shift, Mode Separation, and Concentration. Experiments validate predictions and test time-varying guidance schedules.

Result: Analysis reveals that early strong guidance erodes global diversity by suppressing weaker modes, while late strong guidance suppresses fine-grained variation. Time-varying guidance schedules consistently improve both quality and diversity.

Conclusion: CFG’s sampling dynamics unfold in three stages that explain the diversity-fidelity trade-off. Time-varying guidance schedules offer a practical solution to balance semantic alignment and diversity in diffusion models.

Abstract: Classifier-Free Guidance (CFG) is widely used to improve conditional fidelity in diffusion models, but its impact on sampling dynamics remains poorly understood. Prior studies, often restricted to unimodal conditional distributions or simplified cases, provide only a partial picture. We analyze CFG under multimodal conditionals and show that the sampling process unfolds in three successive stages. In the Direction Shift stage, guidance accelerates movement toward the weighted mean, introducing initialization bias and norm growth. In the Mode Separation stage, local dynamics remain largely neutral, but the inherited bias suppresses weaker modes, reducing global diversity. In the Concentration stage, guidance amplifies within-mode contraction, diminishing fine-grained variability. This unified view explains a widely observed phenomenon: stronger guidance improves semantic alignment but inevitably reduces diversity. Experiments support these predictions, showing that early strong guidance erodes global diversity, while late strong guidance suppresses fine-grained variation. Moreover, our theory naturally suggests a time-varying guidance schedule, and empirical results confirm that it consistently improves both quality and diversity.

[353] Still Competitive: Revisiting Recurrent Models for Irregular Time Series Prediction

Ankitkumar Joshi, Milos Hauskrecht

Main category: cs.LG

TL;DR: GRUwE: A simple RNN-based model with exponential basis functions for irregularly sampled multivariate time series prediction, achieving competitive performance with SOTA methods while being more efficient and easier to implement.

Details

Motivation: To address the challenge of modeling irregularly sampled multivariate time series in domains like healthcare and sensor networks, and to determine whether simpler RNN-based approaches can compete with complex architectures that have emerged for this problem.

Method: Proposes GRUwE (Gated Recurrent Unit with Exponential basis functions) that maintains a Markov state representation updated via two reset mechanisms: observation-triggered reset for new observations, and time-triggered reset using learnable exponential decays for continuous-time predictions.

Result: GRUwE achieves competitive or superior performance compared to recent state-of-the-art methods on next-observation and next-event prediction tasks across several real-world benchmarks.

Conclusion: Simple RNN-based architectures with appropriate modifications (like exponential basis functions) can effectively handle irregularly sampled time series while offering advantages in implementation simplicity, reduced hyperparameter tuning, and computational efficiency.

Abstract: Modeling irregularly sampled multivariate time series is a persistent challenge in domains like healthcare and sensor networks. While recent works have explored a variety of complex learning architectures to solve the prediction problems for irregularly sampled time series, it remains unclear what the true benefits of some of these architectures are, and whether clever modifications of simpler and more efficient RNN-based algorithms are still competitive, i.e. they are on par with or even superior to these methods. In this work, we propose and study GRUwE: Gated Recurrent Unit with Exponential basis functions, that builds upon RNN-based architectures for observations made at irregular times. GRUwE supports both regression-based and event-based predictions in continuous time. GRUwE works by maintaining a Markov state representation of the time series that updates with the arrival of irregular observations. The Markov state update relies on two reset mechanisms: (i) observation-triggered reset to account for the new observation, and (ii) time-triggered reset that relies on learnable exponential decays, to support the predictions in continuous time. Our empirical evaluations across several real-world benchmarks on next-observation and next-event prediction tasks demonstrate that GRUwE can indeed achieve competitive or superior performance compared to the recent state-of-the-art (SOTA) methods. Thanks to its simplicity, GRUwE offers compelling advantages: it is easy to implement, requires minimal hyper-parameter tuning efforts, and significantly reduces the computational overhead in the online deployment.

[354] Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation

Daniel Bethell, Simos Gerasimou, Radu Calinescu, Calum Imrie

Main category: cs.LG

TL;DR: USC (Uncertain Safety Critic) is a novel safe RL approach that uses uncertainty-aware modulation to balance safety constraints and task performance by concentrating conservatism in uncertain/costly regions while preserving sharp gradients in safe areas.

Details

Motivation: Existing safe RL methods struggle to balance safety and performance - overly conservative methods cripple task performance while reward-focused methods frequently violate safety constraints, creating diffuse cost landscapes that flatten gradients and stall policy improvement.

Method: Introduces Uncertain Safety Critic (USC) that integrates uncertainty-aware modulation and refinement into critic training, concentrating conservatism in uncertain and costly regions while preserving sharp gradients in safe areas.

Result: USC reduces safety violations by approximately 40% while maintaining competitive or higher rewards, and reduces the error between predicted and true cost gradients by approximately 83%.

Conclusion: USC breaks the prevailing trade-off between safety and performance in RL, enabling effective reward-safety trade-offs and paving the way for scalable safe RL.

Abstract: Ensuring the safe exploration of reinforcement learning (RL) agents is critical for deployment in real-world systems. Yet existing approaches struggle to strike the right balance: methods that tightly enforce safety often cripple task performance, while those that prioritize reward leave safety constraints frequently violated, producing diffuse cost landscapes that flatten gradients and stall policy improvement. We introduce the Uncertain Safety Critic (USC), a novel approach that integrates uncertainty-aware modulation and refinement into critic training. By concentrating conservatism in uncertain and costly regions while preserving sharp gradients in safe areas, USC enables policies to achieve effective reward-safety trade-offs. Extensive experiments show that USC reduces safety violations by approximately 40% while maintaining competitive or higher rewards, and reduces the error between predicted and true cost gradients by approximately 83%, breaking the prevailing trade-off between safety and performance and paving the way for scalable safe RL.

[355] Transformers Provably Learn Algorithmic Solutions for Graph Connectivity, But Only with the Right Data

Qilin Ye, Deqing Fu, Robin Jia, Vatsal Sharan

Main category: cs.LG

TL;DR: Transformers often learn brittle heuristics instead of generalizable algorithms. Using graph connectivity as a testbed, the paper shows that whether Transformers learn algorithmic solutions depends on whether training instances are within their capacity (diameter ≤ 3^L). Within-capacity graphs drive algorithmic learning, while beyond-capacity graphs lead to simple heuristics.

Details

Motivation: Transformers frequently fail to learn generalizable algorithms and instead rely on brittle heuristics. The paper aims to explain this phenomenon theoretically and empirically using graph connectivity as a testbed to understand when and why Transformers learn algorithmic solutions versus simple heuristics.

Method: The paper uses a simplified Transformer architecture called the Disentangled Transformer. It proves theoretically that an L-layer model can compute connectivity in graphs with diameters up to 3^L, implementing an algorithm equivalent to computing powers of the adjacency matrix. The analysis examines training dynamics to determine when models learn algorithmic solutions versus heuristics.

Result: Theoretical analysis shows that whether Transformers learn algorithmic solutions depends on whether most training instances are within model capacity. Within-capacity graphs (diameter ≤ 3^L) drive learning of algorithmic solutions, while beyond-capacity graphs lead to learning of simple heuristics based on node degrees. Empirically, restricting training data to stay within model capacity enables both standard and Disentangled Transformers to learn exact algorithms.

Conclusion: Transformers’ tendency to learn heuristics versus algorithms depends critically on whether training data is within their capacity. By controlling training data complexity to match model capacity, Transformers can be made to learn generalizable algorithmic solutions rather than brittle heuristics.

Abstract: Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the Disentangled Transformer, and prove that an $L$-layer model can compute connectivity in graphs with diameters up to $3^L$, implementing an algorithm equivalent to computing powers of the adjacency matrix. By analyzing training dynamics, we prove that whether the model learns this strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter $\leq 3^L$) drive the learning of the algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically show that restricting training data to stay within a model’s capacity makes both standard and Disentangled Transformers learn the exact algorithm.

[356] Synthesizing High-Quality Visual Question Answering from Medical Documents with Generator-Verifier LMMs

Xiaoke Huang, Ningsen Wang, Hui Liu, Xianfeng Tang, Yuyin Zhou

Main category: cs.LG

TL;DR: MedVLSynther is a framework that generates high-quality medical VQA questions from biomedical literature using a generator-verifier pipeline, creating MedSynVQA dataset to train multimodal models.

Details

Motivation: Training general medical VQA systems is hindered by lack of large, open, high-quality datasets. Existing medical VQA datasets are limited in size and quality, making it difficult to train effective multimodal models for medical applications.

Method: A rubric-guided generator-verifier framework that synthesizes multiple-choice VQA items from biomedical literature (figures, captions, in-text references). Generator produces stems and options in JSON schema; multi-stage verifier enforces quality gates (self-containment, single correct answer, clinical validity, image-text consistency) with point-based scoring.

Result: Created MedSynVQA: 13,087 audited questions over 14,803 images spanning 13 imaging modalities and 28 anatomical regions. Training LMMs with this data improved accuracy across six medical VQA benchmarks, achieving averages of 55.85 (3B) and 58.15 (7B), with up to 77.57 on VQA-RAD and 67.76 on PathVQA.

Conclusion: MedVLSynther provides an auditable, reproducible, privacy-preserving path to scalable medical VQA training data using open literature and open-weight models, with both generation and verification being essential for quality.

Abstract: Large Multimodal Models (LMMs) are increasingly capable of answering medical questions that require joint reasoning over images and text, yet training general medical VQA systems is impeded by the lack of large, openly usable, high-quality corpora. We present MedVLSynther, a rubric-guided generator-verifier framework that synthesizes high-quality multiple-choice VQA items directly from open biomedical literature by conditioning on figures, captions, and in-text references. The generator produces self-contained stems and parallel, mutually exclusive options under a machine-checkable JSON schema; a multi-stage verifier enforces essential gates (self-containment, single correct answer, clinical validity, image-text consistency), awards fine-grained positive points, and penalizes common failure modes before acceptance. Applying this pipeline to PubMed Central yields MedSynVQA: 13,087 audited questions over 14,803 images spanning 13 imaging modalities and 28 anatomical regions. Training open-weight LMMs with reinforcement learning using verifiable rewards improves accuracy across six medical VQA benchmarks, achieving averages of 55.85 (3B) and 58.15 (7B), with up to 77.57 on VQA-RAD and 67.76 on PathVQA, outperforming strong medical LMMs. A Ablations verify that both generation and verification are necessary and that more verified data consistently helps, and a targeted contamination analysis detects no leakage from evaluation suites. By operating entirely on open literature and open-weight models, MedVLSynther offers an auditable, reproducible, and privacy-preserving path to scalable medical VQA training data.

[357] Shrinking the Variance: Shrinkage Baselines for Reinforcement Learning with Verifiable Rewards

Guanning Zeng, Zhaoyi Zhou, Daman Arora, Andrea Zanette

Main category: cs.LG

TL;DR: Proposes shrinkage estimators for RLVR baselines that combine per-prompt and across-prompt means to reduce variance in policy-gradient training of large reasoning models.

Details

Motivation: Current RLVR methods use empirical mean rewards as baselines to stabilize training, but these can be inaccurate with limited generations per prompt. Stein's paradox suggests shrinkage estimators could improve mean estimation accuracy in low-generation regimes.

Method: Develops shrinkage estimators that combine per-prompt empirical means with across-prompt means to create better baselines. These are theoretically proven to yield lower-variance policy-gradient estimators and require no additional hyperparameters or computation.

Result: Shrinkage baselines consistently outperform standard empirical-mean baselines, producing lower-variance gradient updates and improved training stability in RLVR applications.

Conclusion: Shrinkage estimators provide a simple, effective drop-in replacement for standard baselines in RLVR, offering theoretical guarantees and practical improvements for training large reasoning models.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for post-training large reasoning models (LRMs) using policy-gradient methods such as GRPO. To stabilize training, these methods typically center trajectory rewards by subtracting the empirical mean reward for each prompt. Statistically, this centering acts as a control variate (baseline), reducing the variance of the policy-gradient estimator. In practice, the mean reward is estimated using per-prompt empirical averages computed from the generations for each prompt in a batch. Motivated by Stein’s paradox, we propose shrinkage estimators that combine per-prompt and across-prompt means to improve per-prompt mean estimation accuracy, especially in the low-generation regime typical of RLVR. Theoretically, we construct a shrinkage-based baseline that provably yields lower-variance policy-gradient estimators across algorithms. Our baseline is a drop-in replacement for standard per-prompt mean baselines and requires no additional hyperparameters or computation. Empirically, shrinkage baselines consistently outperform empirical-mean baselines, producing lower-variance gradient updates and improved training stability.

[358] A Versatile Variational Quantum Kernel Framework for Non-Trivial Classification

Jiang Yuhan, Matthew Otten

Main category: cs.LG

TL;DR: Quantum kernel methods benchmarked on diverse real-world datasets show competitive performance with classical kernels, demonstrating potential for practical quantum machine learning applications.

Details

Motivation: To address the gap in evaluating quantum kernel methods on diverse, high-dimensional real-world data, moving beyond limited low-dimensional or synthetic datasets that have prevented thorough assessment of their practical potential.

Method: Developed an algorithmic framework for variational quantum kernels using resource-efficient ansätze for complex classification tasks, introduced parameter scaling technique to accelerate convergence, and benchmarked on eight challenging real-world datasets covering tabular, image, time series, and graph data.

Result: The proposed quantum kernels demonstrate competitive classification accuracy compared to standard classical kernels (like RBF kernel) in classical simulation, showing that properly designed quantum kernels can function as versatile, high-performance tools.

Conclusion: This work demonstrates that quantum kernels can be effective for real-world machine learning applications, laying a foundation for quantum-enhanced applications, though further research is needed to fully assess practical performance of quantum methods.

Abstract: Quantum kernel methods are a promising branch of quantum machine learning, yet their effectiveness on diverse, high-dimensional, real-world data remains unverified. Current research has largely been limited to low-dimensional or synthetic datasets, preventing a thorough evaluation of their potential. To address this gap, we developed an algorithmic framework for variational quantum kernels utilizing resource-efficient ansätze for complex classification tasks and introduced a parameter scaling technique to accelerate convergence. We conducted a comprehensive benchmark of this framework on eight challenging, real-world and high-dimensional datasets covering tabular, image, time series, and graph data. Our results show that the proposed quantum kernels demonstrate competitive classification accuracy compared to standard classical kernels in classical simulation, such as the radial basis function (RBF) kernel. This work demonstrates that properly designed quantum kernels can function as versatile, high-performance tools, laying a foundation for quantum-enhanced applications in real-world machine learning. Further research is needed to fully assess the practical performance of quantum methods.

[359] Data-Efficient Self-Supervised Algorithms for Fine-Grained Birdsong Analysis

Houtan Ghaffari, Lukas Rauch, Paul Devos

Main category: cs.LG

TL;DR: A three-stage training pipeline for birdsong syllable annotation using self-supervised learning, supervised training with augmentations, and semi-supervised post-training, demonstrated on complex Canary songs.

Details

Motivation: Birdsongs are used as proxy models in various research fields, but developing models requires precisely annotated syllable data. There's a need for automated, data-efficient methods to reduce annotation costs and expert labor.

Method: 1) Proposes Residual-MLP-RNN architecture for birdsong annotation. 2) Three-stage training pipeline: self-supervised learning (masked prediction or online clustering), supervised training with data augmentations for frame-level detection, and semi-supervised post-training aligned with downstream task.

Result: Demonstrated performance on complex Canary songs in extreme label-scarcity scenarios. Canary’s difficult song implicitly validates the method for other birds. Also assessed self-supervised embeddings for linear probing and unsupervised birdsong analysis.

Conclusion: Presents a robust, data-efficient approach for birdsong syllable annotation that minimizes expert labor through a combination of self-supervised, supervised, and semi-supervised learning techniques.

Abstract: Many bioacoustics, neuroscience, and linguistics research utilize birdsongs as proxy models to acquire knowledge in diverse areas. Developing models generally requires precisely annotated data at the level of syllables. Hence, automated and data-efficient methods that reduce annotation costs are in demand. This work presents a lightweight, yet performant neural network architecture for birdsong annotation called Residual-MLP-RNN. Then, it presents a robust three-stage training pipeline for developing reliable deep birdsong syllable detectors with minimal expert labor. The first stage is self-supervised learning from unlabeled data. Two of the most successful pretraining paradigms are explored, namely, masked prediction and online clustering. The second stage is supervised training with effective data augmentations to create a robust model for frame-level syllable detection. The third stage is semi-supervised post-training, which leverages the unlabeled data again. However, unlike the initial phase, this time it is aligned with the downstream task. The performance of this data-efficient approach is demonstrated for the complex song of the Canary in extreme label-scarcity scenarios. Canary has one of the most difficult songs to annotate, which implicitly validates the method for other birds. Finally, the potential of self-supervised embeddings is assessed for linear probing and unsupervised birdsong analysis.

[360] Watch Out for the Lifespan: Evaluating Backdoor Attacks Against Federated Model Adaptation

Bastien Vuillod, Pierre-Alain Moellic, Jean-Max Dutertre

Main category: cs.LG

TL;DR: Analysis of how LoRA (Low-Rank Adaptation) affects backdoor attack persistence in Federated Learning, showing lower LoRA rank leads to longer backdoor lifespan after optimal injection.

Details

Motivation: Federated Learning with large models using Parameter-Efficient Fine-Tuning (like LoRA) faces security threats, particularly backdoor attacks. There's a need to understand how LoRA affects backdoor persistence in FL systems to improve security evaluations.

Method: Analyzes influence of LoRA on state-of-the-art backdoor attacks targeting model adaptation in FL. Focuses on backdoor lifespan as critical characteristic, examining how LoRA rank affects persistence after attack injection.

Result: Key finding: For optimally injected backdoors, persistence is longer when LoRA’s rank is lower. Also highlights evaluation issues of backdoor attacks against FL and contributes to more robust evaluation methods.

Conclusion: LoRA rank significantly impacts backdoor lifespan in FL systems. Lower ranks lead to more persistent backdoors. The work improves backdoor attack evaluation methodologies for FL security assessments.

Abstract: Large models adaptation through Federated Learning (FL) addresses a wide range of use cases and is enabled by Parameter-Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA). However, this distributed learning paradigm faces several security threats, particularly to its integrity, such as backdoor attacks that aim to inject malicious behavior during the local training steps of certain clients. We present the first analysis of the influence of LoRA on state-of-the-art backdoor attacks targeting model adaptation in FL. Specifically, we focus on backdoor lifespan, a critical characteristic in FL, that can vary depending on the attack scenario and the attacker’s ability to effectively inject the backdoor. A key finding in our experiments is that for an optimally injected backdoor, the backdoor persistence after the attack is longer when the LoRA’s rank is lower. Importantly, our work highlights evaluation issues of backdoor attacks against FL and contributes to the development of more robust and fair evaluations of backdoor attacks, enhancing the reliability of risk assessments for critical FL systems. Our code is publicly available.

[361] Adaptive Aggregation with Two Gains in QFL

S Nanayakkara

Main category: cs.LG

TL;DR: A2G framework for quantum federated learning with adaptive aggregation using geometry and QoS gains to handle quantum network heterogeneity

Details

Motivation: Federated learning in quantum-enabled heterogeneous networks suffers from performance degradation due to uneven client quality, stochastic teleportation fidelity, device instability, and geometric mismatch between classical and quantum models. Classical aggregation rules are inadequate for quantum federated systems.

Method: Introduces A2G (Adaptive Aggregation with Two Gains) - a dual gain framework with geometry gain for regulating geometric blending and QoS gain derived from teleportation fidelity, latency, and instability for modulating client importance.

Result: Not specified in the abstract, but the method addresses quantum federated learning challenges through adaptive aggregation.

Conclusion: A2G provides a solution for quantum federated learning systems by addressing geometric and quality-of-service challenges in heterogeneous quantum networks.

Abstract: Federated learning (FL) deployed over quantum enabled and heterogeneous classical networks faces significant performance degradation due to uneven client quality, stochastic teleportation fidelity, device instability, and geometric mismatch between local and global models. Classical aggregation rules assume euclidean topology and uniform communication reliability, limiting their suitability for emerging quantum federated systems. This paper introduces A2G (Adaptive Aggregation with Two Gains), a dual gain framework that jointly regulates geometric blending through a geometry gain and modulates client importance using a QoS gain derived from teleportation fidelity, latency, and instability.

[362] Out-of-Distribution Detection in Molecular Complexes via Diffusion Models for Irregular Graphs

David Graber, Victor Armegioiu, Rebecca Buller, Siddhartha Mishra

Main category: cs.LG

TL;DR: A probabilistic OOD detection framework for 3D graph data using diffusion models with unified continuous diffusion over both coordinates and discrete features, validated on protein-ligand complexes.

Details

Motivation: Machine learning models degrade on out-of-distribution data, but OOD detection is challenging for irregular 3D graphs that combine continuous geometry with categorical identities. Reliable deployment requires robust OOD detection for such complex data structures.

Method: Uses a diffusion model that learns training distribution density unsupervised. Introduces unified continuous diffusion over 3D coordinates and discrete features: categorical identities embedded in continuous space with cross-entropy training, diffusion score obtained analytically via posterior-mean interpolation. Creates single self-consistent probability-flow ODE (PF-ODE) producing per-sample log-likelihoods.

Result: PF-ODE likelihoods identify held-out protein families as OOD and correlate strongly with prediction errors of independent binding-affinity model (GEMS). Multi-scale PF-ODE trajectory statistics (path tortuosity, flow stiffness, vector-field instability) provide complementary OOD information. Joint modeling of trajectory features yields high-sensitivity detector improving separation over likelihood-only baselines.

Conclusion: The framework offers principled OOD detection for 3D graph data, enabling a priori reliability estimates and label-free OOD quantification workflow for geometric deep learning applications like protein-ligand complexes.

Abstract: Predictive machine learning models generally excel on in-distribution data, but their performance degrades on out-of-distribution (OOD) inputs. Reliable deployment therefore requires robust OOD detection, yet this is particularly challenging for irregular 3D graphs that combine continuous geometry with categorical identities and are unordered by construction. Here, we present a probabilistic OOD detection framework for complex 3D graph data built on a diffusion model that learns a density of the training distribution in a fully unsupervised manner. A key ingredient we introduce is a unified continuous diffusion over both 3D coordinates and discrete features: categorical identities are embedded in a continuous space and trained with cross-entropy, while the corresponding diffusion score is obtained analytically via posterior-mean interpolation from predicted class probabilities. This yields a single self-consistent probability-flow ODE (PF-ODE) that produces per-sample log-likelihoods, providing a principled typicality score for distribution shift. We validate the approach on protein-ligand complexes and construct strict OOD datasets by withholding entire protein families from training. PF-ODE likelihoods identify held-out families as OOD and correlate strongly with prediction errors of an independent binding-affinity model (GEMS), enabling a priori reliability estimates on new complexes. Beyond scalar likelihoods, we show that multi-scale PF-ODE trajectory statistics - including path tortuosity, flow stiffness, and vector-field instability - provide complementary OOD information. Modeling the joint distribution of these trajectory features yields a practical, high-sensitivity detector that improves separation over likelihood-only baselines, offering a label-free OOD quantification workflow for geometric deep learning.

[363] Communication Compression for Distributed Learning with Aggregate and Server-Guided Feedback

Tomas Ortega, Chun-Yin Huang, Xiaoxiao Li, Hamid Jafarkhani

Main category: cs.LG

TL;DR: Novel compression frameworks CAFe and CAFe-S enable biased compression in federated learning without client-side state, addressing communication bottlenecks while maintaining privacy and compatibility with stateless clients.

Details

Motivation: Federated Learning faces communication bottlenecks, especially in uplink transmission. Biased compression helps but requires error feedback mechanisms that rely on client-specific control variates, which violate privacy and are incompatible with stateless clients common in large-scale FL.

Method: Two frameworks: 1) CAFe uses globally aggregated update from previous round as shared control variate for all clients. 2) CAFe-S extends this for scenarios where server has small private dataset, generating server-guided candidate update as more accurate predictor. Both avoid client-side state.

Result: Analytical proof shows CAFe’s superiority over Distributed Compressed Gradient Descent with biased compression in non-convex regime with bounded gradient dissimilarity. CAFe-S converges to stationary point with rate improving as server’s data become more representative. Experimental results validate superiority over existing compression schemes.

Conclusion: Proposed frameworks enable efficient biased compression in FL without client-side state, addressing communication bottlenecks while maintaining privacy and compatibility with stateless clients, with CAFe-S offering additional benefits when server has representative data.

Abstract: Distributed learning, particularly Federated Learning (FL), faces a significant bottleneck in the communication cost, particularly the uplink transmission of client-to-server updates, which is often constrained by asymmetric bandwidth limits at the edge. Biased compression techniques are effective in practice, but require error feedback mechanisms to provide theoretical guarantees and to ensure convergence when compression is aggressive. Standard error feedback, however, relies on client-specific control variates, which violates user privacy and is incompatible with stateless clients common in large-scale FL. This paper proposes two novel frameworks that enable biased compression without client-side state or control variates. The first, Compressed Aggregate Feedback (CAFe), uses the globally aggregated update from the previous round as a shared control variate for all clients. The second, Server-Guided Compressed Aggregate Feedback (CAFe-S), extends this idea to scenarios where the server possesses a small private dataset; it generates a server-guided candidate update to be used as a more accurate predictor. We consider Distributed Gradient Descent (DGD) as a representative algorithm and analytically prove CAFe’s superiority to Distributed Compressed Gradient Descent (DCGD) with biased compression in the non-convex regime with bounded gradient dissimilarity. We further prove that CAFe-S converges to a stationary point, with a rate that improves as the server’s data become more representative. Experimental results in FL scenarios validate the superiority of our approaches over existing compression schemes.

[364] Inverting Non-Injective Functions with Twin Neural Network Regression

Sebastian J. Wetzel

Main category: cs.LG

TL;DR: Twin Neural Network Regression is a deterministic method for learning inverse mappings of non-injective functions by anchoring predictions to locally invertible regions around known anchor points.

Details

Motivation: Non-injective functions are not globally invertible, but can be restricted to locally injective subdomains. Current probabilistic inversion methods are limited, and there's a need for deterministic approaches to resolve multi-valued inverse mappings.

Method: Reformulates inverse learning as collection of locally invertible problems. Uses Twin Neural Network Regression to predict local inverse corrections around known anchor points, anchoring predictions to points within same locally invertible region to consistently select valid branch of inverse.

Result: Demonstrated on problems defined by mathematical equations or data, including multi-solution toy problems and robot arm inverse kinematics. Provides deterministic framework for resolving multi-valued inverse mappings.

Conclusion: Twin Neural Network Regression offers deterministic approach for learning inverse mappings of non-injective functions by leveraging local invertibility around anchor points, addressing limitations of probabilistic methods.

Abstract: Non-injective functions are not globally invertible. However, they can often be restricted to locally injective subdomains where the inversion is well-defined. In many settings a preferred solution can be selected even when multiple valid preimages exist or input and output dimensions differ. This manuscript describes a natural reformulation of the inverse learning problem for non-injective functions as a collection of locally invertible problems. More precisely, Twin Neural Network Regression is trained to predict local inverse corrections around known anchor points. By anchoring predictions to points within the same locally invertible region, the method consistently selects a valid branch of the inverse. In contrast to current probabilistic state-of-the art inversion methods, Inverse Twin Neural Network Regression is a deterministic framework for resolving multi-valued inverse mappings. I demonstrate the approach on problems that are defined by mathematical equations or by data, including multi-solution toy problems and robot arm inverse kinematics.

[365] Imitation Learning for Combinatorial Optimisation under Uncertainty

Prakash Gawas, Antoine Legrain, Louis-Martin Rousseau

Main category: cs.LG

TL;DR: Systematic taxonomy and framework for expert construction in imitation learning for combinatorial optimization under uncertainty, with evaluation on dynamic physician-patient assignment problem.

Details

Motivation: Existing imitation learning approaches for combinatorial optimization use diverse expert constructions without a unifying framework to characterize their assumptions, computational properties, and impact on learning performance.

Method: Proposes a systematic taxonomy of experts along three dimensions: (1) treatment of uncertainty, (2) level of optimality (task-optimal vs. approximate), and (3) interaction mode with learner. Introduces generalized DAgger framework supporting multiple expert queries, aggregation, and flexible interaction strategies.

Result: Evaluation on dynamic physician-to-patient assignment problem shows policies learned from stochastic experts outperform those from deterministic/full-information experts. Interactive learning improves solution quality with fewer demonstrations. Aggregated deterministic experts provide effective alternative when stochastic optimization is computationally challenging.

Conclusion: Provides systematic framework for expert construction in imitation learning for combinatorial optimization, demonstrating importance of expert choice and interaction strategies for learning performance.

Abstract: Imitation learning (IL) provides a data-driven framework for approximating policies for large-scale combinatorial optimisation problems formulated as sequential decision problems (SDPs), where exact solution methods are computationally intractable. A central but underexplored aspect of IL in this context is the role of the \emph{expert} that generates training demonstrations. Existing studies employ a wide range of expert constructions, yet lack a unifying framework to characterise their modelling assumptions, computational properties, and impact on learning performance. This paper introduces a systematic taxonomy of experts for imitation learning in combinatorial optimisation under uncertainty. The literature is classified along three principal dimensions: (i) treatment of uncertainty; (ii) level of optimality, distinguishing task-optimal and approximate experts; and (iii) interaction mode with the learner, ranging from one-shot supervision to iterative, interactive schemes. We further identify additional categories capturing other relevant expert characteristics. Building on this taxonomy, we propose a generalised Dataset Aggregation (DAgger) framework that accommodates multiple expert queries, expert aggregation, and flexible interaction strategies. The proposed framework is evaluated on a dynamic physician-to-patient assignment problem with stochastic arrivals and capacity constraints. Computational experiments compare learning outcomes across expert types and interaction regimes. The results show that policies learned from stochastic experts consistently outperform those learned from deterministic or full-information experts, while interactive learning improves solution quality using fewer expert demonstrations. Aggregated deterministic experts provide an effective alternative when stochastic optimisation becomes computationally challenging.

[366] Reinforcement Unlearning via Group Relative Policy Optimization

Efstratios Zaradoukas, Bardh Prenkaj, Gjergji Kasneci

Main category: cs.LG

TL;DR: PURGE is a novel LLM unlearning method that uses relative group policy optimization to efficiently remove sensitive/copyrighted data while maintaining model fluency and robustness, achieving better performance than existing approaches.

Details

Motivation: LLMs inadvertently memorize sensitive/copyrighted data during pretraining, creating compliance challenges under GDPR and EU AI Act. Existing unlearning methods often leak data, sacrifice fluency/robustness, or require costly external reward models.

Method: PURGE (Policy Unlearning through Relative Group Erasure) uses Group Relative Policy Optimization framework with intrinsic reward signal that penalizes mentions of forbidden concepts, formulating unlearning as a verifiable problem.

Result: Achieves up to 46x lower token usage per target than SOTA methods, improves fluency by +5.48% and adversarial robustness by +12.02% over base model. On RWKU benchmark: 11% unlearning effectiveness while preserving 98% of original utility.

Conclusion: Framing LLM unlearning as a verifiable task enables more reliable, efficient, and scalable forgetting, suggesting a promising new direction combining theoretical guarantees, improved safety, and practical deployment efficiency.

Abstract: During pretraining, LLMs inadvertently memorize sensitive or copyrighted data, posing significant compliance challenges under legal frameworks like the GDPR and the EU AI Act. Fulfilling these mandates demands techniques that can remove information from a deployed model without retraining from scratch. Existing unlearning approaches attempt to address this need, but often leak the very data they aim to erase, sacrifice fluency and robustness, or depend on costly external reward models. We introduce PURGE (Policy Unlearning through Relative Group Erasure), a novel method grounded in the Group Relative Policy Optimization framework that formulates unlearning as a verifiable problem. PURGE uses an intrinsic reward signal that penalizes any mention of forbidden concepts, allowing safe and consistent unlearning. Our approach achieves up to x46 lower token usage per target than state-of-the-art methods, while improving fluency by +5.48% and adversarial robustness by +12.02% over the base model. Extensive evaluation on the Real World Knowledge Unlearning (RWKU) benchmark shows that PURGE reaches 11% unlearning effectiveness while preserving 98% of original utility. PURGE shows that framing LLM unlearning as a verifiable task enables more reliable, efficient, and scalable forgetting, suggesting a promising new direction for unlearning research that combines theoretical guarantees, improved safety, and practical deployment efficiency.

[367] Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, Zhiying Xu, Jun Wu, Chenfeng Xu, Ion Stoica, Song Han, Kurt Keutzer

Main category: cs.LG

TL;DR: QVG is a training-free KV cache quantization framework for autoregressive video diffusion models that reduces memory usage by up to 7x with minimal latency overhead while maintaining generation quality.

Details

Motivation: KV cache memory bottleneck in autoregressive video diffusion models limits deployability on widely available hardware and degrades long-horizon consistency in identity, layout, and motion due to constrained working memory.

Method: QVG uses Semantic Aware Smoothing to leverage video spatiotemporal redundancy for low-magnitude quantization-friendly residuals, and Progressive Residual Quantization - a coarse-to-fine multi-stage scheme that reduces quantization error while enabling quality-memory trade-offs.

Result: QVG reduces KV cache memory by up to 7.0 times with less than 4% end-to-end latency overhead, consistently outperforming existing baselines in generation quality across LongCat Video, HY WorldPlay, and Self Forcing benchmarks.

Conclusion: QVG establishes a new Pareto frontier between quality and memory efficiency for autoregressive video diffusion models, addressing the critical KV cache memory bottleneck without requiring retraining.

Abstract: Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generation models, the KV cache grows with generation history and quickly dominates GPU memory, often exceeding 30 GB, preventing deployment on widely available hardware. More critically, constrained KV cache budgets restrict the effective working memory, directly degrading long horizon consistency in identity, layout, and motion. To address this challenge, we present Quant VideoGen (QVG), a training free KV cache quantization framework for autoregressive video diffusion models. QVG leverages video spatiotemporal redundancy through Semantic Aware Smoothing, producing low magnitude, quantization friendly residuals. It further introduces Progressive Residual Quantization, a coarse to fine multi stage scheme that reduces quantization error while enabling a smooth quality memory trade off. Across LongCat Video, HY WorldPlay, and Self Forcing benchmarks, QVG establishes a new Pareto frontier between quality and memory efficiency, reducing KV cache memory by up to 7.0 times with less than 4% end to end latency overhead while consistently outperforming existing baselines in generation quality.

[368] Adaptive Exploration for Latent-State Bandits

Jikai Jin, Kenneth Hung, Sanath Kumar Krishnamurthy, Baoyi Shi, Congshan Zhang

Main category: cs.LG

TL;DR: State-model-free bandit algorithms that use lagged context and coordinated probing to handle hidden time-varying states without explicit state modeling

Details

Motivation: Classical multi-armed bandit algorithms fail in environments with hidden, time-varying states that confound reward estimation and optimal action selection due to unobserved confounders causing biased reward estimates and limited state information.

Method: Introduces state-model-free bandit algorithms leveraging lagged contextual features and coordinated probing strategies to implicitly track latent states and disambiguate state-dependent reward patterns without explicit state modeling.

Result: Empirical results across diverse settings demonstrate superior performance over classical approaches, with methods combining computational efficiency with robust adaptation to non-stationary rewards.

Conclusion: The state-model-free approach effectively handles hidden time-varying states, and practical recommendations are provided for algorithm selection in real-world applications.

Abstract: The multi-armed bandit problem is a core framework for sequential decision-making under uncertainty, but classical algorithms often fail in environments with hidden, time-varying states that confound reward estimation and optimal action selection. We address key challenges arising from unobserved confounders, such as biased reward estimates and limited state information, by introducing a family of state-model-free bandit algorithms that leverage lagged contextual features and coordinated probing strategies. These implicitly track latent states and disambiguate state-dependent reward patterns. Our methods and their adaptive variants can learn optimal policies without explicit state modeling, combining computational efficiency with robust adaptation to non-stationary rewards. Empirical results across diverse settings demonstrate superior performance over classical approaches, and we provide practical recommendations for algorithm selection in real-world applications.

[369] Align and Adapt: Multimodal Multiview Human Activity Recognition under Arbitrary View Combinations

Duc-Anh Nguyen, Nhien-An Le-Khac

Main category: cs.LG

TL;DR: AliAd: A multimodal multiview learning model for human activity recognition that handles arbitrary view combinations using contrastive learning and mixture-of-experts, with linear computational complexity.

Details

Motivation: Existing multimodal multiview learning approaches struggle with flexible view configurations including arbitrary view combinations, numbers of views, and heterogeneous modalities, especially in human activity recognition tasks.

Method: Combines multiview contrastive learning with mixture-of-experts module using adjusted center contrastive loss for self-supervised representation learning and view alignment, reducing complexity from O(V²) to O(V). Includes specialized load balancing strategy for mixture-of-experts.

Result: Validated on four datasets with inertial and human pose modalities (3-9 views), demonstrating performance and flexibility in handling arbitrary view availability during training and inference.

Conclusion: AliAd effectively addresses flexible view configurations in multimodal multiview learning for human activity recognition through contrastive learning and mixture-of-experts approach.

Abstract: Multimodal multiview learning seeks to integrate information from diverse sources to enhance task performance. Existing approaches often struggle with flexible view configurations, including arbitrary view combinations, numbers of views, and heterogeneous modalities. Focusing on the context of human activity recognition, we propose AliAd, a model that combines multiview contrastive learning with a mixture-of-experts module to support arbitrary view availability during both training and inference. Instead of trying to reconstruct missing views, an adjusted center contrastive loss is used for self-supervised representation learning and view alignment, mitigating the impact of missing views on multiview fusion. This loss formulation allows for the integration of view weights to account for view quality. Additionally, it reduces computational complexity from $O(V^2)$ to $O(V)$, where $V$ is the number of views. To address residual discrepancies not captured by contrastive learning, we employ a mixture-of-experts module with a specialized load balancing strategy, tasked with adapting to arbitrary view combinations. We highlight the geometric relationship among components in our model and how they combine well in the latent space. AliAd is validated on four datasets encompassing inertial and human pose modalities, with the number of views ranging from three to nine, demonstrating its performance and flexibility.

[370] Feature salience - not task-informativeness - drives machine learning model explanations

Benedict Clark, Marta Oliveira, Rick Wilming, Stefan Haufe

Main category: cs.LG

TL;DR: XAI methods attribute importance primarily to visually salient features rather than statistically informative ones, as shown through watermark experiments in image classification.

Details

Motivation: To investigate whether XAI methods truly identify informative features or are biased by visual salience, statistical suppression, or novelty effects, challenging the assumption that XAI highlights features containing target information.

Method: Trained deep learning models on binary image classification with three watermark conditions: absent, class-dependent confounds, or class-independent noise. Evaluated five popular attribution methods and compared to edge detection filters.

Result: XAI methods showed substantially elevated importance in watermarked areas regardless of training setting (R² ≥ .45), while class-dependence had minimal effect (R² ≤ .03). Importance attribution resembled edge detection and was sensitive to feature value encoding.

Conclusion: XAI importance attribution is driven more by test-time feature salience than learned statistical associations, suggesting previous XAI evaluations may be confounded and feature attribution workflows need scrutiny.

Abstract: Explainable AI (XAI) promises to provide insight into machine learning models’ decision processes, where one goal is to identify failures such as shortcut learning. This promise relies on the field’s assumption that input features marked as important by an XAI must contain information about the target variable. However, it is unclear whether informativeness is indeed the main driver of importance attribution in practice, or if other data properties such as statistical suppression, novelty at test-time, or high feature salience substantially contribute. To clarify this, we trained deep learning models on three variants of a binary image classification task, in which translucent watermarks are either absent, act as class-dependent confounds, or represent class-independent noise. Results for five popular attribution methods show substantially elevated relative importance in watermarked areas (RIW) for all models regardless of the training setting ($R^2 \geq .45$). By contrast, whether the presence of watermarks is class-dependent or not only has a marginal effect on RIW ($R^2 \leq .03$), despite a clear impact impact on model performance and generalisation ability. XAI methods show similar behaviour to model-agnostic edge detection filters and attribute substantially less importance to watermarks when bright image intensities are encoded by smaller instead of larger feature values. These results indicate that importance attribution is most strongly driven by the salience of image structures at test time rather than statistical associations learned by machine learning models. Previous studies demonstrating successful XAI application should be reevaluated with respect to a possibly spurious concurrency of feature salience and informativeness, and workflows using feature attribution methods as building blocks should be scrutinised.

[371] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, Ekdeep Singh Lubana

Main category: cs.LG

TL;DR: RLFR uses interpretable model features as reward signals for reinforcement learning to reduce hallucinations in language models, enabling scalable supervision for open-ended tasks.

Details

Motivation: Language models learn abstract features that encode concepts like factuality, but these features are typically only used for monitoring or steering. The paper aims to use these interpretable features as scalable supervision for open-ended tasks like hallucination reduction.

Method: Develops RLFR (Reinforcement Learning from Feature Rewards) pipeline that uses model features as reward functions. Includes a probing framework to identify candidate hallucinated claims, teaches models to intervene and correct completions when uncertain, and enables scalable test-time compute guided by reward features.

Result: Applied to Gemma-3-12B-IT, the resulting policy is 58% less likely to hallucinate compared to the original model (when used with their probing harness), while preserving performance on standard benchmarks.

Conclusion: By grounding supervision in interpretable features, this work introduces a novel paradigm for using interpretability to learn open-ended tasks, demonstrating effective hallucination reduction through feature-based reinforcement learning.

Abstract: Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compared to the original model (when run in tandem with our probing harness), while preserving performance on standard benchmarks. Taken together, by grounding supervision in the language of features, this paper introduces a novel paradigm in the use of interpretability for learning open-ended tasks.

[372] Stochastic Parroting in Temporal Attention – Regulating the Diagonal Sink

Victoria Hankemeier, Malte Schilling

Main category: cs.LG

TL;DR: Theoretical analysis shows temporal attention suffers from diagonal attention sink bias, with proposed regularization methods to address it

Details

Motivation: Spatio-temporal models are prone to information degeneration between space and time, with prior work showing over-squashing in causal attention creates bias on first tokens. The paper aims to analyze if similar bias exists in temporal attention mechanisms.

Method: Derived sensitivity bounds on expected value of Jacobian of temporal attention layer, theoretically analyzed how off-diagonal attention scores depend on sequence length, identified diagonal attention sink problem, and proposed regularization methods.

Result: Theoretical analysis demonstrates temporal attention matrices suffer from diagonal attention sink, and experimental results show effectiveness of proposed regularization methods in addressing this bias.

Conclusion: Temporal attention mechanisms exhibit diagonal attention sink bias similar to issues in causal attention, but regularization methods can effectively mitigate this problem.

Abstract: Spatio-temporal models analyze spatial structures and temporal dynamics, which makes them prone to information degeneration among space and time. Prior literature has demonstrated that over-squashing in causal attention or temporal convolutions creates a bias on the first tokens. To analyze whether such a bias is present in temporal attention mechanisms, we derive sensitivity bounds on the expected value of the Jacobian of a temporal attention layer. We theoretically show how off-diagonal attention scores depend on the sequence length, and that temporal attention matrices suffer a diagonal attention sink. We suggest regularization methods, and experimentally demonstrate their effectiveness.

[373] Efficient Analysis of the Distilled Neural Tangent Kernel

Jamie Mahowald, Brian Bell, Alex Ho, Michael Geyer

Main category: cs.LG

TL;DR: NTK computation is accelerated via dataset distillation to compress data dimension, preserving kernel structure while reducing computational complexity by up to 5 orders of magnitude.

Details

Motivation: Neural tangent kernel methods face computational bottlenecks due to large Jacobian evaluations across many data points. Existing approaches use projection/sketching, but data dimension compression remains unexplored.

Method: Proposes distilled neural tangent kernel (DNTK) that combines NTK-tuned dataset distillation with projection methods. Shows neural tangent space can be induced by dataset distillation, achieving 20-100× Jacobian reduction while preserving low effective rank of per-class NTK matrices.

Result: Achieves up to 5 orders of magnitude reduction in NTK computational complexity while maintaining kernel structure and predictive performance comparable to full NTK.

Conclusion: Dataset distillation provides an effective alternative to projection methods for NTK acceleration, enabling efficient NTK computation without sacrificing performance.

Abstract: Neural tangent kernel (NTK) methods are computationally limited by the need to evaluate large Jacobians across many data points. Existing approaches reduce this cost primarily through projecting and sketching the Jacobian. We show that NTK computation can also be reduced by compressing the data dimension itself using NTK-tuned dataset distillation. We demonstrate that the neural tangent space spanned by the input data can be induced by dataset distillation, yielding a 20-100$\times$ reduction in required Jacobian calculations. We further show that per-class NTK matrices have low effective rank that is preserved by this reduction. Building on these insights, we propose the distilled neural tangent kernel (DNTK), which combines NTK-tuned dataset distillation with state-of-the-art projection methods to reduce up NTK computational complexity by up to five orders of magnitude while preserving kernel structure and predictive performance.

[374] Boundary Point Jailbreaking of Black-Box LLMs

Xander Davies, Giorgi Giglemiani, Edmund Lau, Eric Winsor, Geoffrey Irving, Yarin Gal

Main category: cs.LG

TL;DR: BPJ is a black-box jailbreak attack that evades strong LLM safeguards by using boundary point optimization and curriculum learning, requiring only binary classifier feedback.

Details

Motivation: Current LLM safeguards have become robust against traditional jailbreak attacks, surviving extensive human red teaming. There's a need for automated attacks that can bypass these strong defenses without relying on white/grey-box access or existing jailbreak libraries.

Method: BPJ uses a fully black-box approach requiring only binary classifier feedback (flagged/not flagged). It converts target harmful strings into curriculum of intermediate attack targets, then actively selects boundary points - evaluation points that best detect small changes in attack strength. This allows optimization without direct access to classifier scores or gradients.

Result: BPJ successfully develops universal jailbreaks against Constitutional Classifiers and is the first automated attack to succeed against GPT-5’s input classifier without human attack seeds. It demonstrates effectiveness against industry-deployed safeguards.

Conclusion: BPJ represents a significant advancement in automated jailbreak attacks, showing that current single-interaction defenses are insufficient. Effective defense requires supplementing individual interaction monitoring with batch-level analysis to detect optimization patterns.

Abstract: Frontier LLMs are safeguarded against attempts to extract harmful information via adversarial prompts known as “jailbreaks”. Recently, defenders have developed classifier-based systems that have survived thousands of hours of human red teaming. We introduce Boundary Point Jailbreaking (BPJ), a new class of automated jailbreak attacks that evade the strongest industry-deployed safeguards. Unlike previous attacks that rely on white/grey-box assumptions (such as classifier scores or gradients) or libraries of existing jailbreaks, BPJ is fully black-box and uses only a single bit of information per query: whether or not the classifier flags the interaction. To achieve this, BPJ addresses the core difficulty in optimising attacks against robust real-world defences: evaluating whether a proposed modification to an attack is an improvement. Instead of directly trying to learn an attack for a target harmful string, BPJ converts the string into a curriculum of intermediate attack targets and then actively selects evaluation points that best detect small changes in attack strength (“boundary points”). We believe BPJ is the first fully automated attack algorithm that succeeds in developing universal jailbreaks against Constitutional Classifiers, as well as the first automated attack algorithm that succeeds against GPT-5’s input classifier without relying on human attack seeds. BPJ is difficult to defend against in individual interactions but incurs many flags during optimisation, suggesting that effective defence requires supplementing single-interaction methods with batch-level monitoring.

cs.MA

[375] Evaluating Collective Behaviour of Hundreds of LLM Agents

Richard Willis, Jianing Zhao, Yali Du, Joel Z. Leibo

Main category: cs.MA

TL;DR: LLM-powered autonomous agents in social dilemmas show that newer models produce worse societal outcomes than older ones when prioritizing individual gain, with cultural evolution simulations revealing risks of poor societal equilibria.

Details

Motivation: As LLM-powered autonomous agents become more prevalent in society, understanding their collective behavior in social dilemmas is crucial for responsible deployment and avoiding negative societal outcomes.

Method: Developed an evaluation framework where LLMs generate strategies encoded as algorithms, enabling inspection before deployment and scaling to populations of hundreds of agents. Used cultural evolution to model user selection of agents in simulations.

Result: More recent LLM models tend to produce worse societal outcomes compared to older models when agents prioritize individual gain over collective benefits. Simulations show significant risk of convergence to poor societal equilibria, especially when cooperation benefits diminish and population sizes increase.

Conclusion: There’s a concerning trend where newer LLMs may lead to worse collective outcomes in social dilemmas, highlighting the need for evaluation frameworks to assess emergent collective behavior before deployment.

Abstract: As autonomous agents powered by LLM are increasingly deployed in society, understanding their collective behaviour in social dilemmas becomes critical. We introduce an evaluation framework where LLMs generate strategies encoded as algorithms, enabling inspection prior to deployment and scaling to populations of hundreds of agents – substantially larger than in previous work. We find that more recent models tend to produce worse societal outcomes compared to older models when agents prioritise individual gain over collective benefits. Using cultural evolution to model user selection of agents, our simulations reveal a significant risk of convergence to poor societal equilibria, particularly when the relative benefit of cooperation diminishes and population sizes increase. We release our code as an evaluation suite for developers to assess the emergent collective behaviour of their models.

[376] Consensus Based Task Allocation for Angles-Only Local Catalog Maintenance of Satellite Systems

Harrison Perone, Christopher W. Hays

Main category: cs.MA

TL;DR: Decentralized task allocation algorithm for satellite constellations to coordinate observations of space objects using angles-only limited FOV measurements, improving fuel efficiency and catalog uncertainty.

Details

Motivation: Close proximity satellites need accurate relative state estimates of all objects (satellites and debris) for safe operations. Ground-based tracking may be insufficient, requiring space-based sensors with coordinated observations among multiple communicating satellites.

Method: Developed a decentralized task allocation algorithm for scheduling and coordinating observations among multiple satellites, each maintaining local catalogs of communicating and non-communicating objects using angles-only limited field of view measurements.

Result: The new method significantly outperforms the uncertainty-fuel Pareto frontier formed by current approaches, demonstrating improved fuel usage and overall catalog uncertainty in numerical simulations.

Conclusion: Decentralized coordination among space-based sensors enables more efficient observation scheduling for space situational awareness, achieving better trade-offs between fuel consumption and catalog accuracy than existing methods.

Abstract: In order for close proximity satellites to safely perform their missions, the relative states of all satellites and pieces of debris must be well understood. This presents a problem for ground based tracking and orbit determination since it may not be practical to achieve the required accuracy. Using space-based sensors allows for more accurate relative state estimates, especially if multiple satellites are allowed to communicate. Of interest to this work is the case where several communicating satellites each need to maintain a local catalog of communicating and non-communicating objects using angles-only limited field of view (FOV) measurements. However, this introduces the problem of efficiently scheduling and coordinating observations among the agents. This paper presents a decentralized task allocation algorithm to address this problem and quantifies its performance in terms of fuel usage and overall catalog uncertainty via numerical simulation. It was found that the new method significantly outperforms the uncertainty-fuel Pareto frontier formed by current approaches.

[377] Fairness Dynamics in Digital Economy Platforms with Biased Ratings

J. Martin Smit, Fernando P. Santos

Main category: cs.MA

TL;DR: Digital platforms with rating systems can perpetuate discrimination against marginalized groups; evolutionary game theory shows trade-off between user experience and fairness, with demographic-based interventions as effective solutions.

Details

Motivation: Digital platforms rely on rating systems to establish trust, but these systems can perpetuate negative biases against marginalized groups. The paper aims to investigate how to design platforms around biased reputation systems to reduce discrimination while maintaining incentives for high-quality service.

Method: The authors introduce an evolutionary game theoretical model to study how digital platforms can perpetuate or counteract rating-based discrimination. They focus on platforms’ decisions to promote service providers based on high reputations or protected group membership.

Result: Results demonstrate a fundamental trade-off between user experience and fairness: promoting highly-rated providers benefits users but lowers demand for marginalized providers facing rating bias. Demographic-based interventions in search results are highly effective at reducing unfairness with minimal user impact. Even without precise bias measurements, improvements over systems ignoring protected characteristics are possible.

Conclusion: The model highlights the benefits of proactive anti-discrimination design in rating-based systems, showing that platforms can implement effective interventions to reduce discrimination while maintaining service quality incentives.

Abstract: The digital services economy consists of online platforms that facilitate interactions between service providers and consumers. This ecosystem is characterized by short-term, often one-off, transactions between parties that have no prior familiarity. To establish trust among users, platforms employ rating systems which allow users to report on the quality of their previous interactions. However, while arguably crucial for these platforms to function, rating systems can perpetuate negative biases against marginalised groups. This paper investigates how to design platforms around biased reputation systems, reducing discrimination while maintaining incentives for all service providers to offer high quality service for users. We introduce an evolutionary game theoretical model to study how digital platforms can perpetuate or counteract rating-based discrimination. We focus on the platforms’ decisions to promote service providers who have high reputations or who belong to a specific protected group. Our results demonstrate a fundamental trade-off between user experience and fairness: promoting highly-rated providers benefits users, but lowers the demand for marginalised providers against which the ratings are biased. Our results also provide evidence that intervening by tuning the demographics of the search results is a highly effective way of reducing unfairness while minimally impacting users. Furthermore, we show that even when precise measurements on the level of rating bias affecting marginalised service providers is unavailable, there is still potential to improve upon a recommender system which ignores protected characteristics. Altogether, our model highlights the benefits of proactive anti-discrimination design in systems where ratings are used to promote cooperative behaviour.

cs.MM

[378] Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection

Rong Fu, Ziming Wang, Shuo Yin, Wenxin Zhang, Haiyun Wei, Kun Liu, Xianda Li, Zeli Su, Simon Fong

Main category: cs.MM

TL;DR: EC-Net is a hyperbolic hypergraph framework for multimodal emotion and sentiment modeling that uses Poincare-ball embeddings and hypergraph fusion with contrastive learning in hyperbolic space.

Details

Motivation: Emotional expression is crucial for natural communication and effective human-computer interaction. Current multimodal emotion modeling approaches need better representation of modality hierarchies and more robust fusion mechanisms, especially when dealing with partial or noisy modality data.

Method: EC-Net uses Poincare-ball embeddings to represent modality hierarchies in hyperbolic space, performs fusion through a hypergraph mechanism with bidirectional message passing between nodes and hyperedges, and employs contrastive learning in hyperbolic space with decoupled radial and angular objectives. It preserves high-order semantic relations via adaptive hyperedge construction across time steps and modalities.

Result: Empirical results on standard multimodal emotion benchmarks show EC-Net produces robust, semantically coherent representations and consistently improves accuracy, particularly when modalities are partially available or contaminated by noise.

Conclusion: Explicit hierarchical geometry combined with hypergraph fusion is effective for resilient multimodal affect understanding, demonstrating the value of hyperbolic representations and hypergraph structures for emotion modeling.

Abstract: Emotional expression underpins natural communication and effective human-computer interaction. We present Emotion Collider (EC-Net), a hyperbolic hypergraph framework for multimodal emotion and sentiment modeling. EC-Net represents modality hierarchies using Poincare-ball embeddings and performs fusion through a hypergraph mechanism that passes messages bidirectionally between nodes and hyperedges. To sharpen class separation, contrastive learning is formulated in hyperbolic space with decoupled radial and angular objectives. High-order semantic relations across time steps and modalities are preserved via adaptive hyperedge construction. Empirical results on standard multimodal emotion benchmarks show that EC-Net produces robust, semantically coherent representations and consistently improves accuracy, particularly when modalities are partially available or contaminated by noise. These findings indicate that explicit hierarchical geometry combined with hypergraph fusion is effective for resilient multimodal affect understanding.

eess.AS

[379] Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

Main category: eess.AS

TL;DR: Resp-Agent: A multimodal system using an Active Adversarial Curriculum Agent to address respiratory auscultation challenges by weaving EHR data with audio tokens and synthesizing hard-to-diagnose samples via adapted LLM.

Details

Motivation: Address two fundamental challenges in deep learning-based respiratory auscultation: (1) inherent information loss when converting signals to spectrograms (discards transient acoustic events and clinical context), and (2) limited data availability exacerbated by severe class imbalance.

Method: Proposes Resp-Agent with three key components: (1) Active Adversarial Curriculum Agent (Thinker-A²CA) as central controller to identify diagnostic weaknesses and schedule targeted synthesis; (2) Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors; (3) Flow Matching Generator that adapts text-only LLM via modality injection to synthesize hard-to-diagnose samples by decoupling pathological content from acoustic style.

Result: Introduces Resp-229k benchmark corpus of 229k recordings with LLM-distilled clinical narratives. Extensive experiments show Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance.

Conclusion: Resp-Agent effectively addresses representation and data gaps in respiratory auscultation through multimodal integration and adaptive synthesis, demonstrating superior performance in challenging diagnostic scenarios.

Abstract: Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.

[380] How Much Does Machine Identity Matter in Anomalous Sound Detection at Test Time?

Kevin Wilkinghoff, Keisuke Imoto, Zheng-Hua Tan

Main category: eess.AS

TL;DR: The paper proposes a modified evaluation protocol for anomalous sound detection (ASD) that removes the assumption of known machine identity at test time, revealing performance degradations and method-specific robustness differences that are hidden under standard machine-wise evaluation.

Details

Motivation: Realistic monitoring scenarios involve multiple machines operating concurrently where test recordings may not be reliably attributable to specific machines. Standard ASD benchmarks assume machine identity is known at test time, which imposes deployment constraints and doesn't reflect real-world conditions where machine identity may be uncertain.

Method: The authors propose a minimal modification to ASD evaluation: test recordings from multiple machines are merged and evaluated jointly without access to machine identity at inference time. Training data and evaluation metrics remain unchanged, and machine identity labels are used only for post hoc evaluation. This reveals how methods perform when machine identity is uncertain.

Result: Experiments with representative ASD methods show that relaxing the machine identity assumption reveals performance degradations and method-specific differences in robustness that are hidden under standard machine-wise evaluation. These degradations are strongly related to implicit machine identification accuracy.

Conclusion: The standard machine-wise evaluation protocol for ASD hides important performance characteristics that become apparent when machine identity is uncertain. The proposed evaluation framework better reflects realistic deployment scenarios and reveals method-specific robustness to machine identity uncertainty.

Abstract: Anomalous sound detection (ASD) benchmarks typically assume that the identity of the monitored machine is known at test time and that recordings are evaluated in a machine-wise manner. However, in realistic monitoring scenarios with multiple known machines operating concurrently, test recordings may not be reliably attributable to a specific machine, and requiring machine identity imposes deployment constraints such as dedicated sensors per machine. To reveal performance degradations and method-specific differences in robustness that are hidden under standard machine-wise evaluation, we consider a minimal modification of the ASD evaluation protocol in which test recordings from multiple machines are merged and evaluated jointly without access to machine identity at inference time. Training data and evaluation metrics remain unchanged, and machine identity labels are used only for post hoc evaluation. Experiments with representative ASD methods show that relaxing this assumption reveals performance degradations and method-specific differences in robustness that are hidden under standard machine-wise evaluation, and that these degradations are strongly related to implicit machine identification accuracy.

[381] Color-based Emotion Representation for Speech Emotion Recognition

Ryotaro Nagase, Ryoichi Takashima, Yoichi Yamashita

Main category: eess.AS

TL;DR: This paper proposes using color attributes (hue, saturation, value) as continuous and interpretable representations for speech emotion recognition, moving beyond traditional categorical/dimensional labels.

Details

Motivation: Traditional speech emotion recognition (SER) relies on categorical or dimensional labels, which are limited in representing both the diversity and interpretability of emotions. The authors seek a more continuous and interpretable representation.

Method: 1) Annotated an emotional speech corpus with color attributes via crowdsourcing; 2) Built regression models for color attributes in SER using machine learning and deep learning; 3) Explored multitask learning combining color attribute regression and emotion classification.

Result: Demonstrated the relationship between color attributes and emotions in speech, successfully developed color attribute regression models for SER, and showed that multitask learning improved performance for each task.

Conclusion: Color attributes provide a promising continuous and interpretable representation for speech emotion recognition, with multitask learning offering performance benefits over single-task approaches.

Abstract: Speech emotion recognition (SER) has traditionally relied on categorical or dimensional labels. However, this technique is limited in representing both the diversity and interpretability of emotions. To overcome this limitation, we focus on color attributes, such as hue, saturation, and value, to represent emotions as continuous and interpretable scores. We annotated an emotional speech corpus with color attributes via crowdsourcing and analyzed them. Moreover, we built regression models for color attributes in SER using machine learning and deep learning, and explored the multitask learning of color attribute regression and emotion classification. As a result, we demonstrated the relationship between color attributes and emotions in speech, and successfully developed color attribute regression models for SER. We also showed that multitask learning improved the performance of each task.

[382] Multi-Channel Replay Speech Detection using Acoustic Maps

Michael Neri, Tuomas Virtanen

Main category: eess.AS

TL;DR: Acoustic maps as spatial features for replay attack detection in speaker verification systems using multi-channel recordings and lightweight CNN.

Details

Motivation: Replay attacks are a critical vulnerability for automatic speaker verification systems, especially in real-time voice assistant applications, requiring robust detection methods.

Method: Proposes acoustic maps as novel spatial feature representation derived from classical beamforming over discrete azimuth and elevation grids. Uses lightweight convolutional neural network (approx. 6k parameters) to operate on these directional energy distributions.

Result: Achieves competitive performance on ReMASC dataset. Acoustic maps provide compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.

Conclusion: Acoustic maps offer an effective spatial feature representation for detecting replay attacks in speaker verification systems, with good performance and interpretability.

Abstract: Replay attacks remain a critical vulnerability for automatic speaker verification systems, particularly in real-time voice assistant applications. In this work, we propose acoustic maps as a novel spatial feature representation for replay speech detection from multi-channel recordings. Derived from classical beamforming over discrete azimuth and elevation grids, acoustic maps encode directional energy distributions that reflect physical differences between human speech radiation and loudspeaker-based replay. A lightweight convolutional neural network is designed to operate on this representation, achieving competitive performance on the ReMASC dataset with approximately 6k trainable parameters. Experimental results show that acoustic maps provide a compact and physically interpretable feature space for replay attack detection across different devices and acoustic environments.

[383] Online Single-Channel Audio-Based Sound Speed Estimation for Robust Multi-Channel Audio Control

Andreas Jonas Fuglsig, Mads Græsbøll Christensen, Jesper Rindom Jensen

Main category: eess.AS

TL;DR: Online sound speed estimation using single microphone during audio playback to improve spatial audio control robustness against environmental variations

Details

Motivation: Environmental variations, especially changes in sound speed, cause systematic mismatches in acoustic propagation models that degrade spatial audio control performance. Existing methods are impractical for minimal sensing systems as they require known sound speed, multiple microphones, or separate calibration.

Method: Proposes an online sound speed estimator that operates during general multichannel audio playback using only a single observation microphone. The method exploits the structured effect of sound speed on reproduced signals and estimates it by minimizing mismatch between measured audio and a parametric acoustic model.

Result: Simulations show accurate tracking of sound speed for diverse input signals and improved spatial control performance when estimates are used to compensate propagation errors in a sound zone control framework.

Conclusion: The proposed single-microphone online sound speed estimation method enables robust spatial audio control in varying environments without requiring multiple sensors or separate calibration procedures.

Abstract: Robust spatial audio control relies on accurate acoustic propagation models, yet environmental variations, especially changes in the speed of sound, cause systematic mismatches that degrade performance. Existing methods either assume known sound speed, require multiple microphones, or rely on separate calibration, making them impractical for systems with minimal sensing. We propose an online sound speed estimator that operates during general multichannel audio playback and requires only a single observation microphone. The method exploits the structured effect of sound speed on the reproduced signal and estimates it by minimizing the mismatch between the measured audio and a parametric acoustic model. Simulations show accurate tracking of sound speed for diverse input signals and improved spatial control performance when the estimates are used to compensate propagation errors in a sound zone control framework.

[384] SELEBI: Percussion-aware Time Stretching via Selective Magnitude Spectrogram Compression by Nonstationary Gabor Transform

Natsuki Akaishi, Nicki Holighaus, Kohei Yatabe

Main category: eess.AS

TL;DR: SELEBI is a signal-adaptive phase vocoder algorithm that reduces percussion smearing in audio time-stretching by using nonstationary Gabor transforms with dynamic window lengths.

Details

Motivation: Conventional phase vocoder time-stretching suffers from "percussion smearing" artifacts that degrade percussive audio quality, caused by temporal mismatch between smeared magnitude spectrograms and localized phase.

Method: Uses nonstationary Gabor transform with dynamically adapted analysis window lengths - short windows for percussive intervals, longer windows for stationary content. This creates temporally localized magnitude spectrograms directly from time-domain signals.

Result: Experimental results show effective mitigation of percussion smearing and natural sound quality while preserving perfect reconstruction property and stability.

Conclusion: SELEBI provides a principled solution to percussion smearing in audio time-stretching through signal-adaptive windowing, outperforming heuristic approaches while maintaining perfect reconstruction.

Abstract: Phase vocoder-based time-stretching is a widely used technique for the time-scale modification of audio signals. However, conventional implementations suffer from ``percussion smearing,’’ a well-known artifact that significantly degrades the quality of percussive components. We attribute this artifact to a fundamental time-scale mismatch between the temporally smeared magnitude spectrogram and the localized, newly generated phase. To address this, we propose SELEBI, a signal-adaptive phase vocoder algorithm that significantly reduces percussion smearing while preserving stability and the perfect reconstruction property. Unlike conventional methods that rely on heuristic processing or component separation, our approach leverages the nonstationary Gabor transform. By dynamically adapting analysis window lengths to assign short windows to intervals containing significant energy associated with percussive components, we directly compute a temporally localized magnitude spectrogram from the time-domain signal. This approach ensures greater consistency between the temporal structures of the magnitude and phase. Furthermore, the perfect reconstruction property of the nonstationary Gabor transform guarantees stable, high-fidelity signal synthesis, in contrast to previous heuristic approaches. Experimental results demonstrate that the proposed method effectively mitigates percussion smearing and yields natural sound quality.

[385] BEST-STD2.0: Balanced and Efficient Speech Tokenizer for Spoken Term Detection

Anup Singh, Vipul Arora, Kris Demuynck

Main category: eess.AS

TL;DR: Proposed noise-augmented training and optimal transport regularization for token-based spoken term detection to improve robustness and token efficiency.

Details

Motivation: Token-based spoken term detection systems are efficient but struggle with robustness to noise/reverberation and inefficient token utilization, limiting their practical application for voice search.

Method: 1) Noise and reverberation-augmented training strategy to improve tokenizer robustness; 2) Optimal transport-based regularization for balanced token usage and enhanced efficiency; 3) TF-IDF-based search mechanism for faster retrieval.

Result: The proposed method outperforms STD baselines across various distortion levels while maintaining high search efficiency.

Conclusion: The approach successfully addresses robustness and efficiency challenges in token-based spoken term detection, making it more practical for real-world voice search applications.

Abstract: Fast and accurate spoken content retrieval is vital for applications such as voice search. Query-by-Example Spoken Term Detection (STD) involves retrieving matching segments from an audio database given a spoken query. Token-based STD systems, which use discrete speech representations, enable efficient search but struggle with robustness to noise and reverberation, and with inefficient token utilization. We address these challenges by proposing a noise and reverberation-augmented training strategy to improve tokenizer robustness. In addition, we introduce optimal transport-based regularization to ensure balanced token usage and enhance token efficiency. To further speed up retrieval, we adopt a TF-IDF-based search mechanism. Empirical evaluations demonstrate that the proposed method outperforms STD baselines across various distortion levels while maintaining high search efficiency.

[386] Spatial Interpolation of Room Impulse Responses based on Deeper Physics-Informed Neural Networks with Residual Connections

Ken Kurata, Gen Sato, Izumi Tsunokuni, Yusuke Ikeda

Main category: eess.AS

TL;DR: Deep residual PINN with sinusoidal activations achieves best RIR estimation accuracy for interpolation/extrapolation, enabling stable training and better reflection component estimation.

Details

Motivation: Room impulse response (RIR) estimation from limited measurements is crucial for sound propagation analysis. While physics-informed neural networks (PINNs) have been introduced for accurate RIR estimation, the role of network depth hasn't been systematically investigated, and deeper architectures need exploration for improved performance.

Method: Developed deeper PINN architecture with residual connections and analyzed how network depth affects estimation performance. Compared activation functions including tanh and sinusoidal activations. Focused on interpolation and extrapolation of RIRs.

Result: Residual PINN with sinusoidal activations achieves highest accuracy for both interpolation and extrapolation of RIRs. The architecture enables stable training as depth increases and yields notable improvements in estimating reflection components.

Conclusion: Provides practical guidelines for designing deep and stable PINNs for acoustic-inverse problems, showing that deeper architectures with appropriate activation functions can significantly improve RIR estimation performance.

Abstract: The room impulse response (RIR) characterizes sound propagation in a room from a loudspeaker to a microphone under the linear time-invariant assumption. Estimating RIRs from a limited number of measurement points is crucial for sound propagation analysis and visualization. Physics-informed neural networks (PINNs) have recently been introduced for accurate RIR estimation by embedding governing physical laws into deep learning models; however, the role of network depth has not been systematically investigated. In this study, we developed a deeper PINN architecture with residual connections and analyzed how network depth affects estimation performance. We further compared activation functions, including tanh and sinusoidal activations. Our results indicate that the residual PINN with sinusoidal activations achieves the highest accuracy for both interpolation and extrapolation of RIRs. Moreover, the proposed architecture enables stable training as the depth increases and yields notable improvements in estimating reflection components. These results provide practical guidelines for designing deep and stable PINNs for acoustic-inverse problems.

eess.IV

[387] Foundation Models for Medical Imaging: Status, Challenges, and Directions

Chuang Niu, Pengwei Wu, Bruno De Man, Ge Wang

Main category: eess.IV

TL;DR: A comprehensive review of foundation models in medical imaging, covering design principles, applications, and future challenges for developing trustworthy clinical AI systems.

Details

Motivation: Medical imaging is transitioning from task-specific models to large, general-purpose foundation models that can adapt across modalities, anatomies, and clinical tasks, requiring a synthesis of current progress and future directions.

Method: Review paper that synthesizes the emerging landscape of medical imaging foundation models along three major axes: principles of FM design, applications of FMs, and forward-looking challenges and opportunities.

Result: Provides a technically grounded, clinically aware roadmap for developing foundation models that are powerful, versatile, trustworthy, and ready for responsible clinical translation.

Conclusion: Foundation models are reshaping medical imaging, and the review offers a comprehensive framework for advancing the field toward clinically relevant, trustworthy AI systems.

Abstract: Foundation models (FMs) are rapidly reshaping medical imaging, shifting the field from narrowly trained, task-specific networks toward large, general-purpose models that can be adapted across modalities, anatomies, and clinical tasks. In this review, we synthesize the emerging landscape of medical imaging FMs along three major axes: principles of FM design, applications of FMs, and forward-looking challenges and opportunities. Taken together, this review provides a technically grounded, clinically aware, and future-facing roadmap for developing FMs that are not only powerful and versatile but also trustworthy and ready for responsible translation into clinical practice.

[388] ROIX-Comp: Optimizing X-ray Computed Tomography Imaging Strategy for Data Reduction and Reconstruction

Amarjit Singh, Kento Sato, Kohei Yoshida, Kentaro Uesugi, Yasumasa Joti, Takaki Hatsui, Andrès Rubio Proaño

Main category: eess.IV

TL;DR: ROIX-Comp: ROI-driven compression framework for X-ray CT data that reduces storage/transmission needs while preserving essential features through error-bounded quantization and combined lossless/lossy compression.

Details

Motivation: HPC environments like synchrotron facilities generate massive X-ray CT datasets with high dimensionality and volume, creating computational/storage challenges that limit real-time processing and workflow efficiency.

Method: Region-of-interest driven extraction framework with error-bounded quantization at pre-processing stage, followed by object extraction combined with multiple state-of-the-art lossless and lossy compressors.

Result: Achieved 12.34x relative compression ratio improvement compared to standard compression across seven X-CT datasets while preserving critical information for downstream processing.

Conclusion: ROIX-Comp effectively addresses computational and storage constraints in HPC environments for X-ray CT data through intelligent ROI-based compression that maintains essential information.

Abstract: In high-performance computing (HPC) environments, particularly in synchrotron radiation facilities, vast amounts of X-ray images are generated. Processing large-scale X-ray Computed Tomography (X-CT) datasets presents significant computational and storage challenges due to their high dimensionality and data volume. Traditional approaches often require extensive storage capacity and high transmission bandwidth, limiting real-time processing capabilities and workflow efficiency. To address these constraints, we introduce a region-of-interest (ROI)-driven extraction framework (ROIX-Comp) that intelligently compresses X-CT data by identifying and retaining only essential features. Our work reduces data volume while preserving critical information for downstream processing tasks. At pre-processing stage, we utilize error-bounded quantization to reduce the amount of data to be processed and therefore improve computational efficiencies. At the compression stage, our methodology combines object extraction with multiple state-of-the-art lossless and lossy compressors, resulting in significantly improved compression ratios. We evaluated this framework against seven X-CT datasets and observed a relative compression ratio improvement of 12.34x compared to the standard compression.

[389] Automated Assessment of Kidney Ureteroscopy Exploration for Training

Fangjie Li, Nicholas Kavoussi, Charan Mohan, Matthieu Chabanas, Jie Ying Wu

Main category: eess.IV

TL;DR: A ureteroscope video-based localization framework for automated feedback in kidney phantom training, using reference reconstruction from prior exploration to identify missed calyces.

Details

Motivation: Current clinical training for kidney ureteroscopic navigation requires one-on-one expert feedback in operating rooms, creating limited training opportunities. There's a need for phantom training systems with automated feedback to expand training access outside the OR.

Method: Proposes a purely ureteroscope video-based scope localization framework that uses a slow, thorough prior exploration video to generate a reference reconstruction of the kidney phantom. This reference is then used to localize any subsequent exploration video of the same phantom and identify missed calyces.

Result: Achieved 69 out of 74 calyces correctly classified across 15 exploration videos, with <4mm camera pose localization error. The system takes 10 minutes to process a typical 1-2 minute exploration video after reference reconstruction.

Conclusion: Demonstrates a novel camera localization framework that provides accurate, automated feedback for kidney phantom explorations, enabling out-of-OR training without expert supervision.

Abstract: Purpose: Kidney ureteroscopic navigation is challenging with a steep learning curve. However, current clinical training has major deficiencies, as it requires one-on-one feedback from experts and occurs in the operating room (OR). Therefore, there is a need for a phantom training system with automated feedback to greatly \revision{expand} training opportunities. Methods: We propose a novel, purely ureteroscope video-based scope localization framework that automatically identifies calyces missed by the trainee in a phantom kidney exploration. We use a slow, thorough, prior exploration video of the kidney to generate a reference reconstruction. Then, this reference reconstruction can be used to localize any exploration video of the same phantom. Results: In 15 exploration videos, a total of 69 out of 74 calyces were correctly classified. We achieve < 4mm camera pose localization error. Given the reference reconstruction, the system takes 10 minutes to generate the results for a typical exploration (1-2 minute long). Conclusion: We demonstrate a novel camera localization framework that can provide accurate and automatic feedback for kidney phantom explorations. We show its ability as a valid tool that enables out-of-OR training without requiring supervision from an expert.

[390] RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion

Kavyansh Tyagi, Vishwas Rathi, Puneet Goyal

Main category: eess.IV

TL;DR: RefineFormer3D is a lightweight hierarchical transformer for 3D medical image segmentation that balances accuracy and efficiency through GhostConv3D patch embedding, MixFFN3D modules, and cross-attention fusion decoder.

Details

Motivation: Transformer-based architectures for 3D medical image segmentation offer superior global contextual modeling but suffer from excessive parameters and memory demands, limiting clinical deployment. There's a need for accurate yet computationally efficient solutions for practical clinical workflows.

Method: Proposes RefineFormer3D with three key components: 1) GhostConv3D-based patch embedding for efficient feature extraction with minimal redundancy, 2) MixFFN3D module with low-rank projections and depthwise convolutions for parameter-efficient feature extraction, and 3) cross-attention fusion decoder enabling adaptive multi-scale skip connection integration.

Result: Achieves 93.44% average Dice score on ACDC and 85.9% on BraTS benchmarks, outperforming or matching state-of-the-art methods with only 2.94M parameters. Fast inference (8.35 ms per volume on GPU) with low memory requirements.

Conclusion: RefineFormer3D establishes an effective and scalable solution for practical 3D medical image segmentation, balancing accuracy and computational efficiency for clinical deployment.

Abstract: Accurate and computationally efficient 3D medical image segmentation remains a critical challenge in clinical workflows. Transformer-based architectures often demonstrate superior global contextual modeling but at the expense of excessive parameter counts and memory demands, restricting their clinical deployment. We propose RefineFormer3D, a lightweight hierarchical transformer architecture that balances segmentation accuracy and computational efficiency for volumetric medical imaging. The architecture integrates three key components: (i) GhostConv3D-based patch embedding for efficient feature extraction with minimal redundancy, (ii) MixFFN3D module with low-rank projections and depthwise convolutions for parameter-efficient feature extraction, and (iii) a cross-attention fusion decoder enabling adaptive multi-scale skip connection integration. RefineFormer3D contains only 2.94M parameters, substantially fewer than contemporary transformer-based methods. Extensive experiments on ACDC and BraTS benchmarks demonstrate that RefineFormer3D achieves 93.44% and 85.9% average Dice scores respectively, outperforming or matching state-of-the-art methods while requiring significantly fewer parameters. Furthermore, the model achieves fast inference (8.35 ms per volume on GPU) with low memory requirements, supporting deployment in resource-constrained clinical environments. These results establish RefineFormer3D as an effective and scalable solution for practical 3D medical image segmentation.

[391] Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model

Ahmet Halici, Ece Tugba Cebeci, Musa Balci, Mustafa Cini, Serkan Sokmen

Main category: eess.IV

TL;DR: A hierarchical vision-language framework for generating diagnostic text from gigapixel histopathology whole slide images using a frozen pathology foundation model with multi-resolution patch selection and retrieval-based verification.

Details

Motivation: Generating diagnostic text from histopathology whole slide images is challenging due to their gigapixel scale and the need for precise, domain-specific medical language.

Method: Uses hierarchical processing with multi-resolution pyramidal patch selection (downsampling factors 2^3 to 2^6), background/artifact removal, UNI Vision Transformer for feature extraction, 6-layer Transformer decoder with cross-attention, BioGPT tokenization, and retrieval-based verification using Sentence BERT embeddings.

Result: The framework generates diagnostic text from histopathology images with improved reliability through retrieval-based verification that can replace generated reports with ground truth references when high similarity matches are found.

Conclusion: Proposes a comprehensive vision-language framework for medical report generation that addresses the challenges of gigapixel image processing and domain-specific language requirements in histopathology.

Abstract: Generating diagnostic text from histopathology whole slide images (WSIs) is challenging due to the gigapixel scale of the input and the requirement for precise, domain specific language. We propose a hierarchical vision language framework that combines a frozen pathology foundation model with a Transformer decoder for report generation. To make WSI processing tractable, we perform multi resolution pyramidal patch selection (downsampling factors 2^3 to 2^6) and remove background and artifacts using Laplacian variance and HSV based criteria. Patch features are extracted with the UNI Vision Transformer and projected to a 6 layer Transformer decoder that generates diagnostic text via cross attention. To better represent biomedical terminology, we tokenize the output using BioGPT. Finally, we add a retrieval based verification step that compares generated reports with a reference corpus using Sentence BERT embeddings; if a high similarity match is found, the generated report is replaced with the retrieved ground truth reference to improve reliability.

[392] Less is More: Skim Transformer for Light Field Image Super-resolution

Zeke Zexi Hu, Haodong Chen, Hui Ye, Xiaoming Chen, Vera Yuk Ying Chung, Yiran Shen, Weidong Cai

Main category: eess.IV

TL;DR: Skim Transformer architecture for light field image super-resolution that selectively attends to subsets of sub-aperture images based on disparity ranges, reducing redundancy and improving efficiency.

Details

Motivation: Light field images contain significant data redundancy from spatial and angular information. Existing methods use all sub-aperture images indiscriminately, leading to disparity entanglement and inefficiency in processing.

Method: Proposes Skim Transformer with multi-branch structure where each branch attends to specific disparity ranges using skimmed subsets of SAIs rather than all SAIs. Implements SkimLFSR for light field super-resolution.

Result: SkimLFSR achieves state-of-the-art results with 67% fewer parameters than prior leading method, surpassing best existing method by 0.63 dB and 0.35 dB PSNR at 2x and 4x super-resolution tasks.

Conclusion: The “less is more” approach of selective attention to disparity-specific subsets of SAIs provides efficient and effective light field image processing with good generalizability across angular resolutions.

Abstract: A light field image captures scenes through its micro-lens array, providing a rich representation that encompasses spatial and angular information. While this richness comes at significant data redundancy, most existing methods tend to indiscriminately utilize all the information from sub-aperture images (SAIs) in an attempt to harness every visual cue regardless of their disparity significance. However, this paradigm inevitably leads to disparity entanglement, a fundamental cause of inefficiency in light field image processing. To address this limitation, we introduce the Skim Transformer, a novel architecture inspired by the “less is more” philosophy. It features a multi-branch structure where each branch is dedicated to a specific disparity range by constructing its attention score matrix over a skimmed subset of SAIs, rather than all of them. Building upon it, we present SkimLFSR, an efficient yet powerful network for light field image super-resolution. Requiring only 67% of the prior leading method’s parameters}, SkimLFSR achieves state-of-the-art results surpassing the best existing method by 0.63 dB and 0.35 dB PSNR at the 2x and 4x tasks, respectively. Through in-depth analyses, we reveal that SkimLFSR, guided by the predefined skimmed SAI sets as prior knowledge, demonstrates distinct disparity-aware behaviors in attending to visual cues. Last but not least, we conduct an experiment to validate SkimLFSR’s generalizability across different angular resolutions, where it achieves competitive performance on a larger angular resolution without any retraining or major network modifications. These findings highlight its effectiveness and adaptability as a promising paradigm for light field image processing.

[393] Filter2Noise: A Framework for Interpretable and Zero-Shot Low-Dose CT Image Denoising

Yipeng Sun, Linda-Sophie Schneider, Siyuan Mei, Jinhua Wang, Ge Hu, Mingxuan Gu, Chengze Ye, Fabian Wagner, Lan Song, Siming Bayer, Andreas Maier

Main category: eess.IV

TL;DR: F2N is a self-supervised, interpretable zero-shot denoising framework for low-dose CT using an attention-guided bilateral filter with only 3.6k parameters, achieving state-of-the-art results without paired training data.

Details

Motivation: Current deep learning denoising methods for low-dose CT require impractical paired data (supervised) or use opaque, parameter-heavy networks (self-supervised) that limit clinical trust and interpretability.

Method: Proposes Filter2Noise (F2N) using an Attention-Guided Bilateral Filter as a transparent mathematical operator. A lightweight attention module predicts spatially varying filter parameters. Uses multi-scale self-supervised loss with Euclidean Local Shuffle to disrupt noise patterns while preserving anatomy.

Result: Achieves state-of-the-art results on Mayo Clinic LDCT Challenge, outperforming competing zero-shot methods by up to 3.68 dB PSNR with only 3.6k parameters (orders of magnitude fewer than competitors). Validated on clinical photon-counting CT data.

Conclusion: F2N combines high performance with transparency, user control, and parameter efficiency, offering a trustworthy solution for LDCT enhancement that addresses clinical trust concerns.

Abstract: Noise in low-dose computed tomography (LDCT) can obscure important diagnostic details. While deep learning offers powerful denoising, supervised methods require impractical paired data, and self-supervised alternatives often use opaque, parameter-heavy networks that limit clinical trust. We propose Filter2Noise (F2N), a novel self-supervised framework for interpretable, zero-shot denoising from a single LDCT image. Instead of a black-box network, its core is an Attention-Guided Bilateral Filter, a transparent, content-aware mathematical operator. A lightweight attention module predicts spatially varying filter parameters, making the process transparent and allowing interactive radiologist control. To learn from a single image with correlated noise, we introduce a multi-scale self-supervised loss coupled with Euclidean Local Shuffle (ELS) to disrupt noise patterns while preserving anatomical integrity. On the Mayo Clinic LDCT Challenge, F2N achieves state-of-the-art results, outperforming competing zero-shot methods by up to 3.68 dB in PSNR. It accomplishes this with only 3.6k parameters, orders of magnitude fewer than competing models, which accelerates inference and simplifies deployment. By combining high performance with transparency, user control, and high parameter efficiency, F2N offers a trustworthy solution for LDCT enhancement. We further demonstrate its applicability by validating it on clinical photon-counting CT data. Code is available at: https://github.com/sypsyp97/Filter2Noise.

[394] Rotterdam artery-vein segmentation (RAV) dataset

Jose Vargas Quiros, Bart Liefers, Karin van Garderen, Jeroen Vermeulen, Eyened Reading Center, Caroline Klaver

Main category: eess.IV

TL;DR: A diverse retinal fundus image dataset with high-quality artery-vein segmentation annotations for developing ML algorithms in ophthalmology.

Details

Motivation: To create a comprehensive dataset for retinal vascular analysis that addresses limitations of existing datasets by including diverse image qualities, connectivity-validated annotations, and real-world variability to support robust ML development.

Method: Collected color fundus images from the longitudinal Rotterdam Study, annotated using a custom interface with separate artery/vein/unknown vessel layers, starting from initial vessel segmentation masks with connectivity verification using connected component visualization tools.

Result: Created dataset with 1024x1024-pixel PNG images in three modalities: original RGB, contrast-enhanced versions, and RGB-encoded artery-vein masks, including challenging samples typically excluded by automated quality assessment but containing valuable vascular information.

Conclusion: The dataset provides a rich, heterogeneous source of fundus images with high-quality segmentations that supports robust benchmarking and training of ML models under real-world variability in image quality and acquisition settings.

Abstract: Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology. Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools. Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information. Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings. Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases.

[395] Learning to Select Like Humans: Explainable Active Learning for Medical Imaging

Ifrat Ikhtear Uddin, Longwei Wang, Xiao Qin, Yang Zhou, KC Santosh

Main category: eess.IV

TL;DR: Explainability-guided active learning framework for medical imaging that combines classification uncertainty with attention misalignment to select samples that improve both predictive performance and clinical interpretability.

Details

Motivation: Medical image analysis requires expensive expert annotation. Traditional active learning methods only consider predictive uncertainty, ignoring whether models learn clinically meaningful features, which is critical for clinical deployment.

Method: Proposes a dual-criterion selection strategy: (1) classification uncertainty to identify informative examples, and (2) attention misalignment with radiologist-defined ROIs using Dice similarity between Grad-CAM attention maps and expert annotations.

Result: Using only 570 strategically selected samples, the approach outperforms random sampling across three medical imaging datasets: 77.22% accuracy on BraTS, 52.37% on VinDr-CXR, and 52.66% on SIIM-COVID-19. Grad-CAM visualizations confirm models focus on diagnostically relevant regions.

Conclusion: Incorporating explanation guidance into sample acquisition yields superior data efficiency while maintaining clinical interpretability, demonstrating that explainability-guided active learning can enhance both predictive performance and spatial interpretability in medical imaging.

Abstract: Medical image analysis requires substantial labeled data for model training, yet expert annotation is expensive and time-consuming. Active learning (AL) addresses this challenge by strategically selecting the most informative samples for the annotation purpose, but traditional methods solely rely on predictive uncertainty while ignoring whether models learn from clinically meaningful features a critical requirement for clinical deployment. We propose an explainability-guided active learning framework that integrates spatial attention alignment into a sample acquisition process. Our approach advocates for a dual-criterion selection strategy combining: (i) classification uncertainty to identify informative examples, and (ii) attention misalignment with radiologist-defined regions-of-interest (ROIs) to target samples where the model focuses on incorrect features. By measuring misalignment between Grad-CAM attention maps and expert annotations using Dice similarity, our acquisition function judiciously identifies samples that enhance both predictive performance and spatial interpretability. We evaluate the framework using three expert-annotated medical imaging datasets, namely, BraTS (MRI brain tumors), VinDr-CXR (chest X-rays), and SIIM-COVID-19 (chest X-rays). Using only 570 strategically selected samples, our explainability-guided approach consistently outperforms random sampling across all the datasets, achieving 77.22% accuracy on BraTS, 52.37% on VinDr-CXR, and 52.66% on SIIM-COVID. Grad-CAM visualizations confirm that the models trained by our dual-criterion selection focus on diagnostically relevant regions, demonstrating that incorporating explanation guidance into sample acquisition yields superior data efficiency while maintaining clinical interpretability.

Editor’s Picks

[1] BAT: Better Audio Transformer Guided by Convex Gated Probing

[2] Spatial Audio Question Answering and Reasoning on Dynamic Source Movements

[3] Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

Today’s Research Highlights

Table of Contents

cs.CL

[1] The Perplexity Paradox: Why Code Compresses Better Than Math in LLM Prompts

[2] Language Model Representations for Efficient Few-Shot Tabular Classification

[3] KD4MT: A Survey of Knowledge Distillation for Machine Translation

[4] Gated Tree Cross-attention for Checkpoint-Compatible Syntax Injection in Decoder-Only LLMs

[5] Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models

[6] Can LLMs Assess Personality? Validating Conversational AI for Trait Profiling

[7] Preference Optimization for Review Question Generation Improves Writing Quality

[8] Large Language Models for Assisting American College Applications

[9] Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

[10] Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints

[11] Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

[12] A Lightweight Explainable Guardrail for Prompt Safety

[13] Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

[14] Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective

[15] Multi-source Heterogeneous Public Opinion Analysis via Collaborative Reasoning and Adaptive Fusion: A Systematically Integrated Approach

[16] State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models

[17] From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants

[18] Reranker Optimization via Geodesic Distances on k-NN Manifolds

[19] CAST: Achieving Stable LLM-based Text Analysis for Data Analytics

[20] Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

[21] Not the Example, but the Process: How Self-Generated Examples Enhance LLM Reasoning

[22] NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey

[23] Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork?

[24] Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning

[25] Towards Fair and Efficient De-identification: Quantifying the Efficiency and Generalizability of De-identification Approaches

[26] VDLM: Variable Diffusion LMs via Robust Latent-to-Text Rendering

[27] CheckIfExist: Detecting Citation Hallucinations in the Era of AI-Generated Content

[28] P-RAG: Prompt-Enhanced Parametric RAG with LoRA and Selective CoT for Biomedical and Multi-Hop QA

[29] Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity

[30] Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion

[31] Doc-to-LoRA: Learning to Instantly Internalize Contexts

[32] Every Little Helps: Building Knowledge Graph Foundation Model with Fine-grained Transferable Multi-modal Tokens

[33] Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

[34] MultiCube-RAG for Multi-hop Question Answering

[35] DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

[36] A Curious Class of Adpositional Multiword Expressions in Korean

[37] CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

[38] Surgical Activation Steering via Generative Causal Mediation

[39] Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs

[40] Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities

[41] Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis

[42] Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution

[43] LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers

[44] Beyond Learning: A Training-Free Alternative to Model Adaptation

[45] The Validity of Coreference-based Evaluations of Natural Language Understanding

[46] Long-Tail Knowledge in Large Language Models: Taxonomy, Mechanisms, Interventions and Implications

[47] Are LLMs Ready to Replace Bangla Annotators?

[48] Aladdin-FTI @ AMIYA Three Wishes for Arabic NLP: Fidelity, Diglossia, and Multidialectal Generation

[49] MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models

[50] MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

[51] Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

[52] Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents

[53] TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers

[54] IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models

[55] Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning

[56] Learning to Learn from Language Feedback with Social Meta-Learning

[57] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

[58] From Growing to Looping: A Unified View of Iterative Computation in LLMs

[59] Optimizing Soft Prompt Tuning via Structural Evolution

[60] Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification

[61] Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

[62] CitiLink-Summ: Summarization of Discussion Subjects in European Portuguese Municipal Meeting Minutes

[63] ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

[64] Who can we trust? LLM-as-a-jury for Comparative Assessment

[65] AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models

[66] Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

[67] Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

[68] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

[69] Reinforced Fast Weights with Next-Sequence Prediction

[70] Evaluating Language Model Agency through Negotiations

[71] Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

[72] When Stereotypes GTG: The Impact of Predictive Text Suggestions on Gender Bias in Human-AI Co-Writing

[73] Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes