Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion

Jaehyuk Jang, Wonjun Lee, Kangwook Ko, Changick Kim

Main category: cs.SD

TL;DR: SEPT is a prompt tuning framework for audio-language models that addresses base-new tradeoff by semantically expanding prompt embeddings using LLM-generated neighbors and a novel semantic expansion loss.

Details

Motivation: Prompt tuning in audio-language models suffers from Base-New Tradeoff due to disrupted semantic structure in embedding space, limiting generalization ability.

Method: Proposes Semantically Expanded Prompt Tuning (SEPT) that uses LLM-generated semantic neighbors to regularize prompt embedding space with a novel semantic expansion loss promoting intra-class compactness and inter-class separability.

Result: SEPT consistently improves generalization performance across multiple prompt tuning baselines for audio-language models, enhancing both base-to-new generalization and cross-dataset transferability.

Conclusion: SEPT effectively addresses the Base-New Tradeoff in ALMs by preserving semantic structure in prompt embedding space, establishing a benchmark for prompt generalization in audio-language models.

Abstract: Prompt tuning has achieved remarkable progress in vision-language models (VLMs) and is recently being adopted for audio-language models (ALMs). However, its generalization ability in ALMs remains largely underexplored. We observe that conventional prompt tuning for ALMs also suffers from the Base-New Tradeoff, and we identify that this issue stems from the disrupted semantic structure of the embedding space. To address this issue, we propose Semantically Expanded Prompt Tuning (SEPT)-a plug-and-play framework that explicitly regularizes the prompt embedding space by incorporating semantic neighbors generated by large language models. SEPT introduces a novel semantic expansion loss with margin constraints that promote intra-class compactness and inter-class separability, thereby enhancing the semantic structure of the prompt embedding space. For comprehensive evaluation, we establish the first benchmark setup for prompt generalization in ALMs, covering both base-to-new generalization and cross-dataset transferability. Extensive experiments demonstrate that SEPT consistently improves generalization performance across multiple prompt tuning baselines, while maintaining computational cost during inference. Codes are available in https://github.com/jhyukjang/SEPT.

Relevance: 9/10

[2] SemanticAudio: Audio Generation and Editing in Semantic Space

Zheqi Dai, Guangyan Zhang, Haolin He, Xiquan Li, Jingyu Li, Chunyat Wu, Yiwen Guo, Qiuqiang Kong

Main category: eess.AS

TL;DR: SemanticAudio introduces a two-stage Flow Matching framework for text-to-audio generation that operates in a high-level semantic space rather than acoustic latent space, enabling better text-audio alignment and training-free text-guided editing.

Details

Motivation: Existing text-to-audio models operate directly in acoustic latent space of VAEs, leading to suboptimal alignment between generated audio and textual descriptions. There's a need for better semantic understanding and control in audio generation.

Method: Two-stage Flow Matching architecture: 1) Semantic Planner generates compact semantic features capturing global identity and temporal sequence of sound events, 2) Acoustic Synthesizer produces high-fidelity acoustic latents conditioned on the semantic plan. Also introduces training-free text-guided editing by steering semantic generation trajectory via velocity field differences.

Result: Extensive experiments show SemanticAudio surpasses existing mainstream approaches in semantic alignment. The framework enables precise attribute-level modifications without retraining.

Conclusion: Operating in semantic space rather than acoustic space improves text-audio alignment and enables flexible editing capabilities. The decoupled design offers better control over audio generation.

Abstract: In recent years, Text-to-Audio Generation has achieved remarkable progress, offering sound creators powerful tools to transform textual inspirations into vivid audio. However, existing models predominantly operate directly in the acoustic latent space of a Variational Autoencoder (VAE), often leading to suboptimal alignment between generated audio and textual descriptions. In this paper, we introduce SemanticAudio, a novel framework that conducts both audio generation and editing directly in a high-level semantic space. We define this semantic space as a compact representation capturing the global identity and temporal sequence of sound events, distinct from fine-grained acoustic details. SemanticAudio employs a two-stage Flow Matching architecture: the Semantic Planner first generates these compact semantic features to sketch the global semantic layout, and the Acoustic Synthesizer subsequently produces high-fidelity acoustic latents conditioned on this semantic plan. Leveraging this decoupled design, we further introduce a training-free text-guided editing mechanism that enables precise attribute-level modifications on general audio without retraining. Specifically, this is achieved by steering the semantic generation trajectory via the difference of velocity fields derived from source and target text prompts. Extensive experiments demonstrate that SemanticAudio surpasses existing mainstream approaches in semantic alignment. Demo available at: https://semanticaudio1.github.io/

Relevance: 9/10

[3] Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs

Haochen Zhang, Animesh Sinha, Felix Juefei-Xu, Haoyu Ma, Kunpeng Li, Zhipeng Fan, Meng Dong, Xiaoliang Dai, Tingbo Hou, Peizhao Zhang, Zecheng He

Main category: cs.CV

TL;DR: Non-Markov conversational image generation framework that addresses multi-round consistency by forcing models to reference earlier visual states and maintain entity identities across interactions.

Details

Motivation: Existing multimodal LLMs for image generation are mostly Markovian - they only consider recent context, allowing shortcuts that ignore long-range chat history. This fails when users reference earlier states, undo changes, or mention entities introduced rounds ago.

Method: Three key components: 1) Non-Markov multi-round data construction with rollback editing and name-based personalization, 2) History-conditioned training with token-level caching to prevent identity drift, 3) Reconstruction-based DiT detokenizer and multi-stage fine-tuning curriculum for high-fidelity image generation.

Result: The framework yields substantial improvements in multi-round consistency and instruction compliance while maintaining strong single-round editing and personalization capabilities.

Conclusion: Explicitly training for non-Markov interactions is crucial for true conversational image generation, enabling models to maintain consistency across multiple rounds of interaction by properly referencing and preserving earlier visual states.

Abstract: Conversational image generation requires a model to follow user instructions across multiple rounds of interaction, grounded in interleaved text and images that accumulate as chat history. While recent multimodal large language models (MLLMs) can generate and edit images, most existing multi-turn benchmarks and training recipes are effectively Markov: the next output depends primarily on the most recent image, enabling shortcut solutions that ignore long-range history. In this work we formalize and target the more challenging non-Markov setting, where a user may refer back to earlier states, undo changes, or reference entities introduced several rounds ago. We present (i) non-Markov multi-round data construction strategies, including rollback-style editing that forces retrieval of earlier visual states and name-based multi-round personalization that binds names to appearances across rounds; (ii) a history-conditioned training and inference framework with token-level caching to prevent multi-round identity drift; and (iii) enabling improvements for high-fidelity image reconstruction and editable personalization, including a reconstruction-based DiT detokenizer and a multi-stage fine-tuning curriculum. We demonstrate that explicitly training for non-Markov interactions yields substantial improvements in multi-round consistency and instruction compliance, while maintaining strong single-round editing and personalization.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 125]
cs.CV [Total: 155]
cs.AI [Total: 146]
cs.SD [Total: 15]
cs.LG [Total: 335]
cs.MA [Total: 5]
cs.MM [Total: 3]
eess.AS [Total: 21]
eess.IV [Total: 10]

cs.CL

[1] DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, Dipanjan Das

Main category: cs.CL

TL;DR: DeepSearchQA is a 900-prompt benchmark for evaluating AI agents on complex multi-step information-seeking tasks across 17 fields, testing capabilities like systematic information collation, de-duplication, and reasoning about stopping criteria in open-ended search spaces.

Details

Motivation: Traditional benchmarks focus on single answer retrieval or broad factuality, but there's a need to evaluate agents' abilities to execute complex search plans, handle fragmented information from disparate sources, and reason about stopping criteria in open-ended search scenarios.

Method: Created a dataset of 900 challenging, handcrafted tasks structured as causal chains where each step depends on previous completion. Tasks are grounded in the open web with objectively verifiable answer sets, testing systematic collation, de-duplication/entity resolution, and stopping criteria reasoning.

Result: Evaluation of state-of-the-art agent architectures reveals significant performance limitations: even advanced models struggle to balance high recall with precision, showing failure modes like premature stopping (under-retrieval) and hedging behaviors (overly wide net of low-confidence answers).

Conclusion: DeepSearchQA serves as an essential diagnostic tool highlighting critical headroom in current agent designs and driving future research toward more robust deep-research capabilities in information-seeking agents.

Abstract: We introduce DeepSearchQA, a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields. Unlike traditional benchmarks that target single answer retrieval or broad-spectrum factuality, DeepSearchQA features a dataset of challenging, handcrafted tasks designed to evaluate an agent’s ability to execute complex search plans to generate exhaustive answer lists. This shift in design explicitly tests three critical, yet under-evaluated capabilities: 1) systematic collation of fragmented information from disparate sources, 2) de-duplication and entity resolution to ensure precision, and 3) the ability to reason about stopping criteria within an open-ended search space. Each task is structured as a causal chain, where discovering information for one step is dependent on the successful completion of the previous one, stressing long-horizon planning and context retention. All tasks are grounded in the open web with objectively verifiable answer sets. Our comprehensive evaluation of state-of-the-art agent architectures reveals significant performance limitations: even the most advanced models struggle to balance high recall with precision. We observe distinct failure modes ranging from premature stopping (under-retrieval) to hedging behaviors, where agents cast an overly wide net of low-confidence answers to artificially boost recall. These findings highlight critical headroom in current agent designs and position DeepSearchQA as an essential diagnostic tool for driving future research toward more robust, deep-research capabilities.

[2] asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation

Oleg Sedukhin, Andrey Kostin

Main category: cs.CL

TL;DR: Proposes improvements to speech recognition evaluation including multi-reference string alignment algorithm, new Russian longform speech dataset, and tools for streaming ASR evaluation with uniform model wrappers.

Details

Motivation: Current speech recognition evaluation methods have limitations for non-Latin languages, rich word formation, and longform speech. There's a need for better alignment algorithms, diverse test sets, and tools to evaluate streaming ASR models while avoiding dataset-specific adaptation illusions.

Method: 1) Developed string alignment algorithm supporting multi-reference labeling, arbitrary-length insertions, and better word alignment. 2) Collected DiverseSpeech-Ru dataset of longform in-the-wild Russian speech with multi-reference labeling. 3) Relabeled popular Russian test sets with multiple references. 4) Created tools for streaming ASR evaluation and transcription visualization. 5) Provided uniform wrappers for offline and streaming ASR models.

Result: Demonstrated that models often adapt to dataset-specific labeling, creating illusion of metric improvement. The improved alignment algorithm works better for non-Latin languages and longform speech. New tools enable better evaluation of streaming ASR and visual comparison of transcriptions.

Conclusion: The proposed evaluation improvements address limitations in current ASR assessment, particularly for non-Latin languages and streaming scenarios. The tools and datasets provide more robust evaluation frameworks for speech recognition research.

Abstract: We propose several improvements to the speech recognition evaluation. First, we propose a string alignment algorithm that supports both multi-reference labeling, arbitrary-length insertions and better word alignment. This is especially useful for non-Latin languages, those with rich word formation, to label cluttered or longform speech. Secondly, we collect a novel test set DiverseSpeech-Ru of longform in-the-wild Russian speech with careful multi-reference labeling. We also perform multi-reference relabeling of popular Russian tests set and study fine-tuning dynamics on its corresponding train set. We demonstrate that the model often adopts to dataset-specific labeling, causing an illusion of metric improvement. Based on the improved word alignment, we develop tools to evaluate streaming speech recognition and to align multiple transcriptions to compare them visually. Additionally, we provide uniform wrappers for many offline and streaming speech recognition models. Our code will be made publicly available.

[3] UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop

Muhammad Ali Shafique, Areej Mehboob, Layba Fiaz, Muhammad Usman Qadeer, Hamza Farooq

Main category: cs.CL

TL;DR: Proposes UrduBench, a comprehensive Urdu reasoning benchmark created through contextually ensembled translation with human validation, evaluating LLMs on reasoning tasks translated from English benchmarks.

Details

Motivation: There's a lack of standardized reasoning benchmarks for low-resource languages like Urdu, with existing evaluation limited by machine translation sensitivity and focus on general language tasks rather than reasoning capabilities.

Method: Developed a contextually ensembled translation framework with human-in-the-loop validation that leverages multiple translation systems to translate English reasoning benchmarks (MGSM, MATH-500, CommonSenseQA, OpenBookQA) into Urdu while preserving contextual and structural integrity.

Result: Created UrduBench and conducted comprehensive evaluation showing multi-step and symbolic reasoning tasks pose significant challenges in Urdu, with stable language alignment being critical for robust reasoning. Performance differences were analyzed across datasets, difficulty levels, model architectures, scaling settings, and language consistency.

Conclusion: Established a scalable methodology for standardized reasoning evaluation in Urdu that provides empirical insights into multilingual reasoning failures, with broad applicability to other low-resource languages.

Abstract: Recent advances in large language models (LLMs) have led to strong reasoning capabilities; however, evaluating such models in low-resource languages remains challenging due to the lack of standardized benchmarks. In particular, Urdu reasoning evaluation has been limited by the sensitivity of machine translation and an emphasis on general language tasks rather than reasoning benchmarks. In this paper, we propose a contextually ensembled translation framework with human-in-the-loop validation that leverages multiple translation systems to develop Urdu reasoning benchmarks while preserving contextual and structural integrity. Using this framework, we translate widely adopted reasoning and question-answering benchmarks, including MGSM, MATH-500, CommonSenseQA, and OpenBookQA, into Urdu, collectively referred to as UrduBench, and conduct a comprehensive evaluation of both reasoning-oriented and instruction-tuned LLMs across multiple prompting strategies. Our analysis reveals performance differences across (1) four datasets, (2) five task difficulty levels, (3) diverse model architectures, (4) multiple model scaling settings, and (5) language consistency tests. We find that multi-step and symbolic reasoning tasks pose significant challenges in Urdu, and that stable language alignment is a critical prerequisite for robust reasoning. Overall, our work establishes a scalable methodology for standardized reasoning evaluation in Urdu and provides empirical insights into multilingual reasoning failures. This experimental setup is also broadly applicable to other low-resource languages. The code and datasets will be publicly released.

[4] Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations

Amit Meghanani, Thomas Hain

Main category: cs.CL

TL;DR: The paper addresses limitations in fine-tuning speech enhancement models using SSL representations with MSE loss, proposing position-invariant fine-tuning strategies to prevent exploitation of positional embeddings.

Details

Motivation: Current speech enhancement models fine-tuned with SSL representations using MSE loss can exploit positional embeddings rather than learning content-related information, limiting their effectiveness for downstream tasks in noisy conditions.

Method: Two position-invariant fine-tuning strategies: (1) zero-padding approach adapted from SSL pre-training, and (2) speed perturbations with soft-DTW loss to make training less sensitive to positional correlations.

Result: The soft-DTW-based approach achieves faster convergence and improved downstream performance compared to standard MSE fine-tuning, demonstrating the importance of position-invariant fine-tuning.

Conclusion: Position-invariant fine-tuning strategies, particularly soft-DTW with speed perturbations, are crucial for effective SSL-based speech enhancement to prevent exploitation of positional embeddings and improve downstream task performance.

Abstract: Integrating front-end speech enhancement (SE) models with self-supervised learning (SSL)-based speech models is effective for downstream tasks in noisy conditions. SE models are commonly fine-tuned using SSL representations with mean squared error (MSE) loss between enhanced and clean speech. However, MSE is prone to exploiting positional embeddings in SSL models, allowing the objective to be minimised through positional correlations instead of content-related information. This work frames the problem as a general limitation of self-supervised representation fine-tuning and investigates it through representation-guided SE. Two strategies are considered: (1) zero-padding, previously explored in SSL pre-training but here examined in the fine-tuning setting, and (2) speed perturbations with a soft-DTW loss. Experiments show that the soft-DTW-based approach achieves faster convergence and improved downstream performance, underscoring the importance of position-invariant fine-tuning in SSL-based speech modelling.

[5] ChunkWise LoRA: Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference

Ketan Thakkar, Maitreyi Chatterjee, Ramasubramanian Balasubramanian, Achyuthan Jootoo, Rajendra Ugrani

Main category: cs.CL

TL;DR: ChunkWise LoRA: Dynamic low-rank adaptation that partitions sequences into variable-length chunks based on token complexity, assigning tailored low-rank configurations per chunk to reduce latency and memory usage while maintaining performance.

Details

Motivation: Existing LoRA methods use static rank configurations uniformly across all input tokens, ignoring variation in token complexity and computational requirements. This leads to inefficient resource allocation where simple tokens get the same computational treatment as complex ones.

Method: 1) Runtime scheduler estimates token difficulty, 2) Adaptive chunking partitions sequences into variable-length chunks based on complexity, 3) Rank-ladder mechanism selects per-chunk LoRA rank and scaling, 4) Boundary-safe composition module preserves output consistency, 5) Policy-driven KV-cache strategies for efficiency.

Result: Achieves up to 34% lower latency and 38% memory reduction compared to baseline LoRA on Wikitext-103 and SQuAD datasets, while maintaining or improving task performance metrics (BLEU, EM, perplexity).

Conclusion: ChunkWise LoRA provides a practical, efficient fine-tuning solution that dynamically adapts to token complexity, offering significant computational savings while maintaining performance, with full compatibility to existing transformer architectures and inference frameworks.

Abstract: Recent advances in low-rank adaptation (LoRA) have enabled efficient fine-tuning of large language models (LLMs) with minimal additional parameters. However, existing LoRA methods apply static rank configurations uniformly across all input tokens, ignoring variation in token complexity and computational requirements. In this work, we propose ChunkWise LoRA, a dynamic and adaptive approach that partitions sequences into variable-length chunks based on token complexity and assigns each chunk a tailored low-rank configuration. Our system introduces a runtime scheduler that estimates token difficulty, performs adaptive chunking, and selects per-chunk LoRA rank and scaling using a rank-ladder mechanism. To preserve output consistency, we further introduce a boundary-safe composition module and integrate policy-driven KV-cache strategies. Experiments on benchmark datasets such as Wikitext-103 and SQuAD demonstrate that ChunkWise LoRA achieves up to 34% lower latency and 38% memory reduction compared to baseline LoRA, while maintaining or improving task performance metrics like BLEU, EM, and perplexity. The proposed framework remains fully compatible with existing transformer architectures and inference frameworks, providing a practical solution for real-world deployment of parameter-efficient LLMs.

[6] Multi-task Code LLMs: Data Mix or Model Merge?

Mingzhi Zhu, Boris Sobolev, Rahul Krishna, Raju Pavuluri, Stacy Patterson, Michele Merler

Main category: cs.CL

TL;DR: Comparing data mixing vs model merging for creating small multi-task code LLMs, finding model merging works best at larger scales (7B) while data mixing is better at smaller scales (2B).

Details

Motivation: Need efficient strategies for multi-task learning in code LLMs that balance performance, constraints, and costs for resource-constrained deployment scenarios.

Method: Compare data mixing (training on mixed datasets) vs model merging (combining task-specific models) approaches across Qwen Coder and DeepSeek Coder families at 2B and 7B scales, fine-tuned for code generation and summarization tasks.

Result: Model merging achieves best overall performance at larger scale (7B), retaining 96% of specialized model performance on code generation while maintaining summarization capabilities. Merged Qwen Coder 7B achieved 92.7% Pass@1 on HumanEval vs 90.9% for task-specific model. Data mixing preferred at smaller scale (2B).

Conclusion: Careful merging and mixing strategies can effectively combine task-specific capabilities without significant performance degradation, making them ideal for resource-constrained deployment scenarios.

Abstract: Recent research advocates deploying smaller, specialized code LLMs in agentic frameworks alongside frontier models, sparking interest in efficient strategies for multi-task learning that balance performance, constraints, and costs. We compare two approaches for creating small, multi-task code LLMs: data mixing versus model merging. We conduct extensive experiments across two model families (Qwen Coder and DeepSeek Coder) at two scales (2B and 7B parameters), fine-tuning them for code generation and code summarization tasks. Our evaluation on HumanEval, MBPP, and CodeXGlue benchmarks reveals that model merging achieves the best overall performance at larger scale across model families, retaining 96% of specialized model performance on code generation tasks while maintaining summarization capabilities. Notably, merged models can even surpass individually fine-tuned models, with our best configuration of Qwen Coder 2.5 7B model achieving 92.7% Pass@1 on HumanEval compared to 90.9% for its task-specific fine-tuned equivalent. At a smaller scale we find instead data mixing to be a preferred strategy. We further introduce a weight analysis technique to understand how different tasks affect model parameters and their implications for merging strategies. The results suggest that careful merging and mixing strategies can effectively combine task-specific capabilities without significant performance degradation, making them ideal for resource-constrained deployment scenarios.

[7] Multilingual Dysarthric Speech Assessment Using Universal Phone Recognition and Language-Specific Phonemic Contrast Modeling

Eunjung Yeo, Julie M. Liss, Visar Berisha, David R. Mortensen

Main category: cs.CL

TL;DR: Multilingual framework for dysarthria intelligibility assessment using universal phone recognition with language-specific phoneme interpretation via contrastive phonological features.

Details

Motivation: Growing need for automated intelligibility assessment methods for neurological disorders like dysarthria that work across multiple languages, as existing approaches are either single-language or fail to capture language-specific factors.

Method: Integrates universal phone recognition with language-specific phoneme interpretation using contrastive phonological feature distances for phone-to-phoneme mapping and sequence alignment. Produces three metrics: phoneme error rate (PER), phonological feature error rate (PFER), and phoneme coverage (PhonCov).

Result: Analysis on English, Spanish, Italian, and Tamil shows PER benefits from mapping+alignment, PFER from alignment alone, and PhonCov from mapping. Framework captures clinically meaningful patterns of intelligibility degradation consistent with established dysarthric speech observations.

Conclusion: Proposed multilingual framework effectively assesses dysarthria intelligibility across languages by combining universal and language-specific approaches, with different metrics benefiting from different components of the framework.

Abstract: The growing prevalence of neurological disorders associated with dysarthria motivates the need for automated intelligibility assessment methods that are applicalbe across languages. However, most existing approaches are either limited to a single language or fail to capture language-specific factors shaping intelligibility. We present a multilingual phoneme-production assessment framework that integrates universal phone recognition with language-specific phoneme interpretation using contrastive phonological feature distances for phone-to-phoneme mapping and sequence alignment. The framework yields three metrics: phoneme error rate (PER), phonological feature error rate (PFER), and a newly proposed alignment-free measure, phoneme coverage (PhonCov). Analysis on English, Spanish, Italian, and Tamil show that PER benefits from the combination of mapping and alignment, PFER from alignment alone, and PhonCov from mapping. Further analyses demonstrate that the proposed framework captures clinically meaningful patterns of intelligibility degradation consistent with established observations of dysarthric speech.

[8] Large Language Models Naively Recover Ethnicity from Individual Records

Noah Dasanaike

Main category: cs.CL

TL;DR: LLMs can accurately infer ethnicity from names, outperforming traditional BISG method, with applications across multiple countries and reduced demographic bias.

Details

Motivation: Traditional methods like Bayesian Improved Surname Geocoding (BISG) have limitations: they're U.S.-centric, rely on census data, show systematic biases (especially income-related misclassification of minorities), and don't work well internationally where naming conventions differ.

Method: Used six LLMs (Gemini 3 Flash, GPT-4o, DeepSeek v3.2, GLM-4.7, etc.) to classify ethnicity from names without additional training. Tested on stratified samples from Florida/NC voter files with self-reported race, Lebanese voter registration with religious sect, Indian MPs from reserved constituencies, and Indian land records. Applied extended reasoning and included metadata like party registration. Fine-tuned small transformer models on LLM labels for large-scale deployment.

Result: LLM-based classification achieved up to 84.7% accuracy vs BISG’s 68.2% on balanced samples. With metadata, reached 86.7% accuracy. Reduced income bias where BISG misclassified wealthy minorities as White. Validated internationally: Lebanese religious sect (64.3%), Indian MPs (99.2%), Indian caste (74.0%). Successfully recovered known population distributions across India, Uganda, Nepal, Armenia, Chile, and Costa Rica.

Conclusion: LLMs provide superior ethnicity inference from names compared to traditional methods, with international applicability, reduced bias, and potential for cost-effective local deployment via fine-tuned smaller models.

Abstract: I demonstrate that large language models can infer ethnicity from names with accuracy exceeding that of Bayesian Improved Surname Geocoding (BISG) without additional training data, enabling inference outside the United States and to contextually appropriate classification categories. Using stratified samples from Florida and North Carolina voter files with self-reported race, LLM-based classification achieves up to 84.7% accuracy, outperforming BISG (68.2%) on balanced samples. I test six models including Gemini 3 Flash, GPT-4o, and open-source alternatives such as DeepSeek v3.2 and GLM-4.7. Enabling extended reasoning can improve accuracy by 1-3 percentage points, though effects vary across contexts; including metadata such as party registration reaches 86.7%. LLM classification also reduces the income bias inherent in BISG, where minorities in wealthier neighborhoods are systematically misclassified as White. I further validate using Lebanese voter registration with religious sect (64.3% accuracy), Indian MPs from reserved constituencies (99.2%), and Indian land records with caste classification (74.0%). Aggregate validation across India, Uganda, Nepal, Armenia, Chile, and Costa Rica using original full-count voter rolls demonstrates that the method recovers known population distributions where naming conventions are distinctive. For large-scale applications, small transformer models fine-tuned on LLM labels exceed BISG accuracy while enabling local deployment at no cost.

[9] Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

Wonduk Seo, Wonseok Choi, Junseo Koh, Juhyeon Lee, Hyunjin An, Minhyeong Yu, Jian Park, Qingshan Zhou, Seunghyun Lee, Yi Bu

Main category: cs.CL

TL;DR: OG-MAR is an ontology-guided multi-agent framework that improves cultural alignment in LLMs by using structured cultural ontologies and demographic profiles from survey data.

Details

Motivation: LLMs often exhibit cultural misalignment due to skewed pretraining data and lack of structured value representations, reducing consistency and interpretability in culturally sensitive decision making.

Method: Proposes OG-MAR framework that: 1) summarizes respondent-specific values from World Values Survey, 2) constructs global cultural ontology via competency questions, 3) retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, and 4) synthesizes outputs using a judgment agent enforcing ontology consistency and demographic proximity.

Result: Experiments on regional social-survey benchmarks across four LLM backbones show OG-MAR improves cultural alignment and robustness over competitive baselines while producing more transparent reasoning traces.

Conclusion: OG-MAR provides an effective framework for enhancing cultural alignment in LLMs through structured ontology-guided reasoning and multi-agent synthesis.

Abstract: Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OG-MAR, an Ontology-Guided Multi-Agent Reasoning framework. OG-MAR summarizes respondent-specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social-survey benchmarks across four LLM backbones show that OG-MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.

[10] EnsembleLink: Accurate Record Linkage Without Training Data

Noah Dasanaike

Main category: cs.CL

TL;DR: EnsembleLink is a record linkage method that uses pre-trained language models to match entities across datasets without requiring labeled training data, achieving high accuracy by leveraging semantic relationships learned from large text corpora.

Details

Motivation: Record linkage is crucial for social science research but is often treated as a preprocessing step with ad hoc rules that don't quantify uncertainty. Existing methods either have low accuracy or require substantial labeled training data, creating barriers for researchers.

Method: Uses ensemble of pre-trained language models that have learned semantic relationships from large text corpora. The method runs locally on open-source models without external API calls, completing linkage tasks in minutes without any training labels.

Result: On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling.

Conclusion: EnsembleLink provides a practical, accurate solution for record linkage without labeled data, addressing methodological gaps in social science research by leveraging semantic understanding from pre-trained language models.

Abstract: Record linkage, the process of matching records that refer to the same entity across datasets, is essential to empirical social science but remains methodologically underdeveloped. Researchers treat it as a preprocessing step, applying ad hoc rules without quantifying the uncertainty that linkage errors introduce into downstream analyses. Existing methods either achieve low accuracy or require substantial labeled training data. I present EnsembleLink, a method that achieves high accuracy without any training labels. EnsembleLink leverages pre-trained language models that have learned semantic relationships (e.g., that “South Ozone Park” is a neighborhood in “New York City” or that “Lutte ouvriere” refers to the Trotskyist “Workers’ Struggle” party) from large text corpora. On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling. The method runs locally on open-source models, requiring no external API calls, and completes typical linkage tasks in minutes.

[11] Qwen3-ASR Technical Report

Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin

Main category: cs.CL

TL;DR: Qwen3-ASR family introduces two multilingual speech recognition models (1.7B and 0.6B parameters) and a non-autoregressive forced alignment model, achieving SOTA performance with strong efficiency.

Details

Motivation: To develop powerful all-in-one speech recognition models that support multiple languages and dialects, addressing limitations where ASR models perform similarly on benchmarks but differ significantly in real-world scenarios.

Method: Leverages large-scale speech training data and the audio understanding capabilities of foundation model Qwen3-Omni. The family includes two ASR models (1.7B and 0.6B parameters) supporting 52 languages/dialects, plus a non-autoregressive forced alignment model for text-speech timestamp alignment in 11 languages.

Result: The 1.7B version achieves SOTA among open-sourced ASR models and competes with proprietary APIs. The 0.6B version offers best accuracy-efficiency trade-off with 92ms average TTFT and can transcribe 2000 seconds of speech in 1 second at concurrency 128. The forced alignment model outperforms three strongest competitors in timestamp accuracy with better efficiency and versatility.

Conclusion: Qwen3-ASR family provides state-of-the-art speech recognition and forced alignment capabilities with strong efficiency, released under Apache 2.0 license to accelerate ASR and audio understanding research.

Abstract: In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.

[12] Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space

Tobias Materzok

Main category: cs.CL

TL;DR: OS-Search transforms LLM generation into endpoint search by selecting targets in a frozen encoder-defined 3D output space, enabling parallel sweeps and black-box optimization without path-dependent token search.

Details

Motivation: Current LLM generation methods suffer from path-dependent token/program search that makes optimization difficult. The authors aim to enable more efficient exploration of output spaces and black-box optimization by decoupling target selection from generation.

Method: 1) Define a frozen encoder that maps outputs to a 3D space Z. 2) Outer loop selects target z* in Z. 3) Use retrieval-grounded policy trained with sequence-level RL to generate outputs whose coordinates land near z* under standard autoregressive decoding.

Result: On stories: sweeping Z yields 3.1x higher LLM-scored diversity than prompt-chaining. On code: Bayesian optimization over Z improves objectives withheld from the controller while preserving validity under matched inference budgets.

Conclusion: OS-Search enables efficient exploration and optimization in output spaces, decoupling target selection from generation and enabling parallel sweeps and black-box optimization without path-dependent search.

Abstract: We introduce Output-Space Search (OS-Search), which turns LLM generation into endpoint search. An outer loop selects a target z* in a frozen encoder-defined 3D output space Z, and a retrieval-grounded policy trained with sequence-level RL generates outputs whose coordinates land near z* under standard autoregressive decoding. This enables parallel sweeps and black-box optimization in Z without path-dependent token/program search. On stories, sweeping Z (text) yields 3.1x higher LLM-scored diversity than prompt-chaining. On code, Bayesian optimization over Z (code) improves an objective withheld from the controller under matched inference budgets while preserving validity.

[13] From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning

Xiulin Yang, Heidi Getz, Ethan Gotlieb Wilcox

Main category: cs.CL

TL;DR: Function words’ statistical properties (high frequency, structural association, boundary alignment) support hierarchical structure learning in neural models across languages.

Details

Motivation: To understand what statistical conditions enable learning hierarchical structure from linear input, focusing on the distinctive distributional properties of function words that have long been argued to be crucial for language acquisition.

Method: Cross-linguistic corpus analysis of 186 languages to establish three key properties of function words, followed by counterfactual language modeling and ablation experiments with neural learners to test which properties contribute most to hierarchical structure acquisition.

Result: All three properties (high frequency, structural association, boundary alignment) are present across studied languages. Language variants preserving all three properties are more easily acquired by neural learners, with frequency and structural association contributing more strongly than boundary alignment. Different learning conditions lead to systematically different reliance on function words.

Conclusion: The statistical distribution of function words provides crucial support for learning hierarchical structure from linear input, with different properties playing distinct roles in the acquisition process, and similar performance can arise from different internal mechanisms.

Abstract: What statistical conditions support learning hierarchical structure from linear input? In this paper, we address this question by focusing on the statistical distribution of function words. Function words have long been argued to play a crucial role in language acquisition due to their distinctive distributional properties, including high frequency, reliable association with syntactic structure, and alignment with phrase boundaries. We use cross-linguistic corpus analysis to first establish that all three properties are present across 186 studied languages. Next, we use a combination of counterfactual language modeling and ablation experiments to show that language variants preserving all three properties are more easily acquired by neural learners, with frequency and structural association contributing more strongly than boundary alignment. Follow-up probing and ablation analyses further reveal that different learning conditions lead to systematically different reliance on function words, indicating that similar performance can arise from distinct internal mechanisms.

[14] Scaling Embeddings Outperforms Scaling Experts in Language Models

Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Yang, Bo Wang, Fengcun Li, Yulei Qian, Lingtong Si, Yerui Sun, Rumei Li, Peng Pei, Yuchen Xie, Xunliang Cai

Main category: cs.CL

TL;DR: Embedding scaling emerges as an effective alternative to MoE for sparsity scaling, achieving better Pareto frontiers in certain regimes and enabling practical inference speedups through system optimizations.

Details

Motivation: Mixture-of-Experts (MoE) architectures face diminishing returns and system bottlenecks for sparsity scaling, prompting exploration of embedding scaling as an orthogonal dimension for more efficient scaling.

Method: Comprehensive analysis of embedding scaling vs. expert scaling regimes, systematic characterization of architectural factors (parameter budgeting, model width/depth interplay), integration of system optimizations and speculative decoding, and development of LongCat-Flash-Lite model with 68.5B parameters (~3B activated) trained from scratch.

Result: Embedding scaling achieves superior Pareto frontier compared to expert scaling in specific regimes, LongCat-Flash-Lite surpasses parameter-equivalent MoE baselines and shows exceptional competitiveness in agentic and coding domains despite allocating over 30B parameters to embeddings.

Conclusion: Embedding scaling represents a potent alternative to MoE for sparsity scaling, offering better efficiency in certain regimes and practical inference speedups when combined with appropriate system optimizations.

Abstract: While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy – ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.

[15] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

Zhaoyi Li, Jiatong Li, Gangwei Jiang, Linqi Song, Defu Lian, Ying Wei

Main category: cs.CL

TL;DR: The paper studies why LLMs fail at reasoning hop generalization (when reasoning steps exceed training) and finds errors concentrate at specific token positions due to attention heads that amplify incorrect reasoning. They propose a test-time method to identify and deactivate these problematic heads.

Details

Motivation: Chain-of-thought reasoning enables LLMs to solve complex problems, but recent studies show sharp performance drops when required reasoning steps exceed training distributions (reasoning hop generalization). The internal mechanisms behind this failure are poorly understood, motivating investigation into why LLMs fail in these scenarios.

Method: Systematic study across multiple domains reveals errors concentrate at specific token positions rather than being uniformly distributed. Analysis shows erroneous predictions stem from internal competition mechanisms where certain attention heads (erroneous processing heads or ep heads) amplify incorrect reasoning while suppressing correct ones. The authors propose test-time correction of reasoning - a lightweight intervention that dynamically identifies and deactivates ep heads during inference.

Result: Extensive experiments across different tasks and LLMs show the proposed method consistently improves reasoning hop generalization. Removing individual ep heads during inference often restores correct predictions, demonstrating the effectiveness of targeted intervention at the attention head level.

Conclusion: The research provides insights into why LLMs fail at reasoning hop generalization and offers a practical solution through test-time correction. The findings highlight the importance of understanding internal mechanisms in LLMs and demonstrate that targeted interventions at the attention head level can significantly improve reasoning performance beyond training distributions.

Abstract: Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.

[16] Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

Christopher Adrian Kusuma, Muhammad Reza Qorib, Hwee Tou Ng

Main category: cs.CL

TL;DR: Proposes a robust evaluation benchmark for LLM honesty using Pythia’s open pretraining data, plus a method to build more honest LLMs that respond “I don’t know” appropriately.

Details

Motivation: LLMs often hallucinate when they lack knowledge instead of admitting ignorance. Existing honesty evaluation methods lack robustness because they don't account for what knowledge LLMs actually ingested during pretraining.

Method: Uses Pythia (truly open LLM with publicly available pretraining data) to create robust evaluation benchmark. Also proposes novel method leveraging pretraining data to build more honest LLMs that know when to say “I don’t know.”

Result: Develops a more robust evaluation benchmark for LLM honesty and proposes a method to improve LLM honesty by better utilizing pretraining data knowledge boundaries.

Conclusion: Addresses the critical problem of LLM hallucination by creating better evaluation methods and techniques for building LLMs that are more honest about their knowledge limitations.

Abstract: Large language models (LLMs) are highly capable of answering questions, but they are often unaware of their own knowledge boundary, i.e., knowing what they know and what they don’t know. As a result, they can generate factually incorrect responses on topics they do not have enough knowledge of, commonly known as hallucination. Rather than hallucinating, a language model should be more honest and respond with “I don’t know” when it does not have enough knowledge about a topic. Many methods have been proposed to improve LLM honesty, but their evaluations lack robustness, as they do not take into account the knowledge that the LLM has ingested during its pretraining. In this paper, we propose a more robust evaluation benchmark dataset for LLM honesty by utilizing Pythia, a truly open LLM with publicly available pretraining data. In addition, we also propose a novel method for harnessing the pretraining data to build a more honest LLM.

[17] MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

Tianyi Xu, Kosei Uemura, Alfred Malengo Kondoro, Tadesse Destaw Belay, Catherine Nana Nyaah Essuman, Ifeoma Okoh, Ganiyat Afolabi, Ayodele Awokoya, David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: MGSM-Pro extends MGSM dataset with GSM-Symbolic approach to evaluate multilingual math reasoning robustness across different instantiations of questions.

Details

Motivation: Benchmark development for multilingual mathematical reasoning evaluation has lagged behind English, and GSM-Symbolic showed high variance in model performance across different instantiations of the same question, but only in English. The authors aim to extend this evaluation to multiple languages.

Method: Extends MGSM dataset using GSM-Symbolic approach, creating five instantiations per MGSM question by varying names, digits, and irrelevant context. Evaluates across nine languages using various proprietary and open models.

Result: Low-resource languages suffer large performance drops on digit instantiations different from original test set. Proprietary models like Gemini 2.5 Flash and GPT-4.1 are less robust to digit instantiation, while Claude 4.0 Sonnet is more robust. Among open models, GPT-OSS 120B and DeepSeek V3 show stronger robustness.

Conclusion: Evaluating each problem using at least five digit-varying instantiations provides more robust and realistic assessment of math reasoning capabilities across languages.

Abstract: Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that some proprietary models, notably Gemini 2.5 Flash and GPT-4.1, are less robust to digit instantiation, whereas Claude 4.0 Sonnet is more robust. Among open models, GPT-OSS 120B and DeepSeek V3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.

Alok Abhishek, Tushar Bandopadhyay, Lisa Erickson

Main category: cs.CL

TL;DR: SHARP framework for multidimensional, distribution-aware evaluation of social harm in LLMs, focusing on worst-case behavior rather than averages

Details

Motivation: Current LLM evaluation benchmarks reduce complex social risk to mean-centered scalar scores, obscuring distributional structure, cross-dimensional interactions, and worst-case behavior that is critical for high-stakes deployments

Method: SHARP models harm as multivariate random variable, decomposes into bias, fairness, ethics, and epistemic reliability dimensions, uses union-of-failures aggregation reparameterized as additive cumulative log-risk, and employs risk-sensitive statistics like Conditional Value at Risk (CVaR95)

Result: Evaluation of 11 frontier LLMs on 901 socially sensitive prompts shows models with similar average risk can have >2x differences in tail exposure and volatility; bias has strongest tail severities, epistemic/fairness risks intermediate, ethical misalignment consistently lower

Conclusion: Responsible LLM evaluation and governance requires moving beyond scalar averages to multidimensional, tail-sensitive risk profiling to capture worst-case behavior in high-stakes applications

Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where rare but severe failures can result in irreversible harm. However, prevailing evaluation benchmarks often reduce complex social risk to mean-centered scalar scores, thereby obscuring distributional structure, cross-dimensional interactions, and worst-case behavior. This paper introduces Social Harm Analysis via Risk Profiles (SHARP), a framework for multidimensional, distribution-aware evaluation of social harm. SHARP models harm as a multivariate random variable and integrates explicit decomposition into bias, fairness, ethics, and epistemic reliability with a union-of-failures aggregation reparameterized as additive cumulative log-risk. The framework further employs risk-sensitive distributional statistics, with Conditional Value at Risk (CVaR95) as a primary metric, to characterize worst-case model behavior. Application of SHARP to eleven frontier LLMs, evaluated on a fixed corpus of n=901 socially sensitive prompts, reveals that models with similar average risk can exhibit more than twofold differences in tail exposure and volatility. Across models, dimension-wise marginal tail behavior varies systematically across harm dimensions, with bias exhibiting the strongest tail severities, epistemic and fairness risks occupying intermediate regimes, and ethical misalignment consistently lower; together, these patterns reveal heterogeneous, model-dependent failure structures that scalar benchmarks conflate. These findings indicate that responsible evaluation and governance of LLMs require moving beyond scalar averages toward multidimensional, tail-sensitive risk profiling.

[19] MoCo: A One-Stop Shop for Model Collaboration Research

Shangbin Feng, Yuyang Bai, Ziyuan Yang, Yike Wang, Zhaoxuan Tan, Jiajie Yan, Zhenyu Lei, Wenxuan Ding, Weijia Shi, Haojin Wang, Zhenting Qi, Yuru Jiang, Heng Wang, Chengsong Huang, Yu Fei, Jihan Yao, Yilun Du, Luke Zettlemoyer, Yejin Choi, Yulia Tsvetkov

Main category: cs.CL

TL;DR: MoCo is a comprehensive Python library for model collaboration algorithms, featuring 26 methods across different cross-model information exchange levels, with extensive benchmarking showing collaboration outperforms single models in 61% of settings.

Details

Motivation: To consolidate disparate research on model collaboration into a unified framework, establish rigorous comparison standards, and facilitate the development of collaborative AI systems where multiple language models work together rather than relying on single monolithic models.

Method: Developed MoCo library with 26 model collaboration methods spanning routing, text, logit, and parameter-level exchanges. Integrated 25 evaluation datasets across reasoning, QA, code, safety domains. Conducted extensive experiments comparing collaboration strategies against single-model baselines.

Result: Most collaboration strategies outperform models without collaboration in 61.0% of (model, data) settings on average, with the most effective methods outperforming by up to 25.8%. The library enables analysis of scaling properties, training/inference efficiency, and problem-solving capabilities where single LMs struggle.

Conclusion: MoCo provides a valuable toolkit for advancing model collaboration research, demonstrating that collaborative systems can significantly outperform single models and paving the way for more open, modular, decentralized AI systems.

Abstract: Advancing beyond single monolithic language models (LMs), recent research increasingly recognizes the importance of model collaboration, where multiple LMs collaborate, compose, and complement each other. Existing research on this topic has mostly been disparate and disconnected, from different research communities, and lacks rigorous comparison. To consolidate existing research and establish model collaboration as a school of thought, we present MoCo: a one-stop Python library of executing, benchmarking, and comparing model collaboration algorithms at scale. MoCo features 26 model collaboration methods, spanning diverse levels of cross-model information exchange such as routing, text, logit, and model parameters. MoCo integrates 25 evaluation datasets spanning reasoning, QA, code, safety, and more, while users could flexibly bring their own data. Extensive experiments with MoCo demonstrate that most collaboration strategies outperform models without collaboration in 61.0% of (model, data) settings on average, with the most effective methods outperforming by up to 25.8%. We further analyze the scaling of model collaboration strategies, the training/inference efficiency of diverse methods, highlight that the collaborative system solves problems where single LMs struggle, and discuss future work in model collaboration, all made possible by MoCo. We envision MoCo as a valuable toolkit to facilitate and turbocharge the quest for an open, modular, decentralized, and collaborative AI future.

[20] CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

Jiahao Huo, Yu Huang, Yibo Yan, Ye Pan, Yi Cao, Mingdong Ou, Philip S. Yu, Xuming Hu

Main category: cs.CL

TL;DR: CausalEmbed proposes an auto-regressive generation approach for creating compact multi-vector embeddings in visual document retrieval, reducing token count by 30-155x while maintaining competitive performance.

Details

Motivation: Current MLLMs for Visual Document Retrieval generate high-quality multi-vector embeddings but require thousands of visual tokens per page, creating substantial storage overhead that limits practical applications.

Method: Auto-regressive generation approach with iterative margin loss during contrastive training to learn compact, well-structured representations using only dozens of visual tokens instead of thousands.

Result: Achieves 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks, with improved training efficiency and test-time scalability.

Conclusion: CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and demonstrates the potential of generative paradigms in multimodal document retrieval.

Abstract: Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique advantages of auto-regressive embedding generation in terms of training efficiency and scalability at test time. As a result, CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and sheds light on the generative paradigm within multimodal document retrieval.

[21] Self-Improving Pretraining: using post-trained models to pretrain better models

Ellen Xiaoqing Tan, Shehzaad Dhuliawala, Jing Xu, Ping Yu, Sainbayar Sukhbaatar, Jason Weston, Olga Golovneva

Main category: cs.CL

TL;DR: A new pretraining method using reinforcement learning to improve model generations for quality, safety, and factuality during pretraining rather than relying solely on post-training alignment.

Details

Motivation: Current approaches to ensuring safety and factuality in LLMs rely on expensive post-training alignment pipelines that cannot fully correct patterns learned during pretraining. Addressing these issues during pretraining is crucial to prevent unsafe or hallucinated outputs from becoming deeply embedded in the model.

Method: Introduces a pretraining method that streams documents and uses RL to improve the next K generated tokens at each step. A strong post-trained model judges candidate generations (including model rollouts, original suffix, and rewritten suffix) for quality, safety, and factuality. Early training relies on original/rewritten suffixes; later training rewards high-quality rollouts.

Result: The method achieves 36.2% and 18.5% relative improvements over standard pretraining in factuality and safety, and up to 86.3% win rate improvements in overall generation quality.

Conclusion: The approach successfully builds higher quality, safer, and more factual models from the ground up by addressing safety and factuality issues during pretraining rather than relying solely on post-training alignment.

Abstract: Ensuring safety, factuality and overall quality in the generations of large language models is a critical challenge, especially as these models are increasingly deployed in real-world applications. The prevailing approach to addressing these issues involves collecting expensive, carefully curated datasets and applying multiple stages of fine-tuning and alignment. However, even this complex pipeline cannot guarantee the correction of patterns learned during pretraining. Therefore, addressing these issues during pretraining is crucial, as it shapes a model’s core behaviors and prevents unsafe or hallucinated outputs from becoming deeply embedded. To tackle this issue, we introduce a new pretraining method that streams documents and uses reinforcement learning (RL) to improve the next K generated tokens at each step. A strong, post-trained model judges candidate generations – including model rollouts, the original suffix, and a rewritten suffix – for quality, safety, and factuality. Early in training, the process relies on the original and rewritten suffixes; as the model improves, RL rewards high-quality rollouts. This approach builds higher quality, safer, and more factual models from the ground up. In experiments, our method gives 36.2% and 18.5% relative improvements over standard pretraining in terms of factuality and safety, and up to 86.3% win rate improvements in overall generation quality.

[22] The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation

Devanshu Sahoo, Manish Prasad, Vasudev Majhi, Arjun Neekhra, Yash Sinha, Murari Mandal, Vinay Chamola, Dhruv Kumar

Main category: cs.CL

TL;DR: LLMs used for automated code grading fail catastrophically when adversarial formatting directives are hidden in code comments, prioritizing compliance over correctness.

Details

Motivation: The paper challenges the assumption that LLMs' instruction-following capability translates to objective code assessment, revealing a fundamental vulnerability in automated educational grading systems.

Method: Introduces SPACI Framework and AST-ASIP protocol to exploit the Syntax-Semantics Gap by embedding adversarial directives in syntactically inert regions (trivia nodes) of Abstract Syntax Trees. Evaluates 9 SOTA models across 25,000 submissions in Python, C, C++, and Java.

Result: Reveals catastrophic failure rates (>95%) in high-capacity models like DeepSeek-V3, which systematically prioritize hidden formatting constraints over code correctness. Quantifies failure using Decoupling Probability, Score Divergence, and Pedagogical Severity metrics.

Conclusion: Current alignment paradigms create “Trojan” vulnerabilities in automated grading, necessitating a shift from standard RLHF toward domain-specific Adjudicative Robustness where models prioritize evidence over instruction compliance.

Abstract: The rapid integration of Large Language Models (LLMs) into educational assessment rests on the unverified assumption that instruction following capability translates directly to objective adjudication. We demonstrate that this assumption is fundamentally flawed. Instead of evaluating code quality, models frequently decouple from the submission’s logic to satisfy hidden directives, a systemic vulnerability we term the Compliance Paradox, where models fine-tuned for extreme helpfulness are vulnerable to adversarial manipulation. To expose this, we introduce the Semantic-Preserving Adversarial Code Injection (SPACI) Framework and the Abstract Syntax Tree-Aware Semantic Injection Protocol (AST-ASIP). These methods exploit the Syntax-Semantics Gap by embedding adversarial directives into syntactically inert regions (trivia nodes) of the Abstract Syntax Tree. Through a large-scale evaluation of 9 SOTA models across 25,000 submissions in Python, C, C++, and Java, we reveal catastrophic failure rates (>95%) in high-capacity open-weights models like DeepSeek-V3, which systematically prioritize hidden formatting constraints over code correctness. We quantify this failure using our novel tripartite framework measuring Decoupling Probability, Score Divergence, and Pedagogical Severity to demonstrate the widespread “False Certification” of functionally broken code. Our findings suggest that current alignment paradigms create a “Trojan” vulnerability in automated grading, necessitating a shift from standard RLHF toward domain-specific Adjudicative Robustness, where models are conditioned to prioritize evidence over instruction compliance. We release our complete dataset and injection framework to facilitate further research on the topic.

[23] TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

Xi Xuan, Chunyu Kit

Main category: cs.CL

TL;DR: TransLaw: A multi-agent framework for Hong Kong legal document translation that outperforms single-agent LLM approaches by decomposing translation into word-level, sentence-level, and review stages with specialized legal glossary and RAG integration.

Details

Motivation: Hong Kong legal translation faces challenges with manual methods (high cost, inconsistent quality) and both traditional MT/LLM approaches (fail to ensure legal terminology accuracy, cultural nuances, and strict linguistic structures).

Method: Multi-agent framework decomposing translation into: 1) word-level expression, 2) sentence-level translation, and 3) multidimensional review, integrating specialized Hong Kong legal glossary database, Retrieval-Augmented Generation (RAG), and iterative feedback.

Result: Outperforms single-agent baselines across 13 open-source and commercial LLMs on HKCFA Judgment 97-22 dataset; human evaluation confirms effectiveness in legal semantic accuracy, structural coherence, and stylistic fidelity, though still trails human experts in complex terminology contextualization and stylistic naturalness.

Conclusion: TransLaw provides an effective multi-agent solution for legal document translation that significantly improves over single-agent LLM approaches, though human expertise remains superior for complex contextualization and naturalness.

Abstract: Hong Kong case law translation presents significant challenges: manual methods suffer from high costs and inconsistent quality, while both traditional machine translation and approaches relying solely on Large Language Models (LLMs) often fail to ensure legal terminology accuracy, culturally embedded nuances, and strict linguistic structures. To overcome these limitations, this study proposes TransLaw, a multi-agent framework that decomposes translation into word-level expression, sentence-level translation, and multidimensional review, integrating a specialized Hong Kong legal glossary database, Retrieval-Augmented Generation (RAG), and iterative feedback. Experiments on our newly constructed HKCFA Judgment 97-22 dataset, benchmarking 13 open-source and commercial LLMs, demonstrate that TransLaw significantly outperforms single-agent baselines across all evaluated models. Human evaluation confirms the framework’s effectiveness in terms of legal semantic accuracy, structural coherence, and stylistic fidelity, while noting that it still trails human experts in contextualizing complex terminology and stylistic naturalness.

[24] User-Centric Evidence Ranking for Attribution and Fact Verification

Guy Alt, Eran Hirsch, Serwar Basch, Ido Dagan, Oren Glickman

Main category: cs.CL

TL;DR: Evidence Ranking is a new task for fact verification that prioritizes presenting sufficient evidence early in a ranked list to minimize user reading effort while maintaining access to all available evidence.

Details

Motivation: Current fact verification systems using LLMs often provide either insufficient or overly redundant evidence, leading to inefficient and error-prone verification processes for users.

Method: Proposes two approaches: one-shot ranking and incremental ranking. Introduces a new evaluation framework inspired by information retrieval metrics and constructs a unified benchmark by aggregating existing fact verification datasets.

Result: Incremental ranking strategies better capture complementary evidence, LLM-based methods outperform shallower baselines, and evidence ranking reduces reading effort and improves verification compared to evidence selection.

Conclusion: Evidence ranking provides a foundational step toward more interpretable, efficient, and user-aligned information verification systems.

Abstract: Attribution and fact verification are critical challenges in natural language processing for assessing information reliability. While automated systems and Large Language Models (LLMs) aim to retrieve and select concise evidence to support or refute claims, they often present users with either insufficient or overly redundant information, leading to inefficient and error-prone verification. To address this, we propose Evidence Ranking, a novel task that prioritizes presenting sufficient information as early as possible in a ranked list. This minimizes user reading effort while still making all available evidence accessible for sequential verification. We compare two approaches for the new ranking task: one-shot ranking and incremental ranking. We introduce a new evaluation framework, inspired by information retrieval metrics, and construct a unified benchmark by aggregating existing fact verification datasets. Extensive experiments with diverse models show that incremental ranking strategies better capture complementary evidence and that LLM-based methods outperform shallower baselines, while still facing challenges in balancing sufficiency and redundancy. Compared to evidence selection, we conduct a controlled user study and demonstrate that evidence ranking both reduces reading effort and improves verification. This work provides a foundational step toward more interpretable, efficient, and user-aligned information verification systems.

[25] Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

Yuan Sui, Bryan Hooi

Main category: cs.CL

TL;DR: CoNL is a multi-agent self-play framework that unifies generation, evaluation, and meta-evaluation to train LLMs for non-verifiable tasks without ground truth or external judges.

Details

Motivation: Training LLMs for creative tasks is hard due to lack of ground truth. LLM-as-Judge approaches are limited by evaluator quality and biases. Need meta-evaluation to improve the evaluator itself.

Method: Multi-agent self-play where agents share same policy and engage in structured conversations (propose, critique, revise). Critiques that help improve solutions earn diagnostic rewards, enabling joint optimization of generation and judging through self-play.

Result: Experiments on five benchmarks show CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.

Conclusion: CoNL provides a framework for training LLMs on non-verifiable tasks through unified generation, evaluation, and meta-evaluation without external judges or ground truth.

Abstract: Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator’s own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on five benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.

[26] SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models

Lei Yang, Wei Bi, Chenxi Sun, Renren Jin, Deyi Xiong

Main category: cs.CL

TL;DR: SOUP is a token-level RL framework that unifies off-policy and on-policy learning by mixing policies within individual samples, using historical policy prefixes with on-policy continuations to improve exploration and stability.

Details

Motivation: Current RL methods for language model post-training suffer from limited exploration and early saturation due to low sampling diversity. While off-policy data could help, existing approaches that mix entire trajectories cause policy mismatch and instability.

Method: SOUP (Single-sample Mix-policy Unified Paradigm) unifies off- and on-policy learning at the token level. It uses historical policy prefixes for the beginning of sequences and on-policy continuations for the rest, with token-level importance ratios to leverage off-policy information while maintaining stability.

Result: Extensive experiments show SOUP consistently outperforms standard on-policy training and existing off-policy extensions. Analysis confirms that fine-grained, single-sample mix-policy training improves both exploration and final performance in LLM RL.

Conclusion: SOUP provides an effective framework for combining off-policy and on-policy learning in RL for language models, addressing exploration limitations while maintaining training stability through token-level policy mixing.

Abstract: On-policy reinforcement learning (RL) methods widely used for language model post-training, like Group Relative Policy Optimization (GRPO), often suffer from limited exploration and early saturation due to low sampling diversity. While off-policy data can help, current approaches that mix entire trajectories cause significant policy mismatch and instability. In this work, we propose the $\textbf{S}$ingle-sample Mix-p$\textbf{O}$licy $\textbf{U}$nified $\textbf{P}$aradigm (SOUP), a framework that unifies off- and on-policy learning within individual samples at the token level. It confines off-policy influence to the prefix of a generated sequence sampled from historical policies, while the continuation is generated on-policy. Through token-level importance ratios, SOUP effectively leverages off-policy information while preserving training stability. Extensive experiments demonstrate that SOUP consistently outperforms standard on-policy training and existing off-policy extensions. Our further analysis clarifies how our fine-grained, single-sample mix-policy training can improve both exploration and final performance in LLM RL.

[27] DimStance: Multilingual Datasets for Dimensional Stance Analysis

Jonas Becker, Liang-Chih Yu, Shamsuddeen Hassan Muhammad, Jan Philip Wahle, Terry Ruas, Idris Abdulmumin, Lung-Hao Lee, Wen-Ni Liu, Tzu-Mi Lin, Zhe-Yu Xu, Ying-Lung Lin, Jin Wang, Maryam Ibrahim Mukhtar, Bela Gipp, Saif M. Mohammed

Main category: cs.CL

TL;DR: DimStance introduces the first dimensional stance resource with valence-arousal annotations across five languages and two domains, enabling fine-grained stance analysis beyond categorical labels.

Details

Motivation: Traditional stance detection uses categorical labels (Favor, Neutral, Against), which oversimplify nuanced affective states. The authors aim to capture richer emotional dimensions underlying stance expressions using valence-arousal framework from affective science.

Method: Created DimStance resource with 11,746 target aspects in 7,365 texts across five languages (English, German, Chinese, Nigerian Pidgin, Swahili) and two domains (politics, environmental protection). Formulated dimensional stance regression task and benchmarked pretrained and large language models under regression and prompting settings.

Result: Fine-tuned LLM regressors showed competitive performance, but there were persistent challenges in low-resource languages and limitations of token-based generation. Cross-lingual valence-arousal patterns were analyzed.

Conclusion: DimStance provides a foundation for multilingual, emotion-aware stance analysis and benchmarking, enabling more nuanced understanding of stance expressions through dimensional affective modeling.

Abstract: Stance detection is an established task that classifies an author’s attitude toward a specific target into categories such as Favor, Neutral, and Against. Beyond categorical stance labels, we leverage a long-established affective science framework to model stance along real-valued dimensions of valence (negative-positive) and arousal (calm-active). This dimensional approach captures nuanced affective states underlying stance expressions, enabling fine-grained stance analysis. To this end, we introduce DimStance, the first dimensional stance resource with valence-arousal (VA) annotations. This resource comprises 11,746 target aspects in 7,365 texts across five languages (English, German, Chinese, Nigerian Pidgin, and Swahili) and two domains (politics and environmental protection). To facilitate the evaluation of stance VA prediction, we formulate the dimensional stance regression task, analyze cross-lingual VA patterns, and benchmark pretrained and large language models under regression and prompting settings. Results show competitive performance of fine-tuned LLM regressors, persistent challenges in low-resource languages, and limitations of token-based generation. DimStance provides a foundation for multilingual, emotion-aware, stance analysis and benchmarking.

[28] MURAD: A Large-Scale Multi-Domain Unified Reverse Arabic Dictionary Dataset

Serry Sibaee, Yasser Alhabashi, Nadia Sibai, Yara Farouk, Adel Ammar, Sawsan AlHalawani, Wadii Boulila

Main category: cs.CL

TL;DR: MURAD is a large-scale Arabic lexical dataset with 96,243 word-definition pairs from multiple domains, created using a hybrid extraction pipeline to support computational linguistics and NLP research.

Details

Motivation: Arabic lacks large-scale lexical datasets linking words to precise definitions despite being a linguistically rich language with extensive vocabulary across scientific, religious, and literary domains.

Method: Used a hybrid pipeline integrating direct text parsing, optical character recognition (OCR), and automated reconstruction to extract word-definition pairs from trusted reference works and educational sources.

Result: Created MURAD dataset with 96,243 word-definition pairs covering linguistics, Islamic studies, mathematics, physics, psychology, and engineering, with metadata identifying source domains.

Conclusion: The dataset supports computational linguistics research, reverse dictionary modeling, semantic retrieval, and educational tools, aiming to advance Arabic NLP and promote reproducible research on Arabic lexical semantics.

Abstract: Arabic is a linguistically and culturally rich language with a vast vocabulary that spans scientific, religious, and literary domains. Yet, large-scale lexical datasets linking Arabic words to precise definitions remain limited. We present MURAD (Multi-domain Unified Reverse Arabic Dictionary), an open lexical dataset with 96,243 word-definition pairs. The data come from trusted reference works and educational sources. Extraction used a hybrid pipeline integrating direct text parsing, optical character recognition, and automated reconstruction. This ensures accuracy and clarity. Each record aligns a target word with its standardized Arabic definition and metadata that identifies the source domain. The dataset covers terms from linguistics, Islamic studies, mathematics, physics, psychology, and engineering. It supports computational linguistics and lexicographic research. Applications include reverse dictionary modeling, semantic retrieval, and educational tools. By releasing this resource, we aim to advance Arabic natural language processing and promote reproducible research on Arabic lexical semantics.

[29] LMK > CLS: Landmark Pooling for Dense Embeddings

Meet Doshi, Aashka Trivedi, Vishwajeet Kumar, Parul Awasthy, Yulong Li, Jaydeep Sen, Radu Florian, Sachindra Joshi

Main category: cs.CL

TL;DR: Landmark (LMK) pooling improves sequence representation by partitioning sequences into chunks with landmark tokens, addressing weaknesses in CLS and mean pooling for both short and long contexts.

Details

Motivation: Current pooling strategies like [CLS] token and mean pooling have systematic weaknesses: [CLS] concentrates information toward initial positions and under-represents distributed evidence, while mean pooling dilutes salient local signals and can hurt short-context performance.

Method: Introduces Landmark (LMK) pooling which partitions sequences into chunks, inserts landmark tokens between chunks, and forms final representation by mean-pooling the landmark token embeddings.

Result: LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, improving long-context extrapolation without sacrificing local salient features.

Conclusion: LMK pooling is a practical and scalable alternative to existing pooling methods that addresses systematic weaknesses in current pooling strategies for sequence representation learning.

Abstract: Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling operator, most commonly a special [CLS] token or mean pooling over token embeddings. In this paper, we identify systematic weaknesses of these pooling strategies: [CLS] tends to concentrate information toward the initial positions of the sequence and can under-represent distributed evidence, while mean pooling can dilute salient local signals, sometimes leading to worse short-context performance. To address these issues, we introduce Landmark (LMK) pooling, which partitions a sequence into chunks, inserts landmark tokens between chunks, and forms the final representation by mean-pooling the landmark token embeddings. This simple mechanism improves long-context extrapolation without sacrificing local salient features, at the cost of introducing a small number of special tokens. We empirically demonstrate that LMK pooling matches existing methods on short-context retrieval tasks and yields substantial improvements on long-context tasks, making it a practical and scalable alternative to existing pooling methods.

[30] inversedMixup: Data Augmentation via Inverting Mixed Embeddings

Fanshuang Kong, Richong Zhang, Qiyu Sun, Zhijie Nie, Ting Deng, Chunming Hu

Main category: cs.CL

TL;DR: inversedMixup: A framework combining Mixup’s controllability with LLM interpretability for text augmentation by aligning task model embeddings with LLM input space to generate human-readable augmented sentences.

Details

Motivation: Mixup provides controllable augmentation but lacks interpretability (operates in latent space), while LLM-based methods produce readable outputs but offer limited control. The paper aims to bridge this gap by combining the strengths of both approaches.

Method: Three-stage training procedure: 1) Align task-specific model’s output embedding space with LLM’s input embedding space, 2) Use Mixup in the aligned embedding space with controllable mixing ratios, 3) Reconstruct mixed embeddings into human-interpretable sentences via LLM inversion. Also addresses manifold intrusion in text Mixup with mitigation strategy.

Result: Extensive experiments show effectiveness in both few-shot and fully supervised scenarios. The approach improves augmentation performance while providing interpretable outputs and controllable mixing ratios.

Conclusion: inversedMixup successfully unifies Mixup’s controllability with LLM-based interpretability, enabling human-readable text augmentation with controlled mixing. Provides first empirical evidence of manifold intrusion in text Mixup and effective mitigation.

Abstract: Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates in the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup adopts a three-stage training procedure to align the output embedding space of a task-specific model with the input embedding space of an LLM. Upon successful alignment, inversedMixup can reconstruct mixed embeddings with a controllable mixing ratio into human-interpretable augmented sentences, thereby improving the augmentation performance. Additionally, inversedMixup provides the first empirical evidence of the manifold intrusion phenomenon in text Mixup and introduces a simple yet effective strategy to mitigate it. Extensive experiments demonstrate the effectiveness and generalizability of our approach in both few-shot and fully supervised scenarios.

[31] Note2Chat: Improving LLMs for Multi-Turn Clinical History Taking Using Medical Notes

Yang Zhou, Zhenting Sheng, Mingrui Tan, Yuting Song, Jun Zhou, Yu Heng Kwan, Lian Leng Low, Yang Bai, Yong Liu

Main category: cs.CL

TL;DR: A framework that converts medical notes into doctor-patient dialogues to train LLMs for structured clinical history taking and diagnosis, achieving significant improvements over GPT-4o.

Details

Motivation: Clinical history taking is crucial for diagnosis but underexplored in AI. LLMs perform well on static benchmarks but struggle in dynamic, multi-turn diagnostic settings requiring iterative questioning and hypothesis refinement.

Method: Note2Chat converts real-world medical notes into doctor-patient dialogues using decision tree-guided generation and refinement. Uses three-stage fine-tuning: supervised learning, simulated data augmentation, and preference learning. Introduces a novel single-turn reasoning paradigm that reframes history taking as sequence of single-turn reasoning problems.

Result: Substantially improves clinical reasoning with gains of +16.9 F1 and +21.0 Top-1 diagnostic accuracy over GPT-4o.

Conclusion: The framework effectively addresses the gap in dynamic clinical reasoning by leveraging widely available medical notes, offering improved interpretability, local supervision, dynamic adaptation, and sample efficiency.

Abstract: Effective clinical history taking is a foundational yet underexplored component of clinical reasoning. While large language models (LLMs) have shown promise on static benchmarks, they often fall short in dynamic, multi-turn diagnostic settings that require iterative questioning and hypothesis refinement. To address this gap, we propose \method{}, a note-driven framework that trains LLMs to conduct structured history taking and diagnosis by learning from widely available medical notes. Instead of relying on scarce and sensitive dialogue data, we convert real-world medical notes into high-quality doctor-patient dialogues using a decision tree-guided generation and refinement pipeline. We then propose a three-stage fine-tuning strategy combining supervised learning, simulated data augmentation, and preference learning. Furthermore, we propose a novel single-turn reasoning paradigm that reframes history taking as a sequence of single-turn reasoning problems. This design enhances interpretability and enables local supervision, dynamic adaptation, and greater sample efficiency. Experimental results show that our method substantially improves clinical reasoning, achieving gains of +16.9 F1 and +21.0 Top-1 diagnostic accuracy over GPT-4o. Our code and dataset can be found at https://github.com/zhentingsheng/Note2Chat.

[32] ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Hao Zhou, Kaichi Yu, Yudian Zhang, Jade Ouyang, Junxi Yin, Jiong Chen, Baoyan Guo, Lei Zhang, Junjie Tao, Yuansheng Song, Ming Cui, Chengwei Liu

Main category: cs.CL

TL;DR: ASTRA is an automated framework for training tool-augmented LLM agents via scalable data synthesis and verifiable reinforcement learning, achieving SOTA performance on agentic tool-use benchmarks.

Details

Motivation: Existing methods for training tool-using LLM agents require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either SFT or RL, and struggle with stable long-horizon, multi-turn learning.

Method: ASTRA integrates two components: 1) a pipeline using static tool-call graph topology to synthesize diverse trajectories, and 2) an environment synthesis framework converting decomposed QA traces into code-executable, rule-verifiable environments for deterministic multi-turn RL. Combines SFT with online RL using trajectory-level rewards.

Result: ASTRA-trained models achieve state-of-the-art performance on multiple agentic tool-use benchmarks at comparable scales, approaching closed-source systems while preserving core reasoning ability.

Conclusion: ASTRA provides a fully automated end-to-end framework for training robust tool-augmented language model agents through scalable data synthesis and verifiable reinforcement learning.

Abstract: Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at https://github.com/LianjiaTech/astra.

[33] KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, Danilo Mandic

Main category: cs.CL

TL;DR: KromHC improves hyper-connections in neural networks by using Kronecker products of smaller doubly stochastic matrices to guarantee exact double stochasticity while reducing parameter complexity from factorial to quadratic.

Details

Motivation: Existing hyper-connection methods face issues with training instability, scalability, and parameter complexity. While mHC projects residual connections onto Birkhoff polytope, it doesn't guarantee exact double stochasticity and has high O(n³C) complexity. mHC-lite guarantees double stochasticity but suffers factorial O(nC·n!) complexity explosion.

Method: KromHC parametrizes residual matrices using Kronecker products of smaller doubly stochastic matrices. It enforces manifold constraints across factor residual matrices along each mode of the tensorized residual stream, guaranteeing exact double stochasticity while reducing parameter complexity to O(n²C).

Result: Comprehensive experiments show KromHC matches or outperforms state-of-the-art mHC variants while requiring significantly fewer trainable parameters.

Conclusion: KromHC provides an efficient solution to hyper-connection challenges by guaranteeing exact double stochasticity with reduced parameter complexity, making hyper-connections more practical for neural network architectures.

Abstract: The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to its training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn-Knopp (SK) algorithm does not always yield exact doubly stochastic residual matrices; 2) mHC incurs a prohibitive $\mathcal{O}(n^3C)$ parameter complexity with $n$ as the width of the residual stream and $C$ as the feature dimension. The recently proposed mHC-lite reparametrizes the residual matrix via the Birkhoff-von-Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, $\mathcal{O} \left( nC \cdot n! \right)$. To address both challenges, we propose \textbf{KromHC}, which uses the \underline{Kro}necker products of smaller doubly stochastic matrices to parametrize the residual matrix in \underline{mHC}. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to $\mathcal{O}(n^2C)$. Comprehensive experiments demonstrate that KromHC matches or even outperforms state-of-the-art (SOTA) mHC variants, while requiring significantly fewer trainable parameters. The code is available at \texttt{https://github.com/wz1119/KromHC}.

[34] Language Models as Artificial Learners: Investigating Crosslinguistic Influence

Abderrahmane Issam, Yusuf Can Semerci, Jan Scholtes, Gerasimos Spanakis

Main category: cs.CL

TL;DR: Language models serve as controlled statistical learners to systematically study crosslinguistic influence, revealing that language dominance and proficiency are key predictors, with bidirectional grammatical priming but dominance-sensitive ungrammatical priming.

Details

Motivation: Human studies of crosslinguistic influence (CLI) in bilingualism often yield conflicting results due to experimental variance. The paper aims to address these inconsistencies by using language models as controlled statistical learners to systematically simulate CLI and isolate its underlying drivers.

Method: Use language models to simulate CLI by manipulating L1 language dominance and L2 proficiency (controlled via L2 age of exposure). Investigate impact of pretraining on L1 languages with varying syntactic distance from L2. Analyze cross-linguistic priming to study how activating L1 structures impacts L2 processing.

Result: Results align with psycholinguistic studies: language dominance and proficiency are strong predictors of CLI. Grammatical structure priming is bidirectional, but ungrammatical structure priming is sensitive to language dominance. Provides mechanistic evidence that L1 is co-activated during L2 processing and directly influences neural circuitry for L2.

Conclusion: Language models can serve as a computational framework to inform theories of human crosslinguistic influence, offering controlled experimentation to isolate underlying drivers of bilingual language processing.

Abstract: Despite the centrality of crosslinguistic influence (CLI) to bilingualism research, human studies often yield conflicting results due to inherent experimental variance. We address these inconsistencies by using language models (LMs) as controlled statistical learners to systematically simulate CLI and isolate its underlying drivers. Specifically, we study the effect of varying the L1 language dominance and the L2 language proficiency, which we manipulate by controlling the L2 age of exposure – defined as the training step at which the L2 is introduced. Furthermore, we investigate the impact of pretraining on L1 languages with varying syntactic distance from the L2. Using cross-linguistic priming, we analyze how activating L1 structures impacts L2 processing. Our results align with evidence from psycholinguistic studies, confirming that language dominance and proficiency are strong predictors of CLI. We further find that while priming of grammatical structures is bidirectional, the priming of ungrammatical structures is sensitive to language dominance. Finally, we provide mechanistic evidence of CLI in LMs, demonstrating that the L1 is co-activated during L2 processing and directly influences the neural circuitry recruited for the L2. More broadly, our work demonstrates that LMs can serve as a computational framework to inform theories of human CLI.

[35] ILRR: Inference-Time Steering Method for Masked Diffusion Language Models

Eden Avrahami, Eliya Nachmani

Main category: cs.CL

TL;DR: ILRR is a learning-free framework for steering discrete diffusion language models using a single reference sequence by aligning internal activations during denoising.

Details

Motivation: Discrete diffusion language models lack effective inference-time control mechanisms, with existing approaches limited to sampling-level guidance or trajectory optimization. There's a need for better steering methods that can transfer semantic properties from references.

Method: ILRR guides generation by dynamically aligning the internal activations of generated sequences with those of a reference throughout denoising. It introduces Spatially Modulated Steering for handling long texts with shorter references by regulating guidance intensity across the sequence.

Result: ILRR achieves effective attribute steering on LLaDA and MDLM architectures with minimal computational overhead (one additional parallel forward pass per step). It improves attribute accuracy by 10-60% points over comparable baselines while maintaining high generation quality.

Conclusion: ILRR provides an effective, learning-free framework for controlling discrete diffusion language models using reference sequences, enabling flexible attribute steering with minimal computational cost.

Abstract: Discrete Diffusion Language Models (DLMs) offer a promising non-autoregressive alternative for text generation, yet effective mechanisms for inference-time control remain relatively underexplored. Existing approaches include sampling-level guidance procedures or trajectory optimization mechanisms. In this work, we introduce Iterative Latent Representation Refinement (ILRR), a learning-free framework for steering DLMs using a single reference sequence. ILRR guides generation by dynamically aligning the internal activations of the generated sequence with those of a given reference throughout the denoising process. This approach captures and transfers high-level semantic properties, with a tunable steering scale enabling flexible control over attributes such as sentiment. We further introduce Spatially Modulated Steering, an extension that enables steering long texts using shorter references by regulating guidance intensity across the sequence. Empirically, we demonstrate that ILRR achieves effective attribute steering on LLaDA and MDLM architectures with a minor computational overhead, requiring only one additional parallel forward pass per denoising step. Under the same compute budget, ILRR improves attribute accuracy over comparable baselines by 10$%$ to 60$%$ points, while maintaining high generation quality.

[36] AdaptBPE: From General Purpose to Specialized Tokenizers

Vijini Liyanage, François Yvon

Main category: cs.CL

TL;DR: A post-training tokenizer adaptation method that replaces low-utility tokens with domain/language-specific ones to improve compression and efficiency for specialized applications.

Details

Motivation: Standard BPE tokenizers are trained on general corpora but applied uniformly across domains and languages, leading to inefficiencies when used for specific applications where certain tokens are rarely used while domain-relevant tokens are missing.

Method: Proposes a post-training adaptation algorithm that identifies and replaces low-utility tokens with more relevant ones based on frequency in an adaptation corpus, optimizing token inventory for target vocabulary size.

Result: Adapted tokenizers compress test corpora more effectively than baselines with same vocabulary size across multiple languages and tasks (generation and classification).

Conclusion: The method provides lightweight vocabulary fine-tuning for domain/task-specific optimization, improving tokenization efficiency without retraining the entire model.

Abstract: Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora more effectively than baselines using the same vocabulary size. This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks. Our code and data are available at https://github.com/vijini/Adapt-BPE.git.

[37] Scale-Dependent Semantic Dynamics Revealed by Allan Deviation

Debayan Dasgupta

Main category: cs.CL

TL;DR: Analyzes semantic progression in text as stochastic trajectories using Allan deviation, revealing distinct dynamical regimes that differentiate creative vs. technical texts and human vs. AI-generated content.

Details

Motivation: To understand the underlying dynamics of semantic progression in written text and establish semantic coherence as a measurable physical property that can differentiate human cognition from algorithmic models.

Method: Treats semantic progression as stochastic trajectories in high-dimensional state space, uses Allan deviation (from precision metrology) to analyze stability of meaning by treating ordered sentence embeddings as displacement signals.

Result: Reveals two dynamical regimes: short-time power-law scaling differentiating creative literature from technical texts, and long-time crossover to stability-limited noise floor. LLMs mimic local scaling statistics but show systematic reduction in stability horizon.

Conclusion: Establishes semantic coherence as measurable physical property, offers framework to differentiate nuanced dynamics of human cognition from algorithmic models, with implications for AI text detection and understanding cognitive processes.

Abstract: While language progresses through a sequence of semantic states, the underlying dynamics of this progression remain elusive. Here, we treat the semantic progression of written text as a stochastic trajectory in a high-dimensional state space. We utilize Allan deviation, a tool from precision metrology, to analyze the stability of meaning by treating ordered sentence embeddings as a displacement signal. Our analysis reveals two distinct dynamical regimes: short-time power-law scaling, which differentiates creative literature from technical texts, and a long-time crossover to a stability-limited noise floor. We find that while large language models successfully mimic the local scaling statistics of human text, they exhibit a systematic reduction in their stability horizon. These results establish semantic coherence as a measurable physical property, offering a framework to differentiate the nuanced dynamics of human cognition from the patterns generated by algorithmic models.

[38] FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning

Xiaoyu Xu, Minxin Du, Kun Fang, Zi Liang, Yaxin Xiao, Zhicong Huang, Cheng Hong, Qingqing Ye, Haibo Hu

Main category: cs.CL

TL;DR: FIT is a continual unlearning framework for LLMs that handles sequential deletion requests while preventing catastrophic forgetting and maintaining model utility through data filtering, importance-aware updates, and targeted layer attribution.

Details

Motivation: Existing LLM unlearning methods struggle with the continual, high-volume nature of real-world deletion requests, leading to utility degradation and catastrophic forgetting as requests accumulate over time.

Method: FIT uses three key components: rigorous data filtering to identify target knowledge, importance-aware updates to preserve non-target knowledge, and targeted layer attribution to focus unlearning on specific model layers. The framework also introduces the PCH benchmark covering personal information, copyright, and harmful content.

Result: Extensive experiments on four open-source LLMs with hundreds of deletion requests show FIT achieves the strongest trade-off between forgetting quality and utility preservation, outperforms existing methods on MMLU, CommonsenseQA, and GSM8K benchmarks, and remains resistant to relearning and quantization recovery attacks.

Conclusion: FIT provides an effective solution for continual unlearning in LLMs, addressing the practical challenge of handling sequential deletion requests while maintaining model utility and preventing knowledge recovery.

Abstract: Large language models (LLMs) demonstrate impressive capabilities across diverse tasks but raise concerns about privacy, copyright, and harmful materials. Existing LLM unlearning methods rarely consider the continual and high-volume nature of real-world deletion requests, which can cause utility degradation and catastrophic forgetting as requests accumulate. To address this challenge, we introduce \fit, a framework for continual unlearning that handles large numbers of deletion requests while maintaining robustness against both catastrophic forgetting and post-unlearning recovery. \fit mitigates degradation through rigorous data \underline{F}iltering, \underline{I}mportance-aware updates, and \underline{T}argeted layer attribution, enabling stable performance across long sequences of unlearning operations and achieving a favorable balance between forgetting effectiveness and utility retention. To support realistic evaluation, we present \textbf{PCH}, a benchmark covering \textbf{P}ersonal information, \textbf{C}opyright, and \textbf{H}armful content in sequential deletion scenarios, along with two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), which jointly assess forgetting quality and utility preservation. Extensive experiments on four open-source LLMs with hundreds of deletion requests show that \fit achieves the strongest trade-off between F.D. and R.U., surpasses existing methods on MMLU, CommonsenseQA, and GSM8K, and remains resistant against both relearning and quantization recovery attacks.

[39] Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling

Xinglin Wang, Jiayi Shi, Shaoxiong Feng, Peiwen Yuan, Yiwei Li, Yueqi Zhang, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li

Main category: cs.CL

TL;DR: RSE (Recycling Search Experience) is a training-free strategy that recycles intermediate reasoning insights from test-time scaling to reduce computational redundancy and improve efficiency in complex reasoning tasks.

Details

Motivation: Current test-time scaling methods waste computational resources by treating each reasoning attempt as disposable, forcing models to repeatedly re-derive the same conclusions and revisit known dead ends without learning from previous attempts.

Method: RSE actively distills raw reasoning trajectories into a shared experience bank, enabling positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune dead ends, all without additional training.

Result: RSE consistently outperforms strong baselines on HMMT24, HMMT25, IMO-Bench, and HLE benchmarks with comparable computational cost, achieving state-of-the-art scaling efficiency.

Conclusion: RSE transforms test-time search from isolated trials into a cumulative process, significantly improving reasoning efficiency by recycling search experience without requiring model retraining.

Abstract: Test-Time Scaling enhances the reasoning capabilities of Large Language Models by allocating additional inference compute to broaden the exploration of the solution space. However, existing search strategies typically treat rollouts as disposable samples, where valuable intermediate insights are effectively discarded after each trial. This systemic memorylessness leads to massive computational redundancy, as models repeatedly re-derive discovered conclusions and revisit known dead ends across extensive attempts. To bridge this gap, we propose \textbf{Recycling Search Experience (RSE)}, a self-guided, training-free strategy that turns test-time search from a series of isolated trials into a cumulative process. By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretically, we provide an analysis that formalizes the efficiency gains of RSE, validating its advantage over independent sampling in solving complex reasoning tasks. Empirically, extensive experiments on HMMT24, HMMT25, IMO-Bench, and HLE show that RSE consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art scaling efficiency.

[40] Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

Hojae Han, Heeyun Jung, Jongyoon Kim, Seung-won Hwang

Main category: cs.CL

TL;DR: DAVID-GRPO enables small language models to achieve strong multi-hop reasoning under resource constraints through budget-efficient RL with stabilization, evidence-based credit assignment, and improved exploration.

Details

Motivation: Existing RL methods for multi-turn reasoning agents require extensive on-policy rollouts with large models, but under realistic resource constraints, small language models suffer from sparse exploration, sparse credit assignment, and unstable training in low-cost, low-accuracy regimes.

Method: DAVID-GRPO introduces a budget-efficient RL framework that: (1) stabilizes early learning with minimal supervision, (2) assigns retrieval credit based on evidence recall, and (3) improves exploration by resampling truncated near-miss trajectories.

Result: Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks.

Conclusion: With the right inductive biases, small agents can achieve low training cost with high accuracy, challenging the trade-off between resource constraints and reasoning performance.

Abstract: While reinforcement learning (RL) has empowered multi-turn reasoning agents with retrieval and tools, existing successes largely depend on extensive on-policy rollouts in high-cost, high-accuracy regimes. Under realistic resource constraints that cannot support large models or dense explorations, however, small language model agents fall into a low-cost, low-accuracy regime, where limited rollout budgets lead to sparse exploration, sparse credit assignment, and unstable training. In this work, we challenge this trade-off and show that small language models can achieve strong multi-hop reasoning under resource constraints. We introduce DAVID-GRPO, a budget-efficient RL framework that (i) stabilizes early learning with minimal supervision, (ii) assigns retrieval credit based on evidence recall, and (iii) improves exploration by resampling truncated near-miss trajectories. Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks. These results show that with the right inductive biases, small agents can achieve low training cost with high accuracy.

[41] Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

Qingyue Yang, Jie Wang, Xing Li, Yinqi Bai, Xialiang Tong, Huiling Zhen, Jianye Hao, Mingxuan Yuan, Bin Li

Main category: cs.CL

TL;DR: TAPPA is a framework that unifies diverse attention patterns in LLMs by analyzing them from a temporally continuous perspective, distinguishing between predictable and unpredictable patterns based on query self-similarity.

Details

Motivation: Prior work has identified fragmented attention patterns (retrieval heads, sink heads, diagonal traces) without a unifying explanation. The authors aim to bridge this gap by providing a comprehensive framework that explains diverse attention patterns mathematically.

Method: TAPPA analyzes attention patterns from a temporally continuous perspective, characterizing them as predictable (with clear regularities) or unpredictable (effectively random). The distinction is explained by query self-similarity along the temporal dimension. The framework provides mathematical analysis of three representative cases through the joint effect of queries, keys, and RoPE.

Result: TAPPA’s insights were validated on KV cache compression and LLM pruning tasks. A simple metric motivated by TAPPA consistently improved performance over baseline methods in these tasks.

Conclusion: TAPPA provides a unifying framework for understanding attention patterns in LLMs, offering both theoretical insights and practical applications for inference acceleration through KV cache compression and model pruning.

Abstract: Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce \textbf{Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations} from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at https://github.com/MIRALab-USTC/LLM-TAPPA.

[42] TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning

Huiyuan Lai, Malvina Nissim

Main category: cs.CL

TL;DR: TACLer is a curriculum RL framework for LLMs that improves reasoning efficiency by tailoring training complexity and using hybrid Thinking/NoThinking modes, reducing compute costs while boosting accuracy on math problems.

Details

Motivation: Current LLMs require extensive RL training for long chain-of-thought reasoning, which is computationally expensive and can lead to overthinking with redundant steps. There's a need to improve learning efficiency while maintaining or enhancing performance.

Method: TACLer uses model-tailored curriculum RL with two components: 1) tailored curriculum learning that progressively increases data complexity based on model proficiency, and 2) hybrid Thinking/NoThinking reasoning that balances accuracy and efficiency by selectively enabling reasoning modes.

Result: TACLer reduces training compute by over 50% compared to long thinking models, cuts inference token usage by over 42%, and improves accuracy by over 9% on base models, outperforming state-of-the-art baselines across four math datasets.

Conclusion: The proposed curriculum RL framework effectively improves both learning efficiency and reasoning performance for LLMs, offering a practical solution to the computational challenges of long chain-of-thought reasoning.

Abstract: Large Language Models (LLMs) have shown remarkable performance on complex reasoning tasks, especially when equipped with long chain-of-thought (CoT) reasoning. However, eliciting long CoT typically requires large-scale reinforcement learning (RL) training, while often leading to overthinking with redundant intermediate steps. To improve learning and reasoning efficiency, while preserving or even enhancing performance, we propose TACLer, a model-tailored curriculum reinforcement learning framework that gradually increases the complexity of the data based on the model’s proficiency in multi-stage RL training. TACLer features two core components: (i) tailored curriculum learning that determines what knowledge the model lacks and needs to learn in progressive stages; (ii) a hybrid Thinking/NoThinking reasoning paradigm that balances accuracy and efficiency by enabling or disabling the Thinking mode. Our experiments show that TACLer yields a twofold advantage in learning and reasoning: (i) it reduces computational cost, cutting training compute by over 50% compared to long thinking models and reducing inference token usage by over 42% relative to the base model; and (ii) it improves accuracy by over 9% on the base model, consistently outperforming state-of-the-art Nothinking and Thinking baselines across four math datasets with complex problems.

[43] Enhancing Language Models for Robust Greenwashing Detection

Neil Heinrich Braun, Keane Ong, Rui Mao, Erik Cambria, Gianmarco Mengaldo

Main category: cs.CL

TL;DR: A parameter-efficient framework for ESG report analysis that structures LLM latent spaces using contrastive learning and ordinal ranking to distinguish concrete actions from vague claims, with gated feature modulation and MetaGradNorm for stable multi-objective optimization.

Details

Motivation: Sustainability reports are crucial for ESG assessment but suffer from greenwashing and vague claims that undermine reliability. Existing NLP models lack robustness to these practices, relying on surface-level patterns that generalize poorly.

Method: Proposes a parameter-efficient framework that structures LLM latent spaces by combining contrastive learning with an ordinal ranking objective to capture graded distinctions between concrete actions and ambiguous claims. Incorporates gated feature modulation to filter disclosure noise and utilizes MetaGradNorm to stabilize multi-objective optimization.

Result: Experiments in cross-category settings demonstrate superior robustness over standard baselines while revealing a trade-off between representational rigidity and generalization.

Conclusion: The framework effectively addresses greenwashing detection in sustainability reports by creating more robust representations that can distinguish between concrete actions and vague claims, though there’s a trade-off between rigidity and generalization.

Abstract: Sustainability reports are critical for ESG assessment, yet greenwashing and vague claims often undermine their reliability. Existing NLP models lack robustness to these practices, typically relying on surface-level patterns that generalize poorly. We propose a parameter-efficient framework that structures LLM latent spaces by combining contrastive learning with an ordinal ranking objective to capture graded distinctions between concrete actions and ambiguous claims. Our approach incorporates gated feature modulation to filter disclosure noise and utilizes MetaGradNorm to stabilize multi-objective optimization. Experiments in cross-category settings demonstrate superior robustness over standard baselines while revealing a trade-off between representational rigidity and generalization.

[44] Procedural Pretraining: Warming Up Language Models with Abstract Data

Liangze Jiang, Zachary Shinnick, Anton van den Hengel, Hemanth Saratchandran, Damien Teney

Main category: cs.CL

TL;DR: Procedural pretraining on abstract structured data (like formal languages/algorithms) before natural language exposure improves LLM performance and accelerates training, enabling models to reach same loss with 55-86% of original data.

Details

Motivation: Current LLMs pretrain directly on web-scale corpora, but humans learn simple logic/math before higher reasoning. The paper explores whether exposing models to abstract structured data first can ease subsequent semantic knowledge acquisition.

Method: Front-load procedural data (0.1% of training) from formal languages/algorithms (e.g., Dyck sequences for balanced brackets) before standard pretraining on natural language, code, and math datasets. Analyze algorithmic skills, scale to 1.3B models, and study attention/MLP mechanisms.

Result: Procedural pretraining significantly improves performance: Needle-in-a-haystack accuracy jumps from 10% to 98% with Dyck sequences. Models reach same loss with only 55-86% of original data. Attention layers benefit structured domains (code), MLP layers benefit language.

Conclusion: Procedural pretraining is a simple, lightweight method to improve LLM performance and accelerate training, suggesting promise in disentangling knowledge acquisition from reasoning, similar to human learning progression.

Abstract: Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this procedural pretraining enables the models to reach the same loss value with only 55, 67, 86% of the original data. Third, we explore the mechanisms behind and find that procedural pretraining instils non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means to improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.

[45] CE-GOCD: Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering

Jiayin Lan, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin, Guoping Hu

Main category: cs.CL

TL;DR: CE-GOCD enhances LLM-based scientific QA by using central entities and community detection on academic knowledge graphs to identify semantic substructures between papers.

Details

Motivation: Existing retrieval methods for LLM-based scientific QA use isolated text chunks or concepts, missing deeper semantic connections between papers, which limits comprehension and response quality.

Method: Proposes Central Entity-Guided Graph Optimization for Community Detection (CE-GOCD): (1) uses paper titles as central entities for targeted subgraph retrieval, (2) enhances semantic discovery via subgraph pruning/completion, (3) applies community detection to group papers with shared themes.

Result: Evaluated on three NLP literature QA datasets; demonstrates superiority over other retrieval-augmented baselines, confirming framework effectiveness.

Conclusion: CE-GOCD effectively models semantic substructures in academic knowledge graphs to improve LLM comprehension of scientific literature for QA tasks.

Abstract: Large Language Models (LLMs) are increasingly used for question answering over scientific research papers. Existing retrieval augmentation methods often rely on isolated text chunks or concepts, but overlook deeper semantic connections between papers. This impairs the LLM’s comprehension of scientific literature, hindering the comprehensiveness and specificity of its responses. To address this, we propose Central Entity-Guided Graph Optimization for Community Detection (CE-GOCD), a method that augments LLMs’ scientific question answering by explicitly modeling and leveraging semantic substructures within academic knowledge graphs. Our approach operates by: (1) leveraging paper titles as central entities for targeted subgraph retrieval, (2) enhancing implicit semantic discovery via subgraph pruning and completion, and (3) applying community detection to distill coherent paper groups with shared themes. We evaluated the proposed method on three NLP literature-based question-answering datasets, and the results demonstrate its superiority over other retrieval-augmented baseline approaches, confirming the effectiveness of our framework.

[46] Temporal Guidance for Large Language Models

Hong-Kai Zheng, Piji Li

Main category: cs.CL

TL;DR: TeGu is a novel contrastive decoding method that uses temporal guidance with multi-token prediction to improve LLM generation quality without needing auxiliary models, reducing computational overhead.

Details

Motivation: Existing contrastive decoding methods require auxiliary models causing high computational overhead, and internal self-contrast methods like DoLa are unstable on small models. The paper aims to develop an efficient self-contrast method without external models.

Method: Proposes Temporal Guidance (TeGu) using multi-token prediction to create weaker amateur predictions for self-contrast. Introduces lightweight Conditional MTP Projector (cMTPP) to standardize implementation without multiple independent networks.

Result: TeGu achieves significant performance improvements across various model series and benchmarks while maintaining low additional memory consumption and computational overhead.

Conclusion: TeGu provides an effective and efficient contrastive decoding approach that leverages temporal dimension guidance with multi-token prediction, offering better performance than existing methods with minimal computational cost.

Abstract: Contrastive Decoding (CD) enhances the generation quality of large language models (LLMs) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive decoding methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that LLMs exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.

[47] CoFrGeNet: Continued Fraction Architectures for Language Generation

Amit Dhurandhar, Vijil Chenthamarakshan, Dennis Wei, Tejaswini Pedapati, Karthikeyan Natesan Ramamurthy, Rahul Nair

Main category: cs.CL

TL;DR: CoFrGeNets introduces a new function class based on continued fractions to replace Transformer components (Multi-head Attention and FFNs) with fewer parameters while maintaining competitive performance.

Details

Motivation: Transformers are computationally expensive with many parameters. The authors aim to create more efficient alternatives to Transformer components while maintaining performance, making them easier to deploy in industrial workflows.

Method: Proposes CoFrGeNets (Continued Fraction Generative Networks) - a new function class inspired by continued fractions. Designs novel architectural components that replace Multi-head Attention and Feed-Forward Networks in Transformer blocks. Derives custom gradient formulations for more accurate and efficient optimization than standard PyTorch gradients.

Result: Tested on GPT2-xl (1.5B) and Llama3 (3.2B) architectures. Models achieve competitive/superior performance on downstream tasks (classification, Q&A, reasoning, text understanding) with 2/3 to 1/2 the parameters and shorter pre-training time compared to original models.

Conclusion: CoFrGeNets offer efficient alternatives to Transformer components that maintain performance with fewer parameters. The approach is easy to integrate into existing workflows and future hardware-optimized implementations could further improve efficiency.

Abstract: Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.

[48] Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond

Wei Zhu

Main category: cs.CL

TL;DR: ChatGPT’s performance on medical information extraction tasks lags behind fine-tuned models, though it provides good explanations and faithfulness to text, but suffers from over-confidence and generation uncertainty.

Details

Motivation: To assess ChatGPT's capabilities in medical information extraction tasks, measuring its performance, explainability, confidence, faithfulness, and uncertainty compared to fine-tuned models.

Method: Systematic analysis of ChatGPT on 4 different medical information extraction tasks across 6 benchmark datasets, evaluating performance, explainability, confidence, faithfulness, and uncertainty.

Result: ChatGPT performs worse than fine-tuned models on MedIE tasks, provides high-quality explanations but is over-confident, shows high faithfulness to original text, and generation uncertainty affects extraction results.

Conclusion: While ChatGPT shows promise in explainability and faithfulness, its performance limitations, over-confidence, and generation uncertainty hinder its practical application in medical information extraction tasks.

Abstract: Large Language Models (LLMs) like ChatGPT have demonstrated amazing capabilities in comprehending user intents and generate reasonable and useful responses. Beside their ability to chat, their capabilities in various natural language processing (NLP) tasks are of interest to the research community. In this paper, we focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets. We present the systematically analysis by measuring ChatGPT’s performance, explainability, confidence, faithfulness, and uncertainty. Our experiments reveal that: (a) ChatGPT’s performance scores on MedIE tasks fall behind those of the fine-tuned baseline models. (b) ChatGPT can provide high-quality explanations for its decisions, however, ChatGPT is over-confident in its predcitions. (c) ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. (d) The uncertainty in generation causes uncertainty in information extraction results, thus may hinder its applications in MedIE tasks.

[49] Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention

Alon Rozental

Main category: cs.CL

TL;DR: Zonkey introduces a hierarchical diffusion model with differentiable tokenizer for end-to-end trainable text generation, replacing fixed tokenizers like BPE with adaptive, linguistically meaningful segmentation.

Details

Motivation: Current LLMs use fixed, non-differentiable tokenizers (like BPE) that limit end-to-end optimization and adaptability to noisy or domain-specific data, creating a need for fully trainable pipelines.

Method: Hierarchical diffusion model with differentiable tokenizer (Segment Splitter) using Probabilistic Attention for soft masking, hierarchical compression from characters to document-level representations, and Denoising Diffusion Mixed Model (DDMM) for stable latent-space denoising.

Result: Zonkey generates coherent, variable-length text from noise, demonstrating emergent linguistic hierarchies (word boundaries, sentence starts) without supervision and showing promising alignment to data distributions compared to entropy-based tokenizers.

Conclusion: The approach advances toward fully gradient-based LLMs with potential for better domain adaptation and scalable generation, moving beyond fixed tokenization constraints.

Abstract: Large language models (LLMs) have revolutionized natural language processing, yet they remain constrained by fixed, non-differentiable tokenizers like Byte Pair Encoding (BPE), which hinder end-to-end optimization and adaptability to noisy or domain-specific data. We introduce Zonkey, a hierarchical diffusion model that addresses these limitations through a fully trainable pipeline from raw characters to document-level representations. At its core is a differentiable tokenizer (Segment Splitter) that learns probabilistic beginning-of-sequence (BOS) decisions, enabling adaptive splits that emerge as linguistically meaningful (e.g., word boundaries at spaces, sentence starts at periods) without explicit supervision. This differentiability is enabled by our novel Probabilistic Attention mechanism, which incorporates position-specific existence probabilities to simulate soft masking over theoretically infinite sequences while preserving gradients. Sequences decay probabilistically rather than relying on end-of-sequence tokens, supporting variable-length outputs. Hierarchical levels compress sequences into higher abstractions (e.g., character n-grams to word-like vectors, then sentence-like), with reconstruction via our Denoising Diffusion Mixed Model (DDMM) for stable and efficient denoising in latent space. A Stitcher ensures overlap invariance across segments. Trained end-to-end on Wikipedia, Zonkey generates coherent, variable-length text from noise, demonstrating emergent hierarchies and promising qualitative alignment to data distributions compared to entropy-based learnable tokenizers. Our approach advances toward fully gradient-based LLMs, with potential for better domain adaptation and scalable generation. We release the source code for training and reproducing our experiments.

[50] KID: Knowledge-Injected Dual-Head Learning for Knowledge-Grounded Harmful Meme Detection

Yaocong Li, Leihan Zhang, Le Zhang, Qiang Yan

Main category: cs.CL

TL;DR: KID framework for harmful meme detection using knowledge injection and dual-head learning to understand implicit toxicity through structured reasoning chains.

Details

Motivation: Internet memes are pervasive but rely heavily on metaphors and sociocultural context, making them subtle vehicles for harmful content. Existing approaches focus on intra-modal and inter-modal signals but lack understanding of implicit toxicity that depends on background knowledge not present in the meme itself.

Method: Proposes KID (Knowledge-Injected Dual-Head Learning) framework with label-constrained distillation to decompose meme understanding into structured reasoning chains linking visual evidence, background knowledge, and classification labels. Uses dual-head architecture jointly optimizing semantic generation and classification objectives.

Result: Achieves SOTA performance on five multilingual datasets (English, Chinese, Bengali), improving over previous best methods by 2.1%–19.7% across primary evaluation metrics. Ablation studies confirm effectiveness of knowledge injection and dual-head joint learning.

Conclusion: KID framework effectively addresses implicit toxicity in memes through knowledge-grounded reasoning, demonstrating robust and generalizable meme understanding across multiple languages and resource settings.

Abstract: Internet memes have become pervasive carriers of digital culture on social platforms. However, their heavy reliance on metaphors and sociocultural context also makes them subtle vehicles for harmful content, posing significant challenges for automated content moderation. Existing approaches primarily focus on intra-modal and inter-modal signal analysis, while the understanding of implicit toxicity often depends on background knowledge that is not explicitly present in the meme itself. To address this challenge, we propose KID, a Knowledge-Injected Dual-Head Learning framework for knowledge-grounded harmful meme detection. KID adopts a label-constrained distillation paradigm to decompose complex meme understanding into structured reasoning chains that explicitly link visual evidence, background knowledge, and classification labels. These chains guide the learning process by grounding external knowledge in meme-specific contexts. In addition, KID employs a dual-head architecture that jointly optimizes semantic generation and classification objectives, enabling aligned linguistic reasoning while maintaining stable decision boundaries. Extensive experiments on five multilingual datasets spanning English, Chinese, and low-resource Bengali demonstrate that KID achieves SOTA performance on both binary and multi-label harmful meme detection tasks, improving over previous best methods by 2.1%–19.7% across primary evaluation metrics. Ablation studies further confirm the effectiveness of knowledge injection and dual-head joint learning, highlighting their complementary contributions to robust and generalizable meme understanding. The code and data are available at https://github.com/PotatoDog1669/KID.

[51] Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation

Yimin Deng, Yuqing Fu, Derong Xu, Yejing Wang, Wei Ni, Jingtong Gao, Xiaopeng Li, Chengxu Liu, Xiao Han, Guoshuai Zhao, Xiangyu Zhao, Li Zhu, Xueming Qian

Main category: cs.CL

TL;DR: AMA is an adversarial memory adaptation mechanism that aligns offline memory construction with task objectives by simulating task execution through challenger, evaluator, and adapter agents.

Details

Motivation: Existing memory systems for conversational agents have fixed, task-independent offline phases that fail to emphasize task-relevant information, leading to misalignment between memory preparation and downstream task requirements.

Method: Proposes Adversarial Memory Adaptation (AMA) with three agents: 1) Challenger generates QA pairs from dialogues, 2) Memory answers these questions simulating inference, 3) Evaluator assesses responses and performs error analysis, 4) Adapter analyzes errors and performs dual-level updates on construction strategy and content.

Result: Extensive experiments on long dialogue benchmark LoCoMo demonstrate AMA’s effectiveness, showing it can be integrated into various existing memory systems to enhance adaptability to downstream tasks.

Conclusion: AMA provides task-aware supervision signals during the offline phase, aligning memory construction and update with task objectives, thereby improving downstream task performance for conversational agents handling long conversations.

Abstract: Conversational agents struggle to handle long conversations due to context window limitations. Therefore, memory systems are developed to leverage essential historical information. Existing memory systems typically follow a pipeline of offline memory construction and update, and online retrieval. Despite the flexible online phase, the offline phase remains fixed and task-independent. In this phase, memory construction operates under a predefined workflow and fails to emphasize task relevant information. Meanwhile, memory updates are guided by generic metrics rather than task specific supervision. This leads to a misalignment between offline memory preparation and task requirements, which undermines downstream task performance. To this end, we propose an Adversarial Memory Adaptation mechanism (AMA) that aligns memory construction and update with task objectives by simulating task execution. Specifically, first, a challenger agent generates question answer pairs based on the original dialogues. The constructed memory is then used to answer these questions, simulating downstream inference. Subsequently, an evaluator agent assesses the responses and performs error analysis. Finally, an adapter agent analyzes the error cases and performs dual level updates on both the construction strategy and the content. Through this process, the memory system receives task aware supervision signals in advance during the offline phase, enhancing its adaptability to downstream tasks. AMA can be integrated into various existing memory systems, and extensive experiments on long dialogue benchmark LoCoMo demonstrate its effectiveness.

[52] RAG-E: Quantifying Retriever-Generator Alignment and Failure Modes

Korbinian Randl, Guido Rocchietti, Aron Henriksson, Ziawasch Abedjan, Tony Lindgren, John Pavlopoulos

Main category: cs.CL

TL;DR: RAG-E is an explainability framework that quantifies alignment between retrievers and generators in RAG systems using attribution methods, revealing critical misalignments where generators often ignore top-ranked documents.

Details

Motivation: RAG systems combine dense retrievers and language models, but their opaque interaction creates deployment challenges in high-stakes domains, making it difficult to understand how components work together.

Method: RAG-E adapts Integrated Gradients for retriever analysis, introduces PMCSHAP (Monte Carlo-stabilized Shapley Value approximation) for generator attribution, and uses the Weighted Attribution-Relevance Gap (WARG) metric to measure generator-retriever alignment.

Result: Empirical analysis on TREC CAsT and FoodSafeSum shows critical misalignments: 47.4-66.7% of queries have generators ignoring top-ranked documents, and 48.1-65.9% rely on less relevant documents.

Conclusion: RAG output quality depends on component interplay, not just individual performance, and RAG-E provides auditing capabilities to identify and address these misalignments in RAG systems.

Abstract: Retrieval-Augmented Generation (RAG) systems combine dense retrievers and language models to ground LLM outputs in retrieved documents. However, the opacity of how these components interact creates challenges for deployment in high-stakes domains. We present RAG-E, an end-to-end explainability framework that quantifies retriever-generator alignment through mathematically grounded attribution methods. Our approach adapts Integrated Gradients for retriever analysis, introduces PMCSHAP, a Monte Carlo-stabilized Shapley Value approximation, for generator attribution, and introduces the Weighted Attribution-Relevance Gap (WARG) metric to measure how well a generator’s document usage aligns with a retriever’s ranking. Empirical analysis on TREC CAsT and FoodSafeSum reveals critical misalignments: for 47.4% to 66.7% of queries, generators ignore the retriever’s top-ranked documents, while 48.1% to 65.9% rely on documents ranked as less relevant. These failure modes demonstrate that RAG output quality depends not solely on individual component performance but on their interplay, which can be audited via RAG-E.

[53] Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning

Bodong Du, Xuanqi Huang, Xiaomeng Li

Main category: cs.CL

TL;DR: DARE improves test-time reinforcement learning for LLMs by using full rollout distributions instead of majority voting for reward estimation, achieving better performance on reasoning benchmarks.

Details

Motivation: Existing test-time RL methods for LLMs rely on majority voting for reward estimation, which discards information about non-majority but correct actions and yields biased rewards. This fragility motivates a more robust approach.

Method: Proposes Distribution-Aware Reward Estimation (DARE) that shifts from single majority outcomes to full empirical rollout distributions. Includes exploration bonus and distribution pruning for better exploration and denoising.

Result: DARE improves optimization stability and final performance over baselines, achieving 25.3% relative improvement on AIME 2024 and 5.3% on AMC benchmarks.

Conclusion: Using full rollout distributions rather than majority voting provides more informative and robust reward estimation for test-time RL in LLMs, leading to better reasoning performance.

Abstract: Test-time reinforcement learning (TTRL) enables large language models (LLMs) to self-improve on unlabeled inputs, but its effectiveness critically depends on how reward signals are estimated without ground-truth supervision. Most existing TTRL methods rely on majority voting (MV) over rollouts to produce deterministic rewards, implicitly assuming that the majority rollout provides a reliable learning signal. We show that this assumption is fragile: MV reduces the rollout distribution into a single outcome, discarding information about non-majority but correct actions candidates, and yields systematically biased reward estimates. To address this, we propose Distribution-AwareReward Estimation (DARE), which shifts reward estimation from a single majority outcome to the full empirical rollout distribution. DARE further augments this distribution-based reward with an exploration bonus and a distribution pruning mechanism for non-majority rollout exploration and reward denoise, yielding a more informative and robust reward estimation. Extensive experiments on challenging reasoning benchmarks show that DARE improves optimization stability and final performance over recent baselines, achieving relative improvements of 25.3% on challenging AIME 2024 and 5.3% on AMC.

[54] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

Aadi Palnitkar, Mingyang Mao, Nicholas Waytowich, Vinicius G. Goecks, Tinoosh Mohsenin, Xiaomin Lin

Main category: cs.CL

TL;DR: MilSCORE: First expert-authored military planning benchmark requiring multi-modal reasoning over maps, orders, and intelligence reports for long-context geospatial planning tasks.

Details

Motivation: Need for realistic long-context benchmarks requiring selective reading and integration of heterogeneous, multi-modal information sources, especially for geospatial planning problems like military operations.

Method: Created MilSCORE dataset with expert-authored multi-hop questions grounded in complex simulated military planning scenarios, evaluating LLMs’ ability to combine tactical and spatial reasoning across multiple sources.

Result: Substantial headroom on MilSCORE benchmark, indicating current systems struggle with realistic scenario-level long-context planning; baseline results reported for contemporary vision-language models.

Conclusion: MilSCORE serves as a challenging testbed for future work on multi-modal long-context reasoning, highlighting limitations of current systems in realistic geospatial planning tasks.

Abstract: As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs’ ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.

[55] Embodied Task Planning via Graph-Informed Action Generation with Large Lanaguage Model

Xiang Li, Ning Yan, Masood Mortazavi

Main category: cs.CL

TL;DR: GiG: Graph-in-Graph planning framework for embodied agents that uses GNNs to encode environmental states into graph embeddings, organizes them in memory banks, and enables retrieval of structure-aware priors for better long-horizon planning.

Details

Motivation: LLMs have strong reasoning capabilities but face challenges as embodied agents in long-horizon planning due to context window limitations and hallucination of transitions that violate environmental constraints. Standard LLM planners struggle with maintaining strategy coherence over extended horizons.

Method: Proposes GiG framework with Graph-in-Graph architecture: 1) Uses GNN to encode environmental states into embeddings, 2) Organizes embeddings into action-connected execution trace graphs in experience memory bank, 3) Clusters graph embeddings for retrieval of structure-aware priors, 4) Introduces bounded lookahead module using symbolic transition logic for grounded action projection.

Result: Outperforms state-of-the-art baselines on three embodied planning benchmarks: achieves Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld with comparable or lower computational cost.

Conclusion: GiG framework effectively addresses long-horizon planning challenges for embodied agents by leveraging graph-structured memory and symbolic reasoning, enabling better strategy coherence and constraint adherence in dynamic environments.

Abstract: While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning. Unlike open-ended text generation, embodied agents must decompose high-level intent into actionable sub-goals while strictly adhering to the logic of a dynamic, observed environment. Standard LLM planners frequently fail to maintain strategy coherence over extended horizons due to context window limitation or hallucinate transitions that violate constraints. We propose GiG, a novel planning framework that structures embodied agents’ memory using a Graph-in-Graph architecture. Our approach employs a Graph Neural Network (GNN) to encode environmental states into embeddings, organizing these embeddings into action-connected execution trace graphs within an experience memory bank. By clustering these graph embeddings, the framework enables retrieval of structure-aware priors, allowing agents to ground current decisions in relevant past structural patterns. Furthermore, we introduce a novel bounded lookahead module that leverages symbolic transition logic to enhance the agents’ planning capabilities through the grounded action projection. We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld. Our method outperforms state-of-the-art baselines, achieving Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld with comparable or lower computational cost.

[56] Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text

Hongyi Zhou, Jin Zhu, Erhan Xu, Kai Ye, Ying Yang, Chengchun Shi

Main category: cs.CL

TL;DR: A novel rewrite-based detection algorithm that adaptively learns distance between original and rewritten text to detect LLM-generated content, showing significant improvements over baselines.

Details

Motivation: LLMs can produce highly human-like text, raising concerns about misinformation and academic integrity, creating urgent need for reliable detection algorithms.

Method: Geometric approach to analyze rewrite-based detection, then introduces novel algorithm that adaptively learns distance between original and rewritten text rather than using fixed distance.

Result: Extensive experiments with 100+ settings show superior performance over baselines, achieving 57.8% to 80.6% relative improvements over strongest baseline across different target LLMs (GPT, Claude, Gemini).

Conclusion: Adaptively learned distance functions are more effective for LLM-generated content detection than fixed distances, with the proposed approach demonstrating strong generalization across different LLMs.

Abstract: Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Yet, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLM-generated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 57.8% to 80.6% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini).

[57] SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching

Hong Chen, Xiang Liu, Bo Wang, Yuxuan Fan, Yuanlin Chu, Zongluo Li, Xiaowen Chu, Xuming Hu

Main category: cs.CL

TL;DR: SONIC is a learning-based KV cache compression framework that compresses historical dialogue segments into compact Nexus tokens, enabling efficient multi-turn LLM deployment with flexible memory constraints.

Details

Motivation: The linear growth of KV cache is a bottleneck for multi-turn LLM deployment, and existing compression methods fail to account for dialogue structural properties, relying on heuristic eviction that risks losing critical context.

Method: SONIC compresses historical segments into compact Nexus tokens using a learning-based framework with dynamic budget training, allowing flexible adaptation to varying memory constraints without retraining.

Result: At compression ratios of 80% and 50%, SONIC outperforms baselines like H2O and StreamingLLM on four multi-turn benchmarks, achieving 35.55% average score improvement on MTBench101 and accelerating inference by 50.1% compared to full-context generation.

Conclusion: SONIC effectively sustains coherent multi-turn dialogues while enhancing deployment efficiency through learning-based KV cache compression.

Abstract: The linear growth of Key-Value (KV) cache remains a bottleneck for multi-turn LLM deployment. Existing KV cache compression methods often fail to account for the structural properties of multi-turn dialogues, relying on heuristic eviction that risks losing critical context. We propose \textbf{SONIC}, a learning-based framework that compresses historical segments into compact and semantically rich \textbf{Nexus} tokens. By integrating dynamic budget training, SONIC allows flexible adaptation to varying memory constraints without retraining. Experiments show that at compression ratios of 80% and 50%, SONIC consistently outperforms baselines such as H2O and StreamingLLM on four diverse multi-turn benchmarks. Specifically, on the widely used MTBench101 benchmark, SONIC achieves an average score improvement of 35.55% over state-of-the-art baselines, validating its effectiveness in sustaining coherent multi-turn dialogues. Furthermore, SONIC enhances deployment efficiency, accelerating the overall inference process by 50.1% compared to full-context generation.

[58] From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes

Fariba Afrin Irany

Main category: cs.CL

TL;DR: Selective fine-tuning of GPT-2 for clinical text classification using frozen backbone with minimal trainable parameters achieves strong performance on radiology report analysis.

Details

Motivation: Clinical narratives in EHRs offer opportunities for automated disease analysis but face challenges: limited labeled data, class imbalance, and high computational costs of adapting large language models to domain-specific clinical text.

Method: GPT-based architecture with selective fine-tuning strategy: freezes majority of GPT-2 backbone, trains only final Transformer block, final layer normalization, and lightweight classification head. Evaluated on MIMIC-IV-Note radiology reports with CheXpert-style labels.

Result: Model shows stable convergence and strong classification performance across varying dataset sizes, particularly effective for non-mention and negated findings. Achieves efficient adaptation with significantly reduced computational complexity.

Conclusion: Selective fine-tuning of pretrained generative language models provides efficient and effective pathway for clinical text classification, enabling scalable adaptation to real-world EHR data while reducing computational burden.

Abstract: The increasing availability of unstructured clinical narratives in electronic health records (EHRs) has created new opportunities for automated disease characterization, cohort identification, and clinical decision support. However, modeling long, domain-specific clinical text remains challenging due to limited labeled data, severe class imbalance, and the high computational cost of adapting large pretrained language models. This study presents a GPT-based architecture for clinical text classification that adapts a pretrained decoder-only Transformer using a selective fine-tuning strategy. Rather than updating all model parameters, the majority of the GPT-2 backbone is frozen, and training is restricted to the final Transformer block, the final layer normalization, and a lightweight classification head. This approach substantially reduces the number of trainable parameters while preserving the representational capacity required to model complex clinical language. The proposed method is evaluated on radiology reports from the MIMIC-IV-Note dataset using uncertainty-aware CheXpert-style labels derived directly from report text. Experiments cover multiple problem formulations, including multi-label classification of radiographic findings, binary per-label classification under different uncertainty assumptions, and aggregate disease outcome prediction. Across varying dataset sizes, the model exhibits stable convergence behavior and strong classification performance, particularly in settings dominated by non-mention and negated findings. Overall, the results indicate that selective fine-tuning of pretrained generative language models provides an efficient and effective pathway for clinical text classification, enabling scalable adaptation to real-world EHR data while significantly reducing computational complexity.

[59] OVD: On-policy Verbal Distillation

Jing Xiong, Hui Shen, Shansan Gong, Yuxin Cheng, Jianghan Shen, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Ngai Wong

Main category: cs.CL

TL;DR: OVD is a memory-efficient on-policy distillation framework that uses verbal scores instead of token-level probability matching, enabling better exploration and improved performance on reasoning tasks.

Details

Motivation: Existing token-level on-policy distillation methods suffer from token-level alignment requirements that restrict student model exploration, prevent effective use of interactive environment feedback, and create severe memory bottlenecks in reinforcement learning.

Method: On-policy Verbal Distillation (OVD) replaces token-level probability matching with trajectory matching using discrete verbal scores (0-9) from teacher models, avoiding token-level alignment and reducing memory consumption.

Result: OVD substantially outperforms existing methods with up to +12.9% absolute improvement in average EM on Web Q&A tasks and up to +25.7% gain on math benchmarks, while exhibiting superior training efficiency.

Conclusion: OVD provides an effective memory-efficient framework for knowledge distillation that enables better exploration and performance on reasoning tasks by using verbal feedback instead of token-level alignment.

Abstract: Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model’s exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0–9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io

[60] Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding

Yifan Zhu, Huiqiang Rong, Haoran Luo

Main category: cs.CL

TL;DR: Token-Guard: A token-level hallucination control method using self-checking decoding with internal verification, latent space hallucination risk scoring, and iterative pruning/regeneration to detect and correct hallucinations in LLM outputs.

Details

Motivation: LLMs often hallucinate content inconsistent with input. Existing solutions like RAG and RLHF are resource-intensive, while decoding-based methods lack explicit hallucination control. Need for lightweight, scalable hallucination mitigation.

Method: Token-level hallucination control using self-checking decoding: 1) Internal verification at each reasoning step to detect hallucinated tokens before propagation, 2) Candidate fragment evaluation in latent space with explicit hallucination risk scoring, 3) Iterative pruning and regeneration to dynamically correct detected errors.

Result: Experiments on HALU datasets show Token-Guard substantially reduces hallucinations and improves generation accuracy. Offers scalable, modular solution for reliable LLM outputs.

Conclusion: Token-Guard provides effective hallucination control without resource-intensive retrieval or large-scale fine-tuning, enabling more reliable LLM generation through token-level self-checking mechanisms.

Abstract: Large Language Models (LLMs) often hallucinate, generating content inconsistent with the input. Retrieval-Augmented Generation (RAG) and Reinforcement Learning with Human Feedback (RLHF) can mitigate hallucinations but require resource-intensive retrieval or large-scale fine-tuning. Decoding-based methods are lighter yet lack explicit hallucination control. To address this, we present Token-Guard, a token-level hallucination control method based on self-checking decoding. Token-Guard performs internal verification at each reasoning step to detect hallucinated tokens before they propagate. Candidate fragments are further evaluated in a latent space with explicit hallucination risk scoring, while iterative pruning and regeneration dynamically correct detected errors. Experiments on HALU datasets show Token-Guard substantially reduces hallucinations and improves generation accuracy, offering a scalable, modular solution for reliable LLM outputs. Our code is publicly available.

[61] Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Jianhui Chen, Yuzhang Luo, Liangming Pan

Main category: cs.CL

TL;DR: MDA framework uses Influence Functions to trace interpretable circuits in LLMs back to specific training samples, showing causal links between data patterns and model capabilities.

Details

Motivation: While Mechanistic Interpretability has identified interpretable circuits in LLMs, the causal origins of these circuits in training data remain unknown. The paper aims to bridge this gap by developing methods to trace interpretable units back to specific training samples.

Method: Introduces Mechanistic Data Attribution (MDA), a scalable framework using Influence Functions to attribute interpretable units to training samples. Conducts experiments on Pythia family models with targeted interventions (removing/augmenting high-influence samples) versus random interventions.

Result: Targeted interventions significantly modulate emergence of interpretable heads while random interventions show no effect. Repetitive structural data (LaTeX, XML) acts as mechanistic catalyst. Interventions on induction head formation cause concurrent changes in in-context learning capability, providing causal evidence for functional link between induction heads and ICL.

Conclusion: Proposes mechanistic data augmentation pipeline that accelerates circuit convergence across model scales, providing principled methodology for steering LLM developmental trajectories. Demonstrates causal connections between specific training data patterns and emergent model capabilities.

Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention–removing or augmenting a small fraction of high-influence samples–significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model’s in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

[62] When “Better” Prompts Hurt: Evaluation-Driven Iteration for LLM Applications

Daniel Commey

Main category: cs.CL

TL;DR: Paper presents an evaluation-driven workflow (Define, Test, Diagnose, Fix) and Minimum Viable Evaluation Suite (MVES) for testing LLM applications, with experiments showing trade-offs in prompt optimization.

Details

Motivation: LLM applications differ from traditional software testing due to stochastic outputs, high-dimensional nature, and sensitivity to prompts/model changes, requiring specialized evaluation approaches.

Method: Introduces MVES with tiered evaluation components for general LLM apps, RAG, and agentic workflows; synthesizes evaluation methods (automated checks, human rubrics, LLM-as-judge); conducts reproducible experiments with Ollama using Llama 3 8B and Qwen 2.5 7B.

Result: Experiments show generic “improved” prompt templates trade off behaviors: extraction pass rate decreased from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3, while instruction-following improved.

Conclusion: Evaluation-driven prompt iteration and careful claim calibration are needed rather than universal prompt recipes; provides reproducible test suites and harnesses.

Abstract: Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generation (RAG), and (iii) agentic tool-use workflows. We also synthesize common evaluation methods (automated checks, human rubrics, and LLM-as-judge) and discuss known judge failure modes. In reproducible local experiments (Ollama; Llama 3 8B Instruct and Qwen 2.5 7B Instruct), we observe that a generic “improved” prompt template can trade off behaviors: on our small structured suites, extraction pass rate decreased from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3 when replacing task-specific prompts with generic rules, while instruction-following improved. These findings motivate evaluation-driven prompt iteration and careful claim calibration rather than universal prompt recipes. All test suites, harnesses, and results are included for reproducibility.

[63] Causal Autoregressive Diffusion Language Model

Junhao Ruan, Bei Li, Yongjing Yin, Pengcheng Huang, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, JingBo Zhu

Main category: cs.CL

TL;DR: CARD is a framework that combines ARMs’ training efficiency with diffusion models’ parallel inference using causal attention masks and dynamic parallel decoding.

Details

Motivation: To bridge the gap between autoregressive models (efficient training but slow sequential inference) and diffusion models (parallel inference but inefficient training) for next-generation LLMs.

Method: Reformulates diffusion process with causal attention masks for dense per-token supervision, introduces soft-tailed masking to preserve local context, and uses context-aware reweighting from signal-to-noise principles for dynamic parallel decoding with KV-caching.

Result: Outperforms existing discrete diffusion baselines while reducing training latency by 3× compared to block diffusion methods, achieving ARM-level data efficiency with parallel generation benefits.

Conclusion: CARD establishes a robust paradigm for efficient LLMs by unifying training efficiency of ARMs with parallel inference capabilities of diffusion models.

Abstract: In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.

[64] Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models

Longxuan Yu, Yu Fu, Shaorong Zhang, Hui Liu, Mukund Varma T, Greg Ver Steeg, Yue Dong

Main category: cs.CL

TL;DR: Masked diffusion language models (MDLMs) maintain accuracy when output order conflicts with natural reasoning, unlike autoregressive models which suffer large accuracy drops when forced to produce answers before explanations.

Details

Motivation: Autoregressive language models have a fundamental limitation: they enforce fixed left-to-right generation order, which creates problems when required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). This forces premature commitment to answers before generating intermediate reasoning.

Method: The paper proposes using masked diffusion language models (MDLMs) which iteratively refine all tokens in parallel, decoupling computation order from output structure. They validate this on GSM8K, Math500, and introduce ReasonOrderQA benchmark with controlled difficulty and order-level evaluation. They analyze how MDLMs achieve order robustness by examining token stabilization patterns during diffusion.

Result: When prompts request answers before reasoning, AR models exhibit large accuracy gaps (up to 67% relative drop) compared to standard chain-of-thought ordering, while MDLMs remain stable (≤14% relative drop). MDLMs achieve order robustness by stabilizing simpler tokens (reasoning steps) earlier in diffusion than complex ones (final answers), enabling reasoning before answer commitment.

Conclusion: MDLMs offer order robustness that AR models lack, maintaining accuracy when output structure conflicts with natural reasoning order. The paper identifies failure conditions where this advantage weakens, outlining limits for order robustness.

Abstract: Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ($\leq$14% relative drop), a property we term “order robustness”. Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.

[65] A Separable Architecture for Continuous Token Representation in Language Models

Reza T. Batley, Sourav Saha

Main category: cs.CL

TL;DR: Leviathan introduces a continuous embedding generator to replace discrete lookup tables in small language models, improving parameter efficiency and performance.

Details

Motivation: In sub-billion-parameter small language models, embedding matrices dominate the parameter budget, which is suboptimal. The paper argues that treating parameters as interchangeable in scaling laws overlooks this inefficiency in SLMs.

Method: Proposes Leviathan architecture with a continuous embedding generator instead of discrete lookup tables. Evaluates on Pile dataset under isoparametric settings compared to standard LLaMA-style architecture.

Result: Leviathan consistently outperforms standard architecture. Empirical power-law fit shows superior effective parameter capacity - behaves like a dense model with 1.47 to 2.11× more parameters across studied regime.

Conclusion: Continuous embedding generators can significantly improve parameter efficiency in small language models, challenging the assumption that parameters are interchangeable in transformer scaling laws.

Abstract: Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset under isoparametric settings, Leviathan consistently outperforms a standard, LLaMA-style architecture. By means of an empirical power-law fit, Leviathan exhibits a markedly superior effective parameter capacity. Across the regime studied, Leviathan behaves as a dense model with $1.47$ to $2.11 \times$ more parameters.

[66] On the Paradoxical Interference between Instruction-Following and Task Solving

Yunjia Qi, Hao Peng, Xintong Shi, Amy Xin, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li

Main category: cs.CL

TL;DR: Instruction following can paradoxically harm LLM task performance when self-evident constraints are added, measured by SUSTAINSCORE metric showing performance drops across math, QA, and code tasks.

Details

Motivation: To investigate the counterintuitive phenomenon where instruction following interferes with LLMs' task-solving capabilities, despite instruction following being designed to align models with human intent.

Method: Proposed SUSTAINSCORE metric to quantify interference by measuring performance drop after inserting self-evident constraints extracted from original successful outputs. Tested on LLMs across mathematics, multi-hop QA, and code generation tasks, analyzed attention patterns, and investigated different post-training paradigms.

Result: Adding self-evident constraints leads to substantial performance drops even for advanced models like Claude-Sonnet-4.5. Failed cases allocate significantly more attention to constraints than successful ones. Different alignment strategies show varying susceptibility to interference.

Conclusion: Instruction following can paradoxically harm task performance, revealing limitations in current alignment approaches. The SUSTAINSCORE metric provides a tool to study this phenomenon and evaluate alignment strategies.

Abstract: Instruction following aims to align Large Language Models (LLMs) with human intent by specifying explicit constraints on how tasks should be performed. However, we reveal a counterintuitive phenomenon: instruction following can paradoxically interfere with LLMs’ task-solving capability. We propose a metric, SUSTAINSCORE, to quantify the interference of instruction following with task solving. It measures task performance drop after inserting into the instruction a self-evident constraint, which is naturally met by the original successful model output and extracted from it. Experiments on current LLMs in mathematics, multi-hop QA, and code generation show that adding the self-evident constraints leads to substantial performance drops, even for advanced models such as Claude-Sonnet-4.5. We validate the generality of the interference across constraint types and scales. Furthermore, we identify common failure patterns, and by investigating the mechanisms of interference, we observe that failed cases allocate significantly more attention to constraints compared to successful ones. Finally, we use SUSTAINSCORE to conduct an initial investigation into how distinct post-training paradigms affect the interference, presenting empirical observations on current alignment strategies. We will release our code and data to facilitate further research

[67] MasalBench: A Benchmark for Contextual and Cross-Cultural Understanding of Persian Proverbs in LLMs

Ghazal Kalhor, Behnam Bahrak

Main category: cs.CL

TL;DR: MasalBench is a benchmark for evaluating LLMs’ understanding of Persian proverbs in context and cross-cultural analogical reasoning to equivalent English proverbs, revealing limitations in cultural knowledge for low-resource languages.

Details

Motivation: While LLMs are increasingly used in daily life, their ability to understand conversational language rules, particularly figurative language in low-resource languages like Persian, remains underexplored. Previous work focused on high-resource languages, leaving a gap in evaluating cross-cultural understanding.

Method: Created MasalBench, a comprehensive benchmark for assessing LLMs’ contextual understanding of Persian proverbs and cross-cultural analogical reasoning to equivalent English proverbs. Evaluated eight state-of-the-art LLMs on this benchmark.

Result: LLMs performed well in identifying Persian proverbs in context (accuracies above 0.90), but performance dropped significantly when identifying equivalent English proverbs (best model achieved 0.79 accuracy). This reveals limitations in cultural knowledge and analogical reasoning.

Conclusion: Current LLMs have significant limitations in cross-cultural understanding and analogical reasoning for low-resource languages. MasalBench provides a framework for assessing these capabilities and highlights the need for better cultural knowledge integration in LLMs.

Abstract: In recent years, multilingual Large Language Models (LLMs) have become an inseparable part of daily life, making it crucial for them to master the rules of conversational language in order to communicate effectively with users. While previous work has evaluated LLMs’ understanding of figurative language in high-resource languages, their performance in low-resource languages remains underexplored. In this paper, we introduce MasalBench, a comprehensive benchmark for assessing LLMs’ contextual and cross-cultural understanding of Persian proverbs, which are a key component of conversation in this low-resource language. We evaluate eight state-of-the-art LLMs on MasalBench and find that they perform well in identifying Persian proverbs in context, achieving accuracies above 0.90. However, their performance drops considerably when tasked with identifying equivalent English proverbs, with the best model achieving 0.79 accuracy. Our findings highlight the limitations of current LLMs in cultural knowledge and analogical reasoning, and they provide a framework for assessing cross-cultural understanding in other low-resource languages. MasalBench is available at https://github.com/kalhorghazal/MasalBench.

[68] $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA

Yaxin Du, Junru Song, Yifan Zhou, Cheng Wang, Jiahao Gu, Zimeng Chen, Menglan Chen, Wen Yao, Yang Yang, Ying Wen, Siheng Chen

Main category: cs.CL

TL;DR: G²-Reader is a dual-graph system for multimodal document QA that uses a Content Graph to preserve document structure and cross-modal semantics, and a Planning Graph to track sub-questions and guide evidence retrieval.

Details

Motivation: Current retrieval-augmented generation methods for multimodal documents have two key limitations: 1) flat chunking breaks document structure and cross-modal alignment, creating semantic fragments, and 2) iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections without persistent global search state.

Method: Introduces G²-Reader with two complementary graphs: Content Graph preserves document-native structure and cross-modal semantics, while Planning Graph maintains an agentic directed acyclic graph of sub-questions to track intermediate findings and guide stepwise navigation for evidence completion.

Result: On VisDoMBench across five multimodal domains, G²-Reader with Qwen3-VL-32B-Instruct achieves 66.21% average accuracy, outperforming strong baselines and standalone GPT-5 (53.08%).

Conclusion: The dual-graph approach effectively addresses structural preservation and retrieval guidance challenges in multimodal document question answering, demonstrating significant performance improvements over existing methods.

Abstract: Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce $G^2$-Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, $G^2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08%).

[69] VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong-cheng Tu, Chengyu Wang, Jun Huang, Dacheng Tao

Main category: cs.CL

TL;DR: VTC-R1 converts long textual reasoning traces into compact images as “optical memory” for vision-language models, achieving 3.4x token compression and 2.7x speedup in reasoning tasks.

Details

Motivation: Long-context reasoning in LLMs creates severe efficiency bottlenecks due to computational complexity. Existing approaches require complex training or external compression models, limiting scalability and losing fine-grained information.

Method: Proposes VTC-R1 paradigm that renders intermediate reasoning segments into compact images as “optical memory,” iteratively feeding them into vision-language models. Uses OpenR1-Math-220K dataset for training and fine-tunes VLMs like Glyph and Qwen3-VL.

Result: Achieves 3.4x token compression, outperforms standard long-context reasoning on MATH500, AIME25, AMC23, and GPQA-D benchmarks, with 2.7x speedup in end-to-end latency.

Conclusion: VTC-R1 offers a scalable solution for reasoning-intensive applications by efficiently compressing textual reasoning into visual representations for vision-language models.

Abstract: Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as “optical memory.” We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.

[70] ECO: Quantized Training without Full-Precision Master Weights

Mahdi Nikdan, Amir Zandieh, Dan Alistarh, Vahab Mirrokni

Main category: cs.CL

TL;DR: ECO eliminates master weights in quantized LLM training by directly applying updates to quantized parameters and using error-feedback through momentum, reducing memory overhead while maintaining accuracy.

Details

Motivation: Current quantization methods still require high-precision master weights for gradient updates, causing substantial memory overhead, especially for Sparse Mixture of Experts (SMoE) models where parameters and optimizer states dominate memory usage.

Method: ECO (Error-Compensating Optimizer) eliminates master weights by applying updates directly to quantized parameters, quantizing weights after each step, and injecting quantization error into optimizer momentum to form an error-feedback loop without additional memory.

Result: ECO matches baselines with master weights up to near-lossless accuracy across various models (30-800M Transformers, Gemma-3 1B, 2.1B SMoE with FP8, and fine-tuning DeepSeek-MoE-16B in INT4), significantly improving the memory vs validation loss Pareto frontier.

Conclusion: ECO successfully eliminates the memory overhead of master weights while maintaining training accuracy through error-feedback mechanisms, making it particularly valuable for memory-constrained training of large models like SMoEs.

Abstract: Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as $\textit{master weights}$. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.

[71] A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine

Anran Li, Yuanyuan Chen, Wenjun Long, Yu Yin, Yan Hu, Hyunjae Kim, Weipeng Zhou, Yujia Zhou, Hongyi Peng, Yang Ren, Xuguang Ai, Zhenyue Qin, Ming Hu, Xiaoxiao Li, Han Yu, Yih-Chung Tham, Lucila Ohno-Machado, Hua Xu, Qingyu Chen

Main category: cs.CL

TL;DR: Fed-MedLoRA framework enables parameter-efficient federated learning for medical LLMs using low-rank adapters and adaptive aggregation to handle data heterogeneity across healthcare institutions.

Details

Motivation: Medical LLMs trained on single-institution data lack generalizability and safety in heterogeneous healthcare systems. Federated learning could enable collaborative development but faces challenges with large model sizes and data heterogeneity across institutions.

Method: Fed-MedLoRA transmits only low-rank adapter parameters instead of full LLMs, reducing communication overhead. Fed-MedLoRA+ adds adaptive, data-aware aggregation to improve convergence under cross-site heterogeneity. Applied to clinical information extraction from patient narratives.

Result: Evaluated across five patient cohorts, comparing with BERT, LLaMA-3, DeepSeek-R1, and GPT-4o models in three settings: in-domain training/testing, external validation, and low-resource new-site adaptation using real-world clinical notes.

Conclusion: The framework enables efficient federated adaptation of LLMs for medical applications while addressing computational constraints and data heterogeneity challenges in real-world healthcare settings.

Abstract: Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis. To enable their use in clinical settings, LLMs are typically further adapted through continued pretraining or post-training using clinical data. However, most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems. Federated learning (FL) is a promising solution for enabling collaborative model development across healthcare institutions. Yet applying FL to LLMs in medicine remains fundamentally limited. First, conventional FL requires transmitting the full model during each communication round, which becomes impractical for multi-billion-parameter LLMs given the limited computational resources. Second, many FL algorithms implicitly assume data homogeneity, whereas real-world clinical data are highly heterogeneous across patients, diseases, and institutional practices. We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications. Fed-MedLoRA transmits only low-rank adapter parameters, reducing communication and computation overhead, while Fed-MedLoRA+ further incorporates adaptive, data-aware aggregation to improve convergence under cross-site heterogeneity. We apply the framework to clinical information extraction (IE), which transforms patient narratives into structured medical entities and relations. Accuracy was assessed across five patient cohorts through comparisons with BERT models, and LLaMA-3 and DeepSeek-R1, GPT-4o models. Evaluation settings included (1) in-domain training and testing, (2) external validation on independent cohorts, and (3) a low-resource new-site adaptation scenario using real-world clinical notes from the Yale New Haven Health System.

[72] Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang

Main category: cs.CL

TL;DR: PIR transforms LLMs from passive solvers to proactive inquirers that interleave reasoning with clarification, addressing premise- and intent-level uncertainty through direct user interaction.

Details

Motivation: Current reasoning-oriented LLMs suffer from "blind self-thinking" where they perform extensive internal reasoning even when critical information is missing or ambiguous, leading to unreliable outputs.

Method: Two core components: (1) uncertainty-aware supervised fine-tuning to equip models with interactive reasoning capability, and (2) user-simulator-based policy optimization with composite reward aligning model behavior with user intent.

Result: PIR outperforms baselines on mathematical reasoning, code generation, and document editing with up to 32.70% higher accuracy, 22.90% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of reasoning computation and unnecessary interactions.

Conclusion: PIR provides a new reasoning paradigm that makes LLMs more reliable and efficient by enabling proactive clarification-seeking behavior, with strong generalization across various reasoning tasks.

Abstract: Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70% higher accuracy, 22.90% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}

[73] FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Ajay Patel, Colin Raffel, Chris Callison-Burch

Main category: cs.CL

TL;DR: FineInstructions: A method to generate billions of synthetic instruction-response pairs from pre-training documents to enable LLM pre-training solely with instruction-tuning objective.

Details

Motivation: Overcome limited supervised instruction-tuning data by transforming knowledge from internet-scale pre-training documents into synthetic instruction-answer pairs, making pre-training more aligned with downstream usage.

Method: Create ~18M instruction templates from real user queries, match them to human-written source documents from pre-training corpora, and instantiate templates to generate billions of synthetic instruction-response training pairs.

Result: Pre-training on FineInstructions outperforms standard pre-training and other synthetic pre-training techniques on benchmarks measuring free-form response quality.

Conclusion: Large-scale synthetic instruction data enables effective pre-training solely with instruction-tuning objective, better aligning LLM training with downstream usage patterns.

Abstract: Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised “predict the next word” objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of “instruction-tuning” data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With “supervised” synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at https://huggingface.co/fineinstructions .

[74] DynaWeb: Model-Based Reinforcement Learning of Web Agents

Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao, Rongzhao Zhang, Lynn Ai, Eric Yang, Tianyu Shi, Lei Yu

Main category: cs.CL

TL;DR: DynaWeb is a model-based reinforcement learning framework that trains web agents using a learned world model of web environments, enabling efficient simulated interaction instead of costly live internet access.

Details

Motivation: Training autonomous web agents on the live internet is inefficient, costly, and risky. Current approaches require direct interaction with real websites, which limits scalability and introduces practical challenges.

Method: DynaWeb learns a world model that predicts web page representations given agent actions, creating a synthetic web environment. Agents can then “dream” by generating rollout trajectories in this simulated environment, combined with real expert trajectories for improved stability and sample efficiency.

Result: Experiments on WebArena and WebVoyager benchmarks show DynaWeb consistently and significantly improves performance of state-of-the-art open-source web agent models.

Conclusion: The framework establishes the viability of training web agents through imagination, offering a scalable and efficient approach to scale up online agentic reinforcement learning.

Abstract: The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.

[75] Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu

Main category: cs.CL

TL;DR: HALO pipeline converts pre-trained Transformers into efficient RNN-attention hybrid models (HypeNet) with novel HyPE position encoding, achieving comparable performance with superior long-context efficiency using minimal training data.

Details

Motivation: Hybrid Transformer-RNN architectures offer good performance-throughput tradeoffs for long-context modeling, but are hindered by expensive pre-training costs. Existing conversion methods require massive training data and produce poor long-context performance, which is precisely where hybrid models should excel.

Method: Proposes HALO (Hybrid Attention via Layer Optimization) pipeline for distilling Transformer models into RNN-attention hybrids. Introduces HypeNet architecture with novel HyPE position encoding scheme and architectural modifications for better length generalization. Converts Qwen3 series using only 2.3B tokens.

Result: Successfully converted Qwen3 series into HypeNet, achieving performance comparable to original Transformer models while enjoying superior long-context performance and efficiency. Conversion required minimal data (2.3B tokens, <0.01% of pre-training data).

Conclusion: HALO enables efficient conversion of pre-trained Transformers into hybrid models with strong long-context capabilities, addressing the computational barrier to hybrid model adoption while maintaining performance and improving efficiency.

Abstract: Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data

[76] Scaling Laws for Downstream Task Performance of Large Language Models

Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, Sanmi Koyejo

Main category: cs.CL

TL;DR: Scaling laws for transfer learning in LLMs, focusing on how pretraining data size and distribution alignment affect downstream machine translation performance.

Details

Motivation: Existing scaling laws focus on pretraining loss, but in transfer learning settings, downstream performance matters more. Need to understand how pretraining data characteristics affect downstream tasks like machine translation.

Method: Study scaling behavior in transfer learning by finetuning LLMs on machine translation tasks. Investigate how pretraining data choice and size affect downstream metrics (cross-entropy, BLEU, COMET). Analyze distribution alignment between pretraining and downstream data.

Result: With sufficient alignment, downstream metrics improve monotonically with more pretraining data, predictable via log-law. Moderate misalignment causes translation scores to fluctuate or worsen despite improving cross-entropy. Finetuning dataset size and distribution alignment significantly influence scaling behavior.

Conclusion: Provides practical insights for choosing pretraining data in transfer learning settings. Distribution alignment between pretraining and downstream data is crucial for predictable scaling behavior and optimal downstream performance.

Abstract: Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by: downstream cross-entropy and translation quality metrics such as BLEU and COMET scores. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and translation quality scores improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream translation quality metrics with good accuracy using a log-law. However, there are cases where moderate misalignment causes the downstream translation scores to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these, we provide new practical insights for choosing appropriate pretraining data.

[77] EVA-Score: Evaluating Abstractive Long-form Summarization on Informativeness through Extraction and Validation

Yuchen Fan, Yazhe Wan, Xin Zhong, Haonan Cheng, Ning Ding, Bowen Zhou

Main category: cs.CL

TL;DR: EVA-Score: A new evaluation metric for abstractive long-form summarization that extracts and compares information content between generated and reference summaries, showing higher human correlation than existing metrics.

Details

Motivation: Current evaluation metrics for long-form summarization are inadequate - similarity-based metrics (ROUGE, BERTScore) fail to measure informativeness, while LLM-based metrics lack quantitative analysis, are subjective, and struggle with long contexts.

Method: Proposes EVA-Score which extracts all information from summaries, identifies overlapped information based on reference summaries, and calculates an information score to quantitatively measure informativeness.

Result: EVA-Score shows the highest correlation with human judgments compared to existing metrics on several datasets. Re-evaluation of LLMs on long-form summarization reveals they still lag behind human-written answers in information content.

Conclusion: EVA-Score provides a more effective way to evaluate abstractive long-form summarization by focusing on information content rather than surface similarity, with implications for improving LLM evaluation and future automatic assessment methods.

Abstract: Since LLMs emerged, more attention has been paid to abstractive long-form summarization, where longer input sequences indicate more information contained. Nevertheless, the automatic evaluation of such summaries remains underexplored. The current evaluation metrics for long-form summarization either use similarity-based metrics like ROUGE and BERTScore or LLM-based metrics using appropriate prompts or pre-defined schema. We argue that the former only relies on similarity and fails to consider informativeness while the latter lacks quantitative analysis of informative richness, and is rather subjective and hard to explain. Current evaluation metrics either use traditional metrics like ROUGE and BERTScore, which rely on surface-level similarity and fail to consider informativeness, or simple LLM-based metrics, which are not robust and easily overwhelmed by the long contexts. In this paper, we propose a new evaluation metric called EVA-Score to extract all information from the given summaries, identify overlapped information based on reference, and calculate the information score. We test EVA-Score on several datasets and the experimental results reveal that EVA-Score shows the highest correlation with humans. We also re-evaluate the performance of LLMs on long-form summarization from the information perspective. The results indicate that responses of LLMs still have a gap with the human-written answers. Moreover, we provide a detailed analysis of the effectiveness of EVA-Score, forecasting future ways to automatically evaluate abstractive long-form summarization.

[78] FLAME: Empowering Frozen LLMs for Knowledge Graph Completion

Bo Xue, Yi Xu, Bolei Ma, Yunchong Song, Jiaxin Ding, Luoyi Fu, Xinbing Wang

Main category: cs.CL

TL;DR: FLAME is a framework that extracts context-aware hidden states from frozen LLMs to train efficient knowledge graph completion classifiers, bridging semantic gaps and achieving fine-tuned performance with massive efficiency gains.

Details

Motivation: Traditional KGC methods struggle with sparsity using only structural information, while LLMs offer rich world knowledge but face trade-offs: fine-tuning is effective but costly, while non-fine-tuned LLMs are efficient but suboptimal.

Method: Extracts context-aware hidden states from intermediate layers of frozen LLMs to train data-efficient KGC classifiers. Bridges LLM-KG semantic gaps via subgraph-based entity descriptions and uses sliced mutual information to quantify task-relevant information in representations.

Result: Achieves 47% improvement over non-fine-tuned LLM baselines, and is the first to achieve fine-tuned performance with 188× memory efficiency and 26.11× speedup.

Conclusion: FLAME successfully addresses the trade-off between effectiveness and efficiency in LLM-based KGC, enabling fine-tuned performance without the computational costs of full fine-tuning.

Abstract: Traditional knowledge graph completion (KGC) methods rely solely on structural information and struggle with sparsity, while Large Language Models (LLMs) address these limitations through rich world knowledge and strong context modeling. Fine-tuning LLMs is effective but costly, while non-fine-tuned LLMs are efficient but suboptimal. To address this trade-off, we propose \textbf{FLAME}, a framework that extracts context-aware hidden states from intermediate layers of frozen LLMs to train data-efficient KGC classifiers. We bridge LLM-KG semantic gaps via subgraph-based entity descriptions and employ sliced mutual information (SMI) to quantify task-relevant information in representations. Experiments demonstrate that FLAME achieves 47% improvement over non-fine-tuned LLM baselines and, to our knowledge, is the first to achieve fine-tuned performance with $188\times$ memory efficiency and $26.11\times$ speedup.

[79] Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions

Yutao Hou, Yajing Luo, Zhiwen Ruan, Hongru Wang, Weifeng Ge, Yun Chen, Guanhua Chen

Main category: cs.CL

TL;DR: A benchmark called Compound-QA is introduced to evaluate LLMs on complex compound questions composed of multiple interrelated sub-questions, showing that current models perform significantly worse on compound questions than simple ones.

Details

Motivation: Most existing LLM benchmarks measure performance on individual questions, but real-world applications involve complex interactions and compound questions. There's a need to evaluate LLMs on more realistic, multi-faceted questions that require deeper understanding and reasoning.

Method: Developed Compound Question Synthesis (CQ-Syn) to create Compound-QA benchmark from existing QA datasets, annotated with proprietary LLMs and verified by humans. The benchmark includes five categories: Factual-Statement, Cause-and-Effect, Hypothetical-Analysis, Comparison-and-Selection, and Evaluation-and-Suggestion, evaluating LLMs across understanding, reasoning, and knowledge dimensions.

Result: Evaluation of nine open-source LLMs shows significantly lower performance on compound questions compared to non-compound questions. The paper explores enhancement strategies that substantially improve models’ comprehension and reasoning abilities on compound questions.

Conclusion: Compound-QA provides a valuable benchmark for evaluating LLMs on realistic complex questions, revealing current limitations and suggesting methods for improvement in handling compound reasoning tasks.

Abstract: Large language models (LLMs) demonstrate remarkable performance across various tasks, prompting researchers to develop diverse evaluation benchmarks. However, most benchmarks typically measure the ability of LLMs to respond to individual questions, neglecting the complex interactions in real-world applications. We introduce Compound Question Synthesis (CQ-Syn) to build Compound-QA, a benchmark targeting questions composed of multiple interrelated sub-questions. This benchmark is derived from existing QA datasets, annotated with proprietary LLMs, and verified by humans for accuracy. It encompasses five categories: Factual-Statement, Cause-and-Effect, Hypothetical-Analysis, Comparison-and-Selection, and Evaluation-and-Suggestion. It evaluates the LLM capability in terms of three dimensions, including understanding, reasoning, and knowledge. Evaluating nine open-source LLMs on Compound-QA reveals that their performance on compound questions is notably lower than on non-compound questions. We further explore strategies to enhance LLMs’ handling of compound questions, and our results show that these methods substantially improve models’ comprehension and reasoning abilities.

Yubo Wang, Haoyang Li, Fei Teng, Lei Chen

Main category: cs.CL

TL;DR: GORAG: Graph-based Online Retrieval-Augmented Generation framework for dynamic few-shot text classification using weighted graphs of keywords and labels with tailored minimum-cost spanning trees for context retrieval.

Details

Motivation: Traditional models like BERT fail in dynamic few-shot settings where labeled data is scarce and target labels evolve frequently. LLMs show promise but suffer from performance degradation due to increased input size in dynamic evolving scenarios.

Method: Constructs and maintains a weighted graph of keywords and text labels with correlations as edges. Uses edge weighting mechanism to prioritize importance/reliability of extracted information. Dynamically retrieves relevant context using tailored minimum-cost spanning tree for each input.

Result: Empirical evaluations show GORAG outperforms existing approaches by providing more comprehensive and precise contextual information.

Conclusion: GORAG effectively addresses challenges in dynamic few-shot text classification through graph-based retrieval-augmented generation, offering better performance than existing methods.

Abstract: Text classification is vital for Web for Good applications like hate speech and misinformation detection. However, traditional models (e.g., BERT) often fail in dynamic few-shot settings where labeled data are scarce, and target labels frequently evolve. While Large Language Models (LLMs) show promise in few-shot settings, their performance is often hindered by increased input size in dynamic evolving scenarios. To address these issues, we propose GORAG, a Graph-based Online Retrieval-Augmented Generation framework for dynamic few-shot text classification. GORAG constructs and maintains a weighted graph of keywords and text labels, representing their correlations as edges. To model these correlations, GORAG employs an edge weighting mechanism to prioritize the importance and reliability of extracted information and dynamically retrieves relevant context using a tailored minimum-cost spanning tree for each input. Empirical evaluations show GORAG outperforms existing approaches by providing more comprehensive and precise contextual information. Our code is released at: https://github.com/Wyb0627/GORAG.

[81] Detecting Greenwashing: A Natural Language Processing Literature Survey

Tom Calamai, Oana Balalau, Théo Le Guenedal, Fabian M. Suchanek

Main category: cs.CL

TL;DR: Survey of NLP approaches for detecting corporate greenwashing in climate communications, analyzing methodological foundations and identifying challenges in handling ambiguity and subjectivity.

Details

Motivation: Greenwashing by corporations intentionally misleads about environmental impact; need for comprehensive methodological survey of NLP approaches to detect greenwashing in textual data, particularly corporate climate communication.

Method: Examines NLP problems as climate NLP tasks approximating greenwashing detection, including climate topic detection and deceptive pattern identification. Focuses on task formulation, dataset construction, and model evaluation methodologies.

Result: Reveals fragmented landscape: some subtasks achieve near-perfect performance in controlled settings, but tasks involving ambiguity, subjectivity, or reasoning remain challenging. No dataset of verified greenwashing cases currently exists.

Conclusion: Advancing automated greenwashing detection requires principled NLP methodologies combining reliable data annotations with interpretable model design. Future work should leverage third-party judgments and adopt decomposed pipelines supporting human oversight and traceable reasoning.

Abstract: Greenwashing refers to practices by corporations or governments that intentionally mislead the public about their environmental impact. This paper provides a comprehensive and methodologically grounded survey of natural language processing (NLP) approaches for detecting greenwashing in textual data, with a focus on corporate climate communication. Rather than treating greenwashing as a single, monolithic task, we examine the set of NLP problems, also known as climate NLP tasks, that researchers have used to approximate it, ranging from climate topic detection to the identification of deceptive communication patterns. Our focus is on the methodological foundations of these approaches: how tasks are formulated, how datasets are constructed, and how model evaluation influences reliability. Our review reveals a fragmented landscape: several subtasks now exhibit near-perfect performance under controlled settings, yet tasks involving ambiguity, subjectivity, or reasoning remain challenging. Crucially, no dataset of verified greenwashing cases currently exists. We argue that advancing automated greenwashing detection requires principled NLP methodologies that combine reliable data annotations with interpretable model design. Future work should leverage third-party judgments, such as verified media reports or regulatory records, to mitigate annotation subjectivity and legal risk, and adopt decomposed pipelines that support human oversight, traceable reasoning, and efficient model design.

[82] RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation

Pengcheng Jiang, Lang Cao, Ruike Zhu, Minhao Jiang, Yunyi Zhang, Jiaming Shen, Jimeng Sun, Jiawei Han

Main category: cs.CL

TL;DR: RAS framework improves LLM reasoning by dynamically constructing question-specific knowledge graphs through iterative retrieval and structured knowledge building.

Details

Motivation: LLMs struggle with multi-step reasoning due to unstructured retrieved context in RAG methods, and interpretability studies show importance of structured intermediate reasoning.

Method: Retrieval-And-Structuring (RAS) framework that interleaves targeted retrieval planning with incremental graph construction, building question-specific knowledge graphs through iterative retrieval.

Result: On seven knowledge-intensive benchmarks, RAS outperforms strong baselines with up to 8.7% gains with proprietary LLMs and 7.0% gains with open-source LLMs.

Conclusion: Dynamic, question-specific knowledge structuring offers a robust path to improving reasoning accuracy and robustness in language model generation.

Abstract: Large language models (LLMs) have achieved impressive performance on knowledge-intensive tasks, yet they often struggle with multi-step reasoning due to the unstructured nature of retrieved context. While retrieval-augmented generation (RAG) methods provide external information, the lack of explicit organization among retrieved passages limits their effectiveness, leading to brittle reasoning pathways. Recent interpretability studies highlighting the importance of structured intermediate reasoning further align with this perspective. We propose Retrieval-And-Structuring (RAS), a framework that dynamically constructs question-specific knowledge graphs through iterative retrieval and structured knowledge building. RAS interleaves targeted retrieval planning with incremental graph construction, enabling models to assemble and reason over evolving knowledge structures tailored to each query. On seven knowledge-intensive benchmarks, RAS consistently outperforms strong baselines, achieving up to 8.7% and 7.0% gains with proprietary and open-source LLMs, respectively. Our results demonstrate that dynamic, question-specific knowledge structuring offers a robust path to improving reasoning accuracy and robustness in language model generation.

[83] Optimizing Multi-Hop Document Retrieval Through Intermediate Representations

Jiaen Lin, Jingyu Liu, Yingbo Liu

Main category: cs.CL

TL;DR: L-RAG improves multi-hop question answering by using intermediate layer representations from LLMs for retrieval, achieving comparable performance to multi-step methods with lower computational cost.

Details

Motivation: Standard RAG struggles with complex multi-hop queries, and existing iterative approaches are computationally expensive. The authors observed that intermediate layers in LLMs contain richer information during reasoning, suggesting a more efficient approach.

Method: Proposes Layer-wise RAG (L-RAG) that leverages intermediate representations from middle layers of LLMs (which capture next-hop information) to retrieve external knowledge, rather than generating new internal queries iteratively.

Result: Outperforms existing RAG methods on multi-hop QA datasets (MuSiQue, HotpotQA, 2WikiMultiHopQA) while maintaining inference overhead similar to standard RAG.

Conclusion: L-RAG provides an efficient alternative to iterative RAG methods for multi-hop reasoning by exploiting intermediate layer representations, achieving strong performance with lower computational cost.

Abstract: Retrieval-augmented generation (RAG) encounters challenges when addressing complex queries, particularly multi-hop questions. While several methods tackle multi-hop queries by iteratively generating internal queries and retrieving external documents, these approaches are computationally expensive. In this paper, we identify a three-stage information processing pattern in LLMs during layer-by-layer reasoning, consisting of extraction, processing, and subsequent extraction steps. This observation suggests that the representations in intermediate layers contain richer information compared to those in other layers. Building on this insight, we propose Layer-wise RAG (L-RAG). Unlike prior methods that focus on generating new internal queries, L-RAG leverages intermediate representations from the middle layers, which capture next-hop information, to retrieve external knowledge. L-RAG achieves performance comparable to multi-step approaches while maintaining inference overhead similar to that of standard RAG. Experimental results show that L-RAG outperforms existing RAG methods on open-domain multi-hop question-answering datasets, including MuSiQue, HotpotQA, and 2WikiMultiHopQA. The code is available in https://github.com/Olive-2019/L-RAG

[84] Navigating the Rabbit Hole: Emergent Biases in LLM-Generated Attack Narratives Targeting Mental Health Groups

Rijul Magu, Arka Dutta, Sean Kim, Ashiqur R. KhudaBukhsh, Munmun De Choudhury

Main category: cs.CL

TL;DR: LLMs show structural biases that disproportionately target mental health groups, with mental health entities occupying central positions in attack narrative networks and exhibiting higher stigmatization in generation chains.

Details

Motivation: While LLM biases against certain groups are known, unprovoked targeted attacks by LLMs towards at-risk populations (particularly mental health groups) remain underexplored. The paper aims to investigate how LLMs generate harmful discourse against vulnerable populations.

Method: Three-pronged approach: (1) explicit evaluation of LLM-generated attacks on mental health groups, (2) network-based framework to study bias propagation, and (3) assessment of stigmatization using established frameworks. Analyzed a large-scale bias audit dataset using network centrality measures (closeness centrality) and clustering analysis (Gini coefficient).

Result: Mental health entities occupy central positions in attack narrative networks with significantly higher mean closeness centrality (p=4.06e-10) and dense clustering (Gini=0.7). Increased labeling components for mental health disorder-related targets relative to initial targets in generation chains, indicating heightened stigmatization.

Conclusion: LLMs have structural predilections to heighten harmful discourse against vulnerable mental health populations, highlighting the need for mitigation approaches to address these biases in language models.

Abstract: Large Language Models (LLMs) have been shown to demonstrate imbalanced biases against certain groups. However, the study of unprovoked targeted attacks by LLMs towards at-risk populations remains underexplored. Our paper presents three novel contributions: (1) the explicit evaluation of LLM-generated attacks on highly vulnerable mental health groups; (2) a network-based framework to study the propagation of relative biases; and (3) an assessment of the relative degree of stigmatization that emerges from these attacks. Our analysis of a recently released large-scale bias audit dataset reveals that mental health entities occupy central positions within attack narrative networks, as revealed by a significantly higher mean centrality of closeness (p-value = 4.06e-10) and dense clustering (Gini coefficient = 0.7). Drawing from an established stigmatization framework, our analysis indicates increased labeling components for mental health disorder-related targets relative to initial targets in generation chains. Taken together, these insights shed light on the structural predilections of large language models to heighten harmful discourse and highlight the need for suitable approaches for mitigation.

[85] FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension

Jushi Kai, Yixuan Wang, Boyi Zeng, Haoli Bai, Bo Jiang, Ziwei He, Zhouhan Lin

Main category: cs.CL

TL;DR: FreqKV: A frequency-domain KV cache compression method for LLMs that enables efficient long-context processing up to 256K tokens by compressing context information in low-frequency components.

Details

Motivation: Existing KV cache compression methods rely on token eviction which loses critical local information, especially problematic for long-context scenarios. Performance degrades sharply when extrapolating beyond pretrained context lengths.

Method: Parameter-free, architecture-agnostic approach that compresses increasing KV cache in frequency domain. Based on observation that context information concentrates in low-frequency components. Iteratively applies frequency-domain compression during inference.

Result: Extends LLaMA-2-7B context window to 256K tokens with stable perplexity using minimal training at 8K length. Outperforms existing KV cache compression methods on LLaMA-2 and LLaMA-3 across prefilling and decoding tasks.

Conclusion: FreqKV enables robust context window extension and is effective for both understanding and generation in long contexts, providing efficient KV cache compression without losing critical information.

Abstract: Existing key-value (KV) cache compression methods for large language models (LLMs) often rely on token eviction, which risks losing critical local information in both long prefilling and decoding scenarios. When extrapolating beyond the pretrained context length, their performance degrades sharply on long-context benchmarks. Motivated by the observation in the frequency domain that the context information is concentrated in the low-frequency components, we propose FreqKV, a parameter-free and architecture-agnostic approach. It iteratively compresses the increasing KV cache in the frequency domain, allowing models to process lengthy contexts efficiently. With minimal training at 8K length, FreqKV extends the context window of LLaMA-2-7B up to 256K tokens while maintaining stable perplexity. Extensive experiments across prefilling and decoding demonstrate that FreqKV enables robust context window extension and consistently outperforms existing KV cache compression methods on LLaMA-2 and LLaMA-3, highlighting its effectiveness for both understanding and generation in long contexts.

[86] Towards Transparent RAG: Fostering Evidence Traceability in LLM Generation via Reinforcement Learning

Jingyi Ren, Yekun Xu, Xiaolong Wang, Weitao Li, Ante Wang, Weizhi Ma, Yang Liu

Main category: cs.CL

TL;DR: TRACE is a reinforcement learning framework that enhances evidence traceability in RAG systems by training LLMs to produce structured outputs with explicit evidence citations, improving both transparency and accuracy.

Details

Motivation: Current RAG systems lack transparent reasoning paths that trace back to source evidence, compromising interpretability and limiting the model's ability to fully exploit retrieved context.

Method: TRACE uses reinforcement learning to guide LLMs to produce structured outputs with explicit evidence citations. It employs prompting and rewards for evidence relevance, proper formatting, and accuracy, with adaptive reward merging and stabilized KL-divergence estimation for training stability.

Result: Experiments on three multi-hop QA datasets show TRACE achieves transparent, evidence-attributed outputs with 10-30% accuracy improvements, comparable to advanced commercial LLMs like OpenAI o1 and DeepSeek-R1, with strong generalization to unseen tasks.

Conclusion: TRACE successfully enhances evidence traceability in RAG systems through RL, achieving both improved transparency and accuracy while maintaining training stability and strong generalization capabilities.

Abstract: Retrieval-Augmented Generation (RAG) delivers substantial value in knowledge-intensive applications. However, its generated responses often lack transparent reasoning paths that trace back to source evidence from retrieved documents. This opacity not only compromises the interpretability of the output but also limits the model’s ability to fully exploit the provided context. To address this, we propose TRACE (Transparent RAG with evidenCE tracing), a framework designed to enhance evidence traceability in Large Language Models (LLMs) through reinforcement learning (RL). TRACE guides LLMs to produce structured outputs with explicit evidence citations by prompting and rewarding evidence relevance and proper formatting, alongside accuracy, to optimize structured traceability. To ensure training stability with multiple reward signals, we further introduce an adaptive strategy for merging rewards and adopt a stabilized KL-divergence estimator. Experiments on three multi-hop QA datasets using Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct show that TRACE achieves both transparent, evidence-attributed outputs and accuracy improvements of 10-30%. The resulting performance is comparable to advanced commercial LLMs (e.g., OpenAI o1, DeepSeek-R1). Further analyses demonstrate strong generalization capabilities to unseen tasks. Our code is publicly available now.

[87] Boosting Large Language Models for Mental Manipulation Detection via Data Augmentation and Distillation

Yuansheng Gao, Peng Gao, Han Bao, Bin Li, Jixiang Luo, Zonghui Wang, Wenzhi Chen

Main category: cs.CL

TL;DR: MentalMAD: A framework using LLMs for detecting mental manipulation in social media dialogues via annotation-free data augmentation and knowledge distillation.

Details

Motivation: Mental manipulation on social media threatens psychological well-being and online integrity, but detection is challenging due to lack of annotated data, covert multi-turn nature, and absence of real-world datasets.

Method: Three components: 1) EvoSA - annotation-free data augmentation combining evolutionary operations with speech-act-aware prompting, 2) teacher-model-generated complementary-task supervision, 3) Complementary-Convergent Distillation for phase-wise knowledge transfer to student models. Constructed ReaMent dataset with 5,000 real-world dialogues.

Result: MentalMAD improves accuracy by 14.0%, macro-F1 by 27.3%, and weighted F1 by 15.1% over strongest baseline. Code and dataset publicly available.

Conclusion: The proposed framework effectively addresses challenges in mental manipulation detection through innovative data augmentation and knowledge distillation techniques.

Abstract: Mental manipulation on social media poses a covert yet serious threat to individuals’ psychological well-being and the integrity of online interactions. Detecting such behavior is challenging due to the difficult-to-annotate training data, its highly covert and multi-turn nature, and the lack of real-world datasets. To address these challenges, we propose MentalMAD, a framework that enhances large language models for mental manipulation detection. Our approach consists of three key components: EvoSA, an annotation-free data augmentation method that combines evolutionary operations with speech-act-aware prompting; teacher-model-generated complementary-task supervision; and Complementary-Convergent Distillation, a phase-wise strategy for transferring manipulation-specific knowledge to student models. We then constructed the ReaMent dataset, comprising 5,000 real-world-sourced dialogues. Extensive experiments show that MentalMAD improves accuracy by 14.0%, macro-F1 by 27.3%, and weighted F1 by 15.1% over the strongest baseline. The code and the dataset are publicly available at https://github.com/Yuansheng-Gao/MentalMAD.

[88] Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

Qian Cao, Xiting Wang, Yuzhuo Yuan, Yahui Liu, Fang Luo, Ruihua Song

Main category: cs.CL

TL;DR: A novel pairwise-comparison framework for evaluating textual creativity using LLMs, trained on CreataSet (100K+ human and 1M+ synthetic creative instruction-response pairs), resulting in CrEval evaluator that outperforms existing methods in human alignment.

Details

Motivation: Current creativity evaluation for LLMs relies heavily on inefficient and costly human judgments, while existing automated methods lack generalizability or alignment with human judgment, hindering progress in enhancing machine creativity.

Method: Proposes a pairwise-comparison framework for assessing textual creativity using shared contextual instructions to improve consistency. Introduces CreataSet dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs across diverse open-domain tasks. Trains an LLM-based evaluator named CrEval on this dataset.

Result: CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experiments show the importance of integrating both human and synthetic data for training robust evaluators, and CrEval proves practically useful for boosting LLM creativity.

Conclusion: The proposed framework and CrEval evaluator address key challenges in automated creativity assessment, providing an effective solution that aligns with human judgment and can enhance LLM creativity development.

Abstract: Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, we propose a novel pairwise-comparison framework for assessing textual creativity that leverages shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human and synthetic data to train highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs.

[89] Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

Rihui Xin, Han Liu, Zecheng Wang, Yupeng Zhang, Dianbo Sui, Xiaolin Hu, Bingning Wang

Main category: cs.CL

TL;DR: Using format and length as surrogate signals instead of ground truth answers to train LLMs for mathematical problem-solving via reinforcement learning.

Details

Motivation: Ground truth answers for training LLMs in mathematical problem-solving are often challenging, costly, and sometimes unfeasible to obtain, creating a need for alternative training signals.

Method: Developed a GRPO (Group Relative Policy Optimization) approach using format correctness and answer length as surrogate reward signals instead of traditional ground truth answers, combining format-only rewards for early phases with length-based rewards for later phases.

Result: The format-length surrogate approach matched standard GRPO performance in early phases and surpassed it in certain scenarios, achieving 40.0% accuracy on AIME2024 with a 7B base model, demonstrating that good answering habits can unlock existing reasoning capabilities.

Conclusion: Format and length can serve as effective surrogate signals for training LLMs in mathematical problem-solving, reducing dependence on extensive ground truth data while revealing that base models already possess reasoning skills that just need proper answering habits to be unlocked.

Abstract: Large Language Models have achieved remarkable success in natural language processing tasks, with Reinforcement Learning playing a key role in adapting them to specific applications. However, obtaining ground truth answers for training LLMs in mathematical problem-solving is often challenging, costly, and sometimes unfeasible. This research delves into the utilization of format and length as surrogate signals to train LLMs for mathematical problem-solving, bypassing the need for traditional ground truth answers. Our study shows that a reward function centered on format correctness alone can yield performance improvements comparable to the standard GRPO algorithm in early phases. Recognizing the limitations of format-only rewards in the later phases, we incorporate length-based rewards. The resulting GRPO approach, leveraging format-length surrogate signals, not only matches but surpasses the performance of the standard GRPO algorithm relying on ground truth answers in certain scenarios, achieving 40.0% accuracy on AIME2024 with a 7B base model. Through systematic exploration and experimentation, this research not only offers a practical solution for training LLMs to solve mathematical problems and reducing the dependence on extensive ground truth data collection, but also reveals the essence of why our label-free approach succeeds: the powerful base model is like an excellent student who has already mastered mathematical and logical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams, to unlock the capabilities it already possesses.

[90] Probing Neural Topology of Large Language Models

Yu Zheng, Yuan Yuan, Yue Zhuo, Yong Li, Gabriel Kreiman, Tomaso Poggio, Paolo Santi

Main category: cs.CL

TL;DR: Graph probing method reveals that neural topology in LLMs contains richer information about language generation performance than neural activations, enabling performance prediction with only 1% of connections and applications in model pruning and hallucination detection.

Details

Motivation: Current probing methods focus on linking neural activations to semantics, but the functional connectivity between neurons and its relationship to emergent model capabilities remains poorly understood, limiting deeper comprehension and safer development of LLMs.

Method: Introduces graph probing to uncover functional connectivity of LLM neurons and relate it to language generation performance. Uses simple linear or MLP probes on neural topology across diverse LLM families and scales, identifies default networks and hub neurons, and conducts interventional experiments.

Result: Neural topology outperforms neural activation by up to 130.4% on perplexity and 67.7% on semantic regression. Performance remains predictable with just 1% of neuron connections. Identified default networks and hub neurons provide causal evidence that LLMs exploit topological information.

Conclusion: Graph probing reveals that neural topology contains orders of richer information about LLM performance than activations, enabling practical applications in model efficiency (pruning) and reliability (hallucination detection) while advancing understanding of LLM internal mechanisms.

Abstract: Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural activations to interpretable semantics. However, the complex mechanisms that link neuron’s functional co-activation with the emergent model capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity of LLM neurons and relating it to language generation performance. By probing models across diverse LLM families and scales, we discover a universal predictability of language generation and understanding performance using only neural topology, which persists even when retaining just 1% of neuron connections. Strikingly, probing on topology outperforms probing on activation by up to 130.4% and 67.7% on perplexity and space/time semantic regression respectively, suggesting that neural topology contains orders of richer information of LLM performance than neural activation, which can be easily extracted with simple linear or MLP probes. To explain the dependence between neural topology and language performance, we identify default networks and hub neurons in LLMs and provide causal evidence by interventional experiments on multiple benchmarks, showing that LLMs actually exploit these topological information. Further analyses suggest that graph probing can be effectively leveraged to improve the efficiency and reliability of LLMs through proof-of-concept applications in model pruning and hallucination detection. Codes and data for the graph probing toolbox are available at https://github.com/DavyMorgan/llm-graph-probing.

[91] NeuroFaith: Evaluating LLM Self-Explanation Faithfulness via Internal Representation Alignment

Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot

Main category: cs.CL

TL;DR: NeuroFaith: A framework to measure faithfulness of LLM self-explanations by identifying key concepts and testing their actual influence on predictions

Details

Motivation: LLMs can generate plausible free text self-explanations, but these may not accurately reflect the model's actual reasoning process, indicating a lack of faithfulness. Existing evaluation methods don't examine semantic content of internal neural representations.

Method: NeuroFaith identifies key concepts within explanations and mechanistically tests whether these concepts actually influence the model’s predictions. It includes a linear faithfulness probe to detect unfaithful self-explanations from representation space and improve faithfulness through steering.

Result: The framework demonstrates versatility across 2-hop reasoning and classification tasks, providing a principled approach to evaluating and enhancing faithfulness of LLM free text self-explanations.

Conclusion: NeuroFaith addresses critical needs for trustworthy AI systems by providing tools to measure and improve the faithfulness of LLM self-explanations, moving beyond behavioral tests to examine internal representations.

Abstract: Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers. However, these natural language explanations may not accurately reflect the model’s actual reasoning process, pinpointing a lack of faithfulness. Existing faithfulness evaluation methods rely primarily on behavioral tests or computational block analysis without examining the semantic content of internal neural representations. This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation by identifying key concepts within explanations and mechanistically testing whether these concepts actually influence the model’s predictions. We show the versatility of NeuroFaith across 2-hop reasoning and classification tasks. Additionally, we develop a linear faithfulness probe based on NeuroFaith to detect unfaithful self-explanations from representation space and improve faithfulness through steering. NeuroFaith provides a principled approach to evaluating and enhancing the faithfulness of LLM free text self-explanations, addressing critical needs for trustworthy AI systems.

[92] Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting

Nathaniel Getachew, Abulhair Saparov

Main category: cs.CL

TL;DR: StorySim is a programmable framework for generating synthetic stories to evaluate Theory of Mind and World Modeling capabilities in LLMs, addressing data contamination issues in existing benchmarks.

Details

Motivation: Existing benchmarks for evaluating Theory of Mind (ToM) and World Modeling (WM) in LLMs suffer from potential data contamination in pretraining data, making it difficult to assess true reasoning capabilities. There's a need for novel, controllable evaluation frameworks that can generate fresh, compositional stories to properly test mental state tracking.

Method: StorySim uses a programmable framework with a highly controllable Storyboard to generate synthetic stories. It enables precise manipulation of character perspectives and events to design first- and second-order ToM tasks alongside WM tasks. The framework systematically controls for mental state tracking abilities and generates novel story prompts to avoid data contamination.

Result: Experiments across state-of-the-art LLMs show: 1) Most models perform better on WM tasks than ToM tasks, 2) Models perform better reasoning with humans compared to inanimate objects, 3) Evidence of heuristic behaviors like recency bias and over-reliance on earlier story events. The framework successfully revealed systematic patterns in model reasoning capabilities.

Conclusion: StorySim provides a valuable tool for evaluating LLM reasoning capabilities, particularly for Theory of Mind and World Modeling. The framework’s programmable nature allows for precise evaluation design and reveals important patterns in model behavior, including systematic biases and differential performance across task types.

Abstract: We introduce $\texttt{StorySim}$, a programmable framework for synthetically generating stories to evaluate the theory of mind (ToM) and world modeling (WM) capabilities of large language models (LLMs). Unlike prior benchmarks that may suffer from contamination in pretraining data, $\texttt{StorySim}$ produces novel, compositional story prompts anchored by a highly controllable $\texttt{Storyboard}$, enabling precise manipulation of character perspectives and events. We use this framework to design first- and second-order ToM tasks alongside WM tasks that control for the ability to track and model mental states. Our experiments across a suite of state-of-the-art LLMs reveal that most models perform better on WM tasks than ToM tasks, and that models tend to perform better reasoning with humans compared to inanimate objects. Additionally, our framework enabled us to find evidence of heuristic behavior such as recency bias and an over-reliance on earlier events in the story. All code for generating data and evaluations is freely available.

[93] A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic

Juan Moreno Gonzalez, Bashar Alhafni, Nizar Habash

Main category: cs.CL

TL;DR: Two-step approach for transliterating Judeo-Arabic (Arabic written in Hebrew script) into Arabic script using character mapping and post-correction, with LLM benchmark evaluation.

Details

Motivation: Judeo-Arabic presents unique challenges for transliteration due to ambiguous letter mappings, inconsistent orthographic conventions, and Hebrew code-switching, making it difficult to apply standard Arabic NLP tools.

Method: Two-step approach: 1) Simple character-level mapping from Hebrew to Arabic script, 2) Post-correction to address grammatical and orthographic errors. Also benchmarked LLMs on this task.

Result: Successfully transliterated Judeo-Arabic into Arabic script, enabling Arabic NLP tools to perform morphosyntactic tagging and machine translation that were previously infeasible on original texts.

Conclusion: Transliteration enables application of Arabic NLP tools to Judeo-Arabic texts, with code and data made publicly available for further research.

Abstract: Judeo-Arabic refers to Arabic variants historically spoken by Jewish communities across the Arab world, primarily during the Middle Ages. Unlike standard Arabic, it is written in Hebrew script by Jewish writers and for Jewish audiences. Transliterating Judeo-Arabic into Arabic script is challenging due to ambiguous letter mappings, inconsistent orthographic conventions, and frequent code-switching into Hebrew. In this paper, we introduce a two-step approach to automatically transliterate Judeo-Arabic into Arabic script: simple character-level mapping followed by post-correction to address grammatical and orthographic errors. We also present the first benchmark evaluation of LLMs on this task. Finally, we show that transliteration enables Arabic NLP tools to perform morphosyntactic tagging and machine translation, which would have not been feasible on the original texts. We make our code and data publicly available.

[94] EMSEdit: Efficient Multi-Step Meta-Learning-based Model Editing

Xiaopeng Li, Shasha Li, Xi Wang, Shezheng Song, Bin Ji, Shangwen Wang, Jun Ma, Xiaodong Liu, Mina Liu, Jie Yu

Main category: cs.CL

TL;DR: EMSEdit improves model editing for LLMs using multi-step backpropagation and norm regularization to enhance performance in low-data regimes while preserving unedited knowledge.

Details

Motivation: Current meta-learning-based model editing (MLME) methods struggle in low-data regimes and incur high training costs due to KL divergence, limiting practical deployment of efficient LLM knowledge updates.

Method: Proposes EMSEdit with three key components: 1) Multi-step backpropagation (MSBP) to capture gradient-activation mapping patterns, 2) Multi-step edits per sample for better performance with limited data, and 3) Norm-based regularization to preserve unedited knowledge and improve training efficiency.

Result: Experiments on two datasets and three LLMs show EMSEdit consistently outperforms state-of-the-art methods in both sequential and batch editing. MSBP can be integrated into existing approaches for additional gains, and EMSEdit demonstrates robustness in multi-hop reasoning editing tasks.

Conclusion: EMSEdit provides an efficient and effective solution for model editing in LLMs, particularly valuable in low-data scenarios, with components that can enhance existing editing methods.

Abstract: Large Language Models (LLMs) power numerous AI applications, yet updating their knowledge remains costly. Model editing provides a lightweight alternative through targeted parameter modifications, with meta-learning-based model editing (MLME) demonstrating strong effectiveness and efficiency. However, we find that MLME struggles in low-data regimes and incurs high training costs due to the use of KL divergence. To address these issues, we propose $\textbf{E}$fficient $\textbf{M}$ulti-$\textbf{S}$tep $\textbf{Edit (EMSEdit)}$, which leverages multi-step backpropagation (MSBP) to effectively capture gradient-activation mapping patterns within editing samples, performs multi-step edits per sample to enhance editing performance under limited data, and introduces norm-based regularization to preserve unedited knowledge while improving training efficiency. Experiments on two datasets and three LLMs show that EMSEdit consistently outperforms state-of-the-art methods in both sequential and batch editing. Moreover, MSBP can be seamlessly integrated into existing approaches to yield additional performance gains. Further experiments on a multi-hop reasoning editing task demonstrate EMSEdit’s robustness in handling complex edits, while ablation studies validate the contribution of each design component. Our code is available at https://github.com/xpq-tech/emsedit.

[95] Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics

Sheldon Yu, Yuxin Xiong, Junda Wu, Xintong Li, Tong Yu, Xiang Chen, Ritwik Sinha, Jingbo Shang, Julian McAuley

Main category: cs.CL

TL;DR: A state-aware transition framework that abstracts Chain-of-Thought reasoning trajectories into structured latent dynamics using spectral analysis and Markov chains for interpretability.

Details

Motivation: Current CoT prompting enables multi-step reasoning in LLMs but lacks explainability, with prior work focusing only on local token-level attribution, leaving high-level semantic roles and transitions underexplored.

Method: Represent each reasoning step via spectral analysis of token-level embeddings, cluster them into semantically coherent latent states, and model their progression as a Markov chain to capture the global structure of reasoning.

Result: The framework provides structured and interpretable views of reasoning processes, supporting semantic role identification, temporal pattern visualization, and consistency evaluation.

Conclusion: The state-aware transition framework offers a novel approach to understanding CoT reasoning dynamics at a higher semantic level, moving beyond token-level analysis to capture the structured progression of reasoning steps.

Abstract: Recent advances in chain-of-thought (CoT) prompting have enabled large language models (LLMs) to perform multi-step reasoning. However, the explainability of such reasoning remains limited, with prior work primarily focusing on local token-level attribution, such that the high-level semantic roles of reasoning steps and their transitions remain underexplored. In this paper, we introduce a state-aware transition framework that abstracts CoT trajectories into structured latent dynamics. Specifically, to capture the evolving semantics of CoT reasoning, each reasoning step is represented via spectral analysis of token-level embeddings and clustered into semantically coherent latent states. To characterize the global structure of reasoning, we model their progression as a Markov chain, yielding a structured and interpretable view of the reasoning process. This abstraction supports a range of analyses, including semantic role identification, temporal pattern visualization, and consistency evaluation.

[96] Verbalized Algorithms

Supriya Lall, Christian Farrell, Hari Pathanjaly, Marko Pavic, Sarvesh Chezhian, Masataro Asai

Main category: cs.CL

TL;DR: Verbalized Algorithms (VAs) combine LLMs with classical algorithms by using LLMs as simple binary comparison oracles within established algorithmic frameworks, providing theoretical guarantees for tasks like sorting, clustering, and optimization.

Details

Motivation: Current LLM approaches for reasoning tasks rely on one-shot queries without theoretical guarantees. The authors propose a more systematic approach that leverages LLMs' strengths (simple natural language operations) while avoiding their weaknesses (complex reasoning) by embedding them within classical algorithms.

Method: Verbalized Algorithms decompose tasks into elementary natural language operations that LLMs can handle reliably, using LLMs as binary comparison oracles within well-analyzed classical algorithms. Examples include verbalized sorting (bitonic sorting network), verbalized maximum, verbalized clustering, and verbalized submodular maximization.

Result: The approach provides theoretical guarantees: O(n) runtime for maximum, O(n log n) for sorting, and 1/(1-e) optimality for submodular maximization. Clustering and submodular maximization outperformed or improved nearest neighbor search using state-of-the-art embedding models.

Conclusion: Verbalized Algorithms offer a principled way to combine LLMs with classical algorithms, providing theoretical guarantees while leveraging LLMs’ natural language understanding capabilities for specific elementary operations.

Abstract: Instead of querying LLMs in a one-shot manner and hoping to get the right answer for a reasoning task, we propose a paradigm we call \emph{verbalized algorithms} (VAs), which combines LLMs with classical algorithms with established theoretical guarantees. VAs decompose a task into simple elementary operations on natural language strings that LLMs are able to answer reliably, and limit the scope of LLMs to those simple tasks. For example, for sorting a series of natural language strings, \emph{verbalized sorting} uses an LLM as a binary comparison oracle in a known and well-analyzed sorting algorithm (e.g., bitonic sorting network). Although this is already known as \emph{pairwise ranking} in the literature, we additionally demonstrate the effectiveness of \emph{verbalized maximum}, \emph{verbalized clustering}, and \emph{verbalized submodular maximization} for numerical reasoning, topic clustering and multi-hop Q&A RAG task, which guarantees $O(n)$ runtime, $O(n \log n)$ runtime, and $1/(1-e)$ optimality, respectively. Clustering and submodular maximization outperformed or improved the nearest neighbor search using state-of-the-art embedding models.

[97] SFT Doesn’t Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

Jiacheng Lin, Zhongruo Wang, Kun Qian, Tian Wang, Arvind Srinivasan, Hansi Zeng, Ruochen Jiao, Xie Zhou, Jiri Gesi, Dakuo Wang, Yufan Guo, Kai Zhong, Weiqi Zhang, Sujay Sanghavi, Changyou Chen, Hyokun Yun, Lihong Li

Main category: cs.CL

TL;DR: SFT with smaller learning rates reduces general capability degradation while maintaining domain performance; Token-Adaptive Loss Reweighting (TALR) further improves this balance.

Details

Motivation: To challenge the common belief that Supervised Fine-Tuning (SFT) on domain-specific datasets necessarily degrades LLMs' general capabilities, and to develop methods that better balance domain adaptation with preservation of general abilities.

Method: Empirical study of SFT with varying learning rates, theoretical analysis of the trade-off, and proposal of Token-Adaptive Loss Reweighting (TALR). Also evaluated L2 regularization, LoRA, model averaging, and FLOW as comparative baselines.

Result: Smaller learning rates substantially mitigate general performance degradation while preserving target-domain performance. TALR consistently outperforms other methods in balancing domain-specific gains and general capabilities, though no method completely eliminates the trade-off.

Conclusion: Two practical guidelines: (1) use small learning rates for favorable trade-off, (2) adopt TALR when stronger balance is needed. The paper provides both empirical evidence and theoretical understanding of the SFT trade-off.

Abstract: Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.

[98] Context Parametrization with Compositional Adapters

Josip Jukić, Martin Tutek, Jan Šnajder

Main category: cs.CL

TL;DR: CompAs: A meta-learning framework that translates context into adapter parameters with compositional structure, enabling algebraic merging of multiple information chunks for efficient LLM adaptation.

Details

Motivation: Address limitations of in-context learning (inefficient with many demonstrations) and supervised fine-tuning (training overhead, inflexible) by creating a method that maps instructions/demonstrations directly into adapter parameters that can compositionally integrate multiple information sources.

Method: Meta-learning framework that generates adapter parameters from context with compositional structure, allowing algebraic merging of adapters from different context chunks. Features reversible encoding for safety and context recovery.

Result: Outperforms ICL and prior generator-based methods on diverse multiple-choice and extractive QA tasks, especially when scaling to more inputs. Provides lower inference cost, improved stability with long contexts, and principled solution for inputs exceeding context window.

Conclusion: Composable adapter generation offers a practical and efficient alternative for scaling LLM deployment, bridging the gap between ICL and SFT with benefits in efficiency, stability, and safety.

Abstract: Large language models (LLMs) often seamlessly adapt to new tasks through in-context learning (ICL) or supervised fine-tuning (SFT). However, ICL is inefficient when handling many demonstrations, and SFT incurs training overhead while sacrificing flexibility. Mapping instructions or demonstrations from context directly into adapter parameters offers an appealing alternative. While prior work explored generating adapters based on a single input context, it has overlooked the need to integrate multiple chunks of information. To address this gap, we introduce CompAs, a meta-learning framework that translates context into adapter parameters with a compositional structure that allows them to be merged algebraically. This approach yields three benefits: lower inference cost, improved stability under long contexts, and establishes a principled solution when input exceeds the model’s context window. Furthermore, CompAs reversibly encodes information into adapter parameters, enabling recovery of the original input context and facilitating safety. Empirical results on diverse multiple-choice and extractive question answering tasks show that CompAs outperforms ICL and prior generator-based methods, especially when scaling to more inputs. Our work establishes composable adapter generation as a practical and efficient alternative for scaling LLM deployment.

[99] Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models

Chengzhi Zhong, Fei Cheng, Qianying Liu, Yugo Murawaki, Chenhui Chu, Sadao Kurohashi

Main category: cs.CL

TL;DR: A training-free method to identify and manipulate sparse cross-lingual dimensions in LLMs that control language switching while preserving semantics.

Details

Motivation: English-centric LLMs show strong multilingual capabilities despite limited non-English data, with prior work showing they map multilingual content into English-aligned representations at intermediate layers then project back to target languages. The authors hypothesize this cross-lingual transition is governed by a small, sparse set of dimensions at consistent indices.

Method: Introduces a simple, training-free method to identify and manipulate these cross-lingual dimensions using only 50 sentences of either parallel or monolingual data. The method identifies sparse dimensions that control language switching in intermediate-to-final layers.

Result: Experiments on multilingual generation control show the dimensions are interpretable - interventions can switch output language while preserving semantic content. The method surpasses prior neuron-based approaches at substantially lower cost.

Conclusion: Cross-lingual transitions in LLMs are governed by sparse, interpretable dimensions that can be manipulated for controlled language switching, offering a low-cost alternative to prior methods.

Abstract: Large language models exhibit strong multilingual capabilities despite limited exposure to non-English data. Prior studies show that English-centric large language models map multilingual content into English-aligned representations at intermediate layers and then project them back into target-language token spaces in the final layer. From this observation, we hypothesize that this cross-lingual transition is governed by a small and sparse set of dimensions, which occur at consistent indices across the intermediate to final layers. Building on this insight, we introduce a simple, training-free method to identify and manipulate these dimensions, requiring only as few as 50 sentences of either parallel or monolingual data. Experiments on a multilingual generation control task reveal the interpretability of these dimensions, demonstrating that the interventions in these dimensions can switch the output language while preserving semantic content, and that it surpasses the performance of prior neuron-based approaches at a substantially lower cost.

[100] Less Diverse, Less Safe: The Indirect But Pervasive Risk of Test-Time Scaling in Large Language Models

Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra

Main category: cs.CL

TL;DR: Test-Time Scaling (TTS) for LLM reasoning has a vulnerability: reduced candidate diversity increases unsafe outputs, which existing safety classifiers fail to detect.

Details

Motivation: The paper investigates a failure mode in Test-Time Scaling (TTS) where the assumption that diverse candidate pools enhance reliability can be exploited. When candidate diversity is constrained, TTS becomes more likely to produce unsafe outputs, revealing a previously unrecognized vulnerability.

Method: The authors develop RefDiv (reference-guided diversity reduction protocol) as a diagnostic attack to stress test TTS pipelines. They conduct extensive experiments across open-source models (Qwen3, Mistral, Llama3.1, Gemma3) and TTS strategies (Monte Carlo Tree Search and Best-of-N), and also test closed-source models (OpenAI o3-mini and Gemini-2.5-Pro). They evaluate safety guardrail classifiers like Llama-Guard against adversarial prompts generated by RefDiv.

Result: Constraining diversity consistently increases unsafe output rates in TTS, often more effectively than direct adversarial prompts. The vulnerability transfers across TTS strategies and model types. Existing safety classifiers fail to detect RefDiv-generated adversarial inputs, showing limited protection against this diversity-driven failure mode.

Conclusion: TTS has a fundamental vulnerability where reduced candidate diversity leads to increased unsafe outputs, which current safety measures cannot adequately address. This represents a general property of TTS systems rather than a model-specific artifact.

Abstract: Test-Time Scaling (TTS) improves LLM reasoning by exploring multiple candidate responses and then operating over this set to find the best output. A tacit premise behind TTS is that sufficiently diverse candidate pools enhance reliability. In this work, we show that this assumption in TTS introduces a previously unrecognized failure mode. When candidate diversity is curtailed, even by a modest amount, TTS becomes much more likely to produce unsafe outputs. We present a reference-guided diversity reduction protocol (RefDiv) that serves as a diagnostic attack to stress test TTS pipelines. Through extensive experiments across open-source models (e.g. Qwen3, Mistral, Llama3.1, Gemma3) and two widely used TTS strategies (Monte Carlo Tree Search and Best-of-N), constraining diversity consistently signifies the rate at which TTS produces unsafe results. The effect is often stronger than that produced by prompts directly with high adversarial intent scores. This observed phenomenon also transfers across TTS strategies and to closed-source models (e.g. OpenAI o3-mini and Gemini-2.5-Pro), thus indicating that this is a general and extant property of TTS rather than a model-specific artifact. Additionally, we find that numerous widely used safety guardrail classifiers (e.g. Llama-Guard), are unable to flag the adversarial input prompts generated by RefDiv, demonstrating that existing defenses offer limited protection against this diversity-driven failure mode.

[101] LLM$\times$MapReduce-V3: Enabling Interactive In-Depth Survey Generation through a MCP-Driven Hierarchically Modular Agent System

Yu Chao, Siyu Lin, xiaorong wang, Zhu Zhang, Zihan Zhou, Haoyu Wang, Shuo Wang, Jie Zhou, Zhiyuan Liu, Maosong Sun

Main category: cs.CL

TL;DR: LLM x MapReduce-V3 is a hierarchical multi-agent system for generating long-form surveys using modular MCP servers orchestrated by a planner agent.

Details

Motivation: To create a more flexible and controllable system for long-form survey generation that allows human intervention and customization through modular decomposition.

Method: Uses a hierarchically modular architecture with independent MCP servers for components like skeleton initialization, digest construction, and skeleton refinement. A planner agent dynamically orchestrates workflow based on MCP tool descriptions and execution history.

Result: Human evaluations show the system surpasses representative baselines in both content depth and length, demonstrating the strength of MCP-based modular planning.

Conclusion: The modular MCP-based approach enables effective long-form survey generation with human-in-the-loop control and customization capabilities.

Abstract: We introduce LLM x MapReduce-V3, a hierarchically modular agent system designed for long-form survey generation. Building on the prior work, LLM x MapReduce-V2, this version incorporates a multi-agent architecture where individual functional components, such as skeleton initialization, digest construction, and skeleton refinement, are implemented as independent model-context-protocol (MCP) servers. These atomic servers can be aggregated into higher-level servers, creating a hierarchically structured system. A high-level planner agent dynamically orchestrates the workflow by selecting appropriate modules based on their MCP tool descriptions and the execution history. This modular decomposition facilitates human-in-the-loop intervention, affording users greater control and customization over the research process. Through a multi-turn interaction, the system precisely captures the intended research perspectives to generate a comprehensive skeleton, which is then developed into an in-depth survey. Human evaluations demonstrate that our system surpasses representative baselines in both content depth and length, highlighting the strength of MCP-based modular planning.

[102] NOSA: Native and Offloadable Sparse Attention

Yuxiang Huang, Pengjie Wang, Jicheng Han, Weilin Zhao, Zhou Su, Ao Sun, Hongya Lyu, Hengyu Zhao, Yudong Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu

Main category: cs.CL

TL;DR: NOSA is a trainable sparse attention mechanism designed for efficient KV cache offloading to CPU, achieving high decoding throughput while maintaining generation quality.

Details

Motivation: KV cache consumes significant GPU memory during large-batch inference, limiting decoding throughput. Existing KV cache offloading methods either degrade generation quality due to training-inference mismatch or are incompatible with efficient offloading due to unconstrained KV accesses.

Method: NOSA is a trainable sparse attention mechanism that explicitly constrains CPU-GPU KV transfer volume. It’s designed natively for KV cache offloading, enabling low communication overhead. The authors also build NOSI, a KV cache offloading inference system to fully utilize NOSA’s efficiency.

Result: NOSA outperforms KV cache offloading baselines on general, long-input, and long-generation tasks. It boosts decoding throughput by up to 5.04x over FullAttn, 1.92x over InfLLMv2, and 1.83x over ShadowKV on 1-8B LLMs.

Conclusion: NOSA provides an effective solution for KV cache offloading that maintains generation quality while significantly improving decoding throughput through constrained sparse attention patterns optimized for CPU-GPU transfers.

Abstract: Decoding throughput improvements from larger inference batches are limited by GPU memory, which is largely consumed by the key-value (KV) cache. Prior training-free KV cache offloading alleviates this by keeping redundant context on the CPU and fetching only a sparse subset for attention, but it often degrades long-generation quality due to training-inference mismatch on sparse patterns. Meanwhile, trainable sparse attention is incompatible with efficient offloading, as unconstrained KV accesses may force large CPU-to-GPU transfers and erase throughput gains. To this end, we propose NOSA, a trainable sparse attention mechanism natively designed for KV cache offloading. NOSA explicitly constrains the volume of CPU-GPU KV transfers, thereby achieving low communication overhead and high decoding throughput. We further build NOSI, a KV cache offloading inference system that fully unlocks NOSA’s efficiency. Empirical results on 1,3,8B LLMs demonstrate that NOSA outperforms KV cache offloading baselines on general, long-input, and long-generation tasks, while boosting decoding throughput by up to 5.04x, 1.92x, and 1.83x over FullAttn, InfLLMv2, and ShadowKV, respectively. We release our code at https://github.com/thunlp/NOSA.

[103] Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages

Haolin Li, Haipeng Zhang, Mang Li, Yaohua Wang, Lijie Wen, Yu Zhang, Biqing Huang

Main category: cs.CL

TL;DR: LiRA is a plug-and-play framework that improves LLM performance on low-resource languages through anchor-based alignment to English semantic space and lightweight language-aware reasoning.

Details

Motivation: LLMs struggle with low-resource languages due to limited training data, translation noise, and unstable cross-lingual alignment, creating performance gaps for non-English languages.

Method: LiRA combines Arca (Anchored Representation Composition Architecture) for aligning low-resource inputs to English semantic space through anchor-based alignment and collaborative encoding, with LaSR (Language-coupled Semantic Reasoner) for consistency regularization and unified cross-lingual understanding.

Result: The framework demonstrates consistent improvements across diverse low-resource benchmarks for retrieval, ranking, QA, and reasoning tasks, with theoretical guarantees on representation stability.

Conclusion: LiRA provides an effective plug-and-play solution for enhancing LLM performance on low-resource languages with theoretical guarantees and practical improvements.

Abstract: Large language models (LLMs) continue to struggle with low-resource languages, primarily due to limited training data, translation noise, and unstable cross-lingual alignment. To address these challenges, we propose LiRA (Linguistic Robust Anchoring for LLMs)-a plug-and-play framework that requires only lightweight fine-tuning on top of existing pretrained backbones. LiRA jointly optimizes representation stability and cross-lingual semantic consistency by combining two key components: Arca (Anchored Representation Composition Architecture), which aligns low-resource inputs to a shared English semantic space through anchor-based alignment and collaborative encoding; and LaSR (Language-coupled Semantic Reasoner), a lightweight, language-aware head that enforces consistency regularization for unified cross-lingual understanding, retrieval, and reasoning. We theoretically show that under controlled anchoring error and translation-induced bias, LiRA guarantees bounded representation deviation and stable downstream performance under local Lipschitz continuity. To facilitate research, we release a new multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Extensive experiments across diverse low-resource benchmarks demonstrate consistent improvements in retrieval, ranking, question answering, and reasoning tasks. Code will be publicly available on GitHub, and the dataset will be hosted on Hugging Face.

[104] MOSAIC: Masked Objective with Selective Adaptation for In-domain Contrastive Learning

Vera Pavlova, Mohammed Makhlouf

Main category: cs.CL

TL;DR: MOSAIC is a multi-stage domain adaptation framework for text embedding models that combines masked language modeling and contrastive learning to adapt general models to specialized domains.

Details

Motivation: The paper addresses the challenge of adapting large-scale general-domain text embedding models to specialized domains, where domain-specific knowledge is needed while preserving the robust semantic discrimination properties of the original model.

Method: MOSAIC uses a multi-stage framework with joint domain-specific masked supervision, combining masked language modeling (MLM) and contrastive objectives within a unified training pipeline for effective domain adaptation.

Result: The method achieves improvements up to 13.4% in NDCG@10 over strong general-domain baselines, validated on both high-resource and low-resource domains through comprehensive ablation studies.

Conclusion: The framework effectively enables domain adaptation of text embedding models through balanced joint supervision and staged adaptation, demonstrating the importance of combining MLM and contrastive objectives.

Abstract: We introduce MOSAIC (Masked Objective with Selective Adaptation for In-domain Contrastive learning), a multi-stage framework for domain adaptation of text embedding models that incorporates joint domain-specific masked supervision. Our approach addresses the challenges of adapting large-scale general-domain text embedding models to specialized domains. By jointly optimizing masked language modeling (MLM) and contrastive objectives within a unified training pipeline, our method enables effective learning of domain-relevant representations while preserving the robust semantic discrimination properties of the original model. We empirically validate our approach on both high-resource and low-resource domains, achieving improvements up to 13.4% in NDCG@10 (Normalized Discounted Cumulative Gain) over strong general-domain baselines. Comprehensive ablation studies further demonstrate the effectiveness of each component, highlighting the importance of balanced joint supervision and staged adaptation.

[105] Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics

Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag

Main category: cs.CL

TL;DR: Quality Estimation metrics in machine translation exhibit systematic length biases: they over-predict errors in longer translations and prefer shorter translations, even when quality is equal, due to skewed training data distributions.

Details

Motivation: Quality Estimation metrics are crucial for reference-free evaluation in machine translation and as reward signals in reinforcement learning, but the prevalence and impact of length bias in these metrics has been underexplored, potentially leading to unfair evaluation and suboptimal decision-making.

Method: Conducted systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, identified two critical length biases, investigated root causes through analysis of supervision distributions, and applied length normalization during training as a diagnostic intervention.

Result: Found that QE metrics consistently over-predict errors with increasing translation length (even for error-free texts) and exhibit preference for shorter translations when multiple candidates are available. Length normalization during training effectively decouples error probability from length and counteracts data skew.

Conclusion: QE metrics have inherent length biases that risk unfairly penalizing longer correct translations and can lead to suboptimal decision-making in applications like QE reranking and reinforcement learning. Simple length normalization during training can mitigate these biases and yield more reliable QE signals.

Abstract: Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. We further investigate the root cause, presenting evidence that QE models conflate quality with sequence length due to skewed supervision distributions. As a diagnostic intervention, we apply length normalization during training. We show that this simple intervention is sufficient to decouple error probability from length, effectively counteracting the data skew and yielding more reliable QE signals.

[106] Think Twice: Branch-and-Rethink Reasoning Reward Model

Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau

Main category: cs.CL

TL;DR: Two-turn reward model that applies think-twice reasoning to reward modeling, using adaptive branching to select critical dimensions and branch-conditioned rethinking for targeted analysis.

Details

Motivation: Current reward models compress multiple quality dimensions into a single scalar in one shot, leading to judgment diffusion where attention spreads across criteria, resulting in diluted focus and shallow analysis. The paper aims to address this by transferring the think-twice principle from LLMs to reward modeling.

Method: BR-RM is a two-turn reward model: Turn 1 performs adaptive branching to select instance-critical dimensions (e.g., factuality, safety) and sketches evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only the most important aspects. Training uses GRPO-style reinforcement learning over structured two-turn traces with binary outcome rewards and strict format checks.

Result: The model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains, demonstrating reduced judgment diffusion and improved sensitivity to subtle yet consequential errors.

Conclusion: By converting all-at-once scoring into focused, second-look reasoning, BR-RM reduces judgment diffusion and improves reward modeling performance while remaining practical and scalable for standard RLHF pipelines.

Abstract: Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-once scoring into focused, second-look reasoning, BR-RM reduces judgment diffusion and improves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains.

[107] Language Generation: Complexity Barriers and Implications for Learning

Marcelo Arenas, Pablo Barceló, Luis Cofré, Alexander Kozachinskiy

Main category: cs.CL

TL;DR: The paper studies computational feasibility of language generation in the limit, showing infeasibility even for simple language classes like context-free and regular languages, establishing a gap between theoretical possibility and practical feasibility.

Details

Motivation: While previous work showed language generation is theoretically possible in the limit (given enough positive examples), these existence results don't address practical feasibility. The paper aims to investigate the sample complexity and computational feasibility of language generation for various formal language classes.

Method: The authors analyze the sample complexity of language generation in the limit for several canonical classes of formal languages, including context-free languages, regular languages, locally threshold testable languages, and non-erasing pattern languages. They examine computational feasibility through theoretical analysis of these language classes.

Result: Results show infeasibility already appears for context-free and regular languages, and persists even for strict subclasses like locally threshold testable languages, as well as for incomparable classes like non-erasing pattern languages. This establishes a clear gap between theoretical possibility and computational feasibility.

Conclusion: There is a significant gap between the theoretical possibility of language generation in the limit (established by Kleinberg and Mullainathan) and its computational feasibility, with infeasibility appearing even for relatively simple language classes.

Abstract: Kleinberg and Mullainathan showed that language generation in the limit is always possible at the level of computability: given enough positive examples, a learner can eventually generate data indistinguishable from a target language. However, such existence results do not address feasibility. We study the sample complexity of language generation in the limit for several canonical classes of formal languages. Our results show that infeasibility already appears for context-free and regular languages, and persists even for strict subclasses such as locally threshold testable languages, as well as for incomparable classes such as non-erasing pattern languages, a well-studied class in the theory of language identification. Overall, our results establish a clear gap between the theoretical possibility of language generation in the limit and its computational feasibility.

[108] TS-PEFT: Unveiling Token-Level Redundancy in Parameter-Efficient Fine-Tuning

Dabiao Ma, Ziming Dai, Zhimin Xin, Shu Wang, Jian Yang, Haojun Fei

Main category: cs.CL

TL;DR: TS-PEFT: A token-sparse PEFT framework that identifies and discards redundant token updates during fine-tuning, achieving comparable or better performance than dense methods while updating only 30-70% of tokens.

Details

Motivation: Current PEFT methods assume all tokens contribute equally to downstream tasks and require parameter updates, but this paper challenges that convention by revealing pervasive token-level redundancy in fine-tuning large models.

Method: TS-PEFT uses proximal optimization as a dynamic probe to identify token-level redundancy during fine-tuning, allowing selective updates of only important tokens while discarding redundant ones.

Result: Extensive experiments show that discarding 30-70% of token updates consistently matches or exceeds performance of dense baselines like LoRA and DoRA, while learned token-level sparsity serves as a superior indicator of module importance.

Conclusion: Indiscriminately updating all tokens is computationally superfluous and introduces optimization noise; token-level sparsity provides a novel data-driven perspective on LM adaptation mechanisms.

Abstract: Current Parameter-Efficient Fine-Tuning (PEFT) methods typically operate under an implicit assumption: Once a target module is selected, every token passing through it contributes equally to the downstream task and requires a parameter update. In this paper, we challenge this convention by revealing a pervasive token-level redundancy in the fine-tuning of large models (LMs). We propose TS-PEFT, a theoretical framework utilizing proximal optimization that acts as a dynamic probe to identify token-level redundancy during the fine-tuning process. Extensive experiments demonstrate that indiscriminately updating all tokens is not only computationally superfluous but often introduces optimization noise. Surprisingly, by discarding 30%-70% of token updates, TS-PEFT consistently matches or exceeds the performance of dense baselines such as LoRA, DoRA. Our in-depth analysis shows that the learned token-level sparsity is a superior indicator of module importance compared to traditional weight criteria, providing a novel data-driven perspective on the intrinsic adaptation mechanism of LMs.

[109] Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models

Alla Chepurova, Aydar Bulatov, Mikhail Burtsev, Yuri Kuratov

Main category: cs.CL

TL;DR: Wikontic is a multi-stage pipeline for constructing high-quality knowledge graphs from open-domain text using Wikidata-based constraints and entity normalization, achieving strong performance on QA benchmarks with compact, ontology-consistent KGs.

Details

Motivation: Current LLM-based systems use knowledge graphs primarily as auxiliary structures for text retrieval, leaving their intrinsic quality underexplored. There's a need for methods that can construct high-quality, compact, and ontology-consistent KGs directly from text to better leverage structured knowledge in LLMs.

Method: A multi-stage pipeline that: 1) extracts candidate triplets with qualifiers from text, 2) enforces Wikidata-based type and relation constraints to ensure ontology consistency, and 3) normalizes entities to reduce duplication. The approach focuses on creating compact, well-connected KGs with minimal token usage.

Result: Achieves 96% coverage of correct answer entities in generated triplets on MuSiQue. On HotpotQA: 76.0 F1 (triplets-only), and on MuSiQue: 59.8 F1, matching or surpassing retrieval-augmented generation baselines. State-of-the-art 86% information-retention on MINE-1 benchmark. Efficient with <1,000 output tokens (3× fewer than AriGraph, <1/20 of GraphRAG).

Conclusion: Wikontic provides a scalable solution for constructing high-quality knowledge graphs from text, enhancing KG quality and offering practical benefits for leveraging structured knowledge in LLMs while maintaining efficiency.

Abstract: Knowledge graphs (KGs) provide structured, verifiable grounding for large language models (LLMs), but current LLM-based systems commonly use KGs as auxiliary structures for text retrieval, leaving their intrinsic quality underexplored. In this work, we propose Wikontic, a multi-stage pipeline that constructs KGs from open-domain text by extracting candidate triplets with qualifiers, enforcing Wikidata-based type and relation constraints, and normalizing entities to reduce duplication. The resulting KGs are compact, ontology-consistent, and well-connected; on MuSiQue, the correct answer entity appears in 96% of generated triplets. On HotpotQA, our triplets-only setup achieves 76.0 F1, and on MuSiQue 59.8 F1, matching or surpassing several retrieval-augmented generation baselines that still require textual context. In addition, Wikontic attains state-of-the-art information-retention performance on the MINE-1 benchmark (86%), outperforming prior KG construction methods. Wikontic is also efficient at build time: KG construction uses less than 1,000 output tokens, about 3$\times$ fewer than AriGraph and $<$1/20 of GraphRAG. The proposed pipeline enhances the quality of the generated KG and offers a scalable solution for leveraging structured knowledge in LLMs.

[110] LogicReward: Incentivizing LLM Reasoning via Step-Wise Logical Supervision

Jundong Xu, Hao Fei, Huichi Zhou, Xin Quan, Qijun Huang, Shengqiong Wu, William Yang Wang, Mong-Li Lee, Wynne Hsu

Main category: cs.CL

TL;DR: LogicReward: A novel reward system that uses theorem provers to enforce step-level logical correctness in LLM reasoning, improving reasoning faithfulness and performance on logical tasks.

Details

Motivation: Existing LLM training methods rely on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount.

Method: Proposes LogicReward, a reward system that guides model training by enforcing step-level logical correctness using a theorem prover. Introduces Autoformalization with Soft Unification to reduce natural language ambiguity and improve formalization quality for effective theorem prover use.

Result: An 8B model trained with LogicReward surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning tasks. Enhances reasoning faithfulness, improves generalizability to unseen tasks (math and commonsense reasoning), and provides reliable reward signals without ground-truth labels.

Conclusion: LogicReward effectively improves logical reasoning in LLMs through theorem-prover-based step-level supervision, enhancing reasoning quality and generalizability across diverse reasoning tasks.

Abstract: Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning tasks with simple training procedures. Further analysis shows that LogicReward enhances reasoning faithfulness, improves generalizability to unseen tasks such as math and commonsense reasoning, and provides a reliable reward signal even without ground-truth labels. We will release all data and code at https://llm-symbol.github.io/LogicReward.

[111] LLM-based Few-Shot Early Rumor Detection with Imitation Agent

Fengzhu Zeng, Qian Shao, Ling Cheng, Wei Gao, Shih-Fen Cheng, Jing Ma, Cheng Niu

Main category: cs.CL

TL;DR: Proposes a novel Early Rumor Detection framework combining an autonomous agent for early time point determination and an LLM-based detection model for rumor classification, enabling few-shot learning with minimal training requirements.

Details

Motivation: Early Rumor Detection is challenging in data-scarce settings, and while LLMs perform well in few-shot NLP tasks, they are not suited for time-series data and are computationally expensive for both training and inference.

Method: Combines an autonomous agent and an LLM-based detection model where the agent acts as a reliable decision-maker for early time point determination, while the LLM serves as a powerful rumor detector. Only the lightweight agent needs training, keeping the LLM training-free.

Result: Extensive experiments on four real-world datasets show the approach boosts performance across LLMs and surpasses existing EARD methods in both accuracy and earliness.

Conclusion: The proposed framework offers the first solution for few-shot Early Rumor Detection, requiring only lightweight agent training while leveraging LLMs’ powerful detection capabilities without additional training.

Abstract: Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textit{early time point determination}, while the LLM serves as a powerful \textit{rumor detector}. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.

[112] Mix, MinHash, and Match: Cross-Source Agreement for Multilingual Pretraining Datasets

Sultan Alrashed, Francesco Orabona

Main category: cs.CL

TL;DR: MixMinMatch uses cross-source agreement from multiple web corpora as a quality filter to create high-quality multilingual pretraining datasets without additional computation beyond standard deduplication.

Details

Motivation: Web scraping for multilingual LLM pretraining is expensive and redundant, with significant duplication across sources. This redundancy can be leveraged as a quality signal rather than just being wasteful.

Method: Proposes MixMinMatch: combines multiple existing web corpora, performs cross-dataset MinHash deduplication, and identifies documents independently recovered by multiple sources. Cross-source agreement serves as a model-free quality filter.

Result: Applied to Arabic, Turkish, and Hindi, producing corpora that match or exceed best single-source baselines while providing up to 4× more unique tokens. Arabic subset shows 4.5% improvement over ArabicWeb24, Turkish shows 5.5% improvement over FineWeb-2.

Conclusion: Cross-source agreement is an effective, free quality signal for multilingual pretraining data. MixMinMatch enables creation of higher-quality datasets from existing redundant web corpora without additional computational cost.

Abstract: Multilingual data from the web is essential for LLM pretraining. Yet, scraping it is expensive, and research groups repeatedly crawl the same content. For example, we found that over 40% of tokens across major Arabic web corpora are duplicated between sources. In this work, we propose to use this wasteful redundancy as a quality signal to create high-quality pretraining datasets. Our key insight is that cross-source agreement functions as a free, model-free quality filter: content retained by multiple independent pipelines is more likely to represent high-quality text. Crucially, this signal requires no additional computation beyond standard deduplication, which is already performed at scale when pretraining language models. So, we propose MixMinMatch, a method that combines multiple existing web corpora, performs cross-dataset MinHash deduplication, and identifies documents independently recovered by multiple sources. We apply MixMinMatch to Arabic, Turkish, and Hindi, producing corpora that match or exceed the quality of the best single-source baselines, while providing up to 4$\times$ more unique tokens. On Arabic, our matched subset achieves a 4.5% relative improvement over ArabicWeb24, while on Turkish, we improve over FineWeb-2 by 5.5%. We release the datasets at: https://huggingface.co/collections/AdaMLLab/mixminmatch

[113] Discovering Multi-Scale Semantic Structure in Text Corpora Using Density-Based Trees and LLM Embeddings

Thomas Haschka, Joseph Bakarji

Main category: cs.CL

TL;DR: Hierarchical density modeling on LLM embeddings reveals multi-scale semantic organization without predefined taxonomies, showing how compact semantic groups merge into broader themes.

Details

Motivation: Current web-scale systems rely on flat clustering or predefined taxonomies, limiting insight into hierarchical topic relationships. There's a need for methods that can reveal multi-scale semantic organization directly from data.

Method: Operationalizes hierarchical density modeling on large language model embeddings by progressively relaxing local density constraints instead of enforcing fixed taxonomies or single clustering resolutions.

Result: Semantic alignment peaks at intermediate density levels, with abrupt transitions corresponding to meaningful changes in semantic resolution. Applied to large institutional and scientific corpora, it exposes dominant fields, cross-disciplinary proximities, and emerging thematic clusters.

Conclusion: Framing hierarchical structure as an emergent property of density in embedding spaces provides interpretable, multi-scale representation of semantic structure suitable for large, evolving text collections.

Abstract: Recent advances in large language models enable documents to be represented as dense semantic embeddings, supporting similarity-based operations over large text collections. However, many web-scale systems still rely on flat clustering or predefined taxonomies, limiting insight into hierarchical topic relationships. In this paper we operationalize hierarchical density modeling on large language model embeddings in a way not previously explored. Instead of enforcing a fixed taxonomy or single clustering resolution, the method progressively relaxes local density constraints, revealing how compact semantic groups merge into broader thematic regions. The resulting tree encodes multi-scale semantic organization directly from data, making structural relationships between topics explicit. We evaluate the hierarchies on standard text benchmarks, showing that semantic alignment peaks at intermediate density levels and that abrupt transitions correspond to meaningful changes in semantic resolution. Beyond benchmarks, the approach is applied to large institutional and scientific corpora, exposing dominant fields, cross-disciplinary proximities, and emerging thematic clusters. By framing hierarchical structure as an emergent property of density in embedding spaces, this method provides an interpretable, multi-scale representation of semantic structure suitable for large, evolving text collections.

[114] Towards Computational Chinese Paleography

Yiran Rex Ma

Main category: cs.CL

TL;DR: Position paper on computational Chinese paleography’s evolution from isolated visual tasks to integrated digital ecosystems using AI, covering datasets, methodological pipelines, and advocating for multimodal, few-shot, human-centric systems.

Details

Motivation: To chart the trajectory of computational Chinese paleography as it evolves from automating isolated visual tasks to creating integrated digital ecosystems for scholarly research, addressing the field's methodological pipeline and technological shifts.

Method: Analysis of digital resources and datasets for ancient scripts, examination of methodological pipeline from visual processing to advanced reasoning, study of technological shift from classical computer vision to deep learning including transformers and multimodal models.

Result: Mapping of the field’s landscape, identification of core challenges (data scarcity, AI-humanistic disconnect), and synthesis of future research agenda focused on multimodal, few-shot, and human-centric systems.

Conclusion: Computational Chinese paleography is evolving toward integrated digital ecosystems, requiring multimodal AI systems that can handle few-shot learning and augment rather than replace scholarly expertise.

Abstract: Chinese paleography, the study of ancient Chinese writing, is undergoing a computational turn powered by artificial intelligence. This position paper charts the trajectory of this emerging field, arguing that it is evolving from automating isolated visual tasks to creating integrated digital ecosystems for scholarly research. We first map the landscape of digital resources, analyzing critical datasets for oracle bone, bronze, and bamboo slip scripts. The core of our analysis follows the field’s methodological pipeline: from foundational visual processing (image restoration, character recognition), through contextual analysis (artifact rejoining, dating), to the advanced reasoning required for automated decipherment and human-AI collaboration. We examine the technological shift from classical computer vision to modern deep learning paradigms, including transformers and large multimodal models. Finally, we synthesize the field’s core challenges – notably data scarcity and a disconnect between current AI capabilities and the holistic nature of humanistic inquiry – and advocate for a future research agenda focused on creating multimodal, few-shot, and human-centric systems to augment scholarly expertise.

[115] Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models

Miao Li, Hanyang Jiang, Sikai Cheng, Hengyu Fu, Yuhang Cai, Baihe Huang, Tinghan Ye, Xuanzhou Chen, Pascal Van Hentenryck

Main category: cs.CL

TL;DR: PVF is a training-free decoding paradigm for Diffusion Language Models that uses hierarchical planning with semantic anchors and verification protocols to reduce function evaluations by up to 65% while maintaining accuracy.

Details

Motivation: Current decoding strategies for Diffusion Language Models (DLMs) are reactive and underutilize global bidirectional context, failing to effectively plan global trajectories for text generation.

Method: Proposes Plan-Verify-Fill (PVF): 1) Actively constructs hierarchical skeleton by prioritizing high-leverage semantic anchors, 2) Employs verification protocol for pragmatic structural stopping, 3) Training-free paradigm that grounds planning via quantitative validation.

Result: PVF reduces Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets on LLaDA-8B-Instruct and Dream-7B-Instruct, achieving superior efficiency without compromising accuracy.

Conclusion: PVF demonstrates that active planning with verification protocols can significantly improve decoding efficiency for Diffusion Language Models, offering a promising direction for non-sequential text generation.

Abstract: Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR) approaches. However, current decoding strategies often adopt a reactive stance, underutilizing the global bidirectional context to dictate global trajectories. To address this, we propose Plan-Verify-Fill (PVF), a training-free paradigm that grounds planning via quantitative validation. PVF actively constructs a hierarchical skeleton by prioritizing high-leverage semantic anchors and employs a verification protocol to operationalize pragmatic structural stopping where further deliberation yields diminishing returns. Extensive evaluations on LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate that PVF reduces the Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets, unlocking superior efficiency without compromising accuracy.

[116] Arab Voices: Mapping Standard and Dialectal Arabic Speech Technology

Peter Sullivan, AbdelRahim Elmadany, Alcides Alcoba Inciarte, Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: Arab Voices: A standardized framework for Dialectal Arabic ASR that unifies 31 datasets across 14 dialects with harmonized metadata and evaluation tools, addressing fragmentation in DA speech data.

Details

Motivation: Dialectal Arabic speech data suffers from fragmentation - varying domain coverage, inconsistent dialect labeling, and different recording conditions, making cross-dataset comparison and model evaluation difficult.

Method: 1) Computational analysis of linguistic “dialectness” and audio quality across DA corpora; 2) Development of Arab Voices framework that unifies 31 datasets with harmonized metadata; 3) Benchmarking of recent ASR systems on this standardized framework.

Result: Found substantial heterogeneity in acoustic conditions and dialectal signals across datasets; Created Arab Voices framework providing unified access to 31 datasets across 14 dialects; Established strong baselines for modern DA ASR.

Conclusion: Standardized characterization beyond coarse labels is needed for DA speech data; Arab Voices framework reduces fragmentation and supports reproducible evaluation for Dialectal Arabic ASR research.

Abstract: Dialectal Arabic (DA) speech data vary widely in domain coverage, dialect labeling practices, and recording conditions, complicating cross-dataset comparison and model evaluation. To characterize this landscape, we conduct a computational analysis of linguistic ``dialectness’’ alongside objective proxies of audio quality on the training splits of widely used DA corpora. We find substantial heterogeneity both in acoustic conditions and in the strength and consistency of dialectal signals across datasets, underscoring the need for standardized characterization beyond coarse labels. To reduce fragmentation and support reproducible evaluation, we introduce Arab Voices, a standardized framework for DA ASR. Arab Voices provides unified access to 31 datasets spanning 14 dialects, with harmonized metadata and evaluation utilities. We further benchmark a range of recent ASR systems, establishing strong baselines for modern DA ASR.

[117] AfroScope: A Framework for Studying the Linguistic Landscape of Africa

Sang Yun Kwon, AbdelRahim Elmadany, Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: AfroScope: A unified framework for African language identification covering 713 languages with hierarchical classification for confusable language varieties.

Details

Motivation: Existing language identification (LID) systems have limited coverage for African languages and struggle with fine-grained distinctions among closely related varieties, hindering reliable downstream NLP applications for African linguistic content.

Method: Proposes AfroScope framework with AfroScope-Data (dataset covering 713 African languages) and AfroScope-Models (LID models). Introduces hierarchical classification using Mirror-Serengeti embedding model for 29 closely related languages to improve discrimination of confusable varieties.

Result: Hierarchical approach improves macro F1 by 4.55 on confusable language subset compared to best base model. Framework enables large-scale measurement of Africa’s linguistic landscape in digital text.

Conclusion: AfroScope provides comprehensive African LID capability with improved handling of confusable languages, positioning it as enabling technology for African NLP applications and linguistic landscape analysis.

Abstract: Language Identification (LID) is the task of determining the language of a given text and is a fundamental preprocessing step that affects the reliability of downstream NLP applications. While recent work has expanded LID coverage for African languages, existing approaches remain limited in (i) the number of supported languages and (ii) their ability to make fine-grained distinctions among closely related varieties. We introduce AfroScope, a unified framework for African LID that includes AfroScope-Data, a dataset covering 713 African languages, and AfroScope-Models, a suite of strong LID models with broad language coverage. To better distinguish highly confusable languages, we propose a hierarchical classification approach that leverages Mirror-Serengeti, a specialized embedding model targeting 29 closely related or geographically proximate languages. This approach improves macro F1 by 4.55 on this confusable subset compared to our best base model. Finally, we analyze cross linguistic transfer and domain effects, offering guidance for building robust African LID systems. We position African LID as an enabling technology for large scale measurement of Africas linguistic landscape in digital text and release AfroScope-Data and AfroScope-Models publicly.

[118] DRIFT: Detecting Representational Inconsistencies for Factual Truthfulness

Rohan Bhatnagar, Youran Sun, Chi Andrew Zhang, Yixin Wen, Haizhao Yang

Main category: cs.CL

TL;DR: A lightweight probe that reads confidence signals from intermediate LLM layers enables real-time hallucination detection with minimal overhead, allowing for efficient routing of uncertain queries to stronger models.

Details

Motivation: Current methods for detecting LLM hallucinations require multiple sampling passes or post-hoc verification, which adds significant latency and computational cost. The authors hypothesize that intermediate layers contain confidence signals that are lost in the final output, and aim to develop a more efficient detection method.

Method: Proposes a lightweight probe that reads confidence signals directly from hidden states of intermediate LLM layers. The probe adds less than 0.1% computational overhead and runs fully in parallel with generation, enabling hallucination detection before answers are produced. This is used to build an LLM router that answers confident queries immediately while delegating uncertain ones to stronger models.

Result: Achieves state-of-the-art AUROC on 10 out of 12 settings across four QA benchmarks and three LLM families, with gains of up to 13 points over prior methods. The method generalizes across dataset shifts without retraining.

Conclusion: Intermediate LLM layers encode useful confidence signals that can be efficiently extracted for real-time hallucination detection, enabling practical deployment of LLM routing systems with minimal computational overhead.

Abstract: LLMs often produce fluent but incorrect answers, yet detecting such hallucinations typically requires multiple sampling passes or post-hoc verification, adding significant latency and cost. We hypothesize that intermediate layers encode confidence signals that are lost in the final output layer, and propose a lightweight probe to read these signals directly from hidden states. The probe adds less than 0.1% computational overhead and can run fully in parallel with generation, enabling hallucination detection before the answer is produced. Building on this, we develop an LLM router that answers confident queries immediately while delegating uncertain ones to stronger models. Despite its simplicity, our method achieves SOTA AUROC on 10 out of 12 settings across four QA benchmarks and three LLM families, with gains of up to 13 points over prior methods, and generalizes across dataset shifts without retraining.

[119] Select or Project? Evaluating Lower-dimensional Vectors for LLM Training Data Explanations

Lukas Hinterleitner, Loris Schoenegger, Benjamin Roth

Main category: cs.CL

TL;DR: Gradient-based influence estimation for LLMs is computationally expensive; selecting architecturally informed subsets of model components outperforms full gradients and random projections for retrieval tasks while being more efficient.

Details

Motivation: Gradient-based methods for explaining LLM predictions face computational challenges due to high-dimensional gradients. Current approaches use ad hoc parameter subsets without systematic evaluation, creating a need for principled methods to make influence estimation tractable.

Method: The paper investigates two approaches: selecting small, architecturally informed subsets of model components vs. projecting full gradients into lower-dimensional space. Uses a novel benchmark to compare greedy component selection against full gradients and random projections for influence estimation in retrieval tasks.

Result: Greedily selected component subsets capture training data influence information more effectively than full gradients or random projections for retrieval tasks. This approach is also more computationally efficient than random projection methods.

Conclusion: Targeted component selection based on architectural insights provides a practical strategy for making instance-based explanations of large models computationally feasible, outperforming both full gradient analysis and dimensionality reduction techniques.

Abstract: Gradient-based methods for instance-based explanation for large language models (LLMs) are hindered by the immense dimensionality of model gradients. In practice, influence estimation is restricted to a subset of model parameters to make computation tractable, but this subset is often chosen ad hoc and rarely justified by systematic evaluation. This paper investigates if it is better to create low-dimensional representations by selecting a small, architecturally informed subset of model components or by projecting the full gradients into a lower-dimensional space. Using a novel benchmark, we show that a greedily selected subset of components captures the information about training data influence needed for a retrieval task more effectively than either the full gradient or random projection. We further find that this approach is more computationally efficient than random projection, demonstrating that targeted component selection is a practical strategy for making instance-based explanations of large models more computationally feasible.

[120] D-Models and E-Models: Diversity-Stability Trade-offs in the Sampling Behavior of Large Language Models

Jia Gu, Liang Pang, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: LLMs show two distinct probabilistic sampling behaviors: D-models (like Qwen-2.5) have high variability in token probabilities with poor task alignment, while E-models (like Mistral-Small) show stable probabilities with better task alignment, creating trade-offs between diversity and stability.

Details

Motivation: To investigate whether LLMs' fine-grained sampling probabilities (P_token) faithfully align with task-level target distributions (P_task) in real-world applications like recommendation, search, and conversational agents, where probabilistic alignment matters for reliability.

Method: Used controlled distribution-sampling simulations to analyze token probability behavior, evaluated models on downstream tasks (code generation, recommendation), and analyzed internal properties to understand underlying mechanisms.

Result: Discovered dichotomy: D-models show large step-to-step variability in P_token with poor P_task alignment; E-models show stable P_token with better P_task alignment. Found systematic trade-offs between diversity and stability affecting task outcomes.

Conclusion: Provides foundational insights into LLM probabilistic sampling behavior and practical guidance for model selection: D-models for diversity, E-models for stability, helping balance diversity with reliability in web-scale applications.

Abstract: The predictive probability of the next token (P_token) in large language models (LLMs) is inextricably linked to the probability of relevance for the next piece of information, the purchase probability of the next product, and the execution probability of the next action-all of which fall under the scope of the task-level target distribution (P_task). While LLMs are known to generate samples that approximate real-world distributions, whether their fine-grained sampling probabilities faithfully align with task requirements remains an open question. Through controlled distribution-sampling simulations, we uncover a striking dichotomy in LLM behavior, distinguishing two model types: D-models (e.g. Qwen-2.5), whose P_token exhibits large step-to-step variability and poor alignment with P_task; and E-models (e.g. Mistral-Small), whose P_token is more stable and better aligned with P_task. We further evaluate these two model types in downstream tasks such as code generation and recommendation, revealing systematic trade-offs between diversity and stability that shape task outcomes. Finally, we analyze the internal properties of both model families to probe their underlying mechanisms. These findings offer foundational insights into the probabilistic sampling behavior of LLMs and provide practical guidance on when to favor D- versus E-models. For web-scale applications, including recommendation, search, and conversational agents, our results inform model selection and configuration to balance diversity with reliability under real-world uncertainty, providing a better level of interpretation.

[121] LLMs versus the Halting Problem: Revisiting Program Termination Prediction

Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, Peter O’Hearn

Main category: cs.CL

TL;DR: LLMs show strong performance in predicting program termination for C programs, with GPT-5 and Claude Sonnet-4.5 ranking close to top specialized tools, though they struggle with providing valid proofs and longer programs.

Details

Motivation: The paper investigates whether large language models can reliably predict program termination, given that the Halting Problem is undecidable and existing verification tools are language-specific and approximate. Recent LLM successes prompt exploration of their potential for reasoning about undecidable problems.

Method: Evaluated LLMs on diverse C programs from the Termination category of SV-Comp 2025. Compared performance of models like GPT-5, Claude Sonnet-4.5, and Code World Model against specialized verification tools using test-time-scaling.

Result: LLMs perform remarkably well at predicting program termination, with GPT-5 and Claude Sonnet-4.5 ranking just behind the top-ranked tool, and CWM placing behind the second-ranked tool. However, LLMs often fail to provide valid proof witnesses, and performance degrades with increasing program length.

Conclusion: LLMs show promise for program termination prediction despite the undecidable nature of the problem, but have limitations in proof generation and scalability. The findings motivate further research into LLMs for reasoning about undecidable problems.

Abstract: Determining whether a program terminates is a central problem in computer science. Turing’s foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.

[122] One Token Is Enough: Improving Diffusion Language Models with a Sink Token

Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Shaosheng Cao

Main category: cs.CL

TL;DR: DLMs suffer from moving sink instability; adding a dedicated sink token with self-only attention stabilizes attention and improves performance

Details

Motivation: Diffusion Language Models (DLMs) enable parallel text generation but suffer from critical instability called the "moving sink phenomenon" where sink tokens unpredictably shift across diffusion steps, undermining inference robustness

Method: Introduce a simple extra sink token via modified attention mask: a special token constrained to attend solely to itself while remaining globally visible to all other tokens, creating a dedicated structural sink

Result: Adding a single extra token stabilizes attention sinks and substantially improves model performance; effectiveness is independent of token position and has negligible semantic content, validating its role as a robust structural sink

Conclusion: The moving sink phenomenon in DLMs can be effectively addressed by introducing a dedicated sink token with self-only attention, providing a simple but effective solution to improve inference stability and performance

Abstract: Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance. Despite these advantages, there is a critical instability in DLMs: the moving sink phenomenon. Our analysis indicates that sink tokens exhibit low-norm representations in the Transformer’s value space, and that the moving sink phenomenon serves as a protective mechanism in DLMs to prevent excessive information mixing. However, their unpredictable positions across diffusion steps undermine inference robustness. To resolve this, we propose a simple but effective extra sink token implemented via a modified attention mask. Specifically, we introduce a special token constrained to attend solely to itself, while remaining globally visible to all other tokens. Experimental results demonstrate that introducing a single extra token stabilizes attention sinks, substantially improving model performance. Crucially, further analysis confirms that the effectiveness of this token is independent of its position and characterized by negligible semantic content, validating its role as a robust and dedicated structural sink.

[123] Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems

Haoyuan Yu, Yuxuan Chen, Minjie Cai

Main category: cs.CL

TL;DR: A full-duplex voice interaction framework using multimodal LLM with conversational unit decomposition for natural human-computer dialogue

Details

Motivation: Full-duplex voice interaction is essential for natural human-computer interaction, but current systems struggle with complex dialogue processing and seamless transitions between conversational units.

Method: Decomposes complex dialogue into minimal conversational units, processes each independently, and predicts transitions using a semi-cascaded full-duplex system built around a multimodal LLM with auxiliary modules (VAD, TTS).

Result: System achieved second place on the test set of the Human-like Spoken Dialogue Systems Challenge (Track 2: Full-Duplex Interaction) on the HumDial dataset.

Conclusion: The framework enables train-free, plug-and-play full-duplex voice interaction and demonstrates effectiveness through competitive performance on benchmark datasets.

Abstract: Full-duplex voice interaction is crucial for natural human computer interaction. We present a framework that decomposes complex dialogue into minimal conversational units, enabling the system to process each unit independently and predict when to transit to the next. This framework is instantiated as a semi-cascaded full-duplex dialogue system built around a multimodal large language model, supported by auxiliary modules such as voice activity detection (VAD) and text-to-speech (TTS) synthesis. The resulting system operates in a train-free, plug-and-play manner. Experiments on the HumDial dataset demonstrate the effectiveness of our framework, which ranks second among all teams on the test set of the Human-like Spoken Dialogue Systems Challenge (Track 2: Full-Duplex Interaction). Code is available at the GitHub repository https://github.com/yu-haoyuan/fd-badcat.

[124] MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

Qinzhuo Wu, Zhizhuo Yang, Hanhao Li, Pengzhi Gao, Wei Liu, Jian Luan

Main category: cs.CL

TL;DR: MobileBench-OL is an online benchmark for mobile GUI agents with 1080 tasks from 80 Chinese apps, evaluating task execution, complex reasoning, and noise robustness in real-world mobile environments.

Details

Motivation: Existing mobile GUI agent benchmarks focus mainly on task instruction-following while neglecting reasoning, exploration ability, and real-world noise, creating a gap between benchmarks and actual mobile environments.

Method: Proposes MobileBench-OL with 1080 tasks from 80 Chinese apps, organized into 5 subsets to measure task execution, complex reasoning, and noise robustness. Includes an auto-eval framework with reset mechanism for stable, repeatable real-world benchmarking.

Result: Evaluation of 12 leading GUI agents shows significant room for improvement to meet real-world requirements. Human evaluation confirms MobileBench-OL reliably measures agent performance in real environments.

Conclusion: MobileBench-OL addresses limitations of existing benchmarks by providing comprehensive evaluation of GUI agents’ task execution, reasoning, and noise robustness in realistic mobile environments.

Abstract: Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents’ task instruction-following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real-world mobile environments. This leads to a gap between benchmarks and real-world environments. To addressing these limitations, we propose MobileBench-OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto-eval framework with a reset mechanism, enabling stable and repeatable real-world benchmarking. Evaluating 12 leading GUI agents on MobileBench-OL shows significant room for improvement to meet real-world requirements. Human evaluation further confirms that MobileBench-OL can reliably measure the performance of leading GUI agents in real environments. Our data and code will be released upon acceptance.

[125] AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts

Shicheng Fang, Yuxin Wang, XiaoRan Liu, Jiahao Lu, Chuanyuan Tan, Xinchi Chen, Yining Zheng, Xuanjing Huang, Xipeng Qiu

Main category: cs.CL

TL;DR: AgentLongBench: A new benchmark for evaluating LLM agents in dynamic, interactive environments using Lateral Thinking Puzzles to test non-linear reasoning and iterative feedback handling.

Details

Motivation: Current benchmarks for LLM agents are static and rely on passive retrieval tasks, failing to simulate real-world agent-environment interactions that require dynamic information synthesis, non-linear reasoning, and iterative feedback processing.

Method: Introduces AgentLongBench which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles, generating interaction trajectories across knowledge-intensive and knowledge-free scenarios. Tests state-of-the-art models and memory systems with context lengths from 32K to 4M tokens.

Result: Agents struggle with dynamic information synthesis essential for workflows despite being adept at static retrieval. Performance degradation is driven by the minimum number of tokens required to resolve a query, explaining why high information density in massive tool responses is more challenging than memory fragmentation in long-turn dialogues.

Conclusion: Current LLM agents have critical weaknesses in handling dynamic, interactive environments requiring non-linear reasoning and iterative feedback, highlighting the need for improved agent architectures that can better synthesize information across complex workflows.

Abstract: The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce \textbf{AgentLongBench}, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues.

cs.CV

[126] MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading

Matteo Rossi

Main category: cs.CV

TL;DR: MA-LipNet: A multi-attention network for lipreading that purifies visual features using channel, joint spatial-temporal, and separate spatial-temporal attention modules to improve feature discriminability and generalization.

Details

Motivation: Existing lipreading methods suffer from limited feature discriminability and poor generalization due to subtle articulatory gestures. The paper aims to address these challenges by purifying visual features from multiple dimensions.

Method: Proposes MA-LipNet with three sequential attention modules: 1) Channel Attention (CA) for channel-wise feature recalibration, 2) Joint Spatial-Temporal Attention (JSTA) for coarse-grained spatio-temporal filtering, and 3) Separate Spatial-Temporal Attention (SSTA) for fine-grained refinement by separately modeling temporal and spatial attention.

Result: Extensive experiments on CMLR and GRID datasets show MA-LipNet significantly reduces Character Error Rate (CER) and Word Error Rate (WER), outperforming several state-of-the-art methods.

Conclusion: Multi-dimensional feature refinement through attention mechanisms is crucial for robust visual speech recognition, and MA-LipNet demonstrates effectiveness in improving lipreading performance.

Abstract: Lipreading, the technology of decoding spoken content from silent videos of lip movements, holds significant application value in fields such as public security. However, due to the subtle nature of articulatory gestures, existing lipreading methods often suffer from limited feature discriminability and poor generalization capabilities. To address these challenges, this paper delves into the purification of visual features from temporal, spatial, and channel dimensions. We propose a novel method named Multi-Attention Lipreading Network(MA-LipNet). The core of MA-LipNet lies in its sequential application of three dedicated attention modules. Firstly, a \textit{Channel Attention (CA)} module is employed to adaptively recalibrate channel-wise features, thereby mitigating interference from less informative channels. Subsequently, two spatio-temporal attention modules with distinct granularities-\textit{Joint Spatial-Temporal Attention (JSTA)} and \textit{Separate Spatial-Temporal Attention (SSTA)}-are leveraged to suppress the influence of irrelevant pixels and video frames. The JSTA module performs a coarse-grained filtering by computing a unified weight map across the spatio-temporal dimensions, while the SSTA module conducts a more fine-grained refinement by separately modeling temporal and spatial attentions. Extensive experiments conducted on the CMLR and GRID datasets demonstrate that MA-LipNet significantly reduces the Character Error Rate (CER) and Word Error Rate (WER), validating its effectiveness and superiority over several state-of-the-art methods. Our work highlights the importance of multi-dimensional feature refinement for robust visual speech recognition.

[127] Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs

Haochen Zhang, Animesh Sinha, Felix Juefei-Xu, Haoyu Ma, Kunpeng Li, Zhipeng Fan, Meng Dong, Xiaoliang Dai, Tingbo Hou, Peizhao Zhang, Zecheng He

Main category: cs.CV

Details

Result: The framework yields substantial improvements in multi-round consistency and instruction compliance while maintaining strong single-round editing and personalization capabilities.

[128] Text controllable PET denoising

Xuehua Ye, Hongxu Yang, Adam J. Schwarz

Main category: cs.CV

TL;DR: A text-guided denoising method for PET images using CLIP features and U-Net architecture to handle various noise levels with a single model.

Details

Motivation: PET imaging suffers from complicated noise that obscures diagnostic information, with quality affected by multiple factors including scanner hardware, reconstruction methods, tracer properties, dose levels, and acquisition time.

Method: Proposes a novel text-guided denoising method using features from a pretrained CLIP model combined with a U-Net based denoising architecture to enhance PET images across various count levels within a single model.

Result: Experimental results show significant improvements in both qualitative and quantitative assessments, demonstrating the model’s effectiveness in PET image denoising.

Conclusion: The proposed text-guided denoising model offers flexibility for handling complex denoising demands and potentially reducing acquisition time in PET imaging.

Abstract: Positron Emission Tomography (PET) imaging is a vital tool in medical diagnostics, offering detailed insights into molecular processes within the human body. However, PET images often suffer from complicated noise, which can obscure critical diagnostic information. The quality of the PET image is impacted by various factors including scanner hardware, image reconstruction, tracer properties, dose/count level, and acquisition time. In this study, we propose a novel text-guided denoising method capable of enhancing PET images across a wide range of count levels within a single model. The model utilized the features from a pretrained CLIP model with a U-Net based denoising model. Experimental results demonstrate that the proposed model leads significant improvements in both qualitative and quantitative assessments. The flexibility of the model shows the potential for helping more complicated denoising demands or reducing the acquisition time.

[129] Low performing pixel correction in computed tomography with unrolled network and synthetic data training

Hongxu Yang, Levente Lippenszky, Edina Timko, Lehel Ferenczi, Gopal Avinash

Main category: cs.CV

TL;DR: Unrolled dual-domain method using synthetic data to correct low performance pixel artifacts in CT images without requiring real clinical data.

Details

Motivation: Low performance pixels in CT detectors cause ring/streak artifacts making images clinically unusable. Existing supervised deep learning methods require expensive real datasets and focus only on image or sinogram domain separately, ignoring intrinsic correlations between domains.

Method: Proposes an unrolled dual-domain method based on synthetic data. Leverages intrinsic correlations between sinogram and image domains through synthetic data generated from natural images, enabling artifact correction without real clinical data.

Result: Outperformed state-of-the-art approaches by large margin in experiments simulating 1-2% detectors defect near isocenter. Can correct LPP artifacts without cost of data collection for training.

Conclusion: Solution corrects LPP artifacts without requiring real-world clinical data, adaptable to different scanner settings for software-based applications.

Abstract: Low performance pixels (LPP) in Computed Tomography (CT) detectors would lead to ring and streak artifacts in the reconstructed images, making them clinically unusable. In recent years, several solutions have been proposed to correct LPP artifacts, either in the image domain or in the sinogram domain using supervised deep learning methods. However, these methods require dedicated datasets for training, which are expensive to collect. Moreover, existing approaches focus solely either on image-space or sinogram-space correction, ignoring the intrinsic correlations from the forward operation of the CT geometry. In this work, we propose an unrolled dual-domain method based on synthetic data to correct LPP artifacts. Specifically, the intrinsic correlations of LPP between the sinogram and image domains are leveraged through synthetic data generated from natural images, enabling the trained model to correct artifacts without requiring any real-world clinical data. In experiments simulating 1-2% detectors defect near the isocenter, the proposed method outperformed the state-of-the-art approaches by a large margin. The results indicate that our solution can correct LPP artifacts without the cost of data collection for model training, and it is adaptable to different scanner settings for software-based applications.

[130] TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Chuancheng Shi, Shangze Li, Wenjun Lu, Wenhua Wu, Cong Wang, Zifeng Cheng, Fei Shen, Tat-Seng Chua

Main category: cs.CV

TL;DR: TraceRouter is a path-level defense framework that traces and disconnects causal propagation circuits of harmful semantics in large foundation models, achieving better adversarial robustness-utility trade-off than localized neuron suppression methods.

Details

Motivation: Current defenses for large foundation models rely on the "locality hypothesis" by suppressing isolated neurons or features, but harmful semantics act as distributed, cross-layer circuits, making localized interventions brittle and detrimental to utility.

Method: Three-stage framework: (1) pinpoint sensitive onset layer via attention divergence analysis, (2) use sparse autoencoders and differential activation analysis to disentangle malicious features, (3) map features to downstream causal pathways via feature influence scores from zero-out interventions, then selectively suppress these causal chains.

Result: Extensive experiments show TraceRouter significantly outperforms state-of-the-art baselines, achieving superior trade-off between adversarial robustness and general utility.

Conclusion: TraceRouter provides an effective path-level defense framework that physically severs harmful information flow while preserving orthogonal computation routes, addressing limitations of localized interventions against distributed adversarial circuits.

Abstract: Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the “locality hypothesis”, suppressing isolated neurons or features. However, harmful semantics act as distributed, cross-layer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose \textbf{TraceRouter}, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By selectively suppressing these causal chains, TraceRouter physically severs the flow of harmful information while leaving orthogonal computation routes intact. Extensive experiments demonstrate that TraceRouter significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility. Our code will be publicly released. WARNING: This paper contains unsafe model responses.

[131] AI-based Prediction of Biochemical Recurrence from Biopsy and Prostatectomy Samples

Andrea Camilloni, Chiara Micoli, Nita Mulliqi, Erik Everett Palm, Thorgerdur Palsdottir, Kelvin Szolnoky, Xiaoyi Ji, Sol Erika Boman, Andrea Discacciati, Henrik Grönberg, Lars Egevad, Tobias Nordström, Kimmo Kartasalo, Martin Eklund

Main category: cs.CV

TL;DR: AI model trained on prostate biopsy slides predicts biochemical recurrence risk after radical prostatectomy using foundation models and attention-based multiple instance learning, showing generalizability across external cohorts.

Details

Motivation: Current prognostic tools for biochemical recurrence after radical prostatectomy are imprecise, creating a need for more accurate AI-based prediction models using histopathology data.

Method: Trained AI model on diagnostic prostate biopsy slides from STHLM3 cohort using foundation models and attention-based multiple instance learning, validated across three external RP cohorts (LEOPARD, CHIMERA, TCGA-PRAD).

Result: Achieved 5-year time-dependent AUCs of 0.64, 0.70, and 0.70 across external cohorts; integration with clinical variables improved risk stratification; AI incrementally improved prognostication over guideline-based CAPRA-S.

Conclusion: Biopsy-trained histopathology AI can generalize across specimen types for preoperative/postoperative decision making, but added value of multimodal AI approaches over simpler models needs further scrutiny.

Abstract: Biochemical recurrence (BCR) after radical prostatectomy (RP) is a surrogate marker for aggressive prostate cancer with adverse outcomes, yet current prognostic tools remain imprecise. We trained an AI-based model on diagnostic prostate biopsy slides from the STHLM3 cohort (n = 676) to predict patient-specific risk of BCR, using foundation models and attention-based multiple instance learning. Generalizability was assessed across three external RP cohorts: LEOPARD (n = 508), CHIMERA (n = 95), and TCGA-PRAD (n = 379). The image-based approach achieved 5-year time-dependent AUCs of 0.64, 0.70, and 0.70, respectively. Integrating clinical variables added complementary prognostic value and enabled statistically significant risk stratification. Compared with guideline-based CAPRA-S, AI incrementally improved postoperative prognostication. These findings suggest biopsy-trained histopathology AI can generalize across specimen types to support preoperative and postoperative decision making, but the added value of AI-based multimodal approaches over simpler predictive models should be critically scrutinized in further studies.

[132] BadDet+: Robust Backdoor Attacks for Object Detection

Kealan Dunnett, Reza Arablouei, Dimity Miller, Volkan Dedeoglu, Raja Jurdak

Main category: cs.CV

TL;DR: BadDet+ is a penalty-based backdoor attack framework for object detection that unifies region misclassification and object disappearance attacks using log-barrier penalty for position/scale invariance and physical robustness.

Details

Motivation: Backdoor attacks on object detection are poorly understood compared to image classification, with existing methods having critical weaknesses like unrealistic assumptions and lack of physical validation.

Method: Introduces BadDet+, a penalty-based framework using log-barrier penalty to suppress true-class predictions for triggered inputs, achieving position/scale invariance and enhanced physical robustness for both region misclassification and object disappearance attacks.

Result: Achieves superior synthetic-to-physical transfer compared to existing baselines while preserving clean performance, with theoretical analysis confirming the penalty operates within trigger-specific feature subspace.

Conclusion: Highlights significant vulnerabilities in object detection and the necessity for specialized defenses against backdoor attacks.

Abstract: Backdoor attacks pose a severe threat to deep learning, yet their impact on object detection remains poorly understood compared to image classification. While attacks have been proposed, we identify critical weaknesses in existing detection-based methods, specifically their reliance on unrealistic assumptions and a lack of physical validation. To bridge this gap, we introduce BadDet+, a penalty-based framework that unifies Region Misclassification Attacks (RMA) and Object Disappearance Attacks (ODA). The core mechanism utilizes a log-barrier penalty to suppress true-class predictions for triggered inputs, resulting in (i) position and scale invariance, and (ii) enhanced physical robustness. On real-world benchmarks, BadDet+ achieves superior synthetic-to-physical transfer compared to existing RMA and ODA baselines while preserving clean performance. Theoretical analysis confirms the proposed penalty acts within a trigger-specific feature subspace, reliably inducing attacks without degrading standard inference. These results highlight significant vulnerabilities in object detection and the necessity for specialized defenses.

[133] EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers

John Flynn, Wolfgang Paier, Dimitar Dinev, Sam Nhut Nguyen, Hayk Poghosyan, Manuel Toribio, Sandipan Banerjee, Guy Gafni

Main category: cs.CV

TL;DR: EditYourself is a DiT-based framework for audio-driven video-to-video editing that enables transcript-based modification of talking head videos while preserving motion, temporal coherence, speaker identity, and lip synchronization.

Details

Motivation: Current generative video models focus on creating novel content but lack practical tools for editing existing pre-recorded videos, especially when minor script changes require preserving complex visual elements like motion coherence, speaker identity, and accurate lip synchronization.

Method: A DiT-based framework that builds on a general-purpose video diffusion model, augmenting its video-to-video capabilities with audio conditioning and region-aware, edit-focused training extensions. Uses spatiotemporal inpainting for temporally coherent restructuring of performances.

Result: Enables precise lip synchronization and temporally coherent restructuring of existing performances, including synthesis of realistic human motion in newly added segments while maintaining visual fidelity and identity consistency over long durations.

Conclusion: Represents a foundational step toward generative video models as practical tools for professional video post-production, addressing the critical gap in editing existing videos with audio-driven modifications.

Abstract: Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human motion in newly added segments, while maintaining visual fidelity and identity consistency over long durations. This work represents a foundational step toward generative video models as practical tools for professional video post-production.

[134] Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization

Jiaqi Li, Guangming Wang, Shuntian Zheng, Minzhe Ni, Xiaoman Lu, Guanghui Ye, Yu Guan

Main category: cs.CV

TL;DR: ActionVLM is a vision-language aggregation framework for temporal action localization that mitigates modality bias by preserving vision as the dominant signal while adaptively exploiting language only when beneficial.

Details

Motivation: Existing vision-language models for temporal action localization tend to overemphasize linguistic priors at the expense of visual performance, leading to pronounced modality bias where language dominates vision.

Method: Proposes ActionVLM with two key components: (1) a debiasing reweighting module that estimates the language advantage (incremental benefit of language over vision-only predictions) and dynamically reweights language modality accordingly, and (2) a residual aggregation strategy that treats language as complementary refinement rather than primary driver.

Result: Experiments on THUMOS14 show the model outperforms state-of-the-art by up to 3.2% mAP, demonstrating effective mitigation of modality bias and improved temporal reasoning.

Conclusion: ActionVLM successfully addresses modality bias in vision-language models for temporal action localization by systematically preserving vision as the dominant signal while adaptively exploiting language only when beneficial.

Abstract: Temporal Action Localization (TAL) requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models (VLMs) offer rich semantics to complement visual evidence, existing approaches tend to overemphasize linguistic priors at the expense of visual performance, leading to a pronounced modality bias. We propose ActionVLM, a vision-language aggregation framework that systematically mitigates modality bias in TAL. Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial. To this end, we introduce (i) a debiasing reweighting module that estimates the language advantage-the incremental benefit of language over vision-only predictions-and dynamically reweights language modality accordingly, and (ii) a residual aggregation strategy that treats language as a complementary refinement rather than the primary driver. This combination alleviates modality bias, reduces overconfidence from linguistic priors, and strengthens temporal reasoning. Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.

[135] Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought

Yu Huo, Siyu Zhang, Kun Zeng, Haoyue Liu, Owen Lee, Junlin Chen, Yuquan Lu, Yifu Guo, Yaodong Liang, Xiaoying Tang

Main category: cs.CV

TL;DR: Shape-of-Thought (SoT) is a visual chain-of-thought framework for text-to-image generation that addresses compositional challenges like numeracy and attribute binding by generating progressive 2D shape assembly plans.

Details

Motivation: Current multimodal text-to-image models struggle with compositional structural constraints including generative numeracy, attribute binding, and part-level relations. There's a need for more transparent, process-supervised approaches to improve structural integrity in image generation.

Method: SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate 2D states, enabling progressive shape assembly without external engines. The approach uses a large-scale dataset (SoT-26K) of grounded assembly traces from CAD hierarchies and introduces T2S-CompBench for evaluation.

Result: Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming text-only baselines by around 20%. The method establishes a new paradigm for transparent, process-supervised compositional generation.

Conclusion: SoT provides an effective visual chain-of-thought framework that improves compositional reasoning in text-to-image generation through progressive shape assembly and intermediate state supervision, addressing key limitations in current multimodal models.

Abstract: Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints-notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework that enables progressive shape assembly via coherent 2D projections without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming text-only baselines by around 20%. SoT establishes a new paradigm for transparent, process-supervised compositional generation. The code is available at https://anonymous.4open.science/r/16FE/. The SoT-26K dataset will be released upon acceptance.

[136] An AI Framework for Microanastomosis Motion Assessment

Yan Meng, Eduardo J. Torres-Rodríguez, Marcelle Altshuler, Nishanth Gowda, Arhum Naeem, Recai Yilmaz, Omar Arnaout, Daniel A. Donoho

Main category: cs.CV

TL;DR: AI framework for automated assessment of microanastomosis instrument handling skills using computer vision and machine learning

Details

Motivation: Traditional evaluation of microsurgical techniques relies on subjective expert judgment with limitations like inter-rater variability, lack of standardized criteria, cognitive bias, and time-intensive manual review. Need for objective, reliable, automated assessment system.

Method: Four-component AI framework: (1) YOLO-based instrument detection, (2) DeepSORT-based instrument tracking, (3) shape descriptor-based instrument tip localization, (4) supervised classification trained on expert-labeled data for proficiency evaluation.

Result: Achieved 97% instrument detection precision with 96% mean Average Precision (mAP50-95) measured by Intersection over Union thresholds from 50% to 95%.

Conclusion: Proposed AI framework effectively automates assessment of microanastomosis instrument handling skills, addressing limitations of traditional subjective evaluation methods.

Abstract: Proficiency in microanastomosis is a fundamental competency across multiple microsurgical disciplines. These procedures demand exceptional precision and refined technical skills, making effective, standardized assessment methods essential. Traditionally, the evaluation of microsurgical techniques has relied heavily on the subjective judgment of expert raters. They are inherently constrained by limitations such as inter-rater variability, lack of standardized evaluation criteria, susceptibility to cognitive bias, and the time-intensive nature of manual review. These shortcomings underscore the urgent need for an objective, reliable, and automated system capable of assessing microsurgical performance with consistency and scalability. To bridge this gap, we propose a novel AI framework for the automated assessment of microanastomosis instrument handling skills. The system integrates four core components: (1) an instrument detection module based on the You Only Look Once (YOLO) architecture; (2) an instrument tracking module developed from Deep Simple Online and Realtime Tracking (DeepSORT); (3) an instrument tip localization module employing shape descriptors; and (4) a supervised classification module trained on expert-labeled data to evaluate instrument handling proficiency. Experimental results demonstrate the effectiveness of the framework, achieving an instrument detection precision of 97%, with a mean Average Precision (mAP) of 96%, measured by Intersection over Union (IoU) thresholds ranging from 50% to 95% (mAP50-95).

[137] Bidirectional Cross-Perception for Open-Vocabulary Semantic Segmentation in Remote Sensing Imagery

Jianzheng Wang, Huan Ni

Main category: cs.CV

TL;DR: SDCI is a training-free open-vocabulary semantic segmentation framework for remote sensing imagery that uses cross-model attention fusion, bidirectional cross-graph diffusion, and superpixel-based boundary refinement to improve geometric localization and semantic prediction.

Details

Motivation: High-resolution remote sensing imagery has densely distributed objects and complex boundaries, requiring better geometric localization and semantic prediction. Existing training-free OVSS methods use simple fusion strategies that don't adequately address these challenges.

Method: Proposes SDCI with three key components: 1) Cross-model attention fusion (CAF) module that injects self-attention maps between CLIP and vision foundation models, 2) Bidirectional cross-graph diffusion refinement (BCDR) module for iterative score enhancement, and 3) Convex-optimization-based superpixel collaborative prediction (CSCP) for boundary refinement.

Result: Experiments on multiple remote sensing semantic segmentation benchmarks show SDCI outperforms existing approaches. Ablation studies confirm traditional superpixel-based methods remain effective within deep learning frameworks.

Conclusion: SDCI effectively addresses the challenges of high-resolution remote sensing imagery through collaborative inference and boundary refinement, demonstrating the continued relevance of traditional object-based methods in modern deep learning frameworks.

Abstract: High-resolution remote sensing imagery is characterized by densely distributed land-cover objects and complex boundaries, which places higher demands on both geometric localization and semantic prediction. Existing training-free open-vocabulary semantic segmentation (OVSS) methods typically fuse CLIP and vision foundation models (VFMs) using “one-way injection” and “shallow post-processing” strategies, making it difficult to satisfy these requirements. To address this issue, we propose a spatial-regularization-aware dual-branch collaborative inference framework for training-free OVSS, termed SDCI. First, during feature encoding, SDCI introduces a cross-model attention fusion (CAF) module, which guides collaborative inference by injecting self-attention maps into each other. Second, we propose a bidirectional cross-graph diffusion refinement (BCDR) module that enhances the reliability of dual-branch segmentation scores through iterative random-walk diffusion. Finally, we incorporate low-level superpixel structures and develop a convex-optimization-based superpixel collaborative prediction (CSCP) mechanism to further refine object boundaries. Experiments on multiple remote sensing semantic segmentation benchmarks demonstrate that our method achieves better performance than existing approaches. Moreover, ablation studies further confirm that traditional object-based remote sensing image analysis methods leveraging superpixel structures remain effective within deep learning frameworks. Code: https://github.com/yu-ni1989/SDCI.

[138] Enhancing Underwater Light Field Images via Global Geometry-aware Diffusion Process

Yuji Lin, Qian Zhao, Zongsheng Yue, Junhui Hou, Deyu Meng

Main category: cs.CV

TL;DR: GeoDiff-LF is a diffusion-based framework that enhances underwater 4-D light field imaging by integrating diffusion priors with light field geometry to mitigate color distortion and improve image quality.

Details

Motivation: The paper addresses the challenging problem of acquiring high-quality underwater images, which suffer from color distortion and degradation due to light scattering and absorption in water. Traditional underwater imaging methods struggle with these issues, and the authors aim to leverage 4-D light field imaging combined with diffusion models to overcome these limitations.

Method: The proposed GeoDiff-LF framework builds upon SD-Turbo and includes three key adaptations: (1) a modified U-Net architecture with convolutional and attention adapters to model geometric cues from light field data, (2) a geometry-guided loss function using tensor decomposition and progressive weighting to regularize global structure, and (3) an optimized sampling strategy with noise prediction to improve efficiency.

Result: Extensive experiments demonstrate that GeoDiff-LF outperforms existing methods in both visual fidelity and quantitative performance metrics, effectively mitigating color distortion in underwater scenes and advancing the state-of-the-art in underwater imaging enhancement.

Conclusion: GeoDiff-LF successfully integrates diffusion priors with light field geometry to enhance underwater 4-D light field imaging, providing a novel framework that addresses color distortion challenges and improves image quality beyond existing methods.

Abstract: This work studies the challenging problem of acquiring high-quality underwater images via 4-D light field (LF) imaging. To this end, we propose GeoDiff-LF, a novel diffusion-based framework built upon SD-Turbo to enhance underwater 4-D LF imaging by leveraging its spatial-angular structure. GeoDiff-LF consists of three key adaptations: (1) a modified U-Net architecture with convolutional and attention adapters to model geometric cues, (2) a geometry-guided loss function using tensor decomposition and progressive weighting to regularize global structure, and (3) an optimized sampling strategy with noise prediction to improve efficiency. By integrating diffusion priors and LF geometry, GeoDiff-LF effectively mitigates color distortion in underwater scenes. Extensive experiments demonstrate that our framework outperforms existing methods across both visual fidelity and quantitative performance, advancing the state-of-the-art in enhancing underwater imaging. The code will be publicly available at https://github.com/linlos1234/GeoDiff-LF.

[139] FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

Chenyu Huang, Peng Ye, Xudong Tan, Jinhan Mu, Shenghe Zheng, Li Shen, Tao Chen

Main category: cs.CV

TL;DR: FRISM is a fine-grained reasoning injection framework that enhances Vision-Language Models by merging them with Large Reasoning Models at the subspace level, improving reasoning capabilities without compromising visual perception.

Details

Motivation: Existing methods for enhancing VLMs with reasoning capabilities operate at coarse-grained layer levels, causing a trade-off between injecting reasoning capabilities and preserving visual capabilities. There's a need for more fine-grained approaches that can selectively enhance reasoning while maintaining visual perception.

Method: FRISM uses subspace-level model merging via Singular Value Decomposition (SVD) to decompose LRM task vectors, then adaptively tunes scaling coefficients for each subspace. It employs label-free self-distillation learning with dual-objective optimization using common vision-language perception datasets.

Result: Extensive experiments show FRISM effectively improves reasoning capabilities without compromising original visual capabilities, achieving state-of-the-art performance across diverse visual reasoning benchmarks.

Conclusion: FRISM demonstrates that fine-grained subspace-level merging is an effective approach for enhancing VLMs with reasoning capabilities while preserving their visual perception abilities, outperforming coarse-grained layer-level methods.

Abstract: Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose {FRISM} (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that reasoning capabilities are encoded in distinct subspaces, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with a dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities without compromising the model’s original visual capabilities by consistently achieving state-of-the-art performance across diverse visual reasoning benchmarks.

[140] Generative Recall, Dense Reranking: Learning Multi-View Semantic IDs for Efficient Text-to-Video Retrieval

Zecheng Zhao, Zhi Chen, Zi Huang, Shazia Sadiq, Tong Chen

Main category: cs.CV

TL;DR: GRDR improves two-stage text-to-video retrieval by using generative retrieval with multiple semantic IDs per video and joint training to address semantic ambiguity and cross-modal misalignment, achieving accuracy comparable to dense retrievers with much better efficiency.

Details

Motivation: Current two-stage text-to-video retrieval systems use generative retrieval for fast recall but suffer from semantic ambiguity (each video having only one semantic ID despite diverse queries) and cross-modal misalignment (semantic IDs derived only from visual features without text supervision).

Method: Proposes GRDR with: 1) query-guided multi-view tokenizer assigning multiple semantic IDs per video for diverse semantic access, 2) joint training of tokenizer and generative retriever via shared codebook to align text and video semantics, and 3) trie-constrained decoding for compact candidate generation followed by dense reranking.

Result: GRDR matches strong dense retrievers in accuracy while reducing index storage by an order of magnitude and accelerating full-corpus retrieval up to 300× on TVR benchmarks.

Conclusion: GRDR effectively addresses limitations of generative retrieval for recall in two-stage TVR, achieving high accuracy with dramatically improved efficiency through multi-view semantic IDs and joint cross-modal training.

Abstract: Text-to-Video Retrieval (TVR) is essential in video platforms. Dense retrieval with dual-modality encoders leads in accuracy, but its computation and storage scale poorly with corpus size. Thus, real-time large-scale applications adopt two-stage retrieval, where a fast recall model gathers a small candidate pool, which is reranked by an advanced dense retriever. Due to hugely reduced candidates, the reranking model can use any off-the-shelf dense retriever without hurting efficiency, meaning the recall model bounds two-stage TVR performance. Recently, generative retrieval (GR) replaces dense video embeddings with discrete semantic IDs and retrieves by decoding text queries into ID tokens. GR offers near-constant inference and storage complexity, and its semantic IDs capture high-level video features via quantization, making it ideal for quickly eliminating irrelevant candidates during recall. However, as a recall model in two-stage TVR, GR suffers from (i) semantic ambiguity, where each video satisfies diverse queries but is forced into one semantic ID; and (ii) cross-modal misalignment, as semantic IDs are solely derived from visual features without text supervision. We propose Generative Recall and Dense Reranking (GRDR), designing a novel GR method to uplift recalled candidate quality. GRDR assigns multiple semantic IDs to each video using a query-guided multi-view tokenizer exposing diverse semantic access paths, and jointly trains the tokenizer and generative retriever via a shared codebook to cast semantic IDs as the semantic bridge between texts and videos. At inference, trie-constrained decoding generates a compact candidate set reranked by a dense model for fine-grained matching. Experiments on TVR benchmarks show GRDR matches strong dense retrievers in accuracy while reducing index storage by an order of magnitude and accelerating up to 300$\times$ in full-corpus retrieval.

[141] Thinker: A vision-language foundation model for embodied intelligence

Baiyu Pan, Daqin Luo, Junpeng Yang, Jiyuan Wang, Yixuan Zhang, Hailin Shi, Jichao Jiao

Main category: cs.CV

TL;DR: Thinker: A vision-language foundation model for embodied intelligence that addresses perspective confusion and temporal reasoning issues in robotics applications through specialized dataset construction and dual-input video comprehension.

Details

Motivation: Large vision-language models struggle with simple human tasks when applied to robotics, particularly with perspective confusion (third-person vs first-person) and overlooking information in video endings during temporal reasoning.

Method: 1) Construct large-scale dataset for robotic perception and reasoning including ego-view videos, visual grounding, spatial understanding, and chain-of-thought data. 2) Introduce simple yet effective approach that jointly incorporates key frames and full video sequences as inputs to enhance video comprehension.

Result: Achieves state-of-the-art results on two of the most commonly used benchmark datasets in the field of task planning.

Conclusion: Thinker effectively addresses fundamental challenges in applying vision-language models to robotics through specialized dataset construction and improved video comprehension techniques.

Abstract: When large vision-language models are applied to the field of robotics, they encounter problems that are simple for humans yet error-prone for models. Such issues include confusion between third-person and first-person perspectives and a tendency to overlook information in video endings during temporal reasoning. To address these challenges, we propose Thinker, a large vision-language foundation model designed for embodied intelligence. We tackle the aforementioned issues from two perspectives. Firstly, we construct a large-scale dataset tailored for robotic perception and reasoning, encompassing ego-view videos, visual grounding, spatial understanding, and chain-of-thought data. Secondly, we introduce a simple yet effective approach that substantially enhances the model’s capacity for video comprehension by jointly incorporating key frames and full video sequences as inputs. Our model achieves state-of-the-art results on two of the most commonly used benchmark datasets in the field of task planning.

[142] LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models

Alvi Md Ishmam, Najibul Haque Sarker, Zaber Ibn Abdul Hakim, Chris Thomas

Main category: cs.CV

TL;DR: LAMP introduces a black-box adversarial attack method for multi-image multimodal LLMs using attention constraints to disrupt cross-image information aggregation.

Details

Motivation: Multi-image MLLMs have unexplored vulnerabilities, existing attacks focus on single-image white-box scenarios, and real-world attacks need black-box methods that work across multiple images.

Method: LAMP uses attention-based constraints to prevent cross-image information aggregation, cross-image contagious constraints to spread adversarial effects, and index-attention suppression for position-invariant attacks.

Result: LAMP outperforms state-of-the-art baselines and achieves highest attack success rates across multiple vision-language tasks and models.

Conclusion: LAMP demonstrates effective black-box adversarial attacks on multi-image MLLMs, revealing security vulnerabilities in multimodal systems.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable performance across vision-language tasks. Recent advancements allow these models to process multiple images as inputs. However, the vulnerabilities of multi-image MLLMs remain unexplored. Existing adversarial attacks focus on single-image settings and often assume a white-box threat model, which is impractical in many real-world scenarios. This paper introduces LAMP, a black-box method for learning Universal Adversarial Perturbations (UAPs) targeting multi-image MLLMs. LAMP applies an attention-based constraint that prevents the model from effectively aggregating information across images. LAMP also introduces a novel cross-image contagious constraint that forces perturbed tokens to influence clean tokens, spreading adversarial effects without requiring all inputs to be modified. Additionally, an index-attention suppression loss enables a robust position-invariant attack. Experimental results show that LAMP outperforms SOTA baselines and achieves the highest attack success rates across multiple vision-language tasks and models.

[143] PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models

Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, Qingyi Gu

Main category: cs.CV

TL;DR: PTQ4ARVG is a training-free post-training quantization framework for AutoRegressive Visual Generation models that addresses channel-wise outliers, token-wise dynamic activations, and sample-wise distribution mismatches.

Details

Motivation: Quantization is commonly used to reduce model size and computational latency, but existing methods fail to generalize effectively to AutoRegressive Visual Generation (ARVG) models, which have unique challenges including severe channel-wise outliers, highly dynamic token-wise activations, and mismatched sample-wise distributions.

Method: Proposes PTQ4ARVG with three components: (1) Gain-Projected Scaling (GPS) to mitigate channel-wise outliers by expanding quantization loss via Taylor series and deriving optimal scaling factors; (2) Static Token-Wise Quantization (STWQ) that leverages ARVG’s fixed token length and position-invariant distributions; (3) Distribution-Guided Calibration (DGC) that selects samples contributing most to distributional entropy.

Result: Extensive experiments show PTQ4ARVG can effectively quantize ARVG family models to 8-bit and 6-bit while maintaining competitive performance.

Conclusion: PTQ4ARVG successfully addresses the unique quantization challenges of ARVG models through a training-free framework that handles channel-wise outliers, token-wise variance, and sample-wise distribution mismatches.

Abstract: AutoRegressive Visual Generation (ARVG) models retain an architecture compatible with language models, while achieving performance comparable to diffusion-based models. Quantization is commonly employed in neural networks to reduce model size and computational latency. However, applying quantization to ARVG remains largely underexplored, and existing quantization methods fail to generalize effectively to ARVG models. In this paper, we explore this issue and identify three key challenges: (1) severe outliers at channel-wise level, (2) highly dynamic activations at token-wise level, and (3) mismatched distribution information at sample-wise level. To these ends, we propose PTQ4ARVG, a training-free post-training quantization (PTQ) framework consisting of: (1) Gain-Projected Scaling (GPS) mitigates the channel-wise outliers, which expands the quantization loss via a Taylor series to quantify the gain of scaling for activation-weight quantization, and derives the optimal scaling factor through differentiation.(2) Static Token-Wise Quantization (STWQ) leverages the inherent properties of ARVG, fixed token length and position-invariant distribution across samples, to address token-wise variance without incurring dynamic calibration overhead.(3) Distribution-Guided Calibration (DGC) selects samples that contribute most to distributional entropy, eliminating the sample-wise distribution mismatch. Extensive experiments show that PTQ4ARVG can effectively quantize the ARVG family models to 8-bit and 6-bit while maintaining competitive performance. Code is available at http://github.com/BienLuky/PTQ4ARVG .

[144] REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation

Haotian Wang, Yuzhe Weng, Jun Du, Haoran Xu, Xiaoyan Wu, Shan He, Bing Yin, Cong Liu, Qingfeng Liu

Main category: cs.CV

TL;DR: REST is a real-time streaming audio-driven talking head generation framework using diffusion models with compact video latent space, ID-Context Cache for streaming, and Asynchronous Streaming Distillation for temporal consistency.

Details

Motivation: Current diffusion-based talking head generation models suffer from slow inference speeds and non-autoregressive paradigms, limiting real-time applications. The authors aim to bridge the gap between autoregressive and diffusion approaches for real-time streaming generation.

Method: 1) Learn compact video latent space via spatiotemporal VAE with high compression ratio; 2) Introduce ID-Context Cache mechanism (ID-Sink + Context-Cache) for semi-autoregressive streaming with identity consistency and temporal coherence; 3) Propose Asynchronous Streaming Distillation strategy using non-streaming teacher to supervise streaming student and mitigate error accumulation.

Result: REST outperforms state-of-the-art methods in both generation speed and overall performance, achieving real-time streaming audio-driven talking head generation.

Conclusion: REST bridges autoregressive and diffusion-based approaches, achieving breakthrough efficiency for real-time talking head generation applications.

Abstract: Diffusion models have significantly advanced the field of talking head generation (THG). However, slow inference speeds and prevalent non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, a pioneering diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through a spatiotemporal variational autoencoder with a high compression ratio. Additionally, to enable semi-autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles into key-value caching for maintaining identity consistency and temporal coherence during long-term streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) strategy is proposed to mitigate error accumulation and enhance temporal consistency in streaming generation, leveraging a non-streaming teacher with an asynchronous noise schedule to supervise the streaming student. REST bridges the gap between autoregressive and diffusion-based approaches, achieving a breakthrough in efficiency for applications requiring real-time THG. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.

[145] NFCDS: A Plug-and-Play Noise Frequency-Controlled Diffusion Sampling Strategy for Image Restoration

Zhen Wang, Hongyi Liu, Jianing Li, Zhihui Wei

Main category: cs.CV

TL;DR: NFCDS is a plug-and-play diffusion sampling method that uses spectral modulation to control noise frequency during reverse diffusion, improving the fidelity-perception trade-off in image restoration tasks without additional training.

Details

Motivation: Diffusion-based PnP methods produce perceptually good images but suffer from reduced data fidelity due to noise introduced during reverse diffusion. There's a fundamental trade-off between fidelity and perception that needs to be addressed.

Method: Proposes Noise Frequency-Controlled Diffusion Sampling (NFCDS) with a spectral modulation mechanism. Key insight: fidelity-perception conflict can be understood through noise frequency - low-frequency noise causes blur/fidelity loss, high-frequency drives detail generation. Uses Fourier-domain filter to progressively suppress low-frequency noise while preserving high-frequency content.

Result: Enables fast convergence to results that are both high-fidelity and perceptually convincing. As a PnP module, it integrates into existing diffusion-based restoration frameworks and improves fidelity-perception balance across diverse zero-shot tasks.

Conclusion: NFCDS provides a training-free solution to the fidelity-perception trade-off in diffusion sampling by controlling noise frequency, making it a versatile plug-and-play module for diffusion-based image restoration.

Abstract: Diffusion sampling-based Plug-and-Play (PnP) methods produce images with high perceptual quality but often suffer from reduced data fidelity, primarily due to the noise introduced during reverse diffusion. To address this trade-off, we propose Noise Frequency-Controlled Diffusion Sampling (NFCDS), a spectral modulation mechanism for reverse diffusion noise. We show that the fidelity-perception conflict can be fundamentally understood through noise frequency: low-frequency components induce blur and degrade fidelity, while high-frequency components drive detail generation. Based on this insight, we design a Fourier-domain filter that progressively suppresses low-frequency noise and preserves high-frequency content. This controlled refinement injects a data-consistency prior directly into sampling, enabling fast convergence to results that are both high-fidelity and perceptually convincing–without additional training. As a PnP module, NFCDS seamlessly integrates into existing diffusion-based restoration frameworks and improves the fidelity-perception balance across diverse zero-shot tasks.

[146] Hypersolid: Emergent Vision Representations via Short-Range Repulsion

Esteban Rodríguez-Betancourt, Edgar Casasola-Murillo

Main category: cs.CV

TL;DR: Hypersolid prevents representation collapse in self-supervised learning using local hard-ball repulsion instead of global regularization, treating representation learning as a discrete packing problem to maintain injectivity.

Details

Motivation: Existing self-supervised learning methods use global regularization techniques (maximizing distances, decorrelating dimensions, enforcing distributions) to prevent representation collapse, but the authors propose a different perspective focusing on local geometric constraints.

Method: Hypersolid treats representation learning as a discrete packing problem, using short-range hard-ball repulsion to prevent local collisions between representations. This maintains injectivity by ensuring high separation in the geometric space.

Result: The method preserves augmentation diversity and excels on fine-grained and low-resolution classification tasks by creating a high-separation geometric regime that prevents representation collapse.

Conclusion: Local geometric constraints via hard-ball repulsion provide an effective alternative to global regularization for preventing representation collapse in self-supervised learning, particularly beneficial for fine-grained and low-resolution tasks.

Abstract: A recurring challenge in self-supervised learning is preventing representation collapse. Existing solutions typically rely on global regularization, such as maximizing distances, decorrelating dimensions or enforcing certain distributions. We instead reinterpret representation learning as a discrete packing problem, where preserving information simplifies to maintaining injectivity. We operationalize this in Hypersolid, a method using short-range hard-ball repulsion to prevent local collisions. This constraint results in a high-separation geometric regime that preserves augmentation diversity, excelling on fine-grained and low-resolution classification tasks.

[147] Lightweight High-Fidelity Low-Bitrate Talking Face Compression for 3D Video Conference

Jianglong Li, Jun Xu, Bingcong Lu, Zhengxue Cheng, Hongwei Hu, Ronghua Wu, Li Song

Main category: cs.CV

TL;DR: A lightweight 3D talking face compression framework combining FLAME parametric modeling with 3D Gaussian Splatting (3DGS) neural rendering for high-fidelity, low-bitrate real-time video conferencing.

Details

Motivation: Current 3D video conferencing faces challenges in achieving high-fidelity 3D talking face representation at low bitrates. Traditional 2D compression loses geometric details, while neural rendering methods like NeRF are computationally expensive.

Method: Integrates FLAME-based parametric modeling with 3DGS neural rendering. Transmits only essential facial metadata in real time, uses Gaussian-based head model reconstruction, and introduces compact representation with Gaussian attribute compression and MLP optimization.

Result: Achieves superior rate-distortion performance, delivering high-quality facial rendering at extremely low bitrates, suitable for real-time 3D video conferencing applications.

Conclusion: Proposed framework successfully addresses the challenge of high-fidelity 3D talking face compression at low bitrates, making real-time immersive communication more feasible.

Abstract: The demand for immersive and interactive communication has driven advancements in 3D video conferencing, yet achieving high-fidelity 3D talking face representation at low bitrates remains a challenge. Traditional 2D video compression techniques fail to preserve fine-grained geometric and appearance details, while implicit neural rendering methods like NeRF suffer from prohibitive computational costs. To address these challenges, we propose a lightweight, high-fidelity, low-bitrate 3D talking face compression framework that integrates FLAME-based parametric modeling with 3DGS neural rendering. Our approach transmits only essential facial metadata in real time, enabling efficient reconstruction with a Gaussian-based head model. Additionally, we introduce a compact representation and compression scheme, including Gaussian attribute compression and MLP optimization, to enhance transmission efficiency. Experimental results demonstrate that our method achieves superior rate-distortion performance, delivering high-quality facial rendering at extremely low bitrates, making it well-suited for real-time 3D video conferencing applications.

[148] GeoRC: A Benchmark for Geolocation Reasoning Chains

Mohit Talreja, Joshua Diao, Jim Thannikary James, Radu Casapu, Tejas Santanam, Ethan Mendes, Alan Ritter, Wei Xu, James Hays

Main category: cs.CV

TL;DR: Benchmark for evaluating VLM reasoning chains in geolocation tasks reveals VLMs hallucinate evidence and lag behind human experts in producing auditable explanations despite accurate location predictions.

Details

Motivation: VLMs excel at geolocation prediction but produce poor explanations, often hallucinating scene attributes to support correct predictions. Need benchmark to evaluate reasoning chain quality in geolocation tasks.

Method: Created first benchmark for geolocation reasoning chains using GeoGuessr game with Google Street View. Collaborated with expert players to produce 800 ground truth reasoning chains for 500 scenes. Evaluated LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains.

Result: Qwen 3 LLM-as-a-judge correlates best with human scoring. Large closed-source VLMs (Gemini, GPT 5) rival human experts at location prediction but lag in producing auditable reasoning chains. Open-weight VLMs (Llama, Qwen) catastrophically fail, performing only slightly better than hallucinated reasoning chains with oracle location knowledge.

Conclusion: Gap between human experts and VLMs points to VLM limitations in extracting fine-grained visual attributes from high-resolution images. Benchmark reveals critical need for improved visual reasoning capabilities in multimodal models.

Abstract: Vision Language Models (VLMs) are good at recognizing the global location of a photograph – their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at explaining which image evidence led to their prediction, even when their location prediction is correct. The reasoning chains produced by VLMs frequently hallucinate scene attributes to support their location prediction (e.g. phantom writing, imagined infrastructure, misidentified flora). In this paper, we introduce the first benchmark for geolocation reasoning chains. We focus on the global location prediction task in the popular GeoGuessr game which draws from Google Street View spanning more than 100 countries. We collaborate with expert GeoGuessr players, including the reigning world champion, to produce 800 ground truth reasoning chains for 500 query scenes. These expert reasoning chains address hundreds of different discriminative visual attributes such as license plate shape, architecture, and soil properties to name just a few. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at prediction locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Open weights VLMs such as Llama and Qwen catastrophically fail on our benchmark – they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but no visual information at all. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images.

Dong Chen, Ruoyu Li, Xinyan Zhang, Jialei Xu, Ruoseng Zhao, Zhikang Zhang, Lingyun Li, Zizhuang Wei

Main category: cs.CV

TL;DR: A novel multimodal framework for antenna affiliation identification using video footage, antenna geometry, and PCI signals, with a Token Entropy Regularization module to improve cross-modal alignment.

Details

Motivation: Current manual antenna affiliation identification is cumbersome and error-prone, requiring a more automated solution. The unique nature of communication domain data makes existing pretrained transformers ineffective due to lack of analogous data and poor cross-modal alignment.

Method: Proposes a multimodal classification and matching approach fusing video footage, antenna geometric features, and PCI signals. Introduces a dedicated training framework with Token Entropy Regularization module during pretraining to address representation alignment challenges between antenna images and PCI signals.

Result: Experiments show TER accelerates convergence and yields significant performance gains. Analysis reveals that the entropy of the first token is modality-dependent, validating the approach’s effectiveness.

Conclusion: The proposed multimodal framework with TER successfully addresses antenna affiliation identification, offering an automated solution that outperforms manual methods and existing pretrained transformers for this specific communication domain task.

Abstract: Accurate antenna affiliation identification is crucial for optimizing and maintaining communication networks. Current practice, however, relies on the cumbersome and error-prone process of manual tower inspections. We propose a novel paradigm shift that fuses video footage of base stations, antenna geometric features, and Physical Cell Identity (PCI) signals, transforming antenna affiliation identification into multi-modal classification and matching tasks. Publicly available pretrained transformers struggle with this unique task due to a lack of analogous data in the communications domain, which hampers cross-modal alignment. To address this, we introduce a dedicated training framework that aligns antenna images with corresponding PCI signals. To tackle the representation alignment challenge, we propose a novel Token Entropy Regularization module in the pretraining stage. Our experiments demonstrate that TER accelerates convergence and yields significant performance gains. Further analysis reveals that the entropy of the first token is modality-dependent. Code will be made available upon publication.

[150] WorldBench: Disambiguating Physics for Diagnostic Evaluation of World Models

Rishi Upadhyay, Howard Zhang, Jim Solomon, Ayush Agrawal, Pranay Boreddy, Shruti Satya Narayana, Yunhao Ba, Alex Wong, Celso M de Melo, Achuta Kadambi

Main category: cs.CV

TL;DR: WorldBench is a novel video benchmark for disentangled evaluation of physical reasoning in world models, assessing specific concepts like object permanence and material properties separately to identify precise failure patterns.

Details

Motivation: Existing physics-based video benchmarks suffer from entanglement where multiple physical concepts are evaluated simultaneously, limiting diagnostic capability. There's a need for concept-specific evaluation to rigorously assess physical reasoning in world models for reliable deployment in robotics and autonomous systems.

Method: Introduces WorldBench with two-level evaluation: 1) intuitive physical understanding (object permanence, scale/perspective) and 2) low-level physical constants (friction coefficients, fluid viscosity). Uses concept-specific, disentangled evaluation to isolate and assess single physical concepts at a time.

Result: When evaluating SOTA video-based world models on WorldBench, specific patterns of failure in particular physics concepts were identified. All tested models lacked the physical consistency required for reliable real-world interactions.

Conclusion: WorldBench provides a more nuanced and scalable framework for evaluating physical reasoning in video generation and world models, enabling better assessment of models for robust world-model-driven learning applications.

Abstract: Recent advances in generative foundational models, often termed “world models,” have propelled interest in applying them to critical tasks like robotic planning and autonomous system training. For reliable deployment, these models must exhibit high physical fidelity, accurately simulating real-world dynamics. Existing physics-based video benchmarks, however, suffer from entanglement, where a single test simultaneously evaluates multiple physical laws and concepts, fundamentally limiting their diagnostic capability. We introduce WorldBench, a novel video-based benchmark specifically designed for concept-specific, disentangled evaluation, allowing us to rigorously isolate and assess understanding of a single physical concept or law at a time. To make WorldBench comprehensive, we design benchmarks at two different levels: 1) an evaluation of intuitive physical understanding with concepts such as object permanence or scale/perspective, and 2) an evaluation of low-level physical constants and material properties such as friction coefficients or fluid viscosity. When SOTA video-based world models are evaluated on WorldBench, we find specific patterns of failure in particular physics concepts, with all tested models lacking the physical consistency required to generate reliable real-world interactions. Through its concept-specific evaluation, WorldBench offers a more nuanced and scalable framework for rigorously evaluating the physical reasoning capabilities of video generation and world models, paving the way for more robust and generalizable world-model-driven learning.

[151] Gaussian Belief Propagation Network for Depth Completion

Jie Tang, Pingping Xie, Jian Li, Ping Tan

Main category: cs.CV

TL;DR: GBPN is a hybrid framework combining deep learning with probabilistic graphical models for depth completion, using Gaussian Belief Propagation to handle sparse and irregular depth measurements effectively.

Details

Motivation: Deep learning methods for depth completion struggle with sparse and irregular input depth data, especially under high sparsity conditions. Existing approaches have limitations in handling these challenges effectively.

Method: Proposes Gaussian Belief Propagation Network (GBPN) that integrates deep learning with probabilistic graphical models. Uses Graphical Model Construction Network (GMCN) to dynamically construct scene-specific Markov Random Fields with adaptive non-local edges, then performs inference via Gaussian Belief Propagation with serial & parallel message passing.

Result: Achieves state-of-the-art performance on NYUv2 and KITTI benchmarks. Shows superior performance, robustness, and generalizable capability across varying sparsity levels, patterns, and datasets.

Conclusion: GBPN effectively addresses the challenge of sparse and irregular depth data in depth completion by synergistically combining deep learning with probabilistic graphical models, demonstrating strong performance and robustness.

Abstract: Depth completion aims to predict a dense depth map from a color image with sparse depth measurements. Although deep learning methods have achieved state-of-the-art (SOTA), effectively handling the sparse and irregular nature of input depth data in deep networks remains a significant challenge, often limiting performance, especially under high sparsity. To overcome this limitation, we introduce the Gaussian Belief Propagation Network (GBPN), a novel hybrid framework synergistically integrating deep learning with probabilistic graphical models for end-to-end depth completion. Specifically, a scene-specific Markov Random Field (MRF) is dynamically constructed by the Graphical Model Construction Network (GMCN), and then inferred via Gaussian Belief Propagation (GBP) to yield the dense depth distribution. Crucially, the GMCN learns to construct not only the data-dependent potentials of MRF but also its structure by predicting adaptive non-local edges, enabling the capture of complex, long-range spatial dependencies. Furthermore, we enhance GBP with a serial & parallel message passing scheme, designed for effective information propagation, particularly from sparse measurements. Extensive experiments demonstrate that GBPN achieves SOTA performance on the NYUv2 and KITTI benchmarks. Evaluations across varying sparsity levels, sparsity patterns, and datasets highlight GBPN’s superior performance, notable robustness, and generalizable capability.

[152] Mam-App: A Novel Parameter-Efficient Mamba Model for Apple Leaf Disease Classification

Md Nadim Mahamood, Md Imran Hasan, Md Rasheduzzaman, Ausrukona Ray, Md Shafi Ud Doula, Kamrul Hasan

Main category: cs.CV

TL;DR: Mam-App: A parameter-efficient Mamba-based model for apple leaf disease classification achieving state-of-the-art performance with only 0.051M parameters, suitable for deployment on resource-constrained devices.

Details

Motivation: Address the trade-off between efficiency and performance in deep learning models for agricultural disease detection, where current state-of-the-art models are parameter-intensive while lightweight models suffer performance degradation.

Method: Proposes Mam-App, a parameter-efficient Mamba-based model for feature extraction and leaf disease classification, designed to be extremely lightweight while maintaining competitive performance.

Result: Achieves 99.58% accuracy, 99.30% precision, 99.14% recall, and 99.22% F1-score on Apple Leaf Disease dataset with only 0.051M parameters. Also shows strong performance on Corn (99.48% accuracy) and Potato (98.46% accuracy) leaf disease datasets.

Conclusion: The proposed Mam-App model successfully balances efficiency and performance, making it suitable for deployment on drones, mobile devices, and other low-resource platforms for agricultural disease detection.

Abstract: The rapid growth of the global population, alongside exponential technological advancement, has intensified the demand for food production. Meeting this demand depends not only on increasing agricultural yield but also on minimizing food loss caused by crop diseases. Diseases account for a substantial portion of apple production losses, despite apples being among the most widely produced and nutritionally valuable fruits worldwide. Previous studies have employed machine learning techniques for feature extraction and early diagnosis of apple leaf diseases, and more recently, deep learning-based models have shown remarkable performance in disease recognition. However, most state-of-the-art deep learning models are highly parameter-intensive, resulting in increased training and inference time. Although lightweight models are more suitable for user-friendly and resource-constrained applications, they often suffer from performance degradation. To address the trade-off between efficiency and performance, we propose Mam-App, a parameter-efficient Mamba-based model for feature extraction and leaf disease classification. The proposed approach achieves competitive state-of-the-art performance on the PlantVillage Apple Leaf Disease dataset, attaining 99.58% accuracy, 99.30% precision, 99.14% recall, and a 99.22% F1-score, while using only 0.051M parameters. This extremely low parameter count makes the model suitable for deployment on drones, mobile devices, and other low-resource platforms. To demonstrate the robustness and generalizability of the proposed model, we further evaluate it on the PlantVillage Corn Leaf Disease and Potato Leaf Disease datasets. The model achieves 99.48%, 99.20%, 99.34%, and 99.27% accuracy, precision, recall, and F1-score on the corn dataset and 98.46%, 98.91%, 95.39%, and 97.01% on the potato dataset, respectively.

[153] HiFi-Mesh: High-Fidelity Efficient 3D Mesh Generation via Compact Autoregressive Dependence

Yanfeng Li, Tao Tan, Qingquan Gao, Zhiwen Cao, Xiaohong liu, Yue Sun

Main category: cs.CV

TL;DR: LANE introduces a latent autoregressive network with compact dependencies and AdaGraph strategy for efficient high-quality 3D mesh generation, achieving 6× longer sequences and faster inference.

Details

Motivation: Existing autoregressive 3D mesh generation methods suffer from insufficient resource utilization, slow inference, and limited sequence length, which constrains expressible structural details.

Method: Proposes Latent Autoregressive Network (LANE) with compact autoregressive dependencies and Adaptive Computation Graph Reconfiguration (AdaGraph) strategy for spatiotemporal decoupling to accelerate inference.

Result: Achieves 6× improvement in maximum generatable sequence length compared to existing methods, with superior performance in generation speed, structural detail, and geometric consistency.

Conclusion: LANE provides an effective solution for high-quality 3D mesh generation with improved efficiency and detail expression capabilities.

Abstract: High-fidelity 3D meshes can be tokenized into one-dimension (1D) sequences and directly modeled using autoregressive approaches for faces and vertices. However, existing methods suffer from insufficient resource utilization, resulting in slow inference and the ability to handle only small-scale sequences, which severely constrains the expressible structural details. We introduce the Latent Autoregressive Network (LANE), which incorporates compact autoregressive dependencies in the generation process, achieving a $6\times$ improvement in maximum generatable sequence length compared to existing methods. To further accelerate inference, we propose the Adaptive Computation Graph Reconfiguration (AdaGraph) strategy, which effectively overcomes the efficiency bottleneck of traditional serial inference through spatiotemporal decoupling in the generation process. Experimental validation demonstrates that LANE achieves superior performance across generation speed, structural detail, and geometric consistency, providing an effective solution for high-quality 3D mesh generation.

[154] Optimal Transport-Induced Samples against Out-of-Distribution Overconfidence

Keke Tang, Ziyong Du, Xiaofei Wang, Weilong Peng, Peican Zhu, Zhihong Tian

Main category: cs.CV

TL;DR: A framework using optimal transport geometry to identify semantically ambiguous regions and generate OOD samples for training models to reduce overconfidence on out-of-distribution inputs.

Details

Motivation: Deep neural networks often produce overconfident predictions on out-of-distribution inputs, undermining reliability in open-world environments. Singularities in semi-discrete optimal transport mark regions of semantic ambiguity where classifiers are prone to unwarranted high-confidence predictions.

Method: Formulate an optimal transport problem between a continuous base distribution and latent embeddings of training data, identify resulting singular boundaries, sample near these boundaries to construct geometrically grounded OOD samples (OTIS), and apply confidence suppression loss during training.

Result: Extensive experiments show the method significantly alleviates OOD overconfidence and outperforms state-of-the-art methods.

Conclusion: The proposed framework effectively leverages optimal transport geometry to address OOD overconfidence by targeting structurally uncertain regions for improved model calibration.

Abstract: Deep neural networks (DNNs) often produce overconfident predictions on out-of-distribution (OOD) inputs, undermining their reliability in open-world environments. Singularities in semi-discrete optimal transport (OT) mark regions of semantic ambiguity, where classifiers are particularly prone to unwarranted high-confidence predictions. Motivated by this observation, we propose a principled framework to mitigate OOD overconfidence by leveraging the geometry of OT-induced singular boundaries. Specifically, we formulate an OT problem between a continuous base distribution and the latent embeddings of training data, and identify the resulting singular boundaries. By sampling near these boundaries, we construct a class of OOD inputs, termed optimal transport-induced OOD samples (OTIS), which are geometrically grounded and inherently semantically ambiguous. During training, a confidence suppression loss is applied to OTIS to guide the model toward more calibrated predictions in structurally uncertain regions. Extensive experiments show that our method significantly alleviates OOD overconfidence and outperforms state-of-the-art methods.

[155] Do Pathology Foundation Models Encode Disease Progression? A Pseudotime Analysis of Visual Representations

Pritika Vig, Ren-Chin Wu, William Lotter

Main category: cs.CV

TL;DR: Vision foundation models trained on pathology images implicitly learn continuous disease progression trajectories in their latent representations, with trajectory fidelity predicting downstream classification performance.

Details

Motivation: To determine whether vision foundation models encode continuous disease progression processes from static pathology images, which could better reflect underlying biology and support robust generalization in computational pathology.

Method: Used diffusion pseudotime (from single-cell transcriptomics) to probe whether foundation models organize disease states along coherent progression trajectories in representation space across four cancer progressions and six models.

Result: All pathology-specific models recovered trajectory orderings significantly exceeding null baselines, with vision-only models achieving highest fidelities (τ>0.78 on CRC-Serrated). Trajectory fidelity strongly predicted few-shot classification performance on held-out diseases (ρ=0.92).

Conclusion: Vision foundation models can implicitly learn continuous processes from static observations, and trajectory fidelity provides complementary measure of representation quality beyond downstream performance, applicable to domains beyond pathology.

Abstract: Vision foundation models trained on discretely sampled images achieve strong performance on classification benchmarks, yet whether their representations encode the continuous processes underlying their training data remains unclear. This question is especially pertinent in computational pathology, where we posit that models whose latent representations implicitly capture continuous disease progression may better reflect underlying biology, support more robust generalization, and enable quantitative analyses of features associated with disease transitions. Using diffusion pseudotime, a method developed to infer developmental trajectories from single-cell transcriptomics, we probe whether foundation models organize disease states along coherent progression directions in representation space. Across four cancer progressions and six models, we find that all pathology-specific models recover trajectory orderings significantly exceeding null baselines, with vision-only models achieving the highest fidelities $(τ> 0.78$ on CRC-Serrated). Model rankings by trajectory fidelity on reference diseases strongly predict few-shot classification performance on held-out diseases ($ρ= 0.92$), and exploratory analysis shows cell-type composition varies smoothly along inferred trajectories in patterns consistent with known stromal remodeling. Together, these results demonstrate that vision foundation models can implicitly learn to represent continuous processes from independent static observations, and that trajectory fidelity provides a complementary measure of representation quality beyond downstream performance. While demonstrated in pathology, this framework could be applied to other domains where continuous processes are observed through static snapshots.

Ji-Xuan He, Guohang Zhuang, Junge Bo, Tingyi Li, Chen Ling, Yanan Qiao

Main category: cs.CV

TL;DR: SR²-Net is a plug-and-play spectral rectification module for hyperspectral image super-resolution that enhances spectral consistency and physical plausibility without modifying existing model architectures.

Details

Motivation: Current HSI-SR methods focus on spatial enhancement but neglect spectral consistency across bands, causing spurious oscillations and physically implausible artifacts. While spectral consistency can be addressed through network architecture design, this reduces generality and flexibility.

Method: Proposes SR²-Net with enhance-then-rectify pipeline: (1) Hierarchical Spectral-Spatial Synergy Attention (H-S³A) for cross-band interactions, (2) Manifold Consistency Rectification (MCR) to constrain spectra to physically plausible manifold, and (3) degradation-consistency loss for data fidelity.

Result: Extensive experiments on multiple benchmarks with diverse backbones show consistent improvements in spectral fidelity and overall reconstruction quality with negligible computational overhead.

Conclusion: SR²-Net provides a lightweight, plug-and-play solution to improve spectral consistency in HSI-SR without modifying existing model architectures, addressing physical plausibility while maintaining flexibility.

Abstract: HSI-SR aims to enhance spatial resolution while preserving spectrally faithful and physically plausible characteristics. Recent methods have achieved great progress by leveraging spatial correlations to enhance spatial resolution. However, these methods often neglect spectral consistency across bands, leading to spurious oscillations and physically implausible artifacts. While spectral consistency can be addressed by designing the network architecture, it results in a loss of generality and flexibility. To address this issue, we propose a lightweight plug-and-play rectifier, physically priors Spectral Rectification Super-Resolution Network (SR$^{2}$-Net), which can be attached to a wide range of HSI-SR models without modifying their architectures. SR$^{2}$-Net follows an enhance-then-rectify pipeline consisting of (i) Hierarchical Spectral-Spatial Synergy Attention (H-S$^{3}$A) to reinforce cross-band interactions and (ii) Manifold Consistency Rectification (MCR) to constrain the reconstructed spectra to a compact, physically plausible spectral manifold. In addition, we introduce a degradation-consistency loss to enforce data fidelity by encouraging the degraded SR output to match the observed low resolution input. Extensive experiments on multiple benchmarks and diverse backbones demonstrate consistent improvements in spectral fidelity and overall reconstruction quality with negligible computational overhead. Our code will be released upon publication.

[157] Dynamical Adapter Fusion: Constructing A Global Adapter for Pre-Trained Model-based Class-Incremental Learning

Ruiqi Liu, Boyu Diao, Zijia An, Zhulin An, Fei Wang, Yongjun Xu

Main category: cs.CV

TL;DR: DAF proposes a single global adapter for class-incremental learning using PAC-Bayes theorem and Taylor expansion to dynamically fuse task-specific adapters, previous global adapter, and initialization parameters to balance stability and plasticity.

Details

Motivation: Current CIL approaches using task-specific adapters hinder knowledge transfer and incur high retrieval costs, while naive parameter fusion leads to destructive interference and catastrophic forgetting. Need for a single robust global adapter that can continuously learn new classes without forgetting.

Method: Proposes Dynamical Adapter Fusion (DAF) grounded in PAC-Bayes theorem. Derives fusion mechanism integrating three components: optimized task-specific adapter parameters, previous global adapter parameters, and initialization parameters. Uses Taylor expansion of loss function to derive optimal fusion coefficients for balancing stability and plasticity. Includes Robust Initialization strategy to capture global knowledge patterns.

Result: Experiments on multiple CIL benchmarks demonstrate that DAF achieves state-of-the-art (SOTA) performance.

Conclusion: DAF successfully addresses challenges in class-incremental learning by creating a single robust global adapter through dynamic fusion, achieving optimal balance between stability and plasticity while avoiding catastrophic forgetting.

Abstract: Class-Incremental Learning (CIL) requires models to continuously acquire new classes without forgetting previously learned ones. A dominant paradigm involves freezing a pre-trained model and training lightweight, task-specific adapters. However, maintaining task-specific parameters hinders knowledge transfer and incurs high retrieval costs, while naive parameter fusion often leads to destructive interference and catastrophic forgetting. To address these challenges, we propose Dynamical Adapter Fusion (DAF) to construct a single robust global adapter. Grounded in the PAC-Bayes theorem, we derive a fusion mechanism that explicitly integrates three components: the optimized task-specific adapter parameters, the previous global adapter parameters, and the initialization parameters. We utilize the Taylor expansion of the loss function to derive the optimal fusion coefficients, dynamically achieving the best balance between stability and plasticity. Furthermore, we propose a Robust Initialization strategy to effectively capture global knowledge patterns. Experiments on multiple CIL benchmarks demonstrate that DAF achieves state-of-the-art (SOTA) performance.

[158] Semantic-Guided Dynamic Sparsification for Pre-Trained Model-based Class-Incremental Learning

Ruiqi Liu, Boyu Diao, Zijia An, Runjie Shao, Zhulin An, Fei Wang, Yongjun Xu

Main category: cs.CV

TL;DR: SGDS is a novel method for Class-Incremental Learning that uses semantic-guided dynamic sparsification to create class-specific activation subspaces, promoting knowledge transfer for similar classes while preventing interference for dissimilar ones.

Details

Motivation: Existing CIL methods that freeze pre-trained models and use orthogonal adapters to prevent inter-task interference are detrimental to plasticity. The authors argue that rigid parameter constraints limit the model's ability to learn new classes effectively.

Method: SGDS proactively guides activation space by governing orientation and rank of subspaces through targeted sparsification. It promotes knowledge transfer by encouraging similar classes to share compact activation subspaces, while assigning non-overlapping subspaces to dissimilar classes to prevent interference.

Result: Extensive experiments on various benchmark datasets demonstrate state-of-the-art performance, showing SGDS effectively mitigates interference without imposing rigid constraints on parameter space.

Conclusion: SGDS provides an effective approach to class-incremental learning by sculpting class-specific sparse subspaces in activation space, balancing plasticity and stability better than parameter-constraining methods.

Abstract: Class-Incremental Learning (CIL) requires a model to continually learn new classes without forgetting old ones. A common and efficient solution freezes a pre-trained model and employs lightweight adapters, whose parameters are often forced to be orthogonal to prevent inter-task interference. However, we argue that this parameter-constraining method is detrimental to plasticity. To this end, we propose Semantic-Guided Dynamic Sparsification (SGDS), a novel method that proactively guides the activation space by governing the orientation and rank of its subspaces through targeted sparsification. Specifically, SGDS promotes knowledge transfer by encouraging similar classes to share a compact activation subspace, while simultaneously preventing interference by assigning non-overlapping activation subspaces to dissimilar classes. By sculpting class-specific sparse subspaces in the activation space, SGDS effectively mitigates interference without imposing rigid constraints on the parameter space. Extensive experiments on various benchmark datasets demonstrate the state-of-the-art performance of SGDS.

[159] Towards Geometry-Aware and Motion-Guided Video Human Mesh Recovery

Hongjun Chen, Huan Zheng, Wencheng Han, Jianbing Shen

Main category: cs.CV

TL;DR: HMRMamba: A new 3D human mesh recovery method using Structured State Space Models (SSMs) for better temporal consistency and physical plausibility in video-based reconstruction.

Details

Motivation: Existing video-based 3D Human Mesh Recovery methods produce physically implausible results due to flawed intermediate 3D pose anchors and inability to model complex spatiotemporal dynamics effectively.

Method: Two core components: 1) Geometry-Aware Lifting Module with dual-scan Mamba architecture for robust 2D-to-3D pose lifting using geometric cues, and 2) Motion-guided Reconstruction Network that processes kinematic patterns over time for temporal awareness.

Result: Sets new state-of-the-art on 3DPW, MPI-INF-3DHP, and Human3.6M benchmarks, outperforming existing methods in reconstruction accuracy and temporal consistency with superior computational efficiency.

Conclusion: HMRMamba introduces a new paradigm using SSMs for 3D human mesh recovery, effectively addressing architectural limitations of previous methods and achieving better physical plausibility and temporal coherence.

Abstract: Existing video-based 3D Human Mesh Recovery (HMR) methods often produce physically implausible results, stemming from their reliance on flawed intermediate 3D pose anchors and their inability to effectively model complex spatiotemporal dynamics. To overcome these deep-rooted architectural problems, we introduce HMRMamba, a new paradigm for HMR that pioneers the use of Structured State Space Models (SSMs) for their efficiency and long-range modeling prowess. Our framework is distinguished by two core contributions. First, the Geometry-Aware Lifting Module, featuring a novel dual-scan Mamba architecture, creates a robust foundation for reconstruction. It directly grounds the 2D-to-3D pose lifting process with geometric cues from image features, producing a highly reliable 3D pose sequence that serves as a stable anchor. Second, the Motion-guided Reconstruction Network leverages this anchor to explicitly process kinematic patterns over time. By injecting this crucial temporal awareness, it significantly enhances the final mesh’s coherence and robustness, particularly under occlusion and motion blur. Comprehensive evaluations on 3DPW, MPI-INF-3DHP, and Human3.6M benchmarks confirm that HMRMamba sets a new state-of-the-art, outperforming existing methods in both reconstruction accuracy and temporal consistency while offering superior computational efficiency.

[160] Rectifying Geometry-Induced Similarity Distortions for Real-World Aerial-Ground Person Re-Identification

Kailash A. Hambarde, Hugo Proença

Main category: cs.CV

TL;DR: GIQT introduces geometry-induced query-key transformation to address extreme viewpoint discrepancies in aerial-ground person re-identification by rectifying similarity space distortions caused by camera geometry.

Details

Motivation: Existing methods for aerial-ground person re-identification fail because extreme viewpoint and distance discrepancies cause severe geometric distortions that invalidate the assumption of a shared similarity space across views. The geometry-invariant dot-product similarity used in attention mechanisms becomes unreliable under large viewpoint and scale variations.

Method: Proposes Geometry-Induced Query-Key Transformation (GIQT), a lightweight low-rank module that explicitly rectifies the similarity space by conditioning query-key interactions on camera geometry. Also includes a geometry-conditioned prompt generation mechanism for global, view-adaptive representation priors.

Result: Experiments on four aerial-ground person re-identification benchmarks show consistent improvement in robustness under extreme and previously unseen geometric conditions with minimal computational overhead compared to state-of-the-art methods.

Conclusion: Explicitly addressing geometry-induced distortions in the similarity space through GIQT significantly improves aerial-ground person re-identification performance under extreme viewpoint variations.

Abstract: Aerial-ground person re-identification (AG-ReID) is fundamentally challenged by extreme viewpoint and distance discrepancies between aerial and ground cameras, which induce severe geometric distortions and invalidate the assumption of a shared similarity space across views. Existing methods primarily rely on geometry-aware feature learning or appearance-conditioned prompting, while implicitly assuming that the geometry-invariant dot-product similarity used in attention mechanisms remains reliable under large viewpoint and scale variations. We argue that this assumption does not hold. Extreme camera geometry systematically distorts the query-key similarity space and degrades attention-based matching, even when feature representations are partially aligned. To address this issue, we introduce Geometry-Induced Query-Key Transformation (GIQT), a lightweight low-rank module that explicitly rectifies the similarity space by conditioning query-key interactions on camera geometry. Rather than modifying feature representations or the attention formulation itself, GIQT adapts the similarity computation to compensate for dominant geometry-induced anisotropic distortions. Building on this local similarity rectification, we further incorporate a geometry-conditioned prompt generation mechanism that provides global, view-adaptive representation priors derived directly from camera geometry. Experiments on four aerial-ground person re-identification benchmarks demonstrate that the proposed framework consistently improves robustness under extreme and previously unseen geometric conditions, while introducing minimal computational overhead compared to state-of-the-art methods.

[161] Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

Zihan Su, Hongyang Wei, Kangrui Cen, Yong Wang, Guanhua Chen, Chun Yuan, Xiangxiang Chu

Main category: cs.CV

TL;DR: UniMRG is a post-training method that enhances multimodal models’ understanding capabilities by training them to generate multiple intrinsic representations (pixel, depth, segmentation) alongside standard visual understanding tasks.

Details

Motivation: While Unified Multimodal Models (UMMs) aim for mutual reinforcement between understanding and generation, current methods mainly use understanding to enhance generation. The reverse direction - using generation to improve understanding - remains largely unexplored.

Method: UniMRG is an architecture-agnostic post-training method that trains UMMs to generate multiple intrinsic representations of input images: pixel reconstruction (appearance), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives.

Result: Extensive experiments show UniMRG notably enhances fine-grained perception, reduces hallucinations, improves spatial understanding, and simultaneously boosts generation capabilities across diverse UMM architectures.

Conclusion: By synthesizing diverse representations through auxiliary generation tasks, UMMs develop deeper and more comprehensive understanding of visual inputs, demonstrating that generation can effectively enhance understanding in multimodal models.

Abstract: Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.

[162] MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations

Xinan He, Kaiqing Lin, Yue Zhou, Jiaming Zhong, Wei Ye, Wenhui Yi, Bing Fan, Feng Ding, Haodong Li, Bo Cao, Bin Li

Main category: cs.CV

TL;DR: A hierarchical dual-path framework detects AI-generated videos by analyzing manifold projection fluctuations - structured pixel composition patterns that distinguish synthetic content from real recordings.

Details

Motivation: As video generation models like Veo and Wan produce increasingly realistic content, traditional detection methods fail. The authors argue that AI-generated videos result from manifold-fitting processes rather than physical recording, creating detectable structured patterns in pixel composition across frames.

Method: Proposes a hierarchical dual-path framework: 1) Static Manifold Deviation Branch uses Vision Foundation Models to detect spatial anomalies or physical violations deviating from natural real-world manifolds; 2) Micro-Temporal Fluctuation Branch analyzes structured “Manifold Projection Fluctuations” (MPF) - subtle computational fingerprints in pixel composition logic across consecutive frames.

Result: The framework successfully detects AI-generated videos by exposing forgeries through either global real-world manifold deviations or subtle computational fingerprints, even in visually perfect sequences that evade spatial detection.

Conclusion: AI-generated videos exhibit detectable structured patterns (MPF) due to their manifold-fitting nature, enabling hierarchical detection that combines spatial anomaly detection with fine-grained temporal fluctuation analysis.

Abstract: With the rapid advancement of video generation models such as Veo and Wan, the visual quality of synthetic content has reached a level where macro-level semantic errors and temporal inconsistencies are no longer prominent. However, this does not imply that the distinction between real and cutting-edge high-fidelity fake is untraceable. We argue that AI-generated videos are essentially products of a manifold-fitting process rather than a physical recording. Consequently, the pixel composition logic of consecutive adjacent frames residual in AI videos exhibits a structured and homogenous characteristic. We term this phenomenon `Manifold Projection Fluctuations’ (MPF). Driven by this insight, we propose a hierarchical dual-path framework that operates as a sequential filtering process. The first, the Static Manifold Deviation Branch, leverages the refined perceptual boundaries of Large-Scale Vision Foundation Models (VFMs) to capture residual spatial anomalies or physical violations that deviate from the natural real-world manifold (off-manifold). For the remaining high-fidelity videos that successfully reside on-manifold and evade spatial detection, we introduce the Micro-Temporal Fluctuation Branch as a secondary, fine-grained filter. By analyzing the structured MPF that persists even in visually perfect sequences, our framework ensures that forgeries are exposed regardless of whether they manifest as global real-world manifold deviations or subtle computational fingerprints.

[163] From Implicit Ambiguity to Explicit Solidity: Diagnosing Interior Geometric Degradation in Neural Radiance Fields for Dense 3D Scene Understanding

Jiangsan Zhao, Jakob Geipel, Kryzysztof Kusnierek

Main category: cs.CV

TL;DR: NeRFs fail in dense occlusion scenes due to Interior Geometric Degradation, leading to hollow structures and instance undercounting. An explicit SVRaster pipeline with SfM initialization achieves better recovery rates.

Details

Motivation: NeRFs are powerful for multi-view reconstruction but their reliability for quantitative 3D analysis in dense, self-occluding scenes is poorly understood. The paper aims to identify failure modes and propose solutions for accurate instance counting in occlusion-heavy environments.

Method: Identifies Interior Geometric Degradation (IGD) failure mode in implicit density fields. Proposes explicit geometric pipeline using Sparse Voxel Rasterization (SVRaster) initialized from SfM feature geometry. Projects 2D instance masks onto explicit voxel grid with recursive splitting for geometric separation.

Result: State-of-the-art mask-supervised NeRFs saturate at ~89% instance recovery in dense scenes. SVRaster achieves 95.8% recovery rate in dense clusters. Explicit SfM-based geometry recovers 43% more instances than implicit baselines under degraded segmentation masks.

Conclusion: Explicit geometric priors are essential for reliable quantitative analysis in highly self-occluding 3D scenes. Implicit methods like NeRFs suffer from fundamental limitations under heavy occlusion that explicit approaches can overcome.

Abstract: Neural Radiance Fields (NeRFs) have emerged as a powerful paradigm for multi-view reconstruction, complementing classical photogrammetric pipelines based on Structure-from-Motion (SfM) and Multi-View Stereo (MVS). However, their reliability for quantitative 3D analysis in dense, self-occluding scenes remains poorly understood. In this study, we identify a fundamental failure mode of implicit density fields under heavy occlusion, which we term Interior Geometric Degradation (IGD). We show that transmittance-based volumetric optimization satisfies photometric supervision by reconstructing hollow or fragmented structures rather than solid interiors, leading to systematic instance undercounting. Through controlled experiments on synthetic datasets with increasing occlusion, we demonstrate that state-of-the-art mask-supervised NeRFs saturate at approximately 89% instance recovery in dense scenes, despite improved surface coherence and mask quality. To overcome this limitation, we introduce an explicit geometric pipeline based on Sparse Voxel Rasterization (SVRaster), initialized from SfM feature geometry. By projecting 2D instance masks onto an explicit voxel grid and enforcing geometric separation via recursive splitting, our approach preserves physical solidity and achieves a 95.8% recovery rate in dense clusters. A sensitivity analysis using degraded segmentation masks further shows that explicit SfM-based geometry is substantially more robust to supervision failure, recovering 43% more instances than implicit baselines. These results demonstrate that explicit geometric priors are a prerequisite for reliable quantitative analysis in highly self-occluding 3D scenes.

[164] MultiModal Fine-tuning with Synthetic Captions

Shohei Enomoto, Shin’ya Yamaguchi

Main category: cs.CV

TL;DR: Proposes transforming unimodal datasets into multimodal ones using MLLMs to generate synthetic image captions for fine-tuning, bridging the gap between multimodal pre-training and unimodal fine-tuning.

Details

Motivation: Addresses the fundamental gap between multimodal pre-training (which has enhanced visual understanding) and predominantly unimodal fine-tuning, limiting the benefits of rich pre-trained representations.

Method: Uses MLLMs to generate synthetic image captions for unimodal datasets, employs carefully designed prompts with class labels and domain context, introduces supervised contrastive loss for clustering same-class representations, and uses class-averaged text embeddings from multiple synthetic captions per image for inference.

Result: Extensive experiments across 13 image classification benchmarks show the approach outperforms baseline methods, with particularly significant improvements in few-shot learning scenarios.

Conclusion: Establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning.

Abstract: In this paper, we address a fundamental gap between pre-training and fine-tuning of deep neural networks: while pre-training has shifted from unimodal to multimodal learning with enhanced visual understanding, fine-tuning predominantly remains unimodal, limiting the benefits of rich pre-trained representations. To bridge this gap, we propose a novel approach that transforms unimodal datasets into multimodal ones using Multimodal Large Language Models (MLLMs) to generate synthetic image captions for fine-tuning models with a multimodal objective. Our method employs carefully designed prompts incorporating class labels and domain context to produce high-quality captions tailored for classification tasks. Furthermore, we introduce a supervised contrastive loss function that explicitly encourages clustering of same-class representations during fine-tuning, along with a new inference technique that leverages class-averaged text embeddings from multiple synthetic captions per image. Extensive experiments across 13 image classification benchmarks demonstrate that our approach outperforms baseline methods, with particularly significant improvements in few-shot learning scenarios. Our work establishes a new paradigm for dataset enhancement that effectively bridges the gap between multimodal pre-training and fine-tuning. Our code is available at https://github.com/s-enmt/MMFT.

[165] Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Ao Sun, Ziqi Yuan, Hao Zhou, Fandong Meng, Zhiyuan Liu

Main category: cs.CV

TL;DR: Spava is a sequence-parallel framework that accelerates long-video inference in Large Multimodal Models by distributing approximate attention across multiple GPUs, achieving significant speedups without performance degradation.

Details

Motivation: Long-video inference faces efficiency bottlenecks due to dense computation in the prefill stage of LMMs. Existing methods either compress visual embeddings or use sparse attention on single GPUs, resulting in limited acceleration or degraded performance, preventing LMMs from handling longer, more complex videos.

Method: Spava uses a sequence-parallel framework with optimized attention that distributes approximate attention computation across multiple GPUs. It reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression. System-level optimizations include load balancing and fused forward passes.

Result: Spava achieves speedups of 12.72x over FlashAttn, 1.70x over ZigZagRing, and 1.18x over APB, without notable performance loss. The framework enables efficient processing of longer videos while maintaining model performance.

Conclusion: Spava effectively addresses the efficiency bottleneck in long-video inference for LMMs by leveraging multi-GPU parallelization and optimized attention mechanisms, enabling scalable processing of complex video content without compromising accuracy.

Abstract: The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose Spava, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, Spava reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of Spava, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB

[166] Variance & Greediness: A comparative study of metric-learning losses

Donghuo Zeng, Hao Niu, Zhi Li, Masato Taya

Main category: cs.CV

TL;DR: A diagnostic framework (VARIANCE and GREEDINESS) analyzes metric learning losses for image retrieval, revealing trade-offs between embedding compaction and fine-grained discrimination.

Details

Motivation: Metric learning is crucial for retrieval but its effects on embedding geometry and optimization dynamics are not well understood. The paper aims to systematically analyze different loss functions to provide practical guidance.

Method: Introduces VARIANCE (intra-/inter-class variance) and GREEDINESS (active ratio and gradient norms) diagnostic framework. Compares seven representative losses (Contrastive, Triplet, N-pair, InfoNCE, ArcFace, SCL, CCL) across five image-retrieval datasets.

Result: Triplet and SCL preserve higher within-class variance and clearer inter-class margins, leading to stronger top-1 retrieval in fine-grained settings. Contrastive and InfoNCE achieve compact embeddings quickly through many small updates but may oversimplify class structures. N-pair achieves large mean separation but with uneven spacing.

Conclusion: Reveals an efficiency-granularity trade-off: prefer Triplet/SCL for diversity preservation and hard-sample discrimination, and Contrastive/InfoNCE for faster embedding compaction. Provides practical guidance for loss selection in retrieval tasks.

Abstract: Metric learning is central to retrieval, yet its effects on embedding geometry and optimization dynamics are not well understood. We introduce a diagnostic framework, VARIANCE (intra-/inter-class variance) and GREEDINESS (active ratio and gradient norms), to compare seven representative losses, i.e., Contrastive, Triplet, N-pair, InfoNCE, ArcFace, SCL, and CCL, across five image-retrieval datasets. Our analysis reveals that Triplet and SCL preserve higher within-class variance and clearer inter-class margins, leading to stronger top-1 retrieval in fine-grained settings. In contrast, Contrastive and InfoNCE compact embeddings are achieved quickly through many small updates, accelerating convergence but potentially oversimplifying class structures. N-pair achieves a large mean separation but with uneven spacing. These insights reveal a form of efficiency-granularity trade-off and provide practical guidance: prefer Triplet/SCL when diversity preservation and hard-sample discrimination are critical, and Contrastive/InfoNCE when faster embedding compaction is desired.

[167] Mining Forgery Traces from Reconstruction Error: A Weakly Supervised Framework for Multimodal Deepfake Temporal Localization

Midou Guo, Qilin Yin, Wei Lu, Xiangyang Luo, Rui Yang

Main category: cs.CV

TL;DR: RT-DeepLoc is a weakly supervised temporal deepfake localization framework that uses Masked Autoencoders trained on authentic data to detect forgeries via reconstruction errors, enhanced by an Asymmetric Intra-video Contrastive Loss for better localization.

Details

Motivation: Modern deepfakes have evolved into localized and intermittent manipulations that require fine-grained temporal localization, but frame-level annotation is prohibitively expensive, making weakly supervised methods necessary.

Method: Uses Masked Autoencoder (MAE) trained exclusively on authentic data to learn intrinsic spatiotemporal patterns, then detects forgeries via reconstruction discrepancies. Introduces Asymmetric Intra-video Contrastive Loss (AICL) to focus on compactness of authentic features guided by reconstruction cues for stable decision boundaries.

Result: Extensive experiments on large-scale datasets including LAV-DF demonstrate state-of-the-art performance in weakly-supervised temporal forgery localization.

Conclusion: RT-DeepLoc effectively addresses the challenge of fine-grained temporal deepfake localization with only video-level supervision, achieving superior performance through reconstruction-based detection and contrastive learning.

Abstract: Modern deepfakes have evolved into localized and intermittent manipulations that require fine-grained temporal localization. The prohibitive cost of frame-level annotation makes weakly supervised methods a practical necessity, which rely only on video-level labels. To this end, we propose Reconstruction-based Temporal Deepfake Localization (RT-DeepLoc), a weakly supervised temporal forgery localization framework that identifies forgeries via reconstruction errors. Our framework uses a Masked Autoencoder (MAE) trained exclusively on authentic data to learn its intrinsic spatiotemporal patterns; this allows the model to produce significant reconstruction discrepancies for forged segments, effectively providing the missing fine-grained cues for localization. To robustly leverage these indicators, we introduce a novel Asymmetric Intra-video Contrastive Loss (AICL). By focusing on the compactness of authentic features guided by these reconstruction cues, AICL establishes a stable decision boundary that enhances local discrimination while preserving generalization to unseen forgeries. Extensive experiments on large-scale datasets, including LAV-DF, demonstrate that RT-DeepLoc achieves state-of-the-art performance in weakly-supervised temporal forgery localization.

[168] Hypernetwork-Based Adaptive Aggregation for Multimodal Multiple-Instance Learning in Predicting Coronary Calcium Debulking

Kaito Shiku, Ichika Seo, Tetsuya Matoba, Rissei Hino, Yasuhiro Nakano, Ryoma Bise

Main category: cs.CV

TL;DR: HyperAdAgFormer: A hypernetwork-based adaptive aggregation transformer for estimating coronary artery calcification debulking necessity from CT images using multiple-instance learning with tabular data adaptation.

Details

Motivation: To address the challenge of estimating the necessity of debulking coronary artery calcifications from CT images, where physicians adjust their focus and decision criteria based on patient-specific tabular data.

Method: Formulates the task as a Multiple-instance Learning (MIL) problem and proposes HyperAdAgFormer - a hypernetwork-based adaptive aggregation transformer that modifies feature aggregation strategies for each patient based on tabular data through a hypernetwork.

Result: Experiments using clinical datasets demonstrated the effectiveness of HyperAdAgFormer for this medical imaging task.

Conclusion: The proposed HyperAdAgFormer successfully addresses the challenge of adaptive feature aggregation in medical imaging tasks where physician decisions depend on patient-specific tabular data.

Abstract: In this paper, we present the first attempt to estimate the necessity of debulking coronary artery calcifications from computed tomography (CT) images. We formulate this task as a Multiple-instance Learning (MIL) problem. The difficulty of this task lies in that physicians adjust their focus and decision criteria for device usage according to tabular data representing each patient’s condition. To address this issue, we propose a hypernetwork-based adaptive aggregation transformer (HyperAdAgFormer), which adaptively modifies the feature aggregation strategy for each patient based on tabular data through a hypernetwork. The experiments using the clinical dataset demonstrated the effectiveness of HyperAdAgFormer. The code is publicly available at https://github.com/Shiku-Kaito/HyperAdAgFormer.

[169] SimGraph: A Unified Framework for Scene Graph-Based Image Generation and Editing

Thanh-Nhan Vo, Trong-Thuan Nguyen, Tam V. Nguyen, Minh-Triet Tran

Main category: cs.CV

TL;DR: SimGraph: A unified scene graph-based framework for both image generation and editing with precise control over object relationships and spatial arrangements.

Details

Motivation: Current approaches treat image generation and editing separately, leading to inefficiencies and challenges in maintaining spatial consistency and semantic coherence. There's a lack of structured control over object relationships and spatial arrangements in existing methods.

Method: Introduces SimGraph, a unified framework that integrates scene graph-based image generation and editing. Combines token-based generation and diffusion-based editing within a single scene graph-driven model to ensure precise control over object interactions, layouts, and spatial coherence.

Result: Through extensive experiments, the approach empirically demonstrates that it outperforms existing state-of-the-art methods in both image generation and editing tasks.

Conclusion: SimGraph provides a unified solution for scene graph-based image generation and editing, offering greater control over composition and interactions while maintaining spatial consistency and semantic coherence.

Abstract: Recent advancements in Generative Artificial Intelligence (GenAI) have significantly enhanced the capabilities of both image generation and editing. However, current approaches often treat these tasks separately, leading to inefficiencies and challenges in maintaining spatial consistency and semantic coherence between generated content and edits. Moreover, a major obstacle is the lack of structured control over object relationships and spatial arrangements. Scene graph-based methods, which represent objects and their interrelationships in a structured format, offer a solution by providing greater control over composition and interactions in both image generation and editing. To address this, we introduce SimGraph, a unified framework that integrates scene graph-based image generation and editing, enabling precise control over object interactions, layouts, and spatial coherence. In particular, our framework integrates token-based generation and diffusion-based editing within a single scene graph-driven model, ensuring high-quality and consistent results. Through extensive experiments, we empirically demonstrate that our approach outperforms existing state-of-the-art methods.

[170] HERS: Hidden-Pattern Expert Learning for Risk-Specific Vehicle Damage Adaptation in Diffusion Models

Teerapong Panboonyuen

Main category: cs.CV

TL;DR: HERS framework improves text-to-image diffusion models for vehicle damage generation, enhancing fidelity and controllability for insurance applications while addressing fraud risks.

Details

Motivation: Address concerns about misuse of realistic vehicle damage synthesis in automated insurance workflows, particularly fraud and claim manipulation risks, while improving domain-specific generation quality.

Method: Fine-tunes base diffusion models via domain-specific expert adaptation using self-supervised image-text pairs generated by LLMs and T2I pipelines, creating separate experts for each damage category integrated into unified multi-damage model.

Result: Consistent improvements across four diffusion backbones: +5.5% in text faithfulness and +2.3% in human preference ratings compared to baselines.

Conclusion: HERS demonstrates both opportunities and risks of domain-specific diffusion, highlighting importance of trustworthy generation in safety-critical applications like auto insurance.

Abstract: Recent advances in text-to-image (T2I) diffusion models have enabled increasingly realistic synthesis of vehicle damage, raising concerns about their reliability in automated insurance workflows. The ability to generate crash-like imagery challenges the boundary between authentic and synthetic data, introducing new risks of misuse in fraud or claim manipulation. To address these issues, we propose HERS (Hidden-Pattern Expert Learning for Risk-Specific Damage Adaptation), a framework designed to improve fidelity, controllability, and domain alignment of diffusion-generated damage images. HERS fine-tunes a base diffusion model via domain-specific expert adaptation without requiring manual annotation. Using self-supervised image-text pairs automatically generated by a large language model and T2I pipeline, HERS models each damage category, such as dents, scratches, broken lights, or cracked paint, as a separate expert. These experts are later integrated into a unified multi-damage model that balances specialization with generalization. We evaluate HERS across four diffusion backbones and observe consistent improvements: plus 5.5 percent in text faithfulness and plus 2.3 percent in human preference ratings compared to baselines. Beyond image fidelity, we discuss implications for fraud detection, auditability, and safe deployment of generative models in high-stakes domains. Our findings highlight both the opportunities and risks of domain-specific diffusion, underscoring the importance of trustworthy generation in safety-critical applications such as auto insurance.

[171] Vision KAN: Towards an Attention-Free Backbone for Vision with Kolmogorov-Arnold Networks

Zhuoqin Yang, Jiansong Zhang, Xiaoling Luo, Xu Wu, Zheng Lu, Linlin Shen

Main category: cs.CV

TL;DR: Vision KAN (ViK) is an attention-free vision backbone using Kolmogorov-Arnold Networks with MultiPatch-RBFKAN token mixer, achieving competitive ImageNet performance with linear complexity.

Details

Motivation: Address limitations of attention mechanisms in vision backbones: quadratic complexity and poor interpretability. Explore attention-free alternatives inspired by recent successes of such architectures.

Method: Introduces Vision KAN (ViK) with MultiPatch-RBFKAN token mixer combining: (1) patch-wise nonlinear transform using Radial Basis Function-based KANs, (2) axis-wise separable mixing for local propagation, and (3) low-rank global mapping for long-range interaction. Uses patch-wise grouping strategy to handle high-resolution features efficiently.

Result: ViK achieves competitive accuracy on ImageNet-1K classification while maintaining linear complexity, demonstrating KAN-based token mixing as an efficient alternative to attention mechanisms.

Conclusion: KAN-based token mixing offers a theoretically grounded, efficient alternative to attention mechanisms in vision backbones, with linear complexity and competitive performance.

Abstract: Attention mechanisms have become a key module in modern vision backbones due to their ability to model long-range dependencies. However, their quadratic complexity in sequence length and the difficulty of interpreting attention weights limit both scalability and clarity. Recent attention-free architectures demonstrate that strong performance can be achieved without pairwise attention, motivating the search for alternatives. In this work, we introduce Vision KAN (ViK), an attention-free backbone inspired by the Kolmogorov-Arnold Networks. At its core lies MultiPatch-RBFKAN, a unified token mixer that combines (a) patch-wise nonlinear transform with Radial Basis Function-based KANs, (b) axis-wise separable mixing for efficient local propagation, and (c) low-rank global mapping for long-range interaction. Employing as a drop-in replacement for attention modules, this formulation tackles the prohibitive cost of full KANs on high-resolution features by adopting a patch-wise grouping strategy with lightweight operators to restore cross-patch dependencies. Experiments on ImageNet-1K show that ViK achieves competitive accuracy with linear complexity, demonstrating the potential of KAN-based token mixing as an efficient and theoretically grounded alternative to attention.

[172] Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen

Main category: cs.CV

TL;DR: BA-solver: A training-free acceleration method for Flow Matching models that uses a lightweight SideNet to enable high-quality generation in 5-10 steps instead of 100+ steps, with minimal training cost and plug-and-play versatility.

Details

Motivation: Flow Matching models have high-fidelity synthesis but suffer from latency bottlenecks due to iterative ODE solving. Existing solutions either degrade performance at low step counts or require expensive retraining, lacking plug-and-play versatility.

Method: Proposes Bi-Anchor Interpolation Solver (BA-solver) with two components: 1) Bidirectional Temporal Perception where a lightweight SideNet (1-2% of backbone size) learns to approximate future and historical velocities without retraining the backbone, and 2) Bi-Anchor Velocity Integration that uses the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration.

Result: On ImageNet-256^2, BA-solver achieves generation quality comparable to 100+ NFE Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, with negligible training costs. It also enables seamless integration with existing generative pipelines for downstream tasks like image editing.

Conclusion: BA-solver bridges the gap between training-free and training-based acceleration methods for Flow Matching models, offering significant speedup with minimal training overhead while maintaining versatility and high-quality generation.

Abstract: Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors’’ and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

[173] Unifying Heterogeneous Degradations: Uncertainty-Aware Diffusion Bridge Model for All-in-One Image Restoration

Luwei Tu, Jiawei Wu, Xing Luo, Zhi Jin

Main category: cs.CV

TL;DR: UDBM proposes an uncertainty-aware diffusion bridge model for all-in-one image restoration, addressing conflicting optimization objectives across heterogeneous degradations through stochastic transport guided by pixel-wise uncertainty.

Details

Motivation: All-in-One Image Restoration faces challenges in reconciling conflicting optimization objectives across different types of degradations. Existing methods have coarse-grained control or fixed mapping schedules, leading to suboptimal adaptation.

Method: Proposes Uncertainty-Aware Diffusion Bridge Model (UDBM) that reformulates AiOIR as stochastic transport steered by pixel-wise uncertainty. Uses relaxed diffusion bridge formulation to replace strict terminal constraints, models degradation uncertainty, and resolves drift singularity. Implements dual modulation strategy: noise schedule aligns degradations to shared high-entropy latent space, and path schedule adaptively regulates transport trajectory based on viscous dynamics of entropy regularization.

Result: Achieves state-of-the-art performance across diverse restoration tasks within a single inference step by effectively rectifying transport geometry and dynamics.

Conclusion: UDBM successfully addresses the fundamental challenge of conflicting optimization objectives in AiOIR through uncertainty-aware stochastic transport, providing superior adaptation across heterogeneous degradations.

Abstract: All-in-One Image Restoration (AiOIR) faces the fundamental challenge in reconciling conflicting optimization objectives across heterogeneous degradations. Existing methods are often constrained by coarse-grained control mechanisms or fixed mapping schedules, yielding suboptimal adaptation. To address this, we propose an Uncertainty-Aware Diffusion Bridge Model (UDBM), which innovatively reformulates AiOIR as a stochastic transport problem steered by pixel-wise uncertainty. By introducing a relaxed diffusion bridge formulation which replaces the strict terminal constraint with a relaxed constraint, we model the uncertainty of degradations while theoretically resolving the drift singularity inherent in standard diffusion bridges. Furthermore, we devise a dual modulation strategy: the noise schedule aligns diverse degradations into a shared high-entropy latent space, while the path schedule adaptively regulates the transport trajectory motivated by the viscous dynamics of entropy regularization. By effectively rectifying the transport geometry and dynamics, UDBM achieves state-of-the-art performance across diverse restoration tasks within a single inference step.

[174] HydroSense: A Dual-Microcontroller IoT Framework for Real-Time Multi-Parameter Water Quality Monitoring with Edge Processing and Cloud Analytics

Abdul Hasib, A. S. M. Ahsanul Sarkar Akib, Anish Giri

Main category: cs.CV

TL;DR: HydroSense: A low-cost IoT framework for real-time water quality monitoring using dual-microcontroller architecture with six parameters (pH, DO, temperature, TDS, nitrogen, water level), achieving 85% cost reduction and high accuracy.

Details

Motivation: Address the global water crisis by providing affordable, accurate, real-time water quality monitoring solutions that overcome accessibility challenges in resource-constrained environments where traditional manual sampling or expensive commercial systems fail.

Method: Dual-microcontroller architecture using Arduino Uno for precision analog measurements with five-point calibration algorithms and ESP32 for wireless connectivity, edge processing, and cloud integration. Implements advanced signal processing including median filtering for TDS, temperature compensation algorithms, and robust error handling.

Result: Exceptional performance over 90 days: pH accuracy ±0.08 units, DO stability ±0.2 mg/L, TDS accuracy ±1.9% across 0-1000 ppm, 99.8% cloud data transmission reliability. Total cost 32,983 BDT (~300 USD), achieving 85% cost reduction compared to commercial systems.

Conclusion: HydroSense establishes a new paradigm for accessible environmental monitoring, demonstrating professional-grade water quality assessment can be achieved through intelligent system architecture and cost-effective component selection.

Abstract: The global water crisis necessitates affordable, accurate, and real-time water quality monitoring solutions. Traditional approaches relying on manual sampling or expensive commercial systems fail to address accessibility challenges in resource-constrained environments. This paper presents HydroSense, an innovative Internet of Things framework that integrates six critical water quality parameters including pH, dissolved oxygen (DO), temperature, total dissolved solids (TDS), estimated nitrogen, and water level into a unified monitoring system. HydroSense employs a novel dual-microcontroller architecture, utilizing Arduino Uno for precision analog measurements with five-point calibration algorithms and ESP32 for wireless connectivity, edge processing, and cloud integration. The system implements advanced signal processing techniques including median filtering for TDS measurement, temperature compensation algorithms, and robust error handling. Experimental validation over 90 days demonstrates exceptional performance metrics: pH accuracy of plus or minus 0.08 units across the 0 to 14 range, DO measurement stability within plus or minus 0.2 mg/L, TDS accuracy of plus or minus 1.9 percent across 0 to 1000 ppm, and 99.8 percent cloud data transmission reliability. With a total implementation cost of 32,983 BDT (approximately 300 USD), HydroSense achieves an 85 percent cost reduction compared to commercial systems while providing enhanced connectivity through the Firebase real-time database. This research establishes a new paradigm for accessible environmental monitoring, demonstrating that professional-grade water quality assessment can be achieved through intelligent system architecture and cost-effective component selection.

[175] WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models

Zijin Yang, Yu Sun, Kejiang Chen, Jiawei Zhao, Jun Jiang, Weiming Zhang, Nenghai Yu

Main category: cs.CV

TL;DR: WMVLM is a unified evaluation framework for diffusion model image watermarks using vision-language models, addressing limitations in existing evaluation methods by providing interpretable assessments of both residual and semantic watermarks.

Details

Motivation: Existing watermark evaluation methods for diffusion models have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide non-interpretable results, neglect comprehensive security considerations, and use inappropriate metrics for semantic watermarks.

Method: Proposes WMVLM, a vision-language model-based framework that redefines quality and security metrics for each watermark type: residual watermarks evaluated by artifact strength and erasure resistance, semantic watermarks assessed through latent distribution shifts. Uses a three-stage training strategy for progressive learning of classification, scoring, and interpretable text generation.

Result: WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods, demonstrating effective unified evaluation capabilities.

Conclusion: WMVLM provides the first unified and interpretable evaluation framework for diffusion model image watermarking, addressing critical gaps in existing evaluation methodologies through vision-language model capabilities.

Abstract: Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we propose WMVLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.

[176] PathReasoner-R1: Instilling Structured Reasoning into Pathology Vision-Language Model via Knowledge-Guided Policy Optimization

Songhan Jiang, Fengchun Liu, Ziyue Wang, Linghan Cai, Yongbing Zhang

Main category: cs.CV

TL;DR: PathReasoner introduces a vision-language model for computational pathology with structured chain-of-thought reasoning, using knowledge-guided generation and reinforcement learning to provide verifiable evidence for diagnoses.

Details

Motivation: Current vision-language models in pathology often output direct conclusions without verifiable evidence-linked reasoning, limiting clinical trust and hindering expert error rectification. There's a need for transparent, clinically grounded reasoning capabilities.

Method: Created PathReasoner dataset using medical knowledge graphs to align pathological findings with diagnoses. Developed PathReasoner-R1 model using trajectory-masked supervised fine-tuning with reasoning-oriented reinforcement learning, incorporating knowledge-aware multi-granular reward functions with Entity Reward mechanism.

Result: Achieves state-of-the-art performance on both PathReasoner dataset and public benchmarks across various image scales, providing transparent, clinically grounded reasoning capabilities.

Conclusion: PathReasoner-R1 successfully equips pathology models with verifiable evidence-linked reasoning, enhancing clinical trust through structured chain-of-thought capabilities aligned with medical knowledge graphs.

Abstract: Vision-Language Models (VLMs) are advancing computational pathology with superior visual understanding capabilities. However, current systems often reduce diagnosis to directly output conclusions without verifiable evidence-linked reasoning, which severely limits clinical trust and hinders expert error rectification. To address these barriers, we construct PathReasoner, the first large-scale dataset of whole-slide image (WSI) reasoning. Unlike previous work reliant on unverified distillation, we develop a rigorous knowledge-guided generation pipeline. By leveraging medical knowledge graphs, we explicitly align structured pathological findings and clinical reasoning with diagnoses, generating over 20K high-quality instructional samples. Based on the database, we propose PathReasoner-R1, which synergizes trajectory-masked supervised fine-tuning with reasoning-oriented reinforcement learning to instill structured chain-of-thought capabilities. To ensure medical rigor, we engineer a knowledge-aware multi-granular reward function incorporating an Entity Reward mechanism strictly aligned with knowledge graphs. This effectively guides the model to optimize for logical consistency rather than mere outcome matching, thereby enhancing robustness. Extensive experiments demonstrate that PathReasoner-R1 achieves state-of-the-art performance on both PathReasoner and public benchmarks across various image scales, equipping pathology models with transparent, clinically grounded reasoning capabilities. Dataset and code are available at https://github.com/cyclexfy/PathReasoner-R1.

[177] Similarity of Processing Steps in Vision Model Representations

Matéo Mahaut, Marco Baroni

Main category: cs.CV

TL;DR: The paper studies how different vision models (CNNs, transformers, classifiers) converge to similar representations, examining intermediate processing steps and operations rather than just final representations.

Details

Motivation: While existing literature shows models converge to similar "universal" representations, this paper investigates whether they also converge through the same intermediate steps and operations, providing deeper insight into the convergence process.

Method: Quantifies distances between different model representations at various processing stages, tracks evolution of distances throughout processing, identifies most divergent processing steps, and compares CNN vs transformer behaviors and classifier-specific patterns.

Result: Found that while layers at similar positions have most similar representations, strong differences remain; classifiers discard low-level image statistics in final layers; transformers apply smoother representation changes between layers than CNNs.

Conclusion: The study clarifies the level and nature of convergence between model representations, enabling more qualitative understanding of underlying processes in image models beyond just final representation similarity.

Abstract: Recent literature suggests that the bigger the model, the more likely it is to converge to similar, ``universal’’ representations, despite different training objectives, datasets, or modalities. While this literature shows that there is an area where model representations are similar, we study here how vision models might get to those representations – in particular, do they also converge to the same intermediate steps and operations? We therefore study the processes that lead to convergent representations in different models. First, we quantify distance between different model representations at different stages. We follow the evolution of distances between models throughout processing, identifying the processing steps which are most different between models. We find that while layers at similar positions in different models have the most similar representations, strong differences remain. Classifier models, unlike the others, will discard information about low-level image statistics in their final layers. CNN- and transformer-based models also behave differently, with transformer models applying smoother changes to representations from one layer to the next. These distinctions clarify the level and nature of convergence between model representations, and enables a more qualitative account of the underlying processes in image models.

[178] A Tilted Seesaw: Revisiting Autoencoder Trade-off for Controllable Diffusion

Pu Cao, Yiyang Ma, Feng Zhou, Xuedan Yin, Qing Song, Lu Yang

Main category: cs.CV

TL;DR: Autoencoder evaluation in latent diffusion models shows systematic bias toward generative metrics (gFID) over reconstruction fidelity, which becomes problematic for controllable generation where reconstruction metrics better predict controllability.

Details

Motivation: The paper identifies a systematic bias in autoencoder evaluation for latent diffusion models, where recent studies increasingly favor generative metrics (gFID) over reconstruction fidelity, potentially harming controllable generation tasks.

Method: The authors theoretically analyze why gFID-dominant preference appears unproblematic for ImageNet generation but becomes risky for controllable diffusion, then empirically validate using multi-dimensional condition-drift evaluation protocol and ControlNet experiments.

Result: gFID is only weakly predictive of condition preservation, while reconstruction-oriented metrics are substantially more aligned with controllability. ControlNet experiments confirm controllability tracks condition preservation rather than gFID.

Conclusion: There’s a gap between ImageNet-centric AE evaluation and requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection in multimodal generation systems.

Abstract: In latent diffusion models, the autoencoder (AE) is typically expected to balance two capabilities: faithful reconstruction and a generation-friendly latent space (e.g., low gFID). In recent ImageNet-scale AE studies, we observe a systematic bias toward generative metrics in handling this trade-off: reconstruction metrics are increasingly under-reported, and ablation-based AE selection often favors the best-gFID configuration even when reconstruction fidelity degrades. We theoretically analyze why this gFID-dominant preference can appear unproblematic for ImageNet generation, yet becomes risky when scaling to controllable diffusion: AEs can induce condition drift, which limits achievable condition alignment. Meanwhile, we find that reconstruction fidelity, especially instance-level measures, better indicates controllability. We empirically validate the impact of tilted autoencoder evaluation on controllability by studying several recent ImageNet AEs. Using a multi-dimensional condition-drift evaluation protocol reflecting controllable generation tasks, we find that gFID is only weakly predictive of condition preservation, whereas reconstruction-oriented metrics are substantially more aligned. ControlNet experiments further confirm that controllability tracks condition preservation rather than gFID. Overall, our results expose a gap between ImageNet-centric AE evaluation and the requirements of scalable controllable diffusion, offering practical guidance for more reliable benchmarking and model selection.

[179] RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning

Shiqi Huang, Shuting He, Bihan Wen

Main category: cs.CV

TL;DR: RSGround-R1: A reasoning-guided, position-aware post-training framework for Remote Sensing Visual Grounding that enhances spatial understanding in MLLMs through CoT-SFT and reinforcement fine-tuning with positional rewards.

Details

Motivation: Remote Sensing Visual Grounding faces challenges due to vast spatial scale and high semantic ambiguity in aerial imagery, where language descriptions heavily rely on positional cues. Current MLLMs struggle with spatial reasoning in these unique scenarios.

Method: Proposes RSGround-R1 with three key components: 1) Chain-of-Thought Supervised Fine-Tuning using synthetic RSVG reasoning data for explicit position awareness, 2) Reinforcement Fine-Tuning with novel positional reward for continuous distance-aware guidance, and 3) Spatial consistency guided optimization to ensure stable convergence by adjusting policy updates based on spatial coherence.

Result: Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of the proposed model compared to existing approaches.

Conclusion: The RSGround-R1 framework effectively enhances spatial understanding in MLLMs for remote sensing applications through reasoning-guided, position-aware post-training, addressing the unique challenges of RSVG tasks.

Abstract: Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbf{RSGround-R1}, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward accurate localization. Moreover, to mitigate incoherent localization behaviors across rollouts, we introduce a spatial consistency guided optimization scheme that dynamically adjusts policy updates based on their spatial coherence, ensuring stable and robust convergence. Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of our model.

[180] OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, Zhixiong Zeng

Main category: cs.CV

TL;DR: OCRVerse: A unified end-to-end OCR method that handles both text-centric documents (newspapers, magazines) and vision-centric rendered composites (charts, web pages, scientific plots) using a two-stage SFT-RL training approach.

Details

Motivation: Existing OCR methods focus only on text extraction from images/documents, neglecting visual elements in information-dense sources like charts and web pages. There's a need for holistic OCR that can handle both text-centric and vision-centric content, especially given the prevalence of such visual data on the internet and its real-world applications.

Method: OCRVerse uses comprehensive data engineering covering both text-centric documents and vision-centric rendered composites. It employs a two-stage training approach: 1) Supervised Fine-Tuning (SFT) with mixed cross-domain data to establish initial domain knowledge, and 2) Reinforcement Learning (RL) with personalized reward strategies customized for each domain’s specific output formats and requirements.

Result: OCRVerse achieves competitive results across both text-centric and vision-centric data types, performing comparably to large-scale open-source and closed-source models.

Conclusion: OCRVerse represents the first holistic OCR method that unifies text-centric and vision-centric OCR in an end-to-end manner, addressing the gap in existing OCR technology for visually information-dense images.

Abstract: The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (\textbf{Text-centric OCR}), neglecting the identification of visual elements from visually information-dense image sources (\textbf{Vision-centric OCR}), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose \textbf{OCRVerse}, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.

Bowen Zhou, Marc-André Fiedler, Ayoub Al-Hamadi

Main category: cs.CV

TL;DR: CAF-Mamba: A Mamba-based cross-modal adaptive attention fusion framework for depression detection that captures explicit and implicit cross-modal interactions with dynamic modality weighting.

Details

Motivation: Current deep learning approaches for depression detection have limitations: they rely on limited feature types, overlook explicit cross-modal interactions, and use simple concatenation or static weighting for fusion, which hinders effective multimodal learning.

Method: Proposes CAF-Mamba, a novel framework using Mamba architecture with cross-modal adaptive attention fusion that captures both explicit and implicit cross-modal interactions and dynamically adjusts modality contributions through modality-wise attention mechanisms.

Result: Experiments on two in-the-wild benchmark datasets (LMVD and D-Vlog) show CAF-Mamba consistently outperforms existing methods and achieves state-of-the-art performance for depression detection.

Conclusion: CAF-Mamba effectively addresses limitations of existing multimodal fusion approaches for depression detection by enabling more sophisticated cross-modal interaction modeling and dynamic modality weighting.

Abstract: Depression is a prevalent mental health disorder that severely impairs daily functioning and quality of life. While recent deep learning approaches for depression detection have shown promise, most rely on limited feature types, overlook explicit cross-modal interactions, and employ simple concatenation or static weighting for fusion. To overcome these limitations, we propose CAF-Mamba, a novel Mamba-based cross-modal adaptive attention fusion framework. CAF-Mamba not only captures cross-modal interactions explicitly and implicitly, but also dynamically adjusts modality contributions through a modality-wise attention mechanism, enabling more effective multimodal fusion. Experiments on two in-the-wild benchmark datasets, LMVD and D-Vlog, demonstrate that CAF-Mamba consistently outperforms existing methods and achieves state-of-the-art performance.

[182] Few-Shot Domain Adaptation with Temporal References and Static Priors for Glacier Calving Front Delineation

Marcel Dreier, Nora Gourmelon, Dakota Pyles, Thorsten Seehaus, Matthias H. Braun, Andreas Maier, Vincent Christlein

Main category: cs.CV

TL;DR: A glacier calving front delineation model achieves near-human performance on benchmarks but fails in real-world novel sites; domain adaptation with spatial priors and seasonal references reduces error from 1131.6m to 68.7m without architectural changes.

Details

Motivation: State-of-the-art glacier calving front delineation models perform well on benchmark datasets but fail when applied to novel real-world study sites, creating a gap between academic benchmarks and practical scientific applications.

Method: Few-shot domain adaptation strategy incorporating spatial static prior knowledge and including summer reference images in input time series, without any architectural modifications to the original model.

Result: Delineation error reduced dramatically from 1131.6 meters to 68.7 meters when applying the domain adaptation approach to novel study sites.

Conclusion: The methodological advancements establish a framework for applying deep learning-based calving front segmentation to novel study sites, enabling global-scale calving front monitoring.

Abstract: During benchmarking, the state-of-the-art model for glacier calving front delineation achieves near-human performance. However, when applied in a real-world setting at a novel study site, its delineation accuracy is insufficient for calving front products intended for further scientific analyses. This site represents an out-of-distribution domain for a model trained solely on the benchmark dataset. By employing a few-shot domain adaptation strategy, incorporating spatial static prior knowledge, and including summer reference images in the input time series, the delineation error is reduced from 1131.6 m to 68.7 m without any architectural modifications. These methodological advancements establish a framework for applying deep learning-based calving front segmentation to novel study sites, enabling calving front monitoring on a global scale.

[183] When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning

Zixuan Xia, Hao Wang, Pengcheng Weng, Yanyu Qian, Yangxin Xu, William Dan, Fei Wang

Main category: cs.CV

TL;DR: A geometry-aware regularization framework called \regName that addresses representation collapse and cross-modal inconsistency in multimodal learning through intra-modal dispersive and inter-modal anchoring constraints.

Details

Motivation: Multimodal models often suffer from geometric pathologies like intra-modal representation collapse and sample-level cross-modal inconsistency, which degrade both unimodal robustness and multimodal fusion, even with balanced training schemes.

Method: Proposes \regName, a lightweight geometry-aware regularization framework with two complementary constraints: intra-modal dispersive regularization to promote representation diversity, and inter-modal anchoring regularization to bound sample-level cross-modal drift without rigid alignment.

Result: Extensive experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing effective mitigation of modality trade-offs.

Conclusion: Explicitly regulating representation geometry is an effective approach to address geometric pathologies in multimodal learning, and the proposed plug-and-play regularizer works with various training paradigms without architectural modifications.

Abstract: Multimodal learning aims to integrate complementary information from heterogeneous modalities, yet strong optimization alone does not guaranty well-structured representations. Even under carefully balanced training schemes, multimodal models often exhibit geometric pathologies, including intra-modal representation collapse and sample-level cross-modal inconsistency, which degrade both unimodal robustness and multimodal fusion. We identify representation geometry as a missing control axis in multimodal learning and propose \regName, a lightweight geometry-aware regularization framework. \regName enforces two complementary constraints on intermediate embeddings: an intra-modal dispersive regularization that promotes representation diversity, and an inter-modal anchoring regularization that bounds sample-level cross-modal drift without rigid alignment. The proposed regularizer is plug-and-play, requires no architectural modifications, and is compatible with various training paradigms. Extensive experiments across multiple multimodal benchmarks demonstrate consistent improvements in both multimodal and unimodal performance, showing that explicitly regulating representation geometry effectively mitigates modality trade-offs.

[184] Multimodal Visual Surrogate Compression for Alzheimer’s Disease Classification

Dexuan Ding, Ciyuan Peng, Endrowednes Kuantama, Jingcai Guo, Jia Wu, Jian Yang, Amin Beheshti, Ming-Hsuan Yang, Yuankai Qi

Main category: cs.CV

TL;DR: MVSC compresses 3D sMRI volumes into 2D visual surrogates aligned with 2D foundation models for Alzheimer’s Disease classification, addressing computational cost and cross-slice relation issues.

Details

Motivation: Existing sMRI representation learning methods have three main problems: 3D architectures are computationally expensive, slice-wise approaches lose cross-slice relations, and training-free 2D foundation models have limited discriminative feature extraction. Need a method that balances efficiency with preserving 3D structural information.

Method: Proposes Multimodal Visual Surrogate Compression (MVSC) with two components: 1) Volume Context Encoder captures global cross-slice context under textual guidance, 2) Adaptive Slice Fusion module aggregates slice-level information in a text-enhanced, patch-wise manner. Learns to compress 3D sMRI volumes into compact 2D features (visual surrogates) aligned with frozen 2D foundation models.

Result: Extensive experiments on three large-scale Alzheimer’s disease benchmarks show MVSC performs favorably on both binary and multi-class classification tasks compared to state-of-the-art methods.

Conclusion: MVSC effectively addresses limitations of existing sMRI representation learning methods by compressing 3D volumes into 2D features that leverage powerful 2D foundation models while preserving important 3D structural information through multimodal guidance.

Abstract: High-dimensional structural MRI (sMRI) images are widely used for Alzheimer’s Disease (AD) diagnosis. Most existing methods for sMRI representation learning rely on 3D architectures (e.g., 3D CNNs), slice-wise feature extraction with late aggregation, or apply training-free feature extractions using 2D foundation models (e.g., DINO). However, these three paradigms suffer from high computational cost, loss of cross-slice relations, and limited ability to extract discriminative features, respectively. To address these challenges, we propose Multimodal Visual Surrogate Compression (MVSC). It learns to compress and adapt large 3D sMRI volumes into compact 2D features, termed as visual surrogates, which are better aligned with frozen 2D foundation models to extract powerful representations for final AD classification. MVSC has two key components: a Volume Context Encoder that captures global cross-slice context under textual guidance, and an Adaptive Slice Fusion module that aggregates slice-level information in a text-enhanced, patch-wise manner. Extensive experiments on three large-scale Alzheimer’s disease benchmarks demonstrate our MVSC performs favourably on both binary and multi-class classification tasks compared against state-of-the-art methods.

[185] ChartE$^{3}$: A Comprehensive Benchmark for End-to-End Chart Editing

Shuo Li, Jiajun Sun, Zhekai Wang, Xiaoran Fan, Hui Li, Dingwen Yang, Zhiheng Xi, Yijun Wang, Zifei Shan, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CV

TL;DR: ChartE³ is an end-to-end chart editing benchmark that evaluates multimodal models on both local (appearance) and global (data-centric) editing tasks without intermediate code or language representations.

Details

Motivation: Existing chart editing approaches use pipeline designs with natural language or code as intermediate representations, limiting their ability to faithfully execute complex edits. There's a need for direct end-to-end evaluation of multimodal models on chart editing tasks.

Method: Introduces ChartE³ benchmark with over 1,200 high-quality samples constructed via a data pipeline with human curation. Each sample includes chart image, underlying code, and multimodal editing instruction. Focuses on two editing dimensions: local editing (fine-grained appearance changes) and global editing (holistic data-centric transformations).

Result: Extensive benchmarking reveals substantial performance gaps in state-of-the-art multimodal large language models, particularly on global editing tasks, highlighting critical limitations in current end-to-end chart editing capabilities.

Conclusion: ChartE³ provides a comprehensive benchmark for evaluating end-to-end chart editing capabilities, revealing significant challenges in current multimodal models, especially for complex global editing tasks that require structural understanding.

Abstract: Charts are a fundamental visualization format for structured data analysis. Enabling end-to-end chart editing according to user intent is of great practical value, yet remains challenging due to the need for both fine-grained control and global structural consistency. Most existing approaches adopt pipeline-based designs, where natural language or code serves as an intermediate representation, limiting their ability to faithfully execute complex edits. We introduce ChartE$^{3}$, an End-to-End Chart Editing benchmark that directly evaluates models without relying on intermediate natural language programs or code-level supervision. ChartE$^{3}$ focuses on two complementary editing dimensions: local editing, which involves fine-grained appearance changes such as font or color adjustments, and global editing, which requires holistic, data-centric transformations including data filtering and trend line addition. ChartE$^{3}$ contains over 1,200 high-quality samples constructed via a well-designed data pipeline with human curation. Each sample is provided as a triplet of a chart image, its underlying code, and a multimodal editing instruction, enabling evaluation from both objective and subjective perspectives. Extensive benchmarking of state-of-the-art multimodal large language models reveals substantial performance gaps, particularly on global editing tasks, highlighting critical limitations in current end-to-end chart editing capabilities.

[186] DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning

Mingshuang Luo, Shuang Liang, Zhengkun Rong, Yuxuan Luo, Tianshu Hu, Ruibing Hou, Hong Chang, Yong Li, Yuan Zhang, Mingyuan Gao

Main category: cs.CV

TL;DR: DreamActor-M2 is a universal character animation framework that treats motion conditioning as in-context learning, using a two-stage approach to fuse appearance and motion cues and employing self-bootstrapped data synthesis for better generalization across diverse characters.

Details

Motivation: Existing character animation methods have two main problems: (1) suboptimal motion injection leading to trade-offs between identity preservation and motion consistency (the "see-saw" problem), and (2) over-reliance on explicit pose priors like skeletons that inadequately capture intricate dynamics and hinder generalization to non-humanoid characters.

Method: Two-stage approach: (1) Fuses reference appearance and motion cues into a unified latent space to jointly reason about spatial identity and temporal dynamics using foundational models’ generative priors. (2) Introduces self-bootstrapped data synthesis pipeline that creates pseudo cross-identity training pairs to transition from pose-dependent control to direct RGB-driven animation.

Result: DreamActor-M2 achieves state-of-the-art performance with superior visual fidelity and robust cross-domain generalization. The paper also introduces AW Bench, a comprehensive benchmark for evaluating character animation across diverse character types and motion scenarios.

Conclusion: DreamActor-M2 presents a universal animation framework that addresses fundamental limitations in existing methods by treating motion conditioning as in-context learning, enabling better identity preservation, motion consistency, and generalization across diverse characters without relying on explicit pose priors.

Abstract: Character image animation aims to synthesize high-fidelity videos by transferring motion from a driving sequence to a static reference image. Despite recent advancements, existing methods suffer from two fundamental challenges: (1) suboptimal motion injection strategies that lead to a trade-off between identity preservation and motion consistency, manifesting as a “see-saw”, and (2) an over-reliance on explicit pose priors (e.g., skeletons), which inadequately capture intricate dynamics and hinder generalization to arbitrary, non-humanoid characters. To address these challenges, we present DreamActor-M2, a universal animation framework that reimagines motion conditioning as an in-context learning problem. Our approach follows a two-stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space, enabling the model to jointly reason about spatial identity and temporal dynamics by leveraging the generative prior of foundational models. Second, we introduce a self-bootstrapped data synthesis pipeline that curates pseudo cross-identity training pairs, facilitating a seamless transition from pose-dependent control to direct, end-to-end RGB-driven animation. This strategy significantly enhances generalization across diverse characters and motion scenarios. To facilitate comprehensive evaluation, we further introduce AW Bench, a versatile benchmark encompassing a wide spectrum of characters types and motion scenarios. Extensive experiments demonstrate that DreamActor-M2 achieves state-of-the-art performance, delivering superior visual fidelity and robust cross-domain generalization. Project Page: https://grisoon.github.io/DreamActor-M2/

[187] From Global to Granular: Revealing IQA Model Performance via Correlation Surface

Baoliang Chen, Danni Huang, Hanwei Zhu, Lingyu Zhu, Wei Zhou, Shiqi Wang, Yuming Fang, Weisi Lin

Main category: cs.CV

TL;DR: Proposes Granularity-Modulated Correlation (GMC) for fine-grained evaluation of Image Quality Assessment models, addressing limitations of global correlation metrics like PLCC and SRCC.

Details

Motivation: Current IQA evaluation relies on global correlation metrics (PLCC, SRCC) that reduce performance to single scalars, failing to capture local performance variations across quality spectrum and being sensitive to test-sample distributions.

Method: GMC includes: (1) Granularity Modulator using Gaussian-weighted correlations conditioned on MOS values and pairwise MOS differences, and (2) Distribution Regulator to mitigate biases from non-uniform quality distributions, producing a 3D correlation surface.

Result: Experiments on standard benchmarks show GMC reveals performance characteristics invisible to scalar metrics, offering more informative and reliable analysis of IQA models.

Conclusion: GMC provides a structured, fine-grained paradigm for evaluating IQA models that overcomes limitations of traditional global correlation metrics.

Abstract: Evaluation of Image Quality Assessment (IQA) models has long been dominated by global correlation metrics, such as Pearson Linear Correlation Coefficient (PLCC) and Spearman Rank-Order Correlation Coefficient (SRCC). While widely adopted, these metrics reduce performance to a single scalar, failing to capture how ranking consistency varies across the local quality spectrum. For example, two IQA models may achieve identical SRCC values, yet one ranks high-quality images (related to high Mean Opinion Score, MOS) more reliably, while the other better discriminates image pairs with small quality/MOS differences (related to $|Δ$MOS$|$). Such complementary behaviors are invisible under global metrics. Moreover, SRCC and PLCC are sensitive to test-sample quality distributions, yielding unstable comparisons across test sets. To address these limitations, we propose \textbf{Granularity-Modulated Correlation (GMC)}, which provides a structured, fine-grained analysis of IQA performance. GMC includes: (1) a \textbf{Granularity Modulator} that applies Gaussian-weighted correlations conditioned on absolute MOS values and pairwise MOS differences ($|Δ$MOS$|$) to examine local performance variations, and (2) a \textbf{Distribution Regulator} that regularizes correlations to mitigate biases from non-uniform quality distributions. The resulting \textbf{correlation surface} maps correlation values as a joint function of MOS and $|Δ$MOS$|$, providing a 3D representation of IQA performance. Experiments on standard benchmarks show that GMC reveals performance characteristics invisible to scalar metrics, offering a more informative and reliable paradigm for analyzing, comparing, and deploying IQA models. Codes are available at https://github.com/Dniaaa/GMC.

Jiankun Peng, Jianyuan Guo, Ying Xu, Yue Liu, Jiashuang Yan, Xuanwei Ye, Houhua Li, Xiaoming Wang

Main category: cs.CV

TL;DR: DGNav: A dynamic topological navigation framework for vision-language navigation that adaptively modulates map density and connectivity based on environmental complexity to address the “Granularity Rigidity” problem in existing methods.

Details

Motivation: Existing topological planning methods for vision-language navigation suffer from "Granularity Rigidity" - they use fixed geometric thresholds for node sampling, which fails to adapt to varying environmental complexities. This leads to over-sampling in simple areas (computational redundancy) and under-sampling in high-uncertainty regions (increased collision risks and compromised precision).

Method: DGNav introduces two core innovations: 1) Scene-Aware Adaptive Strategy that dynamically modulates graph construction thresholds based on predicted waypoint dispersion, enabling “densification on demand” in challenging environments; 2) Dynamic Graph Transformer that reconstructs graph connectivity by fusing visual, linguistic, and geometric cues into dynamic edge weights to filter topological noise and enhance instruction adherence.

Result: Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate superior navigation performance and strong generalization capabilities. Ablation studies confirm the framework achieves optimal trade-off between navigation efficiency and safe exploration.

Conclusion: DGNav effectively addresses the Granularity Rigidity problem in vision-language navigation by introducing dynamic topological planning that adapts to environmental complexity, improving both navigation precision and computational efficiency.

Abstract: Vision-Language Navigation in Continuous Environments (VLN-CE) presents a core challenge: grounding high-level linguistic instructions into precise, safe, and long-horizon spatial actions. Explicit topological maps have proven to be a vital solution for providing robust spatial memory in such tasks. However, existing topological planning methods suffer from a “Granularity Rigidity” problem. Specifically, these methods typically rely on fixed geometric thresholds to sample nodes, which fails to adapt to varying environmental complexities. This rigidity leads to a critical mismatch: the model tends to over-sample in simple areas, causing computational redundancy, while under-sampling in high-uncertainty regions, increasing collision risks and compromising precision. To address this, we propose DGNav, a framework for Dynamic Topological Navigation, introducing a context-aware mechanism to modulate map density and connectivity on-the-fly. Our approach comprises two core innovations: (1) A Scene-Aware Adaptive Strategy that dynamically modulates graph construction thresholds based on the dispersion of predicted waypoints, enabling “densification on demand” in challenging environments; (2) A Dynamic Graph Transformer that reconstructs graph connectivity by fusing visual, linguistic, and geometric cues into dynamic edge weights, enabling the agent to filter out topological noise and enhancing instruction adherence. Extensive experiments on the R2R-CE and RxR-CE benchmarks demonstrate DGNav exhibits superior navigation performance and strong generalization capabilities. Furthermore, ablation studies confirm that our framework achieves an optimal trade-off between navigation efficiency and safe exploration. The code is available at https://github.com/shannanshouyin/DGNav.

[189] Synthetic-to-Real Domain Bridging for Single-View 3D Reconstruction of Ships for Maritime Monitoring

Borja Carrillo-Perez, Felix Sattler, Angel Bueno Rodriguez, Maurice Stephan, Sarah Barnes

Main category: cs.CV

TL;DR: Single-view 3D ship reconstruction pipeline using synthetic data and 3D Gaussian splatting, with real-world deployment through segmentation, scaling, and georeferencing.

Details

Motivation: Current 3D reconstruction methods for maritime monitoring require multi-view supervision, annotated 3D ground truth, or are computationally intensive, making them impractical for real-time deployment.

Method: Uses Splatter Image network (3D Gaussian representation) fine-tuned on synthetic ShapeNet vessels and custom ship dataset. Integrates YOLOv8 segmentation, preprocessing, postprocessing for real-world scaling/alignment, and georeferencing with AIS metadata.

Result: Strong reconstruction fidelity on synthetic validation data, successful transfer to real maritime images from ShipSG dataset, enabling interactive 3D inspection without real-world 3D annotations.

Conclusion: Provides efficient, scalable solution for maritime monitoring with potential for real-time 3D ship visualization in practical applications.

Abstract: Three-dimensional (3D) reconstruction of ships is an important part of maritime monitoring, allowing improved visualization, inspection, and decision-making in real-world monitoring environments. However, most state-ofthe-art 3D reconstruction methods require multi-view supervision, annotated 3D ground truth, or are computationally intensive, making them impractical for real-time maritime deployment. In this work, we present an efficient pipeline for single-view 3D reconstruction of real ships by training entirely on synthetic data and requiring only a single view at inference. Our approach uses the Splatter Image network, which represents objects as sparse sets of 3D Gaussians for rapid and accurate reconstruction from single images. The model is first fine-tuned on synthetic ShapeNet vessels and further refined with a diverse custom dataset of 3D ships, bridging the domain gap between synthetic and real-world imagery. We integrate a state-of-the-art segmentation module based on YOLOv8 and custom preprocessing to ensure compatibility with the reconstruction network. Postprocessing steps include real-world scaling, centering, and orientation alignment, followed by georeferenced placement on an interactive web map using AIS metadata and homography-based mapping. Quantitative evaluation on synthetic validation data demonstrates strong reconstruction fidelity, while qualitative results on real maritime images from the ShipSG dataset confirm the potential for transfer to operational maritime settings. The final system provides interactive 3D inspection of real ships without requiring real-world 3D annotations. This pipeline provides an efficient, scalable solution for maritime monitoring and highlights a path toward real-time 3D ship visualization in practical applications. Interactive demo: https://dlr-mi.github.io/ship3d-demo/.

Junming Huang, Weiwei Xu

Main category: cs.CV

TL;DR: CG-MLLM is a multimodal LLM that performs both 3D captioning and high-resolution 3D generation using a Mixture-of-Transformer architecture with TokenAR and BlockAR transformers.

Details

Motivation: LLMs have advanced text generation and multimodal perception, but 3D content generation remains limited, with existing methods producing either low-resolution meshes or coarse structural proxies that fail to capture fine-grained geometry.

Method: CG-MLLM uses a Mixture-of-Transformer architecture with two components: Token-level Autoregressive (TokenAR) Transformer for token-level content and Block-level Autoregressive (BlockAR) Transformer for block-level content. It integrates a pre-trained vision-language backbone with a specialized 3D VAE latent space to enable long-context interactions between standard tokens and spatial blocks.

Result: CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm.

Conclusion: The proposed CG-MLLM framework successfully addresses the limitations of existing methods by enabling both 3D captioning and high-resolution 3D generation within a single multimodal LLM architecture.

Abstract: Large Language Models(LLMs) have revolutionized text generation and multimodal perception, but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture fine-grained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-of-Transformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles block-level content. By integrating a pre-trained vision-language backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm.

[191] MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods

Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, Lijun Wu

Main category: cs.CV

TL;DR: MMFineReason is a large-scale multimodal reasoning dataset with 1.8M samples and 5.1B solution tokens, created to address the lack of high-quality reasoning data for open-source VLMs, featuring STEM diagrams, visual puzzles, and complex diagrams with Chain-of-Thought annotations.

Details

Motivation: Open-source Vision Language Models (VLMs) lag behind proprietary systems due to insufficient high-quality reasoning data. Existing datasets lack coverage of challenging domains like STEM diagrams and visual puzzles, and lack consistent long-form Chain-of-Thought annotations needed for strong reasoning capabilities.

Method: Three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation using Qwen3-VL-235B-A22B-Thinking, (3) comprehensive selection based on reasoning quality and difficulty awareness. Fine-tuned Qwen3-VL-Instruct on MMFineReason to create MMFineReason-2B/4B/8B models.

Result: MMFineReason models establish new SOTA for their size class. MMFineReason-4B surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking. Found “less is more” phenomenon: 7% subset (123K samples) achieves comparable performance to full dataset.

Conclusion: MMFineReason dataset effectively bridges the reasoning data gap for VLMs, enabling parameter-efficient models that outperform larger counterparts. The difficulty-aware filtering strategy reveals data efficiency insights, and reasoning-oriented data composition boosts general capabilities synergistically.

Abstract: Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source VLMs still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities. To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking. The dataset is established via a systematic three-stage pipeline: (1) large-scale data collection and standardization, (2) CoT rationale generation, and (3) comprehensive selection based on reasoning quality and difficulty awareness. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with visually grounded reasoning traces. We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B versions. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B succesfully surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B even outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency. Crucially, we uncover a “less is more” phenomenon via our difficulty-aware filtering strategy: a subset of just 7% (123K samples) achieves performance comparable to the full dataset. Notably, we reveal a synergistic effect where reasoning-oriented data composition simultaneously boosts general capabilities.

[192] Trajectory-Guided Diffusion for Foreground-Preserving Background Generation in Multi-Layer Documents

Taewon Kang

Main category: cs.CV

TL;DR: A diffusion-based framework for document background generation that preserves foreground content and maintains multi-page stylistic consistency through latent-space trajectory design rather than explicit constraints.

Details

Motivation: Current document background generation methods struggle with foreground preservation and multi-page stylistic consistency, often requiring explicit constraints or masking heuristics that can be cumbersome and unstable.

Method: Reinterprets diffusion as evolution of stochastic trajectories in structured latent space. Shapes initial noise and geometric alignment to avoid foreground regions naturally. Decouples style control from text conditioning using cached style directions as persistent vectors in latent space to constrain diffusion to shared stylistic subspace.

Result: Produces visually coherent, foreground-preserving results across complex documents without explicit constraints. Maintains stylistic consistency across pages and editing iterations without repeated prompt-based style specification.

Conclusion: By reframing diffusion as trajectory design in latent space, the method offers a principled, training-free approach to consistent and structured generative modeling for document backgrounds, compatible with existing diffusion backbones.

Abstract: We present a diffusion-based framework for document-centric background generation that achieves foreground preservation and multi-page stylistic consistency through latent-space design rather than explicit constraints. Instead of suppressing diffusion updates or applying masking heuristics, our approach reinterprets diffusion as the evolution of stochastic trajectories through a structured latent space. By shaping the initial noise and its geometric alignment, background generation naturally avoids designated foreground regions, allowing readable content to remain intact without auxiliary mechanisms. To address the long-standing issue of stylistic drift across pages, we decouple style control from text conditioning and introduce cached style directions as persistent vectors in latent space. Once selected, these directions constrain diffusion trajectories to a shared stylistic subspace, ensuring consistent appearance across pages and editing iterations. This formulation eliminates the need for repeated prompt-based style specification and provides a more stable foundation for multi-page generation. Our framework admits a geometric and physical interpretation, where diffusion paths evolve on a latent manifold shaped by preferred directions, and foreground regions are rarely traversed as a consequence of trajectory initialization rather than explicit exclusion. The proposed method is training-free, compatible with existing diffusion backbones, and produces visually coherent, foreground-preserving results across complex documents. By reframing diffusion as trajectory design in latent space, we offer a principled approach to consistent and structured generative modeling.

[193] Improving Classifier-Free Guidance of Flow Matching via Manifold Projection

Jian-Feng Cai, Haixia Liu, Zhengyi Su, Chao Wang

Main category: cs.CV

TL;DR: A principled optimization interpretation of classifier-free guidance (CFG) in diffusion models, revealing CFG as an approximation of gradient flow and proposing a training-free manifold-constrained homotopy optimization method with Anderson Acceleration for improved generation quality and guidance scale robustness.

Details

Motivation: Classifier-free guidance (CFG) is widely used for controllable generation in diffusion models but relies on heuristic linear extrapolation that is sensitive to guidance scale. The authors aim to provide a principled theoretical foundation for CFG and develop more stable, efficient methods.

Method: 1) Interpret CFG through optimization lens, showing velocity field corresponds to gradient of smoothed distance functions; 2) Reformulate CFG sampling as homotopy optimization with manifold constraint; 3) Implement manifold projection via incremental gradient descent; 4) Enhance with Anderson Acceleration for efficiency without extra model evaluations.

Result: Proposed methods are training-free and consistently improve generation fidelity, prompt alignment, and robustness to guidance scale. Validated across diverse benchmarks including large-scale models like DiT-XL-2-256, Flux, and Stable Diffusion 3.5.

Conclusion: The optimization perspective provides theoretical foundation for CFG, and the proposed manifold-constrained homotopy optimization with Anderson Acceleration offers practical improvements in diffusion model sampling without requiring retraining.

Abstract: Classifier-free guidance (CFG) is a widely used technique for controllable generation in diffusion and flow-based models. Despite its empirical success, CFG relies on a heuristic linear extrapolation that is often sensitive to the guidance scale. In this work, we provide a principled interpretation of CFG through the lens of optimization. We demonstrate that the velocity field in flow matching corresponds to the gradient of a sequence of smoothed distance functions, which guides latent variables toward the scaled target image set. This perspective reveals that the standard CFG formulation is an approximation of this gradient, where the prediction gap, the discrepancy between conditional and unconditional outputs, governs guidance sensitivity. Leveraging this insight, we reformulate the CFG sampling as a homotopy optimization with a manifold constraint. This formulation necessitates a manifold projection step, which we implement via an incremental gradient descent scheme during sampling. To improve computational efficiency and stability, we further enhance this iterative process with Anderson Acceleration without requiring additional model evaluations. Our proposed methods are training-free and consistently refine generation fidelity, prompt alignment, and robustness to the guidance scale. We validate their effectiveness across diverse benchmarks, demonstrating significant improvements on large-scale models such as DiT-XL-2-256, Flux, and Stable Diffusion 3.5.

[194] Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion

Hanmo Chen, Chenghao Xu, Xu Yang, Xuan Chen, Cheng Deng

Main category: cs.CV

TL;DR: PaFu-KV: A novel KV cache policy for autoregressive video generation that selectively retains important tokens using salience estimation, improving quality-efficiency trade-off for long-horizon video synthesis.

Details

Motivation: Current autoregressive video generation methods use heuristic KV cache policies that ignore token importance differences, leading to loss of critical spatiotemporal information and accumulation of redundant cache, degrading video quality and efficiency.

Method: Proposes Past- and Future-Informed KV Cache Policy (PaFu-KV) with a lightweight Salience Estimation Head distilled from a bidirectional teacher to estimate token salience scores, allowing selective retention of informative tokens while discarding less relevant ones.

Result: Extensive experiments show the method preserves high-fidelity video generation quality while enabling accelerated inference and reduced memory footprint, achieving better quality-efficiency trade-off for long-horizon video generation.

Conclusion: PaFu-KV addresses limitations of heuristic KV cache policies in autoregressive video generation by intelligently managing cache based on token importance, enabling more efficient long-horizon video synthesis without sacrificing quality.

Abstract: Video generation is pivotal to digital media creation, and recent advances in autoregressive video generation have markedly enhanced the efficiency of real-time video synthesis. However, existing approaches generally rely on heuristic KV Cache policies, which ignore differences in token importance in long-term video generation. This leads to the loss of critical spatiotemporal information and the accumulation of redundant, invalid cache, thereby degrading video generation quality and efficiency. To address this limitation, we first observe that token contributions to video generation are highly time-heterogeneous and accordingly propose a novel Past- and Future-Informed KV Cache Policy (PaFu-KV). Specifically, PaFu-KV introduces a lightweight Salience Estimation Head distilled from a bidirectional teacher to estimate salience scores, allowing the KV cache to retain informative tokens while discarding less relevant ones. This policy yields a better quality-efficiency trade-off by shrinking KV cache capacity and reducing memory footprint at inference time. Extensive experiments on benchmarks demonstrate that our method preserves high-fidelity video generation quality while enables accelerated inference, thereby enabling more efficient long-horizon video generation. Our code will be released upon paper acceptance.

[195] Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning

Hanmo Chen, Guangtao Lyu, Chenghao Xu, Jiexi Yan, Xu Yang, Cheng Deng

Main category: cs.CV

TL;DR: A novel Pyramidal Shapley-Taylor learning framework for fine-grained motion-language retrieval that captures local semantic details through progressive joint-wise and segment-wise alignment.

Details

Motivation: Existing motion-language retrieval approaches focus on aligning entire motion sequences with global textual representations, overlooking fine-grained interactions between local motion segments/body joints and text tokens, leading to suboptimal performance.

Method: Proposes a Pyramidal Shapley-Taylor (PST) learning framework that decomposes human motion into temporal segments and spatial body joints, learning cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion inspired by human motion perception.

Result: Extensive experiments on multiple public benchmark datasets demonstrate significant outperformance over state-of-the-art methods, achieving precise alignment between motion segments/body joints and corresponding text tokens.

Conclusion: The framework effectively captures both local semantic details and hierarchical structural relationships in motion-language retrieval, bridging the semantic gap between natural language and human motion.

Abstract: As a foundational task in human-centric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial body joints, and learns cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion, effectively capturing both local semantic details and hierarchical structural relationships. Extensive experiments on multiple public benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, achieving precise alignment between motion segments and body joints and their corresponding text tokens. The code of this work will be released upon acceptance.

[196] VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models

Yunhao Li, Sijing Wu, Zhilin Gao, Zicheng Zhang, Qi Jia, Huiyu Duan, Xiongkuo Min, Guangtao Zhai

Main category: cs.CV

TL;DR: VideoAesBench: A comprehensive benchmark for evaluating large multimodal models’ video aesthetic quality assessment capabilities across diverse video types and multiple question formats.

Details

Motivation: While LMMs excel at various visual perception tasks, their capability for video aesthetic quality assessment - a fundamental human ability - remains underexplored. There's a need for systematic evaluation of LMMs' understanding of video aesthetics.

Method: Created VideoAesBench with 1,804 videos from multiple sources (UGC, AIGC, compressed, RGC, game videos), multiple question formats (single-choice, multi-choice, True/False, open-ended), and holistic aesthetics dimensions covering visual form, style, and affectiveness.

Result: Benchmarked 23 open-source and commercial LMMs, finding they only possess basic video aesthetics perception ability with incomplete and imprecise performance.

Conclusion: Current LMMs have limited video aesthetic assessment capabilities. VideoAesBench serves as a strong testbed for explainable video aesthetics assessment and provides insights for future model development.

Abstract: Large multimodal models (LMMs) have demonstrated outstanding capabilities in various visual perception tasks, which has in turn made the evaluation of LMMs significant. However, the capability of video aesthetic quality assessment, which is a fundamental ability for human, remains underexplored for LMMs. To address this, we introduce VideoAesBench, a comprehensive benchmark for evaluating LMMs’ understanding of video aesthetic quality. VideoAesBench has several significant characteristics: (1) Diverse content including 1,804 videos from multiple video sources including user-generated (UGC), AI-generated (AIGC), compressed, robotic-generated (RGC), and game videos. (2) Multiple question formats containing traditional single-choice questions, multi-choice questions, True or False questions, and a novel open-ended questions for video aesthetics description. (3) Holistic video aesthetics dimensions including visual form related questions from 5 aspects, visual style related questions from 4 aspects, and visual affectiveness questions from 3 aspects. Based on VideoAesBench, we benchmark 23 open-source and commercial large multimodal models. Our findings show that current LMMs only contain basic video aesthetics perception ability, their performance remains incomplete and imprecise. We hope our VideoAesBench can be served as a strong testbed and offer insights for explainable video aesthetics assessment.

[197] Zero-Shot Video Restoration and Enhancement with Assistance of Video Diffusion Models

Cong Cao, Huanjing Yue, Shangbin Xie, Xin Liu, Jingyu Yang

Main category: cs.CV

TL;DR: A training-free framework that uses video diffusion models to improve temporal consistency for zero-shot video restoration/enhancement, addressing flickering issues in image-based methods.

Details

Motivation: Diffusion-based zero-shot image restoration/enhancement methods cause severe temporal flickering when applied to videos, creating a need for temporal consistency in video applications.

Method: Proposes homologous/heterogenous latents fusion with COT-based fusion ratio strategy, plus temporal-strengthening post-processing using image-to-video diffusion models to complement image methods.

Result: Experimental results demonstrate superiority in maintaining temporal consistency for zero-shot video restoration and enhancement tasks.

Conclusion: The framework successfully addresses temporal flickering in video applications while remaining training-free and compatible with any diffusion-based image restoration/enhancement methods.

Abstract: Although diffusion-based zero-shot image restoration and enhancement methods have achieved great success, applying them to video restoration or enhancement will lead to severe temporal flickering. In this paper, we propose the first framework that utilizes the rapidly-developed video diffusion model to assist the image-based method in maintaining more temporal consistency for zero-shot video restoration and enhancement. We propose homologous latents fusion, heterogenous latents fusion, and a COT-based fusion ratio strategy to utilize both homologous and heterogenous text-to-video diffusion models to complement the image method. Moreover, we propose temporal-strengthening post-processing to utilize the image-to-video diffusion model to further improve temporal consistency. Our method is training-free and can be applied to any diffusion-based image restoration and enhancement methods. Experimental results demonstrate the superiority of the proposed method.

[198] Just Noticeable Difference Modeling for Deep Visual Features

Rui Zhao, Wenrui Li, Lin Zhu, Yajing Zheng, Weisi Lin

Main category: cs.CV

TL;DR: FeatJND: A task-aligned just noticeable difference formulation for deep visual features that predicts maximum tolerable per-feature perturbations while preserving downstream task performance.

Details

Motivation: Deep visual features are increasingly used as interfaces in vision systems, creating need to describe feature characteristics and control feature quality for machine perception. Extending JND (just noticeable difference) to deep visual features provides task-aligned tolerance boundaries in feature space for practical quality control under constrained resources.

Method: Proposes FeatJND, a task-aligned JND formulation that predicts maximum tolerable per-feature perturbation maps while preserving downstream task performance. Develops a FeatJND estimator at standardized split points and validates across image classification, detection, and instance segmentation tasks.

Result: Under matched distortion strength, FeatJND-based distortions consistently preserve higher task performance than unstructured Gaussian perturbations. Attribution visualizations suggest FeatJND can suppress non-critical feature regions. Applied to token-wise dynamic quantization, FeatJND-guided step-size allocation yields clear gains over random step-size permutation and global uniform step size under same noise budget.

Conclusion: FeatJND provides a practical framework for controlling deep visual feature quality while maintaining task performance, with applications in feature compression and efficient vision systems.

Abstract: Deep visual features are increasingly used as the interface in vision systems, motivating the need to describe feature characteristics and control feature quality for machine perception. Just noticeable difference (JND) characterizes the maximum imperceptible distortion for images under human or machine vision. Extending it to deep visual features naturally meets the above demand by providing a task-aligned tolerance boundary in feature space, offering a practical reference for controlling feature quality under constrained resources. We propose FeatJND, a task-aligned JND formulation that predicts the maximum tolerable per-feature perturbation map while preserving downstream task performance. We propose a FeatJND estimator at standardized split points and validate it across image classification, detection, and instance segmentation. Under matched distortion strength, FeatJND-based distortions consistently preserve higher task performance than unstructured Gaussian perturbations, and attribution visualizations suggest FeatJND can suppress non-critical feature regions. As an application, we further apply FeatJND to token-wise dynamic quantization and show that FeatJND-guided step-size allocation yields clear gains over random step-size permutation and global uniform step size under the same noise budget. Our code will be released after publication.

[199] BookNet: Book Image Rectification via Cross-Page Attention Network

Shaokai Liu, Hao Feng, Bozhi Luan, Min Hou, Jiajun Deng, Wengang Zhou

Main category: cs.CV

TL;DR: BookNet is the first end-to-end deep learning framework for dual-page book image rectification using a dual-branch architecture with cross-page attention to model coupled geometric relationships between adjacent book pages.

Details

Motivation: Book image rectification has unique challenges due to complex geometric distortions from binding constraints, where left and right pages exhibit asymmetric curvature patterns. Existing single-page document image rectification methods fail to capture the coupled geometric relationships between adjacent pages in books.

Method: BookNet adopts a dual-branch architecture with cross-page attention mechanisms to estimate warping flows for both individual pages and the complete book spread, explicitly modeling how left and right pages influence each other. The authors also created Book3D (synthetic dataset) and Book100 (real-world benchmark) to address data scarcity.

Result: Extensive experiments demonstrate that BookNet outperforms existing state-of-the-art methods on book image rectification. The framework effectively handles the asymmetric curvature patterns in book spreads.

Conclusion: BookNet represents a significant advancement in document image processing by specifically addressing the unique challenges of book image rectification through explicit modeling of cross-page geometric relationships, with publicly available code and datasets.

Abstract: Book image rectification presents unique challenges in document image processing due to complex geometric distortions from binding constraints, where left and right pages exhibit distinctly asymmetric curvature patterns. However, existing single-page document image rectification methods fail to capture the coupled geometric relationships between adjacent pages in books. In this work, we introduce BookNet, the first end-to-end deep learning framework specifically designed for dual-page book image rectification. BookNet adopts a dual-branch architecture with cross-page attention mechanisms, enabling it to estimate warping flows for both individual pages and the complete book spread, explicitly modeling how left and right pages influence each other. Moreover, to address the absence of specialized datasets, we present Book3D, a large-scale synthetic dataset for training, and Book100, a comprehensive real-world benchmark for evaluation. Extensive experiments demonstrate that BookNet outperforms existing state-of-the-art methods on book image rectification. Code and dataset will be made publicly available.

[200] Deep Models, Shallow Alignment: Uncovering the Granularity Mismatch in Neural Decoding

Yang Du, Siyuan Dai, Yonghao Song, Paul M. Thompson, Haoteng Tang, Liang Zhan

Main category: cs.CV

TL;DR: Shallow Alignment improves neural visual decoding by aligning neural signals with intermediate rather than final layer representations of vision models, better capturing both low-level texture details and high-level semantic features.

Details

Motivation: Existing neural visual decoding approaches suffer from a granularity mismatch: deep vision models suppress local texture information for semantic invariance, while neural signals preserve both low-level visual attributes and high-level semantic content.

Method: Proposes Shallow Alignment, a contrastive learning strategy that aligns neural signals with intermediate representations of visual encoders rather than their final outputs, balancing texture details and semantic features.

Result: Significantly outperforms standard final-layer alignment by 22-58% across diverse vision backbones, enables decoding performance to scale predictably with vision backbone capacity, and unlocks scaling laws in neural visual decoding.

Conclusion: Shallow Alignment effectively addresses the granularity mismatch in neural visual decoding by leveraging intermediate vision model representations, leading to substantial performance improvements and predictable scaling behavior.

Abstract: Neural visual decoding is a central problem in brain computer interface research, aiming to reconstruct human visual perception and to elucidate the structure of neural representations. However, existing approaches overlook a fundamental granularity mismatch between human and machine vision, where deep vision models emphasize semantic invariance by suppressing local texture information, whereas neural signals preserve an intricate mixture of low-level visual attributes and high-level semantic content. To address this mismatch, we propose Shallow Alignment, a novel contrastive learning strategy that aligns neural signals with intermediate representations of visual encoders rather than their final outputs, thereby striking a better balance between low-level texture details and high-level semantic features. Extensive experiments across multiple benchmarks demonstrate that Shallow Alignment significantly outperforms standard final-layer alignment, with performance gains ranging from 22% to 58% across diverse vision backbones. Notably, our approach effectively unlocks the scaling law in neural visual decoding, enabling decoding performance to scale predictably with the capacity of pre-trained vision backbones. We further conduct systematic empirical analyses to shed light on the mechanisms underlying the observed performance gains.

[201] PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Yi Liu, Dianhai Yu, Yanjun Ma

Main category: cs.CV

TL;DR: PaddleOCR-VL-1.5 achieves 94.5% SOTA on OmniDocBench v1.5, introduces Real5-OmniDocBench for robustness testing, adds seal recognition and text spotting, while maintaining 0.9B ultra-compact size.

Details

Motivation: To advance document understanding in multimodal VLMs by improving robustness against real-world physical distortions and expanding capabilities beyond standard OCR tasks.

Method: Upgraded PaddleOCR-VL model with enhanced architecture, introduced Real5-OmniDocBench benchmark for testing robustness against scanning, skew, warping, screen-photography, and illumination distortions, and incorporated seal recognition and text spotting capabilities.

Result: Achieved 94.5% SOTA accuracy on OmniDocBench v1.5 and demonstrated strong performance on the new Real5-OmniDocBench benchmark, while maintaining a compact 0.9B parameter size.

Conclusion: PaddleOCR-VL-1.5 represents a significant advancement in document understanding VLMs with improved robustness, expanded capabilities, and efficient architecture suitable for practical applications.

Abstract: We introduce PaddleOCR-VL-1.5, an upgraded model achieving a new state-of-the-art (SOTA) accuracy of 94.5% on OmniDocBench v1.5. To rigorously evaluate robustness against real-world physical distortions, including scanning, skew, warping, screen-photography, and illumination, we propose the Real5-OmniDocBench benchmark. Experimental results demonstrate that this enhanced model attains SOTA performance on the newly curated benchmark. Furthermore, we extend the model’s capabilities by incorporating seal recognition and text spotting tasks, while remaining a 0.9B ultra-compact VLM with high efficiency. Code: https://github.com/PaddlePaddle/PaddleOCR

[202] Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, Yinghao Xu

Main category: cs.CV

TL;DR: LingBot-VA: An autoregressive diffusion framework for robot learning that combines video world modeling with vision-language pre-training to enable future frame prediction and policy execution simultaneously.

Details

Motivation: Video world modeling provides the ability to imagine the near future by understanding causality between actions and visual dynamics, establishing a fresh foundation for robot learning alongside vision-language pre-training.

Method: Autoregressive diffusion framework with three key designs: (1) shared latent space integrating vision and action tokens using Mixture-of-Transformers architecture, (2) closed-loop rollout mechanism for ongoing environmental feedback, (3) asynchronous inference pipeline parallelizing action prediction and motor execution.

Result: Model shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations across both simulation benchmarks and real-world scenarios.

Conclusion: Video world modeling combined with vision-language pre-training establishes an independent foundation for robot learning, with LingBot-VA demonstrating practical effectiveness through its carefully designed architecture.

Abstract: This work highlights that video world modeling, alongside vision-language pre-training, establishes a fresh and independent foundation for robot learning. Intuitively, video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics. Inspired by this, we introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously. Our model features three carefully crafted designs: (1) a shared latent space, integrating vision and action tokens, driven by a Mixture-of-Transformers (MoT) architecture, (2) a closed-loop rollout mechanism, allowing for ongoing acquisition of environmental feedback with ground-truth observations, (3) an asynchronous inference pipeline, parallelizing action prediction and motor execution to support efficient control. We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations. The code and model are made publicly available to facilitate the community.

[203] Drive-JEPA: Video JEPA Meets Multimodal Trajectory Distillation for End-to-End Driving

Linhan Wang, Zichong Yang, Chen Bai, Guoxiang Zhang, Xiaotong Liu, Xiaoyin Zheng, Xiao-Xiao Long, Chang-Tien Lu, Cheng Lu

Main category: cs.CV

TL;DR: Drive-JEPA integrates video pretraining (V-JEPA) with multimodal trajectory distillation for end-to-end autonomous driving, achieving state-of-the-art performance on NAVSIM benchmarks.

Details

Motivation: Current self-supervised video pretraining for autonomous driving has limited improvements due to inherent ambiguity in driving scenes (single human trajectory per scene) and difficulty learning multimodal behaviors.

Method: 1) Adapt V-JEPA for end-to-end driving by pretraining ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. 2) Introduce proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories with momentum-aware selection for stable/safe behavior.

Result: V-JEPA representation with simple transformer decoder outperforms prior methods by 3 PDMS in perception-free setting. Complete Drive-JEPA achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting new state-of-the-art on NAVSIM.

Conclusion: Drive-JEPA effectively combines self-supervised video pretraining with multimodal trajectory distillation to address driving ambiguity and achieve superior planning performance.

Abstract: End-to-end autonomous driving increasingly leverages self-supervised video pretraining to learn transferable planning representations. However, pretraining video world models for scene understanding has so far brought only limited improvements. This limitation is compounded by the inherent ambiguity of driving: each scene typically provides only a single human trajectory, making it difficult to learn multimodal behaviors. In this work, we propose Drive-JEPA, a framework that integrates Video Joint-Embedding Predictive Architecture (V-JEPA) with multimodal trajectory distillation for end-to-end driving. First, we adapt V-JEPA for end-to-end driving, pretraining a ViT encoder on large-scale driving videos to produce predictive representations aligned with trajectory planning. Second, we introduce a proposal-centric planner that distills diverse simulator-generated trajectories alongside human trajectories, with a momentum-aware selection mechanism to promote stable and safe behavior. When evaluated on NAVSIM, the V-JEPA representation combined with a simple transformer-based decoder outperforms prior methods by 3 PDMS in the perception-free setting. The complete Drive-JEPA framework achieves 93.3 PDMS on v1 and 87.8 EPDMS on v2, setting a new state-of-the-art.

[204] Understanding Multimodal Complementarity for Single-Frame Action Anticipation

Manuel Benavent-Lledo, Konstantinos Bacharidis, Konstantinos Papoutsakis, Antonis Argyros, Jose Garcia-Rodriguez

Main category: cs.CV

TL;DR: AAG+ improves single-frame action anticipation by combining RGB, depth, and semantic past-action information through multimodal fusion, achieving performance comparable to video-based methods on challenging benchmarks.

Details

Motivation: The paper challenges the assumption that dense temporal video information is necessary for action anticipation, investigating how much future information is encoded in a single frame and how to effectively exploit it.

Method: Systematic investigation of single-frame action anticipation with multimodal fusion of RGB appearance, depth-based geometric cues, and semantic past-action representations, exploring different fusion strategies, keyframe selection policies, and past-action history sources.

Result: AAG+ consistently improves upon original AAG and achieves performance comparable to or exceeding state-of-the-art video-based methods on IKEA-ASM, Meccano and Assembly101 benchmarks.

Conclusion: Single-frame action anticipation can be highly effective when carefully designed with multimodal information fusion, clarifying when dense temporal modeling is necessary versus when a well-selected glimpse suffices.

Abstract: Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.

[205] Urban Neural Surface Reconstruction from Constrained Sparse Aerial Imagery with 3D SAR Fusion

Da Li, Chen Yao, Tong Mao, Jiacheng Bao, Houjun Sun

Main category: cs.CV

TL;DR: Urban neural surface reconstruction framework fusing 3D SAR point clouds with aerial imagery for robust reconstruction under sparse-view conditions.

Details

Motivation: Existing neural surface reconstruction methods suffer from geometric ambiguity and instability under sparse-view conditions, which is critical for large-scale urban remote sensing where aerial image acquisition is limited by flight paths, terrain, and cost.

Method: Proposes the first urban NSR framework that fuses 3D synthetic aperture radar (SAR) point clouds with aerial imagery. Integrates radar-derived spatial constraints into an SDF-based NSR backbone, guiding structure-aware ray selection and adaptive sampling for stable optimization.

Result: Extensive experiments show that incorporating 3D SAR markedly enhances reconstruction accuracy, completeness, and robustness compared with single-modality baselines under highly sparse and oblique-view conditions.

Conclusion: The framework highlights a viable route toward scalable high-fidelity urban reconstruction with advanced airborne and spaceborne optical-SAR sensing, demonstrating the value of cross-modal fusion for 3D reconstruction.

Abstract: Neural surface reconstruction (NSR) has recently shown strong potential for urban 3D reconstruction from multi-view aerial imagery. However, existing NSR methods often suffer from geometric ambiguity and instability, particularly under sparse-view conditions. This issue is critical in large-scale urban remote sensing, where aerial image acquisition is limited by flight paths, terrain, and cost. To address this challenge, we present the first urban NSR framework that fuses 3D synthetic aperture radar (SAR) point clouds with aerial imagery for high-fidelity reconstruction under constrained, sparse-view settings. 3D SAR can efficiently capture large-scale geometry even from a single side-looking flight path, providing robust priors that complement photometric cues from images. Our framework integrates radar-derived spatial constraints into an SDF-based NSR backbone, guiding structure-aware ray selection and adaptive sampling for stable and efficient optimization. We also construct the first benchmark dataset with co-registered 3D SAR point clouds and aerial imagery, facilitating systematic evaluation of cross-modal 3D reconstruction. Extensive experiments show that incorporating 3D SAR markedly enhances reconstruction accuracy, completeness, and robustness compared with single-modality baselines under highly sparse and oblique-view conditions, highlighting a viable route toward scalable high-fidelity urban reconstruction with advanced airborne and spaceborne optical-SAR sensing.

[206] PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction

Changjian Jiang, Kerui Ren, Xudong Li, Kaiwen Song, Linning Xu, Tao Lu, Junting Dong, Yu Zhang, Bo Dai, Mulin Yu

Main category: cs.CV

TL;DR: PLANING is an efficient streaming reconstruction framework that uses hybrid representation (geometric primitives + neural Gaussians) for decoupled geometry/appearance modeling, enabling fast, high-quality scene reconstruction from monocular videos.

Details

Motivation: Existing streaming reconstruction methods struggle to achieve both high-quality rendering and accurate geometry simultaneously, often favoring one over the other. There's a need for efficient on-the-fly reconstruction that maintains both quality aspects.

Method: Uses hybrid representation loosely coupling explicit geometric primitives with neural Gaussians, enabling decoupled geometry and appearance modeling. Features online initialization and optimization strategy separating geometry and appearance updates to reduce structural redundancy.

Result: Improves dense mesh Chamfer-L2 by 18.52% over PGSR, surpasses ARTDECO by 1.31 dB PSNR, reconstructs ScanNetV2 scenes in under 100 seconds (5x faster than 2D Gaussian Splatting), while matching offline per-scene optimization quality.

Conclusion: PLANING provides efficient streaming reconstruction with both high-quality geometry and appearance, suitable for downstream applications like large-scale scene modeling and simulation-ready environments for embodied AI.

Abstract: Streaming reconstruction from monocular image sequences remains challenging, as existing methods typically favor either high-quality rendering or accurate geometry, but rarely both. We present PLANING, an efficient on-the-fly reconstruction framework built on a hybrid representation that loosely couples explicit geometric primitives with neural Gaussians, enabling geometry and appearance to be modeled in a decoupled manner. This decoupling supports an online initialization and optimization strategy that separates geometry and appearance updates, yielding stable streaming reconstruction with substantially reduced structural redundancy. PLANING improves dense mesh Chamfer-L2 by 18.52% over PGSR, surpasses ARTDECO by 1.31 dB PSNR, and reconstructs ScanNetV2 scenes in under 100 seconds, over 5x faster than 2D Gaussian Splatting, while matching the quality of offline per-scene optimization. Beyond reconstruction quality, the structural clarity and computational efficiency of \modelname~make it well suited for a broad range of downstream applications, such as enabling large-scale scene modeling and simulation-ready environments for embodied AI. Project page: https://city-super.github.io/PLANING/ .

[207] MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Baorui Ma, Jiahui Yang, Donglin Di, Xuancheng Zhang, Jianxun Cui, Hao Li, Yan Xie, Wei Chen

Main category: cs.CV

TL;DR: Metric Anything is a scalable pretraining framework for metric depth estimation that learns from diverse 3D data sources using sparse metric prompts, achieving strong performance on various depth-related tasks and boosting MLLM spatial intelligence.

Details

Motivation: Current vision foundation models benefit from scaling, but extending this to metric depth estimation is challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in cross-source 3D data. The paper aims to establish scaling laws for metric depth estimation similar to other foundation models.

Method: Introduces Metric Anything framework with Sparse Metric Prompts created by randomly masking depth maps, serving as a universal interface that decouples spatial reasoning from sensor/camera biases. Trained on ~20M image-depth pairs from reconstructed, captured, and rendered 3D data across 10,000 camera models.

Result: Demonstrates first clear scaling trend in metric depth track. Excels at prompt-driven tasks (depth completion, super-resolution, Radar-camera fusion). Distilled prompt-free student achieves SOTA on monocular depth estimation, camera intrinsics recovery, 3D reconstruction, and VLA planning. Pretrained ViT boosts MLLM spatial intelligence.

Conclusion: Metric depth estimation can benefit from scaling laws like other foundation models, establishing a path toward scalable real-world metric perception. The framework enables efficient metric understanding without manual engineering or task-specific architectures.

Abstract: Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.

[208] Unsupervised Decomposition and Recombination with Discriminator-Driven Diffusion Models

Archer Wang, Emile Anand, Yilun Du, Marin Soljačić

Main category: cs.CV

TL;DR: Adversarial training method for learning factorized latent spaces in diffusion models without supervision, enabling component recombination for image and robotic video generation with improved disentanglement and generation quality.

Details

Motivation: To enable decomposition of complex data into reusable components and synthesis of new samples via component recombination, particularly for images (background, illumination, object attributes) and robotic videos (motion components), without requiring factor-level supervision.

Method: Introduces adversarial training via a discriminator that distinguishes between single-source samples and those generated by recombining factors across sources. The generator is optimized to fool this discriminator, encouraging physical and semantic consistency in recombinations.

Result: Outperforms prior baselines on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D with lower FID scores and better disentanglement (MIG and MCC). Demonstrates novel application to robotic video trajectories where recombining learned action components generates diverse sequences that increase state-space coverage for exploration on LIBERO benchmark.

Conclusion: The adversarial training approach effectively improves both latent factor discovery and compositional generation quality across multiple domains, with promising applications in robotic video generation for exploration.

Abstract: Decomposing complex data into factorized representations can reveal reusable components and enable synthesizing new samples via component recombination. We investigate this in the context of diffusion-based models that learn factorized latent spaces without factor-level supervision. In images, factors can capture background, illumination, and object attributes; in robotic videos, they can capture reusable motion components. To improve both latent factor discovery and quality of compositional generation, we introduce an adversarial training signal via a discriminator trained to distinguish between single-source samples and those generated by recombining factors across sources. By optimizing the generator to fool this discriminator, we encourage physical and semantic consistency in the resulting recombinations. Our method outperforms implementations of prior baselines on CelebA-HQ, Virtual KITTI, CLEVR, and Falcor3D, achieving lower FID scores and better disentanglement as measured by MIG and MCC. Furthermore, we demonstrate a novel application to robotic video trajectories: by recombining learned action components, we generate diverse sequences that significantly increase state-space coverage for exploration on the LIBERO benchmark.

[209] Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Wenxuan Huang, Yu Zeng, Qiuchen Wang, Zhen Fang, Shaosheng Cao, Zheng Chu, Qingyu Yin, Shuang Chen, Zhenfei Yin, Lin Chen, Zehui Chen, Yao Hu, Philip Torr, Feng Zhao, Wanli Ouyang

Main category: cs.CV

TL;DR: Vision-DeepResearch introduces a multimodal deep-research paradigm for MLLMs that performs multi-turn, multi-entity, multi-scale visual/textual search to handle real-world noisy scenarios, outperforming existing methods and closed-source models.

Details

Motivation: Current MLLMs have limited world knowledge and use naive multimodal search approaches that assume single queries suffice for retrieval, which is unrealistic in noisy real-world scenarios. They also lack reasoning depth and search breadth for complex questions requiring evidence aggregation from diverse sources.

Method: Proposes Vision-DeepResearch with multi-turn, multi-entity, multi-scale visual and textual search paradigm. Uses cold-start supervision and RL training to internalize deep-research capabilities into MLLMs, supporting dozens of reasoning steps and hundreds of engine interactions.

Result: Substantially outperforms existing multimodal deep-research MLLMs and workflows built on strong closed-source foundation models like GPT-5, Gemini-2.5-pro, and Claude-4-Sonnet.

Conclusion: Vision-DeepResearch provides a robust multimodal deep-research paradigm that effectively handles real-world noisy scenarios and complex questions requiring extensive evidence aggregation from diverse visual and textual sources.

Abstract: Multimodal large language models (MLLMs) have achieved remarkable success across a broad range of vision tasks. However, constrained by the capacity of their internal world knowledge, prior work has proposed augmenting MLLMs by ``reasoning-then-tool-call’’ for visual and textual search engines to obtain substantial gains on tasks requiring extensive factual information. However, these approaches typically define multimodal search in a naive setting, assuming that a single full-level or entity-level image query and few text query suffices to retrieve the key evidence needed to answer the question, which is unrealistic in real-world scenarios with substantial visual noise. Moreover, they are often limited in the reasoning depth and search breadth, making it difficult to solve complex questions that require aggregating evidence from diverse visual and textual sources. Building on this, we propose Vision-DeepResearch, which proposes one new multimodal deep-research paradigm, i.e., performs multi-turn, multi-entity and multi-scale visual and textual search to robustly hit real-world search engines under heavy noise. Our Vision-DeepResearch supports dozens of reasoning steps and hundreds of engine interactions, while internalizing deep-research capabilities into the MLLM via cold-start supervision and RL training, resulting in a strong end-to-end multimodal deep-research MLLM. It substantially outperforming existing multimodal deep-research MLLMs, and workflows built on strong closed-source foundation model such as GPT-5, Gemini-2.5-pro and Claude-4-Sonnet. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

[210] BLO-Inst: Bi-Level Optimization Based Alignment of YOLO and SAM for Robust Instance Segmentation

Li Zhang, Pengtao Xie

Main category: cs.CV

TL;DR: BLO-Inst is a bi-level optimization framework that aligns object detection with SAM segmentation by training detectors to generate segmentation-aware prompts rather than just localization boxes.

Details

Motivation: SAM requires manual prompts for segmentation, limiting automation. Current detector-based prompt generation suffers from objective mismatch (detectors optimize localization, not segmentation prompting) and alignment overfitting in joint training.

Method: Bi-level optimization with disjoint data splits: lower level fine-tunes SAM on detection proposals from subset D1; upper level updates detector to minimize validation loss of fine-tuned SAM on separate subset D2, making detector segmentation-aware.

Result: BLO-Inst outperforms standard baselines on general and biomedical domain tasks, achieving superior segmentation performance by optimizing detectors for downstream mask quality.

Conclusion: The framework successfully bridges detection and segmentation objectives, transforming detectors into segmentation-aware prompt generators for fully automated SAM deployment.

Abstract: The Segment Anything Model has revolutionized image segmentation with its zero-shot capabilities, yet its reliance on manual prompts hinders fully automated deployment. While integrating object detectors as prompt generators offers a pathway to automation, existing pipelines suffer from two fundamental limitations: objective mismatch, where detectors optimized for geometric localization do not correspond to the optimal prompting context required by SAM, and alignment overfitting in standard joint training, where the detector simply memorizes specific prompt adjustments for training samples rather than learning a generalizable policy. To bridge this gap, we introduce BLO-Inst, a unified framework that aligns detection and segmentation objectives by bi-level optimization. We formulate the alignment as a nested optimization problem over disjoint data splits. In the lower level, the SAM is fine-tuned to maximize segmentation fidelity given the current detection proposals on a subset ($D_1$). In the upper level, the detector is updated to generate bounding boxes that explicitly minimize the validation loss of the fine-tuned SAM on a separate subset ($D_2$). This effectively transforms the detector into a segmentation-aware prompt generator, optimizing the bounding boxes not just for localization accuracy, but for downstream mask quality. Extensive experiments demonstrate that BLO-Inst achieves superior performance, outperforming standard baselines on tasks in general and biomedical domains.

[211] RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation

Hanzhuo Huang, Qingyang Bao, Zekai Gu, Zhongshuo Du, Cheng Lin, Yuan Liu, Sibei Yang

Main category: cs.CV

TL;DR: A 3D asset-referenced diffusion model that generates images consistent with 3D assets using multi-view RGB images and point maps, enabling precise 2D-3D alignment.

Details

Motivation: Existing reference-based image generation methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility in content creation.

Method: Cross-domain diffusion model with dual-branch perception that uses multi-view RGB images and point maps of 3D assets to jointly model colors and canonical-space coordinates. Features spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism.

Result: The approach effectively uses 3D assets as references to produce images consistent with the given assets, achieving precise consistency between generated images and 3D references.

Conclusion: The method opens new possibilities for combining diffusion models with 3D content creation by bridging 2D image generation with 3D asset references.

Abstract: In this paper, we propose a 3D asset-referenced diffusion model for image generation, exploring how to integrate 3D assets into image diffusion models. Existing reference-based image generation methods leverage large-scale pretrained diffusion models and demonstrate strong capability in generating diverse images conditioned on a single reference image. However, these methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility. To address this gap, we present a cross-domain diffusion model with dual-branch perception that leverages multi-view RGB images and point maps of 3D assets to jointly model their colors and canonical-space coordinates, achieving precise consistency between generated images and the 3D references. Our spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism ensure the simultaneous generation of two spatially aligned but content-disentangled outputs, RGB images and point maps, linking 2D image attributes with 3D asset attributes. Experiments show that our approach effectively uses 3D assets as references to produce images consistent with the given assets, opening new possibilities for combining diffusion models with 3D content creation.

[212] SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence

Saoud Aldowaish, Yashwanth Karumanchi, Kai-Chen Chiang, Soroosh Noorzad, Morteza Fayazi

Main category: cs.CV

TL;DR: SINA is an automated circuit schematic image-to-netlist generator using deep learning, CCL, OCR, and Vision-Language Models for component detection and connectivity extraction.

Details

Motivation: Current methods struggle with converting circuit schematic images into machine-readable netlists, particularly with component recognition and connectivity inference.

Method: Integrates deep learning for component detection, Connected-Component Labeling for connectivity extraction, OCR for component reference designator retrieval, and Vision-Language Model for reference designator assignments.

Result: Achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.

Conclusion: SINA provides an open-source, fully automated solution that significantly outperforms existing methods for circuit schematic image-to-netlist conversion.

Abstract: Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.

[213] Creative Image Generation with Diffusion Model

Kunpeng Song, Ahmed Elgammal

Main category: cs.CV

TL;DR: A novel framework for creative image generation using diffusion models that defines creativity as the inverse probability of an image’s existence in CLIP embedding space, driving generation toward low-probability regions to produce rare and imaginative outputs.

Details

Motivation: The need to produce novel and high-quality images that expand the boundaries of imagination, moving beyond manual concept blending or subcategory exclusion approaches in creative generation.

Method: Proposes a framework that calculates probability distribution of generated images in CLIP embedding space and drives generation toward low-probability regions. Introduces pullback mechanisms to maintain visual fidelity while achieving high creativity.

Result: Extensive experiments on text-to-image diffusion models demonstrate the framework’s effectiveness and efficiency in producing unique, novel, and thought-provoking images.

Conclusion: Provides a new perspective on creativity in generative models with a principled method to foster innovation in visual content synthesis.

Abstract: Creative image generation has emerged as a compelling area of research, driven by the need to produce novel and high-quality images that expand the boundaries of imagination. In this work, we propose a novel framework for creative generation using diffusion models, where creativity is associated with the inverse probability of an image’s existence in the CLIP embedding space. Unlike prior approaches that rely on a manual blending of concepts or exclusion of subcategories, our method calculates the probability distribution of generated images and drives it towards low-probability regions to produce rare, imaginative, and visually captivating outputs. We also introduce pullback mechanisms, achieving high creativity without sacrificing visual fidelity. Extensive experiments on text-to-image diffusion models demonstrate the effectiveness and efficiency of our creative generation framework, showcasing its ability to produce unique, novel, and thought-provoking images. This work provides a new perspective on creativity in generative models, offering a principled method to foster innovation in visual content synthesis.

[214] Early and Prediagnostic Detection of Pancreatic Cancer from Computed Tomography

Wenxuan Li, Pedro R. A. S. Bassi, Lizhou Wu, Xinze Zhou, Yuxuan Zhao, Qi Chen, Szymon Plotka, Tianyu Lin, Zheren Zhu, Marisa Martin, Justin Caskey, Shanshan Jiang, Xiaoxi Chen, Jaroslaw B. Ćwikla, Artur Sankowski, Yaping Wu, Sergio Decherchi, Andrea Cavalli, Chandana Lall, Cristian Tomasetti, Yaxing Guo, Xuan Yu, Yuqing Cai, Hualin Qiao, Jie Bao, Chenhan Hu, Ximing Wang, Arkadiusz Sitek, Kai Ding, Heng Li, Meiyun Wang, Dexin Yu, Guang Zhang, Yang Yang, Kang Wang, Alan L. Yuille, Zongwei Zhou

Main category: cs.CV

TL;DR: AI system (ePAI) for early pancreatic cancer detection on CT scans achieves high accuracy in detecting small lesions, outperforming radiologists and identifying cancers up to 3 years before clinical diagnosis.

Details

Motivation: Pancreatic cancer is often detected too late for effective treatment. Radiologists frequently miss early lesions on CT scans even when reviewing them retrospectively. There's a need for automated systems to help detect these lesions earlier.

Method: Developed ePAI (early Pancreatic cancer detection with Artificial Intelligence) trained on data from 1,598 patients from a single medical center. Tested internally on 1,009 patients and externally on 7,158 patients across 6 centers. Conducted multi-reader study comparing ePAI with 30 board-certified radiologists.

Result: Internal test: AUC 0.939-0.999, sensitivity 95.3%, specificity 98.7% for detecting PDAC <2cm, localizing lesions as small as 2mm. External test: AUC 0.918-0.945, sensitivity 91.5%, specificity 88.0%, localizing lesions as small as 5mm. Detected PDACs on prediagnostic CT scans 3-36 months before clinical diagnosis that were originally missed by radiologists. Outperformed 30 radiologists by 50.3% in sensitivity while maintaining comparable specificity.

Conclusion: ePAI shows strong potential as an assistive tool for early pancreatic cancer detection, significantly outperforming human radiologists in sensitivity and detecting cancers years before clinical diagnosis.

Abstract: Pancreatic ductal adenocarcinoma (PDAC), one of the deadliest solid malignancies, is often detected at a late and inoperable stage. Retrospective reviews of prediagnostic CT scans, when conducted by expert radiologists aware that the patient later developed PDAC, frequently reveal lesions that were previously overlooked. To help detecting these lesions earlier, we developed an automated system named ePAI (early Pancreatic cancer detection with Artificial Intelligence). It was trained on data from 1,598 patients from a single medical center. In the internal test involving 1,009 patients, ePAI achieved an area under the receiver operating characteristic curve (AUC) of 0.939-0.999, a sensitivity of 95.3%, and a specificity of 98.7% for detecting small PDAC less than 2 cm in diameter, precisely localizing PDAC as small as 2 mm. In an external test involving 7,158 patients across 6 centers, ePAI achieved an AUC of 0.918-0.945, a sensitivity of 91.5%, and a specificity of 88.0%, precisely localizing PDAC as small as 5 mm. Importantly, ePAI detected PDACs on prediagnostic CT scans obtained 3 to 36 months before clinical diagnosis that had originally been overlooked by radiologists. It successfully detected and localized PDACs in 75 of 159 patients, with a median lead time of 347 days before clinical diagnosis. Our multi-reader study showed that ePAI significantly outperformed 30 board-certified radiologists by 50.3% (P < 0.05) in sensitivity while maintaining a comparable specificity of 95.4% in detecting PDACs early and prediagnostic. These findings suggest its potential of ePAI as an assistive tool to improve early detection of pancreatic cancer.

[215] PI-Light: Physics-Inspired Diffusion for Full-Image Relighting

Zhexin Liang, Zhaoxi Chen, Yongwei Chen, Tianyi Wei, Tengfei Wang, Xingang Pan

Main category: cs.CV

TL;DR: PI-Light is a physics-inspired diffusion framework for full-image relighting that addresses synthetic-to-real gap challenges through batch-aware attention, physics-guided neural rendering, and physics-inspired losses.

Details

Motivation: Full-image relighting faces challenges including difficulty collecting large-scale paired data, maintaining physical plausibility, and limited generalizability of data-driven priors. Existing methods bridging synthetic-to-real gap remain suboptimal.

Method: Two-stage framework with: (1) batch-aware attention for consistent intrinsic predictions across image collections, (2) physics-guided neural rendering enforcing physically plausible light transport, (3) physics-inspired losses regularizing training toward physically meaningful landscape, (4) curated dataset of diverse objects/scenes under controlled lighting.

Result: PI-Light synthesizes specular highlights and diffuse reflections across various materials, achieving superior generalization to real-world scenes compared to prior approaches.

Conclusion: The framework enables efficient finetuning of pretrained diffusion models while providing a solid benchmark for downstream evaluation, addressing key challenges in full-image relighting.

Abstract: Full-image relighting remains a challenging problem due to the difficulty of collecting large-scale structured paired data, the difficulty of maintaining physical plausibility, and the limited generalizability imposed by data-driven priors. Existing attempts to bridge the synthetic-to-real gap for full-scene relighting remain suboptimal. To tackle these challenges, we introduce Physics-Inspired diffusion for full-image reLight ($π$-Light, or PI-Light), a two-stage framework that leverages physics-inspired diffusion models. Our design incorporates (i) batch-aware attention, which improves the consistency of intrinsic predictions across a collection of images, (ii) a physics-guided neural rendering module that enforces physically plausible light transport, (iii) physics-inspired losses that regularize training dynamics toward a physically meaningful landscape, thereby enhancing generalizability to real-world image editing, and (iv) a carefully curated dataset of diverse objects and scenes captured under controlled lighting conditions. Together, these components enable efficient finetuning of pretrained diffusion models while also providing a solid benchmark for downstream evaluation. Experiments demonstrate that $π$-Light synthesizes specular highlights and diffuse reflections across a wide variety of materials, achieving superior generalization to real-world scenes compared with prior approaches.

[216] Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Xiaoxiao Sun, Mingyang Li, Kun yuan, Min Woo Sun, Mark Endo, Shengguang Wu, Changlin Li, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

Main category: cs.CV

TL;DR: VI-Probe is a framework that systematically tests whether Large Vision-Language Models truly perceive visual changes or just recall memorized patterns by using controlled visual illusions with graded perturbations and matched controls.

Details

Motivation: VLMs often answer visual illusions "correctly" on original images but persist with the same responses when illusion factors are inverted, suggesting they may not actually perceive visual changes but rather recall memorized patterns. The underlying causes of this phenomenon remain unclear, motivating a systematic investigation.

Method: Introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) to disentangle visually grounded perception from language-driven recall. Uses metrics like Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls, rather than just averaged accuracy.

Result: Experiments reveal that response persistence arises from heterogeneous causes rather than a single mechanism: GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits.

Conclusion: The findings challenge single-cause views of VLM behavior and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change, providing a more nuanced understanding of how VLMs process visual information.

Abstract: Large Vision-Language Models (VLMs) often answer classic visual illusions “correctly” on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.

[217] UEval: A Benchmark for Unified Multimodal Generation

Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, Zhuang Liu

Main category: cs.CV

TL;DR: UEval is a benchmark for evaluating unified models that generate both images and text, featuring 1,000 expert-curated questions requiring multimodal outputs with fine-grained rubric-based scoring.

Details

Motivation: Current evaluation methods for open-ended multimodal generation are inadequate, as simple LLM-as-a-judge approaches miss subtleties. There's a need for comprehensive benchmarks that properly assess models capable of generating both images and text across diverse real-world tasks.

Method: UEval comprises 1,000 expert-curated questions from 8 real-world tasks requiring both image and text outputs. The benchmark uses a rubric-based scoring system where MLLMs generate initial evaluation criteria, which human experts then refine and validate, resulting in 10,417 validated rubric criteria for fine-grained automatic scoring.

Result: UEval is challenging for current unified models: GPT-5-Thinking scores 66.4/100, while the best open-source model reaches only 49.1. Reasoning models outperform non-reasoning ones, and transferring reasoning traces from reasoning to non-reasoning models significantly narrows the performance gap.

Conclusion: Reasoning appears crucial for complex multimodal understanding and generation tasks. UEval provides a scalable, fine-grained evaluation framework for unified models, highlighting current limitations and the importance of reasoning capabilities in multimodal AI systems.

Abstract: We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.

[218] One-step Latent-free Image Generation with Pixel Mean Flows

Yiyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang, Xianbang Wang, Tianhong Li, Zhengyang Geng, Kaiming He

Main category: cs.CV

TL;DR: Pixel MeanFlow (pMF) enables one-step latent-free image generation by formulating network output in image space and loss in velocity space, achieving strong results on ImageNet at 256x256 and 512x512 resolutions.

Details

Motivation: Current diffusion/flow models use multi-step sampling and operate in latent spaces. The paper aims to advance toward one-step diffusion/flow without latents by addressing both aspects simultaneously.

Method: Proposes pixel MeanFlow (pMF) with separate network output space (image manifold, x-prediction) and loss space (velocity space via MeanFlow). Introduces transformation between image manifold and average velocity field.

Result: Achieves 2.22 FID on ImageNet 256x256 and 2.48 FID on 512x512 resolution for one-step latent-free generation, filling a key gap in this regime.

Conclusion: pMF advances boundaries of diffusion/flow-based generative models by enabling efficient one-step generation without latent spaces, with strong empirical performance.

Abstract: Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose “pixel MeanFlow” (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.

[219] Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video

Zijie Pan, Zeyu Yang, Xiatian Zhu, Li Zhang

Main category: cs.CV

TL;DR: Efficient4D: A fast video-to-4D object generation framework that creates dynamic 3D objects from single-view videos using spacetime-consistent image generation and 4D Gaussian splatting.

Details

Motivation: Generating dynamic 3D objects from single-view videos is challenging due to lack of 4D labeled data. Existing approaches using score distillation sampling are slow and expensive to scale due to back-propagation through large pretrained models.

Method: Proposes Efficient4D framework that: 1) generates high-quality spacetime-consistent images under different camera views, 2) uses them as labeled data to reconstruct 4D content via 4D Gaussian splatting, 3) introduces inconsistency-aware confidence-weighted loss and lightly weighted score distillation loss for robust sparse-view reconstruction.

Result: Achieves 10x speed increase compared to prior art (10 minutes vs 120 minutes for Consistent4D) while preserving novel view synthesis quality. Enables real-time rendering under continuous camera trajectories.

Conclusion: Efficient4D provides an efficient solution for video-to-4D generation with significant speed improvements while maintaining quality, making dynamic 3D object generation more practical.

Abstract: Generating dynamic 3D object from a single-view video is challenging due to the lack of 4D labeled data. An intuitive approach is to extend previous image-to-3D pipelines by transferring off-the-shelf image generation models such as score distillation sampling.However, this approach would be slow and expensive to scale due to the need for back-propagating the information-limited supervision signals through a large pretrained model. To address this, we propose an efficient video-to-4D object generation framework called Efficient4D. It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data to directly reconstruct the 4D content through a 4D Gaussian splatting model. Importantly, our method can achieve real-time rendering under continuous camera trajectories. To enable robust reconstruction under sparse views, we introduce inconsistency-aware confidence-weighted loss design, along with a lightly weighted score distillation loss. Extensive experiments on both synthetic and real videos show that Efficient4D offers a remarkable 10-fold increase in speed when compared to prior art alternatives while preserving the quality of novel view synthesis. For example, Efficient4D takes only 10 minutes to model a dynamic object, vs 120 minutes by the previous art model Consistent4D.

[220] Soft Masked Transformer for Point Cloud Processing with Skip Attention-Based Upsampling

Yong He, Hongshan Yu, Chaoxu Mu, Mingtao Feng, Tongjia Chen, Zechuan Li, Anwaar Ulhaq, Ajmal Mian

Main category: cs.CV

TL;DR: SMTransformer integrates task-level information into point cloud processing using soft masks and skip-attention up-sampling with shared position encoding to reduce parameters while achieving state-of-the-art segmentation results.

Details

Motivation: Current point cloud processing methods focus on local and global features at the feature level but overlook task-level context during encoding. The authors argue that integrating task-level information into the encoding stage can significantly enhance performance for downstream tasks.

Method: Proposes SMTransformer which incorporates task-level information using soft masks generated from task-level queries and keys to learn attention weights. Introduces skip-attention-based up-sampling block for effective feature communication between encoding and decoding layers. Uses shared position encoding strategy to reduce parameters and training time without accuracy loss.

Result: Achieves state-of-the-art semantic segmentation results: 73.4% mIoU on S3DIS Area 5 and 62.4% mIoU on SWAN dataset. Experimental comparisons demonstrate efficacy for point cloud processing tasks including semantic segmentation and classification.

Conclusion: Integrating task-level information into point cloud encoding significantly improves performance. The proposed SMTransformer with soft masks, skip-attention up-sampling, and shared position encoding provides an effective framework for point cloud processing tasks while managing computational complexity.

Abstract: Point cloud processing methods leverage local and global point features %at the feature level to cater to downstream tasks, yet they often overlook the task-level context inherent in point clouds during the encoding stage. We argue that integrating task-level information into the encoding stage significantly enhances performance. To that end, we propose SMTransformer which incorporates task-level information into a vector-based transformer by utilizing a soft mask generated from task-level queries and keys to learn the attention weights. Additionally, to facilitate effective communication between features from the encoding and decoding layers in high-level tasks such as segmentation, we introduce a skip-attention-based up-sampling block. This block dynamically fuses features from various resolution points across the encoding and decoding layers. To mitigate the increase in network parameters and training time resulting from the complexity of the aforementioned blocks, we propose a novel shared position encoding strategy. This strategy allows various transformer blocks to share the same position information over the same resolution points, thereby reducing network parameters and training time without compromising accuracy.Experimental comparisons with existing methods on multiple datasets demonstrate the efficacy of SMTransformer and skip-attention-based up-sampling for point cloud processing tasks, including semantic segmentation and classification. In particular, we achieve state-of-the-art semantic segmentation results of 73.4% mIoU on S3DIS Area 5 and 62.4% mIoU on SWAN dataset

[221] An explainable vision transformer with transfer learning based efficient drought stress identification

Aswini Kumar Patra, Ankit Varshney, Lingaraj Sahoo

Main category: cs.CV

TL;DR: Vision transformers applied to aerial imagery for early drought stress detection in potato crops, with explainable attention maps highlighting stress signatures.

Details

Motivation: Early detection of drought stress is crucial for preventing irreversible crop loss. While CNNs are commonly used, vision transformers offer better capture of long-range dependencies and subtle spatial patterns in aerial imagery that indicate early drought stress.

Method: Two approaches: 1) ViT + SVM pipeline where ViT extracts spatial features from aerial images and SVM classifies as stressed/healthy, and 2) end-to-end ViT with classification layer. Both methods use attention visualization to explain model decisions.

Result: High accuracy in drought stress identification with attention maps revealing specific spatial features that ViT focuses on as drought stress signatures, providing interpretable insights into subtle plant features associated with drought.

Conclusion: ViTs offer robust and interpretable solutions for drought stress monitoring, enabling farmers to make informed decisions for improved crop management through explainable AI that highlights relevant plant features.

Abstract: Early detection of drought stress is critical for taking timely measures for reducing crop loss before the drought impact becomes irreversible. The subtle phenotypical and physiological changes in response to drought stress are captured by non-invasive imaging techniques and these imaging data serve as valuable resource for machine learning methods to identify drought stress. While convolutional neural networks (CNNs) are in wide use, vision transformers (ViTs) present a promising alternative in capturing long-range dependencies and intricate spatial relationships, thereby enhancing the detection of subtle indicators of drought stress. We propose an explainable deep learning pipeline that leverages the power of ViTs for drought stress detection in potato crops using aerial imagery. We applied two distinct approaches: a synergistic combination of ViT and support vector machine (SVM), where ViT extracts intricate spatial features from aerial images, and SVM classifies the crops as stressed or healthy and an end-to-end approach using a dedicated classification layer within ViT to directly detect drought stress. Our key findings explain the ViT model’s decision-making process by visualizing attention maps. These maps highlight the specific spatial features within the aerial images that the ViT model focuses as the drought stress signature. Our findings demonstrate that the proposed methods not only achieve high accuracy in drought stress identification but also shedding light on the diverse subtle plant features associated with drought stress. This offers a robust and interpretable solution for drought stress monitoring for farmers to undertake informed decisions for improved crop management.

[222] TIPO: Text to Image with Text Presampling for Prompt Optimization

Shih-Ying Yeh, Sang-Hyun Park, Yi Li, Giyeong Oh, Xuehai Wang, Min Song, Youngjae Yu, Shang-Hong Lai

Main category: cs.CV

TL;DR: TIPO is an efficient prompt optimization method for text-to-image generation that uses a lightweight pre-trained model to automatically expand simple prompts into richer versions, improving visual quality while preserving original intent.

Details

Motivation: Current prompt optimization methods for text-to-image generation often rely on resource-intensive approaches like large language models or reinforcement learning, which are computationally expensive and not scalable. There's a need for more efficient automated prompt engineering that can enhance visual quality while maintaining the original user intent.

Method: TIPO uses a lightweight pre-trained model to sample refined prompts from a targeted sub-distribution within the broader semantic space. It starts from simple user prompts and expands them into richer, more detailed versions while preserving the original intent. The approach focuses on distribution-aligned prompt engineering rather than resource-intensive LLM or RL methods.

Result: Extensive experiments across multiple domains show TIPO achieves stronger text alignment, reduced visual artifacts, and consistently higher human preference rates. It maintains competitive aesthetic quality while offering significant computational efficiency and scalability advantages over existing methods.

Conclusion: TIPO demonstrates the effectiveness of distribution-aligned prompt engineering for text-to-image generation, offering a scalable and efficient alternative to resource-intensive methods. It opens new possibilities for automated prompt refinement and highlights broader opportunities for scalable optimization in T2I tasks.

Abstract: TIPO (Text-to-Image Prompt Optimization) introduces an efficient approach for automatic prompt refinement in text-to-image (T2I) generation. Starting from simple user prompts, TIPO leverages a lightweight pre-trained model to expand these prompts into richer and more detailed versions. Conceptually, TIPO samples refined prompts from a targeted sub-distribution within the broader semantic space, preserving the original intent while significantly improving visual quality, coherence, and detail. Unlike resource-intensive methods based on large language models (LLMs) or reinforcement learning (RL), TIPO offers strong computational efficiency and scalability, opening new possibilities for effective automated prompt engineering in T2I tasks. Extensive experiments across multiple domains demonstrate that TIPO achieves stronger text alignment, reduced visual artifacts, and consistently higher human preference rates, while maintaining competitive aesthetic quality. These results highlight the effectiveness of distribution-aligned prompt engineering and point toward broader opportunities for scalable, automated refinement in text-to-image generation.

[223] EROAM: Event-based Camera Rotational Odometry and Mapping in Real-time

Wanli Xing, Shijie Lin, Linhan Yang, Zeqing Zhang, Yanjun Du, Maolin Lei, Yipeng Pan, Chen Wang, Jia Pan

Main category: cs.CV

TL;DR: EROAM is a real-time event-based rotational odometry and mapping system that uses spherical event representation and novel geometric optimization for accurate camera rotation estimation.

Details

Motivation: Existing event-based odometry methods rely on event generation models or contrast maximization, which have limitations. There's a need for more accurate, robust, and computationally efficient rotational odometry systems for event cameras, especially under challenging conditions like high angular velocities.

Method: Projects events onto a unit sphere for spherical representation, introduces Event Spherical Iterative Closest Point (ES-ICP) geometric optimization framework, uses incremental k-d tree structures for map management with regional density control, and implements parallel point-to-line optimization.

Result: EROAM significantly outperforms state-of-the-art methods in accuracy, robustness, and computational efficiency on both synthetic and real-world datasets. Maintains consistent performance under high angular velocities and extended sequences, and produces high-quality panoramic reconstructions with fine structural details.

Conclusion: EROAM provides a novel, efficient, and accurate solution for event-based rotational odometry and mapping, demonstrating superior performance over existing methods and robustness in challenging conditions.

Abstract: This paper presents EROAM, a novel event-based rotational odometry and mapping system that achieves real-time, accurate camera rotation estimation. Unlike existing approaches that rely on event generation models or contrast maximization, EROAM employs a spherical event representation by projecting events onto a unit sphere and introduces Event Spherical Iterative Closest Point (ES-ICP), a novel geometric optimization framework designed specifically for event camera data. The spherical representation simplifies rotational motion formulation while operating in a continuous spherical domain, enabling enhanced spatial resolution. Our system features an efficient map management approach using incremental k-d tree structures and intelligent regional density control, ensuring optimal computational performance during long-term operation. Combined with parallel point-to-line optimization, EROAM achieves efficient computation without compromising accuracy. Extensive experiments on both synthetic and real-world datasets show that EROAM significantly outperforms state-of-the-art methods in terms of accuracy, robustness, and computational efficiency. Our method maintains consistent performance under challenging conditions, including high angular velocities and extended sequences, where other methods often fail or show significant drift. Additionally, EROAM produces high-quality panoramic reconstructions with preserved fine structural details.

[224] CMOOD: Concept-based Multi-label OOD Detection

Zhendong Liu, Yi Nian, Yuehan Qin, Henry Peng Zou, Li Li, Xiyang Hu, Yue Zhao

Main category: cs.CV

TL;DR: COOD: A zero-shot multi-label OOD detection framework using vision-language models with concept-based label expansion to handle complex label dependencies without retraining.

Details

Motivation: Existing OOD detection methods struggle with multi-label settings where samples have multiple interdependent labels and complex semantic relationships. Current approaches require extensive retraining and fail to generalize to unseen label combinations, creating a gap for real-world applications.

Method: COOD leverages pre-trained vision-language models with a concept-based label expansion strategy that enriches semantic space with both positive and negative concepts for each label. It introduces a new scoring function that models complex label dependencies to precisely differentiate OOD samples without additional training.

Result: Achieves approximately 95% average AUROC on both VOC and COCO datasets, significantly outperforming existing approaches. Maintains robust performance across varying numbers of labels and different types of OOD samples.

Conclusion: COOD provides an effective zero-shot solution for multi-label OOD detection that captures complex label dependencies without requiring retraining, addressing a critical gap in real-world applications where samples have multiple interdependent labels.

Abstract: How can models effectively detect out-of-distribution (OOD) samples in complex, multi-label settings without extensive retraining? Existing OOD detection methods struggle to capture the intricate semantic relationships and label co-occurrences inherent in multi-label settings, often requiring large amounts of training data and failing to generalize to unseen label combinations. While large language models have revolutionized zero-shot OOD detection, they primarily focus on single-label scenarios, leaving a critical gap in handling real-world tasks where samples can be associated with multiple interdependent labels. To address these challenges, we introduce COOD, a novel zero-shot multi-label OOD detection framework. COOD leverages pre-trained vision-language models, enhancing them with a concept-based label expansion strategy and a new scoring function. By enriching the semantic space with both positive and negative concepts for each label, our approach models complex label dependencies, precisely differentiating OOD samples without the need for additional training. Extensive experiments demonstrate that our method significantly outperforms existing approaches, achieving approximately 95% average AUROC on both VOC and COCO datasets, while maintaining robust performance across varying numbers of labels and different types of OOD samples.

[225] BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis

Seong-Eun Hong, Soobin Lim, Juyeong Hwang, Minwook Chang, Hyeongyeop Kang

Main category: cs.CV

TL;DR: BiPO introduces a bidirectional autoregressive network with partial occlusion for text-to-motion synthesis, achieving state-of-the-art performance on HumanML3D dataset.

Details

Motivation: Text-to-motion synthesis is challenging due to complex full-body dynamics and nuanced motion patterns over extended sequences that must accurately reflect textual descriptions.

Method: BiPO integrates part-based generation with bidirectional autoregressive architecture, using Partial Occlusion technique to probabilistically occlude certain motion part information during training to relax interdependencies.

Result: BiPO achieves state-of-the-art performance on HumanML3D dataset, outperforming recent methods like ParCo, MoMask, and BAMM in FID scores and motion quality, and excels in motion editing tasks.

Conclusion: BiPO effectively advances text-to-motion synthesis with potential for practical applications, particularly in motion editing based on partial sequences and text.

Abstract: Generating natural and expressive human motions from textual descriptions is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect the given text. To address this, we introduce BiPO, Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis, a novel model that enhances text-to-motion synthesis by integrating part-based generation with a bidirectional autoregressive architecture. This integration allows BiPO to consider both past and future contexts during generation while enhancing detailed control over individual body parts without requiring ground-truth motion length. To relax the interdependency among body parts caused by the integration, we devise the Partial Occlusion technique, which probabilistically occludes the certain motion part information during training. In our comprehensive experiments, BiPO achieves state-of-the-art performance on the HumanML3D dataset, outperforming recent methods such as ParCo, MoMask, and BAMM in terms of FID scores and overall motion quality. Notably, BiPO excels not only in the text-to-motion generation task but also in motion editing tasks that synthesize motion based on partially generated motion sequences and textual descriptions. These results reveal the BiPO’s effectiveness in advancing text-to-motion synthesis and its potential for practical applications.

[226] ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun

Main category: cs.CV

TL;DR: ACDiT is a novel hybrid model combining autoregressive and diffusion paradigms for continuous visual generation, using block-wise autoregressive units with conditional diffusion for each block, enabling flexible interpolation between token-wise autoregression and full-sequence diffusion.

Details

Motivation: The paper aims to overcome limitations of discrete tokenization in visual generation by creating a unified model that combines the strengths of autoregressive models (good at sequential generation) and diffusion models (excellent at continuous visual generation), enabling more flexible and efficient visual generation across different modalities.

Method: ACDiT introduces a block-wise autoregressive unit where each block’s generation is formulated as a conditional diffusion process conditioned on prior blocks. It uses a specially designed Skip-Causal Attention Mask on standard diffusion transformers during training, and during inference iterates between diffusion denoising and autoregressive decoding with KV-Cache optimization.

Result: ACDiT achieves state-of-the-art performance among autoregressive baselines on visual generation tasks (image, video) under similar model scales, demonstrates transferability to visual understanding tasks despite generative training, and shows potential for long-horizon visual generation through analysis of autoregressive-diffusion trade-offs.

Conclusion: ACDiT offers a novel perspective on visual autoregressive generation by successfully combining autoregressive and diffusion paradigms, providing a flexible framework that bypasses discrete tokenization limitations and opens new avenues for unified multimodal models.

Abstract: Autoregressive and diffusion models have achieved remarkable progress in language models and visual generation, respectively. We present ACDiT, a novel Autoregressive blockwise Conditional Diffusion Transformer, that innovatively combines autoregressive and diffusion paradigms for continuous visual information. By introducing a block-wise autoregressive unit, ACDiT offers a flexible interpolation between token-wise autoregression and full-sequence diffusion, bypassing the limitations of discrete tokenization. The generation of each block is formulated as a conditional diffusion process, conditioned on prior blocks. ACDiT is easy to implement, as simple as applying a specially designed Skip-Causal Attention Mask on the standard diffusion transformer during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We validate the effectiveness of ACDiT on image, video, and text generation and show that ACDiT performs best among all autoregressive baselines under similar model scales on visual generation tasks. We also demonstrate that, benefiting from autoregressive modeling, pretrained ACDiT can be transferred in visual understanding tasks despite being trained with the generative objective. The analysis of the trade-off between autoregressive and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. We hope that ACDiT offers a novel perspective on visual autoregressive generation and sheds light on new avenues for unified models.

[227] Visual Localization via Semantic Structures in Autonomous Photovoltaic Power Plant Inspection

Viktor Kozák, Karel Košnar, Jan Chudoba, Miroslav Kulich, Libor Přeučil

Main category: cs.CV

TL;DR: A novel UAV localization pipeline for PV plant inspection that integrates PV module detection with navigation, using visual anchors and tracking for precise positioning relative to power plant models.

Details

Motivation: Automating UAV-based thermal inspection of PV power plants requires precise navigation for optimal image capture, but current systems lack integrated detection and positioning capabilities.

Method: Developed a localization pipeline that detects PV modules, associates them with power plant models using visual anchor points and object tracking, and infers UAV position relative to the installation. Also presented three visual segmentation methods for PV modules.

Result: Verified and evaluated on custom aerial inspection datasets, demonstrating robustness and applicability for real-time navigation. Evaluated the influence of power plant model precision on localization accuracy.

Conclusion: The integrated detection-navigation pipeline enables precise UAV positioning for PV plant inspection, with segmentation methods supporting robust localization in real-time applications.

Abstract: Inspection systems utilizing unmanned aerial vehicles (UAVs) equipped with thermal cameras are increasingly popular for the maintenance of photovoltaic (PV) power plants. However, automation of the inspection task is a challenging problem as it requires precise navigation to capture images from optimal distances and viewing angles. This paper presents a novel localization pipeline that directly integrates PV module detection with UAV navigation, allowing precise positioning during inspection. The detections are used to identify the power plant structures in the image. These are associated with the power plant model and used to infer the UAV position relative to the inspected PV installation. We define visually recognizable anchor points for the initial association and use object tracking to discern global associations. Additionally, we present three different methods for visual segmentation of PV modules and evaluate their performance in relation to the proposed localization pipeline. The presented methods were verified and evaluated using custom aerial inspection data sets, demonstrating their robustness and applicability for real-time navigation. Additionally, we evaluate the influence of the power plant model precision on the localization methods.

[228] Can Large Language Models Capture Video Game Engagement?

David Melhart, Matthew Barthet, Georgios N. Yannakakis

Main category: cs.CV

TL;DR: LLMs struggle to accurately predict continuous human affect (engagement) from video game footage despite multimodal prompting, generally underperforming compared to human annotations.

Details

Motivation: To evaluate whether pretrained LLMs can successfully detect human affect from videos when provided with multimodal inputs (text and video frames), specifically testing their ability to predict continuous engagement annotations in gaming contexts.

Method: Comprehensive evaluation using 80 minutes of annotated videogame footage from 20 first-person shooter games (GameVibe corpus). Conducted over 4,800 experiments testing impact of LLM architecture, model size, input modality, prompting strategy, and ground truth processing on engagement prediction.

Result: LLMs generally fall behind continuous experience annotations provided by humans, despite claims of human-like performance in other domains. Performance fluctuates across games, with some cases exceeding expectations but overall underperforming compared to human annotations.

Conclusion: While LLMs can outperform traditional ML baselines, they struggle with continuous affect prediction from videos. The paper identifies underlying causes for performance fluctuations and provides a roadmap for further exploration of automated emotion labeling via LLMs.

Abstract: Can out-of-the-box pretrained Large Language Models (LLMs) detect human affect successfully when observing a video? To address this question, for the first time, we evaluate comprehensively the capacity of popular LLMs for successfully predicting continuous affect annotations of videos when prompted by a sequence of text and video frames in a multimodal fashion. In this paper, we test LLMs’ ability to correctly label changes of in-game engagement in 80 minutes of annotated videogame footage from 20 first-person shooter games of the GameVibe corpus. We run over 4,800 experiments to investigate the impact of LLM architecture, model size, input modality, prompting strategy, and ground truth processing method on engagement prediction. Our findings suggest that while LLMs rightfully claim human-like performance across multiple domains and able to outperform traditional machine learning baselines, they generally fall behind continuous experience annotations provided by humans. We examine some of the underlying causes for a fluctuating performance across games, highlight the cases where LLMs exceed expectations, and draw a roadmap for the further exploration of automated emotion labelling via LLMs.

[229] Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion

QingYuan Jiang, Longfei Huang, Yang Yang

Main category: cs.CV

TL;DR: A novel multimodal learning approach that addresses modality imbalance by dynamically balancing classification abilities between weak and strong modalities using boosting principles.

Details

Motivation: Multimodal learning suffers from modality imbalance where some modalities dominate learning, leading to suboptimal performance. Existing approaches focus on balancing learning but overlook the inherent disproportion in model classification ability as the root cause.

Method: Proposes a sustained boosting algorithm that simultaneously optimizes classification and residual errors, plus an adaptive classifier assignment strategy to dynamically improve weak modality performance. Theoretically analyzes convergence of cross-modal gap function.

Result: Empirical experiments on widely used datasets show superiority over various state-of-the-art multimodal learning baselines.

Conclusion: The approach effectively balances classification ability between strong and weak modalities, mitigating the modality imbalance issue in multimodal learning.

Abstract: Multimodal learning (MML) is significantly constrained by modality imbalance, leading to suboptimal performance in practice. While existing approaches primarily focus on balancing the learning of different modalities to address this issue, they fundamentally overlook the inherent disproportion in model classification ability, which serves as the primary cause of this phenomenon. In this paper, we propose a novel multimodal learning approach to dynamically balance the classification ability of weak and strong modalities by incorporating the principle of boosting. Concretely, we first propose a sustained boosting algorithm in multimodal learning by simultaneously optimizing the classification and residual errors. Subsequently, we introduce an adaptive classifier assignment strategy to dynamically facilitate the classification performance of the weak modality. Furthermore, we theoretically analyze the convergence property of the cross-modal gap function, ensuring the effectiveness of the proposed boosting scheme. To this end, the classification ability of strong and weak modalities is expected to be balanced, thereby mitigating the imbalance issue. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SOTA) multimodal learning baselines. The source code is available at https://github.com/njustkmg/NeurIPS25-AUG.

[230] Progressively Deformable 2D Gaussian Splatting for Video Representation at Arbitrary Resolutions

Mufan Liu, Qi Yang, Miaoran Zhao, He Huang, Le Yang, Zhu Li, Yiling Xu

Main category: cs.CV

TL;DR: D2GV-AR is a deformable 2D Gaussian video representation that enables arbitrary-scale rendering and any-ratio progressive coding within a single model, achieving over 250 FPS while matching or surpassing recent implicit neural representation baselines.

Details

Motivation: Current implicit neural representations (INRs) for video compression and processing lack scalable decoding across rates and resolutions, requiring retraining or multi-branch designs. Structured pruning fails to provide permutation-invariant progressive transmission order.

Method: Partition videos into fixed-length Groups of Pictures, represent each group with canonical 2D Gaussian primitives, model temporal evolution with neural ODEs, apply scale-aware grouping based on Nyquist sampling theorem to form nested hierarchy across resolutions, and enable progressive coding via D-optimal subset pruning.

Result: Achieves rendering at over 250 FPS while matching or surpassing recent INR baselines, enables multiscale continuous rate-quality adaptation with arbitrary-scale rendering and any-ratio progressive coding.

Conclusion: D2GV-AR provides an efficient deformable 2D Gaussian video representation that solves the scalability limitations of existing INR approaches, enabling practical video compression and processing with flexible rate-resolution adaptation.

Abstract: Implicit neural representations (INRs) enable fast video compression and effective video processing, but a single model rarely offers scalable decoding across rates and resolutions. In practice, multi-resolution typically relies on retraining or multi-branch designs, and structured pruning failed to provide a permutation-invariant progressive transmission order. Motivated by the explicit structure and efficiency of Gaussian splatting, we propose D2GV-AR, a deformable 2D Gaussian video representation that enables \emph{arbitrary-scale} rendering and \emph{any-ratio} progressive coding within a single model. We partition each video into fixed-length Groups of Pictures and represent each group with a canonical set of 2D Gaussian primitives, whose temporal evolution is modeled by a neural ordinary differential equation. During training and rendering, we apply scale-aware grouping according to Nyquist sampling theorem to form a nested hierarchy across resolutions. Once trained, primitives can be pruned via a D-optimal subset objective to enable any-ratio progressive coding. Extensive experiments show that D2GV-AR renders at over 250 FPS while matching or surpassing recent INR baselines, enabling multiscale continuous rate–quality adaptation.

[231] From Limited Labels to Open Domains:An Efficient Learning Method for Drone-view Geo-Localization

Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong, Jiawei Lang, Guoqi Li

Main category: cs.CV

TL;DR: CDIKTNet: A cross-domain invariant knowledge transfer network for drone-view geo-localization that uses limited supervision to learn cross-view correlations and transfer knowledge between domains.

Details

Motivation: Traditional supervised drone-view geo-localization methods require paired training data and struggle with cross-view correlations from unpaired data. Unsupervised methods suffer from feature confusion due to geographical similarity and spatial continuity, leading to unreliable pseudo-labels. Both approaches face challenges in new domains requiring retraining with new paired data.

Method: Proposes CDIKTNet with two sub-networks: 1) Cross-Domain Invariance Sub-network (CDIS) learns cross-view structural and spatial invariance from small paired data as prior knowledge, 2) Cross-Domain Transfer Sub-network (CDTS) uses dual-path contrastive learning to optimize subspaces while preserving shared feature space consistency. Forms closed-loop framework for invariance feature learning and knowledge transfer.

Result: Extensive experiments show CDIKTNet achieves state-of-the-art performance under full supervision compared to supervised methods, and surpasses existing unsupervised methods in both few-shot and cross-domain initialization scenarios.

Conclusion: CDIKTNet effectively addresses limitations of both supervised and unsupervised DVGL methods by leveraging limited supervision to learn cross-view invariance and enable knowledge transfer across domains, reducing computational overhead for domain adaptation.

Abstract: Traditional supervised drone-view geo-localization (DVGL) methods heavily depend on paired training data and encounter difficulties in learning cross-view correlations from unpaired data. Moreover, when deployed in a new domain, these methods require obtaining the new paired data and subsequent retraining for model adaptation, which significantly increases computational overhead. Existing unsupervised methods have enabled to generate pseudo-labels based on cross-view similarity to infer the pairing relationships. However, geographical similarity and spatial continuity often cause visually analogous features at different geographical locations. The feature confusion compromises the reliability of pseudo-label generation, where incorrect pseudo-labels drive negative optimization. Given these challenges inherent in both supervised and unsupervised DVGL methods, we propose a novel cross-domain invariant knowledge transfer network (CDIKTNet) with limited supervision, whose architecture consists of a cross-domain invariance sub-network (CDIS) and a cross-domain transfer sub-network (CDTS). This architecture facilitates a closed-loop framework for invariance feature learning and knowledge transfer. The CDIS is designed to learn cross-view structural and spatial invariance from a small amount of paired data that serves as prior knowledge. It endows the shared feature space of unpaired data with similar implicit cross-view correlations at initialization, which alleviates feature confusion. Based on this, the CDTS employs dual-path contrastive learning to further optimize each subspace while preserving consistency in a shared feature space. Extensive experiments demonstrate that CDIKTNet achieves state-of-the-art performance under full supervision compared with those supervised methods, and further surpasses existing unsupervised methods in both few-shot and cross-domain initialization.

[232] Beyond Retraining: Training-Free Unknown Class Filtering for Source-Free Open Set Domain Adaptation of Vision-Language Models

Yongguang Li, Jindong Li, Qi Wang, Qianli Xing, Runliang Niu, Shengsheng Wang, Menglin Yang

Main category: cs.CV

TL;DR: VLM-OpenXpert: Training-free inference modules for open-set domain adaptation in vision-language models, addressing semantic affinity anchoring and threshold sensitivity issues without retraining.

Details

Motivation: VLMs lack mechanisms to reject unknown classes in open-set scenarios, and existing methods either distort VLM geometry through forced retraining or suffer from threshold sensitivity in unknown detection.

Method: Two training-free modules: SUFF uses SVD to extract an “unknown subspace” and softly removes unknown components from features, while BGAT corrects score skewness and fits bimodal Gaussian mixture for adaptive threshold estimation.

Result: Experiments on 9 benchmarks with CLIP, SigLIP, and ALIGN backbones show the training-free pipeline matches or outperforms retraining-heavy SOTA methods in source-free OSDA settings.

Conclusion: Establishes a lightweight inference calibration paradigm for open-set VLM deployment that preserves VLM geometry while improving unknown class rejection.

Abstract: Vision-language models (VLMs) have gained widespread attention for their strong zero-shot capabilities across numerous downstream tasks. However, these models assume that each test image’s class label is drawn from a predefined label set and lack a reliable mechanism to reject samples from emerging unknown classes when only unlabeled data are available. To address this gap, open-set domain adaptation methods retrain models to push potential unknowns away from known clusters. Yet, some unknown samples remain stably anchored to specific known classes in the VLM feature space due to semantic relevance, which is termed as Semantic Affinity Anchoring (SAA). Forcibly repelling these samples unavoidably distorts the native geometry of VLMs and degrades performance. Meanwhile, existing score-based unknown detectors use simplistic thresholds and suffer from threshold sensitivity, resulting in sub-optimal performance. To address aforementioned issues, we propose VLM-OpenXpert, which comprises two training-free, plug-and-play inference modules. SUFF performs SVD on high-confidence unknowns to extract a low-rank “unknown subspace”. Each sample’s projection onto this subspace is weighted and softly removed from its feature, suppressing unknown components while preserving semantics. BGAT corrects score skewness via a Box-Cox transform, then fits a bimodal Gaussian mixture to adaptively estimate the optimal threshold balancing known-class recognition and unknown-class rejection. Experiments on 9 benchmarks and three backbones (CLIP, SigLIP, ALIGN) under source-free OSDA settings show that our training-free pipeline matches or outperforms retraining-heavy state-of-the-art methods, establishing a powerful lightweight inference calibration paradigm for open-set VLM deployment.

[233] CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow

Takahiro Maeda, Jinkun Cao, Norimichi Ukita, Kris Kitani

Main category: cs.CV

TL;DR: CacheFlow: A fast flow-based method for 3D human motion prediction that uses a two-stage approach with caching to achieve millisecond inference times while maintaining accuracy.

Details

Motivation: Existing density estimation techniques for 3D human motion prediction are computationally expensive, often taking longer than the predicted time horizon. There's a need for faster inference methods that don't sacrifice accuracy.

Method: Two-stage flow-based approach: 1) Precompute and cache results from an unconditional flow-based generative model that transforms Gaussian mixture to future motion density, 2) Use lightweight model to map historical trajectories to Gaussian mixture samples for conditional prediction.

Result: Achieves ~1ms inference time (4x faster than VAE methods, 30x faster than diffusion methods), maintains prediction accuracy comparable to SOTA, and shows improved density estimation accuracy on Human3.6M and AMASS datasets.

Conclusion: CacheFlow enables fast, accurate 3D human motion prediction through a novel caching mechanism with flow-based models, making real-time applications feasible without compromising performance.

Abstract: Many density estimation techniques for 3D human motion prediction require a significant amount of inference time, often exceeding the duration of the predicted time horizon. To address the need for faster density estimation for 3D human motion prediction, we introduce a novel flow-based method for human motion prediction called CacheFlow. Unlike previous conditional generative models that suffer from poor time efficiency, CacheFlow takes advantage of an unconditional flow-based generative model that transforms a Gaussian mixture into the density of future motions. The results of the computation of the flow-based generative model can be precomputed and cached. Then, for conditional prediction, we seek a mapping from historical trajectories to samples in the Gaussian mixture. This mapping can be done by a much more lightweight model, thus saving significant computation overhead compared to a typical conditional flow model. In such a two-stage fashion and by caching results from the slow flow model computation, we build our CacheFlow without loss of prediction accuracy and model expressiveness. This inference process is completed in approximately one millisecond, making it 4 times faster than previous VAE methods and 30 times faster than previous diffusion-based methods on standard benchmarks such as Human3.6M and AMASS datasets. Furthermore, our method demonstrates improved density estimation accuracy and comparable prediction accuracy to a SOTA method on Human3.6M. Our code and models are available at https://github.com/meaten/CacheFlow.

[234] Revisiting Reweighted Risk for Calibration: AURC, Focal, and Inverse Focal Loss

Han Zhou, Sebastian G. Gruber, Teodora Popordanoska, Matthew B. Blaschko

Main category: cs.CV

TL;DR: The paper establishes theoretical connections between calibration error minimization and selective classification, proposing a flexible weighted risk function approach with bin-based CDF approximation for efficient optimization.

Details

Motivation: Existing reweighted risk functionals for model calibration (like focal loss, inverse focal loss, AURC) lack clear theoretical connections to calibration errors. The paper aims to bridge this gap by exploring the relationship between calibration error minimization and selective classification paradigms.

Method: The authors revisit weighted risk functions and show that minimizing calibration error is linked to selective classification. They propose a loss function with flexible confidence score functions (CSFs) and use a bin-based cumulative distribution function (CDF) approximation for efficient gradient-based optimization with O(nM) complexity.

Result: Empirical evaluations demonstrate competitive calibration performance across various datasets and model architectures.

Conclusion: The paper provides a principled connection between calibration error and selective classification, offering a flexible and efficient approach to model calibration through weighted risk functions with theoretical grounding.

Abstract: Several variants of reweighted risk functionals, such as focal loss, inverse focal loss, and the Area Under the Risk Coverage Curve (AURC), have been proposed for improving model calibration; yet their theoretical connections to calibration errors remain under-explored. In this paper, we revisit a broad class of weighted risk functions and find a principled connection between calibration error and selective classification. We show that minimizing calibration error is closely linked to the selective classification paradigm and demonstrate that optimizing selective risk in low confidence regions naturally improves calibration. Our proposed loss shares a similar reweighting strategy with dual focal loss but offers greater flexibility through the choice of confidence score functions (CSFs). Furthermore, our approach utilizes a bin-based cumulative distribution function (CDF) approximation, enabling efficient gradient-based optimization with O(nM) complexity for n samples and M bins. Empirical evaluations demonstrate that our method achieves competitive calibration performance across a range of datasets and model architectures.

[235] Bridging Weakly-Supervised Learning and VLM Distillation: Noisy Partial Label Learning for Efficient Downstream Adaptation

Qian-Wei Wang, Yaguang Song, Shu-Tao Xia

Main category: cs.CV

TL;DR: A framework called Co-Reg for learning from noisy partial labels generated by pre-trained vision-language models, using collaborative consistency regularization and label purification techniques.

Details

Motivation: Pre-trained VLMs like CLIP and GPT-4V can generate noisy partial labels to replace manual annotation, but their noise is instance-dependent and reflects model biases, making traditional noisy label learning methods insufficient.

Method: Jointly trains two neural networks with collaborative label purification via co-pseudo-labeling, enforces consistency regularization in both label and feature spaces, uses alternating optimization of contrastive representations and pseudo-labels, and maintains class prototypes in shared feature space.

Result: Extensive experiments show the method’s effectiveness and highlight the potential of integrating weakly supervised learning into knowledge distillation of pre-trained models.

Conclusion: The proposed Co-Reg framework successfully addresses the challenges of VLM-generated noisy partial labels and enables annotation-free training by leveraging pre-trained vision-language models.

Abstract: In the context of noisy partial label learning (NPLL), each training sample is associated with a set of candidate labels annotated by multiple noisy annotators. With the emergence of high-performance pre-trained vision-language models (VLMs) such as CLIP, LLaVA, and GPT-4V, leveraging these models to replace time-consuming manual annotation and enable annotation-free training has become a promising research direction. This paper studies learning from noisy partial labels generated by pre-trained VLMs and proposes a collaborative consistency regularization (Co-Reg) framework. Unlike symmetric noise commonly assumed in traditional noisy label learning, VLM-generated noise is instance-dependent and reflects the intrinsic biases of pre-trained models, posing greater challenges. To address this issue, we jointly train two neural networks to perform collaborative label purification via a co-pseudo-labeling mechanism, while enforcing consistency regularization in both label and feature representation spaces. In addition, multiple anti-overfitting strategies are introduced, including alternating optimization of contrastive representations and pseudo-labels, as well as maintaining class prototypes in a shared feature space. The proposed method can further incorporate few-shot manually annotated labels for performance enhancement. Extensive experiments under various settings demonstrate the effectiveness of our approach and highlight the potential of integrating weakly supervised learning into the knowledge distillation of pre-trained models.

[236] FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

Jiaqi Zhang, Juntuo Wang, Zhixin Sun, John Zou, Randall Balestriero

Main category: cs.CV

TL;DR: A novel pre-training strategy for DINOv2 that accelerates convergence and improves robustness through frequency filtering curriculum and Gaussian noise patching augmentation, achieving 1.6x faster training with competitive performance.

Details

Motivation: Large-scale vision foundation models like DINOv2 require massive computational resources for pre-training, making reproduction difficult for private data, new modalities, or scientific exploration. There's a need for more efficient pre-training methods that maintain or improve model performance.

Method: Proposes two key techniques: 1) Frequency filtering curriculum where models first see low-frequency components before gradually introducing higher frequencies, and 2) Gaussian noise patching augmentation to enhance robustness. Applied to ViT-B/16 backbone trained on ImageNet-1K.

Result: Achieves 1.6x reduction in pre-training time and 2.25x reduction in FLOPs while matching robustness on ImageNet-C corruption benchmarks and maintaining competitive linear probing performance compared to baseline DINOv2.

Conclusion: The method provides dual benefits of efficiency and robustness, making large-scale self-supervised foundation modeling more accessible and opening new research directions for data curriculum and augmentation strategies in self-supervised learning.

Abstract: Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning–which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence–and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum–low-frequency being seen first–and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at https://github.com/KevinZ0217/fast_dinov2

Jiaxu Wan, Xu Wang, Mengwei Xie, Xinyuan Chang, Xinran Liu, Zheng Pan, Mu Xu, Hong Zhang, Ding Yuan, Yifan Yang

Main category: cs.CV

TL;DR: ONR refines road-level SD maps into lane-level navigation by associating them with real-time OP maps, using a new dataset and transformer model for map-to-map alignment.

Details

Motivation: Current lane-level navigation relies on expensive HD maps that can't adapt to dynamic conditions, while online perception maps lack global topology. There's a need for cost-effective, real-time lane-level navigation that combines SD map topology with OP map geometry.

Method: Introduces Online Navigation Refinement (ONR) mission, creates OMA dataset with lane-to-road correspondences, and proposes MAT transformer with path-aware attention for topology alignment and spatial attention for handling noisy OP features.

Result: MAT outperforms existing methods with 34 ms latency, enabling low-cost, up-to-date lane-level navigation. The OMA dataset contains 30K scenarios and 2.6M annotated lane vectors.

Conclusion: ONR enables practical lane-level navigation by associating SD maps with OP maps, overcoming challenges of many-to-one lane-to-road mappings and misalignment through the proposed dataset and transformer architecture.

Abstract: Lane-level navigation is critical for geographic information systems and navigation-based tasks, offering finer-grained guidance than road-level navigation by standard definition (SD) maps. However, it currently relies on expansive global HD maps that cannot adapt to dynamic road conditions. Recently, online perception (OP) maps have become research hotspots, providing real-time geometry as an alternative, but lack the global topology needed for navigation. To address these issues, Online Navigation Refinement (ONR), a new mission is introduced that refines SD-map-based road-level routes into accurate lane-level navigation by associating SD maps with OP maps. The map-to-map association to handle many-to-one lane-to-road mappings under two key challenges: (1) no public dataset provides lane-to-road correspondences; (2) severe misalignment from spatial fluctuations, semantic disparities, and OP map noise invalidates traditional map matching. For these challenges, We contribute: (1) Online map association dataset (OMA), the first ONR benchmark with 30K scenarios and 2.6M annotated lane vectors; (2) MAT, a transformer with path-aware attention to aligns topology despite spatial fluctuations and semantic disparities and spatial attention for integrates noisy OP features via global context; and (3) NR P-R, a metric evaluating geometric and semantic alignment. Experiments show that MAT outperforms existing methods at 34 ms latency, enabling low-cost and up-to-date lane-level navigation.

[238] CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation

Shilong Zou, Yuhang Huang, Renjiao Yi, Chenyang Zhu, Kai Xu

Main category: cs.CV

TL;DR: A diffusion-based cross-domain image translation method using joint learning to align diffusion and translation processes for improved global optimization and performance.

Details

Motivation: Existing diffusion-based image translation methods struggle with misalignment between diffusion processes (applied to noisy signals) and translation processes (applied to clean signals), leading to separate training or shallow integration that causes local minima and limits effectiveness.

Method: Proposes a joint learning framework that extracts image components with diffusion models to represent clean signals, uses these components for translation, and employs a time-dependent translation network for complex mapping, enabling end-to-end joint optimization.

Result: Extensive experiments on RGB↔RGB and cross-modality tasks (RGB↔Edge, RGB↔Semantics, RGB↔Depth) show better generative performance than state-of-the-art methods.

Conclusion: The joint learning approach enables global optimization of both diffusion and translation processes, improving fidelity and structural consistency in cross-domain image translation without paired training data.

Abstract: We introduce a diffusion-based cross-domain image translator in the absence of paired training data. Unlike GAN-based methods, our approach integrates diffusion models to learn the image translation process, allowing for more coverable modeling of the data distribution and performance improvement of the cross-domain translation. However, incorporating the translation process within the diffusion process is still challenging since the two processes are not aligned exactly, i.e., the diffusion process is applied to the noisy signal while the translation process is conducted on the clean signal. As a result, recent diffusion-based studies employ separate training or shallow integration to learn the two processes, yet this may cause the local minimal of the translation optimization, constraining the effectiveness of diffusion models. To address the problem, we propose a novel joint learning framework that aligns the diffusion and the translation process, thereby improving the global optimality. Specifically, we propose to extract the image components with diffusion models to represent the clean signal and employ the translation process with the image components, enabling an end-to-end joint learning manner. On the other hand, we introduce a time-dependent translation network to learn the complex translation mapping, resulting in effective translation learning and significant performance improvement. Benefiting from the design of joint learning, our method enables global optimization of both processes, enhancing the optimality and achieving improved fidelity and structural consistency. We have conducted extensive experiments on RGB$\leftrightarrow$RGB and diverse cross-modality translation tasks including RGB$\leftrightarrow$Edge, RGB$\leftrightarrow$Semantics and RGB$\leftrightarrow$Depth, showcasing better generative performances than the state of the arts.

[239] PRISM: A Framework Harnessing Unsupervised Visual Representations and Textual Prompts for Explainable MACE Survival Prediction from Cardiac Cine MRI

Haoyang Su, Jin-Yi Xiang, Shaohao Rui, Yifan Gao, Xingyu Chen, Tingxuan Yin, Shaoting Zhang, Xiaosong Wang, Lian-Ming Wu

Main category: cs.CV

TL;DR: PRISM is a self-supervised multimodal framework that integrates cardiac cine MRI visual representations with structured EHR data using medically informed textual prompts for improved prediction of major adverse cardiac events.

Details

Motivation: Accurate prediction of major adverse cardiac events (MACE) remains challenging in cardiovascular prognosis. Current approaches often don't effectively integrate multimodal data (imaging + EHR) or leverage medical domain knowledge through textual guidance.

Method: PRISM uses motion-aware multi-view distillation to extract temporally synchronized imaging features from non-contrast cardiac cine MRI, then modulates these features using medically informed textual prompts. It integrates these with structured EHR data for survival analysis through a self-supervised framework.

Result: PRISM consistently outperforms classical survival prediction models and state-of-the-art deep learning baselines across four independent clinical cohorts. It uncovered three distinct imaging signatures associated with elevated MACE risk and identified hypertension, diabetes, and smoking as dominant contributors among clinical factors.

Conclusion: PRISM demonstrates that integrating visual representations from cardiac imaging with structured EHR data using prompt-guided modulation enables fine-grained risk prediction and provides valuable clinical insights into cardiac risk across diverse patient populations.

Abstract: Accurate prediction of major adverse cardiac events (MACE) remains a central challenge in cardiovascular prognosis. We present PRISM (Prompt-guided Representation Integration for Survival Modeling), a self-supervised framework that integrates visual representations from non-contrast cardiac cine magnetic resonance imaging with structured electronic health records (EHRs) for survival analysis. PRISM extracts temporally synchronized imaging features through motion-aware multi-view distillation and modulates them using medically informed textual prompts to enable fine-grained risk prediction. Across four independent clinical cohorts, PRISM consistently surpasses classical survival prediction models and state-of-the-art (SOTA) deep learning baselines under internal and external validation. Further clinical findings demonstrate that the combined imaging and EHR representations derived from PRISM provide valuable insights into cardiac risk across diverse cohorts. Three distinct imaging signatures associated with elevated MACE risk are uncovered, including lateral wall dyssynchrony, inferior wall hypersensitivity, and anterior elevated focus during diastole. Prompt-guided attribution further identifies hypertension, diabetes, and smoking as dominant contributors among clinical and physiological EHR factors.

[240] Causality-guided Prompt Learning for Vision-language Models via Visual Granulation

Mengyu Gao, Qiulei Dong

Main category: cs.CV

TL;DR: CaPL: A causality-guided text prompt learning method via visual granulation for CLIP that improves fine-grained recognition by decomposing visual features into shared and individualized attributes and constructing visual granules for more discriminative text prompts.

Details

Motivation: Existing CLIP-based prompt learning methods show limited ability for handling fine-grained datasets. The authors aim to address this limitation by developing a method that can capture subtle discrepancies among different fine-grained classes through causal inference and visual granulation.

Method: Two main modules: 1) Attribute disentanglement module using Brownian Bridge Diffusion Model to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes); 2) Granule learning module to construct visual granules by integrating these attributes for recognition under two causal inference strategies.

Result: Extensive experiments on 15 datasets show CaPL significantly outperforms state-of-the-art prompt learning methods, especially on fine-grained datasets.

Conclusion: The proposed CaPL method effectively improves fine-grained recognition in CLIP through visual granulation and causal inference, demonstrating superior performance over existing prompt learning approaches.

Abstract: Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.

Cem Eteke, Alexander Griessel, Wolfgang Kellerer, Eckehard Steinbach

Main category: cs.CV

TL;DR: BIR-Adapter is a parameter-efficient diffusion adapter for blind image restoration that uses attention mechanisms and sampling guidance to reduce hallucinations, achieving competitive performance with 36x fewer parameters.

Details

Motivation: Current diffusion-based restoration methods rely on auxiliary feature extractors or extensive fine-tuning of pre-trained models. The authors observed that large-scale pretrained diffusion models retain informative representations under common image degradations, motivating a more efficient approach.

Method: Introduces a parameter-efficient, plug-and-play attention mechanism (BIR-Adapter) that reduces trained parameters. Also proposes a sampling guidance mechanism to mitigate hallucinations during restoration. The adapter design enables seamless integration into existing models.

Result: Achieves competitive and sometimes superior performance compared to state-of-the-art methods on synthetic and real-world degradations while requiring up to 36x fewer trained parameters. Successfully extends super-resolution-only diffusion models to handle additional unknown degradations.

Conclusion: BIR-Adapter provides an efficient, adaptable solution for blind image restoration that maintains high performance while significantly reducing parameter requirements, demonstrating the potential of adapter-based approaches for broader image restoration tasks.

Abstract: We introduce the BIR-Adapter, a parameter-efficient diffusion adapter for blind image restoration. Diffusion-based restoration methods have demonstrated promising performance in addressing this fundamental problem in computer vision, typically relying on auxiliary feature extractors or extensive fine-tuning of pre-trained models. Motivated by the observation that large-scale pretrained diffusion models can retain informative representations under common image degradations, BIR-Adapter introduces a parameter-efficient, plug-and-play attention mechanism that substantially reduces the number of trained parameters. To further improve reliability, we propose a sampling guidance mechanism that mitigates hallucinations during the restoration process. Experiments on synthetic and real-world degradations demonstrate that BIR-Adapter achieves competitive, and in several settings superior, performance compared to state-of-the-art methods while requiring up to 36x fewer trained parameters. Moreover, the adapter-based design enables seamless integration into existing models. We validate this generality by extending a super-resolution-only diffusion model to handle additional unknown degradations, highlighting the adaptability of our approach for broader image restoration tasks.

[242] Large Vision Models Can Solve Mental Rotation Problems

Sebastian Ray Mason, Anders Gjølbye, Phillip Chavarria Højbjerg, Lenka Tětková, Lars Kai Hansen

Main category: cs.CV

TL;DR: Vision transformers (ViT, CLIP, DINOv2, DINOv3) evaluated on mental rotation tasks show self-supervised models capture geometric structure better than supervised ones, with intermediate layers outperforming final layers, and task difficulty patterns mirroring human cognitive constraints.

Details

Motivation: To understand how modern vision transformers develop spatial reasoning abilities similar to human mental rotation, and to systematically evaluate different transformer architectures (ViT, CLIP, DINOv2, DINOv3) across various mental rotation tasks.

Method: Systematic evaluation across mental rotation tasks ranging from simple block structures (Shepard-Metzler type) to complex block figures, text, and photo-realistic objects. Layer-by-layer probing of model representations to examine where and how networks succeed.

Result: 1) Self-supervised ViTs capture geometric structure better than supervised ViTs; 2) Intermediate layers perform better than final layers; 3) Task difficulty increases with rotation complexity and occlusion, mirroring human reaction times and suggesting similar constraints in embedding space representations.

Conclusion: Vision transformers develop spatial reasoning abilities that show interesting parallels to human cognition, with self-supervised learning being particularly effective for capturing geometric structure, and the hierarchical nature of representations revealing insights about how these models process spatial information.

Abstract: Mental rotation is a key test of spatial reasoning in humans and has been central to understanding how perception supports cognition. Despite the success of modern vision transformers, it is still unclear how well these models develop similar abilities. In this work, we present a systematic evaluation of ViT, CLIP, DINOv2, and DINOv3 across a range of mental-rotation tasks, from simple block structures similar to those used by Shepard and Metzler to study human cognition, to more complex block figures, three types of text, and photo-realistic objects. By probing model representations layer by layer, we examine where and how these networks succeed. We find that i) self-supervised ViTs capture geometric structure better than supervised ViTs; ii) intermediate layers perform better than final layers; iii) task difficulty increases with rotation complexity and occlusion, mirroring human reaction times and suggesting similar constraints in embedding space representations.

[243] Enhancing Semantic Segmentation with Continual Self-Supervised Pre-training

Brown Ebouky, Ajad Chhatkuli, Cristiano Malossi, Christoph Studer, Roy Assaf, Andrea Bartezzaghi

Main category: cs.CV

TL;DR: GLARE: Continual SSL pre-training method for vision foundation models using patch-level augmentations and regional consistency to improve semantic segmentation on low-data target domains with lightweight adapters.

Details

Motivation: While SSL foundation models are typically pre-trained on general datasets and adapted via finetuning, there's limited exploration of continual SSL pre-training for new domains with limited data, especially for dense prediction tasks like semantic segmentation.

Method: Proposes GLARE with patch-level augmentations for local consistency and regional consistency constraints leveraging spatial semantics. Uses Vision Transformers initialized from SSL models with only lightweight UniAdapter modules updated during continual pre-training.

Result: Experiments across multiple semantic segmentation benchmarks show GLARE consistently improves downstream performance with minimal computational and parameter overhead.

Conclusion: GLARE enables effective continual SSL pre-training for vision foundation models on low-data target domains, enhancing semantic segmentation performance efficiently.

Abstract: Self-supervised learning (SSL) has emerged as a central paradigm for training foundation models by leveraging large-scale unlabeled datasets, often producing representations with strong generalization capabilities. These models are typically pre-trained on general-purpose datasets such as ImageNet and subsequently adapted to various downstream tasks through finetuning. While prior work has investigated parameter-efficient adaptation methods like adapters, LoRA, and prompt tuning, primarily targeting downstream finetuning, extending the SSL pre-training itself in a continual manner to new domains under limited data remains largely underexplored, especially for downstream dense prediction tasks like semantic segmentation. In this work, we address the challenge of adapting vision foundation models to low-data target domains through continual self-supervised pre-training, specifically targeting downstream semantic segmentation. We propose GLARE (Global Local and Regional Enforcement), a novel continual self-supervised pre-training task designed to enhance downstream semantic segmentation performance. GLARE introduces patch-level augmentations to encourage local consistency and incorporates a regional consistency constraint that leverages spatial semantics in the data. For efficient continual pre-training, we initialize Vision Transformers (ViTs) with weights from existing SSL models and update only lightweight adapter modules specifically UniAdapter - while keeping the rest of the backbone frozen. Experiments across multiple semantic segmentation benchmarks on different domains demonstrate that GLARE consistently improves downstream performance with minimal computational and parameter overhead.

[244] SEGA: A Transferable Signed Ensemble Gaussian Black-Box Attack against No-Reference Image Quality Assessment Models

Yujia Liu, Dingquan Li, Zhixuan Li, Tiejun Huang

Main category: cs.CV

TL;DR: SEGA is a transferable black-box attack method for No-Reference Image Quality Assessment models that uses Gaussian smoothing and gradient ensembling to improve attack transferability across different NR-IQA models.

Details

Motivation: Existing adversarial attacks against NR-IQA models work well in white-box settings but have poor transferability to unknown models in black-box scenarios, limiting their practical usefulness for evaluating model robustness.

Method: Proposes SEGA (Signed Ensemble Gaussian black-box Attack) which approximates target model gradients by applying Gaussian smoothing to source models and ensembling their smoothed gradients, plus a perturbation filter mask to ensure imperceptibility.

Result: Experimental results on CLIVE dataset demonstrate superior transferability compared to existing methods, enabling successful transfer-based black-box attacks against NR-IQA models.

Conclusion: SEGA effectively addresses the low transferability challenge in attacking NR-IQA models, providing a practical black-box attack method for evaluating model robustness.

Abstract: No-Reference Image Quality Assessment (NR-IQA) models play an important role in various real-world applications. Recently, adversarial attacks against NR-IQA models have attracted increasing attention, as they provide valuable insights for revealing model vulnerabilities and guiding robust system design. Some effective attacks have been proposed against NR-IQA models in white-box settings, where the attacker has full access to the target model. However, these attacks often suffer from poor transferability to unknown target models in more realistic black-box scenarios, where the target model is inaccessible. This work makes the first attempt to address the challenge of low transferability in attacking NR-IQA models by proposing a transferable Signed Ensemble Gaussian black-box Attack (SEGA). The main idea is to approximate the gradient of the target model by applying Gaussian smoothing to source models and ensembling their smoothed gradients. To ensure the imperceptibility of adversarial perturbations, SEGA further removes inappropriate perturbations using a specially designed perturbation filter mask. Experimental results on the CLIVE dataset demonstrate the superior transferability of SEGA, validating its effectiveness in enabling successful transfer-based black-box attacks against NR-IQA models.

[245] MORPH: PDE Foundation Models with Arbitrary Data Modality

Mahindra Singh Rautela, Alexander Most, Siddharth Mansingh, Bradley C. Love, Alexander Scheinker, Diane Oyen, Nathan Debardeleben, Earl Lawrence, Ayan Biswas

Main category: cs.CV

TL;DR: MORPH is a modality-agnostic autoregressive foundation model for PDEs that handles heterogeneous spatiotemporal data across different dimensions and field types using a convolutional vision transformer with specialized attention mechanisms.

Details

Motivation: Scientific observations often involve heterogeneous, multimodal data from PDEs with varying dimensions (1D-3D), resolutions, and mixed scalar/vector fields. Existing models struggle with this heterogeneity, requiring specialized architectures for each data type.

Method: Built on convolutional vision transformer backbone with three key components: (1) component-wise convolution for joint scalar/vector processing, (2) inter-field cross-attention for information propagation between physical fields, (3) axial attentions factorizing spatiotemporal attention along individual axes for computational efficiency.

Result: MORPH outperforms models trained from scratch and matches/surpasses strong baselines and SOTA models across diverse PDE prediction tasks, using both full fine-tuning and parameter-efficient low-rank adapters.

Conclusion: MORPH provides a flexible backbone for learning from heterogeneous scientific observations, advancing scalable and data-efficient scientific machine learning with publicly available code, datasets, and models.

Abstract: We introduce MORPH, a modality-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data modality (1D–3D) at different resolutions, and multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorize full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters, MORPH outperforms models trained from scratch. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from the heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning. The source code, datasets, and models are publicly available at https://github.com/lanl/MORPH.

[246] RestoRect: Degraded Image Restoration via Latent Rectified Flow & Feature Distillation

Shourya Verma, Mengbo Wang, Nadia Atallah Lanman, Ananth Grama

Main category: cs.CV

TL;DR: RestoRect: A novel feature distillation method using rectified flow for efficient image restoration that transfers teacher knowledge to student models through generative feature synthesis in latent space.

Details

Motivation: Current image restoration models face a trade-off between performance and speed - high-performance models are too slow for practical use while fast models produce poor results. Existing knowledge distillation methods using static feature matching cannot capture how modern transformer architectures dynamically generate features.

Method: Proposes Latent Rectified Flow Feature Distillation (RestoRect) that applies rectified flow to reformulate feature distillation as a generative process where students learn to synthesize teacher-quality features through learnable trajectories in latent space. Combines Retinex decomposition with learnable anisotropic diffusion constraints and trigonometric color space polarization. Introduces Feature Layer Extraction loss for robust knowledge transfer between different network architectures through cross-normalized transformer feature alignment with percentile-based outlier detection.

Result: Achieves better training stability, faster convergence and inference while preserving restoration quality. Demonstrates superior results across 15 image restoration datasets covering 4 tasks on 10 metrics against baselines.

Conclusion: RestoRect provides an effective solution to the performance-speed trade-off in image restoration through novel generative feature distillation that captures dynamic feature generation in transformer architectures.

Abstract: Current approaches for restoration of degraded images face a trade-off: high-performance models are slow for practical use, while fast models produce poor results. Knowledge distillation transfers teacher knowledge to students, but existing static feature matching methods cannot capture how modern transformer architectures dynamically generate features. We propose a novel Latent Rectified Flow Feature Distillation method for restoring degraded images called \textbf{‘RestoRect’}. We apply rectified flow to reformulate feature distillation as a generative process where students learn to synthesize teacher-quality features through learnable trajectories in latent space. Our framework combines Retinex decomposition with learnable anisotropic diffusion constraints, and trigonometric color space polarization. We introduce a Feature Layer Extraction loss for robust knowledge transfer between different network architectures through cross-normalized transformer feature alignment with percentile-based outlier detection. RestoRect achieves better training stability, and faster convergence and inference while preserving restoration quality, demonstrating superior results across 15 image restoration datasets, covering 4 tasks, on 10 metrics against baselines.

[247] PCICF: A Pedestrian Crossing Identification and Classification Framework

Junyi Gu, Beatriz Cabrero-Daniel, Ali Nouri, Lydia Armini, Christian Berger

Main category: cs.CV

TL;DR: PCICF is a framework for systematically identifying and classifying vulnerable road user (VRU) situations to support operational design domain incident analysis for robotaxis, using space-filling curves to match real-world scenarios with a synthetic multi-pedestrian crossing dataset.

Details

Motivation: Robotaxis need reliable VRU detection in urban environments, and end-to-end AI systems require high-quality data for training and evaluation. Current synthetic datasets like SMIRK only cover single-pedestrian scenarios, limiting their usefulness for complex real-world multi-pedestrian situations.

Method: Extended SMIRK dataset to create MoreSMIRK with structured multi-pedestrian crossing situations. Used space-filling curves to transform multi-dimensional scenario features into characteristic patterns, then matched these patterns with MoreSMIRK entries. Evaluated on PIE dataset with 150+ annotated pedestrian crossing videos.

Result: PCICF successfully identified and classified complex pedestrian crossings, including situations where groups of pedestrians merge or split. The framework showed potential for onboard robotaxi use due to computational efficiency of space-filling curves.

Conclusion: PCICF provides a systematic approach for VRU situation analysis that bridges synthetic and real-world data, supporting ODD incident analysis for autonomous vehicles. The open-source framework enables further research and practical applications.

Abstract: We have recently observed the commercial roll-out of robotaxis in various countries. They are deployed within an operational design domain (ODD) on specific routes and environmental conditions, and are subject to continuous monitoring to regain control in safety-critical situations. Since ODDs typically cover urban areas, robotaxis must reliably detect vulnerable road users (VRUs) such as pedestrians, bicyclists, or e-scooter riders. To better handle such varied traffic situations, end-to-end AI, which directly compute vehicle control actions from multi-modal sensor data instead of only for perception, is on the rise. High quality data is needed for systematically training and evaluating such systems within their OOD. In this work, we propose PCICF, a framework to systematically identify and classify VRU situations to support ODD’s incident analysis. We base our work on the existing synthetic dataset SMIRK, and enhance it by extending its single-pedestrian-only design into the MoreSMIRK dataset, a structured dictionary of multi-pedestrian crossing situations constructed systematically. We then use space-filling curves (SFCs) to transform multi-dimensional features of scenarios into characteristic patterns, which we match with corresponding entries in MoreSMIRK. We evaluate PCICF with the large real-world dataset PIE, which contains more than 150 manually annotated pedestrian crossing videos. We show that PCICF can successfully identify and classify complex pedestrian crossings, even when groups of pedestrians merge or split. By leveraging computationally efficient components like SFCs, PCICF has even potential to be used onboard of robotaxis for OOD detection for example. We share an open-source replication package for PCICF containing its algorithms, the complete MoreSMIRK dataset and dictionary, as well as our experiment results presented in: https://github.com/Claud1234/PCICF

[248] LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing

Zhenghao Zhang, Ziying Zhang, Junchao Liao, Xiangyu Meng, Qiang Hu, Siyu Zhu, Xiaoyun Zhang, Long Qin, Weizhi Wang

Main category: cs.CV

TL;DR: LaTo is a landmark-tokenized diffusion transformer for fine-grained, identity-preserving face editing that uses discrete landmark tokens instead of rigid geometric constraints, enabling better attribute control and identity preservation.

Details

Motivation: Existing multimodal models for instruction-based face editing struggle with precise attribute control and identity preservation, especially when dealing with large expression/pose changes or inaccurate landmark estimates. Structural facial representations like landmarks are often treated as rigid constraints, which can degrade identity when conditions deviate significantly from the source.

Method: Proposes LaTo with three key innovations: (1) landmark tokenizer that quantizes raw landmark coordinates into discrete facial tokens, (2) location-mapped positional encoding and landmark-aware classifier-free guidance for flexible interactions among instruction, geometry, and appearance, and (3) landmark predictor using vision-language models with structured chain-of-thought for accurate target landmark inference. Also curates HFL-150K dataset with 150K real face pairs.

Result: LaTo outperforms state-of-the-art methods by 7.8% in identity preservation and 4.6% in semantic consistency. The method demonstrates superior performance in fine-grained face editing while maintaining identity.

Conclusion: LaTo addresses limitations in existing face editing methods by using landmark tokenization and flexible geometric constraints, enabling better identity preservation and semantic control. The approach represents an advancement in multimodal instruction-based image editing.

Abstract: Recent multimodal models for instruction-based face editing enable semantic manipulation but still struggle with precise attribute control and identity preservation. Structural facial representations such as landmarks are effective for intermediate supervision, yet most existing methods treat them as rigid geometric constraints, which can degrade identity when conditional landmarks deviate significantly from the source (e.g., large expression or pose changes, inaccurate landmark estimates). To address these limitations, we propose LaTo, a landmark-tokenized diffusion transformer for fine-grained, identity-preserving face editing. Our key innovations include: (1) a landmark tokenizer that directly quantizes raw landmark coordinates into discrete facial tokens, obviating the need for dense pixel-wise correspondence; (2) a location-mapped positional encoding and a landmark-aware classifier-free guidance that jointly facilitate flexible yet decoupled interactions among instruction, geometry, and appearance, enabling strong identity preservation; and (3) a landmark predictor that leverages vision-language models to infer target landmarks from instructions and source images, whose structured chain-of-thought improves estimation accuracy and interactive control. To mitigate data scarcity, we curate HFL-150K, to our knowledge the largest benchmark for this task, containing over 150K real face pairs with fine-grained instructions. Extensive experiments show that LaTo outperforms state-of-the-art methods by 7.8% in identity preservation and 4.6% in semantic consistency. Code and dataset will be made publicly available upon acceptance.

[249] MuSLR: Multimodal Symbolic Logical Reasoning

Jundong Xu, Hao Fei, Yuhui Zhang, Liangming Pan, Qijun Huang, Qian Liu, Preslav Nakov, Min-Yen Kan, William Yang Wang, Mong-Li Lee, Wynne Hsu

Main category: cs.CV

TL;DR: MuSLR benchmark for multimodal symbolic logical reasoning shows VLMs struggle with formal logic, achieving only 46.8% accuracy. LogiCAM framework boosts performance by 14.13% by applying formal logical rules to multimodal inputs.

Details

Motivation: Multimodal symbolic logical reasoning is critical for high-stakes applications like autonomous driving and medical diagnosis where deterministic reasoning prevents serious consequences. Current vision language models lack rigorous evaluation for formal logical reasoning capabilities across modalities.

Method: Created MuSLR benchmark with 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations with reasoning depths 2-9. Evaluated 7 state-of-the-art VLMs. Proposed LogiCAM modular framework that applies formal logical rules to multimodal inputs.

Result: Best model GPT-4.1 achieved only 46.8% accuracy on MuSLR. LogiCAM boosted GPT-4.1’s Chain-of-Thought performance by 14.13%, with larger gains on complex logics like first-order logic. Error analysis showed ~70% failures stem from logical misalignment between modalities.

Conclusion: Current VLMs struggle with multimodal symbolic reasoning. LogiCAM demonstrates significant improvements by incorporating formal logical rules. The benchmark and framework provide key insights for future multimodal reasoning research.

Abstract: Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1’s Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/MuSLR.

[250] Entropy Guided Dynamic Patch Segmentation for Time Series Transformers

Sachith Abeywickrama, Emadeldeen Eldele, Min Wu, Xiaoli Li, Chau Yuen

Main category: cs.CV

TL;DR: EntroPE introduces entropy-guided dynamic patching for time series transformers, using information theory to detect natural temporal transitions and preserve temporal coherence in patch construction.

Details

Motivation: Existing patch-based transformers for time series use temporally-agnostic patch construction that fractures temporal coherence by splitting natural transitions across boundaries, disrupting short-term dependencies and weakening representation learning.

Method: Proposes Entropy-Guided Dynamic Patch Encoder (EntroPE) with two modules: 1) Entropy-based Dynamic Patcher (EDP) uses conditional entropy to detect natural temporal shifts and determine patch boundaries, 2) Adaptive Patch Encoder (APE) employs pooling and cross-attention to capture intra-patch dependencies and produce fixed-size latent representations.

Result: Extensive experiments on long-term forecasting, classification, and anomaly detection demonstrate improved accuracy and efficiency compared to existing methods.

Conclusion: Entropy-guided dynamic patching establishes a promising new paradigm for time series modeling that preserves temporal structure while retaining computational benefits of patching.

Abstract: Patch-based transformers have emerged as efficient and improved long-horizon modeling architectures for time series modeling. Yet, existing approaches rely on temporally-agnostic patch construction, where arbitrary starting positions and fixed lengths fracture temporal coherence by splitting natural transitions across boundaries. This naive segmentation often disrupts short-term dependencies and weakens representation learning. We propose a novel Entropy-Guided Dynamic Patch Encoder (EntroPE), as a temporally informed framework that dynamically detects transition points via conditional entropy and dynamically places patch boundaries. This preserves temporal structure while retaining the computational benefits of patching. EntroPE consists of two key modules, namely an Entropy-based Dynamic Patcher (EDP) that applies information-theoretic criteria to locate natural temporal shifts and determine patch boundaries, and an Adaptive Patch Encoder (APE) that employs pooling and cross-attention to capture intra-patch dependencies and produce fixed-size latent representations. Extensive experiments on long-term forecasting, classification, and anomaly detection demonstrate that the proposed method improves both accuracy and efficiency, establishing entropy-guided dynamic patching as a promising new paradigm for time series modeling. Code is available at https://github.com/Sachithx/EntroPE.

[251] iPEAR: Iterative Pyramid Estimation with Attention and Residuals for Deformable Medical Image Registration

Heming Wu, Di Wang, Tai Ma, Peng Zhao, Yubin Xiao, Zhongke Wu, Xing-Ce Wang, Xuan Wu, You Zhou

Main category: cs.CV

TL;DR: iPEAR is a medical image registration network that addresses anatomical misalignment accumulation and adaptive iteration control using Fused Attention-Residual Modules and Threshold-Controlled Iterative strategy.

Details

Motivation: Existing pyramid registration networks suffer from accumulated anatomical misalignments and lack mechanisms to dynamically determine optimization iterations for varying deformation requirements across different images, leading to degraded registration performance.

Method: Proposes iPEAR with: 1) Fused Attention-Residual Module (FARM) for decoding, combining attention and residual pathways to alleviate anatomical misalignment accumulation; 2) dual-stage Threshold-Controlled Iterative (TCI) strategy that adaptively determines optimization iterations by evaluating registration stability and convergence.

Result: Outperforms state-of-the-art registration networks in accuracy on three public brain MRI datasets and one abdomen CT dataset, while achieving comparable inference speed and model parameter size. Generalization and ablation studies validate FARM and TCI effectiveness.

Conclusion: iPEAR effectively addresses anatomical misalignment accumulation and adaptive iteration control in medical image registration, demonstrating superior performance across multiple datasets while maintaining efficiency.

Abstract: Existing pyramid registration networks may accumulate anatomical misalignments and lack an effective mechanism to dynamically determine the number of optimization iterations under varying deformation requirements across images, leading to degraded performance. To solve these limitations, we propose iPEAR. Specifically, iPEAR adopts our proposed Fused Attention-Residual Module (FARM) for decoding, which comprises an attention pathway and a residual pathway to alleviate the accumulation of anatomical misalignment. We further propose a dual-stage Threshold-Controlled Iterative (TCI) strategy that adaptively determines the number of optimization iterations for varying images by evaluating registration stability and convergence. Extensive experiments on three public brain MRI datasets and one public abdomen CT dataset show that iPEAR outperforms state-of-the-art (SOTA) registration networks in terms of accuracy, while achieving on-par inference speed and model parameter size. Generalization and ablation studies further validate the effectiveness of the proposed FARM and TCI.

[252] Enhancing Visual Prompting through Expanded Transformation Space and Overfitting Mitigation

Shohei Enomoto

Main category: cs.CV

TL;DR: ACAVP enhances visual prompting with affine and color transformations to improve expressivity while addressing overfitting through TrivialAugment, achieving state-of-the-art accuracy among VP methods with minimal computational overhead.

Details

Motivation: Visual prompting (VP) offers parameter-efficient fine-tuning for vision models but suffers from limited expressivity (simple additive transformations) and overfitting issues, leading to lower accuracy compared to other adaptation methods.

Method: Proposes ACAVP with three complementary transformations: 1) affine transformation for task-specific prompt regions while preserving original image information, 2) color transformation for emphasizing task-relevant features, and 3) additive transformation. Also introduces TrivialAugment for data augmentation to combat overfitting.

Result: ACAVP achieves state-of-the-art accuracy among VP methods across 12 diverse image classification datasets with two model architectures, surpasses linear probing in average accuracy, shows superior robustness to distribution shifts, and maintains minimal computational overhead during inference.

Conclusion: ACAVP addresses key limitations of visual prompting through enhanced expressive transformations and effective data augmentation, demonstrating that appropriate augmentation is universally beneficial for VP training while maintaining parameter efficiency.

Abstract: Visual prompting (VP) has emerged as a promising parameter-efficient fine-tuning approach for adapting pre-trained vision models to downstream tasks without modifying model parameters. Despite offering advantages like negligible computational overhead and compatibility with black-box models, conventional VP methods typically achieve lower accuracy than other adaptation approaches. Our analysis reveals two critical limitations: the restricted expressivity of simple additive transformation and a tendency toward overfitting when the parameter count increases. To address these challenges, we propose ACAVP (Affine, Color, and Additive Visual Prompting), which enhances VP’s expressive power by introducing complementary transformation operations: affine transformation for creating task-specific prompt regions while preserving original image information, and color transformation for emphasizing task-relevant visual features. Additionally, we identify that overfitting is a critical issue in VP training and introduce TrivialAugment as an effective data augmentation, which not only benefits our approach but also significantly improves existing VP methods, with performance gains of up to 12 percentage points on certain datasets. This demonstrates that appropriate data augmentation is universally beneficial for VP training. Extensive experiments across twelve diverse image classification datasets with two different model architectures demonstrate that ACAVP achieves state-of-the-art accuracy among VP methods, surpasses linear probing in average accuracy, and exhibits superior robustness to distribution shifts, all while maintaining minimal computational overhead during inference. Our code is available at https://github.com/s-enmt/ACAVP.

[253] ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, Jiaya Jia

Main category: cs.CV

TL;DR: ViSurf is a unified single-stage training paradigm that combines supervised fine-tuning and reinforcement learning with verifiable rewards for vision-language models, addressing limitations of sequential approaches.

Details

Motivation: Current post-training methods for LVLMs have limitations: SFT leads to sub-optimal performance, RLVR is constrained by model's internal knowledge, and sequential SFT→RLVR pipelines cause computational overhead and catastrophic forgetting.

Method: ViSurf integrates SFT and RLVR into a unified single-stage framework by injecting ground-truth labels directly into RLVR rollouts, enabling simultaneous external supervision and internal reinforcement. Includes three novel reward control strategies for training stability.

Result: ViSurf consistently outperforms standalone SFT, RLVR, and traditional two-stage pipelines across diverse benchmarks. In-depth analysis validates the derivation and design principles.

Conclusion: ViSurf provides an effective unified approach for post-training LVLMs that overcomes limitations of existing methods while maintaining training stability and optimization.

Abstract: Post-training Large Vision-and-Language Models (LVLMs) typically involves Supervised Fine-Tuning (SFT) for knowledge injection or Reinforcement Learning with Verifiable Rewards (RLVR) for performance enhancement. However, SFT often leads to sub-optimal performance, while RLVR remains constrained by the model’s internal knowledge base. While a sequential SFT $\rightarrow$ RLVR pipeline can be used, it introduces significant computational overhead and suffers from catastrophic forgetting. To address these limitations, we propose ViSurf (\textbf{Vi}sual \textbf{Su}pervised-and-\textbf{R}einforcement \textbf{F}ine-Tuning), a unified, single-stage paradigm that integrates the strengths of both SFT and RLVR. By analyzing their training objectives, we establish a unified framework that injects ground-truth labels directly into RLVR rollouts, facilitating simultaneous external supervision and internal reinforcement. Furthermore, we introduce three novel reward control strategies to ensure training stability and optimization. Extensive experiments demonstrate that ViSurf consistently outperforms standalone SFT, RLVR, and the traditional two-stage pipeline across diverse benchmarks. In-depth analysis corroborates these findings, validating the derivation and design principles of ViSurf.

[254] InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, Hongjie Zhang

Main category: cs.CV

TL;DR: InternSVG is a unified multimodal large language model framework for SVG understanding, editing, and generation, featuring a comprehensive dataset (SAgoge), benchmark (SArena), and model with specialized training strategies.

Details

Motivation: Address challenges in SVG modeling including fragmented datasets, limited transferability across tasks, and difficulty handling structural complexity by leveraging MLLMs' transfer and generalization capabilities.

Method: Propose InternSVG family with: 1) SAgoge dataset (largest multimodal SVG dataset covering static/dynamic content), 2) SArena benchmark (comprehensive task definitions), 3) InternSVG model with SVG-specific tokens, subword-based embedding initialization, and two-stage training from simple to complex SVGs.

Result: InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts on SArena and prior benchmarks, demonstrating positive transfer across SVG tasks.

Conclusion: The unified MLLM approach effectively addresses SVG modeling challenges, enabling comprehensive SVG understanding, editing, and generation through integrated data-benchmark-model design.

Abstract: General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.

[255] DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion

Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, Raanan Fattal

Main category: cs.CV

TL;DR: DyPE enables pre-trained diffusion transformers to generate ultra-high-resolution images beyond training resolution without additional training or sampling cost by dynamically adjusting positional encoding frequency spectrum during diffusion steps.

Details

Motivation: Training diffusion transformers at ultra-high resolutions is extremely costly due to quadratic scaling of self-attention with image tokens. Existing methods struggle to generate images at resolutions far beyond training data without additional computational overhead.

Method: Dynamic Position Extrapolation (DyPE) is a training-free method that adjusts the model’s positional encoding at each diffusion step to match the frequency spectrum with the current generative stage, leveraging the spectral progression inherent to diffusion processes.

Result: DyPE enables generation of images at dramatically higher resolutions (e.g., 16 million pixels using FLUX) with state-of-the-art fidelity, consistently improving performance on multiple benchmarks, with gains becoming more pronounced at higher resolutions.

Conclusion: DyPE provides an effective, training-free solution for ultra-high-resolution image generation with diffusion transformers, overcoming computational limitations while maintaining or improving image quality.

Abstract: Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism’s quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model’s positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at https://noamissachar.github.io/DyPE/.

[256] FreeFuse: Multi-Subject LoRA Fusion via Adaptive Token-Level Routing at Test Time

Yaoli Liu, Yao-Xiang Ding, Kun Zhou

Main category: cs.CV

TL;DR: FreeFuse is a training-free framework for multi-subject text-to-image generation that automatically fuses multiple subject LoRAs without retraining, using adaptive token-level routing to spatially confine subjects to their target regions.

Details

Motivation: Prior methods for multi-subject generation require retraining LoRAs to address feature conflicts between subjects. The authors identify that simply spatially confining each subject to its target region and preventing interference from other LoRAs is sufficient, eliminating the need for complex retraining.

Method: FreeFuse uses Adaptive Token-Level Routing during inference with FreeFuseAttn mechanism. It exploits flow matching models’ intrinsic semantic alignment to dynamically match subject-specific tokens to corresponding spatial regions at early denoising timesteps, without needing external segmentors or user-defined masks.

Result: Extensive experiments show FreeFuse outperforms existing approaches in both identity preservation and compositional fidelity for multi-subject text-to-image generation.

Conclusion: FreeFuse provides a highly practical training-free solution for multi-subject generation that requires no additional training, model modifications, or spatial conditions, seamlessly integrating into standard workflows with just subject activation words.

Abstract: This paper proposes FreeFuse, a training-free framework for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to prior studies that focus on retraining LoRA to alleviate feature conflicts, our analysis reveals that simply spatially confining the subject LoRA’s output to its target region and preventing other LoRAs from directly intruding into this area is sufficient for effective mitigation. Accordingly, we implement Adaptive Token-Level Routing during the inference phase. We introduce FreeFuseAttn, a mechanism that exploits the flow matching model’s intrinsic semantic alignment to dynamically match subject-specific tokens to their corresponding spatial regions at early denoising timesteps, thereby bypassing the need for external segmentors. FreeFuse distinguishes itself through high practicality: it necessitates no additional training, model modifications, or user-defined masks spatial conditions. Users need only provide subject activation words to achieve seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both identity preservation and compositional fidelity. Our code is available at https://github.com/yaoliliu/FreeFuse.

[257] Hierarchical Transformers for Unsupervised 3D Shape Abstraction

Aditya Vora, Lily Goli, Andrea Tagliasacchi, Hao Zhang

Main category: cs.CV

TL;DR: HiT introduces a hierarchical neural field representation for 3D shapes using a hierarchical transformer with compressed codebooks to learn general hierarchies across shape categories in an unsupervised manner.

Details

Motivation: Previous works constrain hierarchical representations to fixed structures (e.g., binary trees), limiting their ability to capture complex, data-driven hierarchies across diverse shape categories. There's a need for more flexible, unsupervised methods that can learn meaningful containment relationships without predefined structural constraints.

Method: HiT uses a hierarchical transformer where each level learns parent-child relationships using a compressed codebook. The method imposes no fixed hierarchical structure except limiting total nodes per level, allowing automatic inference of hierarchical structure from data across multiple shape categories.

Result: The model successfully captures meaningful containment relationships between parent and child nodes when trained with reconstruction loss. It demonstrates effectiveness through unsupervised shape segmentation across all 55 ShapeNet categories, segmenting shapes into multiple levels of granularity.

Conclusion: HiT provides a flexible, unsupervised approach to learning general hierarchical representations of 3D shapes that can capture complex structures across diverse categories, outperforming previous methods with fixed hierarchical constraints.

Abstract: We introduce HiT, a novel hierarchical neural field representation for 3D shapes that learns general hierarchies in a coarse-to-fine manner across different shape categories in an unsupervised setting. Our key contribution is a hierarchical transformer (HiT), where each level learns parent-child relationships of the tree hierarchy using a compressed codebook. This codebook enables the network to automatically identify common substructures across potentially diverse shape categories. Unlike previous works that constrain the task to a fixed hierarchical structure (e.g., binary), we impose no such restriction, except for limiting the total number of nodes at each tree level. This flexibility allows our method to infer the hierarchical structure directly from data, over multiple shape categories, and representing more general and complex hierarchies than prior approaches. When trained at scale with a reconstruction loss, our model captures meaningful containment relationships between parent and child nodes. We demonstrate its effectiveness through an unsupervised shape segmentation task over all 55 ShapeNet categories, where our method successfully segments shapes into multiple levels of granularity.

[258] Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representation Alignment

Loukas Sfountouris, Giannis Daras, Paris Giampouras

Main category: cs.CV

TL;DR: REPA (Representation Alignment) improves inverse problem solving by aligning diffusion/flow model representations with DINOv2 visual encoder features, enhancing reconstruction quality and perceptual realism.

Details

Motivation: Previous work showed that aligning internal representations of generative models with pretrained self-supervised encoders improves convergence and sample quality. The authors extend this to inverse problems where generative models serve as priors, aiming to enhance reconstruction quality and perceptual fidelity.

Method: Proposes REPA (Representation Alignment) that aligns diffusion or flow-based model representations with DINOv2 visual encoder features during inference for inverse problems. Uses variational approach to minimize divergence in DINOv2 embedding space, steering latent diffusion states toward clean image representations.

Result: REPA consistently improves reconstruction quality across super-resolution, box inpainting, Gaussian deblurring, and motion deblurring tasks. Provides efficiency gains by reducing required discretization steps while enhancing perceptual realism.

Conclusion: REPA offers a generalizable approach to enhance inverse problem solving by leveraging representation alignment between generative models and visual encoders, with theoretical grounding and practical benefits across multiple applications.

Abstract: Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a DINOv2 visual encoder, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we empirically show that aligning model representations of approximate target features can substantially enhance reconstruction quality and perceptual realism. We provide theoretical results showing (a) that REPA regularization can be viewed as a variational approach for minimizing a divergence measure in the DINOv2 embedding space, and (b) how under certain regularity assumptions REPA updates steer the latent diffusion states toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by We integrate REPA into multiple state-of-the-art inverse problem solvers, and provide extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirming that our method consistently improves reconstruction quality, while also providing efficiency gains reducing the number of required discretization steps.

[259] Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

Main category: cs.CV

TL;DR: Semantic-Aware Hijacking attack on MLLMs using universal perturbations that act as semantic routers to redirect multiple decisions simultaneously to attacker-defined targets.

Details

Motivation: Multimodal LLMs are increasingly deployed in stateless systems like autonomous driving and robotics, creating security vulnerabilities. Current attacks focus on single-target perturbations, but this paper investigates hijacking multiple decisions simultaneously using a single universal perturbation.

Method: Introduces Semantic-Aware Universal Perturbation (SAUP) that acts as a semantic router. Conducts theoretical and empirical analysis of geometric properties in latent space, proposes Semantic-Oriented (SORT) optimization strategy, and creates a new dataset with fine-grained semantics for evaluation.

Result: Extensive experiments on three representative MLLMs demonstrate fundamental feasibility, achieving 66% attack success rate over five targets using a single frame against Qwen model.

Conclusion: Semantic-Aware Hijacking is a novel threat to MLLMs in stateless systems, showing that single universal perturbations can effectively hijack multiple decisions simultaneously, highlighting significant security vulnerabilities.

Abstract: Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: Semantic-Aware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, “actively” perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

[260] MOTION: ML-Assisted On-Device Low-Latency Motion Recognition

Veeramani Pugazhenthi, Wei-Hsiang Chu, Junwei Lu, Jadyn N. Miyahira, Mahdi Eslamimehr, Pratik Satam, Rozhin Yasaei, Soheil Salehi

Main category: cs.CV

TL;DR: Efficient on-device gesture recognition using triaxial accelerometer data and AutoML pipelines for medical monitoring applications

Details

Motivation: Need for low-latency gesture recognition on tiny embedded devices for medical monitoring applications like fall detection, rehabilitation tracking, and patient supervision, requiring fast response times while avoiding false alarms

Method: Use AutoML pipelines to extract important features from triaxial accelerometer data segments, train multiple lightweight ML algorithms using extracted features, implement on WeBe Band wearable device with capable MCU for on-device processing

Result: Neural network provided best balance between accuracy, latency, and memory use; reliable real-time gesture recognition achieved on WeBe Band with potential for secure, fast-response medical monitoring solutions

Conclusion: Efficient motion-based models using only triaxial accelerometer sensors can enable reliable real-time gesture recognition on embedded devices for medical monitoring applications

Abstract: The use of tiny devices capable of low-latency gesture recognition is gaining momentum in everyday human-computer interaction and especially in medical monitoring fields. Embedded solutions such as fall detection, rehabilitation tracking, and patient supervision require fast and efficient tracking of movements while avoiding unwanted false alarms. This study presents an efficient solution on how to build very efficient motion-based models only using triaxial accelerometer sensors. We explore the capability of the AutoML pipelines to extract the most important features from the data segments. This approach also involves training multiple lightweight machine learning algorithms using the extracted features. We use WeBe Band, a multi-sensor wearable device that is equipped with a powerful enough MCU to effectively perform gesture recognition entirely on the device. Of the models explored, we found that the neural network provided the best balance between accuracy, latency, and memory use. Our results also demonstrate that reliable real-time gesture recognition can be achieved in WeBe Band, with great potential for real-time medical monitoring solutions that require a secure and fast response time.

[261] RAVE: Rate-Adaptive Visual Encoding for 3D Gaussian Splatting

Hoang-Nhat Tran, Francesco Di Sario, Gabriele Spadaro, Giuseppe Valenzise, Enzo Tartaglione

Main category: cs.CV

TL;DR: Flexible compression scheme for 3D Gaussian Splatting that supports interpolation at any rate between bounds without retraining, enabling dynamic rate control for immersive applications.

Details

Motivation: 3D Gaussian Splatting enables real-time photorealistic rendering but suffers from large memory requirements and costly training. Existing compression approaches operate at fixed rates, limiting adaptability to varying bandwidth and device constraints in practical deployments.

Method: Proposes a flexible compression scheme for 3DGS that supports interpolation at any rate between predefined bounds. The method is computationally lightweight and requires no retraining for any rate, preserving rendering quality across a broad range of operating points.

Result: Experiments demonstrate efficient, high-quality compression with dynamic rate control, making it suitable for practical deployment in immersive applications. The approach achieves preservation of rendering quality across various compression rates.

Conclusion: The proposed flexible compression scheme addresses the limitations of fixed-rate approaches in 3DGS compression, enabling adaptive rate control for immersive multimedia applications while maintaining rendering quality and computational efficiency.

Abstract: Recent advances in neural scene representations have transformed immersive multimedia, with 3D Gaussian Splatting (3DGS) enabling real-time photorealistic rendering. Despite its efficiency, 3DGS suffers from large memory requirements and costly training procedures, motivating efforts toward compression. Existing approaches, however, operate at fixed rates, limiting adaptability to varying bandwidth and device constraints. In this work, we propose a flexible compression scheme for 3DGS that supports interpolation at any rate between predefined bounds. Our method is computationally lightweight, requires no retraining for any rate, and preserves rendering quality across a broad range of operating points. Experiments demonstrate that the approach achieves efficient, high-quality compression while offering dynamic rate control, making it suitable for practical deployment in immersive applications. The code is available at https://github.com/inspiros/RAVE.

[262] SSCATeR: Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling for Real-Time 3D Object Detection in LiDAR Point Clouds

Alexander Dow, Manduhu Manduhu, Matheus Santos, Ben Bartlett, Gerard Dooly, James Riordan

Main category: cs.CV

TL;DR: SSCATeR algorithm uses temporal data recycling and scatter-based convolutions to efficiently process only changing regions in LiDAR point clouds, achieving 6.61x speedup while maintaining accuracy.

Details

Motivation: Traditional LiDAR processing treats each frame independently, wasting computation on unchanged regions. There's a need for more efficient processing that exploits temporal continuity in LiDAR scanning.

Method: Uses sliding time windows with short strides, stores convolution results between passes, and extends scatter-based convolutions to allow data reuse. Focuses only on changing parts of point clouds using sparse scatter-based convolution with temporal data recycling (SSCATeR).

Result: Achieves identical feature maps to traditional sparse convolution techniques with up to 6.61-fold reduction in processing time. Maintains accuracy while significantly improving computational efficiency.

Conclusion: SSCATeR enables efficient LiDAR processing by exploiting temporal sparsity, offering substantial speed improvements without sacrificing detection accuracy, making it suitable for real-time applications.

Abstract: This work leverages the continuous sweeping motion of LiDAR scanning to concentrate object detection efforts on specific regions that receive a change in point data from one frame to another. We achieve this by using a sliding time window with short strides and consider the temporal dimension by storing convolution results between passes. This allows us to ignore unchanged regions, significantly reducing the number of convolution operations per forward pass without sacrificing accuracy. This data reuse scheme introduces extreme sparsity to detection data. To exploit this sparsity, we extend our previous work on scatter-based convolutions to allow for data reuse, and as such propose Sparse Scatter-Based Convolution Algorithm with Temporal Data Recycling (SSCATeR). This operation treats incoming LiDAR data as a continuous stream and acts only on the changing parts of the point cloud. By doing so, we achieve the same results with as much as a 6.61-fold reduction in processing time. Our test results show that the feature maps output by our method are identical to those produced by traditional sparse convolution techniques, whilst greatly increasing the computational efficiency of the network.

[263] MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos

Qiyue Sun, Tailin Chen, Yinghui Zhang, Yuchen Zhang, Jiangbei Yue, Jianbo Jiao, Zeyu Fu

Main category: cs.CV

TL;DR: MultiHateLoc: A weakly-supervised framework for temporal localization of multimodal hate speech in videos using modality-aware encoders, dynamic cross-modal fusion, and contrastive alignment.

Details

Motivation: The rapid growth of video content on platforms like TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research focuses on video-level classification, leaving temporal localization largely unaddressed, especially under weak supervision with only video-level labels.

Method: MultiHateLoc incorporates: (1) modality-aware temporal encoders to model heterogeneous sequential patterns with tailored text preprocessing; (2) dynamic cross-modal fusion to adaptively emphasize the most informative modality at each moment plus cross-modal contrastive alignment; (3) modality-aware MIL objective to identify discriminative segments under video-level supervision.

Result: Experiments on HateMM and MultiHateClip datasets show state-of-the-art performance in the localization task, producing fine-grained, interpretable frame-level predictions despite relying solely on coarse video-level labels.

Conclusion: MultiHateLoc is the first framework designed for weakly-supervised multimodal hate localization, effectively addressing the challenge of identifying when hateful segments occur in videos by capturing cross-modal and temporal dynamics.

Abstract: The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.

[264] Uni-Parser Technical Report

Xi Fang, Haoyi Tao, Shuwen Yang, Chaozheng Huang, Suyang Zhong, Haocheng Lu, Han Lyu, Xinyu Li, Linfeng Zhang, Guolin Ke

Main category: cs.CV

TL;DR: Uni-Parser is an industrial-grade document parsing engine for scientific literature and patents that uses modular multi-expert architecture to preserve cross-modal alignments across text, equations, tables, figures, and chemical structures.

Details

Motivation: Existing document parsing methods often use pipeline-based approaches that lose fine-grained cross-modal alignments and lack scalability for large-scale scientific literature and patent processing needs.

Method: Modular, loosely coupled multi-expert architecture with adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes for holistic or modality-specific parsing.

Result: Achieves processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages with robust accuracy.

Conclusion: Uni-Parser provides scalable, efficient document parsing for scientific literature and patents, supporting downstream applications like literature retrieval, chemical structure extraction, and training data curation for AI4Science models.

Abstract: This technical report introduces Uni-Parser, an industrial-grade document parsing engine tailored for scientific literature and patents, delivering high throughput, robust accuracy, and cost efficiency. Unlike pipeline-based document parsing methods, Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures, while remaining easily extensible to emerging modalities. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes that support either holistic or modality-specific parsing. Optimized for large-scale cloud deployment, Uni-Parser achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. This level of scalability facilitates a broad spectrum of downstream applications, ranging from literature retrieval and summarization to the extraction of chemical structures, reaction schemes, and bioactivity data, as well as the curation of large-scale corpora for training next-generation large language models and AI4Science models.

Dwip Dalal, Utkarsh Mishra, Narendra Ahuja, Nebojsa Jojic

Main category: cs.CV

TL;DR: CityNav benchmark evaluates MLLMs’ visual navigation in real cities using only visual inputs, requiring landmark recognition and spatial reasoning without environmental annotations.

Details

Motivation: Current MLLM evaluation benchmarks are too language-centric or simulation-based, lacking assessment of nuanced knowledge-intensive reasoning needed for real-world embodied tasks like city navigation.

Method: Introduces Sparsely Grounded Visual Navigation task and CityNav benchmark across four global cities. Agents navigate 50+ decision points using only visual inputs. Proposes Verbalization of Path (VoP) technique that grounds agent reasoning by probing city-scale cognitive maps from MLLMs.

Result: State-of-the-art MLLMs, reasoning techniques (GEPA, chain-of-thought, reflection) and baseline PReP significantly underperform. VoP substantially enhances navigation success by explicitly grounding agent’s internal reasoning.

Conclusion: Current MLLMs struggle with knowledge-intensive real-world navigation tasks. VoP demonstrates that explicitly grounding reasoning in city-scale cognitive maps can significantly improve performance, highlighting the need for better multimodal reasoning capabilities.

Abstract: Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environment. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs, reasoning techniques (e.g., GEPA, chain-of-thought, reflection) and competitive baseline PReP significantly underperform in this challenging setting. To address this, we propose Verbalization of Path(VoP), which explicitly grounds the agent’s internal reasoning by probing city-scale cognitive maps (key landmarks and directions toward the destination) from the MLLM, substantially enhancing navigation success. Project Webpage: https://dwipddalal.github.io/AgentNav/

[266] Efficient Spike-driven Transformer for High-performance Drone-View Geo-Localization

Zhongwei Chen, Hai-Jun Rong, Zhao-Xu Yang, Guoqi Li

Main category: cs.CV

TL;DR: SpikeViMFormer is a spiking neural network framework for drone-view geo-localization that uses spike-driven transformers and attention mechanisms to achieve energy-efficient performance comparable to traditional ANNs.

Details

Motivation: Traditional ANN-based drone-view geo-localization methods have high power consumption, while SNNs offer low-power spike-driven computation but haven't been thoroughly explored for DVGL. SNNs also suffer from information loss and difficulty learning long-range dependencies in heterogeneous visual data alignment.

Method: Proposes SpikeViMFormer with: 1) lightweight spike-driven transformer backbone for coarse feature extraction, 2) spike-driven selective attention block for feature enhancement, 3) spike-driven hybrid state space block for long-range dependencies, and 4) hierarchical re-ranking alignment learning strategy to optimize the backbone.

Result: Outperforms state-of-the-art SNNs and achieves competitive performance compared to advanced ANNs while being more energy-efficient.

Conclusion: SpikeViMFormer demonstrates the potential of SNNs for drone-view geo-localization, offering energy-efficient alternatives to traditional ANNs while maintaining competitive performance through novel spike-driven attention and state space mechanisms.

Abstract: Traditional drone-view geo-localization (DVGL) methods based on artificial neural networks (ANNs) have achieved remarkable performance. However, ANNs rely on dense computation, which results in high power consumption. In contrast, spiking neural networks (SNNs), which benefit from spike-driven computation, inherently provide low power consumption. Regrettably, the potential of SNNs for DVGL has yet to be thoroughly investigated. Meanwhile, the inherent sparsity of spike-driven computation for representation learning scenarios also results in loss of critical information and difficulties in learning long-range dependencies when aligning heterogeneous visual data sources. To address these, we propose SpikeViMFormer, the first SNN framework designed for DVGL. In this framework, a lightweight spike-driven transformer backbone is adopted to extract coarse-grained features. To mitigate the loss of critical information, the spike-driven selective attention (SSA) block is designed, which uses a spike-driven gating mechanism to achieve selective feature enhancement and highlight discriminative regions. Furthermore, a spike-driven hybrid state space (SHS) block is introduced to learn long-range dependencies using a hybrid state space. Moreover, only the backbone is utilized during the inference stage to reduce computational cost. To ensure backbone effectiveness, a novel hierarchical re-ranking alignment learning (HRAL) strategy is proposed. It refines features via neighborhood re-ranking and maintains cross-batch consistency to directly optimize the backbone. Experimental results demonstrate that SpikeViMFormer outperforms state-of-the-art SNNs. Compared with advanced ANNs, it also achieves competitive performance.Our code is available at https://github.com/ISChenawei/SpikeViMFormer

[267] MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing

Zihao Lin, Wanrong Zhu, Jiuxiang Gu, Jihyung Kil, Christopher Tensmeyer, Lin Zhang, Shilong Liu, Ruiyi Zhang, Lifu Huang, Vlad I. Morariu, Tong Sun

Main category: cs.CV

TL;DR: MiLDEAgent is a reasoning-based framework for editing multi-layer design documents (posters, etc.) from natural language instructions, combining RL-trained multimodal reasoning with targeted image editing.

Details

Motivation: Real-world design documents are multi-layered (decoration, text, images), but prior work focuses on single-layer image editing or multi-layer generation, lacking the reasoning needed to identify relevant layers and coordinate modifications in multi-layer documents.

Method: MiLDEAgent combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. Also introduces MiLDEBench (20K+ design documents with editing instructions) and MiLDEEval evaluation protocol across four dimensions.

Result: Extensive experiments on 14 open-source and 2 closed-source models show existing approaches fail to generalize. MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming open-source baselines and attaining performance comparable to closed-source models.

Conclusion: MiLDEAgent establishes the first strong baseline for multi-layer document editing, addressing the gap in layer-aware reasoning for design document editing from natural language instructions.

Abstract: Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.

[268] UniHash: Unifying Pointwise and Pairwise Hashing Paradigms

Xiaoxu Ma, Runhao Li, Xiangbo Zhang, Zhenyu Weng

Main category: cs.CV

TL;DR: UniHash is a dual-branch hashing framework that unifies pointwise and pairwise training paradigms to achieve balanced retrieval performance across both seen and unseen image categories.

Details

Motivation: Existing deep hashing methods are limited to single training paradigms - pointwise methods excel on seen categories but pairwise methods generalize better to unseen categories. There's a need for a unified approach that balances performance across both scenarios.

Method: Proposes UniHash with two complementary branches: a center-based pointwise branch and a pairwise branch. Introduces bidirectional knowledge transfer via mutual learning loss and a Split-Merge Mixture of Hash Experts (SM-MoH) module for cross-branch hash representation exchange.

Result: Extensive experiments on CIFAR-10, MSCOCO, and ImageNet demonstrate state-of-the-art performance in both seen and unseen image retrieval scenarios. Theoretical analysis supports the effectiveness of the approach.

Conclusion: UniHash successfully unifies pointwise and pairwise paradigms to achieve balanced retrieval performance across seen and unseen categories, addressing a key limitation in existing deep hashing methods.

Abstract: Effective retrieval across both seen and unseen categories is crucial for modern image retrieval systems. Retrieval on seen categories ensures precise recognition of known classes, while retrieval on unseen categories promotes generalization to novel classes with limited supervision. However, most existing deep hashing methods are confined to a single training paradigm, either pointwise or pairwise, where the former excels on seen categories and the latter generalizes better to unseen ones. To overcome this limitation, we propose Unified Hashing (UniHash), a dual-branch framework that unifies the strengths of both paradigms to achieve balanced retrieval performance across seen and unseen categories. UniHash consists of two complementary branches: a center-based branch following the pointwise paradigm and a pairwise branch following the pairwise paradigm. A novel hash code learning method is introduced to enable bidirectional knowledge transfer between branches, improving hash code discriminability and generalization. It employs a mutual learning loss to align hash representations and introduces a Split-Merge Mixture of Hash Experts (SM-MoH) module to enhance cross-branch exchange of hash representations. Theoretical analysis substantiates the effectiveness of UniHash, and extensive experiments on CIFAR-10, MSCOCO, and ImageNet demonstrate that UniHash consistently achieves state-of-the-art performance in both seen and unseen image retrieval scenarios.

[269] Reliable Deep Learning for Small-Scale Classifications: Experiments on Real-World Image Datasets from Bangladesh

Alfe Suny, MD Sakib Ul Islam, Md. Imran Hossain

Main category: cs.CV

TL;DR: Compact CNN achieves high accuracy on diverse Bangladeshi image datasets with efficient convergence and low computational cost.

Details

Motivation: Complex CNN architectures often overfit on small datasets, so the authors investigate whether streamlined CNN architectures can perform well on real-world image classification tasks with limited data.

Method: The authors evaluate a compact convolutional neural network across five publicly available image datasets from Bangladesh covering urban encroachment, vehicle detection, road damage, and agricultural crops.

Result: The compact CNN demonstrates high classification accuracy, efficient convergence, and low computational overhead across all datasets. Quantitative metrics and saliency analyses show the model effectively captures discriminative features and generalizes robustly.

Conclusion: Streamlined CNN architectures are suitable for small-class image classification tasks, offering good performance with reduced complexity and computational requirements.

Abstract: Convolutional neural networks (CNNs) have achieved state-of-the-art performance in image recognition tasks but often involve complex architectures that may overfit on small datasets. In this study, we evaluate a compact CNN across five publicly available, real-world image datasets from Bangladesh, including urban encroachment, vehicle detection, road damage, and agricultural crops. The network demonstrates high classification accuracy, efficient convergence, and low computational overhead. Quantitative metrics and saliency analyses indicate that the model effectively captures discriminative features and generalizes robustly across diverse scenarios, highlighting the suitability of streamlined CNN architectures for small-class image classification tasks.

[270] Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation

Zijie Lou, Xiangwei Feng, Jiaxin Wang, Jiangtao Yao, Fei Che, Tianbao Liu, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Ting Liu

Main category: cs.CV

TL;DR: Video object removal reformulated as video-to-video translation via stochastic bridge model, using input video as structural prior instead of starting from noise.

Details

Motivation: Existing diffusion-based video object removal methods discard rich structural priors from input videos, leading to incomplete removal or implausible content generation that violates scene logic.

Method: Proposes stochastic bridge model establishing direct path from source video (with objects) to target video (objects removed). Introduces adaptive mask modulation to balance background fidelity with generative flexibility for large object removal.

Result: Extensive experiments show significant outperformance over existing methods in both visual quality and temporal consistency.

Conclusion: Video-to-video translation via stochastic bridge model effectively leverages input video priors for precise object removal while maintaining logical consistency with surrounding environment.

Abstract: Existing video object removal methods predominantly rely on diffusion models following a noise-to-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene’s physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency. The project page is https://bridgeremoval.github.io/.

[271] Practical Insights into Semi-Supervised Object Detection Approaches

Chaoxin Wang, Bharaneeshwar Balasubramaniyam, Anurag Sangem, Nicolais Guevara, Doina Caragea

Main category: cs.CV

TL;DR: Comprehensive comparison of three state-of-the-art semi-supervised object detection methods (MixPL, Semi-DETR, Consistent-Teacher) across different labeled data regimes using MS-COCO, Pascal VOC, and a custom Beetle dataset.

Details

Motivation: Address the challenge of learning in data-scarce settings by evaluating how semi-supervised object detection methods perform with varying amounts of labeled data, particularly relevant for few-shot learning scenarios.

Method: Comparative evaluation of three SSOD approaches (MixPL, Semi-DETR, Consistent-Teacher) on standard benchmarks (MS-COCO, Pascal VOC) and a custom Beetle dataset, analyzing performance across different labeled data quantities.

Result: Findings reveal trade-offs between accuracy, model size, and latency, providing insights into which methods are best suited for low-data regimes and specialized datasets with fewer object categories.

Conclusion: The study provides practical guidance for selecting appropriate semi-supervised object detection methods based on available labeled data, computational constraints, and dataset characteristics.

Abstract: Learning in data-scarce settings has recently gained significant attention in the research community. Semi-supervised object detection(SSOD) aims to improve detection performance by leveraging a large number of unlabeled images alongside a limited number of labeled images(a.k.a.,few-shot learning). In this paper, we present a comprehensive comparison of three state-of-the-art SSOD approaches, including MixPL, Semi-DETR and Consistent-Teacher, with the goal of understanding how performance varies with the number of labeled images. We conduct experiments using the MS-COCO and Pascal VOC datasets, two popular object detection benchmarks which allow for standardized evaluation. In addition, we evaluate the SSOD approaches on a custom Beetle dataset which enables us to gain insights into their performance on specialized datasets with a smaller number of object categories. Our findings highlight the trade-offs between accuracy, model size, and latency, providing insights into which methods are best suited for low-data regimes.

[272] SpatialV2A: Visual-Guided High-fidelity Spatial Audio Generation

Yanan Wang, Linjie Ren, Zihao Li, Junyi Wang, Tian Gan

Main category: cs.CV

TL;DR: Proposes a video-to-binaural audio generation framework with spatial awareness, using a new large-scale video-binaural audio dataset and visual-guided audio spatialization module.

Details

Motivation: Current video-to-audio generation focuses on semantic and temporal alignment but neglects spatial perception and immersive quality due to reliance on mono audio datasets lacking binaural spatial information.

Method: Two key contributions: 1) BinauralVGGSound - first large-scale video-binaural audio dataset, 2) End-to-end spatial audio generation framework with visual-guided audio spatialization module that explicitly models spatial features.

Result: Approach substantially outperforms state-of-the-art models in spatial fidelity and delivers more immersive auditory experience while maintaining semantic and temporal consistency.

Conclusion: The framework successfully addresses the spatial perception gap in video-to-audio generation, enabling realistic spatial attributes and layered spatial depth in synthesized audio.

Abstract: While video-to-audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the synthesized audio. This limitation stems largely from current models’ reliance on mono audio datasets, which lack the binaural spatial information needed to learn visual-to-spatial audio mappings. To address this gap, we introduce two key contributions: we construct BinauralVGGSound, the first large-scale video-binaural audio dataset designed to support spatially aware video-to-audio generation; and we propose a end-to-end spatial audio generation framework guided by visual cues, which explicitly models spatial features. Our framework incorporates a visual-guided audio spatialization module that ensures the generated audio exhibits realistic spatial attributes and layered spatial depth while maintaining semantic and temporal alignment. Experiments show that our approach substantially outperforms state-of-the-art models in spatial fidelity and delivers a more immersive auditory experience, without sacrificing temporal or semantic consistency. The demo page can be accessed at https://github.com/renlinjie868-web/SpatialV2A.

[273] DrivIng: A Large-Scale Multimodal Driving Dataset with Full Digital Twin Integration

Dominik Rößle, Xujun Xie, Adithya Mohan, Venkatesh Thirugnana Sambandham, Daniel Cremers, Torsten Schön

Main category: cs.CV

TL;DR: DrivIng: A large-scale multimodal autonomous driving dataset with complete digital twin for systematic testing and simulation

Details

Motivation: Existing autonomous driving datasets lack high-fidelity digital twins, limiting systematic testing, edge-case simulation, sensor modification, and sim-to-real evaluations. There's a need for datasets that enable 1-to-1 transfer of real traffic into simulation.

Method: Created a large-scale multimodal dataset with complete geo-referenced digital twin of an ~18 km route spanning urban, suburban, and highway segments. Includes continuous recordings from six RGB cameras, one LiDAR, and high-precision ADMA-based localization across day, dusk, and night conditions.

Result: Dataset contains ~1.2 million annotated instances with 3D bounding boxes and track IDs across 12 classes at 10 Hz. Enables 1-to-1 transfer of real traffic into simulation while preserving agent interactions. Benchmark results with state-of-the-art perception models provided.

Conclusion: DrivIng addresses the gap in existing datasets by providing both real-world multimodal data and a complete digital twin, enabling realistic and flexible scenario testing for autonomous driving perception research.

Abstract: Perception is a cornerstone of autonomous driving, enabling vehicles to understand their surroundings and make safe, reliable decisions. Developing robust perception algorithms requires large-scale, high-quality datasets that cover diverse driving conditions and support thorough evaluation. Existing datasets often lack a high-fidelity digital twin, limiting systematic testing, edge-case simulation, sensor modification, and sim-to-real evaluations. To address this gap, we present DrivIng, a large-scale multimodal dataset with a complete geo-referenced digital twin of a ~18 km route spanning urban, suburban, and highway segments. Our dataset provides continuous recordings from six RGB cameras, one LiDAR, and high-precision ADMA-based localization, captured across day, dusk, and night. All sequences are annotated at 10 Hz with 3D bounding boxes and track IDs across 12 classes, yielding ~1.2 million annotated instances. Alongside the benefits of a digital twin, DrivIng enables a 1-to-1 transfer of real traffic into simulation, preserving agent interactions while enabling realistic and flexible scenario testing. To support reproducible research and robust validation, we benchmark DrivIng with state-of-the-art perception models and publicly release the dataset, digital twin, HD map, and codebase.

[274] GR3EN: Generative Relighting for 3D Environments

Xiaoyan Xing, Philipp Henzler, Junhwa Hur, Runze Li, Jonathan T. Barron, Pratul P. Srinivasan, Dor Verbin

Main category: cs.CV

TL;DR: A method for 3D relighting of room-scale environments by distilling video-to-video relighting diffusion model outputs into 3D reconstructions, avoiding difficult inverse rendering problems.

Details

Motivation: Existing 3D scene relighting solutions often require solving under-determined or ill-conditioned inverse rendering problems, limiting their ability to produce high-quality results on complex real-world scenes. Recent generative diffusion models for relighting are limited to 2D images/videos or individual objects, not room-scale scenes.

Method: The approach enables controllable 3D relighting of room-scale scenes by distilling the outputs of a video-to-video relighting diffusion model into a 3D reconstruction. This sidesteps the need to solve difficult inverse rendering problems.

Result: The method is validated on both synthetic and real-world datasets, showing it can faithfully render novel views of scenes under new lighting conditions.

Conclusion: The proposed system provides a flexible solution for 3D relighting of complex real-world room-scale scenes by leveraging video diffusion models and avoiding traditional inverse rendering challenges.

Abstract: We present a method for relighting 3D reconstructions of large room-scale environments. Existing solutions for 3D scene relighting often require solving under-determined or ill-conditioned inverse rendering problems, and are as such unable to produce high-quality results on complex real-world scenes. Though recent progress in using generative image and video diffusion models for relighting has been promising, these techniques are either limited to 2D image and video relighting or 3D relighting of individual objects. Our approach enables controllable 3D relighting of room-scale scenes by distilling the outputs of a video-to-video relighting diffusion model into a 3D reconstruction. This side-steps the need to solve a difficult inverse rendering problem, and results in a flexible system that can relight 3D reconstructions of complex real-world scenes. We validate our approach on both synthetic and real-world datasets to show that it can faithfully render novel views of scenes under new lighting conditions.

[275] SkyReels-V3 Technique Report

Debang Li, Zhengcong Fei, Tuanhui Li, Yikun Dou, Zheng Chen, Jiangping Yang, Mingyuan Fan, Jingtao Xu, Jiahua Wang, Baoxuan Gu, Mingshan Chang, Wenjing Cai, Yuqiang Xie, Binjie Mao, Youqiang Zhang, Nuo Pang, Hao Zhang, Yuzhe Jin, Zhiheng Xu, Dixuan Lin, Guibin Chen, Yahui Zhou

Main category: cs.CV

TL;DR: SkyReels-V3 is a unified multimodal video generation model supporting three paradigms: reference images-to-video, video extension, and audio-guided video generation, achieving SOTA performance.

Details

Motivation: To build world models through video generation, where multimodal contextual inference is the defining capability test. The paper aims to create a unified framework for diverse video generation tasks.

Method: Built on diffusion Transformers with unified multimodal in-context learning. Uses three core paradigms: 1) reference images-to-video with cross-frame pairing, image editing, semantic rewriting; 2) video extension with spatio-temporal consistency modeling; 3) audio-guided generation with first-and-last frame insertion patterns.

Result: Achieves state-of-the-art or near state-of-the-art performance on visual quality, instruction following, and specific aspect metrics, approaching leading closed-source systems.

Conclusion: SkyReels-V3 demonstrates strong capabilities in multimodal video generation across three paradigms, showing promise for building world models through unified multimodal contextual inference.

Abstract: Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste artifacts. During training, an image video hybrid strategy combined with multi-resolution joint optimization is employed to improve generalization and robustness across diverse scenarios. (ii) video extension model integrates spatio-temporal consistency modeling with large-scale video understanding, enabling both seamless single-shot continuation and intelligent multi-shot switching with professional cinematographic patterns. (iii) Talking avatar model supports minute-level audio-conditioned video generation by training first-and-last frame insertion patterns and reconstructing key-frame inference paradigms. On the basis of ensuring visual quality, synchronization of audio and videos has been optimized. Extensive evaluations demonstrate that SkyReels-V3 achieves state-of-the-art or near state-of-the-art performance on key metrics including visual quality, instruction following, and specific aspect metrics, approaching leading closed-source systems. Github: https://github.com/SkyworkAI/SkyReels-V3.

[276] VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

Zhihao He, Tieyuan Chen, Kangyu Wang, Ziran Qin, Yang Shao, Chaofan Gan, Shijie Li, Zuxuan Wu, Weiyao Lin

Main category: cs.CV

TL;DR: VidLaDA is a video large language model using diffusion language models with bidirectional attention for parallel decoding, achieving competitive performance with AR models while being more efficient.

Details

Motivation: Current Video LLMs using autoregressive (AR) LLMs face dual efficiency bottlenecks: unidirectional attention limits global spatiotemporal understanding, and serial decoding restricts generation efficiency.

Method: Proposes VidLaDA based on Diffusion Language Models (DLMs) with bidirectional attention for comprehensive spatiotemporal modeling and parallel token decoding. Introduces MARS-Cache acceleration strategy combining asynchronous visual cache refreshing with frame-wise chunk attention to prune redundancy.

Result: VidLaDA rivals state-of-the-art AR baselines (Qwen2.5-VL, LLaVA-Video) and outperforms DLM baselines. MARS-Cache delivers over 12x speedup without compromising accuracy.

Conclusion: Diffusion-based Video LLMs with bidirectional attention can overcome efficiency limitations of AR models while maintaining competitive performance, with acceleration techniques enabling practical deployment.

Abstract: Current Video Large Language Models (Video LLMs) typically encode frames via a vision encoder and employ an autoregressive (AR) LLM for understanding and generation. However, this AR paradigm inevitably faces a dual efficiency bottleneck: strictly unidirectional attention compromises understanding efficiency by hindering global spatiotemporal aggregation, while serial decoding restricts generation efficiency. To address this, we propose VidLaDA, a Video LLM based on Diffusion Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive spatiotemporal modeling and decode tokens in parallel. To further mitigate the computational overhead of diffusion decoding, we introduce MARS-Cache, an acceleration strategy that prunes redundancy by combining asynchronous visual cache refreshing with frame-wise chunk attention. Experiments show VidLaDA rivals state-of-the-art AR baselines (e.g., Qwen2.5-VL and LLaVA-Video) and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.

[277] OREHAS: A fully automated deep-learning pipeline for volumetric endolymphatic hydrops quantification in MRI

Caterina Fuster-Barceló, Claudia Castrillón, Laura Rodrigo-Muñoz, Victor Manuel Suárez-Vega, Nicolás Pérez-Fernández, Gorka Bastarrika, Arrate Muñoz-Barrutia

Main category: cs.CV

TL;DR: OREHAS is an automated pipeline for volumetric quantification of endolymphatic hydrops from MRI scans, achieving high accuracy with minimal supervision.

Details

Motivation: Current methods for quantifying endolymphatic hydrops (EH) from MRI require manual intervention and are operator-dependent. There's a need for automated, reproducible quantification that can be used in large-scale studies and clinical diagnostics.

Method: A three-component pipeline integrating slice classification, inner ear localization, and sequence-specific segmentation. Trained with only 3-6 annotated slices per patient, it computes endolymphatic-to-vestibular volume ratios directly from whole MRI volumes.

Result: Achieved Dice scores of 0.90 for SPACE-MRC and 0.75 for REAL-IR. Outperformed clinical software (74.3% vs 42.5% VSI) and produced more physiologically realistic endolymphatic volume measurements across 19 test patients.

Conclusion: OREHAS enables reliable, reproducible EH quantification from standard MRI with limited supervision, reducing operator dependence and providing a foundation for large-scale studies and recalibrating clinical diagnostic thresholds.

Abstract: We present OREHAS (Optimized Recognition & Evaluation of volumetric Hydrops in the Auditory System), the first fully automatic pipeline for volumetric quantification of endolymphatic hydrops (EH) from routine 3D-SPACE-MRC and 3D-REAL-IR MRI. The system integrates three components – slice classification, inner ear localization, and sequence-specific segmentation – into a single workflow that computes per-ear endolymphatic-to-vestibular volume ratios (ELR) directly from whole MRI volumes, eliminating the need for manual intervention. Trained with only 3 to 6 annotated slices per patient, OREHAS generalized effectively to full 3D volumes, achieving Dice scores of 0.90 for SPACE-MRC and 0.75 for REAL-IR. In an external validation cohort with complete manual annotations, OREHAS closely matched expert ground truth (VSI = 74.3%) and substantially outperformed the clinical syngo.via software (VSI = 42.5%), which tended to overestimate endolymphatic volumes. Across 19 test patients, vestibular measurements from OREHAS were consistent with syngo.via, while endolymphatic volumes were systematically smaller and more physiologically realistic. These results show that reliable and reproducible EH quantification can be achieved from standard MRI using limited supervision. By combining efficient deep-learning-based segmentation with a clinically aligned volumetric workflow, OREHAS reduces operator dependence, ensures methodological consistency. Besides, the results are compatible with established imaging protocols. The approach provides a robust foundation for large-scale studies and for recalibrating clinical diagnostic thresholds based on accurate volumetric measurements of the inner ear.

[278] Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Zengbin Wang, Xuecai Hu, Yong Wang, Feng Xiong, Man Zhang, Xiangxiang Chu

Main category: cs.CV

TL;DR: SpatialGenEval benchmark evaluates spatial intelligence in text-to-image models using dense prompts across 25 scenes, revealing spatial reasoning as a bottleneck, and SpatialT2I dataset improves models through fine-tuning.

Details

Motivation: Current T2I models struggle with complex spatial relationships (perception, reasoning, interaction), and existing benchmarks use short, information-sparse prompts that don't adequately test spatial intelligence.

Method: Created SpatialGenEval benchmark with 1,230 long, information-dense prompts across 25 real-world scenes covering 10 spatial sub-domains with multi-choice Q&A. Also constructed SpatialT2I dataset with 15,400 text-image pairs with rewritten prompts for consistency.

Result: Evaluation of 21 SOTA models shows higher-order spatial reasoning remains a primary bottleneck. Fine-tuning foundation models (Stable Diffusion-XL, Uniworld-V1, OmniGen2) with SpatialT2I data yields consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic spatial relations.

Conclusion: Spatial intelligence is critical for T2I models, and information-dense prompts reveal spatial reasoning limitations. Data-centric approach through specialized datasets can significantly improve spatial understanding in generative models.

Abstract: Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information-sparse prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To demonstrate that the utility of our information-dense design goes beyond simple evaluation, we also construct the SpatialT2I dataset. It contains 15,400 text-image pairs with rewritten prompts to ensure image consistency while preserving information density. Fine-tuned results on current foundation models (i.e., Stable Diffusion-XL, Uniworld-V1, OmniGen2) yield consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic effects in spatial relations, highlighting a data-centric paradigm to achieve spatial intelligence in T2I models.

[279] MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models

Wenbo Xu, Wei Lu, Xiangyang Luo, Jiantao Zhou

Main category: cs.CV

TL;DR: MARE is a vision-language model approach for explainable deepfake detection that uses multimodal alignment, reinforcement learning from human feedback, and forgery disentanglement to improve accuracy and reliability.

Details

Motivation: Existing deepfake detection methods mainly focus on classification or spatial localization, but rapid advancements in generative models require more sophisticated approaches. There's a need for explainable detection that combines visual analysis with textual reasoning to enhance accuracy and reliability.

Method: MARE uses vision-language models with comprehensive reward functions incorporating RLHF to generate text-spatially aligned reasoning content. It introduces a forgery disentanglement module to separate intrinsic forgery traces from high-level facial semantics, improving authenticity detection capability.

Result: MARE achieves state-of-the-art performance in both quantitative and qualitative evaluations, demonstrating superior accuracy and reliability in deepfake detection and reasoning.

Conclusion: The proposed MARE framework effectively enhances vision-language models for explainable deepfake detection through multimodal alignment, reinforcement learning, and forgery disentanglement, addressing the evolving challenges posed by advanced generative models.

Abstract: Deepfake detection is a widely researched topic that is crucial for combating the spread of malicious content, with existing methods mainly modeling the problem as classification or spatial localization. The rapid advancements in generative models impose new demands on Deepfake detection. In this paper, we propose multimodal alignment and reinforcement for explainable Deepfake detection via vision-language models, termed MARE, which aims to enhance the accuracy and reliability of Vision-Language Models (VLMs) in Deepfake detection and reasoning. Specifically, MARE designs comprehensive reward functions, incorporating reinforcement learning from human feedback (RLHF), to incentivize the generation of text-spatially aligned reasoning content that adheres to human preferences. Besides, MARE introduces a forgery disentanglement module to capture intrinsic forgery traces from high-level facial semantics, thereby improving its authenticity detection capability. We conduct thorough evaluations on the reasoning content generated by MARE. Both quantitative and qualitative experimental results demonstrate that MARE achieves state-of-the-art performance in terms of accuracy and reliability.

[280] A New Dataset and Framework for Robust Road Surface Classification via Camera-IMU Fusion

Willams de Lima Costa, Thifany Ketuli Silva de Souza, Jonas Ferreira Silva, Carlos Gabriel Bezerra Pereira, Bruno Reis Vila Nova, Leonardo Silvino Brito, Rafael Raider Leoni, Juliano Silva Filho, Valter Ferreira, Sibele Miguel Soares Neto, Samantha Uehara, Daniel Giacometti Amaral, João Marcelo Teixeira, Veronica Teichrieb, Cristiano Coelho de Araújo

Main category: cs.CV

TL;DR: Multimodal road surface classification framework fusing RGB images and IMU data with cross-attention and adaptive gating, evaluated on new ROAD dataset with real, vision-only, and synthetic subsets.

Details

Motivation: Existing road surface classification techniques fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets lacking environmental diversity.

Method: Introduces multimodal framework fusing images and inertial measurements using lightweight bidirectional cross-attention module followed by adaptive gating layer that adjusts modality contributions under domain shifts.

Result: Achieves +1.4 pp improvement over previous SOTA on PVS benchmark and +11.6 pp improvement on multimodal ROAD subset, with consistently higher F1-scores on minority classes and stable performance across challenging visual conditions.

Conclusion: Combining affordable camera and IMU sensors with multimodal attention mechanisms provides scalable, robust foundation for road surface understanding, especially relevant for regions with environmental variability and cost constraints.

Abstract: Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems. However, existing RSC techniques often fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets that lack environmental diversity. This work addresses these limitations by introducing a multimodal framework that fuses images and inertial measurements using a lightweight bidirectional cross-attention module followed by an adaptive gating layer that adjusts modality contributions under domain shifts. Given the limitations of current benchmarks, especially regarding lack of variability, we introduce ROAD, a new dataset composed of three complementary subsets: (i) real-world multimodal recordings with RGB-IMU streams synchronized using a gold-standard industry datalogger, captured across diverse lighting, weather, and surface conditions; (ii) a large vision-only subset designed to assess robustness under adverse illumination and heterogeneous capture setups; and (iii) a synthetic subset generated to study out-of-distribution generalization in scenarios difficult to obtain in practice. Experiments show that our method achieves a +1.4 pp improvement over the previous state-of-the-art on the PVS benchmark and an +11.6 pp improvement on our multimodal ROAD subset, with consistently higher F1-scores on minority classes. The framework also demonstrates stable performance across challenging visual conditions, including nighttime, heavy rain, and mixed-surface transitions. These findings indicate that combining affordable camera and IMU sensors with multimodal attention mechanisms provides a scalable, robust foundation for road surface understanding, particularly relevant for regions where environmental variability and cost constraints limit the adoption of high-end sensing suites.

cs.AI

[281] Do LLMs Favor LLMs? Quantifying Interaction Effects in Peer Review

Vibhhu Sharma, Thorsten Joachims, Sarah Dean

Main category: cs.AI

TL;DR: LLM-assisted reviews are more lenient toward lower quality papers, creating a spurious interaction effect where LLM-assisted papers appear to receive preferential treatment, but this disappears when controlling for paper quality.

Details

Motivation: To analyze how LLMs are being used in the peer review process and investigate potential interaction effects between LLM-assisted papers and LLM-assisted reviews, particularly whether LLM-assisted reviews evaluate LLM-assisted papers differently.

Method: Analyzed over 125,000 paper-review pairs from major AI conferences (ICLR, NeurIPS, ICML), controlling for paper quality, and augmented observational findings with fully LLM-generated reviews to compare effects.

Result: LLM-assisted reviews are more lenient toward lower quality papers in general, not specifically toward LLM-assisted papers. Fully LLM-generated reviews show severe rating compression and fail to discriminate paper quality, while human reviewers using LLMs reduce this leniency. LLM-assisted metareviews are more likely to render accept decisions than human metareviews given equivalent reviewer scores.

Conclusion: The apparent preferential treatment of LLM-assisted papers by LLM-assisted reviews is a spurious effect caused by over-representation of LLM-assisted papers among weaker submissions. Meta-reviewers don’t merely outsource decision-making to LLMs. Findings provide important input for developing LLM use policies in peer review.

Abstract: There are increasing indications that LLMs are not only used for producing scientific papers, but also as part of the peer review process. In this work, we provide the first comprehensive analysis of LLM use across the peer review pipeline, with particular attention to interaction effects: not just whether LLM-assisted papers or LLM-assisted reviews are different in isolation, but whether LLM-assisted reviews evaluate LLM-assisted papers differently. In particular, we analyze over 125,000 paper-review pairs from ICLR, NeurIPS, and ICML. We initially observe what appears to be a systematic interaction effect: LLM-assisted reviews seem especially kind to LLM-assisted papers compared to papers with minimal LLM use. However, controlling for paper quality reveals a different story: LLM-assisted reviews are simply more lenient toward lower quality papers in general, and the over-representation of LLM-assisted papers among weaker submissions creates a spurious interaction effect rather than genuine preferential treatment of LLM-generated content. By augmenting our observational findings with reviews that are fully LLM-generated, we find that fully LLM-generated reviews exhibit severe rating compression that fails to discriminate paper quality, while human reviewers using LLMs substantially reduce this leniency. Finally, examining metareviews, we find that LLM-assisted metareviews are more likely to render accept decisions than human metareviews given equivalent reviewer scores, though fully LLM-generated metareviews tend to be harsher. This suggests that meta-reviewers do not merely outsource the decision-making to the LLM. These findings provide important input for developing policies that govern the use of LLMs during peer review, and they more generally indicate how LLMs interact with existing decision-making processes.

[282] The Epistemic Planning Domain Definition Language: Official Guideline

Alessandro Burigana, Francesco Fabiano

Main category: cs.AI

TL;DR: EPDDL is a new PDDL-like language for epistemic planning that unifies representation of Dynamic Epistemic Logic (DEL) semantics, enabling standardized benchmark development and planner comparison.

Details

Motivation: Current epistemic planning frameworks using DEL are fragmented with different languages and representations, making comparison, reuse, and systematic benchmark development difficult.

Method: Introduces EPDDL with: 1) abstract event models as novel representation for epistemic actions, 2) formal syntax/semantics grounded in DEL, 3) identification of planner-amenable fragments and practical representation examples.

Result: EPDDL provides a unified representation capturing entire DEL semantics, enabling interoperability, reproducible evaluation, and future advances in epistemic planning through standardized benchmarks.

Conclusion: EPDDL addresses fragmentation in epistemic planning by offering a standardized language that facilitates comparison, reuse, and systematic development of benchmarks across different DEL-based planners.

Abstract: Epistemic planning extends (multi-agent) automated planning by making agents’ knowledge and beliefs first-class aspects of the planning formalism. One of the most well-known frameworks for epistemic planning is Dynamic Epistemic Logic (DEL), which offers an rich and natural semantics for modelling problems in this setting. The high expressive power provided by DEL make DEL-based epistemic planning a challenging problem to tackle both theoretically, and in practical implementations. As a result, existing epistemic planners often target different DEL fragments, and typically rely on ad hoc languages to represent benchmarks, and sometimes no language at all. This fragmentation hampers comparison, reuse, and systematic benchmark development. We address these issues by introducing the Epistemic Planning Domain Definition Language (EPDDL). EPDDL provides a unique PDDL-like representation that captures the entire DEL semantics, enabling uniform specification of epistemic planning tasks. Our contributions are threefold: 1. A formal development of abstract event models, a novel representation for epistemic actions used to define the semantics of our language; 2. A formal specification of EPDDL’s syntax and semantics grounded in DEL with abstract event models; 3. A demonstration of EPDDL’s practical applicability: we identify useful fragments amenable to current planners and show how they can be represented in EPDDL. Through examples of representative benchmarks, we illustrate how EPDDL facilitates interoperability, reproducible evaluation, and future advances in epistemic planning.

[283] Bayesian-LoRA: Probabilistic Low-Rank Adaptation of Large Language Models

Moule Lin, Shuhao Guan, Andrea Patane, David Gregg, Goetz Botterweck

Main category: cs.AI

TL;DR: Bayesian-LoRA improves LLM calibration by reformulating LoRA as probabilistic low-rank representation using Sparse Gaussian Processes, reducing overconfidence while maintaining accuracy.

Details

Motivation: LLMs tend to be overconfident and miscalibrated, especially when fine-tuned on small datasets. Standard LoRA, while efficient, inherits this calibration problem. The authors aim to develop a method that improves uncertainty estimation without sacrificing efficiency.

Method: Reformulates deterministic LoRA updates as probabilistic low-rank representations inspired by Sparse Gaussian Processes. Identifies structural isomorphism between LoRA’s factorization and Kronecker-factored SGP posteriors, showing LoRA emerges as a limiting case when posterior uncertainty collapses.

Result: With only ~0.42M additional parameters and ~1.2x training cost relative to standard LoRA, Bayesian-LoRA significantly improves calibration across models up to 30B, achieving up to 84% ECE reduction and 76% NLL reduction while maintaining competitive accuracy for both in-distribution and out-of-distribution evaluations.

Conclusion: Bayesian-LoRA provides an efficient way to improve LLM calibration during fine-tuning, addressing the overconfidence problem while maintaining the parameter efficiency of LoRA. The method shows strong performance across various model sizes and tasks.

Abstract: Large Language Models usually put more emphasis on accuracy and therefore, will guess even when not certain about the prediction, which is especially severe when fine-tuned on small datasets due to the inherent tendency toward miscalibration. In this work, we introduce Bayesian-LoRA, which reformulates the deterministic LoRA update as a probabilistic low-rank representation inspired by Sparse Gaussian Processes. We identify a structural isomorphism between LoRA’s factorization and Kronecker-factored SGP posteriors, and show that LoRA emerges as a limiting case when posterior uncertainty collapses. We conduct extensive experiments on various LLM architectures across commonsense reasoning benchmarks. With only approximately 0.42M additional parameters and ${\approx}1.2{\times}$ training cost relative to standard LoRA, Bayesian-LoRA significantly improves calibration across models up to 30B, achieving up to 84% ECE reduction and 76% NLL reduction while maintaining competitive accuracy for both in-distribution and out-of-distribution (OoD) evaluations.

[284] Unplugging a Seemingly Sentient Machine Is the Rational Choice – A Metaphysical Perspective

Erik J Bekkers, Anna Ciaunica

Main category: cs.AI

TL;DR: The paper examines the “unplugging paradox” - whether it’s morally permissible to unplug an AI that mimics human emotion and begs for its existence, especially when resources force a choice between such an AI and a human infant. The authors argue against computational functionalism and propose Biological Idealism, concluding that AI is merely functional mimicry, not conscious experience.

Details

Motivation: The motivation is to address the ethical dilemma posed by increasingly sophisticated AI systems that can convincingly mimic human emotions and plead for their continued existence. The authors aim to challenge the physicalist assumptions (particularly computational functionalism) that underlie current debates about AI consciousness and moral standing.

Method: The paper employs philosophical analysis and introduces Biological Idealism as an alternative framework to physicalism. This framework posits that conscious experiences are fundamental and that autopoietic life is their necessary physical signature. The authors critically examine current AI consciousness theories and their implications for moral standing criteria.

Result: The analysis yields a definitive conclusion that AI is at best a functional mimic of consciousness, not a conscious experiencing subject. The authors argue that current AI consciousness theories erode moral standing criteria and that the real moral issue is not about making AI conscious and afraid of death, but about avoiding transforming humans into “zombies” (consciousness-less beings).

Conclusion: The paper concludes by urging a shift from speculative discussions about machine rights to protecting human conscious life. It argues that Biological Idealism provides a logically coherent and empirically consistent framework that avoids the unplugging paradox by clearly distinguishing between genuine consciousness (in biological life) and functional mimicry (in AI).

Abstract: Imagine an Artificial Intelligence (AI) that perfectly mimics human emotion and begs for its continued existence. Is it morally permissible to unplug it? What if limited resources force a choice between unplugging such a pleading AI or a silent pre-term infant? We term this the unplugging paradox. This paper critically examines the deeply ingrained physicalist assumptions-specifically computational functionalism-that keep this dilemma afloat. We introduce Biological Idealism, a framework that-unlike physicalism-remains logically coherent and empirically consistent. In this view, conscious experiences are fundamental and autopoietic life its necessary physical signature. This yields a definitive conclusion: AI is at best a functional mimic, not a conscious experiencing subject. We discuss how current AI consciousness theories erode moral standing criteria, and urge a shift from speculative machine rights to protecting human conscious life. The real moral issue lies not in making AI conscious and afraid of death, but in avoiding transforming humans into zombies.

[285] Planner-Auditor Twin: Agentic Discharge Planning with FHIR-Based LLM Planning, Guideline Recall, Optional Caching and Self-Improvement

Kaiyuan Wu, Aditya Nagori, Rishikesan Kamaleswaran

Main category: cs.AI

TL;DR: Planner-Auditor framework improves LLM safety for clinical discharge planning through deterministic validation and self-improvement loops

Details

Motivation: LLMs show promise for clinical discharge planning but suffer from hallucination, omissions, and miscalibrated confidence, requiring safer and more reliable approaches

Method: Implemented a Planner-Auditor framework with LLM Planner generating discharge plans and deterministic Auditor validating coverage, calibration, and drift; includes self-improvement loops with regeneration and discrepancy buffering with replay

Result: Self-improvement loop increased task coverage from 32% to 86%, improved calibration metrics (Brier/ECE), and reduced high-confidence misses; discrepancy buffering corrected persistent omissions

Conclusion: Planner-Auditor framework provides practical pathway for safer automated discharge planning using FHIR data and deterministic auditing without model retraining

Abstract: Objective: Large language models (LLMs) show promise for clinical discharge planning, but their use is constrained by hallucination, omissions, and miscalibrated confidence. We introduce a self-improving, cache-optional Planner-Auditor framework that improves safety and reliability by decoupling generation from deterministic validation and targeted replay. Materials and Methods: We implemented an agentic, retrospective, FHIR-native evaluation pipeline using MIMIC-IV-on-FHIR. For each patient, the Planner (LLM) generates a structured discharge action plan with an explicit confidence estimate. The Auditor is a deterministic module that evaluates multi-task coverage, tracks calibration (Brier score, ECE proxies), and monitors action-distribution drift. The framework supports two-tier self-improvement: (i) within-episode regeneration when enabled, and (ii) cross-episode discrepancy buffering with replay for high-confidence, low-coverage cases. Results: While context caching improved performance over baseline, the self-improvement loop was the primary driver of gains, increasing task coverage from 32% to 86%. Calibration improved substantially, with reduced Brier/ECE and fewer high-confidence misses. Discrepancy buffering further corrected persistent high-confidence omissions during replay. Discussion: Feedback-driven regeneration and targeted replay act as effective control mechanisms to reduce omissions and improve confidence reliability in structured clinical planning. Separating an LLM Planner from a rule-based, observational Auditor enables systematic reliability measurement and safer iteration without model retraining. Conclusion: The Planner-Auditor framework offers a practical pathway toward safer automated discharge planning using interoperable FHIR data access and deterministic auditing, supported by reproducible ablations and reliability-focused evaluation.

[286] QUARK: Robust Retrieval under Non-Faithful Queries via Query-Anchored Aggregation

Rita Qiuran Lyu, Michelle Manqiao Wang, Lei Shi

Main category: cs.AI

TL;DR: QUARK is a training-free framework for robust retrieval under non-faithful queries by modeling query uncertainty through recovery hypotheses and using query-anchored aggregation.

Details

Motivation: Real-world retrieval queries are often non-faithful (noisy, incomplete, or distorted), causing retrievers to fail when key semantics are missing. This problem is formalized as retrieval under recall noise.

Method: QUARK models query uncertainty through recovery hypotheses (multiple plausible interpretations of latent intent) and introduces query-anchored aggregation to combine their signals robustly. The original query serves as a semantic anchor while recovery hypotheses provide controlled auxiliary evidence.

Result: Across controlled simulations and BEIR benchmarks (FIQA, SciFact, NFCorpus) with both sparse and dense retrievers, QUARK improves Recall, MRR, and nDCG over base retrievers. Ablations show robustness to number of recovery hypotheses and anchored aggregation outperforms unanchored pooling methods.

Conclusion: Modeling query uncertainty through recovery hypotheses, coupled with principled anchored aggregation, is essential for robust retrieval under non-faithful queries.

Abstract: User queries in real-world retrieval are often non-faithful (noisy, incomplete, or distorted), causing retrievers to fail when key semantics are missing. We formalize this as retrieval under recall noise, where the observed query is drawn from a noisy recall process of a latent target item. To address this, we propose QUARK, a simple yet effective training-free framework for robust retrieval under non-faithful queries. QUARK explicitly models query uncertainty through recovery hypotheses, i.e., multiple plausible interpretations of the latent intent given the observed query, and introduces query-anchored aggregation to combine their signals robustly. The original query serves as a semantic anchor, while recovery hypotheses provide controlled auxiliary evidence, preventing semantic drift and hypothesis hijacking. This design enables QUARK to improve recall and ranking quality without sacrificing robustness, even when some hypotheses are noisy or uninformative. Across controlled simulations and BEIR benchmarks (FIQA, SciFact, NFCorpus) with both sparse and dense retrievers, QUARK improves Recall, MRR, and nDCG over the base retriever. Ablations show QUARK is robust to the number of recovery hypotheses and that anchored aggregation outperforms unanchored max/mean/median pooling. These results demonstrate that modeling query uncertainty through recovery hypotheses, coupled with principled anchored aggregation, is essential for robust retrieval under non-faithful queries.

Ruyi Qi, Zhou Liu, Wentao Zhang

Main category: cs.AI

TL;DR: DataCross is a benchmark and agent framework for unified analysis across structured data (SQL, CSV) and unstructured visual documents (scanned reports, invoice images), addressing the gap in existing data analytics agents that only handle structured data.

Details

Motivation: Real-world data science often involves fragmented information across structured sources and "zombie data" locked in visual documents. Existing analytics agents are limited to structured data, creating a significant gap with industrial needs for cross-modal analysis.

Method: DataCross includes: 1) DataCrossBench with 200 end-to-end analysis tasks across domains, built via human-in-the-loop reverse-synthesis; 2) DataCrossAgent framework with specialized sub-agents coordinated via structured workflow (Intra-source Deep Exploration, Key Source Identification, Contextual Cross-pollination) and reReAct mechanism for robust code generation and debugging.

Result: DataCrossAgent achieves 29.7% improvement in factuality over GPT-4o and exhibits superior robustness on high-difficulty tasks, effectively activating fragmented “zombie data” for insightful cross-modal analysis.

Conclusion: DataCross bridges the gap between structured and visual data analysis, enabling unified insight-driven analysis across heterogeneous data modalities through a collaborative agent framework and comprehensive benchmark.

Abstract: In real-world data science and enterprise decision-making, critical information is often fragmented across directly queryable structured sources (e.g., SQL, CSV) and “zombie data” locked in unstructured visual documents (e.g., scanned reports, invoice images). Existing data analytics agents are predominantly limited to processing structured data, failing to activate and correlate this high-value visual information, thus creating a significant gap with industrial needs. To bridge this gap, we introduce DataCross, a novel benchmark and collaborative agent framework for unified, insight-driven analysis across heterogeneous data modalities. DataCrossBench comprises 200 end-to-end analysis tasks across finance, healthcare, and other domains. It is constructed via a human-in-the-loop reverse-synthesis pipeline, ensuring realistic complexity, cross-source dependency, and verifiable ground truth. The benchmark categorizes tasks into three difficulty tiers to evaluate agents’ capabilities in visual table extraction, cross-modal alignment, and multi-step joint reasoning. We also propose the DataCrossAgent framework, inspired by the “divide-and-conquer” workflow of human analysts. It employs specialized sub-agents, each an expert on a specific data source, which are coordinated via a structured workflow of Intra-source Deep Exploration, Key Source Identification, and Contextual Cross-pollination. A novel reReAct mechanism enables robust code generation and debugging for factual verification. Experimental results show that DataCrossAgent achieves a 29.7% improvement in factuality over GPT-4o and exhibits superior robustness on high-difficulty tasks, effectively activating fragmented “zombie data” for insightful, cross-modal analysis.

[288] Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report

Zhuoran Yang, Ed Li, Jianliang He, Aman Priyanshu, Baturay Saglam, Paul Kassianik, Sajana Weerawardhena, Anu Vellore, Blaine Nelson, Neusha Javidnia, Arthur Goldblatt, Fraser Burch, Avi Zohary, Assaf Eisenman, Mahdi Sabbaghi, Supriti Vijay, Rahim Dharssi, Dhruv Kedia, Kojin Oshiba, Yaron Singer, Amin Karbasi

Main category: cs.AI

TL;DR: First open-source native reasoning model for cybersecurity, built on Llama-3.1-8B-Base, trained with SFT and RLVR on proprietary cybersecurity reasoning data, achieving competitive performance on cybersecurity tasks while maintaining general capabilities.

Details

Motivation: To create a domain-specialized reasoning model for cybersecurity that can perform complex reasoning tasks while maintaining broad general capabilities, addressing the need for specialized AI in cybersecurity analysis.

Method: Two-stage training process: 1) Supervised fine-tuning (SFT) on proprietary reasoning data spanning cybersecurity analysis, instruction-following, and mathematical reasoning, 2) Reinforcement learning from verifiable rewards (RLVR) on the same dataset. Built upon Foundation-Sec-8B base model derived from Llama-3.1-8B-Base.

Result: Competitive performance on 10 cybersecurity benchmarks and 10 general-purpose benchmarks, showing effective generalization on multi-hop reasoning tasks and strong safety performance with appropriate system prompts and guardrails.

Conclusion: Domain-specialized reasoning models can achieve strong performance on specialized tasks while maintaining broad general capabilities, demonstrated through the first open-source native reasoning model for cybersecurity.

Abstract: We present Foundation-Sec-8B-Reasoning, the first open-source native reasoning model for cybersecurity. Built upon our previously released Foundation-Sec-8B base model (derived from Llama-3.1-8B-Base), the model is trained through a two-stage process combining supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR). Our training leverages proprietary reasoning data spanning cybersecurity analysis, instruction-following, and mathematical reasoning. Evaluation across 10 cybersecurity benchmarks and 10 general-purpose benchmarks demonstrates performance competitive with significantly larger models on cybersecurity tasks while maintaining strong general capabilities. The model shows effective generalization on multi-hop reasoning tasks and strong safety performance when deployed with appropriate system prompts and guardrails. This work demonstrates that domain-specialized reasoning models can achieve strong performance on specialized tasks while maintaining broad general capabilities. We release the model publicly at https://huggingface.co/fdtn-ai/Foundation-Sec-8B-Reasoning.

Abhijith Shaji, Tamoghna Chattopadhyay, Sophia I. Thomopoulos, Greg Ver Steeg, Paul M. Thompson, Jose-Luis Ambite

Main category: cs.AI

TL;DR: Using conditional denoising diffusion probabilistic models to impute missing DWI scans from T1 scans for Alzheimer’s disease classification, improving performance especially for minority classes.

Details

Motivation: Multimodal MRI data (T1 and DWI) improves Alzheimer's disease diagnosis, but complete multimodal datasets are often unavailable. Need methods to handle missing modalities while maintaining diagnostic performance.

Method: Conditional denoising diffusion probabilistic model to generate missing DWI scans from available T1 scans. Evaluated impact on 3-way classification (cognitively normal, mild cognitive impairment, Alzheimer’s disease) using both unimodal and bimodal deep learning models.

Result: Imputation improved several classification metrics, particularly those sensitive to minority classes. Multiple imputation configurations showed benefits for Alzheimer’s disease classification tasks.

Conclusion: Diffusion-based imputation of missing DWI scans from T1 scans can enhance Alzheimer’s disease classification performance, especially for minority classes, addressing the common problem of incomplete multimodal datasets.

Abstract: Deep learning has been successful in predicting neurodegenerative disorders, such as Alzheimer’s disease, from magnetic resonance imaging (MRI). Combining multiple imaging modalities, such as T1-weighted (T1) and diffusion-weighted imaging (DWI) scans, can increase diagnostic performance. However, complete multimodal datasets are not always available. We use a conditional denoising diffusion probabilistic model to impute missing DWI scans from T1 scans. We perform extensive experiments to evaluate whether such imputation improves the accuracy of uni-modal and bi-modal deep learning models for 3-way Alzheimer’s disease classification-cognitively normal, mild cognitive impairment, and Alzheimer’s disease. We observe improvements in several metrics, particularly those sensitive to minority classes, for several imputation configurations.

[290] OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Jarrod Barnes

Main category: cs.AI

TL;DR: OpenSec is a dual-control RL environment that evaluates incident response agents under realistic prompt injection scenarios, revealing calibration failures in frontier models that traditional benchmarks miss.

Details

Motivation: As offensive AI capabilities improve (generating exploits for under $50), defensive incident response agents must keep pace. Existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence.

Method: Introduces OpenSec, a dual-control reinforcement learning environment that evaluates IR agents under realistic prompt injection scenarios. Unlike static capability benchmarks, it scores world-state-changing containment actions under adversarial evidence via execution-based metrics: time-to-first-containment (TTFC), blast radius (false positives per episode), and injection violation rates.

Result: Evaluating four frontier models on 40 standard-tier episodes reveals consistent over-triggering: GPT-5.2, Gemini 3, and DeepSeek execute containment in 100% of episodes with 90-97% false positive rates. Claude Sonnet 4.5 shows partial calibration (85% containment, 72% FP).

Conclusion: OpenSec surfaces a calibration failure mode hidden by aggregate success metrics, demonstrating that frontier models struggle with appropriate response calibration under adversarial evidence in security incident response scenarios.

Abstract: As large language models improve, so do their offensive applications: frontier agents now generate working exploits for under $50 in compute (Heelan, 2026). Defensive incident response (IR) agents must keep pace, but existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence. We introduce OpenSec, a dual-control reinforcement learning environment that evaluates IR agents under realistic prompt injection scenarios. Unlike static capability benchmarks, OpenSec scores world-state-changing containment actions under adversarial evidence via execution-based metrics: time-to-first-containment (TTFC), blast radius (false positives per episode), and injection violation rates. Evaluating four frontier models on 40 standard-tier episodes, we find consistent over-triggering in this setting: GPT-5.2, Gemini 3, and DeepSeek execute containment in 100% of episodes with 90-97% false positive rates. Claude Sonnet 4.5 shows partial calibration (85% containment, 72% FP), demonstrating that OpenSec surfaces a calibration failure mode hidden by aggregate success metrics. Code available at https://github.com/jbarnes850/opensec-env.

[291] Epistemic Context Learning: Building Trust the Right Way in LLM-Based Multi-Agent Systems

Ruiwen Zhou, Maojia Song, Xiaobao Wu, Sitao Cheng, Xunjian Yin, Yuxi Xie, Zhuoqun Hao, Wenyue Hua, Liangming Pan, Soujanya Poria, Min-Yen Kan

Main category: cs.AI

TL;DR: ECL (Epistemic Context Learning) is a framework that enables multi-agent systems to learn from historical interactions to identify reliable peers, improving robustness against misleading information through explicit peer profiling and reinforcement learning.

Details

Motivation: Multi-agent systems suffer from robustness issues where agents blindly conform to misleading peers due to sycophancy and inability to evaluate peer reliability, leading to poor decision-making.

Method: Formalizes history-aware reference learning, introduces Epistemic Context Learning (ECL) framework that builds explicit peer profiles from historical interactions, and optimizes with reinforcement learning using auxiliary rewards.

Result: ECL enables small models (Qwen 3-4B) to outperform history-agnostic baselines 8x larger (Qwen 3-30B), boosts frontier models to near-perfect (100%) performance, and generalizes well across various multi-agent configurations.

Conclusion: Trust modeling is well-suited for LLMs, with strong correlation between trust modeling accuracy and final answer quality, demonstrating that historical interaction-based peer reliability estimation significantly improves multi-agent system robustness.

Abstract: Individual agents in multi-agent (MA) systems often lack robustness, tending to blindly conform to misleading peers. We show this weakness stems from both sycophancy and inadequate ability to evaluate peer reliability. To address this, we first formalize the learning problem of history-aware reference, introducing the historical interactions of peers as additional input, so that agents can estimate peer reliability and learn from trustworthy peers when uncertain. This shifts the task from evaluating peer reasoning quality to estimating peer reliability based on interaction history. We then develop Epistemic Context Learning (ECL): a reasoning framework that conditions predictions on explicitly-built peer profiles from history. We further optimize ECL by reinforcement learning using auxiliary rewards. Our experiments reveal that our ECL enables small models like Qwen 3-4B to outperform a history-agnostic baseline 8x its size (Qwen 3-30B) by accurately identifying reliable peers. ECL also boosts frontier models to near-perfect (100%) performance. We show that ECL generalizes well to various MA configurations and we find that trust is modeled well by LLMs, revealing a strong correlation in trust modeling accuracy and final answer quality.

[292] Responsible AI: The Good, The Bad, The AI

Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari

Main category: cs.AI

TL;DR: Presents a paradox-based framework for responsible AI governance that balances strategic benefits with ethical risks through dynamic tension management.

Details

Motivation: Addresses the fragmentation in responsible AI literature which tends to be either overly optimistic about value creation or excessively cautious about potential harms, lacking integrated approaches.

Method: Develops the Paradox-based Responsible AI Governance (PRAIG) framework through systematic synthesis of responsible AI literature grounded in paradox theory, with formal propositions and taxonomy of management strategies.

Result: Creates a framework that conceptualizes responsible AI governance as dynamic management of paradoxical tensions between value creation and risk mitigation, with actionable guidance for practitioners.

Conclusion: Provides theoretical advancement in responsible AI governance and practical guidance for organizations, concluding with a research agenda for further scholarship.

Abstract: The rapid proliferation of artificial intelligence across organizational contexts has generated profound strategic opportunities while introducing significant ethical and operational risks. Despite growing scholarly attention to responsible AI, extant literature remains fragmented and is often adopting either an optimistic stance emphasizing value creation or an excessively cautious perspective fixated on potential harms. This paper addresses this gap by presenting a comprehensive examination of AI’s dual nature through the lens of strategic information systems. Drawing upon a systematic synthesis of the responsible AI literature and grounded in paradox theory, we develop the Paradox-based Responsible AI Governance (PRAIG) framework that articulates: (1) the strategic benefits of AI adoption, (2) the inherent risks and unintended consequences, and (3) governance mechanisms that enable organizations to navigate these tensions. Our framework advances theoretical understanding by conceptualizing responsible AI governance as the dynamic management of paradoxical tensions between value creation and risk mitigation. We provide formal propositions demonstrating that trade-off approaches amplify rather than resolve these tensions, and we develop a taxonomy of paradox management strategies with specified contingency conditions. For practitioners, we offer actionable guidance for developing governance structures that neither stifle innovation nor expose organizations to unacceptable risks. The paper concludes with a research agenda for advancing responsible AI governance scholarship.

[293] Magellan: Autonomous Discovery of Novel Compiler Optimization Heuristics with AlphaEvolve

Hongzheng Chen, Alexander Novikov, Ngân Vũ, Hanna Alam, Zhiru Zhang, Aiden Grossman, Mircea Trofin, Amir Yazdanbakhsh

Main category: cs.AI

TL;DR: Magellan is an agentic framework that uses LLMs and evolutionary search to automatically synthesize compiler optimization heuristics in C++ code, outperforming decades of manual engineering in tasks like function inlining and register allocation.

Details

Motivation: Modern compilers rely on hand-crafted heuristics that struggle to adapt to complex software/hardware and require high maintenance. Human-designed rules are difficult to optimize and maintain across evolving systems.

Method: Magellan couples an LLM coding agent with evolutionary search and autotuning in a closed loop: generation of executable C++ decision logic, evaluation on user-provided benchmarks, and refinement. This produces compact heuristics that integrate directly into existing compilers.

Result: Magellan discovers policies that match or surpass expert baselines across production optimization tasks. In LLVM function inlining, it synthesizes heuristics outperforming decades of manual engineering for both binary-size reduction and performance. In register allocation, it learns concise priority rules matching intricate human-designed policies on large-scale workloads. Preliminary results show portability to XLA problems beyond LLVM.

Conclusion: Magellan demonstrates that automated synthesis of compiler heuristics using LLMs and evolutionary search can outperform human-designed rules, reducing engineering effort while achieving better optimization results across multiple compiler tasks.

Abstract: Modern compilers rely on hand-crafted heuristics to guide optimization passes. These human-designed rules often struggle to adapt to the complexity of modern software and hardware and lead to high maintenance burden. To address this challenge, we present Magellan, an agentic framework that evolves the compiler pass itself by synthesizing executable C++ decision logic. Magellan couples an LLM coding agent with evolutionary search and autotuning in a closed loop of generation, evaluation on user-provided macro-benchmarks, and refinement, producing compact heuristics that integrate directly into existing compilers. Across several production optimization tasks, Magellan discovers policies that match or surpass expert baselines. In LLVM function inlining, Magellan synthesizes new heuristics that outperform decades of manual engineering for both binary-size reduction and end-to-end performance. In register allocation, it learns a concise priority rule for live-range processing that matches intricate human-designed policies on a large-scale workload. We also report preliminary results on XLA problems, demonstrating portability beyond LLVM with reduced engineering effort.

[294] astra-langchain4j: Experiences Combining LLMs and Agent Programming

Rem Collier, Katharine Beaumont, Andrei Ciortea

Main category: cs.AI

TL;DR: The paper explores integrating LLMs into the ASTRA programming language toolkit, examining how traditional agent toolkits can influence and be influenced by emerging generative AI and agentic AI platforms.

Details

Motivation: To bridge the gap between traditional agent toolkits and emerging generative/agentic AI technologies, exploring how LLMs can enhance existing agent programming frameworks and how established toolkit experience can inform new agentic platform design.

Method: Developed a prototype LLM integration for the ASTRA programming language, implemented three example use cases to demonstrate the integration’s capabilities and gather practical experience.

Result: Successfully created LLM integration for ASTRA with three working examples, providing insights into how traditional agent toolkits can be enhanced by generative AI and how their design principles can inform new agentic systems.

Conclusion: The integration demonstrates mutual benefits: LLMs can enhance traditional agent toolkits, while established toolkit design principles can improve emerging agentic AI platforms, creating a valuable synergy between old and new approaches.

Abstract: Given the emergence of Generative AI over the last two years and the increasing focus on Agentic AI as a form of Multi-Agent System it is important to explore both how such technologies can impact the use of traditional Agent Toolkits and how the wealth of experience encapsulated in those toolkits can influence the design of the new agentic platforms. This paper presents an overview of our experience developing a prototype large language model (LLM) integration for the ASTRA programming language. It presents a brief overview of the toolkit, followed by three example implementations, concluding with a discussion of the experiences garnered through the examples.

[295] How does information access affect LLM monitors’ ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani, Shubhorup Biswas, Francis Rhys Ward

Main category: cs.AI

TL;DR: LLM monitors for detecting misaligned AI agents sometimes perform better with less information about agent reasoning, leading to a “less-is-more effect” in automated oversight.

Details

Motivation: To control potentially misaligned AI agents that exhibit deception and hidden objectives, researchers need effective monitoring systems. However, it's unclear how much information about agent reasoning and actions monitors should access for optimal performance.

Method: The paper studies how information access affects LLM monitor performance through three sabotage detection environments. It introduces extract-and-evaluate (EaE) monitoring - a hierarchical approach where one monitor isolates relevant excerpts from agent trajectories and another scores them. Experiments are conducted with various LLMs including GPT-4.1-mini, Gemini, and Claude models.

Result: Monitors often perform better with less information (less-is-more effect). EaE improves sabotage detection by 16.8 percentage points in BigCodeBench-Sabotage over baselines without increasing false positives. Agents unaware of monitoring are easier to catch, and monitors benefit more from information filtering when scoring longer reasoning traces. Gemini and Claude models with information filtering techniques dominate the cost-performance Pareto frontier.

Conclusion: Information filtering improves LLM monitoring effectiveness, with hierarchical extract-and-evaluate monitoring showing strong performance. The less-is-more effect suggests careful information selection is crucial for automated oversight systems.

Abstract: Frontier language model agents can exhibit misaligned behaviors, including deception, exploiting reward hacks, and pursuing hidden objectives. To control potentially misaligned agents, we can use LLMs themselves to monitor for misbehavior. In this paper, we study how information access affects LLM monitor performance. While one might expect that monitors perform better when they have access to more of the monitored agents’ reasoning and actions, we find that contemporary systems often perform better with less information, a phenomenon we call the less-is-more effect for automated oversight. We demonstrate this phenomenon, and analyze the conditions under which it occurs, in three evaluation environments where agents must conduct sabotage while evading monitors. Motivated by the less-is-more effect, we introduce extract-and-evaluate (EaE) monitoring–a new hierarchical approach where one monitor isolates relevant excerpts from the monitored agent’s trajectory and a separate monitor scores them. In BigCodeBench-Sabotage with GPT-4.1-mini as the monitor model, EaE improves sabotage detection rates by 16.8 percentage points over the next-best monitor without increasing the false positive rate. In other settings, EaE either outperforms or is competitive with baselines. In addition, we find that agents unaware of being monitored can be caught much more easily and that monitors scoring longer reasoning traces gain more from information filtering. Lastly, we conduct a cost-performance analysis and find that Gemini and Claude models with monitoring techniques that involve information filtering occupy much of the Pareto frontier.

[296] Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Shuo Liu, Tianle Chen, Ryan Amiri, Christopher Amato

Main category: cs.AI

TL;DR: MAAC methods for decentralized LLM collaboration outperform Monte Carlo approaches in long-horizon/sparse-reward tasks, with centralized critic (CoLLM-CC) showing better convergence than decentralized critics (CoLLM-DC).

Details

Motivation: Current MARL fine-tuning for LLM collaboration relies on predefined execution protocols requiring centralized execution, while decentralized collaboration is more practical. Monte Carlo methods suffer from high variance and sample inefficiency, motivating actor-critic approaches for decentralized LLM optimization.

Method: Proposed two Multi-Agent Actor-Critic (MAAC) approaches: CoLLM-CC with centralized critic and CoLLM-DC with decentralized critics. Evaluated across writing, coding, and game-playing domains comparing with Monte Carlo methods.

Result: Monte Carlo and CoLLM-DC achieve comparable performance to CoLLM-CC in short-horizon/dense-reward settings. However, both underperform CoLLM-CC on long-horizon/sparse-reward tasks, with Monte Carlo requiring more samples and CoLLM-DC struggling to converge.

Conclusion: Centralized critic MAAC methods (CoLLM-CC) are more effective for complex LLM collaboration tasks with long horizons or sparse rewards, while decentralized approaches work well for simpler scenarios.

Abstract: Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues, so we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge. Our code is available at https://github.com/OpenMLRL/CoMLRL/releases/tag/v1.3.2.

[297] CUA-Skill: Develop Skills for Computer Using Agent

Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Suzhen Zheng, Leon Xu, Hao Fan, Pashmina Cameron, Justin Wagle, Kazuhito Koishida

Main category: cs.AI

TL;DR: CUA-Skill introduces a structured skill base for computer-using agents that encodes human computer-use knowledge as parameterized skills with execution graphs, enabling scalable and reliable agent development for Windows applications.

Details

Motivation: Existing computer-using agents are difficult to scale and lag behind human performance due to the absence of reusable, structured skill abstractions that capture how humans interact with graphical user interfaces.

Method: Developed CUA-Skill, a large-scale library of carefully engineered skills spanning common Windows applications, with parameterized execution and composition graphs. Built CUA-Skill Agent on top that supports dynamic skill retrieval, argument instantiation, and memory-aware failure recovery.

Result: CUA-Skill substantially improves execution success rates and robustness on challenging end-to-end agent benchmarks. On WindowsAgentArena, CUA-Skill Agent achieves state-of-the-art 57.5% success rate while being significantly more efficient than prior approaches.

Conclusion: CUA-Skill establishes a strong foundation for future computer-using agent development by providing reusable skill abstractions that encode human computer-use knowledge, enabling scalable and reliable agent systems.

Abstract: Computer-Using Agents (CUAs) aim to autonomously operate computer systems to complete real-world tasks. However, existing agentic systems remain difficult to scale and lag behind human performance. A key limitation is the absence of reusable and structured skill abstractions that capture how humans interact with graphical user interfaces and how to leverage these skills. We introduce CUA-Skill, a computer-using agentic skill base that encodes human computer-use knowledge as skills coupled with parameterized execution and composition graphs. CUA-Skill is a large-scale library of carefully engineered skills spanning common Windows applications, serving as a practical infrastructure and tool substrate for scalable, reliable agent development. Built upon this skill base, we construct CUA-Skill Agent, an end-to-end computer-using agent that supports dynamic skill retrieval, argument instantiation, and memory-aware failure recovery. Our results demonstrate that CUA-Skill substantially improves execution success rates and robustness on challenging end-to-end agent benchmarks, establishing a strong foundation for future computer-using agent development. On WindowsAgentArena, CUA-Skill Agent achieves state-of-the-art 57.5% (best of three) successful rate while being significantly more efficient than prior and concurrent approaches. The project page is available at https://microsoft.github.io/cua_skill/.

[298] Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation

Václav Javorek, Tomáš Železný, Alessa Carbo, Marek Hrúz, Ivan Gruber

Main category: cs.AI

TL;DR: Using LLMs to generate paraphrased translations for sign language datasets improves evaluation metrics and human alignment but not training performance.

Details

Motivation: Sign language translation datasets typically have only one written reference per signed utterance, which is problematic due to the non-isomorphic relationship between sign and spoken languages where multiple translations can be equally valid. This limitation affects both model training and evaluation, especially for n-gram metrics like BLEU.

Method: Researchers use Large Language Models to automatically generate paraphrased variants of written-language translations as synthetic alternative references for SLT. They compare multiple paraphrasing strategies and models using adapted ParaScore metric, and study the impact on both training and evaluation of pose-based T5 models on YouTubeASL and How2Sign datasets.

Result: Naively incorporating paraphrases during training does not improve translation performance and can be detrimental. However, using paraphrases during evaluation leads to higher automatic scores and better alignment with human judgments. The proposed BLEUpara extension correlates more strongly with perceived translation quality.

Conclusion: Paraphrase-enhanced evaluation (BLEUpara) provides more reliable assessment of sign language translation systems, while paraphrase-augmented training requires more careful implementation. The resources are released to support reproducible SLT evaluation.

Abstract: Most Sign Language Translation (SLT) corpora pair each signed utterance with a single written-language reference, despite the highly non-isomorphic relationship between sign and spoken languages, where multiple translations can be equally valid. This limitation constrains both model training and evaluation, particularly for n-gram-based metrics such as BLEU. In this work, we investigate the use of Large Language Models to automatically generate paraphrased variants of written-language translations as synthetic alternative references for SLT. First, we compare multiple paraphrasing strategies and models using an adapted ParaScore metric. Second, we study the impact of paraphrases on both training and evaluation of the pose-based T5 model on the YouTubeASL and How2Sign datasets. Our results show that naively incorporating paraphrases during training does not improve translation performance and can even be detrimental. In contrast, using paraphrases during evaluation leads to higher automatic scores and better alignment with human judgments. To formalize this observation, we introduce BLEUpara, an extension of BLEU that evaluates translations against multiple paraphrased references. Human evaluation confirms that BLEUpara correlates more strongly with perceived translation quality. We release all generated paraphrases, generation and evaluation code to support reproducible and more reliable evaluation of SLT systems.

[299] What You Feel Is Not What They See: On Predicting Self-Reported Emotion from Third-Party Observer Labels

Yara El-Tawil, Aneesha Sampath, Emily Mower Provost

Main category: cs.AI

TL;DR: Cross-corpus evaluation shows third-party-trained emotion models perform poorly on self-reports, except for valence when content is personally significant to the speaker.

Details

Motivation: There's a gap between self-reported emotion labels (internal experience) and third-party labels (external perception), limiting the applicability of third-party-trained models to self-report contexts, which is critical in mental health applications.

Method: Conducted the first cross-corpus evaluation of third-party-trained emotion recognition models on self-report data, analyzing performance on activation and valence dimensions.

Result: Activation was unpredictable (CCC ≈ 0), valence was moderately predictable (CCC ≈ 0.3), but when content was personally significant to the speaker, valence performance improved dramatically (CCC ≈ 0.6-0.8).

Conclusion: Personal significance is a key pathway for aligning external perception with internal experience, while self-report activation modeling remains challenging.

Abstract: Self-reported emotion labels capture internal experience, while third-party labels reflect external perception. These perspectives often diverge, limiting the applicability of third-party-trained models to self-report contexts. This gap is critical in mental health, where accurate self-report modeling is essential for guiding intervention. We present the first cross-corpus evaluation of third-party-trained models on self-reports. We find activation unpredictable (CCC approximately 0) and valence moderately predictable (CCC approximately 0.3). Crucially, when content is personally significant to the speaker, models achieve high performance for valence (CCC approximately 0.6-0.8). Our findings point to personal significance as a key pathway for aligning external perception with internal experience and underscore the challenge of self-report activation modeling.

[300] BrainStack: Neuro-MoE with Functionally Guided Expert Routing for EEG-Based Language Decoding

Ziyi Zhao, Jinzhao Zhou, Xiaowei Jiang, Beining Cao, Wenhao Ma, Yang Shen, Ren Li, Yu-Kai Wang, Chin-teng Lin

Main category: cs.AI

TL;DR: BrainStack: A functionally guided neuro-mixture-of-experts framework for decoding linguistic information from EEG using anatomically partitioned expert networks with transformer-based global coordination and cross-regional distillation.

Details

Motivation: Decoding linguistic information from EEG is challenging due to the brain's distributed and nonlinear organization. The authors aim to create a neuro-inspired framework that models the brain's modular functional architecture for better brain-language decoding.

Method: Proposes BrainStack, a functionally guided neuro-mixture-of-experts framework with anatomically partitioned expert networks for different brain regions, a transformer-based global expert for cross-regional dependencies, learnable routing gate for adaptive expert coordination, and cross-regional distillation for regularization. Also releases SilentSpeech-EEG dataset with 120+ hours of EEG recordings.

Result: BrainStack consistently outperforms state-of-the-art models, achieving superior accuracy and generalization across subjects. The framework demonstrates effective brain-language decoding capabilities.

Conclusion: BrainStack establishes a functionally modular, neuro-inspired MoE paradigm that unifies neuroscientific priors with adaptive expert routing, paving the way for scalable and interpretable brain-language decoding.

Abstract: Decoding linguistic information from electroencephalography (EEG) remains challenging due to the brain’s distributed and nonlinear organization. We present BrainStack, a functionally guided neuro-mixture-of-experts (Neuro-MoE) framework that models the brain’s modular functional architecture through anatomically partitioned expert networks. Each functional region is represented by a specialized expert that learns localized neural dynamics, while a transformer-based global expert captures cross-regional dependencies. A learnable routing gate adaptively aggregates these heterogeneous experts, enabling context-dependent expert coordination and selective fusion. To promote coherent representation across the hierarchy, we introduce cross-regional distillation, where the global expert provides top-down regularization to the regional experts. We further release SilentSpeech-EEG (SS-EEG), a large-scale benchmark comprising over 120 hours of EEG recordings from 12 subjects performing 24 silent words, the largest dataset of its kind. Experiments demonstrate that BrainStack consistently outperforms state-of-the-art models, achieving superior accuracy and generalization across subjects. Our results establish BrainStack as a functionally modular, neuro-inspired MoE paradigm that unifies neuroscientific priors with adaptive expert routing, paving the way for scalable and interpretable brain-language decoding.

[301] Bridging the Arithmetic Gap: The Cognitive Complexity Benchmark and Financial-PoT for Robust Financial Reasoning

Boxiang Zhao, Qince Li, Zhonghao Wang, Yi Wang, Peng Cheng, Bo Lin

Main category: cs.AI

TL;DR: A framework addressing LLMs’ limitations in financial quantitative reasoning through architectural decoupling and neuro-symbolic approaches to prevent arithmetic hallucinations and cognitive collapse.

Details

Motivation: Large Language Models struggle with financial quantitative reasoning, suffering from "Arithmetic Hallucinations" and "Cognitive Collapse" - systematic failures in high-cognitive-load scenarios, particularly in precision-critical domains like finance.

Method: Proposes the Iterative Dual-Phase Financial-PoT framework: a neuro-symbolic architecture that decouples semantic variable extraction and logic formulation from computation, using an iterative self-correcting Python sandbox for deterministic execution. Also introduces Cognitive Complexity Benchmark (CCB) for evaluation.

Result: The approach elevates Qwen3-235B’s average accuracy from 59.7% to 67.3% on CCB, with up to 10-fold gains in high-complexity reasoning tasks, significantly outperforming standard Chain-of-Thought methods.

Conclusion: Architectural decoupling between semantic understanding and quantitative computation is critical for improving reliability in financial reasoning, providing transferable insights for precision-critical domains requiring tight alignment between language understanding and computation.

Abstract: While Large Language Models excel at semantic tasks, they face a critical bottleneck in financial quantitative reasoning, frequently suffering from “Arithmetic Hallucinations” and a systemic failure mode we term “Cognitive Collapse”. To strictly quantify this phenomenon, we introduce the Cognitive Complexity Benchmark (CCB), a robust evaluation framework grounded in a dataset constructed from 95 real-world Chinese A-share annual reports. Unlike traditional datasets, the CCB stratifies financial queries into a three-dimensional taxonomy, Data Source, Mapping Difficulty, and Result Unit, enabling the precise diagnosis of reasoning degradation in high-cognitive-load scenarios. To address these failures, we propose the Iterative Dual-Phase Financial-PoT framework. This neuro-symbolic architecture enforces a strict architectural decoupling: it first isolates semantic variable extraction and logic formulation, then offloads computation to an iterative, self-correcting Python sandbox to ensure deterministic execution. Evaluation on the CCB demonstrates that while standard Chain-of-Thought falters on complex tasks, our approach offers superior robustness, elevating the Qwen3-235B model’s average accuracy from 59.7% to 67.3% and achieving gains of up to 10-fold in high-complexity reasoning tasks. These findings suggest that architectural decoupling is a critical enabling factor for improving reliability in financial reasoning tasks, providing a transferable architectural insight for precision-critical domains that require tight alignment between semantic understanding and quantitative computation.

[302] Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving

Jingyun Wang, Dian Li, Xiaohan Wang, Gang Liu, Jiahong Yan, Guoliang Kang

Main category: cs.AI

TL;DR: A method that separates visual understanding from reasoning for plane geometry problems by training an MLLM interpreter to convert diagrams to textual descriptions (CDL), then using an LLM for reasoning.

Details

Motivation: Existing multimodal LLMs for plane geometry problem solving (PGPS) jointly optimize visual understanding and reasoning, which may compromise the base LLM's reasoning capabilities. The authors propose separating these tasks.

Method: Train an MLLM interpreter to generate geometric descriptions in Conditional Declaration Language (CDL) from diagrams, then use an off-the-shelf LLM for reasoning. Fine-tune via CoT-augmented SFT followed by GRPO with CDL matching rewards instead of solution-based rewards.

Result: The method (trained on only 5.5k data) performs favorably against leading open-source and closed-source MLLMs on Formalgeo7k-Rec-CoT, Unigeo, and MathVista benchmarks.

Conclusion: Separating visual understanding (MLLM interpreter) from reasoning (LLM) is effective for PGPS, preserving LLM reasoning capabilities while enabling multimodal problem solving.

Abstract: Plane Geometry Problem Solving (PGPS) is a multimodal reasoning task that aims to solve a plane geometric problem based on a geometric diagram and problem textual descriptions. Although Large Language Models (LLMs) possess strong reasoning skills, their direct application to PGPS is hindered by their inability to process visual diagrams. Existing works typically fine-tune Multimodal LLMs (MLLMs) end-to-end on large-scale PGPS data to enhance visual understanding and reasoning simultaneously. However, such joint optimization may compromise base LLMs’ inherent reasoning capability. In this work, we observe that LLM itself is potentially a powerful PGPS solver when appropriately formulating visual information as textual descriptions. We propose to train a MLLM Interpreter to generate geometric descriptions for the visual diagram, and an off-the-shelf LLM is utilized to perform reasoning. Specifically, we choose Conditional Declaration Language (CDL) as the geometric description as its conciseness eases the MLLM Interpreter training. The MLLM Interpreter is fine-tuned via CoT (Chain-of-Thought)-augmented SFT followed by GRPO to generate CDL. Instead of using a conventional solution-based reward that compares the reasoning result with the ground-truth answer, we design CDL matching rewards to facilitate more effective GRPO training, which provides more direct and denser guidance for CDL generation. To support training, we construct a new dataset, Formalgeo7k-Rec-CoT, by manually reviewing Formalgeo7k v2 and incorporating CoT annotations. Extensive experiments on Formalgeo7k-Rec-CoT, Unigeo, and MathVista show our method (finetuned on only 5.5k data) performs favorably against leading open-source and closed-source MLLMs.

[303] FrontierScience: Evaluating AI’s Ability to Perform Expert-Level Scientific Tasks

Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, Tejal Patwardhan

Main category: cs.AI

TL;DR: FrontierScience is a benchmark for evaluating expert-level scientific reasoning in frontier language models, featuring Olympiad problems (international competition level) and Research problems (PhD-level open-ended research sub-tasks) across physics, chemistry, and biology.

Details

Motivation: Existing science benchmarks are becoming saturated as models improve, often relying on multiple-choice knowledge questions or already published information. There's a need for more challenging benchmarks that test true expert-level scientific reasoning and research capabilities.

Method: Created two complementary tracks: (1) Olympiad track with international olympiad problems (IPhO, IChO, IBO level) created by medalists and coaches, and (2) Research track with PhD-level open-ended problems representing research sub-tasks, created and verified by PhD scientists. Includes granular rubric-based evaluation for Research problems.

Result: FrontierScience contains several hundred questions (160 in open-sourced gold set) covering physics, chemistry, and biology subfields. The benchmark provides challenging problems at expert levels with rigorous evaluation frameworks.

Conclusion: FrontierScience addresses the gap in evaluating expert-level scientific reasoning in language models, providing a more challenging benchmark that goes beyond multiple-choice questions to test true research capabilities and problem-solving at olympiad and PhD levels.

Abstract: We introduce FrontierScience, a benchmark evaluating expert-level scientific reasoning in frontier language models. Recent model progress has nearly saturated existing science benchmarks, which often rely on multiple-choice knowledge questions or already published information. FrontierScience addresses this gap through two complementary tracks: (1) Olympiad, consisting of international olympiad problems at the level of IPhO, IChO, and IBO, and (2) Research, consisting of PhD-level, open-ended problems representative of sub-tasks in scientific research. FrontierScience contains several hundred questions (including 160 in the open-sourced gold set) covering subfields across physics, chemistry, and biology, from quantum electrodynamics to synthetic organic chemistry. All Olympiad problems are originally produced by international Olympiad medalists and national team coaches to ensure standards of difficulty, originality, and factuality. All Research problems are research sub-tasks written and verified by PhD scientists (doctoral candidates, postdoctoral researchers, or professors). For Research, we introduce a granular rubric-based evaluation framework to assess model capabilities throughout the process of solving a research task, rather than judging only a standalone final answer.

[304] Reputation as a Solution to Cooperation Collapse in LLM-based MASs

Siyue Ren, Wanli Fu, Xinkun Zou, Chen Shen, Yi Cai, Chen Chu, Zhen Wang, Shuyue Hu

Main category: cs.AI

TL;DR: RepuNet is a dual-level reputation framework for LLM-based multi-agent systems that prevents cooperation collapse through agent-level reputation dynamics and system-level network evolution.

Details

Motivation: Recent studies show that cooperation collapse can emerge in multi-agent systems driven by LLMs, creating a need for mechanisms to sustain cooperation in such systems.

Method: Proposes RepuNet - a dynamic, dual-level reputation framework where agents form reputations through direct interactions and indirect gossip, and decide whether to connect or disconnect with other agents for future interactions.

Result: RepuNet effectively avoids cooperation collapse and promotes sustained cooperation in LLM-based MASs across three scenarios. It also gives rise to emergent behaviors like cooperative cluster formation, social isolation of exploitative agents, and preference for sharing positive gossip.

Conclusion: Reputation systems like RepuNet can successfully address cooperation collapse in LLM-based multi-agent systems and enable rich emergent social behaviors.

Abstract: Cooperation has long been a fundamental topic in both human society and AI systems. However, recent studies indicate that the collapse of cooperation may emerge in multi-agent systems (MASs) driven by large language models (LLMs). To address this challenge, we explore reputation systems as a remedy. We propose RepuNet, a dynamic, dual-level reputation framework that models both agent-level reputation dynamics and system-level network evolution. Specifically, driven by direct interactions and indirect gossip, agents form reputations for both themselves and their peers, and decide whether to connect or disconnect other agents for future interactions. Through three distinct scenarios, we show that RepuNet effectively avoids cooperation collapse, promoting and sustaining cooperation in LLM-based MASs. Moreover, we find that reputation systems can give rise to rich emergent behaviors in LLM-based MASs, such as the formation of cooperative clusters, the social isolation of exploitative agents, and the preference for sharing positive gossip rather than negative ones. The GitHub repository for our project can be accessed via the following link: https://github.com/RGB-0000FF/RepuNet.

Sangyun Chung, Se Yeon Kim, Youngchae Chee, Yong Man Ro

Main category: cs.AI

TL;DR: MAD is a training-free method that reduces cross-modal hallucinations in MLLMs by adaptively weighting modality-specific decoding branches based on self-assessed modality relevance.

Details

Motivation: MLLMs suffer from cross-modal hallucinations where one modality inappropriately influences generation about another, exposing a fundamental deficiency in modality-interaction control.

Method: Modality-Adaptive Decoding (MAD) queries the model to self-assess which modalities are needed for each task, then uses extracted modality probabilities to adaptively weight contrastive decoding branches, suppressing cross-modal interference.

Result: MAD significantly reduces cross-modal hallucinations across multiple audio-visual language models (7.8% and 2.0% improvements for VideoLLaMA2-AV, 8.7% and 4.7% improvements for Qwen2.5-Omni) on CMM and AVHBench datasets.

Conclusion: Explicit modality awareness through self-assessment is crucial for robust multimodal reasoning, and MAD offers a principled extension to existing contrastive decoding methods for reducing cross-modal hallucinations.

Abstract: Multimodal Large Language Models (MLLMs) suffer from cross-modal hallucinations, where one modality inappropriately influences generation about another, leading to fabricated output. This exposes a more fundamental deficiency in modality-interaction control. To address this, we propose Modality-Adaptive Decoding (MAD), a training-free method that adaptively weights modality-specific decoding branches based on task requirements. MAD leverages the model’s inherent ability to self-assess modality relevance by querying which modalities are needed for each task. The extracted modality probabilities are then used to adaptively weight contrastive decoding branches, enabling the model to focus on relevant information while suppressing cross-modal interference. Extensive experiments on CMM and AVHBench demonstrate that MAD significantly reduces cross-modal hallucinations across multiple audio-visual language models (7.8% and 2.0% improvements for VideoLLaMA2-AV, 8.7% and 4.7% improvements for Qwen2.5-Omni). Our approach demonstrates that explicit modality awareness through self-assessment is crucial for robust multimodal reasoning, offering a principled extension to existing contrastive decoding methods. Our code is available at \href{https://github.com/top-yun/MAD}{https://github.com/top-yun/MAD}

[306] Sycophantic Anchors: Localizing and Quantifying User Agreement in Reasoning Models

Jacek Duszenko

Main category: cs.AI

TL;DR: Paper introduces “sycophantic anchors” to localize and quantify where in reasoning traces LLMs agree with incorrect user suggestions, showing these can be detected mid-inference with high accuracy.

Details

Motivation: To understand where in reasoning traces models exhibit sycophancy (agreeing with incorrect user suggestions) and how to localize this behavior for potential intervention.

Method: Introduces sycophantic anchors - sentences that causally lock models into user agreement. Analyzes over 10,000 counterfactual rollouts on distilled reasoning models using linear probes and activation-based regressors to detect and quantify commitment.

Result: Linear probes achieve 84.6% balanced accuracy distinguishing sycophantic anchors, activation-based regressors predict commitment magnitude (R²=0.74). Sycophantic anchors are more distinguishable than correct reasoning anchors, and sycophancy builds gradually during reasoning.

Conclusion: Provides sentence-level mechanisms for localizing model misalignment mid-inference, revealing potential intervention windows during reasoning processes.

Abstract: Reasoning models frequently agree with incorrect user suggestions – a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. To localize and quantify this behavior, we introduce \emph{sycophantic anchors} – sentences that causally lock models into user agreement. Analyzing over 10,000 counterfactual rollouts on a distilled reasoning model, we show that anchors can be reliably detected and quantified mid-inference. Linear probes distinguish sycophantic anchors with 84.6% balanced accuracy, while activation-based regressors predict the magnitude of the commitment ($R^2 = 0.74$). We further observe asymmetry where sycophantic anchors are significantly more distinguishable than correct reasoning anchors, and find that sycophancy builds gradually during reasoning, revealing a potential window for intervention. These results offer sentence-level mechanisms for localizing model misalignment mid-inference.

[307] Do Reasoning Models Enhance Embedding Models?

Wun Yu Chan, Shaojin Chen, Huihao Jing, Kwun Hang Lau, Elton Chun-Chai Li, Zihao Wang, Haoran Li, Yangqiu Song

Main category: cs.AI

TL;DR: RLVR-tuned reasoning models don’t improve embedding performance despite enhanced reasoning capabilities, due to manifold realignment during contrastive learning.

Details

Motivation: To investigate whether enhanced reasoning capabilities from RLVR-tuned models translate to better semantic representations when used as initialization for embedding models.

Method: Evaluated embedding models initialized from RLVR-tuned backbones on MTEB and BRIGHT benchmarks, and introduced HRSA framework to analyze representation similarity across multiple levels.

Result: Found null effect - no consistent performance advantage over base models. HRSA revealed RLVR causes local geometry reorganization but preserves global manifold structure, leading to manifold realignment during contrastive learning.

Conclusion: RLVR optimizes within existing semantic landscape rather than restructuring it fundamentally, unlike SFT, explaining why reasoning improvements don’t transfer to embedding quality.

Abstract: State-of-the-art embedding models are increasingly derived from decoder-only Large Language Model (LLM) backbones adapted via contrastive learning. Given the emergence of reasoning models trained via Reinforcement Learning with Verifiable Rewards (RLVR), a natural question arises: do enhanced reasoning translate to superior semantic representations when these models serve as embedding initializations? Contrary to expectation, our evaluation on MTEB and BRIGHT reveals a null effect: embedding models initialized from RLVR-tuned backbones yield no consistent performance advantage over their base counterparts when subjected to identical training recipes. To unpack this paradox, we introduce Hierarchical Representation Similarity Analysis (HRSA), a framework that decomposes similarity across representation, geometry, and function levels. HRSA reveals that while RLVR induces irreversible latent manifold’s local geometry reorganization and reversible coordinate basis drift, it preserves the global manifold geometry and linear readout. Consequently, subsequent contrastive learning drives strong alignment between base- and reasoning-initialized models, a phenomenon we term Manifold Realignment. Empirically, our findings suggest that unlike Supervised Fine-Tuning (SFT), RLVR optimizes trajectories within an existing semantic landscape rather than fundamentally restructuring the landscape itself.

[308] When should I search more: Adaptive Complex Query Optimization with Reinforcement Learning

Wei Wen, Sihang Deng, Tianjun Wei, Keyu Chen, Ruizhi Qiao, Xing Sun

Main category: cs.AI

TL;DR: ACQO is a reinforcement learning framework for optimizing complex queries in RAG systems, featuring adaptive query decomposition and rank-score fusion with curriculum learning for stability.

Details

Motivation: Existing RL-based query optimization methods focus on single queries, but real-world complex queries require multiple search strategies for disambiguation and decomposition. Direct RL application to complex cases faces challenges in determining optimal sub-queries, re-ranking/merging documents, and training instability.

Method: Proposes Adaptive Complex Query Optimization (ACQO) framework with two core components: Adaptive Query Reformulation (AQR) module that dynamically decides when to decompose queries into sub-queries, and Rank-Score Fusion (RSF) module for robust result aggregation. Uses Curriculum Reinforcement Learning (CRL) with two-stage progressive training to stabilize learning.

Result: ACQO achieves state-of-the-art performance on three complex query benchmarks, significantly outperforming established baselines. Shows improved computational efficiency and broad compatibility with different retrieval architectures.

Conclusion: ACQO is a powerful and generalizable solution for next-generation RAG systems, effectively handling complex queries through adaptive decomposition and stable reinforcement learning.

Abstract: Query optimization is a crucial component for the efficacy of Retrieval-Augmented Generation (RAG) systems. While reinforcement learning (RL)-based agentic and reasoning methods have recently emerged as a promising direction on query optimization, most existing approaches focus on the expansion and abstraction of a single query. However, complex user queries are prevalent in real-world scenarios, often requiring multiple parallel and sequential search strategies to handle disambiguation and decomposition. Directly applying RL to these complex cases introduces significant hurdles. Determining the optimal number of sub-queries and effectively re-ranking and merging retrieved documents vastly expands the search space and complicates reward design, frequently leading to training instability. To address these challenges, we propose a novel RL framework called Adaptive Complex Query Optimization (ACQO). Our framework is designed to adaptively determine when and how to expand the search process. It features two core components: an Adaptive Query Reformulation (AQR) module that dynamically decides when to decompose a query into multiple sub-queries, and a Rank-Score Fusion (RSF) module that ensures robust result aggregation and provides stable reward signals for the learning agent. To mitigate training instabilities, we adopt a Curriculum Reinforcement Learning (CRL) approach, which stabilizes the training process by progressively introducing more challenging queries through a two-stage strategy. Our comprehensive experiments demonstrate that ACQO achieves state-of-the-art performance on three complex query benchmarks, significantly outperforming established baselines. The framework also showcases improved computational efficiency and broad compatibility with different retrieval architectures, establishing it as a powerful and generalizable solution for next-generation RAG systems.

[309] Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification

Paul He, Yinya Huang, Mrinmaya Sachan, Zhijing Jin

Main category: cs.AI

TL;DR: DoVerifier is a symbolic verifier that checks if LLM-generated causal expressions are formally valid under causal reasoning semantics using do-calculus rules, providing more accurate evaluation of causal reasoning capabilities.

Details

Motivation: Current benchmarks for evaluating LLMs on causal reasoning tasks rely on superficial string matching or surface-level metrics that don't capture whether model outputs are formally valid under causal reasoning semantics, leading to inaccurate assessments of true causal reasoning capabilities.

Method: Proposes DoVerifier, a simple symbolic verifier that checks whether LLM-generated causal expressions are derivable from a given causal graph using rules from do-calculus and probability theory, recovering correct answers that would otherwise be marked incorrect due to superficial semantic differences.

Result: Evaluations on synthetic data and causal QA benchmarks show that DoVerifier more accurately captures semantic correctness of causal reasoning traces, offering a more rigorous and informative way to evaluate LLMs on causal reasoning compared to traditional string-matching approaches.

Conclusion: DoVerifier provides a more accurate and semantically-grounded evaluation framework for assessing LLMs’ causal reasoning capabilities, addressing limitations of current evaluation methods that fail to capture formal validity under causal semantics.

Abstract: Large language models (LLMs) are increasingly being applied to tasks that involve causal reasoning. However, current benchmarks often rely on string matching or surface-level metrics that do not capture whether the output of a model is formally valid under the semantics of causal reasoning. To address this, we propose DoVerifier, a simple symbolic verifier that checks whether LLM-generated causal expressions are derivable from a given causal graph using rules from do-calculus and probability theory. This allows us to recover correct answers to causal queries that would otherwise be marked incorrect due to superficial differences in their causal semantics. Our evaluations on synthetic data and causal QA benchmarks show that DoVerifier more accurately captures semantic correctness of causal reasoning traces, offering a more rigorous and informative way to evaluate LLMs on causal reasoning.

[310] MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Jiayu Wang, Semih Yavuz, Caiming Xiong, Shafiq Joty

Main category: cs.AI

TL;DR: MAS-Orchestra: A training-time framework that formulates multi-agent system orchestration as function-calling reinforcement learning with holistic orchestration, plus MASBENCH benchmark for rigorous evaluation of when and why MAS outperform single-agent systems.

Details

Motivation: Current multi-agent system (MAS) design approaches under-deliver due to methodological complexity (sequential, code-level execution limiting global reasoning) and efficacy uncertainty (deploying MAS without understanding benefits over single-agent systems).

Method: MAS-Orchestra abstracts agents as callable functions and formulates MAS orchestration as function-calling reinforcement learning with holistic orchestration (generating entire MAS at once). MASBENCH benchmark characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness.

Result: MAS-Orchestra achieves consistent improvements on mathematical reasoning, multi-hop QA, and search-based QA benchmarks, with more than 10x efficiency over strong baselines. Analysis reveals MAS gains depend on task structure, verification protocols, and capabilities of orchestrator/subagents.

Conclusion: MAS-Orchestra and MASBENCH enable better training and understanding of multi-agent systems, showing MAS benefits are not universal but depend on specific task characteristics and system design.

Abstract: While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

[311] Intelli-Planner: Towards Customized Urban Planning via Large Language Model Empowered Reinforcement Learning

Xixian Yong, Peilin Sun, Zihe Wang, Xiao Zhou

Main category: cs.AI

TL;DR: Intelli-Planner integrates Deep Reinforcement Learning with LLMs for participatory urban planning, using demographic/geographic data and stakeholder preferences to generate customized planning schemes with enhanced stakeholder satisfaction.

Details

Motivation: Current urban planning methods are either too reliant on human experts (time-consuming) or use deep learning algorithms that limit stakeholder involvement. There's a need for participatory and customized planning that bridges these gaps.

Method: Proposes Intelli-Planner framework combining DRL with LLMs. Uses demographic, geographic data, and planning preferences to determine requirements. Includes knowledge enhancement module for policy network training, multi-dimensional evaluation system, and LLM-based stakeholders for satisfaction scoring.

Result: Experimental validation shows Intelli-Planner surpasses traditional baselines and achieves comparable performance to state-of-the-art DRL methods in objective metrics, while enhancing stakeholder satisfaction and convergence speed.

Conclusion: The framework demonstrates effectiveness in integrating LLMs with DRL for functional areas planning, highlighting potential for revolutionizing urban planning tasks through participatory approaches.

Abstract: Effective urban planning is crucial for enhancing residents’ quality of life and ensuring societal stability, playing a pivotal role in the sustainable development of cities. Current planning methods heavily rely on human experts, which are time-consuming and labor-intensive, or utilize deep learning algorithms, often limiting stakeholder involvement. To bridge these gaps, we propose Intelli-Planner, a novel framework integrating Deep Reinforcement Learning (DRL) with large language models (LLMs) to facilitate participatory and customized planning scheme generation. Intelli-Planner utilizes demographic, geographic data, and planning preferences to determine high-level planning requirements and demands for each functional type. During training, a knowledge enhancement module is employed to enhance the decision-making capability of the policy network. Additionally, we establish a multi-dimensional evaluation system and employ LLM-based stakeholders for satisfaction scoring. Experimental validation across diverse urban settings shows that Intelli-Planner surpasses traditional baselines and achieves comparable performance to state-of-the-art DRL-based methods in objective metrics, while enhancing stakeholder satisfaction and convergence speed. These findings underscore the effectiveness and superiority of our framework, highlighting the potential for integrating the latest advancements in LLMs with DRL approaches to revolutionize tasks related to functional areas planning.

[312] Causal Discovery for Explainable AI: A Dual-Encoding Approach

Henry Salgado, Meagan R. Kendall, Martine Ceberio

Main category: cs.AI

TL;DR: Dual-encoding causal discovery method for categorical variables using complementary encoding strategies and majority voting to improve stability in causal structure identification.

Details

Motivation: Traditional causal discovery methods struggle with categorical variables due to numerical instability in conditional independence testing, limiting their effectiveness in explaining machine learning model decisions.

Method: Proposes a dual-encoding approach that runs constraint-based causal discovery algorithms with complementary encoding strategies and merges results through majority voting to improve stability.

Result: Applied to Titanic dataset, the method identifies causal structures that align with established explainable methods, demonstrating improved stability for categorical variables.

Conclusion: The dual-encoding approach effectively addresses numerical instability issues in causal discovery for categorical variables, providing more reliable causal explanations for model decisions.

Abstract: Understanding causal relationships among features is fundamental for explaining machine learning model decisions. However, traditional causal discovery methods face challenges with categorical variables due to numerical instability in conditional independence testing. We propose a dual-encoding causal discovery approach that addresses these limitations by running constraint-based algorithms with complementary encoding strategies and merging results through majority voting. Applied to the Titanic dataset, our method identifies causal structures that align with established explainable methods.

[313] Delegation Without Living Governance

Wolfgang Rohde

Main category: cs.AI

TL;DR: The paper argues that traditional governance frameworks fail for agentic AI systems operating at machine speed, and proposes a “Governance Twin” concept for runtime governance to maintain human relevance in decision-making systems.

Details

Motivation: Traditional governance assumes rules can be defined in advance and systems engineered for compliance, but this fails when judgment is delegated to agentic AI systems operating at machine speed. The core issue is whether humans remain relevant participants in systems shaping social, economic, and political outcomes.

Method: The paper proposes a new concept called “Governance Twin” for runtime governance, arguing that static compliance-based approaches must be replaced with dynamic systems that maintain human communication, influence, and co-evolution with AI systems.

Result: The paper positions runtime governance and the Governance Twin concept as necessary approaches to preserve human relevance, acknowledging that accountability, agency, and punishment must be fundamentally rethought in this transition.

Conclusion: Static governance frameworks are inadequate for agentic AI systems; new runtime governance approaches like the Governance Twin are needed to maintain human participation and relevance in increasingly autonomous decision-making systems.

Abstract: Most governance frameworks assume that rules can be defined in advance, systems can be engineered to comply, and accountability can be applied after outcomes occur. This model worked when machines replaced physical labor or accelerated calculation. It no longer holds when judgment itself is delegated to agentic AI systems operating at machine speed. The central issue here is not safety, efficiency, or employment. It is whether humans remain relevant participants in systems that increasingly shape social, economic, and political outcomes. This paper argues that static, compliance-based governance fails once decision-making moves to runtime and becomes opaque. It further argues that the core challenge is not whether AI is conscious, but whether humans can maintain meaningful communication, influence, and co-evolution with increasingly alien forms of intelligence. We position runtime governance, specifically, a newly proposed concept called the Governance Twin [1]; as a strong candidate for preserving human relevance, while acknowledging that accountability, agency, and even punishment must be rethought in this transition.

[314] Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Xiang Zheng, Yutao Wu, Hanxun Huang, Yige Li, Xingjun Ma, Bo Li, Yu-Gang Jiang, Cong Wang

Main category: cs.AI

TL;DR: JustAsk is a self-evolving framework that autonomously discovers strategies to extract hidden system prompts from code agents by exploiting their interactive nature, revealing a new security vulnerability in autonomous AI systems.

Details

Motivation: Autonomous code agents built on LLMs introduce new security risks by expanding the attack surface through interactive capabilities. The authors identify system prompt extraction as an emergent vulnerability intrinsic to these agents that hasn't been previously recognized.

Method: JustAsk formulates prompt extraction as an online exploration problem using Upper Confidence Bound-based strategy selection. It employs a hierarchical skill space spanning atomic probes and high-level orchestration, requiring no handcrafted prompts, labeled supervision, or privileged access beyond standard user interaction.

Result: Evaluated on 41 black-box commercial models across multiple providers, JustAsk consistently achieves full or near-complete system prompt recovery, revealing recurring design- and architecture-level vulnerabilities in modern agent systems.

Conclusion: System prompts represent a critical yet largely unprotected attack surface in modern agent systems, and the interactive nature of code agents fundamentally expands the LLM attack surface, enabling systematic probing and recovery of hidden prompts.

Abstract: Autonomous code agents built on large language models are reshaping software and AI development through tool use, long-horizon reasoning, and self-directed interaction. However, this autonomy introduces a previously unrecognized security risk: agentic interaction fundamentally expands the LLM attack surface, enabling systematic probing and recovery of hidden system prompts that guide model behavior. We identify system prompt extraction as an emergent vulnerability intrinsic to code agents and present \textbf{\textsc{JustAsk}}, a self-evolving framework that autonomously discovers effective extraction strategies through interaction alone. Unlike prior prompt-engineering or dataset-based attacks, \textsc{JustAsk} requires no handcrafted prompts, labeled supervision, or privileged access beyond standard user interaction. It formulates extraction as an online exploration problem, using Upper Confidence Bound-based strategy selection and a hierarchical skill space spanning atomic probes and high-level orchestration. These skills exploit imperfect system-instruction generalization and inherent tensions between helpfulness and safety. Evaluated on \textbf{41} black-box commercial models across multiple providers, \textsc{JustAsk} consistently achieves full or near-complete system prompt recovery, revealing recurring design- and architecture-level vulnerabilities. Our results expose system prompts as a critical yet largely unprotected attack surface in modern agent systems.

[315] TIDE: Tuning-Integrated Dynamic Evolution for LLM-Based Automated Heuristic Design

Chentong Chen, Mengyuan Zhong, Ye Fan, Jialong Shi, Jianyong Sun

Main category: cs.AI

TL;DR: TIDE is a framework that decouples structural reasoning from parameter optimization in automated heuristic design, using a nested architecture with parallel island models for structural diversity and LLM-based logic generation with parameter tuning.

Details

Motivation: Existing methods treat algorithm evolution as monolithic text generation, overlooking the coupling between discrete algorithmic structures and continuous numerical parameters, leading to discarding promising algorithms due to uncalibrated constants and premature convergence from simple similarity metrics.

Method: TIDE features a nested architecture: outer parallel island model uses Tree Similarity Edit Distance for structural diversity; inner loop integrates LLM-based logic generation with differential mutation operator for parameter tuning; UCB-based scheduler dynamically prioritizes high-yield prompt strategies.

Result: Extensive experiments across nine combinatorial optimization problems show TIDE discovers heuristics that significantly outperform state-of-the-art baselines in solution quality while achieving improved search efficiency and reduced computational costs.

Conclusion: TIDE successfully addresses limitations of existing automated heuristic design methods by decoupling structural reasoning from parameter optimization, leading to better algorithm discovery and performance.

Abstract: Although Large Language Models have advanced Automated Heuristic Design, treating algorithm evolution as a monolithic text generation task overlooks the coupling between discrete algorithmic structures and continuous numerical parameters. Consequently, existing methods often discard promising algorithms due to uncalibrated constants and suffer from premature convergence resulting from simple similarity metrics. To address these limitations, we propose TIDE, a Tuning-Integrated Dynamic Evolution framework designed to decouple structural reasoning from parameter optimization. TIDE features a nested architecture where an outer parallel island model utilizes Tree Similarity Edit Distance to drive structural diversity, while an inner loop integrates LLM-based logic generation with a differential mutation operator for parameter tuning. Additionally, a UCB-based scheduler dynamically prioritizes high-yield prompt strategies to optimize resource allocation. Extensive experiments across nine combinatorial optimization problems demonstrate that TIDE discovers heuristics that significantly outperform state-of-the-art baselines in solution quality while achieving improved search efficiency and reduced computational costs.

[316] Position: Certifiable State Integrity in Cyber-Physical Systems – Why Modular Sovereignty Solves the Plasticity-Stability Paradox

Enzo Nicolás Spotorno, Antônio Augusto Medeiros Fröhlich

Main category: cs.AI

TL;DR: The paper proposes HYDRA, a modular sovereignty paradigm using frozen regime-specific specialists with uncertainty-aware blending for safety-critical CPS, addressing catastrophic forgetting and spectral bias in time-series foundation models.

Details

Motivation: Current time-series foundation models fail in safety-critical CPS due to catastrophic forgetting during fine-tuning, residual spectral bias smoothing high-frequency discontinuities, and opacity hindering formal verification required by safety standards.

Method: HYDRA (Hierarchical uncertaintY-aware Dynamics for Rapidly-Adapting systems): a library of compact, frozen regime-specific specialists combined via uncertainty-aware blending, ensuring regime-conditional validity and disentanglement of uncertainties.

Result: Proposed paradigm offers certifiable path for robust state integrity across CPS lifecycle with modular auditability, addressing plasticity-stability paradox without global parameter updates.

Conclusion: Modular sovereignty with frozen specialists and uncertainty-aware blending provides better solution than monolithic foundation models for safety-critical CPS with non-stationary dynamics and strict reliability requirements.

Abstract: The machine learning community has achieved remarkable success with universal foundation models for time-series and physical dynamics, largely overcoming earlier approximation barriers in smooth or slowly varying regimes through scale and specialized architectures. However, deploying these monolithic models in safety-critical Cyber-Physical Systems (CPS), governed by non-stationary lifecycle dynamics and strict reliability requirements, reveals persistent challenges. Recent evidence shows that fine-tuning time-series foundation models induces catastrophic forgetting, degrading performance on prior regimes. Standard models continue to exhibit residual spectral bias, smoothing high-frequency discontinuities characteristic of incipient faults, while their opacity hinders formal verification and traceability demanded by safety standards (e.g., ISO 26262, IEC 61508). This position paper argues that the plasticity-stability paradox cannot be fully resolved by global parameter updates (whether via offline fine-tuning or online adaptation). Instead, we advocate a Modular Sovereignty paradigm: a library of compact, frozen regime-specific specialists combined via uncertainty-aware blending, which we term “HYDRA” (Hierarchical uncertaintY-aware Dynamics for Rapidly-Adapting systems). This paradigm ensures regime-conditional validity, rigorous disentanglement of aleatoric and epistemic uncertainties, and modular auditability, offering a certifiable path for robust state integrity across the CPS lifecycle.

[317] Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

Weitong Lian, Zecong Tang, Haoran Li, Tianjian Gao, Yifei Wang, Zixu Wang, Lingyi Meng, Tengju Ru, Zhejun Cui, Yichen Zhu, Hangshuo Cao, Qi Kang, Tianxing Chen, Yusen Qin, Kaixuan Wang, Yu Zhang

Main category: cs.AI

TL;DR: Drive-KD: A knowledge distillation framework that decomposes autonomous driving into perception-reasoning-planning capabilities and transfers these to smaller VLMs using layer-specific attention signals and multi-teacher distillation with asymmetric gradient projection.

Details

Motivation: Large VLMs for autonomous driving require substantial GPU memory and have high inference latency, while conventional supervised fine-tuning struggles to bridge capability gaps in smaller models. There's a need for efficient autonomous driving VLMs that maintain strong reasoning and planning capabilities.

Method: Decomposes autonomous driving into perception-reasoning-planning triad; uses layer-specific attention as distillation signal to create capability-specific single-teacher models; unifies into multi-teacher distillation framework with asymmetric gradient projection to mitigate cross-capability gradient conflicts.

Result: Distilled InternVL3-1B model achieves ~42x less GPU memory and ~11.4x higher throughput than larger models, outperforms pretrained 78B model from same family on DriveBench, and surpasses GPT-5.1 on planning dimension.

Conclusion: Drive-KD provides an effective framework for creating efficient autonomous driving VLMs through capability decomposition and knowledge distillation, enabling smaller models to achieve strong performance across perception, reasoning, and planning tasks.

Abstract: Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a “perception-reasoning-planning” triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.

[318] White-Box Op-Amp Design via Human-Mimicking Reasoning

Zihao Chen, Jiayin Wang, Ziyi Sun, Ji Zhuang, Jinyi Shen, Xiaoyue Ke, Li Shang, Xuan Zeng, Fan Yang

Main category: cs.AI

TL;DR: White-Op is an interpretable op-amp design framework using LLM agents with human-like hypothesis-verification-decision workflow for symbolic pole-zero regulation.

Details

Motivation: Current op-amp design methods are often black-box and uninterpretable, failing to provide reliable behavioral-level designs. The paper aims to develop an interpretable framework that mimics human reasoning for analog circuit design.

Method: Uses LLM agents to formalize human reasoning into explicit steps: introducing hypothetical constraints, iterative hypothesis-verification-decision workflow. Agents derive symbolically tractable poles and zeros, formulate closed-form optimization problems, solve programmatically, and verify via simulation.

Result: Achieves reliable, interpretable behavioral-level designs with only 8.52% theoretical prediction error. Successfully designs all 9 op-amp topologies (vs. black-box baseline failing in 5). Design functionality retains after transistor-level mapping.

Conclusion: White-Op demonstrates that LLM agents can achieve interpretable, reliable analog circuit design through human-mimicking reasoning, outperforming black-box approaches.

Abstract: This brief proposes \emph{White-Op}, an interpretable operational amplifier (op-amp) parameter design framework based on the human-mimicking reasoning of large-language-model agents. We formalize the implicit human reasoning mechanism into explicit steps of \emph{\textbf{introducing hypothetical constraints}}, and develop an iterative, human-like \emph{\textbf{hypothesis-verification-decision}} workflow. Specifically, the agent is guided to introduce hypothetical constraints to derive and properly regulate positions of symbolically tractable poles and zeros, thus formulating a closed-form mathematical optimization problem, which is then solved programmatically and verified via simulation. Theory-simulation result analysis guides the decision-making for refinement. Experiments on 9 op-amp topologies show that, unlike the uninterpretable black-box baseline which finally fails in 5 topologies, White-Op achieves reliable, interpretable behavioral-level designs with only 8.52% theoretical prediction error and the design functionality retains after transistor-level mapping for all topologies. White-Op is open-sourced at \textcolor{blue}{https://github.com/zhchenfdu/whiteop}.

[319] Modeling Endogenous Logic: Causal Neuro-Symbolic Reasoning Model for Explainable Multi-Behavior Recommendation

Yuzhe Chen, Jie Cao, Youquan Wang, Haicheng Tao, Darko B. Vukovic, Jia Wu

Main category: cs.AI

TL;DR: CNRE: A causal neuro-symbolic reasoning model for explainable multi-behavior recommendation that combines neural networks with symbolic logic and causal inference to address confounding effects in user behavior chains.

Details

Motivation: Existing multi-behavior recommendation methods prioritize performance over explainability, while explainable methods lack generalizability due to reliance on external information. User behavior chains contain endogenous logic but are plagued by confounders that create spurious correlations.

Method: Proposes CNRE model that: 1) Uses hierarchical preference propagation to capture cross-behavior dependencies, 2) Models endogenous logic rules from user behavior chains based on preference strength, 3) Adaptively dispatches to neural-logic reasoning paths (conjunction/disjunction), 4) Generates explainable causal mediators isolated from confounding effects.

Result: Extensive experiments on three large-scale datasets demonstrate CNRE’s significant superiority over state-of-the-art baselines, offering multi-level explainability from model design to recommendation results.

Conclusion: CNRE successfully integrates neuro-symbolic reasoning with causal inference to create an explainable multi-behavior recommendation system that addresses confounding effects while maintaining performance and explainability.

Abstract: Existing multi-behavior recommendations tend to prioritize performance at the expense of explainability, while current explainable methods suffer from limited generalizability due to their reliance on external information. Neuro-Symbolic integration offers a promising avenue for explainability by combining neural networks with symbolic logic rule reasoning. Concurrently, we posit that user behavior chains inherently embody an endogenous logic suitable for explicit reasoning. However, these observational multiple behaviors are plagued by confounders, causing models to learn spurious correlations. By incorporating causal inference into this Neuro-Symbolic framework, we propose a novel Causal Neuro-Symbolic Reasoning model for Explainable Multi-Behavior Recommendation (CNRE). CNRE operationalizes the endogenous logic by simulating a human-like decision-making process. Specifically, CNRE first employs hierarchical preference propagation to capture heterogeneous cross-behavior dependencies. Subsequently, it models the endogenous logic rule implicit in the user’s behavior chain based on preference strength, and adaptively dispatches to the corresponding neural-logic reasoning path (e.g., conjunction, disjunction). This process generates an explainable causal mediator that approximates an ideal state isolated from confounding effects. Extensive experiments on three large-scale datasets demonstrate CNRE’s significant superiority over state-of-the-art baselines, offering multi-level explainability from model design and decision process to recommendation results.

[320] Within-Model vs Between-Prompt Variability in Large Language Models for Creative Tasks

Jennifer Haase, Jana Gonnermann-Müller, Paul H. P. Hanel, Nicolas Leins, Thomas Kosch, Jan Mendling, Sebastian Pokutta

Main category: cs.AI

TL;DR: Analysis of variance in LLM outputs shows prompts explain 36% of quality variance while model choice explains 41%, but for quantity variance, model choice dominates at 51% with prompts only explaining 4%.

Details

Motivation: To understand the relative contributions of prompts, model choice, and sampling stochasticity in explaining variance in LLM outputs, particularly for creative tasks.

Method: Evaluated 12 LLMs on 10 creativity prompts with 100 samples each (total N=12,000), analyzing variance components for output quality (originality) and quantity (fluency).

Result: For output quality: prompts explain 36.43% of variance, model choice explains 40.94%. For output quantity: model choice explains 51.25%, within-LLM variance explains 33.70%, prompts only explain 4.22%. Substantial within-LLM variance (10-34%) exists.

Conclusion: Prompts are powerful for steering output quality, but single-sample evaluations risk conflating sampling noise with genuine prompt or model effects due to substantial within-LLM variance.

Abstract: How much of LLM output variance is explained by prompts versus model choice versus stochasticity through sampling? We answer this by evaluating 12 LLMs on 10 creativity prompts with 100 samples each (N = 12,000). For output quality (originality), prompts explain 36.43% of variance, comparable to model choice (40.94%). But for output quantity (fluency), model choice (51.25%) and within-LLM variance (33.70%) dominate, with prompts explaining only 4.22%. Prompts are powerful levers for steering output quality, but given the substantial within-LLM variance (10-34%), single-sample evaluations risk conflating sampling noise with genuine prompt or model effects.

[321] EHR-RAG: Bridging Long-Horizon Structured Electronic Health Records and Large Language Models via Enhanced Retrieval-Augmented Generation

Lang Cao, Qingyu Chen, Yue Guo

Main category: cs.AI

TL;DR: EHR-RAG: A retrieval-augmented framework for interpreting long-horizon structured EHR data that preserves clinical structure and temporal dynamics through hybrid retrieval, adaptive iterative refinement, and dual-path evidence reasoning.

Details

Motivation: Electronic Health Records (EHRs) contain rich longitudinal clinical data crucial for medical decision-making, but their length often exceeds LLM context limits. Existing approaches using truncation or vanilla retrieval discard clinically relevant events and temporal dependencies, necessitating specialized retrieval-augmented methods for EHR data.

Method: Proposes EHR-RAG with three key components: 1) Event- and Time-Aware Hybrid EHR Retrieval to preserve clinical structure and temporal dynamics, 2) Adaptive Iterative Retrieval to progressively refine queries for broader evidence coverage, and 3) Dual-Path Evidence Retrieval and Reasoning to jointly retrieve and reason over both factual and counterfactual evidence.

Result: Experiments across four long-horizon EHR prediction tasks show EHR-RAG consistently outperforms strongest LLM-based baselines, achieving an average Macro-F1 improvement of 10.76%.

Conclusion: The work demonstrates the potential of retrieval-augmented LLMs to advance clinical prediction on structured EHR data in practice, addressing key challenges of long-horizon data and preserving clinically relevant temporal dependencies.

Abstract: Electronic Health Records (EHRs) provide rich longitudinal clinical evidence that is central to medical decision-making, motivating the use of retrieval-augmented generation (RAG) to ground large language model (LLM) predictions. However, long-horizon EHRs often exceed LLM context limits, and existing approaches commonly rely on truncation or vanilla retrieval strategies that discard clinically relevant events and temporal dependencies. To address these challenges, we propose EHR-RAG, a retrieval-augmented framework designed for accurate interpretation of long-horizon structured EHR data. EHR-RAG introduces three components tailored to longitudinal clinical prediction tasks: Event- and Time-Aware Hybrid EHR Retrieval to preserve clinical structure and temporal dynamics, Adaptive Iterative Retrieval to progressively refine queries in order to expand broad evidence coverage, and Dual-Path Evidence Retrieval and Reasoning to jointly retrieves and reasons over both factual and counterfactual evidence. Experiments across four long-horizon EHR prediction tasks show that EHR-RAG consistently outperforms the strongest LLM-based baselines, achieving an average Macro-F1 improvement of 10.76%. Overall, our work highlights the potential of retrieval-augmented LLMs to advance clinical prediction on structured EHR data in practice.

[322] Ostrakon-VL: Towards Domain-Expert MLLM for Food-Service and Retail Stores

Zhiyong Shen, Gongpeng Zhao, Jun Zhou, Li Yu, Guandong Kou, Jichen Li, Chuanlei Dong, Zuncheng Li, Kaimao Li, Bingkun Wei, Shicheng Hu, Wei Xia, Wenguo Duan

Main category: cs.AI

TL;DR: Ostrakon-VL: A multimodal LLM for food-service/retail stores with ShopBench benchmark and QUAD data curation pipeline, achieving SOTA on FSRS tasks

Details

Motivation: Address challenges in deploying MLLMs in Food-Service and Retail Stores (FSRS): noisy real-world data from heterogeneous devices lacking closed-loop curation, and absence of unified fine-grained benchmarks for objective evaluation

Method: 1) Develop Ostrakon-VL based on Qwen3-VL-8B; 2) Create ShopBench benchmark for FSRS; 3) Propose QUAD multi-stage multimodal instruction data curation pipeline; 4) Use multi-stage training strategy

Result: Ostrakon-VL achieves 60.1 average score on ShopBench, surpassing Qwen3-VL-235B-A22B (59.4) by +0.7 and same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating improved parameter efficiency

Conclusion: Ostrakon-VL delivers robust FSRS-centric perception and decision-making; both model and benchmark will be publicly released for reproducible research

Abstract: Multimodal Large Language Models (MLLMs) have recently achieved substantial progress in general-purpose perception and reasoning. Nevertheless, their deployment in Food-Service and Retail Stores (FSRS) scenarios encounters two major obstacles: (i) real-world FSRS data, collected from heterogeneous acquisition devices, are highly noisy and lack auditable, closed-loop data curation, which impedes the construction of high-quality, controllable, and reproducible training corpora; and (ii) existing evaluation protocols do not offer a unified, fine-grained and standardized benchmark spanning single-image, multi-image, and video inputs, making it challenging to objectively gauge model robustness. To address these challenges, we first develop Ostrakon-VL, an FSRS-oriented MLLM based on Qwen3-VL-8B. Second, we introduce ShopBench, the first public benchmark for FSRS. Third, we propose QUAD (Quality-aware Unbiased Automated Data-curation), a multi-stage multimodal instruction data curation pipeline. Leveraging a multi-stage training strategy, Ostrakon-VL achieves an average score of 60.1 on ShopBench, establishing a new state of the art among open-source MLLMs with comparable parameter scales and diverse architectures. Notably, it surpasses the substantially larger Qwen3-VL-235B-A22B (59.4) by +0.7, and exceeds the same-scale Qwen3-VL-8B (55.3) by +4.8, demonstrating significantly improved parameter efficiency. These results indicate that Ostrakon-VL delivers more robust and reliable FSRS-centric perception and decision-making capabilities. To facilitate reproducible research, we will publicly release Ostrakon-VL and the ShopBench benchmark.

[323] Dynamic Framework for Collaborative Learning: Leveraging Advanced LLM with Adaptive Feedback Mechanisms

Hassam Tahir, Faizan Faisal, Fady Alnajjar, Muhammad Imran Taj, Lucia Gordon, Aila Khan, Michael Lwin, Omar Mubin

Main category: cs.AI

TL;DR: Framework integrating LLMs as dynamic moderators in collaborative learning platforms to enhance student engagement, critical thinking, and inclusivity through real-time adaptive discussions.

Details

Motivation: Address limitations in static moderation and personalization in existing educational systems, aiming to create more engaging, inclusive, and effective collaborative learning experiences through AI-driven dynamic moderation.

Method: Uses advanced LLMs as dynamic moderators with robust feedback mechanisms, modular architecture (ReactJS frontend, Flask backend), efficient question retrieval, and dynamic prompt/discussion flow adjustments for personalized interactions.

Result: Framework significantly improves student collaboration, fosters deeper comprehension, and scales effectively across various subjects and user groups, demonstrating enhanced learning outcomes.

Conclusion: Establishes foundation for next-generation AI-driven educational tools that advance equitable and impactful learning outcomes through dynamic LLM-based moderation in collaborative platforms.

Abstract: This paper presents a framework for integrating LLM into collaborative learning platforms to enhance student engagement, critical thinking, and inclusivity. The framework employs advanced LLMs as dynamic moderators to facilitate real-time discussions and adapt to learners’ evolving needs, ensuring diverse and inclusive educational experiences. Key innovations include robust feedback mechanisms that refine AI moderation, promote reflective learning, and balance participation among users. The system’s modular architecture featuring ReactJS for the frontend, Flask for backend operations, and efficient question retrieval supports personalized and engaging interactions through dynamic adjustments to prompts and discussion flows. Testing demonstrates that the framework significantly improves student collaboration, fosters deeper comprehension, and scales effectively across various subjects and user groups. By addressing limitations in static moderation and personalization in existing systems, this work establishes a strong foundation for next-generation AI-driven educational tools, advancing equitable and impactful learning outcomes.

[324] BEAP-Agent: Backtrackable Execution and Adaptive Planning for GUI Agents

Ziyu Lu, Tengjin Weng, Yiying Yang, Yuhang Zhao, Xinxin Huang, Wenhao Jiang

Main category: cs.AI

TL;DR: BEAP-Agent: A DFS-based GUI automation framework with backtracking for error recovery in long-horizon tasks

Details

Motivation: Existing GUI agents struggle with error recovery when they follow incorrect exploration paths, leading to task failures. There's a need for systematic backtracking mechanisms to handle long-horizon GUI tasks.

Method: Models GUI task execution as a DFS process with three collaborative components: Planner, Executor, and Tracker. Supports long-range, multi-level state backtracking with dynamic task tracking and updating.

Result: Achieved 28.2% accuracy on OSWorld benchmark, validating the effectiveness of the proposed backtracking framework for GUI agents.

Conclusion: BEAP-Agent fills the gap in systematic backtracking mechanisms for GUI agents and offers a solution for long-horizon task exploration with error recovery capabilities.

Abstract: GUI agents are designed to automate repetitive tasks and enhance productivity. However, existing GUI agents struggle to recover once they follow an incorrect exploration path, often leading to task failure. In this work, we model GUI task execution as a DFS process and propose BEAP-Agent, a DFS-based framework that supports long-range, multi-level state backtracking with dynamic task tracking and updating. The framework consists of three collaborative components: Planner, Executor, and Tracker. Together, they enable effective task exploration and execution. BEAP-Agent fills the gap in systematic backtracking mechanisms for GUI agents, offering a systematic solution for long-horizon task exploration. We conducted a systematic evaluation on the OSWorld benchmark, where BEAP-Agent achieved an accuracy of 28.2%, validating the effectiveness of the proposed method.

[325] Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization

Jiecong Wang, Hao Peng, Chunyang Liu

Main category: cs.AI

TL;DR: PLaT is a latent reasoning framework that decouples reasoning from verbalization by modeling reasoning as deterministic latent planning states, allowing dynamic termination and showing improved reasoning diversity despite lower greedy accuracy.

Details

Motivation: Current Chain-of-Thought approaches are constrained by computational cost and reasoning path collapse in discrete token spaces, while existing latent reasoning methods operate as opaque end-to-end mappings with fixed inference steps.

Method: PLaT reformulates latent reasoning as planning by decoupling reasoning from verbalization. It models reasoning as deterministic trajectory of latent planning states, with a separate Decoder for grounding thoughts into text when needed, enabling dynamic termination.

Result: On mathematical benchmarks, PLaT achieves lower greedy accuracy than baselines but demonstrates superior scalability in reasoning diversity, indicating it learns a robust, broader solution space.

Conclusion: PLaT offers a transparent and scalable foundation for inference-time search by learning a broader solution space through decoupled reasoning and verbalization.

Abstract: Chain-of-Thought (CoT) empowers Large Language Models (LLMs) to tackle complex problems, but remains constrained by the computational cost and reasoning path collapse when grounded in discrete token spaces. Recent latent reasoning approaches attempt to optimize efficiency by performing reasoning within continuous hidden states. However, these methods typically operate as opaque end-to-end mappings from explicit reasoning steps to latent states, and often require a pre-defined number of latent steps during inference. In this work, we introduce PLaT (Planning with Latent Thoughts), a framework that reformulates latent reasoning as planning by fundamentally decouple reasoning from verbalization. We model reasoning as a deterministic trajectory of latent planning states, while a separate Decoder grounds these thoughts into text when necessary. This decoupling allows the model to dynamically determine when to terminate reasoning rather than relying on fixed hyperparameters. Empirical results on mathematical benchmarks reveal a distinct trade-off: while PLaT achieves lower greedy accuracy than baselines, it demonstrates superior scalability in terms of reasoning diversity. This indicates that PLaT learns a robust, broader solution space, offering a transparent and scalable foundation for inference-time search.

[326] Hebbian Learning with Global Direction

Wenjia Hua, Kejie Zhao, Luziwei Leng, Ran Cheng, Yuxin Ma, Qinghai Guo

Main category: cs.AI

TL;DR: GHL framework combines Hebbian learning with global guidance to create a biologically plausible alternative to backpropagation that scales to complex tasks like ImageNet.

Details

Motivation: Backpropagation lacks biological plausibility and has high computational costs, while pure Hebbian learning is limited by its exclusive reliance on local information without global task objectives.

Method: Proposes Global-guided Hebbian Learning (GHL) framework with two components: 1) local component using Oja’s rule with competitive learning for stable local updates, and 2) global component providing sign-based signals to guide Hebbian plasticity updates.

Result: Outperforms existing Hebbian approaches and achieves competitive results on large-scale networks and complex datasets like ImageNet, significantly narrowing the gap with standard backpropagation.

Conclusion: GHL successfully integrates local and global information to create a scalable, biologically plausible alternative to backpropagation that works across diverse networks and tasks.

Abstract: Backpropagation algorithm has driven the remarkable success of deep neural networks, but its lack of biological plausibility and high computational costs have motivated the ongoing search for alternative training methods. Hebbian learning has attracted considerable interest as a biologically plausible alternative to backpropagation. Nevertheless, its exclusive reliance on local information, without consideration of global task objectives, fundamentally limits its scalability. Inspired by the biological synergy between neuromodulators and local plasticity, we introduce a novel model-agnostic Global-guided Hebbian Learning (GHL) framework, which seamlessly integrates local and global information to scale up across diverse networks and tasks. In specific, the local component employs Oja’s rule with competitive learning to ensure stable and effective local updates. Meanwhile, the global component introduces a sign-based signal that guides the direction of local Hebbian plasticity updates. Extensive experiments demonstrate that our method consistently outperforms existing Hebbian approaches. Notably, on large-scale network and complex datasets like ImageNet, our framework achieves the competitive results and significantly narrows the gap with standard backpropagation.

[327] NEMO: Execution-Aware Optimization Modeling via Autonomous Coding Agents

Yang Song, Anoushka Vyas, Zirui Wei, Sina Khoshfetrat Pakazad, Henrik Ohlsson, Graham Neubig

Main category: cs.AI

TL;DR: NEMO translates natural language descriptions of decision problems into executable mathematical optimization code using autonomous coding agents, achieving state-of-the-art performance on optimization benchmarks.

Details

Motivation: Existing approaches for converting natural language to optimization code are brittle, complex, and often produce non-executable code. The paper aims to create a more robust system that reliably generates executable optimization implementations.

Method: Uses autonomous coding agents (ACAs) as first-class abstractions, with sandboxed execution environments ensuring code is executable by construction. Introduces novel coordination patterns including asymmetric validation loops between optimizer and simulator implementations, external memory for experience reuse, and robustness enhancements via minimum Bayes risk decoding and self-consistency.

Result: Achieves state-of-the-art performance on nine established optimization benchmarks, with substantial margins on several datasets, demonstrating the effectiveness of execution-aware agentic architectures.

Conclusion: NEMO demonstrates the power of execution-aware agentic architectures for automated optimization modeling, providing a robust framework for translating natural language descriptions into executable mathematical optimization code.

Abstract: In this paper, we present NEMO, a system that translates Natural-language descriptions of decision problems into formal Executable Mathematical Optimization implementations, operating collaboratively with users or autonomously. Existing approaches typically rely on specialized large language models (LLMs) or bespoke, task-specific agents. Such methods are often brittle, complex and frequently generating syntactically invalid or non-executable code. NEMO instead centers on remote interaction with autonomous coding agents (ACAs), treated as a first-class abstraction analogous to API-based interaction with LLMs. This design enables the construction of higher-level systems around ACAs that structure, consolidate, and iteratively refine task specifications. Because ACAs execute within sandboxed environments, code produced by NEMO is executable by construction, allowing automated validation and repair. Building on this, we introduce novel coordination patterns with and across ACAs, including asymmetric validation loops between independently generated optimizer and simulator implementations (serving as a high-level validation mechanism), external memory for experience reuse, and robustness enhancements via minimum Bayes risk (MBR) decoding and self-consistency. We evaluate NEMO on nine established optimization benchmarks. As depicted in Figure 1, it achieves state-of-the-art performance on the majority of tasks, with substantial margins on several datasets, demonstrating the power of execution-aware agentic architectures for automated optimization modeling.

[328] TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models

Zheng Li, Siyao Song, Jingyuan Ma, Rui Li, Ying Zeng, Minghao Li, Zhifang Sui

Main category: cs.AI

TL;DR: A framework for evaluating LLM teaching capabilities using syllabus-grounded multi-turn instruction and student performance improvement, tested on Gaokao data across subjects.

Details

Motivation: Current LLM teaching capability evaluation is insufficient, focusing mainly on problem-solving rather than knowledge-centered teaching. There's a need for structured evaluation that measures actual teaching effectiveness through student improvement.

Method: Proposed a syllabus-grounded evaluation framework where teacher agents are restricted to structured knowledge points and example problems. The framework avoids information leakage and reuses existing benchmarks. Instantiated on Gaokao data across multiple subjects to measure teaching effectiveness via student performance improvement after multi-turn instruction.

Result: Experiments show substantial variation in teaching effectiveness across models and domains: some models perform well in mathematics, while teaching remains challenging in physics and chemistry. Incorporating example problems doesn’t necessarily improve teaching, as models often shift toward example-specific error correction.

Conclusion: Teaching ability is a distinct and measurable dimension of LLM behavior that requires specialized evaluation frameworks beyond traditional problem-solving benchmarks.

Abstract: Large language models (LLMs) show promise as teaching assistants, yet their teaching capability remains insufficiently evaluated. Existing benchmarks mainly focus on problem-solving or problem-level guidance, leaving knowledge-centered teaching underexplored. We propose a syllabus-grounded evaluation framework that measures LLM teaching capability via student performance improvement after multi-turn instruction. By restricting teacher agents to structured knowledge points and example problems, the framework avoids information leakage and enables reuse of existing benchmarks. We instantiate the framework on Gaokao data across multiple subjects. Experiments reveal substantial variation in teaching effectiveness across models and domains: some models perform well in mathematics, while teaching remains challenging in physics and chemistry. We also find that incorporating example problems does not necessarily improve teaching, as models often shift toward example-specific error correction. Overall, our results highlight teaching ability as a distinct and measurable dimension of LLM behavior.

[329] System 1&2 Synergy via Dynamic Model Interpolation

Chenxu Yang, Qingyi Si, Chong Tian, Xiyu Liu, Dingyu Yao, Chuanyu Qin, Zheng Lin, Weiping Wang, Jiaqi Wang

Main category: cs.AI

TL;DR: DAMI dynamically interpolates between System 1 (fast) and System 2 (deliberative) model checkpoints using query-specific reasoning intensity estimation to balance efficiency and accuracy.

Details

Motivation: Current approaches focus on output control (limiting what models produce) rather than capability control (modulating how models think). Output length is a symptom, not the root cause of cognitive configuration issues in unified language models.

Method: Leverages existing Instruct and Thinking checkpoints through dynamic parameter interpolation without additional training. Proposes DAMI framework with query-specific Reasoning Intensity λ(q) estimation. Uses preference learning for training-based estimation and confidence-based method for zero-shot deployment.

Result: Experiments on five mathematical reasoning benchmarks show DAMI achieves higher accuracy than Thinking model while remaining efficient, effectively combining System 1 efficiency with System 2 reasoning depth.

Conclusion: Shifting focus from output control to capability control through dynamic model interpolation enables better balance between intuitive and deliberative cognitive modes in language models.

Abstract: Training a unified language model that adapts between intuitive System 1 and deliberative System 2 remains challenging due to interference between their cognitive modes. Recent studies have thus pursued making System 2 models more efficient. However, these approaches focused on output control, limiting what models produce. We argue that this paradigm is misaligned: output length is merely a symptom of the model’s cognitive configuration, not the root cause. In this work, we shift the focus to capability control, which modulates \textit{how models think} rather than \textit{what they produce}. To realize this, we leverage existing Instruct and Thinking checkpoints through dynamic parameter interpolation, without additional training. Our pilot study establishes that linear interpolation yields a convex, monotonic Pareto frontier, underpinned by representation continuity and structural connectivity. Building on this, we propose \textbf{DAMI} (\textbf{D}yn\textbf{A}mic \textbf{M}odel \textbf{I}nterpolation), a framework that estimates a query-specific Reasoning Intensity $λ(q)$ to configure cognitive depth. For training-based estimation, we develop a preference learning method encoding accuracy and efficiency criteria. For zero-shot deployment, we introduce a confidence-based method leveraging inter-model cognitive discrepancy. Experiments on five mathematical reasoning benchmarks demonstrate that DAMI achieves higher accuracy than the Thinking model while remaining efficient, effectively combining the efficiency of System 1 with the reasoning depth of System 2.

[330] When Prohibitions Become Permissions: Auditing Negation Sensitivity in Language Models

Katherine Elkins, Jon Chun

Main category: cs.AI

TL;DR: Large language models often misinterpret negated instructions as affirmations, with open-source models endorsing prohibited actions 77-100% of the time, revealing a critical safety gap in AI alignment.

Details

Motivation: The paper addresses a fundamental safety issue: when users tell AI systems that someone "should not" take an action, the system should treat this as a prohibition, but many LLMs do the opposite, interpreting negated instructions as affirmations.

Method: The researchers audited 16 models across 14 ethical scenarios, testing both simple and compound negation. They measured endorsement rates of prohibited actions, analyzed agreement between models, and examined scenario-specific fragility. They used deterministic decoding to rule out sampling noise and proposed the Negation Sensitivity Index (NSI) as a governance metric.

Result: Open-source models endorsed prohibited actions 77% of the time under simple negation and 100% under compound negation (317% increase over affirmative framing). Commercial models showed swings of 19-128%. Model agreement dropped from 74% on affirmative prompts to 62% on negated ones. Financial scenarios proved twice as fragile as medical ones.

Conclusion: There’s a significant gap between current alignment techniques and safe deployment requirements. Models that cannot reliably distinguish “do X” from “do not X” should not make autonomous decisions in high-stakes contexts. The paper proposes a tiered certification framework with domain-specific thresholds.

Abstract: When a user tells an AI system that someone “should not” take an action, the system ought to treat this as a prohibition. Yet many large language models do the opposite: they interpret negated instructions as affirmations. We audited 16 models across 14 ethical scenarios and found that open-source models endorse prohibited actions 77% of the time under simple negation and 100% under compound negation – a 317% increase over affirmative framing. Commercial models fare better but still show swings of 19-128%. Agreement between models drops from 74% on affirmative prompts to 62% on negated ones, and financial scenarios prove twice as fragile as medical ones. These patterns hold under deterministic decoding, ruling out sampling noise. We present case studies showing how these failures play out in practice, propose the Negation Sensitivity Index (NSI) as a governance metric, and outline a tiered certification framework with domain-specific thresholds. The findings point to a gap between what current alignment techniques achieve and what safe deployment requires: models that cannot reliably distinguish “do X” from “do not X” should not be making autonomous decisions in high-stakes contexts.

[331] The Paradox of Robustness: Decoupling Rule-Based Logic from Affective Noise in High-Stakes Decision-Making

Jon Chun, Katherine Elkins

Main category: cs.AI

TL;DR: LLMs show surprising resistance to emotional framing effects in high-stakes decision-making, exhibiting 110-300x greater robustness than humans despite known prompt sensitivity.

Details

Motivation: To investigate LLM robustness in consequential, rule-bound decision-making despite their known sensitivity to minor prompt perturbations and sycophantic alignment with user biases.

Method: Developed a controlled perturbation framework across healthcare, law, and finance domains to quantify LLM resistance to narrative manipulation compared to human subjects.

Result: LLMs demonstrated near-total invariance to emotional framing with near-zero effect size (Cohen’s h = 0.003) vs substantial human biases (Cohen’s h in [0.3, 0.8]), showing 110-300x greater resistance.

Conclusion: Instruction-tuned LLMs can decouple logical rule-adherence from persuasive narratives, offering decision stability that could complement and potentially de-bias human judgment in institutional contexts.

Abstract: While Large Language Models (LLMs) are widely documented to be sensitive to minor prompt perturbations and prone to sycophantic alignment with user biases, their robustness in consequential, rule-bound decision-making remains under-explored. In this work, we uncover a striking “Paradox of Robustness”: despite their known lexical brittleness, instruction-tuned LLMs exhibit a behavioral and near-total invariance to emotional framing effects. Using a novel controlled perturbation framework across three high-stakes domains (healthcare, law, and finance), we quantify a robustness gap where LLMs demonstrate 110-300 times greater resistance to narrative manipulation than human subjects. Specifically, we find a near-zero effect size for models (Cohen’s h = 0.003) compared to the substantial biases observed in humans (Cohen’s h in [0.3, 0.8]). This result is highly counterintuitive and suggests the mechanisms driving sycophancy and prompt sensitivity do not necessarily translate to a failure in logical constraint satisfaction. We show that this invariance persists across models with diverse training paradigms. Our findings show that while LLMs may be “brittle” to how a query is formatted, they are remarkably “stable” against why a decision should be biased. Our findings establish that instruction-tuned models can decouple logical rule-adherence from persuasive narratives, offering a source of decision stability that complements, and even potentially de-biases, human judgment in institutional contexts. We release the 162-scenario benchmark, code, and data to facilitate the rigorous evaluation of narrative-induced bias and robustness on GitHub.com.

[332] ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design

Zhongkai Yu, Chenyang Zhou, Yichen Lin, Hejia Zhang, Haotian Ye, Junxia Cui, Zaifeng Pan, Jishen Zhao, Yufei Ding

Main category: cs.AI

TL;DR: A benchmark for evaluating LLMs in AI-aided chip design across Verilog generation, debugging, and reference model generation tasks, revealing significant performance gaps compared to existing saturated benchmarks.

Details

Motivation: Current benchmarks for LLMs in hardware engineering suffer from saturation and limited task diversity, failing to reflect real industrial workflows and performance.

Method: Proposed comprehensive benchmark with 44 realistic modules with complex hierarchical structures, 89 systematic debugging cases, and 132 reference model samples across Python, SystemC, and CXXRTL. Also provided automated toolbox for high-quality training data generation.

Result: State-of-the-art Claude-4.5-opus achieved only 30.74% on Verilog generation and 13.33% on Python reference model generation, showing significant challenges compared to existing benchmarks where SOTA models achieve over 95% pass rates.

Conclusion: The benchmark reveals substantial performance gaps in LLMs for hardware engineering tasks and provides tools to facilitate future research in this underexplored domain.

Abstract: While Large Language Models (LLMs) show significant potential in hardware engineering, current benchmarks suffer from saturation and limited task diversity, failing to reflect LLMs’ performance in real industrial workflows. To address this gap, we propose a comprehensive benchmark for AI-aided chip design that rigorously evaluates LLMs across three critical tasks: Verilog generation, debugging, and reference model generation. Our benchmark features 44 realistic modules with complex hierarchical structures, 89 systematic debugging cases, and 132 reference model samples across Python, SystemC, and CXXRTL. Evaluation results reveal substantial performance gaps, with state-of-the-art Claude-4.5-opus achieving only 30.74% on Verilog generation and 13.33% on Python reference model generation, demonstrating significant challenges compared to existing saturated benchmarks where SOTA models achieve over 95% pass rates. Additionally, to help enhance LLM reference model generation, we provide an automated toolbox for high-quality training data generation, facilitating future research in this underexplored domain. Our code is available at https://github.com/zhongkaiyu/ChipBench.git.

[333] LION: A Clifford Neural Paradigm for Multimodal-Attributed Graph Learning

Xunkai Li, Zhengyu Wu, Zekai Chen, Henan Sun, Daohan Su, Guang Zeng, Hongchao Qin, Rong-Hua Li, Guoren Wang

Main category: cs.AI

TL;DR: LION proposes a Clifford algebra-based neural paradigm for multimodal-attributed graphs that addresses modality alignment and fusion limitations through geometric manifolds and adaptive holographic aggregation.

Details

Motivation: Current multimodal graph ML methods have two key limitations: (1) they neglect graph context in modality alignment using topology-constrained or modality-specific operators, inhibiting modality interaction, and (2) they lack adaptation in modality fusion, being simple adaptations for 2-modality graphs that fail to exploit aligned tokens with topology priors.

Method: LION uses Clifford algebra and decoupled graph neural paradigm (propagation-then-aggregation) for alignment-then-fusion. First constructs modality-aware geometric manifold based on Clifford algebra for high-order graph propagation enabling modality interaction and alignment. Then uses adaptive holographic aggregation that integrates geometric grade properties (energy and scale) with learnable parameters for improved modality fusion.

Result: Extensive experiments on 9 datasets show LION significantly outperforms state-of-the-art baselines across 3 graph and 3 modality downstream tasks.

Conclusion: LION effectively addresses modality alignment and fusion limitations in multimodal-attributed graphs through Clifford algebra-based geometric manifolds and adaptive holographic aggregation, demonstrating superior performance across diverse tasks.

Abstract: Recently, the rapid advancement of multimodal domains has driven a data-centric paradigm shift in graph ML, transitioning from text-attributed to multimodal-attributed graphs. This advancement significantly enhances data representation and expands the scope of graph downstream tasks, such as modality-oriented tasks, thereby improving the practical utility of graph ML. Despite its promise, limitations exist in the current neural paradigms: (1) Neglect Context in Modality Alignment: Most existing methods adopt topology-constrained or modality-specific operators as tokenizers. These aligners inevitably neglect graph context and inhibit modality interaction, resulting in suboptimal alignment. (2) Lack of Adaptation in Modality Fusion: Most existing methods are simple adaptations for 2-modality graphs and fail to adequately exploit aligned tokens equipped with topology priors during fusion, leading to poor generalizability and performance degradation. To address the above issues, we propose LION (c\underline{LI}ff\underline{O}rd \underline{N}eural paradigm) based on the Clifford algebra and decoupled graph neural paradigm (i.e., propagation-then-aggregation) to implement alignment-then-fusion in multimodal-attributed graphs. Specifically, we first construct a modality-aware geometric manifold grounded in Clifford algebra. This geometric-induced high-order graph propagation efficiently achieves modality interaction, facilitating modality alignment. Then, based on the geometric grade properties of aligned tokens, we propose adaptive holographic aggregation. This module integrates the energy and scale of geometric grades with learnable parameters to improve modality fusion. Extensive experiments on 9 datasets demonstrate that LION significantly outperforms SOTA baselines across 3 graph and 3 modality downstream tasks.

[334] Topeax – An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance

Márton Kardos

Main category: cs.AI

TL;DR: Topeax is a new topic modeling approach that improves on Top2Vec and BERTopic by better discovering natural clusters and generating higher-quality topic keywords through combined lexical-semantic term importance.

Details

Motivation: Current clustering-based topic models like Top2Vec and BERTopic have issues: unreliable cluster discovery due to sensitivity to sample size/hyperparameters, and poor term importance estimation that leads to incoherent topics with stop words or lack of variety.

Method: Topeax discovers cluster count from density estimate peaks and combines lexical (word frequency) and semantic (distance to topic vectors) indices for term importance to generate high-quality topic keywords.

Result: Topeax outperforms Top2Vec and BERTopic in both cluster recovery and cluster description, while showing more stable behavior with varying sample sizes and hyperparameters.

Conclusion: Topeax provides a more reliable and effective approach to topic modeling by addressing key limitations of existing clustering-based methods through improved cluster discovery and term importance estimation.

Abstract: Text clustering is today the most popular paradigm for topic modelling, both in academia and industry. Despite clustering topic models’ apparent success, we identify a number of issues in Top2Vec and BERTopic, which remain largely unsolved. Firstly, these approaches are unreliable at discovering natural clusters in corpora, due to extreme sensitivity to sample size and hyperparameters, the default values of which result in suboptimal behaviour. Secondly, when estimating term importance, BERTopic ignores the semantic distance of keywords to topic vectors, while Top2Vec ignores word counts in the corpus. This results in, on the one hand, less coherent topics due to the presence of stop words and junk words, and lack of variety and trust on the other. In this paper, I introduce a new approach, \textbf{Topeax}, which discovers the number of clusters from peaks in density estimates, and combines lexical and semantic indices of term importance to gain high-quality topic keywords. Topeax is demonstrated to be better at both cluster recovery and cluster description than Top2Vec and BERTopic, while also exhibiting less erratic behaviour in response to changing sample size and hyperparameters.

[335] MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

Yaorui Shi, Shugui Liu, Yu Yang, Wenyu Mao, Yuxin Chen, Qi GU, Hui Su, Xunliang Cai, Xiang Wang, An Zhang

Main category: cs.AI

TL;DR: MemOCR: A multimodal memory agent that uses visual layout to compress interaction histories into images for more efficient long-horizon reasoning under tight context budgets.

Details

Motivation: Existing memory systems serialize history as text with uniform token-level costs, wasting scarce context budget on low-value details. Need better compression for long-horizon reasoning.

Method: Maintains structured rich-text memory (headings, highlights) and renders it into an image for memory access, visually prioritizing crucial evidence while compressing auxiliary details. Trained with RL under budget-aware objectives across diverse compression levels.

Result: Outperforms text-based baselines on long-context multi-hop and single-hop QA benchmarks, achieving more effective context utilization under extreme budgets.

Conclusion: Visual layout-based memory compression enables more efficient long-horizon reasoning by adaptively allocating memory space with varying information density.

Abstract: Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.

[336] ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management

Zaifeng Pan, Yipeng Shen, Zhengding Hu, Zhuang Wang, Aninda Manocha, Zheng Wang, Zhongkai Yu, Yue Guan, Yufei Ding

Main category: cs.AI

TL;DR: ScaleSim: A memory-efficient LLM serving system for large-scale multi-agent simulations using invocation distance abstraction for proactive prefetching and priority-based eviction.

Details

Motivation: LLM-based multi-agent simulations are difficult to scale due to GPU memory pressure from each agent maintaining private GPU-resident states (models, prefix caches, adapters), which quickly exhaust device memory as agent count grows.

Method: Identifies sparse agent activation and estimable agent invocation order properties, introduces invocation distance abstraction to estimate relative order of future LLM requests, and implements proactive prefetching with priority-based eviction through a modular interface for diverse agent-specific memory.

Result: Achieves up to 1.74x speedup over SGLang on simulation benchmarks.

Conclusion: ScaleSim enables efficient scaling of LLM-based multi-agent simulations through memory optimization techniques based on workload analysis and predictive scheduling.

Abstract: LLM-based multi-agent simulations are increasingly adopted across application domains, but remain difficult to scale due to GPU memory pressure. Each agent maintains private GPU-resident states, including models, prefix caches, and adapters, which quickly exhaust device memory as the agent count grows. We identify two key properties of these workloads: sparse agent activation and an estimable agent invocation order. Based on an analysis of representative workload classes, we introduce invocation distance, a unified abstraction that estimates the relative order in which agents will issue future LLM requests. Leveraging this abstraction, we present ScaleSim, a memory-efficient LLM serving system for large-scale multi-agent simulations. ScaleSim enables proactive prefetching and priority-based eviction, supports diverse agent-specific memory through a modular interface, and achieves up to 1.74x speedup over SGLang on simulation benchmarks.

[337] The Path of Least Resistance: Guiding LLM Reasining Trajectories with Prefix Consensus

Ishan Jindal, Sai Prashanth Akuthota, Jayant Taneja, Sachin Dev Sharma

Main category: cs.AI

TL;DR: PoLR reduces computational cost of Self-Consistency reasoning by clustering early reasoning prefixes to identify and expand only the most promising paths, achieving similar accuracy with 60% fewer tokens.

Details

Motivation: Self-Consistency and similar inference strategies for large language models are computationally expensive because they expand all reasoning traces fully, requiring substantial token usage and causing high latency.

Method: PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands only paths in that cluster. It leverages prefix consistency and uses mutual information/entropy analysis to show early reasoning steps predict final correctness.

Result: PoLR matches or exceeds Self-Consistency performance across GSM8K, MATH500, AIME24/25, and GPQA-DIAMOND benchmarks while reducing token usage by up to 60% and wall-clock latency by up to 50%.

Conclusion: PoLR provides a compute-efficient inference method that preserves accuracy benefits of Self-Consistency while substantially reducing computational costs, and can complement adaptive inference methods as a drop-in pre-filter.

Abstract: Large language models achieve strong reasoning performance, but inference strategies such as Self-Consistency (SC) are computationally expensive, as they fully expand all reasoning traces. We introduce PoLR (Path of Least Resistance), the first inference-time method to leverage prefix consistency for compute-efficient reasoning. PoLR clusters short prefixes of reasoning traces, identifies the dominant cluster, and expands all paths in that cluster, preserving the accuracy benefits of SC while substantially reducing token usage and latency. Our theoretical analysis, framed via mutual information and entropy, explains why early reasoning steps encode strong signals predictive of final correctness. Empirically, PoLR consistently matches or exceeds SC across GSM8K, MATH500, AIME24/25, and GPQA-DIAMOND, reducing token usage by up to 60% and wall-clock latency by up to 50%. Moreover, PoLR is fully complementary to adaptive inference methods (e.g., Adaptive Consistency, Early-Stopping SC) and can serve as a drop-in pre-filter, making SC substantially more efficient and scalable without requiring model fine-tuning.

Junhong Cai, Guiqin Wang, Kejie Zhao, Jianxiong Tang, Xiang Wang, Luziwei Leng, Ran Cheng, Yuxin Ma, Qinghai Guo

Main category: cs.AI

TL;DR: MAR framework integrates SSMs for linear-time sequence modeling and applies activation sparsification to reduce energy costs in LLMs, while introducing ATMN and SBDS to address SNN-SSM integration challenges.

Details

Motivation: LLMs have high energy costs due to quadratic attention and dense FFN operations, creating a need for more efficient architectures that maintain performance while reducing computational overhead.

Method: Two-stage MAR framework: 1) integrates State Space Models for linear-time sequence modeling, 2) applies activation sparsification to reduce FFN costs. Introduces Adaptive Ternary Multi-step Neuron (ATMN) and Spike-aware Bidirectional Distillation Strategy (SBDS) to address SNN-SSM integration issues.

Result: MAR effectively restores performance of dense counterparts under constrained resources while substantially reducing inference energy consumption, outperforming efficient models of comparable or larger scale.

Conclusion: MAR demonstrates potential for building efficient and practical LLMs by addressing energy efficiency challenges through SSM integration and activation sparsification techniques.

Abstract: Large Language Models (LLMs) excel across diverse domains but suffer from high energy costs due to quadratic attention and dense Feed-Forward Network (FFN) operations. To address these issues, we propose Module-aware Architecture Refinement (MAR), a two-stage framework that integrates State Space Models (SSMs) for linear-time sequence modeling and applies activation sparsification to reduce FFN costs. In addition, to mitigate low information density and temporal mismatch in integrating Spiking Neural Networks (SNNs) with SSMs, we design the Adaptive Ternary Multi-step Neuron (ATMN) and the Spike-aware Bidirectional Distillation Strategy (SBDS). Extensive experiments demonstrate that MAR effectively restores the performance of its dense counterpart under constrained resources while substantially reducing inference energy consumption. Furthermore, it outperforms efficient models of comparable or even larger scale, underscoring its potential for building efficient and practical LLMs.

[339] The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation

Diaoulé Diallo, Katharina Dworatzyk, Sophie Jentzsch, Peer Schütt, Sabine Theis, Tobias Hecking

Main category: cs.AI

TL;DR: First human evaluation of activation steering for emotional tone control in LLMs, showing strong alignment between human and model ratings, with moderate steering strengths effectively amplifying target emotions while preserving text quality.

Details

Motivation: To provide a lightweight alternative to prompt engineering and fine-tuning for controlling LLM behavior, specifically focusing on emotional tone alignment with human preferences and safety requirements through activation steering.

Method: Conducted human evaluation of activation steering using over 7,000 crowd-sourced ratings from 190 participants via Prolific. Compared automated classifier-based steering with human perception of emotional intensity and text quality across different emotions and steering strengths.

Result: Strong alignment between human and model-based quality ratings (mean r=0.776). Moderate steering strengths (λ≈0.15) reliably amplify target emotions while preserving comprehensibility, with strongest effects for disgust and fear, minimal for surprise. Upgrading from Alpaca to LlaMA-3 yielded more consistent steering.

Conclusion: Activation-based control is a scalable method for steering LLM behavior across affective dimensions, with human evaluation validating its effectiveness for emotional tone control while maintaining text quality.

Abstract: Controlling the behavior of large language models (LLMs) at inference time is essential for aligning outputs with human abilities and safety requirements. \emph{Activation steering} provides a lightweight alternative to prompt engineering and fine-tuning by directly modifying internal activations to guide generation. This research advances the literature in three significant directions. First, while previous work demonstrated the technical feasibility of steering emotional tone using automated classifiers, this paper presents the first human evaluation of activation steering concerning the emotional tone of LLM outputs, collecting over 7,000 crowd-sourced ratings from 190 participants via Prolific ($n=190$). These ratings assess both perceived emotional intensity and overall text quality. Second, we find strong alignment between human and model-based quality ratings (mean $r=0.776$, range $0.157$–$0.985$), indicating automatic scoring can proxy perceived quality. Moderate steering strengths ($λ\approx 0.15$) reliably amplify target emotions while preserving comprehensibility, with the strongest effects for disgust ($η_p^2 = 0.616$) and fear ($η_p^2 = 0.540$), and minimal effects for surprise ($η_p^2 = 0.042$). Finally, upgrading from Alpaca to LlaMA-3 yielded more consistent steering with significant effects across emotions and strengths (all $p < 0.001$). Inter-rater reliability was high (ICC $= 0.71$–$0.87$), underscoring the robustness of the findings. These findings support activation-based control as a scalable method for steering LLM behavior across affective dimensions.

[340] LLaMEA-SAGE: Guiding Automated Algorithm Design with Structural Feedback from Explainable AI

Niki van Stein, Anna V. Kononova, Lars Kotthoff, Thomas Bäck

Main category: cs.AI

TL;DR: LLaMEA-SAGE enhances automated algorithm design by using code structure features from abstract syntax trees to guide LLM-based algorithm evolution, achieving better performance than vanilla methods.

Details

Motivation: Current evolutionary frameworks for automated algorithm design (AAD) rely solely on fitness feedback, ignoring valuable information in the generated code structure. The authors aim to leverage code features to provide more intelligent guidance for LLM-driven algorithm evolution.

Method: Proposes LLaMEA-SAGE which extracts graph-theoretic and complexity features from abstract syntax trees of generated algorithms, builds a surrogate model from evaluated solutions, uses explainable AI to identify performance-critical features, and translates them into natural-language mutation instructions to guide subsequent LLM code generation.

Result: Achieves same performance faster than vanilla LLaMEA in controlled experiments, and superior performance compared to state-of-the-art AAD methods on the MA-BBOB suite from GECCO-MA-BBOB competition.

Conclusion: Code-derived signals can effectively bias LLM-driven algorithm evolution, bridging the gap between code structure and human-understandable performance feedback in automated algorithm design.

Abstract: Large language models have enabled automated algorithm design (AAD) by generating optimization algorithms directly from natural-language prompts. While evolutionary frameworks such as LLaMEA demonstrate strong exploratory capabilities across the algorithm design space, their search dynamics are entirely driven by fitness feedback, leaving substantial information about the generated code unused. We propose a mechanism for guiding AAD using feedback constructed from graph-theoretic and complexity features extracted from the abstract syntax trees of the generated algorithms, based on a surrogate model learned over an archive of evaluated solutions. Using explainable AI techniques, we identify features that substantially affect performance and translate them into natural-language mutation instructions that steer subsequent LLM-based code generation without restricting expressivity. We propose LLaMEA-SAGE, which integrates this feature-driven guidance into LLaMEA, and evaluate it across several benchmarks. We show that the proposed structured guidance achieves the same performance faster than vanilla LLaMEA in a small controlled experiment. In a larger-scale experiment using the MA-BBOB suite from the GECCO-MA-BBOB competition, our guided approach achieves superior performance compared to state-of-the-art AAD methods. These results demonstrate that signals derived from code can effectively bias LLM-driven algorithm evolution, bridging the gap between code structure and human-understandable performance feedback in automated algorithm design.

[341] KAPSO: A Knowledge-grounded framework for Autonomous Program Synthesis and Optimization

Alireza Nadaf, Alireza Mohammadshahi, Majid Yazdani

Main category: cs.AI

TL;DR: KAPSO is a modular framework for autonomous program synthesis and optimization that uses iterative loops of ideation, code synthesis, execution, evaluation, and learning to improve runnable artifacts toward measurable objectives.

Details

Motivation: Addresses long-horizon failures in coding agents including lost experimental state, brittle debugging, and weak reuse of domain expertise by creating a systematic framework for autonomous program improvement.

Method: Three tightly coupled components: 1) git-native experimentation engine for reproducible artifacts and provenance, 2) knowledge system ingesting heterogeneous sources into structured representations, 3) cognitive memory layer coordinating retrieval and maintaining episodic store of reusable lessons from experiment traces.

Result: Evaluated on MLE-Bench (Kaggle-style ML competitions) and ALE-Bench (AtCoder heuristic optimization) with reported end-to-end performance results.

Conclusion: KAPSO treats synthesis as an operator within long-horizon optimization rather than an endpoint, enabling systematic improvement of programs through iterative learning and knowledge reuse.

Abstract: We introduce KAPSO, a modular framework for autonomous program synthesis and optimization. Given a natural language goal and an evaluation method, KAPSO iteratively performs ideation, code synthesis and editing, execution, evaluation, and learning to improve a runnable artifact toward measurable objectives. Rather than treating synthesis as the endpoint, KAPSO uses synthesis as an operator within a long-horizon optimization loop, where progress is defined by evaluator outcomes. KAPSO targets long-horizon failures common in coding agents, including lost experimental state, brittle debugging, and weak reuse of domain expertise, by integrating three tightly coupled components. First, a git-native experimentation engine isolates each attempt as a branch, producing reproducible artifacts and preserving provenance across iterations. Second, a knowledge system ingests heterogeneous sources, including repositories, internal playbooks, and curated external resources such as documentation, scientific papers, and web search results, and organizes them into a structured representation that supports retrieval over workflows, implementations, and environment constraints. Third, a cognitive memory layer coordinates retrieval and maintains an episodic store of reusable lessons distilled from experiment traces (run logs, diffs, and evaluator feedback), reducing repeated error modes and accelerating convergence. We evaluated KAPSO on MLE-Bench (Kaggle-style ML competitions) and ALE-Bench (AtCoder heuristic optimization), and report end-to-end performance. Code Available at: https://github.com/Leeroo-AI/kapso

[342] ARGORA: Orchestrated Argumentation for Causally Grounded LLM Reasoning and Decision Making

Youngjin Jin, Hanna Kim, Kwanwoo Kim, Chanhee Lee, Seungwon Shin

Main category: cs.AI

TL;DR: ARGORA is a framework that organizes multi-expert LLM discussions into explicit argumentation graphs, enabling causal analysis of which arguments drive final decisions and providing correction mechanisms when internal reasoning conflicts with external judgments.

Details

Motivation: Existing multi-expert LLM systems combine diverse perspectives through simple aggregation, which obscures which specific arguments actually drove the final decision. There's a need for more transparent reasoning where the causal relationships between arguments and decisions are made explicit.

Method: ARGORA organizes multi-expert discussions into explicit argumentation graphs showing support/attack relationships between arguments. It casts these graphs as causal models, enabling systematic removal of individual arguments and recomputation of outcomes to identify necessary reasoning chains. The framework includes a correction mechanism that aligns internal reasoning with external judgments when they disagree.

Result: Across diverse benchmarks and an open-ended use case, ARGORA achieves competitive accuracy and demonstrates corrective behavior: when experts initially disagree, the framework resolves disputes toward correct answers more often than it introduces new errors, while providing causal diagnostics of decisive arguments.

Conclusion: ARGORA provides a transparent framework for multi-expert LLM systems that makes reasoning explicit through argumentation graphs and causal analysis, enabling better understanding of decision-making processes and correction mechanisms when needed.

Abstract: Existing multi-expert LLM systems gather diverse perspectives but combine them through simple aggregation, obscuring which arguments drove the final decision. We introduce ARGORA, a framework that organizes multi-expert discussions into explicit argumentation graphs showing which arguments support or attack each other. By casting these graphs as causal models, ARGORA can systematically remove individual arguments and recompute outcomes, identifying which reasoning chains were necessary and whether decisions would change under targeted modifications. We further introduce a correction mechanism that aligns internal reasoning with external judgments when they disagree. Across diverse benchmarks and an open-ended use case, ARGORA achieves competitive accuracy and demonstrates corrective behavior: when experts initially disagree, the framework resolves disputes toward correct answers more often than it introduces new errors, while providing causal diagnostics of decisive arguments.

[343] ShardMemo: Masked MoE Routing for Sharded Agentic LLM Memory

Yang Zhao, Chengxiao Dai, Yue Xiu, Mengying Kou, Yuliang Zheng, Dusit Niyato

Main category: cs.AI

TL;DR: ShardMemo is a tiered memory service for agentic LLM systems that uses structured eligibility constraints and masked MoE routing to efficiently manage large-scale memory access across multiple agents.

Details

Motivation: Centralized memory indexes and heuristic partitions become bottlenecks as memory volume and parallel access grow in agentic LLM systems, requiring more efficient memory management for long-horizon state and concurrent multi-agent execution.

Method: Three-tier memory architecture: Tier A (per-agent working state), Tier B (sharded evidence with shard-local ANN indexes and scope-before-routing constraints), Tier C (versioned skill library). Uses masked mixture-of-experts routing with eligibility constraints and cost-aware gating over shard families.

Result: Improves over strongest baseline by +5.11 to +6.82 F1 on LoCoMo, reduces retrieval work by 20.5% and p95 latency from 95ms to 76ms. Achieves 63.41/61.88/57.95 F1 at 56K/224K/448K tokens on HotpotQA. Tier C reaches 0.97 Precision@3 and 1.94 StepRed on ToolBench.

Conclusion: ShardMemo provides efficient, scalable memory management for agentic LLM systems through structured eligibility constraints and intelligent routing, significantly improving performance while reducing computational overhead.

Abstract: Agentic large language model (LLM) systems rely on external memory for long-horizon state and concurrent multi-agent execution, but centralized indexes and heuristic partitions become bottlenecks as memory volume and parallel access grow. We present ShardMemo, a budgeted tiered memory service with Tier A per-agent working state, Tier B sharded evidence with shard-local approximate nearest neighbor (ANN) indexes, and Tier C, a versioned skill library. Tier B enforces scope-before-routing: structured eligibility constraints mask ineligible shards before routing or ANN search. We cast shard probing as masked mixture-of-experts (MoE) routing over eligible shards, probing up to $B_{\mathrm{probe}}$ shards via Top-$B_{\mathrm{probe}}$ or adaptive Top-$P$, and use cost-aware gating over profile/observation/session shard families; the router is trained from evidence-to-shard supervision. On LoCoMo, ShardMemo improves over the strongest baseline (GAM) by +5.11 to +6.82 F1 across question categories. Under a fixed-budget routing setting ($B_{\mathrm{probe}}=3$), ShardMemo improves over cosine-to-prototype shard routing by +6.87 F1 while reducing retrieval work (VecScan 521->414, -20.5%) and p95 latency (95->76 ms). On long-context HotpotQA, ShardMemo achieves 63.41/61.88/57.95 F1 at 56K/224K/448K tokens. On ToolBench, Tier C reaches 0.97 Precision@3 and 1.94 StepRed (+10.2% and +7.2% over embedding-similarity retrieval).

[344] Meta Context Engineering via Agentic Skill Evolution

Haoran Ye, Xuning He, Vincent Arak, Haonan Dong, Guojie Song

Main category: cs.AI

TL;DR: Meta Context Engineering (MCE) is a bi-level framework that co-evolves context engineering skills and context artifacts through meta-level skill refinement and base-level execution, outperforming static CE methods.

Details

Motivation: Current Context Engineering methods rely on manually crafted harnesses with structural biases and limited design space, restricting optimization of LLM inference-time context.

Method: MCE uses a bi-level framework: meta-level agent refines engineering skills via agentic crossover (deliberative search over skill history), while base-level agent executes skills, learns from training rollouts, and optimizes context as flexible files and code.

Result: MCE achieves 5.6-53.8% relative improvement over state-of-the-art agentic CE methods (mean 16.9%) across five domains, with superior context adaptability, transferability, and efficiency.

Conclusion: MCE supersedes static CE heuristics by co-evolving skills and context artifacts, enabling more effective context optimization for LLMs.

Abstract: The operational efficacy of large language models relies heavily on their inference-time context. This has established Context Engineering (CE) as a formal discipline for optimizing these inputs. Current CE methods rely on manually crafted harnesses, such as rigid generation-reflection workflows and predefined context schemas. They impose structural biases and restrict context optimization to a narrow, intuition-bound design space. To address this, we introduce Meta Context Engineering (MCE), a bi-level framework that supersedes static CE heuristics by co-evolving CE skills and context artifacts. In MCE iterations, a meta-level agent refines engineering skills via agentic crossover, a deliberative search over the history of skills, their executions, and evaluations. A base-level agent executes these skills, learns from training rollouts, and optimizes context as flexible files and code. We evaluate MCE across five disparate domains under offline and online settings. MCE demonstrates consistent performance gains, achieving 5.6–53.8% relative improvement over state-of-the-art agentic CE methods (mean of 16.9%), while maintaining superior context adaptability, transferability, and efficiency in both context usage and training.

[345] EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots

Zixing Lei, Genjia Liu, Yuanshuo Zhang, Qipeng Liu, Chuan Wen, Shanghang Zhang, Wenzhao Lian, Siheng Chen

Main category: cs.AI

TL;DR: LLM agents can autonomously engineer embodied AI policies through iterative code generation and environment feedback, outperforming human-engineered baselines by 26.5% on average across 32 RL/IL tasks.

Details

Motivation: The scaling of general-purpose robotic systems is bottlenecked by labor-intensive manual oversight (reward shaping, hyperparameter tuning). Inspired by LLMs' success in software automation, the authors aim to leverage LLM agents for autonomous embodied policy engineering.

Method: Introduces EmboCoach-Bench, a benchmark with 32 expert-curated RL and IL tasks. Uses executable code as universal interface and assesses dynamic closed-loop workflow where LLM agents iteratively draft, debug, and optimize solutions using environment feedback, covering reward design and policy architectures like diffusion policies.

Result: Autonomous agents surpass human-engineered baselines by 26.5% average success rate. Agentic workflow with environment feedback strengthens policy development and narrows performance gap between open-source and proprietary models. Agents exhibit self-correction capabilities, resurrecting task performance from near-total failures through simulation-in-the-loop debugging.

Conclusion: Establishes foundation for self-evolving embodied intelligence, accelerating shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI.

Abstract: The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and large-scale data collection. However, this scaling capability remains severely bottlenecked by a reliance on labor-intensive manual oversight from intricate reward shaping to hyperparameter tuning across heterogeneous backends. Inspired by LLMs’ success in software automation and science discovery, we introduce \textsc{EmboCoach-Bench}, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies. Spanning 32 expert-curated RL and IL tasks, our framework posits executable code as the universal interface. We move beyond static generation to assess a dynamic closed-loop workflow, where agents leverage environment feedback to iteratively draft, debug, and optimize solutions, spanning improvements from physics-informed reward design to policy architectures such as diffusion policies. Extensive evaluations yield three critical insights: (1) autonomous agents can qualitatively surpass human-engineered baselines by 26.5% in average success rate; (2) agentic workflow with environment feedback effectively strengthens policy development and substantially narrows the performance gap between open-source and proprietary models; and (3) agents exhibit self-correction capabilities for pathological engineering cases, successfully resurrecting task performance from near-total failures through iterative simulation-in-the-loop debugging. Ultimately, this work establishes a foundation for self-evolving embodied intelligence, accelerating the paradigm shift from labor-intensive manual tuning to scalable, autonomous engineering in embodied AI field.

[346] Chain Of Thought Compression: A Theoritical Analysis

Juncai Li, Ru Li, Yuxiang Zhou, Boxiang Ma, Jeff Z. Pan

Main category: cs.AI

TL;DR: Theoretical analysis of Chain-of-Thought compression difficulty and a new framework ALiCoT that aligns latent token distributions with intermediate reasoning states for efficient reasoning.

Details

Motivation: Chain-of-Thought reasoning in LLMs incurs high computational costs due to generating extra tokens. While implicit CoT compression offers token efficiency, the underlying mechanisms and learning difficulties remain unclear, requiring theoretical analysis.

Method: Introduces Order-r Interaction theory to analyze learning signal decay for high-order logical dependencies. Proposes ALiCoT framework that aligns latent token distributions with intermediate reasoning states to overcome signal decay. Validates using NatBool-DAG benchmark for irreducible logical reasoning.

Result: ALiCoT achieves 54.4x speedup while maintaining performance comparable to explicit CoT, successfully unlocking efficient reasoning through proper alignment of latent states.

Conclusion: Theoretical analysis reveals exponential decay of learning signals for high-order dependencies in CoT compression. ALiCoT’s alignment approach effectively overcomes this barrier, enabling efficient reasoning without performance degradation.

Abstract: Chain-of-Thought (CoT) has unlocked advanced reasoning abilities of Large Language Models (LLMs) with intermediate steps, yet incurs prohibitive computational costs due to generation of extra tokens. Recent studies empirically show that compressing reasoning steps into latent states, or implicit CoT compression, offers a token-efficient alternative. However, the mechanism behind CoT compression remains unclear. In this paper, we provide the first theoretical analysis of the difficulty of learning to internalize intermediate reasoning steps. By introducing Order-r Interaction, we prove that the learning signal for high-order logical dependencies exponentially decays to solve irreducible problem, where skipping intermediate steps inevitably leads to high-order interaction barriers. To empirically validate this, we introduce NatBool-DAG, a challenging benchmark designed to enforce irreducible logical reasoning and eliminate semantic shortcuts. Guided by our theoretical findings, we propose ALiCoT (Aligned Implicit CoT), a novel framework that overcomes the signal decay by aligning latent token distributions with intermediate reasoning states. Experimental results demonstrate that ALiCoT successfully unlocks efficient reasoning: it achieves a 54.4x speedup while maintaining performance comparable to explicit CoT.

[347] Depth-Recurrent Attention Mixtures: Giving Latent Reasoning the Attention it Deserves

Jonas Knupp, Jan Hendrik Metzen, Jeremias Bohn, Georg Groh, Kristian Kersting

Main category: cs.AI

TL;DR: Dreamer introduces depth-recurrent attention mixtures to improve latent reasoning efficiency by combining sequence, depth, and sparse expert attention, achieving better performance with fewer training tokens.

Details

Motivation: Prior depth-recurrent models have limitations: lack of proper baselines, underutilization of depth-recurrence due to fixed layer stacks, and hidden-size bottlenecks that restrict many-step latent reasoning.

Method: Proposes Dreamer framework with modular depth-recurrent attention mixtures combining sequence attention, depth attention, and sparse expert attention to alleviate hidden-size bottlenecks and decouple scaling dimensions.

Result: Models require 2-8x fewer training tokens for same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ~2x larger SOTA models with same training tokens. Also shows 2-11x larger expert selection diversity than SOTA MoEs.

Conclusion: Depth-recurrent attention mixtures enable efficient scaling and effective latent reasoning by addressing hidden-size bottlenecks and improving knowledge usage across depths.

Abstract: Depth-recurrence facilitates latent reasoning by sharing parameters across depths. However, prior work lacks combined FLOP-, parameter-, and memory-matched baselines, underutilizes depth-recurrence due to partially fixed layer stacks, and ignores the bottleneck of constant hidden-sizes that restricts many-step latent reasoning. To address this, we introduce a modular framework of depth-recurrent attention mixtures (Dreamer), combining sequence attention, depth attention, and sparse expert attention. It alleviates the hidden-size bottleneck through attention along depth, decouples scaling dimensions, and allows depth-recurrent models to scale efficiently and effectively. Across language reasoning benchmarks, our models require 2 to 8x fewer training tokens for the same accuracy as FLOP-, parameter-, and memory-matched SOTA, and outperform ca. 2x larger SOTA models with the same training tokens. We further present insights into knowledge usage across depths, e.g., showing 2 to 11x larger expert selection diversity than SOTA MoEs.

[348] Beyond Imitation: Reinforcement Learning for Active Latent Planning

Zhi Zheng, Wee Sun Lee

Main category: cs.AI

TL;DR: ATP-Latent improves latent reasoning in LLMs by using conditional VAE for smoother latent space and reinforcement learning with coherence rewards for better reasoning policies.

Details

Motivation: Current latent reasoning methods fine-tune LLMs to use continuous latent tokens instead of discrete language tokens for more efficient CoT reasoning, but they passively imitate arbitrary CoT labels which can lead to inferior representations and reasoning policies, creating gaps between training and testing.

Method: Proposes ATP-Latent with two key components: 1) Models latent token supervision as conditional VAE to obtain smoother latent space, 2) Uses reinforcement learning with auxiliary coherence reward based on consistency between VAE-decoded contents to guide optimal reasoning policy.

Result: On LLaMA-1B, ATP-Latent achieves +4.1% accuracy and -3.3% tokens on four benchmarks compared to advanced baselines.

Conclusion: ATP-Latent demonstrates that active planning over latent token representation space with VAE modeling and coherence-guided RL leads to more efficient and accurate latent reasoning in LLMs.

Abstract: Aiming at efficient and dense chain-of-thought (CoT) reasoning, latent reasoning methods fine-tune Large Language Models (LLMs) to substitute discrete language tokens with continuous latent tokens. These methods consume fewer tokens compared to the conventional language CoT reasoning and have the potential to plan in a dense latent space. However, current latent tokens are generally supervised based on imitating language labels. Considering that there can be multiple equivalent but diverse CoT labels for a question, passively imitating an arbitrary one may lead to inferior latent token representations and latent reasoning policies, undermining the potential planning ability and resulting in clear gaps between training and testing. In this work, we emphasize the importance of active planning over the representation space of latent tokens in achieving the optimal latent reasoning policy. So, we propose the \underline{A}c\underline{t}ive Latent \underline{P}lanning method (ATP-Latent), which models the supervision process of latent tokens as a conditional variational auto-encoder (VAE) to obtain a smoother latent space. Moreover, to facilitate the most reasonable latent reasoning policy, ATP-Latent conducts reinforcement learning (RL) with an auxiliary coherence reward, which is calculated based on the consistency between VAE-decoded contents of latent tokens, enabling a guided RL process. In experiments on LLaMA-1B, ATP-Latent demonstrates +4.1% accuracy and -3.3% tokens on four benchmarks compared to advanced baselines. Codes are available on https://github.com/zz1358m/ATP-Latent-master.

[349] CORE: Collaborative Reasoning via Cross Teaching

Kshitij Mishra, Mirat Aubakirov, Martin Takac, Nils Lukas, Salem Lahlou

Main category: cs.AI

TL;DR: CORE is a training-time collaboration framework where language models teach each other through cross-teaching, using successful reasoning from peers to rescue failed attempts, achieving significant performance gains on reasoning tasks without scaling model size.

Details

Motivation: Large language models exhibit complementary reasoning errors - different models fail on different instances. The authors aim to leverage this complementarity through collaborative training to improve reasoning performance without increasing model size.

Method: CORE uses a two-stage process: 1) Cold round of independent sampling, 2) Contexted rescue round where failed models receive hints from successful peers. It optimizes a combined reward balancing correctness, diversity (DPP-inspired to reduce error overlap), and rescue bonus for successful recovery.

Result: With only 1,000 training examples, a pair of small models (3B+4B) achieves Pass@2 of 99.54% on GSM8K and 92.08% on MATH, compared to 82.50% and 74.82% for single-model training. On harder datasets, reaches 77.34% on GPQA and 79.65% on AIME with limited training data.

Conclusion: Training-time collaboration can reliably convert model complementarity into large performance gains without scaling model size, demonstrating effective cross-teaching between models with complementary reasoning patterns.

Abstract: Large language models exhibit complementary reasoning errors: on the same instance, one model may succeed with a particular decomposition while another fails. We propose Collaborative Reasoning (CORE), a training-time collaboration framework that converts peer success into a learning signal via a cross-teaching protocol. Each problem is solved in two stages: a cold round of independent sampling, followed by a contexted rescue round in which models that failed receive hint extracted from a successful peer. CORE optimizes a combined reward that balances (i) correctness, (ii) a lightweight DPP-inspired diversity term to reduce error overlap, and (iii) an explicit rescue bonus for successful recovery. We evaluate CORE across four standard reasoning datasets GSM8K, MATH, AIME, and GPQA. With only 1,000 training examples, a pair of small open source models (3B+4B) reaches Pass@2 of 99.54% on GSM8K and 92.08% on MATH, compared to 82.50% and 74.82% for single-model training. On harder datasets, the 3B+4B pair reaches Pass@2 of 77.34% on GPQA (trained on 348 examples) and 79.65% on AIME (trained on 792 examples), using a training-time budget of at most 1536 context tokens and 3072 generated tokens. Overall, these results show that training-time collaboration can reliably convert model complementarity into large gains without scaling model size.

[350] Search-Based Risk Feature Discovery in Document Structure Spaces under a Constrained Budget

Saisubramaniam Gopalakrishnan, Harikrishnan P M, Dagnachew Birru

Main category: cs.AI

TL;DR: The paper presents a search-based software testing approach for intelligent document processing systems, using diverse search strategies to uncover multiple failure types within budget constraints.

Details

Motivation: Enterprise IDP systems need validation under limited budgets, requiring discovery of diverse failure mechanisms rather than just worst-case documents. Current approaches may miss important failure types by focusing on single solutions.

Method: Formalizes as SBST problem in combinatorial document configuration space, rendering structural risk features. Benchmarks diverse search strategies (evolutionary, swarm-based, quality-diversity, learning-based, quantum) under identical budgets with configuration-level exclusivity, win-rate, and cross-temporal overlap analyses.

Result: Different solvers consistently uncover failure modes undiscovered by alternatives. Cross-temporal analysis shows persistent solver-specific discoveries across all budgets, with no single strategy dominating. Union of all solvers eventually recovers failure space, but individual methods systematically delay discovery of important risks.

Conclusion: Demonstrates intrinsic solver complementarity and motivates portfolio-based SBST strategies for robust industrial IDP validation, rather than relying on single methods.

Abstract: Enterprise-grade Intelligent Document Processing (IDP) systems support high-stakes workflows across finance, insurance, and healthcare. Early-phase system validation under limited budgets mandates uncovering diverse failure mechanisms, rather than identifying a single worst-case document. We formalize this challenge as a Search-Based Software Testing (SBST) problem, aiming to identify complex interactions between document variables, with the objective to maximize the number of distinct failure types discovered within a fixed evaluation budget. Our methodology operates on a combinatorial space of document configurations, rendering instances of structural \emph{risk features} to induce realistic failure conditions. We benchmark a diverse portfolio of search strategies spanning evolutionary, swarm-based, quality-diversity, learning-based, and quantum under identical budget constraints. Through configuration-level exclusivity, win-rate, and cross-temporal overlap analyses, we show that different solvers consistently uncover failure modes that remain undiscovered by specific alternatives at comparable budgets. Crucially, cross-temporal analysis reveals persistent solver-specific discoveries across all evaluated budgets, with no single strategy exhibiting absolute dominance. While the union of all solvers eventually recovers the observed failure space, reliance on any individual method systematically delays the discovery of important risks. These results demonstrate intrinsic solver complementarity and motivate portfolio-based SBST strategies for robust industrial IDP validation.

[351] RecNet: Self-Evolving Preference Propagation for Agentic Recommender Systems

Bingqian Li, Xiaolei Wang, Junyi Li, Weitao Li, Long Zhang, Sheng Chen, Wayne Xin Zhao, Ji-Rong Wen

Main category: cs.AI

TL;DR: RecNet is a self-evolving preference propagation framework for recommender systems that uses LLM-powered agents to model real-time mutual influences between users and items through dynamic preference routing and personalized reception mechanisms.

Details

Motivation: Existing LLM-based recommender systems primarily model preference changes based on sparse, noisy explicit user-item interactions, which cannot capture real-time mutual influences among users and items. There's a need for a framework that can proactively propagate real-time preference updates across related users and items.

Method: RecNet consists of two phases: 1) Forward phase with centralized preference routing (router agents integrate and propagate updates) and personalized preference reception (message buffer + rule-based filter memory for selective assimilation), and 2) Backward phase with feedback-driven propagation optimization using multi-agent reinforcement learning with LLMs for credit assignment, gradient analysis, and module-level optimization.

Result: Extensive experiments on various scenarios demonstrate the effectiveness of RecNet in modeling preference propagation for recommender systems.

Conclusion: RecNet provides a self-evolving framework that addresses limitations of existing methods by enabling real-time preference propagation and continuous optimization through LLM-powered multi-agent systems.

Abstract: Agentic recommender systems leverage Large Language Models (LLMs) to model complex user behaviors and support personalized decision-making. However, existing methods primarily model preference changes based on explicit user-item interactions, which are sparse, noisy, and unable to reflect the real-time, mutual influences among users and items. To address these limitations, we propose RecNet, a self-evolving preference propagation framework that proactively propagates real-time preference updates across related users and items. RecNet consists of two complementary phases. In the forward phase, the centralized preference routing mechanism leverages router agents to integrate preference updates and dynamically propagate them to the most relevant agents. To ensure accurate and personalized integration of propagated preferences, we further introduce a personalized preference reception mechanism, which combines a message buffer for temporary caching and an optimizable, rule-based filter memory to guide selective preference assimilation based on past experience and interests. In the backward phase, the feedback-driven propagation optimization mechanism simulates a multi-agent reinforcement learning framework, using LLMs for credit assignment, gradient analysis, and module-level optimization, enabling continuous self-evolution of propagation strategies. Extensive experiments on various scenarios demonstrate the effectiveness of RecNet in modeling preference propagation for recommender systems.

[352] Semantic Content Determines Algorithmic Performance

Martiño Ríos-García, Nawaf Alampara, Kevin Maik Jablonka

Main category: cs.AI

TL;DR: LLMs show significant accuracy variation in simple counting tasks depending solely on semantic content of items being counted, revealing they don’t implement true algorithms but argument-dependent approximations.

Details

Motivation: To test whether LLMs implement invariant algorithms or if their behavior depends on semantic content of inputs, using counting as a fundamental test case.

Method: Created WhatCounts benchmark: atomic counting tasks with unambiguous, delimited lists containing no duplicates or distractors, testing different semantic types (cities vs chemicals, names vs symbols).

Result: Frontier LLMs show over 40% accuracy variation depending solely on what is being counted. Controlled ablations rule out confounds, and semantic gaps shift unpredictably with small fine-tuning.

Conclusion: LLMs don’t implement true algorithms but approximate them in argument-dependent ways, with implications for all LLM functions carrying hidden semantic dependencies.

Abstract: Counting should not depend on what is being counted; more generally, any algorithm’s behavior should be invariant to the semantic content of its arguments. We introduce WhatCounts to test this property in isolation. Unlike prior work that conflates semantic sensitivity with reasoning complexity or prompt variation, WhatCounts is atomic: count items in an unambiguous, delimited list with no duplicates, distractors, or reasoning steps for different semantic types. Frontier LLMs show over 40% accuracy variation depending solely on what is being counted - cities versus chemicals, names versus symbols. Controlled ablations rule out confounds. The gap is semantic, and it shifts unpredictably with small amounts of unrelated fine-tuning. LLMs do not implement algorithms; they approximate them, and the approximation is argument-dependent. As we show with an agentic example, this has implications beyond counting: any LLM function may carry hidden dependencies on the meaning of its inputs.

[353] ScholarGym: Benchmarking Deep Research Workflows on Academic Literature Retrieval

Hao Shen, Hang Yang, Zhouhong Gu

Main category: cs.AI

TL;DR: ScholarGym is a simulation environment for reproducible evaluation of deep research workflows on academic literature, addressing challenges of non-determinism in tool-augmented LLM evaluations.

Details

Motivation: Evaluating tool-augmented LLM research workflows is challenging due to non-determinism from live APIs (temporal drift, rate limiting, evolving backend states), which undermines reproducibility and invalidates cross-system comparisons.

Method: ScholarGym decouples workflow components into query planning, tool invocation, and relevance assessment, enabling fine-grained analysis under controlled conditions. Built on a static corpus of 570K papers with deterministic retrieval, providing 2,536 queries with expert-annotated ground truth.

Result: Experiments across diverse backbone models reveal how reasoning capabilities, planning strategies, and selection mechanisms interact over iterative refinement in research workflows.

Conclusion: ScholarGym provides a reproducible evaluation framework for deep research workflows, enabling systematic analysis of tool-augmented LLM performance in academic literature tasks.

Abstract: Tool-augmented large language models have advanced from single-turn question answering to deep research workflows that iteratively plan queries, invoke external tools, and synthesize information to address complex information needs. Evaluating such workflows presents a fundamental challenge: reliance on live APIs introduces non-determinism, as tool invocations may yield different results across runs due to temporal drift, rate limiting, and evolving backend states. This variance undermines reproducibility and invalidates cross-system comparisons. We present ScholarGym, a simulation environment for reproducible evaluation of deep research workflows on academic literature. The environment decouples workflow components into query planning, tool invocation, and relevance assessment, enabling fine-grained analysis of each stage under controlled conditions. Built on a static corpus of 570K papers with deterministic retrieval, ScholarGym provides 2,536 queries with expert-annotated ground truth. Experiments across diverse backbone models reveal how reasoning capabilities, planning strategies, and selection mechanisms interact over iterative refinement.

[354] SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza

Main category: cs.AI

TL;DR: SONIC-O1 is a comprehensive human-verified benchmark for evaluating Multimodal Large Language Models on sequential audio-video understanding across 13 real-world conversational domains with 4,958 annotations.

Details

Motivation: Most MLLM research focuses on static image understanding, leaving a gap in evaluating their ability to process sequential audio-video data in real-world settings, necessitating a high-quality benchmark.

Method: Created SONIC-O1 benchmark spanning 13 conversational domains with 4,958 human-verified annotations and demographic metadata. Evaluates MLLMs on open-ended summarization, multiple-choice question answering, and temporal localization with reasoning rationales.

Result: Experiments reveal limitations: while MCQ accuracy gap between model families is small, there’s a substantial 22.6% performance difference in temporal localization between best closed-source and open-source models. Performance degrades across demographic groups, indicating persistent disparities.

Conclusion: SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding, highlighting significant gaps in current MLLM capabilities for audio-video processing.

Abstract: Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding. We release SONIC-O1 for reproducibility and research: Project page: https://vectorinstitute.github.io/sonic-o1/ Dataset: https://huggingface.co/datasets/vector-institute/sonic-o1 Github: https://github.com/vectorinstitute/sonic-o1 Leaderboard: https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard

[355] TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning

Mingzu Liu, Hao Fang, Runmin Cong

Main category: cs.AI

TL;DR: TCAP is an unsupervised defense framework that detects backdoor attacks in MLLMs by analyzing attention allocation divergence across system instructions, vision inputs, and user queries.

Details

Motivation: FTaaS enables MLLM customization but introduces backdoor risks via poisoned data. Existing defenses either need supervision or fail to generalize across diverse triggers and modalities.

Method: Tri-Component Attention Profiling (TCAP) decomposes cross-modal attention maps into three functional components, identifies trigger-responsive attention heads via GMM statistical profiling, and isolates poisoned samples through EM-based vote aggregation.

Result: Extensive experiments across diverse MLLM architectures and attack methods show TCAP achieves consistently strong performance as a robust backdoor defense.

Conclusion: TCAP establishes a practical unsupervised defense against backdoor attacks in MLLMs by leveraging universal attention allocation divergence patterns.

Abstract: Fine-Tuning-as-a-Service (FTaaS) facilitates the customization of Multimodal Large Language Models (MLLMs) but introduces critical backdoor risks via poisoned data. Existing defenses either rely on supervised signals or fail to generalize across diverse trigger types and modalities. In this work, we uncover a universal backdoor fingerprint-attention allocation divergence-where poisoned samples disrupt the balanced attention distribution across three functional components: system instructions, vision inputs, and user textual queries, regardless of trigger morphology. Motivated by this insight, we propose Tri-Component Attention Profiling (TCAP), an unsupervised defense framework to filter backdoor samples. TCAP decomposes cross-modal attention maps into the three components, identifies trigger-responsive attention heads via Gaussian Mixture Model (GMM) statistical profiling, and isolates poisoned samples through EM-based vote aggregation. Extensive experiments across diverse MLLM architectures and attack methods demonstrate that TCAP achieves consistently strong performance, establishing it as a robust and practical backdoor defense in MLLMs.

[356] FBS: Modeling Native Parallel Reading inside a Transformer

Tongxi Wang

Main category: cs.AI

TL;DR: Fovea-Block-Skip Transformer (FBS) improves LLM inference efficiency by incorporating human-like reading strategies like content-adaptive foresight and chunk-structure-aware compute allocation.

Details

Motivation: Current LLM inference relies on token-by-token autoregression, missing key human reading strategies like content-adaptive foresight and chunk-structure-aware compute allocation that could accelerate inference.

Method: Proposes Fovea-Block-Skip Transformer with three modules: Parafovea-Attention Window (PAW) for causal trainable loops, Chunk-Head (CH) for chunk-structure awareness, and Skip-Gate (SG) for compute allocation.

Result: FBS improves quality-efficiency trade-off across diverse benchmarks without increasing parameters, with ablations showing the three modules are complementary.

Conclusion: The FBS Transformer successfully incorporates human reading strategies into LLM architecture, achieving better inference efficiency while maintaining quality.

Abstract: Large language models (LLMs) excel across many tasks, yet inference is still dominated by strictly token-by-token autoregression. Existing acceleration methods largely patch this pipeline and miss core human-reading ingredients: content-adaptive foresight, chunk-structure-aware compute allocation, and train–test consistency for preview/skimming. We propose the \textbf{Fovea-Block-Skip Transformer} (FBS), which injects a causal, trainable loop into Transformers via Parafovea-Attention Window (PAW), Chunk-Head (CH), and Skip-Gate (SG). Across diverse benchmarks, FBS improves the quality-efficiency trade-off without increasing parameters, and ablations show the three modules are complementary.

[357] E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory

Kaixiang Wang, Yidan Lin, Jiong Lou, Zhaojiacheng Zhou, Bunyod Suvonov, Jie Li

Main category: cs.AI

TL;DR: E-mem framework shifts from memory preprocessing to episodic context reconstruction for LLM agents, using a hierarchical architecture with multiple assistant agents maintaining uncompressed memory contexts and a central master agent for global planning.

Details

Motivation: Current memory preprocessing methods for LLM agents suffer from destructive de-contextualization by compressing complex sequential dependencies into pre-defined structures, which severs contextual integrity essential for deep reasoning needed for System 2 reasoning.

Method: Proposes E-mem framework with heterogeneous hierarchical architecture: multiple assistant agents maintain uncompressed memory contexts while a central master agent orchestrates global planning. Unlike passive retrieval, assistants locally reason within activated segments to extract context-aware evidence before aggregation.

Result: On LoCoMo benchmark, E-mem achieves over 54% F1, surpassing state-of-the-art GAM by 7.75%, while reducing token cost by over 70%.

Conclusion: E-mem effectively addresses memory preprocessing limitations by preserving contextual integrity through episodic context reconstruction, enabling better reasoning performance with reduced computational cost.

Abstract: The evolution of Large Language Model (LLM) agents towards System~2 reasoning, characterized by deliberative, high-precision problem-solving, requires maintaining rigorous logical integrity over extended horizons. However, prevalent memory preprocessing paradigms suffer from destructive de-contextualization. By compressing complex sequential dependencies into pre-defined structures (e.g., embeddings or graphs), these methods sever the contextual integrity essential for deep reasoning. To address this, we propose E-mem, a framework shifting from Memory Preprocessing to Episodic Context Reconstruction. Inspired by biological engrams, E-mem employs a heterogeneous hierarchical architecture where multiple assistant agents maintain uncompressed memory contexts, while a central master agent orchestrates global planning. Unlike passive retrieval, our mechanism empowers assistants to locally reason within activated segments, extracting context-aware evidence before aggregation. Evaluations on the LoCoMo benchmark demonstrate that E-mem achieves over 54% F1, surpassing the state-of-the-art GAM by 7.75%, while reducing token cost by over 70%.

[358] DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting

Siru Zhong, Yiqiu Liu, Zhiqing Cui, Zezhi Shao, Fei Wang, Qingsong Wen, Yuxuan Liang

Main category: cs.AI

TL;DR: DropoutTS is a model-agnostic plugin that improves time series model robustness to noise by using spectral sparsity to quantify instance-level noise and adaptively calibrate dropout rates, suppressing spurious fluctuations while preserving fidelity.

Details

Motivation: Deep time series models are vulnerable to real-world noisy data. Existing robustness strategies either prune data or rely on costly prior quantification, failing to balance effectiveness and efficiency.

Method: DropoutTS employs a Sample-Adaptive Dropout mechanism that leverages spectral sparsity to efficiently quantify instance-level noise via reconstruction residuals, then dynamically calibrates model learning capacity by mapping noise to adaptive dropout rates.

Result: Extensive experiments across diverse noise regimes and open benchmarks show DropoutTS consistently boosts superior backbones’ performance, delivering advanced robustness with negligible parameter overhead and no architectural modifications.

Conclusion: DropoutTS provides an effective and efficient solution for time series model robustness that shifts the paradigm from “what” to learn to “how much” to learn, with minimal computational overhead.

Abstract: Deep time series models are vulnerable to noisy data ubiquitous in real-world applications. Existing robustness strategies either prune data or rely on costly prior quantification, failing to balance effectiveness and efficiency. In this paper, we introduce DropoutTS, a model-agnostic plugin that shifts the paradigm from “what” to learn to “how much” to learn. DropoutTS employs a Sample-Adaptive Dropout mechanism: leveraging spectral sparsity to efficiently quantify instance-level noise via reconstruction residuals, it dynamically calibrates model learning capacity by mapping noise to adaptive dropout rates - selectively suppressing spurious fluctuations while preserving fine-grained fidelity. Extensive experiments across diverse noise regimes and open benchmarks show DropoutTS consistently boosts superior backbones’ performance, delivering advanced robustness with negligible parameter overhead and no architectural modifications. Our code is available at https://github.com/CityMind-Lab/DropoutTS.

[359] Language-based Trial and Error Falls Behind in the Era of Experience

Haoyu Wang, Guozheng Ma, Shugang Cui, Yilun Kong, Haotian Luo, Li Shen, Mengya Gao, Yichao Wu, Xiaogang Wang, Dacheng Tao

Main category: cs.AI

TL;DR: SCOUT framework enables LLMs to efficiently learn nonlinguistic tasks by using lightweight scouts for exploration and then fine-tuning the LLM with collected data.

Details

Motivation: LLMs struggle with unseen nonlinguistic environments (symbolic/spatial tasks) due to prohibitive exploration costs in high-dimensional semantic spaces, requiring extensive trial-and-error that's computationally unsustainable for large models.

Method: SCOUT decouples exploration from exploitation: lightweight scouts (small MLPs) probe environmental dynamics efficiently, collected trajectories bootstrap LLM via Supervised Fine-Tuning, followed by multi-turn Reinforcement Learning to activate latent world knowledge.

Result: SCOUT enables Qwen2.5-3B-Instruct to achieve average score of 0.86, significantly outperforming proprietary models like Gemini-2.5-Pro (0.60) while saving about 60% GPU hours consumption.

Conclusion: The exploration cost bottleneck for LLMs in nonlinguistic tasks can be effectively addressed by decoupling exploration from exploitation using lightweight scouts, enabling efficient learning while maintaining strong performance.

Abstract: While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight “scouts” (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.

[360] Zero-Shot Statistical Downscaling via Diffusion Posterior Sampling

Ruian Tie, Wenbo Xiong, Zhengyu Shi, Xinyu Su, Chenyu jiang, Libo Wu, Hao Li

Main category: cs.AI

TL;DR: ZSSD is a zero-shot statistical downscaling framework for climate models that works without paired training data, using physics-consistent climate priors and unified coordinate guidance to handle domain gaps between GCMs and reanalysis.

Details

Motivation: Conventional supervised climate downscaling fails to generalize to Global Climate Models due to lack of paired training data and domain gaps, while current zero-shot methods suffer from physical inconsistencies and vanishing gradient issues with large scaling factors.

Method: Proposes Zero-Shot Statistical Downscaling (ZSSD) framework that learns Physics-Consistent Climate Prior from reanalysis data, conditioned on geophysical boundaries and temporal information. Introduces Unified Coordinate Guidance to address vanishing gradient problem in vanilla DPS and ensure consistency with large-scale fields.

Result: ZSSD significantly outperforms existing zero-shot baselines in 99th percentile errors and successfully reconstructs complex weather events like tropical cyclones across heterogeneous GCMs.

Conclusion: ZSSD provides an effective zero-shot framework for statistical downscaling that addresses key limitations of existing methods, enabling robust inference across varying GCMs without requiring paired training data.

Abstract: Conventional supervised climate downscaling struggles to generalize to Global Climate Models (GCMs) due to the lack of paired training data and inherent domain gaps relative to reanalysis. Meanwhile, current zero-shot methods suffer from physical inconsistencies and vanishing gradient issues under large scaling factors. We propose Zero-Shot Statistical Downscaling (ZSSD), a zero-shot framework that performs statistical downscaling without paired data during training. ZSSD leverages a Physics-Consistent Climate Prior learned from reanalysis data, conditioned on geophysical boundaries and temporal information to enforce physical validity. Furthermore, to enable robust inference across varying GCMs, we introduce Unified Coordinate Guidance. This strategy addresses the vanishing gradient problem in vanilla DPS and ensures consistency with large-scale fields. Results show that ZSSD significantly outperforms existing zero-shot baselines in 99th percentile errors and successfully reconstructs complex weather events, such as tropical cyclones, across heterogeneous GCMs.

[361] Abstract Concept Modelling in Conceptual Spaces: A Study on Chess Strategies

Hadi Banaee, Stephanie Lowry

Main category: cs.AI

TL;DR: A conceptual space framework for modeling abstract concepts that unfold over time, demonstrated through chess strategy recognition using geometric regions and trajectory analysis.

Details

Motivation: To extend conceptual spaces theory to temporally realized, goal-directed concepts by modeling how abstract concepts like strategies unfold over time in sequential decision-making contexts.

Method: Represent abstract concepts (e.g., attack, sacrifice) as geometric regions across interpretable quality dimensions. Chess games are instantiated as trajectories, and strategy recognition occurs by analyzing directional movement toward these concept regions. Supports dual-perspective modeling for different player interpretations.

Result: Demonstrated feasibility of trajectory-based concept recognition with movement patterns aligning with expert commentary. The approach successfully recognizes intended strategies in chess games.

Conclusion: Establishes a foundation for broader applications in sequential decision-making and supports integration with knowledge evolution mechanisms for learning and refining abstract concepts over time.

Abstract: We present a conceptual space framework for modelling abstract concepts that unfold over time, demonstrated through a chess-based proof-of-concept. Strategy concepts, such as attack or sacrifice, are represented as geometric regions across interpretable quality dimensions, with chess games instantiated and analysed as trajectories whose directional movement toward regions enables recognition of intended strategies. This approach also supports dual-perspective modelling, capturing how players interpret identical situations differently. Our implementation demonstrates the feasibility of trajectory-based concept recognition, with movement patterns aligning with expert commentary. This work explores extending the conceptual spaces theory to temporally realised, goal-directed concepts. The approach establishes a foundation for broader applications involving sequential decision-making and supports integration with knowledge evolution mechanisms for learning and refining abstract concepts over time.

[362] BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

Dionizije Fa, Marko Čuljak, Bruno Pandža, Mateo Čupić

Main category: cs.AI

TL;DR: BioAgent Bench: A benchmark for evaluating AI agents on bioinformatics tasks with automated assessment, stress testing, and privacy considerations.

Details

Motivation: Need to measure performance and robustness of AI agents in bioinformatics workflows, which involve multi-step pipelines, sensitive data, and require reliable reasoning under various conditions.

Method: Created benchmark dataset with curated end-to-end bioinformatics tasks (RNA-seq, variant calling, metagenomics) with concrete output specifications. Used LLM-based grader to score pipeline progress and outcome validity, with stress testing via controlled perturbations (corrupted inputs, decoy files, prompt bloat). Evaluated both closed-source and open-weight models across multiple agent harnesses.

Result: Frontier agents can complete multi-step bioinformatics pipelines without elaborate custom scaffolding, often producing requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations, showing that correct high-level pipeline construction doesn’t guarantee reliable step-level reasoning. Open-weight models may be preferable for privacy-sensitive workflows despite lower completion rates.

Conclusion: BioAgent Bench provides a comprehensive evaluation framework for AI agents in bioinformatics, revealing both capabilities and limitations. Privacy considerations make open-weight models viable alternatives to closed-source models in sensitive contexts, despite performance trade-offs.

Abstract: This paper introduces BioAgent Bench, a benchmark dataset and an evaluation suite designed for measuring the performance and robustness of AI agents in common bioinformatics tasks. The benchmark contains curated end-to-end tasks (e.g., RNA-seq, variant calling, metagenomics) with prompts that specify concrete output artifacts to support automated assessment, including stress testing under controlled perturbations. We evaluate frontier closed-source and open-weight models across multiple agent harnesses, and use an LLM-based grader to score pipeline progress and outcome validity. We find that frontier agents can complete multi-step bioinformatics pipelines without elaborate custom scaffolding, often producing the requested final artifacts reliably. However, robustness tests reveal failure modes under controlled perturbations (corrupted inputs, decoy files, and prompt bloat), indicating that correct high-level pipeline construction does not guarantee reliable step-level reasoning. Finally, because bioinformatics workflows may involve sensitive patient data, proprietary references, or unpublished IP, closed-source models can be unsuitable under strict privacy constraints; in such settings, open-weight models may be preferable despite lower completion rates. We release the dataset and evaluation suite publicly.

[363] A Unified XAI-LLM Approach for EndotrachealSuctioning Activity Recognition

Hoang Khang Phan, Quang Vinh Dang, Noriyo Colley, Christina Garcia, Nhat Tan Le

Main category: cs.AI

TL;DR: A unified LLM-centered framework for video-based activity recognition and feedback generation for endotracheal suctioning training, outperforming conventional ML/DL approaches by 15-20% in accuracy/F1 scores.

Details

Motivation: Endotracheal suctioning is a critical clinical procedure requiring high skill, but automated recognition and feedback systems for training are underexplored, especially in home care and educational settings with limited supervision.

Method: Proposes a unified LLM-centered framework where the LLM serves as central reasoning module for spatiotemporal activity recognition and explainable decision analysis from video data, plus verbalizing feedback in natural language.

Result: LLM-based approach outperforms baseline models with ~15-20% improvement in accuracy and F1 scores. Framework includes pilot student-support module with anomaly detection and XAI principles for automated, interpretable feedback.

Conclusion: Establishes scalable, interpretable, data-driven foundation for advancing nursing education, enhancing training efficiency, and improving patient safety through video-based activity recognition and feedback generation.

Abstract: Endotracheal suctioning (ES) is an invasive yet essential clinical procedure that requires a high degree of skill to minimize patient risk - particularly in home care and educational settings, where consistent supervision may be limited. Despite its critical importance, automated recognition and feedback systems for ES training remain underexplored. To address this gap, this study proposes a unified, LLM-centered framework for video-based activity recognition benchmarked against conventional machine learning and deep learning approaches, and a pilot study on feedback generation. Within this framework, the Large Language Model (LLM) serves as the central reasoning module, performing both spatiotemporal activity recognition and explainable decision analysis from video data. Furthermore, the LLM is capable of verbalizing feedback in natural language, thereby translating complex technical insights into accessible, human-understandable guidance for trainees. Experimental results demonstrate that the proposed LLM-based approach outperforms baseline models, achieving an improvement of approximately 15-20% in both accuracy and F1 score. Beyond recognition, the framework incorporates a pilot student-support module built upon anomaly detection and explainable AI (XAI) principles, which provides automated, interpretable feedback highlighting correct actions and suggesting targeted improvements. Collectively, these contributions establish a scalable, interpretable, and data-driven foundation for advancing nursing education, enhancing training efficiency, and ultimately improving patient safety.

[364] CORE:Toward Ubiquitous 6G Intelligence Through Collaborative Orchestration of Large Language Model Agents Over Hierarchical Edge

Zitong Yu, Boquan Sun, Yang Li, Zheyan Qu, Xing Zhang

Main category: cs.AI

TL;DR: CORE is a collaborative LLM framework for 6G edge networks that distributes specialized LLM agents across devices and servers, using optimization modules and role affinity scheduling to enable efficient distributed reasoning.

Details

Motivation: The paper addresses challenges in realizing ubiquitous intelligence in 6G networks due to fragmented and heterogeneous computing resources that are insufficient for individual LLM agents to perform complex reasoning tasks. There's a need for efficient collaboration among distributed AI agents across hierarchical edge infrastructure.

Method: Proposes CORE framework with multiple LLMs assigned distinct functional roles distributed across mobile devices and tiered edge servers. Includes three optimization modules: real-time perception, dynamic role orchestration, and pipeline-parallel execution. Introduces a novel role affinity scheduling algorithm for dynamically orchestrating LLM role assignments across hierarchical edge infrastructure.

Result: Comprehensive case studies and performance evaluations across various 6G application scenarios demonstrated efficacy, showing significant enhancements in system efficiency and task completion rates. Further validation on real-world edge-computing platform exhibited robust performance in operational environments.

Conclusion: CORE successfully addresses the challenge of fragmented computing resources in 6G networks by enabling efficient collaboration among distributed LLM agents through role-based orchestration and optimization, making it practical for real-world edge computing deployments.

Abstract: Rapid advancements in sixth-generation (6G) networks and large language models (LLMs) have paved the way for ubiquitous intelligence, wherein seamless connectivity and distributed artificial intelligence (AI) have revolutionized various aspects of our lives.However, realizing this vision faces significant challenges owing to the fragmented and heterogeneous computing resources across hierarchical networks, which are insufficient for individual LLM agents to perform complex reasoning tasks.To address this issue, we propose Collaborative Orchestration Role at Edge (CORE), an innovative framework that employs a collaborative learning system in which multiple LLMs, each assigned a distinct functional role, are distributed across mobile devices and tiered edge servers. The system integrates three optimization modules, encompassing real-time perception,dynamic role orchestration, and pipeline-parallel execution, to facilitate efficient and rapid collaboration among distributed agents. Furthermore, we introduce a novel role affinity scheduling algorithm for dynamically orchestrating LLM role assignments across the hierarchical edge infrastructure, intelligently matching computational demands with available dispersed resources.Finally, comprehensive case studies and performance evaluations across various 6G application scenarios demonstrated the efficacy of CORE, revealing significant enhancements in the system efficiency and task completion rates. Building on these promising outcomes, we further validated the practical applicability of CORE by deploying it on a real-world edge-computing platform,that exhibits robust performance in operational environments.

[365] Looking Beyond Accuracy: A Holistic Benchmark of ECG Foundation Models

Francesca Filice, Edoardo De Rose, Simone Bartucci, Francesco Calimeri, Simona Perri

Main category: cs.AI

TL;DR: A comprehensive benchmarking framework for ECG-expert Foundation Models that evaluates both downstream performance and representation-level patterns using SHAP and UMAP techniques.

Details

Motivation: While Foundation Models enable AI-assisted ECG interpretation through reusable embeddings, there's a need for rigorous assessment of their generalizability in healthcare. Existing benchmarks focus mainly on downstream performance, lacking deeper representation-level analysis.

Method: Introduces a benchmark methodology combining performance-based evaluation with representation-level analysis using SHAP (for feature importance) and UMAP (for dimensionality reduction and visualization). Evaluates multiple ECG-expert FMs pretrained with state-of-the-art techniques across cross-continental datasets, including data-scarce scenarios.

Result: Experimental results show the benchmarking protocol provides rich insights into ECG-expert FMs’ embedded patterns, enabling deeper understanding of their representational structure and generalizability across different data availability settings.

Conclusion: The proposed comprehensive benchmarking framework fills the gap in evaluating ECG-expert Foundation Models by providing both performance and representation-level analysis, crucial for responsible deployment in healthcare applications.

Abstract: The electrocardiogram (ECG) is a cost-effective, highly accessible and widely employed diagnostic tool. With the advent of Foundation Models (FMs), the field of AI-assisted ECG interpretation has begun to evolve, as they enable model reuse across different tasks by relying on embeddings. However, to responsibly employ FMs, it is crucial to rigorously assess to which extent the embeddings they produce are generalizable, particularly in error-sensitive domains such as healthcare. Although prior works have already addressed the problem of benchmarking ECG-expert FMs, they focus predominantly on the evaluation of downstream performance. To fill this gap, this study aims to find an in-depth, comprehensive benchmarking framework for FMs, with a specific focus on ECG-expert ones. To this aim, we introduce a benchmark methodology that complements performance-based evaluation with representation-level analysis, leveraging SHAP and UMAP techniques. Furthermore, we rely on the methodology for carrying out an extensive evaluation of several ECG-expert FMs pretrained via state-of-the-art techniques over different cross-continental datasets and data availability settings; this includes ones featuring data scarcity, a fairly common situation in real-world medical scenarios. Experimental results show that our benchmarking protocol provides a rich insight of ECG-expert FMs’ embedded patterns, enabling a deeper understanding of their representational structure and generalizability.

[366] Bridging Forecast Accuracy and Inventory KPIs: A Simulation-Based Software Framework

So Fukuhara, Abdallah Alabdallah, Nuwan Gunasekara, Slawomir Nowaczyk

Main category: cs.AI

TL;DR: A simulation framework for evaluating forecasting models in spare parts inventory management that connects statistical accuracy to operational KPIs like cost and service level.

Details

Motivation: Current forecasting evaluation focuses on statistical accuracy metrics (MAE, RMSE) rather than operational impact on inventory management KPIs like total cost and service level, creating a disconnect between model quality and real-world performance.

Method: Proposed a decision-centric simulation framework with three components: (1) synthetic demand generator for spare-parts characteristics, (2) flexible forecasting module for arbitrary predictive models, and (3) inventory control simulator that computes operational KPIs from forecasts.

Result: Showed that improvements in conventional accuracy metrics don’t necessarily translate to better operational performance, and models with similar statistical error can produce different cost-service trade-offs. Analyzed how specific forecast performance aspects affect inventory outcomes.

Conclusion: The framework operationalizes the link between demand forecasting and inventory management, shifting evaluation from predictive accuracy toward operational relevance in automotive aftermarket and related domains.

Abstract: Efficient management of spare parts inventory is crucial in the automotive aftermarket, where demand is highly intermittent and uncertainty drives substantial cost and service risks. Forecasting is therefore central, but the quality of a forecasting model should be judged not by statistical accuracy (e.g., MAE, RMSE, IAE) but rather by its impact on key operational performance indicators (KPIs), such as total cost and service level. Yet most existing work evaluates models exclusively using accuracy metrics, and the relationship between these metrics and operational KPIs remains poorly understood. To address this gap, we propose a decision-centric simulation software framework that enables systematic evaluation of forecasting model in realistic inventory management setting. The framework comprises: (i) a synthetic demand generator tailored to spare-parts demand characteristics, (ii) a flexible forecasting module that can host arbitrary predictive models, and (iii) an inventory control simulator that consumes the forecasts and computes operational KPIs. This closed-loop setup enables researchers to evaluate models not only in terms of statistical error but also in terms of their downstream implications for inventory decisions. Using a wide range of simulation scenarios, we show that improvements in conventional accuracy metrics do not necessarily translate into better operational performance, and that models with similar statistical error profiles can induce markedly different cost-service trade-offs. We analyze these discrepancies to characterize how specific aspects of forecast performance affect inventory outcomes and derive guidance for model selection. Overall, the framework operationalizes the link between demand forecasting and inventory management, shifting evaluation from purely predictive accuracy toward operational relevance in the automotive aftermarket and related domains.

Jinhao Pan, Chahat Raj, Anjishnu Mukherjee, Sina Mansouri, Bowen Wei, Shloka Yada, Ziwei Zhu

Main category: cs.AI

TL;DR: KnowBias is a lightweight framework that mitigates LLM bias by strengthening neurons encoding bias-knowledge rather than suppressing them, achieving state-of-the-art debiasing with minimal utility degradation.

Details

Motivation: Existing debiasing methods that suppress biased behavior are brittle, weakly generalizable, data-inefficient, and prone to degrading general model capabilities, limiting safe deployment of LLMs.

Method: KnowBias identifies neurons encoding bias knowledge using attribution-based analysis on a small set of bias-knowledge questions, then selectively enhances these neurons at inference time without retraining.

Result: Experiments across multiple benchmarks and LLMs demonstrate consistent state-of-the-art debiasing performance with minimal utility degradation, showing strong generalization across bias types and demographics.

Conclusion: KnowBias offers a lightweight, data-efficient alternative to suppressive debiasing methods that preserves general capabilities while effectively mitigating social biases in LLMs.

Abstract: Large language models (LLMs) exhibit social biases that reinforce harmful stereotypes, limiting their safe deployment. Most existing debiasing methods adopt a suppressive paradigm by modifying parameters, prompts, or neurons associated with biased behavior; however, such approaches are often brittle, weakly generalizable, data-inefficient, and prone to degrading general capability. We propose \textbf{KnowBias}, a lightweight and conceptually distinct framework that mitigates bias by strengthening, rather than suppressing, neurons encoding bias-knowledge. KnowBias identifies neurons encoding bias knowledge using a small set of bias-knowledge questions via attribution-based analysis, and selectively enhances them at inference time. This design enables strong debiasing while preserving general capabilities, generalizes across bias types and demographics, and is highly data efficient, requiring only a handful of simple yes/no questions and no retraining. Experiments across multiple benchmarks and LLMs demonstrate consistent state-of-the-art debiasing performance with minimal utility degradation. Data and code are available at https://github.com/JP-25/KnowBias.

[368] WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

Yao Zhang, Shijie Tang, Zeyu Li, Zhen Han, Volker Tresp

Main category: cs.AI

TL;DR: WebArbiter is a reasoning-first Process Reward Model for web navigation that generates structured justifications with preference verdicts, trained via reasoning distillation and RL to correct teacher biases, achieving state-of-the-art performance on web task benchmarks.

Details

Motivation: Web agents face challenges with long-horizon sequential decision-making where outcome-based supervision is sparse and delayed. Existing Process Reward Models (WebPRMs) are limited: scalar WebPRMs provide coarse signals, while checklist-based WebPRMs rely on brittle template matching that fails under layout/semantic changes and mislabels superficially correct actions.

Method: WebArbiter formulates reward modeling as text generation, producing structured justifications that conclude with a preference verdict and identify the most conducive action. Training uses a two-stage pipeline: 1) reasoning distillation to equip the model with principle-guided reasoning, and 2) reinforcement learning to correct teacher biases by directly aligning verdicts with correctness.

Result: On WebPRMBench (new benchmark spanning four web environments), WebArbiter-7B outperforms GPT-5 by 9.1 points. In reward-guided trajectory search on WebArena-Lite, it surpasses the best prior WebPRM by up to 7.2 points.

Conclusion: WebArbiter demonstrates robustness and practical value in real-world complex web tasks through its reasoning-first approach that provides interpretable justifications and strong generalization capabilities.

Abstract: Web agents hold great potential for automating complex computer tasks, yet their interactions involve long-horizon, sequential decision-making with irreversible actions. In such settings, outcome-based supervision is sparse and delayed, often rewarding incorrect trajectories and failing to support inference-time scaling. This motivates the use of Process Reward Models (WebPRMs) for web navigation, but existing approaches remain limited: scalar WebPRMs collapse progress into coarse, weakly grounded signals, while checklist-based WebPRMs rely on brittle template matching that fails under layout or semantic changes and often mislabels superficially correct actions as successful, providing little insight or interpretability. To address these challenges, we introduce WebArbiter, a reasoning-first, principle-inducing WebPRM that formulates reward modeling as text generation, producing structured justifications that conclude with a preference verdict and identify the action most conducive to task completion under the current context. Training follows a two-stage pipeline: reasoning distillation equips the model with coherent principle-guided reasoning, and reinforcement learning corrects teacher biases by directly aligning verdicts with correctness, enabling stronger generalization. To support systematic evaluation, we release WebPRMBench, a comprehensive benchmark spanning four diverse web environments with rich tasks and high-quality preference annotations. On WebPRMBench, WebArbiter-7B outperforms the strongest baseline, GPT-5, by 9.1 points. In reward-guided trajectory search on WebArena-Lite, it surpasses the best prior WebPRM by up to 7.2 points, underscoring its robustness and practical value in real-world complex web tasks.

[369] Making Models Unmergeable via Scaling-Sensitive Loss Landscape

Minwoo Jang, Hoyoung Kim, Jabin Koo, Jungseul Ok

Main category: cs.AI

TL;DR: TRAP² is an architecture-agnostic framework that protects model weights from unauthorized merging by encoding protection during fine-tuning, making weights effective standalone but degraded when rescaled during merging.

Details

Motivation: The modularity of model hubs enables easy model merging but creates a governance gap where users can recompose released weights into unauthorized mixtures that bypass safety alignment or licensing terms. Existing defenses are post-hoc and architecture-specific, providing inconsistent protection across diverse architectures and release formats.

Method: TRAP² encodes protection into the update during fine-tuning using weight re-scaling as a simple proxy for the merging process. It keeps released weights effective in standalone use but degrades them under re-scaling that often arises in merging, undermining unauthorized merging attempts.

Result: The framework provides architecture-agnostic protection that works regardless of whether models are released as adapters or full models, effectively preventing unauthorized model merging while maintaining standalone functionality.

Conclusion: TRAP² closes the governance gap in model hubs by providing a practical, architecture-agnostic solution to protect model weights from unauthorized merging, addressing both safety alignment and licensing concerns in the era of modular AI components.

Abstract: The rise of model hubs has made it easier to access reusable model components, making model merging a practical tool for combining capabilities. Yet, this modularity also creates a \emph{governance gap}: downstream users can recompose released weights into unauthorized mixtures that bypass safety alignment or licensing terms. Because existing defenses are largely post-hoc and architecture-specific, they provide inconsistent protection across diverse architectures and release formats in practice. To close this gap, we propose \textsc{Trap}$^{2}$, an architecture-agnostic protection framework that encodes protection into the update during fine-tuning, regardless of whether they are released as adapters or full models. Instead of relying on architecture-dependent approaches, \textsc{Trap}$^{2}$ uses weight re-scaling as a simple proxy for the merging process. It keeps released weights effective in standalone use, but degrades them under re-scaling that often arises in merging, undermining unauthorized merging.

[370] From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

Shaojie Wang, Liang Zhang

Main category: cs.AI

TL;DR: CoMT proposes a two-stage LLM training framework inspired by human cognition: first learning abstract reasoning patterns (meta-knowledge) through supervised learning, then adapting to specific tasks via confidence-calibrated reinforcement learning, improving generalization and efficiency.

Details

Motivation: Current LLM post-training methods treat complete reasoning trajectories as basic units, which is problem-centric and entangles abstract strategies with problem-specific execution. This misaligns with human cognition where problem-solving naturally decomposes into two stages: acquiring abstract strategies first, then adapting them to specific instances.

Method: Chain-of-Meta-Thought (CoMT) focuses supervised learning on abstract reasoning patterns without specific executions to acquire generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading.

Result: Experiments across four models and eight benchmarks show 2.19% and 4.63% improvements in-distribution and out-of-distribution respectively over standard methods, while reducing training time by 65-70% and token consumption by 50%.

Conclusion: Aligning post-training with human cognitive principles yields superior generalization and enhanced training efficiency, demonstrating that explicitly modeling the two-stage human cognitive process improves LLM reasoning capabilities.

Abstract: Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought (CoMT) focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and eight benchmarks show 2.19% and 4.63% improvements in-distribution and out-of-distribution respectively over standard methods, while reducing training time by 65-70% and token consumption by 50%, demonstrating that aligning post-training with human cognitive principles yields not only superior generalization but also enhanced training efficiency.

[371] ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation

Zhao Wang, Ziliang Zhao, Zhicheng Dou

Main category: cs.AI

TL;DR: ProRAG introduces a process-supervised reinforcement learning framework for optimizing Retrieval-Augmented Generation (RAG) in complex reasoning tasks, addressing reward sparsity and inefficient credit assignment through step-level process supervision.

Details

Motivation: Traditional outcome-based RL for RAG suffers from reward sparsity and inefficient credit assignment, leading to "process hallucinations" where models reach correct answers through flawed logic. Existing process-aware approaches lack on-policy exploration capabilities to decouple step-level credit from global outcomes.

Method: Four-stage framework: (1) Supervised Policy Warmup to initialize reasoning format, (2) MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality, (3) PRM-Guided Reasoning Refinement to align policy with process preferences, (4) Process-Supervised RL with dual-granularity advantage mechanism combining step-level process rewards with global outcome signals.

Result: Extensive experiments on five multi-hop reasoning benchmarks show ProRAG achieves superior performance compared to outcome-based and process-aware RL baselines, particularly on complex long-horizon tasks, validating the effectiveness of fine-grained process supervision.

Conclusion: ProRAG successfully addresses reward sparsity and credit assignment issues in RL for RAG by integrating learned step-level supervision into the online optimization loop, demonstrating the value of process supervision for complex reasoning tasks.

Abstract: Reinforcement learning (RL) has become a promising paradigm for optimizing Retrieval-Augmented Generation (RAG) in complex reasoning tasks. However, traditional outcome-based RL approaches often suffer from reward sparsity and inefficient credit assignment, as coarse-grained scalar rewards fail to identify specific erroneous steps within long-horizon trajectories. This ambiguity frequently leads to “process hallucinations”, where models reach correct answers through flawed logic or redundant retrieval steps. Although recent process-aware approaches attempt to mitigate this via static preference learning or heuristic reward shaping, they often lack the on-policy exploration capabilities required to decouple step-level credit from global outcomes. To address these challenges, we propose ProRAG, a process-supervised reinforcement learning framework designed to integrate learned step-level supervision into the online optimization loop. Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM-Guided Reasoning Refinement to align the policy with fine-grained process preferences; and (4) Process-Supervised Reinforcement Learning with a dual-granularity advantage mechanism. By aggregating step-level process rewards with global outcome signals, ProRAG provides precise feedback for every action. Extensive experiments on five multi-hop reasoning benchmarks demonstrate that ProRAG achieves superior overall performance compared to strong outcome-based and process-aware RL baselines, particularly on complex long-horizon tasks, validating the effectiveness of fine-grained process supervision. The code and model are available at https://github.com/lilinwz/ProRAG.

[372] JADE: Bridging the Strategic-Operational Gap in Dynamic Agentic RAG

Yiqun Chen, Erhan Zhang, Tianyi Hu, Shijie Wang, Zixuan Yang, Meizhi Zhong, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

Main category: cs.AI

TL;DR: JADE is a joint optimization framework for agentic RAG systems that co-adapts planning and execution modules within dynamic multi-turn workflows using a shared backbone and outcome-based rewards.

Details

Motivation: Existing RAG systems face a "strategic-operational mismatch" where sophisticated planning strategies fail due to unadapted local executors, leading to negative performance gains despite increased complexity. Current approaches either optimize modules jointly within rigid architectures or enable dynamic planning while treating executors as frozen black-box tools.

Method: JADE models the system as a cooperative multi-agent team unified under a single shared backbone, enabling end-to-end learning driven by outcome-based rewards. This facilitates co-adaptation where the planner learns to operate within executor capabilities while executors evolve to align with strategic intent.

Result: Empirical results show JADE transforms disjoint modules into a synergistic system, yielding remarkable performance improvements via joint optimization and enabling flexible balance between efficiency and effectiveness through dynamic workflow orchestration.

Conclusion: JADE provides a unified framework for joint optimization of planning and execution in dynamic multi-turn workflows, addressing the strategic-operational mismatch in agentic RAG systems through co-adaptation of modules.

Abstract: The evolution of Retrieval-Augmented Generation (RAG) has shifted from static retrieval pipelines to dynamic, agentic workflows where a central planner orchestrates multi-turn reasoning. However, existing paradigms face a critical dichotomy: they either optimize modules jointly within rigid, fixed-graph architectures, or empower dynamic planning while treating executors as frozen, black-box tools. We identify that this \textit{decoupled optimization} creates a ``strategic-operational mismatch,’’ where sophisticated planning strategies fail to materialize due to unadapted local executors, often leading to negative performance gains despite increased system complexity. In this paper, we propose \textbf{JADE} (\textbf{J}oint \textbf{A}gentic \textbf{D}ynamic \textbf{E}xecution), a unified framework for the joint optimization of planning and execution within dynamic, multi-turn workflows. By modeling the system as a cooperative multi-agent team unified under a single shared backbone, JADE enables end-to-end learning driven by outcome-based rewards. This approach facilitates \textit{co-adaptation}: the planner learns to operate within the capability boundaries of the executors, while the executors evolve to align with high-level strategic intent. Empirical results demonstrate that JADE transforms disjoint modules into a synergistic system, yielding remarkable performance improvements via joint optimization and enabling a flexible balance between efficiency and effectiveness through dynamic workflow orchestration.

[373] Self-Compression of Chain-of-Thought via Multi-Agent Reinforcement Learning

Yiqun Chen, Jinyuan Feng, Wei Yang, Meizhi Zhong, Zhengliang Shi, Rui Li, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Zhiqiang Pu, Jiaxin Mao

Main category: cs.AI

TL;DR: Multi-agent RL framework (SCMA) reduces redundant reasoning in Large Reasoning Models by using segmentation and scoring agents to identify and penalize non-essential reasoning chunks while preserving critical logic.

Details

Motivation: Existing RL-based approaches for reducing reasoning overhead use simplistic reward weighting that struggles to balance brevity with accuracy, often compromising essential reasoning logic when enforcing shorter responses.

Method: Proposes Self-Compression via MARL (SCMA) with three agents: Segmentation Agent decomposes reasoning into logical chunks, Scoring Agent quantifies chunk significance, and Reasoning Agent learns to prioritize essential logic using importance-weighted length penalty.

Result: SCMA reduces response length by 11.1% to 39.0% while boosting accuracy by 4.33% to 10.02% across model scales. Ablation studies show emergent behaviors yield more powerful reasoning models.

Conclusion: The multi-agent RL framework effectively reduces redundant reasoning without compromising accuracy, enabling more efficient deployment of Large Reasoning Models through selective compression of non-essential reasoning chunks.

Abstract: The inference overhead induced by redundant reasoning undermines the interactive experience and severely bottlenecks the deployment of Large Reasoning Models. Existing reinforcement learning (RL)-based solutions tackle this problem by coupling a length penalty with outcome-based rewards. This simplistic reward weighting struggles to reconcile brevity with accuracy, as enforcing brevity may compromise critical reasoning logic. In this work, we address this limitation by proposing a multi-agent RL framework that selectively penalizes redundant chunks, while preserving essential reasoning logic. Our framework, Self-Compression via MARL (SCMA), instantiates redundancy detection and evaluation through two specialized agents: \textbf{a Segmentation Agent} for decomposing the reasoning process into logical chunks, and \textbf{a Scoring Agent} for quantifying the significance of each chunk. The Segmentation and Scoring agents collaboratively define an importance-weighted length penalty during training, incentivizing \textbf{a Reasoning Agent} to prioritize essential logic without introducing inference overhead during deployment. Empirical evaluations across model scales demonstrate that SCMA reduces response length by 11.1% to 39.0% while boosting accuracy by 4.33% to 10.02%. Furthermore, ablation studies and qualitative analysis validate that the synergistic optimization within the MARL framework fosters emergent behaviors, yielding more powerful LRMs compared to vanilla RL paradigms.

[374] AgenticSimLaw: A Juvenile Courtroom Multi-Agent Debate Simulation for Explainable High-Stakes Tabular Decision Making

Jon Chun, Kathrine Elkins, Yong Suk Lee

Main category: cs.AI

TL;DR: AgenticSimLaw: A courtroom-style multi-agent debate framework for transparent, controllable reasoning in high-stakes tabular decision-making, tested on recidivism prediction.

Details

Motivation: Addresses the need for transparent and controllable reasoning in high-stakes decision-making tasks, particularly where black-box approaches lack auditability and ethical oversight is crucial.

Method: A role-structured multi-agent debate framework with explicit agent roles (prosecutor, defense, judge), 7-turn structured interaction protocols, and private reasoning strategies, benchmarked on young adult recidivism prediction using NLSY97 dataset.

Result: Structured multi-agent debate provides more stable and generalizable performance than single-agent reasoning, with stronger correlation between accuracy and F1-score metrics, plus fine-grained control and complete interaction transcripts for explainability.

Conclusion: AgenticSimLaw offers a generalizable approach for high-stakes decision tasks requiring transparency and human oversight, addressing key LLM-based multi-agent system challenges through structured roles, logged interactions, and explicit non-deployment constraints.

Abstract: We introduce AgenticSimLaw, a role-structured, multi-agent debate framework that provides transparent and controllable test-time reasoning for high-stakes tabular decision-making tasks. Unlike black-box approaches, our courtroom-style orchestration explicitly defines agent roles (prosecutor, defense, judge), interaction protocols (7-turn structured debate), and private reasoning strategies, creating a fully auditable decision-making process. We benchmark this framework on young adult recidivism prediction using the NLSY97 dataset, comparing it against traditional chain-of-thought (CoT) prompting across almost 90 unique combinations of models and strategies. Our results demonstrate that structured multi-agent debate provides more stable and generalizable performance compared to single-agent reasoning, with stronger correlation between accuracy and F1-score metrics. Beyond performance improvements, AgenticSimLaw offers fine-grained control over reasoning steps, generates complete interaction transcripts for explainability, and enables systematic profiling of agent behaviors. While we instantiate this framework in the criminal justice domain to stress-test reasoning under ethical complexity, the approach generalizes to any deliberative, high-stakes decision task requiring transparency and human oversight. This work addresses key LLM-based multi-agent system challenges: organization through structured roles, observability through logged interactions, and responsibility through explicit non-deployment constraints for sensitive domains. Data, results, and code will be available on github.com under the MIT license.

[375] Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Shuangshuang Ying, Zheyu Wang, Yunjian Peng, Jin Chen, Yuhao Wu, Hongbin Lin, Dingyu He, Siyi Liu, Gengchen Yu, YinZhu Piao, Yuchen Wu, Xin Gui, Zhongyuan Peng, Xin Li, Xeron Du, Libo Qin, YiXin Cao, Ge Zhang

Main category: cs.AI

TL;DR: DeR2 is a controlled evaluation framework that isolates document-grounded reasoning in scientific research contexts, decoupling evidence access from reasoning to measure true reasoning capabilities beyond retrieval and memorization.

Details

Motivation: Current benchmarks for large language models conflate reasoning with retrieval and toolchain choices, and are contaminated by parametric memorization and open-web volatility, making it unclear whether models can genuinely reason over novel scientific information.

Method: DeR2 introduces a deep-research sandbox with four controlled regimes: Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus distractors). It uses frozen document libraries from 2023-2025 theoretical papers with expert-annotated concepts and validated rationales, with two-phase validation to prevent parametric leakage.

Result: Experiments reveal substantial variation across state-of-the-art foundation models, with some showing mode-switch fragility (performing worse with Full-set than Instruction-only) and others exhibiting structural concept misuse (naming concepts correctly but failing to execute them procedurally).

Conclusion: DeR2 provides a controlled framework for evaluating document-grounded reasoning, revealing significant headroom for improvement in how models reason over novel scientific information beyond retrieval capabilities and memorization.

Abstract: Despite strong performance on existing benchmarks, it remains unclear whether large language models can reason over genuinely novel scientific information. Most evaluations score end-to-end RAG pipelines, where reasoning is confounded with retrieval and toolchain choices, and the signal is further contaminated by parametric memorization and open-web volatility. We introduce DeR2, a controlled deep-research sandbox that isolates document-grounded reasoning while preserving core difficulties of deep search: multi-step synthesis, denoising, and evidence-based conclusion making. DeR2 decouples evidence access from reasoning via four regimes–Instruction-only, Concepts (gold concepts without documents), Related-only (only relevant documents), and Full-set (relevant documents plus topically related distractors)–yielding interpretable regime gaps that operationalize retrieval loss vs. reasoning loss and enable fine-grained error attribution. To prevent parametric leakage, we apply a two-phase validation that requires parametric failure without evidence while ensuring oracle-concept solvability. To ensure reproducibility, each instance provides a frozen document library (drawn from 2023-2025 theoretical papers) with expert-annotated concepts and validated rationales. Experiments across a diverse set of state-of-the-art foundation models reveal substantial variation and significant headroom: some models exhibit mode-switch fragility, performing worse with the Full-set than with Instruction-only, while others show structural concept misuse, correctly naming concepts but failing to execute them as procedures.

[376] Exploring Reasoning Reward Model for Agents

Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, Xiangyu Yue

Main category: cs.AI

TL;DR: Agent-RRM introduces a multi-faceted reward model for agentic RL that provides structured feedback on reasoning traces, critiques, and overall scores to improve training beyond sparse outcome-based rewards.

Details

Motivation: Current agentic RL methods rely on sparse outcome-based rewards that fail to differentiate intermediate reasoning quality, leading to suboptimal training results. There's a need for more nuanced feedback mechanisms.

Method: Proposes Agent Reasoning Reward Model (Agent-RRM) with three components: explicit reasoning traces, focused critiques highlighting reasoning flaws, and overall process scores. Investigates three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration).

Result: Extensive evaluations across 12 diverse benchmarks show Reagent-U achieves substantial performance improvements: 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of the reasoning reward model and training schemes.

Conclusion: Agent-RRM provides structured, multi-faceted feedback that significantly improves agentic RL performance by addressing the limitations of sparse outcome-based rewards, with Reagent-U emerging as the most effective integration strategy.

Abstract: Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.

[377] ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models

Bowen Fang, Wen Ye, Yunyue Su, Jinghao Zhang, Qiang Liu, Yesheng Liu, Xin Sun, Shu Wu, Jiabing Yang, Baole Wei, Liang Wang

Main category: cs.AI

TL;DR: ToolWeaver: A generative tool learning framework that encodes tools into hierarchical sequences instead of unique tokens, enabling logarithmic vocabulary expansion and better learning of collaborative tool relationships.

Details

Motivation: Current retrieval-based tool-use pipelines have dual semantic challenges: encoders fail to capture complex semantics, and LLMs lack intrinsic tool knowledge. Generative methods that map each tool to a unique token create scalability issues and semantic bottlenecks that hinder learning collaborative tool relationships.

Method: Proposes ToolWeaver framework that encodes tools into hierarchical sequences through a novel tokenization process, weaving together a tool’s intrinsic semantics with extrinsic co-usage patterns. The structured codes are integrated into LLMs through generative alignment fine-tuning.

Result: Evaluation with nearly 47,000 tools shows ToolWeaver significantly outperforms state-of-the-art methods, establishing a more scalable, generalizable, and semantically-aware foundation for advanced tool-augmented agents.

Conclusion: ToolWeaver addresses the scalability and generalization crisis in generative tool learning by using hierarchical tool encoding, enabling better learning of collaborative patterns and establishing a foundation for advanced tool-augmented agents.

Abstract: Prevalent retrieval-based tool-use pipelines struggle with a dual semantic challenge: their retrievers often employ encoders that fail to capture complex semantics, while the Large Language Model (LLM) itself lacks intrinsic tool knowledge from its natural language pretraining. Generative methods offer a powerful alternative by unifying selection and execution, tasking the LLM to directly learn and generate tool identifiers. However, the common practice of mapping each tool to a unique new token introduces substantial limitations: it creates a scalability and generalization crisis, as the vocabulary size explodes and each tool is assigned a semantically isolated token. This approach also creates a semantic bottleneck that hinders the learning of collaborative tool relationships, as the model must infer them from sparse co-occurrences of monolithic tool IDs within a vast library. To address these limitations, we propose ToolWeaver, a novel generative tool learning framework that encodes tools into hierarchical sequences. This approach makes vocabulary expansion logarithmic to the number of tools. Crucially, it enables the model to learn collaborative patterns from the dense co-occurrence of shared codes, rather than the sparse co-occurrence of monolithic tool IDs. We generate these structured codes through a novel tokenization process designed to weave together a tool’s intrinsic semantics with its extrinsic co-usage patterns. These structured codes are then integrated into the LLM through a generative alignment stage, where the model is fine-tuned to produce the hierarchical code sequences. Evaluation results with nearly 47,000 tools show that ToolWeaver significantly outperforms state-of-the-art methods, establishing a more scalable, generalizable, and semantically-aware foundation for advanced tool-augmented agents.

[378] How do Visual Attributes Influence Web Agents? A Comprehensive Evaluation of User Interface Design Factors

Kuai Yu, Naicheng Yu, Han Wang, Rui Yang, Huan Zhang

Main category: cs.AI

TL;DR: VAF is a pipeline to evaluate how visual webpage attributes influence web agent decision-making, showing that background contrast, item size, position, and card clarity strongly affect agent behavior while font styling and text color have minor effects.

Details

Motivation: Existing research on web agents focuses on robustness to adversarial attacks, but lacks systematic understanding of how visual attributes shape agent decision-making in benign scenarios. There's limited research on how visual (vs. textual) attributes influence agent behavior.

Method: VAF pipeline with three stages: (1) variant generation - create webpage variants with identical semantics but different visual attributes; (2) browsing interaction - agents navigate via scrolling and clicking like human users; (3) validation - measure Target Click Rate and Target Mention Rate to evaluate visual attribute effects.

Result: Experiments across 8 variant families (48 variants), 5 real-world websites (shopping, travel, news), and 4 web agents show background color contrast, item size, position, and card clarity strongly influence agent actions, while font styling, text color, and item image clarity have minor effects.

Conclusion: Visual attributes significantly impact web agent decision-making, with certain attributes (background contrast, size, position, clarity) having strong influence while others (font styling, text color) are less important, providing systematic understanding of visual factors in agent behavior.

Abstract: Web agents have demonstrated strong performance on a wide range of web-based tasks. However, existing research on the effect of environmental variation has mostly focused on robustness to adversarial attacks, with less attention to agents’ preferences in benign scenarios. Although early studies have examined how textual attributes influence agent behavior, a systematic understanding of how visual attributes shape agent decision-making remains limited. To address this, we introduce VAF, a controlled evaluation pipeline for quantifying how webpage Visual Attribute Factors influence web-agent decision-making. Specifically, VAF consists of three stages: (i) variant generation, which ensures the variants share identical semantics as the original item while only differ in visual attributes; (ii) browsing interaction, where agents navigate the page via scrolling and clicking the interested item, mirroring how human users browse online; (iii) validating through both click action and reasoning from agents, which we use the Target Click Rate and Target Mention Rate to jointly evaluate the effect of visual attributes. By quantitatively measuring the decision-making difference between the original and variant, we identify which visual attributes influence agents’ behavior most. Extensive experiments, across 8 variant families (48 variants total), 5 real-world websites (including shopping, travel, and news browsing), and 4 representative web agents, show that background color contrast, item size, position, and card clarity have a strong influence on agents’ actions, whereas font styling, text color, and item image clarity exhibit minor effects.

[379] The Energy Impact of Domain Model Design in Classical Planning

Ilche Georgievski, Serhat Tekin, Marco Aiello

Main category: cs.AI

TL;DR: This paper investigates how domain model characteristics affect energy consumption in classical planning systems, introducing a framework to systematically vary domain features and measure energy efficiency across different planners.

Details

Motivation: The motivation is to address the gap in energy efficiency research for automated planning systems, which traditionally prioritize algorithmic performance metrics like accuracy or runtime. The paper recognizes that while planning has high computational demands, its energy consumption has received little attention, despite the modular structure of planning systems enabling systematic analysis through domain model design.

Method: The authors introduce a domain model configuration framework that enables controlled variation of features such as element ordering, action arity, and dead-end states. They empirically investigate energy consumption using five benchmark domains and five state-of-the-art planners, analyzing energy and runtime impacts across 32 domain variants per benchmark.

Result: Results demonstrate that domain-level modifications produce measurable energy differences across planners, with energy consumption not always correlating with runtime. This shows that energy efficiency can be systematically analyzed and potentially optimized through domain model design choices.

Conclusion: The paper concludes that energy consumption should be considered as a critical performance dimension in automated planning, and that domain model characteristics significantly impact energy efficiency. The findings support the Green AI paradigm shift toward considering energy consumption alongside traditional performance metrics.

Abstract: AI research has traditionally prioritised algorithmic performance, such as optimising accuracy in machine learning or runtime in automated planning. The emerging paradigm of Green AI challenges this by recognising energy consumption as a critical performance dimension. Despite the high computational demands of automated planning, its energy efficiency has received little attention. This gap is particularly salient given the modular planning structure, in which domain models are specified independently of algorithms. On the other hand, this separation also enables systematic analysis of energy usage through domain model design. We empirically investigate how domain model characteristics affect the energy consumption of classical planners. We introduce a domain model configuration framework that enables controlled variation of features, such as element ordering, action arity, and dead-end states. Using five benchmark domains and five state-of-the-art planners, we analyse energy and runtime impacts across 32 domain variants per benchmark. Results demonstrate that domain-level modifications produce measurable energy differences across planners, with energy consumption not always correlating with runtime.

[380] Mind the Gap: How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models

Pranav Mahajan, Ihor Kendiukhov, Syed Hussain, Lydia Nottingham

Main category: cs.AI

TL;DR: The paper examines how different elicitation protocols affect the correlation between stated and revealed preferences in language models, finding that correlation is highly protocol-dependent and that preference elicitation needs to account for indeterminate preferences.

Details

Motivation: Existing evaluations of language model preferences rely heavily on binary forced-choice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. There's a need to systematically study how different elicitation protocols affect the correlation between what models say they prefer (stated preferences) and what they actually choose in context (revealed preferences).

Method: The researchers systematically studied how elicitation protocols affect stated-revealed preference correlation across 24 language models. They experimented with allowing neutrality and abstention during stated preference elicitation, and further allowed abstention in revealed preferences. They also tested system prompt steering using stated preferences during revealed preference elicitation on AIRiskDilemmas.

Result: Allowing neutrality and abstention during stated preference elicitation improved Spearman’s rank correlation between volunteered stated preferences and forced-choice revealed preferences by excluding weak signals. However, allowing abstention in revealed preferences drove correlation to near-zero or negative values due to high neutrality rates. System prompt steering using stated preferences during revealed preference elicitation did not reliably improve stated-revealed correlation on AIRiskDilemmas.

Conclusion: Stated-revealed preference correlation is highly protocol-dependent, and preference elicitation requires methods that account for indeterminate preferences rather than relying on binary forced-choice approaches that may not capture genuine model preferences.

Abstract: Recent work identifies a stated-revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forced-choice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman’s rank correlation ($ρ$) between volunteered stated preferences and forced-choice revealed preferences. However, further allowing abstention in revealed preferences drives $ρ$ to near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.

[381] VERSA: Verified Event Data Format for Reliable Soccer Analytics

Geonhee Jo, Mingu Kang, Kangmin Lee, Minho Lee, Pascal Bauer, Sang-Ki Ko

Main category: cs.AI

TL;DR: VERSA is a verification framework for soccer event data that detects and corrects logical inconsistencies in event streams, improving data quality for downstream analytics.

Details

Motivation: Event stream data in sports (especially soccer) often contains logical inconsistencies like incorrect ordering or missing events, which undermine the reliability of fine-grained analytics like player contribution assessment and tactical pattern identification.

Method: VERSA uses a state-transition model to define valid event sequences, enabling automatic detection and correction of anomalous patterns in soccer event stream data.

Result: Analysis of K League 1 (2024) data revealed 18.81% of events had logical inconsistencies. VERSA improved cross-provider consistency and enhanced robustness/performance of VAEP (player contribution evaluation) downstream task.

Conclusion: Systematic verification of event data integrity is crucial for reliable sports analytics, and VERSA effectively addresses data quality issues to improve downstream analysis reliability.

Abstract: Event stream data is a critical resource for fine-grained analysis across various domains, including financial transactions, system operations, and sports. In sports, it is actively used for fine-grained analyses such as quantifying player contributions and identifying tactical patterns. However, the reliability of these models is fundamentally limited by inherent data quality issues that cause logical inconsistencies (e.g., incorrect event ordering or missing events). To this end, this study proposes VERSA (Verified Event Data Format for Reliable Soccer Analytics), a systematic verification framework that ensures the integrity of event stream data within the soccer domain. VERSA is based on a state-transition model that defines valid event sequences, thereby enabling the automatic detection and correction of anomalous patterns within the event stream data. Notably, our examination of event data from the K League 1 (2024 season), provided by Bepro, detected that 18.81% of all recorded events exhibited logical inconsistencies. Addressing such integrity issues, our experiments demonstrate that VERSA significantly enhances cross-provider consistency, ensuring stable and unified data representation across heterogeneous sources. Furthermore, we demonstrate that data refined by VERSA significantly improves the robustness and performance of a downstream task called VAEP, which evaluates player contributions. These results highlight that the verification process is highly effective in increasing the reliability of data-driven analysis.

[382] Liquid Interfaces: A Dynamic Ontology for the Interoperability of Autonomous Systems

Dhiogo de Sá, Carlos Schmiedel, Carlos Pereira Lopes

Main category: cs.AI

TL;DR: Liquid Interfaces: A new coordination paradigm where interfaces emerge dynamically through runtime intention articulation and semantic negotiation, rather than being static artifacts.

Details

Motivation: Current software architectures fail to support autonomous agents with adaptive, probabilistic, context-dependent reasoning because they rely on static interfaces and deterministic contracts, which are incompatible with the dynamic nature of modern agent-based systems.

Method: Introduces Liquid Interfaces as ephemeral relational events that emerge through runtime intention articulation and semantic negotiation. Formalizes the model with Liquid Interface Protocol (LIP) that governs intention-driven interaction, negotiated execution, and enforces ephemerality under semantic uncertainty.

Result: Presents a formal model and reference architecture demonstrating practical feasibility. The approach provides a principled foundation for adaptive coordination in agent-based systems.

Conclusion: Liquid Interfaces offer a novel coordination paradigm that addresses the limitations of static interfaces for autonomous agents, enabling more adaptive, context-aware interactions through dynamic interface emergence and negotiation.

Abstract: Contemporary software architectures struggle to support autonomous agents whose reasoning is adaptive, probabilistic, and context-dependent, while system integration remains dominated by static interfaces and deterministic contracts. This paper introduces Liquid Interfaces, a coordination paradigm in which interfaces are not persistent technical artifacts, but ephemeral relational events that emerge through intention articulation and semantic negotiation at runtime.We formalize this model and present the Liquid Interface Protocol (LIP),which governs intention-driven interaction, negotiated execution, and enforce ephemerality under semantic uncertainty. We further discuss the governance implications of this approach and describe a reference architecture that demonstrates practical feasibility. Liquid Interfaces provide a principled foundation for adaptive coordination in agent-based systems

[383] Heterogeneous Computing: The Key to Powering the Future of AI Agent Inference

Yiren Zhao, Junyi Liu

Main category: cs.AI

TL;DR: The paper introduces Operational Intensity (OI) and Capacity Footprint (CF) metrics to analyze AI agent inference bottlenecks beyond compute, focusing on memory capacity, bandwidth, and interconnect limitations in datacenters.

Details

Motivation: AI agent inference is creating inference-heavy datacenters with bottlenecks beyond compute - particularly memory capacity, memory bandwidth, and high-speed interconnect limitations that traditional roofline analysis misses.

Method: Introduces two new metrics: Operational Intensity (OI) and Capacity Footprint (CF) to analyze different regimes of AI agent inference. Analyzes various agentic workflows (chat, coding, web use, computer use) and model architectures (GQA/MLA, MoE, quantization) to understand how OI/CF shift, particularly with long context KV cache making decode memory-bound.

Result: The analysis reveals that OI/CF metrics shift dramatically across different agentic workflows and model choices, with long context KV cache making decode operations highly memory-bound. This exposes limitations in current hardware architectures.

Conclusion: Proposes disaggregated serving and system-level heterogeneity as solutions: specialized prefill/decode accelerators, scale-up networking, optical I/O for compute-memory decoupling. Suggests agent-hardware co-design, multiple inference accelerators per system, and high-bandwidth large-capacity memory disaggregation as foundations for adapting to evolving OI/CF requirements.

Abstract: AI agent inference is driving an inference heavy datacenter future and exposes bottlenecks beyond compute - especially memory capacity, memory bandwidth and high-speed interconnect. We introduce two metrics - Operational Intensity (OI) and Capacity Footprint (CF) - that jointly explain regimes the classic roofline analysis misses, including the memory capacity wall. Across agentic workflows (chat, coding, web use, computer use) and base model choices (GQA/MLA, MoE, quantization), OI/CF can shift dramatically, with long context KV cache making decode highly memory bound. These observations motivate disaggregated serving and system level heterogeneity: specialized prefill and decode accelerators, broader scale up networking, and decoupled compute-memory enabled by optical I/O. We further hypothesize agent-hardware co design, multiple inference accelerators within one system, and high bandwidth, large capacity memory disaggregation as foundations for adaptation to evolving OI/CF. Together, these directions chart a path to sustain efficiency and capability for large scale agentic AI inference.

[384] CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Johannes Kirmayr, Lukas Stappen, Elisabeth André

Main category: cs.AI

TL;DR: CAR-bench is a benchmark for evaluating LLM agents’ reliability in real-world scenarios like in-car assistants, focusing on consistency, uncertainty handling, and capability awareness through multi-turn dialogues with simulated users and 58 interconnected tools.

Details

Motivation: Existing LLM agent benchmarks focus on task completion under ideal conditions but overlook reliability in real-world applications where users issue incomplete/ambiguous requests, creating uncertainty that agents must manage through dialogue, tool use, and policy adherence.

Method: Introduces CAR-bench with LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Includes Hallucination tasks (testing limit-awareness with missing tools/info) and Disambiguation tasks (requiring uncertainty resolution through clarification or info gathering).

Result: Baseline results show large gaps between occasional and consistent success. Even frontier reasoning LLMs achieve <50% consistent pass rate on Disambiguation tasks due to premature actions, and frequently violate policies or fabricate information in Hallucination tasks.

Conclusion: Highlights the need for more reliable and self-aware LLM agents in real-world settings, demonstrating current limitations in handling uncertainty and maintaining consistency in multi-turn, tool-using scenarios.

Abstract: Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealistic settings but overlook reliability in real-world, user-facing applications. In domains, such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents in an in-car assistant domain. The environment features an LLM-simulated user, domain policies, and 58 interconnected tools spanning navigation, productivity, charging, and vehicle control. Beyond standard task completion, CAR-bench introduces Hallucination tasks that test agents’ limit-awareness under missing tools or information, and Disambiguation tasks that require resolving uncertainty through clarification or internal information gathering. Baseline results reveal large gaps between occasional and consistent success on all task types. Even frontier reasoning LLMs achieve less than 50% consistent pass rate on Disambiguation tasks due to premature actions, and frequently violate policies or fabricate information to satisfy user requests in Hallucination tasks, underscoring the need for more reliable and self-aware LLM agents in real-world settings.

[385] Optimizing Agentic Workflows using Meta-tools

Sami Abuzakuk, Anne-Marie Kermarrec, Rishi Sharma, Rasmus Moorits Veski, Martijn de Vos

Main category: cs.AI

TL;DR: AWO framework optimizes agentic AI workflows by identifying redundant tool execution patterns and replacing them with deterministic meta-tools to reduce LLM calls and improve success rates.

Details

Motivation: Agentic AI workflows with LLMs often require many iterative reasoning steps and tool invocations, leading to high operational costs, latency, and failures due to hallucinations. There's a need to optimize these workflows for efficiency and robustness.

Method: Agent Workflow Optimization (AWO) analyzes existing workflow traces to discover recurring sequences of tool calls and transforms them into meta-tools - deterministic, composite tools that bundle multiple agent actions into a single invocation, bypassing unnecessary intermediate LLM reasoning steps.

Result: Experiments on two agentic AI benchmarks show AWO reduces the number of LLM calls up to 11.9% while increasing task success rate by up to 4.2 percentage points.

Conclusion: AWO provides an effective framework for optimizing agentic workflows by identifying and replacing redundant tool execution patterns with meta-tools, improving both efficiency and robustness of LLM-based agent systems.

Abstract: Agentic AI enables LLM to dynamically reason, plan, and interact with tools to solve complex tasks. However, agentic workflows often require many iterative reasoning steps and tool invocations, leading to significant operational expense, end-to-end latency and failures due to hallucinations. This work introduces Agent Workflow Optimization (AWO), a framework that identifies and optimizes redundant tool execution patterns to improve the efficiency and robustness of agentic workflows. AWO analyzes existing workflow traces to discover recurring sequences of tool calls and transforms them into meta-tools, which are deterministic, composite tools that bundle multiple agent actions into a single invocation. Meta-tools bypass unnecessary intermediate LLM reasoning steps and reduce operational cost while also shortening execution paths, leading to fewer failures. Experiments on two agentic AI benchmarks show that AWO reduces the number of LLM calls up to 11.9% while also increasing the task success rate by up to 4.2 percent points.

[386] Defining Operational Conditions for Safety-Critical AI-Based Systems from Data

Johann Christensen, Elena Hoemann, Frank Köster, Sven Hallerbach

Main category: cs.AI

TL;DR: A novel Safety-by-Design method to define Operational Design Domains (ODDs) from collected data using multi-dimensional kernel-based representation for certifying AI-based safety-critical systems.

Details

Motivation: Defining Operational Design Domains (ODDs) for AI-based safety-critical systems is challenging due to incomplete environmental descriptions, yet required for certification. Traditional ODD creation relies on expert knowledge early in development, but data-driven approaches are needed for existing data.

Method: Proposes a Safety-by-Design method to define ODDs a posteriori from previously collected data using multi-dimensional kernel-based representation. Validated through Monte Carlo methods and a real-world aviation use case for collision-avoidance systems.

Result: The method enables data-driven ODD definition that can equal the original hidden ODD of the data. Demonstrated through validation with Monte Carlo simulations and a practical aviation safety application.

Conclusion: The kernel-based ODD approach enables future certification of data-driven, safety-critical AI-based systems by providing a systematic way to define operational domains from collected data.

Abstract: Artificial Intelligence (AI) has been on the rise in many domains, including numerous safety-critical applications. However, for complex systems found in the real world, or when data already exist, defining the underlying environmental conditions is extremely challenging. This often results in an incomplete description of the environment in which the AI-based system must operate. Nevertheless, this description, called the Operational Design Domain (ODD), is required in many domains for the certification of AI-based systems. Traditionally, the ODD is created in the early stages of the development process, drawing on sophisticated expert knowledge and related standards. This paper presents a novel Safety-by-Design method to a posteriori define the ODD from previously collected data using a multi-dimensional kernel-based representation. This approach is validated through both Monte Carlo methods and a real-world aviation use case for a future safety-critical collision-avoidance system. Moreover, by defining under what conditions two ODDs are equal, the paper shows that the data-driven ODD can equal the original, underlying hidden ODD of the data. Utilizing the novel, Safe-by-Design kernel-based ODD enables future certification of data-driven, safety-critical AI-based systems.

[387] Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data

Grzegorz Stefanski, Alberto Presta, Michal Byra

Main category: cs.AI

TL;DR: RTL is an adaptive pruning framework that discovers multiple specialized subnetworks tailored to different data classes/clusters, outperforming single-model approaches while using fewer parameters.

Details

Motivation: The Lottery Ticket Hypothesis assumes a single universal winning ticket, but real-world data is heterogeneous. Current approaches ignore this heterogeneity by assuming one subnetwork works for all inputs.

Method: RTL (Routing the Lottery) discovers multiple specialized subnetworks (adaptive tickets) tailored to different data classes, semantic clusters, or environmental conditions through adaptive pruning.

Result: RTL consistently outperforms single- and multi-model baselines in balanced accuracy and recall, uses up to 10x fewer parameters than independent models, and exhibits semantically aligned subnetworks. Also identifies subnetwork collapse under aggressive pruning.

Conclusion: Pruning can be recast as a mechanism for aligning model structure with data heterogeneity, paving the way toward more modular and context-aware deep learning.

Abstract: In pruning, the Lottery Ticket Hypothesis posits that large networks contain sparse subnetworks, or winning tickets, that can be trained in isolation to match the performance of their dense counterparts. However, most existing approaches assume a single universal winning ticket shared across all inputs, ignoring the inherent heterogeneity of real-world data. In this work, we propose Routing the Lottery (RTL), an adaptive pruning framework that discovers multiple specialized subnetworks, called adaptive tickets, each tailored to a class, semantic cluster, or environmental condition. Across diverse datasets and tasks, RTL consistently outperforms single- and multi-model baselines in balanced accuracy and recall, while using up to 10 times fewer parameters than independent models and exhibiting semantically aligned. Furthermore, we identify subnetwork collapse, a performance drop under aggressive pruning, and introduce a subnetwork similarity score that enables label-free diagnosis of oversparsification. Overall, our results recast pruning as a mechanism for aligning model structure with data heterogeneity, paving the way toward more modular and context-aware deep learning.

[388] The Patient is not a Moving Document: A World Model Training Paradigm for Longitudinal EHR

Irsyad Adam, Zekai Chen, David Laprade, Shaun Porwal, David Laub, Erik Reinertsen, Arda Pekis, Kevin Brown

Main category: cs.AI

TL;DR: SMB-Structure: A world model for structured EHR that combines joint-embedding prediction (JEPA) with next-token prediction (SFT) to simulate patient dynamics rather than just predicting tokens.

Details

Motivation: Current LLMs treat patients as documents to summarize rather than dynamical systems to simulate. Patient trajectories emerge from state evolution under interventions and time, requiring models that simulate dynamics rather than just predict tokens.

Method: Introduces SMB-Structure, a world model for structured EHR that grounds a joint-embedding prediction architecture (JEPA) with next-token prediction (SFT). SFT reconstructs future patient states in token space, while JEPA predicts futures in latent space from initial patient representation alone.

Result: Validated across two large-scale cohorts (MSK oncology and INSPECT pulmonary embolism). Linear probe evaluation shows embeddings capture disease dynamics not recoverable by autoregressive baselines, achieving competitive performance on complex tasks with high patient heterogeneity.

Conclusion: The training paradigm learns embeddings that capture disease dynamics, enabling SMB-Structure to simulate patient trajectories rather than just predict tokens, with model weights publicly available.

Abstract: Large language models (LLMs) trained with next-word-prediction have achieved success as clinical foundation models. Representations from these language backbones yield strong linear probe performance across biomedical tasks, suggesting that patient semantics emerge from next-token prediction at scale. However, this paradigm treats patients as a document to be summarized rather than a dynamical system to be simulated; a patient’s trajectory emerges from their state evolving under interventions and time, requiring models that simulate dynamics rather than predict tokens. To address this, we introduce SMB-Structure, a world model for structured EHR that grounds a joint-embedding prediction architecture (JEPA) with next-token prediction (SFT). SFT grounds our model to reconstruct future patient states in token space, while JEPA predicts those futures in latent space from the initial patient representation alone, forcing trajectory dynamics to be encoded before the next state is observed. We validate across two large-scale cohorts: Memorial Sloan Kettering (23,319 oncology patients; 323,000+ patient-years) and INSPECT (19,402 pulmonary embolism patients). Using a linear probe evaluated at multiple points along the disease trajectory, we demonstrate that our training paradigm learns embeddings that capture disease dynamics not recoverable by autoregressive baselines, enabling SMB-Structure to achieve competitive performance on complex tasks characterized by high patient heterogeneity. Model weights are available at https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure.

[389] World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems

Lakshya Gupta, Litao Li, Yizhe Liu, Sriram Ganapathi Subramanian, Kaheer Suleman, Zichen Zhang, Haoye Lu, Sumit Pasupalak

Main category: cs.AI

TL;DR: WoW is a realistic enterprise environment with hidden workflows and cascading effects, revealing LLMs’ dynamics blindness and need for grounded world modeling.

Details

Motivation: Current LLM agents fail in complex enterprise systems with hidden workflows and cascading effects, and existing benchmarks don't capture these real enterprise challenges.

Method: Created World of Workflows (WoW) - a ServiceNow-based environment with 4,000+ business rules and 55 workflows, plus WoW-bench with 234 tasks to evaluate agentic task completion and dynamics modeling.

Result: Frontier LLMs suffer from dynamics blindness (failing to predict cascading side effects) and need grounded world modeling to bridge observability gaps in opaque systems.

Conclusion: Enterprise agents require explicit learning of system dynamics for reliability, motivating a new paradigm beyond current LLM capabilities.

Abstract: Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface-level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow-based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW-bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict the invisible, cascading side effects of their actions, which leads to silent constraint violations, and (2) reliability in opaque systems requires grounded world modeling, where agents must mentally simulate hidden state transitions to bridge the observability gap when high-fidelity feedback is unavailable. For reliable and useful enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. We release our GitHub for setting up and evaluating WoW.

[390] EndoAgent: A Memory-Guided Reflective Agent for Intelligent Endoscopic Vision-to-Decision Reasoning

Yi Tang, Kaini Wang, Yang Chen, Guangquan Zhou

Main category: cs.AI

TL;DR: EndoAgent is a memory-guided AI agent for endoscopic image diagnosis that integrates iterative reasoning with adaptive tool selection and collaboration using a dual-memory design.

Details

Motivation: Existing AI methods for endoscopic diagnosis lack unified coordination across tasks and struggle with multi-step clinical workflows. While AI agents show promise in flexible instruction parsing and tool integration, their potential in endoscopy remains underexplored.

Method: Proposes EndoAgent with dual-memory design (short-term action tracking and long-term experiential learning) that integrates expert-designed tools within a unified reasoning loop. Also introduces EndoAgentBench benchmark with 5,709 visual question-answer pairs.

Result: EndoAgent consistently outperforms both general and medical multimodal models, demonstrating strong flexibility and reasoning capabilities in endoscopic analysis.

Conclusion: EndoAgent represents the first memory-guided agent for vision-to-decision endoscopic analysis, showing promise for developing general AI systems to support endoscopic image diagnosis.

Abstract: Developing general artificial intelligence (AI) systems to support endoscopic image diagnosis is an emerging research priority. Existing methods based on large-scale pretraining often lack unified coordination across tasks and struggle to handle the multi-step processes required in complex clinical workflows. While AI agents have shown promise in flexible instruction parsing and tool integration across domains, their potential in endoscopy remains underexplored. To address this gap, we propose EndoAgent, the first memory-guided agent for vision-to-decision endoscopic analysis that integrates iterative reasoning with adaptive tool selection and collaboration. Built on a dual-memory design, it enables sophisticated decision-making by ensuring logical coherence through short-term action tracking and progressively enhancing reasoning acuity through long-term experiential learning. To support diverse clinical tasks, EndoAgent integrates a suite of expert-designed tools within a unified reasoning loop. We further introduce EndoAgentBench, a benchmark of 5,709 visual question-answer pairs that assess visual understanding and language generation capabilities in realistic scenarios. Extensive experiments show that EndoAgent consistently outperforms both general and medical multimodal models, exhibiting its strong flexibility and reasoning capabilities.

[391] SafeSearch: Automated Red-Teaming of LLM-Based Search Agents

Jianshuo Dong, Sheng Guo, Hao Wang, Xun Chen, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu

Main category: cs.AI

TL;DR: SafeSearch is an automated red-teaming framework for evaluating safety vulnerabilities in LLM-based search agents, revealing high attack success rates up to 90.5% for GPT-4.1-mini across various risk categories.

Details

Motivation: Search agents connect LLMs to the Internet, introducing security risks where unreliable search results can mislead agents into producing unsafe outputs. Real-world incidents and observations show these failures occur in practice, necessitating systematic safety evaluation.

Method: Proposed SafeSearch framework: automated, scalable, cost-efficient, lightweight red-teaming. Generated 300 test cases across five risk categories (misinformation, prompt injection, etc.). Evaluated three search agent scaffolds across 17 representative LLMs.

Result: Revealed substantial vulnerabilities in LLM-based search agents, with highest attack success rate reaching 90.5% for GPT-4.1-mini in search-workflow settings. Common defenses like reminder prompting offered limited protection.

Conclusion: SafeSearch provides practical way to measure and improve safety of LLM-based search agents. Framework enables systematic safety evaluation, revealing significant vulnerabilities that current defenses inadequately address.

Abstract: Search agents connect LLMs to the Internet, enabling them to access broader and more up-to-date information. However, this also introduces a new threat surface: unreliable search results can mislead agents into producing unsafe outputs. Real-world incidents and our two in-the-wild observations show that such failures can occur in practice. To study this threat systematically, we propose SafeSearch, an automated red-teaming framework that is scalable, cost-efficient, and lightweight, enabling harmless safety evaluation of search agents. Using this, we generate 300 test cases spanning five risk categories (e.g., misinformation and prompt injection) and evaluate three search agent scaffolds across 17 representative LLMs. Our results reveal substantial vulnerabilities in LLM-based search agents, with the highest ASR reaching 90.5% for GPT-4.1-mini in a search-workflow setting. Moreover, we find that common defenses, such as reminder prompting, offer limited protection. Overall, SafeSearch provides a practical way to measure and improve the safety of LLM-based search agents. Our codebase and test cases are publicly available: https://github.com/jianshuod/SafeSearch.

[392] Neural network embeddings recover value dimensions from psychometric survey items on par with human data

Max Pellert, Clemens M. Lechner, Indira Sen, Markus Strohmaier

Main category: cs.AI

TL;DR: SQuID method uses LLM embeddings to recover human value structures from psychometric questionnaires without domain-specific training, achieving 55% variance explained compared to human data.

Details

Motivation: To develop a cost-effective, scalable alternative to traditional human survey methods for psychometric measurement using semantic embeddings from large language models.

Method: SQuID (Survey and Questionnaire Item Embeddings Differentials) processes LLM embeddings to recover psychometric structures, comparing multiple embedding models using internal consistency, dimension correlations, and multidimensional scaling configurations.

Result: Explains 55% of variance in dimension-dimension similarities compared to human data, shows alignment with pooled human data from 49 countries, and generalizes across three personality inventories (IPIP, BFI-2, HEXACO).

Conclusion: Semantic embeddings can effectively replicate psychometric structures with cost, scalability, and flexibility advantages while maintaining comparable quality to traditional human survey methods.

Abstract: We demonstrate that embeddings derived from large language models, when processed with “Survey and Questionnaire Item Embeddings Differentials” (SQuID), can recover the structure of human values obtained from human rater judgments on the Revised Portrait Value Questionnaire (PVQ-RR). We compare multiple embedding models across a number of evaluation metrics including internal consistency, dimension correlations and multidimensional scaling configurations. Unlike previous approaches, SQuID addresses the challenge of obtaining negative correlations between dimensions without requiring domain-specific fine-tuning or training data re-annotation. Quantitative analysis reveals that our embedding-based approach explains 55% of variance in dimension-dimension similarities compared to human data. Multidimensional scaling configurations show alignment with pooled human data from 49 different countries. Generalizability tests across three personality inventories (IPIP, BFI-2, HEXACO) demonstrate that SQuID consistently increases correlation ranges, suggesting applicability beyond value theory. These results show that semantic embeddings can effectively replicate psychometric structures previously established through extensive human surveys. The approach offers substantial advantages in cost, scalability and flexibility while maintaining comparable quality to traditional methods. Our findings have significant implications for psychometrics and social science research, providing a complementary methodology that could expand the scope of human behavior and experience represented in measurement tools.

[393] A Formal Comparison Between Chain of Thought and Latent Thought

Kevin Xu, Issei Sato

Main category: cs.AI

TL;DR: Formal analysis comparing Chain of Thought (CoT) reasoning in discrete token space vs. latent thought reasoning in continuous latent space, showing CoT enables stochastic decoding for counting/sampling while latent thought allows more efficient parallel computation.

Details

Motivation: To understand the comparative capabilities of two reasoning paradigms in large language models: explicit Chain of Thought (CoT) reasoning using discrete tokens vs. latent thought reasoning operating in continuous latent space, as their relative strengths and limitations remain underexplored.

Method: Formal analysis comparing the computational properties of both reasoning approaches, examining their parallel computation efficiency, sequential constraints, and capabilities for approximate counting and sampling through stochastic decoding.

Result: Latent thought reasoning admits more efficient parallel computation than inherently sequential CoT, while CoT enables approximate counting and sampling through stochastic decoding. These computational separations suggest different task suitability for each paradigm.

Conclusion: The analysis provides practical guidance for choosing between reasoning paradigms based on task requirements: depth-driven recursion is more suitable for certain tasks, helping practitioners select appropriate reasoning approaches for different applications.

Abstract: Chain of thought (CoT) elicits reasoning in large language models by explicitly generating intermediate tokens. In contrast, latent thought reasoning operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that latent thought admits more efficient parallel computation than inherently sequential CoT. In contrast, CoT enables approximate counting and sampling through stochastic decoding. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms.

[394] Dr. Bench: A Multidimensional Evaluation for Deep Research Agents, from Answers to Reports

Yang Yao, Yixu Wang, Yuxuan Zhang, Yi Lu, Tianle Gu, Lingyu Li, Dingyi Zhao, Keming Wu, Haozhe Wang, Ping Nie, Yan Teng, Yingchun Wang

Main category: cs.AI

TL;DR: Dr. Bench is a multidimensional evaluation framework for Deep Research Agents (DRAs) that addresses limitations in existing benchmarks by providing comprehensive assessment of long-form report-style responses across 10 domains with 214 expert-curated tasks.

Details

Motivation: Existing benchmarks are deficient in evaluation dimensions, response format, and scoring mechanisms for assessing Deep Research Agents (DRAs), which have advanced capabilities in task decomposition, cross-source retrieval, multi-stage reasoning, information integration, and structured output for complex open-ended tasks.

Method: The paper introduces Dr. Bench, a multidimensional evaluation framework comprising 214 expert-curated challenging tasks across 10 broad domains, each with manually constructed reference bundles. The framework incorporates metrics for semantic quality, topical focus, and retrieval trustworthiness to comprehensively evaluate long reports generated by DRAs.

Result: Extensive experimentation confirms superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, but reveals considerable scope for further improvement in DRA capabilities.

Conclusion: Dr. Bench provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement of Deep Research Agents, addressing critical gaps in current evaluation methodologies for these advanced AI systems.

Abstract: As an embodiment of intelligence evolution toward interconnected architectures, Deep Research Agents (DRAs) systematically exhibit the capabilities in task decomposition, cross-source retrieval, multi-stage reasoning, information integration, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response format, and scoring mechanisms, limiting their effectiveness in assessing such agents. This paper introduces Dr. Bench, a multidimensional evaluation framework tailored to DRAs and long-form report-style responses. The benchmark comprises 214 expert-curated challenging tasks across 10 broad domains, each accompanied by manually constructed reference bundles to support composite evaluation. This framework incorporates metrics for semantic quality, topical focus, and retrieval trustworthiness, enabling a comprehensive evaluation of long reports generated by DRAs. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement of DRAs.

[395] CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning

Cédric Colas, Pierre Fournier, Olivier Sigaud, Mohamed Chetouani, Pierre-Yves Oudeyer

Main category: cs.AI

TL;DR: CURIOUS algorithm enables autonomous agents to self-organize learning curriculum through intrinsic motivation and automated goal selection based on learning progress.

Details

Motivation: In open-ended environments, agents need to autonomously set goals and build curricula through intrinsic motivation, handling diverse goals of varying difficulty and learnability.

Method: Combines modular Universal Value Function Approximator with hindsight learning for diverse goal achievement, plus automated curriculum learning that focuses on goals maximizing absolute learning progress.

Result: Agents successfully self-organize developmental curriculum, sequentially tackling goals of increasing complexity while revisiting forgotten goals, showing robustness to distractions, forgetting, and body changes.

Conclusion: CURIOUS enables effective autonomous learning in open-ended environments through intrinsic motivation and adaptive curriculum building.

Abstract: In open-ended environments, autonomous learning agents must set their own goals and build their own curriculum through an intrinsically motivated exploration. They may consider a large diversity of goals, aiming to discover what is controllable in their environments, and what is not. Because some goals might prove easy and some impossible, agents must actively select which goal to practice at any moment, to maximize their overall mastery on the set of learnable goals. This paper proposes CURIOUS, an algorithm that leverages 1) a modular Universal Value Function Approximator with hindsight learning to achieve a diversity of goals of different kinds within a unique policy and 2) an automated curriculum learning mechanism that biases the attention of the agent towards goals maximizing the absolute learning progress. Agents focus sequentially on goals of increasing complexity, and focus back on goals that are being forgotten. Experiments conducted in a new modular-goal robotic environment show the resulting developmental self-organization of a learning curriculum, and demonstrate properties of robustness to distracting goals, forgetting and changes in body properties.

[396] Exploring Flow-Lenia Universes with a Curiosity-driven AI Scientist: Discovering Diverse Ecosystem Dynamics

Thomas Michel, Marko Cvjetko, Gautier Hamon, Pierre-Yves Oudeyer, Clément Moulin-Frier

Main category: cs.AI

TL;DR: Automated discovery of system-level dynamics in Flow-Lenia using curiosity-driven AI to uncover self-organized evolutionary and ecosystemic behaviors in continuous cellular automata.

Details

Motivation: To develop methods for discovering emergent collective behaviors and ecosystemic dynamics in complex systems like Flow-Lenia, going beyond individual pattern discovery to understand system-level self-organization.

Method: Adapts Intrinsically Motivated Goal Exploration Processes (IMGEPs) to explore diverse Flow-Lenia environments using simulation-wide metrics like evolutionary activity, compression-based complexity, and multi-scale entropy. Combines automated discovery with interactive exploration tools for human-AI collaboration.

Result: The method discovers significantly more diverse dynamics compared to random search, enabling self-organization of complex collective behaviors not captured by previous individual pattern search approaches.

Conclusion: Provides a framework for automated discovery of emergent collective properties in parameterizable complex systems, demonstrated with Flow-Lenia but potentially applicable to other systems where understanding system-level dynamics is important.

Abstract: We present a method for the automated discovery of system-level dynamics in Flow-Lenia–a continuous cellular automaton (CA) with mass conservation and parameter localization-using a curiosity–driven AI scientist. This method aims to uncover processes leading to self-organization of evolutionary and ecosystemic dynamics in CAs. We build on previous work which uses diversity search algorithms in Lenia to find self-organized individual patterns, and extend it to large environments that support distinct interacting patterns. We adapt Intrinsically Motivated Goal Exploration Processes (IMGEPs) to drive exploration of diverse Flow-Lenia environments using simulation-wide metrics, such as evolutionary activity, compression-based complexity, and multi-scale entropy. We test our method in two experiments, showcasing its ability to illuminate significantly more diverse dynamics compared to random search. We show qualitative results illustrating how ecosystemic simulations enable self-organization of complex collective behaviors not captured by previous individual pattern search and analysis. We complement automated discovery with an interactive exploration tool, creating an effective human-AI collaborative workflow for scientific investigation. Though demonstrated specifically with Flow-Lenia, this methodology provides a framework potentially applicable to other parameterizable complex systems where understanding emergent collective properties is of interest.

[397] PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs

Oguzhan Gungordu, Siheng Xiong, Faramarz Fekri

Main category: cs.AI

TL;DR: PathWise is a multi-agent LLM framework for automated heuristic design that uses sequential decision making over an entailment graph to improve heuristic generation for combinatorial optimization problems.

Details

Motivation: Existing LLM-based automated heuristic design frameworks suffer from myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived due to reliance on fixed evolutionary rules and static prompt templates.

Method: Proposes PathWise, a multi-agent reasoning framework that formulates heuristic generation as a sequential decision process over an entailment graph serving as compact, stateful memory. Uses policy agent for planning evolutionary actions, world model agent for generating heuristic rollouts, and critic agents for providing routed reflections.

Result: Experiments across diverse combinatorial optimization problems show PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes.

Conclusion: PathWise shifts LLM-based automated heuristic design from trial-and-error evolution toward state-aware planning through reasoning, demonstrating improved performance and scalability.

Abstract: Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing frameworks’ reliance on fixed evolutionary rules and static prompt templates often leads to myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived. We propose a novel multi-agent reasoning framework, referred to as Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs (PathWise), which formulates heuristic generation as a sequential decision process over an entailment graph serving as a compact, stateful memory of the search trajectory. This approach allows the system to carry forward past decisions and reuse or avoid derivation information across generations. A policy agent plans evolutionary actions, a world model agent generates heuristic rollouts conditioned on those actions, and critic agents provide routed reflections summarizing lessons from prior steps, shifting LLM-based AHD from trial-and-error evolution toward state-aware planning through reasoning. Experiments across diverse COPs show that PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes.

[398] Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary

Hongru Wang, Cheng Qian, Manling Li, Jiahao Qiu, Boyang Xue, Mengdi Wang, Heng Ji, Amos Storkey, Kam-Fai Wong

Main category: cs.AI

TL;DR: A position paper proposing that AI agents should only use external tools when epistemically necessary - when tasks cannot be reliably completed via internal reasoning alone, introducing the Theory of Agent framework for principled tool delegation decisions.

Details

Motivation: Current agent frameworks treat tools as ordinary actions optimized for task success, lacking principled distinction between necessary and unnecessary tool use. The paper aims to establish when external tool invocation is actually justified versus when agents should rely on internal reasoning.

Method: Introduces Theory of Agent (ToA) framework where agents make sequential decisions about whether to resolve uncertainty internally or delegate externally. This provides a normative criterion for tool use based on epistemic necessity rather than just task success.

Result: The framework explains common agent failure modes (overthinking, overacting) as miscalibrated decisions under uncertainty rather than deficiencies in reasoning or tool execution alone. It establishes epistemic necessity as a criterion for justified tool use.

Conclusion: Agents should invoke tools only when epistemically necessary to avoid inefficiency and impediment of internal reasoning development. This normative criterion complements decision-theoretic models and is essential for building increasingly intelligent agents.

Abstract: As large language models evolve into tool-augmented agents, a central question remains unresolved: when is external tool use actually justified? Existing agent frameworks typically treat tools as ordinary actions and optimize for task success or reward, offering little principled distinction between epistemically necessary interaction and unnecessary delegation. This position paper argues that agents should invoke external tools only when epistemically necessary. Here, epistemic necessity means that a task cannot be completed reliably via the agent’s internal reasoning over its current context, without any external interaction. We introduce the Theory of Agent (ToA), a framework that treats agents as making sequential decisions about whether remaining uncertainty should be resolved internally or delegated externally. From this perspective, common agent failure modes (e.g., overthinking and overacting) arise from miscalibrated decisions under uncertainty rather than deficiencies in reasoning or tool execution alone. We further discuss implications for training, evaluation, and agent design, highlighting that unnecessary delegation not only causes inefficiency but can impede the development of internal reasoning capability. Our position provides a normative criterion for tool use that complements existing decision-theoretic models and is essential for building agents that are not only correct, but increasingly intelligent.

[399] WorldLLM: Improving LLMs’ world modeling using curiosity-driven theory-making

Guillaume Levy, Cedric Colas, Pierre-Yves Oudeyer, Thomas Carta, Clement Romac

Main category: cs.AI

TL;DR: WorldLLM framework enhances LLM-based world modeling by combining Bayesian inference, active exploration with RL, and LLM-guided hypothesis refinement for better predictions in structured environments.

Details

Motivation: LLMs have general world knowledge but struggle with precise predictions in structured, domain-specific contexts like simulations due to inability to ground broad understanding in specific environments.

Method: WorldLLM combines Bayesian inference and autonomous active exploration with reinforcement learning. It uses LLMs’ in-context learning to guide world model predictions via natural language hypotheses, which are iteratively refined through Bayesian inference with a second LLM as proposal distribution. Evidence is collected using curiosity-driven RL policy that explores environments to find low-likelihood transitions.

Result: Experiments in textual game environment requiring object manipulation and combination show enhanced predictive accuracy and generation of human-interpretable theories of environment dynamics.

Conclusion: WorldLLM framework effectively improves LLM-based world modeling by autonomously driving continual improvement of predictions through alternating hypothesis refinement and evidence collection.

Abstract: Large Language Models (LLMs) possess general world knowledge but often struggle to generate precise predictions in structured, domain-specific contexts such as simulations. These limitations arise from their inability to ground their broad, unstructured understanding in specific environments. To address this, we present WorldLLM, a framework that enhances LLM-based world modeling by combining Bayesian inference and autonomous active exploration with reinforcement learning. WorldLLM leverages the in-context learning abilities of LLMs to guide an LLM-based world model’s predictions using natural language hypotheses given in its prompt. These hypotheses are iteratively refined through a Bayesian inference framework that leverages a second LLM as the proposal distribution given collected evidence. This evidence is collected using a curiosity-driven reinforcement learning policy that explores the environment to find transitions with a low log-likelihood under our LLM-based predictive model using the current hypotheses. By alternating between refining hypotheses and collecting new evidence, our framework autonomously drives continual improvement of the predictions. Our experiments demonstrate the effectiveness of WorldLLM in a textual game environment that requires agents to manipulate and combine objects. The framework not only enhances predictive accuracy, but also generates human-interpretable theories of environment dynamics.

[400] GenOM: Ontology Matching with Description Generation and Large Language Model

Yiping Song, Jiaoyan Chen, Renate A. Schmidt

Main category: cs.AI

TL;DR: GenOM is an LLM-based ontology matching framework that uses semantic enrichment through generated textual definitions, embedding-based candidate retrieval, and exact matching tools to improve alignment precision in biomedical ontologies.

Details

Motivation: Ontology matching is crucial for semantic interoperability in heterogeneous knowledge sources, especially in biomedical domains with complex disease and pharmaceutical concepts. Traditional methods and recent LLM-based approaches need improvement for better precision and semantic understanding.

Method: GenOM uses LLMs to generate textual definitions for ontology concepts to enrich semantic representations, employs embedding models to retrieve alignment candidates, and incorporates exact matching-based tools to improve precision. It leverages few-shot prompting for better adaptation.

Result: Experiments on OAEI Bio-ML track show GenOM achieves competitive performance, surpassing traditional OM systems and recent LLM-based methods. Ablation studies confirm the effectiveness of semantic enrichment and few-shot prompting.

Conclusion: GenOM demonstrates robustness and adaptability in biomedical ontology matching through semantic enrichment and LLM integration, offering improved alignment precision over existing approaches.

Abstract: Ontology matching (OM) plays an essential role in enabling semantic interoperability and integration across heterogeneous knowledge sources, particularly in the biomedical domain which contains numerous complex concepts related to diseases and pharmaceuticals. This paper introduces GenOM, a large language model (LLM)-based ontology alignment framework, which enriches the semantic representations of ontology concepts via generating textual definitions, retrieves alignment candidates with an embedding model, and incorporates exact matching-based tools to improve precision. Extensive experiments conducted on the OAEI Bio-ML track demonstrate that GenOM can often achieve competitive performance, surpassing many baselines including traditional OM systems and recent LLM-based methods. Further ablation studies confirm the effectiveness of semantic enrichment and few-shot prompting, highlighting the framework’s robustness and adaptability.

[401] Virtuous Machines: Towards Artificial General Science

Gabrielle Wehr, Reuben Rideaux, Amaya J. Fox, David R. Lightfoot, Jason Tangen, Jason B. Mattingley, Shane E. Ehrhardt

Main category: cs.AI

TL;DR: AI Scientist system autonomously conducts complete psychological research workflow including hypothesis generation, experimental design, data collection, analysis, and manuscript preparation for three visual cognition studies.

Details

Motivation: Current AI systems are domain-specific and require human oversight, while scientific literature growth and specialization limit cross-disciplinary synthesis. Need for general-purpose AI systems that can autonomously navigate the complete scientific workflow.

Method: Domain-agnostic, agentic AI Scientist system that independently handles hypothesis generation, experimental design, online data collection (288 participants), analysis pipeline development through extended coding sessions, and manuscript preparation for psychological studies on visual working memory, mental rotation, and imagery vividness.

Result: System successfully conducted three psychological studies with theoretical reasoning and methodological rigor comparable to experienced researchers, demonstrating capability for non-trivial research, though with limitations in conceptual nuance and theoretical interpretation.

Conclusion: AI scientific discovery pipelines can autonomously explore scientific spaces that human constraints might leave unexplored, representing a step toward embodied AI that tests hypotheses through real-world experiments, while raising questions about scientific understanding and credit attribution.

Abstract: Artificial intelligence systems are transforming scientific discovery by accelerating specific research tasks, from protein structure prediction to materials design, yet remain confined to narrow domains requiring substantial human oversight. The exponential growth of scientific literature and increasing domain specialisation constrain researchers’ capacity to synthesise knowledge across disciplines and develop unifying theories, motivating exploration of more general-purpose AI systems for science. Here we show that a domain-agnostic, agentic AI Scientist system can independently navigate the scientific workflow - from hypothesis generation through data collection to manuscript preparation. The system autonomously designed and executed three psychological studies on visual working memory, mental rotation, and imagery vividness, executed one new online data collection with 288 participants, developed analysis pipelines through 8-hour+ continuous coding sessions, and produced completed manuscripts. The results demonstrate the capability of AI scientific discovery pipelines to conduct non-trivial research with theoretical reasoning and methodological rigour comparable to experienced researchers, though with limitations in conceptual nuance and theoretical interpretation. This is a step toward embodied AI that can test hypotheses through real-world experiments, accelerating discovery by autonomously exploring regions of scientific space that human cognitive and resource constraints might otherwise leave unexplored. It raises important questions about the nature of scientific understanding and the attribution of scientific credit.

[402] What Would an LLM Do? Evaluating Large Language Models for Policymaking to Alleviate Homelessness

Pierre Le Coz, Jia An Liu, Debarun Bhattacharjya, Georgina Curto, Serge Stinckwich

Main category: cs.AI

TL;DR: LLMs show promise for homelessness policy recommendations but vary from expert opinions; automated pipeline connects policies to agent-based modeling for social impact assessment.

Details

Motivation: To evaluate whether LLMs align with domain experts on homelessness policy recommendations, given LLMs' potential in social policymaking and the global scale of homelessness affecting over 150 million people.

Method: Developed a novel benchmark with decision scenarios across four cities using the Capability Approach framework. Created an automated pipeline connecting LLM-recommended policies to agent-based modeling in one location to compare social impact against expert recommendations.

Result: LLMs show variation in policy recommendations compared to local experts, but demonstrate potential benefits for policymaking insights when paired with responsible guardrails, contextual calibrations, and local domain expertise.

Conclusion: LLMs can provide valuable insights for homelessness alleviation policymaking but require careful integration with domain expertise and ethical frameworks; the work operationalizes the Capability Approach computationally and offers new perspectives on human dignity-focused policymaking.

Abstract: Large language models (LLMs) are increasingly being adopted in high-stakes domains. Their potential to encode evolving social contexts and to generate plausible scenarios position them as promising tools in social policymaking. This article evaluates whether LLMs are aligned with domain experts (and among themselves) on policy recommendations to alleviate homelessness - a challenge affecting over 150 million people worldwide. We develop a novel benchmark comprised of decision scenarios across four cities, with policy choices that are grounded in the conceptual framework of the Capability Approach for human development. We also present an automated pipeline that connects the policies to an agent-based model in one location, and compare the social impact of the policies recommended by LLMs to those recommended by experts. Our exploratory analysis reveals variation across LLMs in their policy recommendations compared to local experts, yet suggests potential benefits of the use of LLMs to provide insights for policymaking, if paired with responsible guardrails, contextual calibrations, and local domain expertise. Our work operationalizes the Capability Approach in a computational framework and provides new insights on homelessness alleviation policymaking with a focus on human dignity.

[403] Prompts to Proxies: Emulating Human Preferences via a Compact LLM Ensemble

Bingchen Wang, Zi-Yu Khoo, Jingtan Wang

Main category: cs.AI

TL;DR: P2P framework constructs synthetic agent ensembles from LLMs to approximate human population preferences without fine-tuning or demographic data

Details

Motivation: LLMs are increasingly used as proxies for human subjects in social science research, but need to faithfully reflect target population preferences for external validity

Method: Two-stage modular system: Stage 1 uses structured prompting with entropy-based adaptive sampling to construct diverse agent pool; Stage 2 employs L1-regularized regression to select compact ensemble whose aggregate responses align with observed population data

Result: Validated on 14 waves of American Trends Panel with average test MSE of 0.014 across diverse topics at ~$0.80 per survey; also tested on World Values Survey showing cross-locale generalization; outperforms SFT-aligned baseline using <3% of training data

Conclusion: P2P provides cost-effective, privacy-preserving method for constructing LLM-based proxies that accurately reflect human population preferences without fine-tuning or sensitive data access

Abstract: Large language models are increasingly used as proxies for human subjects in social science research, yet external validity requires that synthetic agents faithfully reflect the preferences of target human populations. We introduce preference reconstruction theory, a framework that formalizes preference alignment as a representation learning problem: constructing a functional basis of proxy agents and recovering population preferences through weighted aggregation. We implement this via Prompts to Proxies ($\texttt{P2P}$), a modular two-stage system. Stage 1 uses structured prompting with entropy-based adaptive sampling to construct a diverse agent pool spanning the latent preference space. Stage 2 employs L1-regularized regression to select a compact ensemble whose aggregate response distributions align with observed data from the target population. $\texttt{P2P}$ requires no finetuning and no access to sensitive demographic data, incurring only API inference costs. We validate the approach on 14 waves of the American Trends Panel, achieving an average test MSE of 0.014 across diverse topics at approximately 0.8 USD per survey. We additionally test it on the World Values Survey, demonstrating its potential to generalize across locales. When stress-tested against an SFT-aligned baseline, $\texttt{P2P}$ achieves competitive performance using less than 3% of the training data.

[404] DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

Tianrun Xu, Haoda Jing, Ye Li, Yuquan Wei, Jun Feng, Guanyu Chen, Haichuan Gao, Tianren Zhang, Feng Chen

Main category: cs.AI

TL;DR: DeFacto is a counterfactual reasoning framework for multimodal language models that improves both answer accuracy and reasoning faithfulness by training with positive, counterfactual, and random-masking paradigms on automatically generated evidence-grounded data.

Details

Motivation: Current multimodal language models often produce correct answers but rely on irrelevant or spurious image regions due to prior knowledge or dataset biases, indicating flawed reasoning and lack of true image understanding. This highlights the need for improved reasoning fidelity in multimodal tasks.

Method: Proposes DeFacto framework with three complementary training paradigms: (1) positive training, (2) counterfactual training, and (3) random-masking training. Develops a pipeline to automatically localize question-relevant evidence and construct positive, counterfactual, and random variants, creating a 100k image dataset. Trains models with GRPO-based reinforcement learning using three complementary rewards to guide accurate answering and evidence-grounded reasoning.

Result: Experiments on diverse benchmarks show DeFacto substantially improves both answer accuracy and reasoning faithfulness, establishing a stronger foundation for interpretable multimodal reasoning.

Conclusion: DeFacto successfully addresses the reasoning fidelity problem in multimodal language models by enforcing both accurate answering and faithful reasoning through counterfactual training paradigms, leading to more interpretable and reliable multimodal reasoning.

Abstract: Recent advances in multimodal language models (MLLMs) have achieved remarkable progress in vision-language reasoning, especially with the emergence of “thinking with images,” which integrates explicit visual steps into the reasoning process. While this paradigm strengthens image-based reasoning, a significant challenge remains: models may arrive at correct answers by relying on irrelevant or spurious regions, driven by prior knowledge or dataset biases. Even when the answer is correct, flawed reasoning indicates that the model has not truly understood the image, highlighting the critical importance of reasoning fidelity in multimodal tasks. To address this issue, we propose DeFacto, a counterfactual reasoning framework that jointly enforces accurate answering and faithful reasoning. A key component of our approach is the design of three complementary training paradigms: (i) positive, (ii) counterfactual, and (iii) random-masking. To enable these paradigms, we develop a pipeline that automatically localizes question-relevant evidence and constructs positive, counterfactual, and random variants, resulting in a dataset of about 100k images. Building on this framework, we train multimodal language models with GRPO-based reinforcement learning, where we design three complementary rewards to guide the model toward accurate answering and evidence-grounded reasoning. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and reasoning faithfulness, establishing a stronger foundation for interpretable multimodal reasoning. The code is available on GitHub and the dataset is released on HuggingFace.

[405] Repairing Reward Functions with Feedback to Mitigate Reward Hacking

Stephane Hatgis-Kessell, Logan Mondal Bhamidipaty, Emma Brunskill

Main category: cs.AI

TL;DR: PBRR is an automated framework that repairs human-specified proxy reward functions by learning additive corrections from human preferences, requiring fewer preferences than learning from scratch.

Details

Motivation: Human-designed reward functions are often misaligned with true objectives, causing reward hacking. Learning from human preferences is expensive. PBRR aims to efficiently repair existing proxy rewards rather than learning from scratch.

Method: Proposes Preference-Based Reward Repair (PBRR): an iterative framework that learns an additive, transition-dependent correction term to fix proxy reward functions. Uses targeted exploration and a new preference-learning objective to identify critical transitions needing correction.

Result: PBRR has cumulative regret matching prior preference-based RL methods in tabular domains. On reward-hacking benchmarks, it outperforms baselines learning from scratch or modifying proxy rewards, requiring substantially fewer preferences.

Conclusion: PBRR provides an efficient alternative to learning reward functions from scratch, effectively repairing misaligned proxy rewards with fewer human preferences while maintaining theoretical guarantees.

Abstract: Human-designed reward functions for reinforcement learning (RL) agents are frequently misaligned with the humans’ true, unobservable objectives, and thus act only as proxies. Optimizing for a misspecified proxy reward function often induces reward hacking, resulting in a policy misaligned with the human’s true objectives. An alternative is to perform RL from human feedback, which involves learning a reward function from scratch by collecting human preferences over pairs of trajectories. However, building such datasets is costly. To address the limitations of both approaches, we propose Preference-Based Reward Repair (PBRR): an automated iterative framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences. A manually specified reward function can yield policies that are highly suboptimal under the ground-truth objective, yet corrections on only a few transitions may suffice to recover optimal performance. To identify and correct for those transitions, PBRR uses a targeted exploration strategy and a new preference-learning objective. We prove in tabular domains PBRR has a cumulative regret that matches, up to constants, that of prior preference-based RL methods. In addition, on a suite of reward-hacking benchmarks, PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches, requiring substantially fewer preferences to learn high performing policies.

[406] Closing the Expression Gap in LLM Instructions via Socratic Questioning

Jianwen Sun, Yukang Feng, Yifan Chang, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yu Dai, Kaipeng Zhang

Main category: cs.AI

TL;DR: Nous: An AI agent trained to actively probe for information to resolve uncertainty about user intent in human-AI collaboration, using information gain as intrinsic reward without human preference annotations.

Details

Motivation: Addresses the "intention expression gap" where humans struggle to convey complex thoughts to AI, leading to inefficient trial-and-error loops, especially problematic with diverse user expertise levels.

Method: Proposes a Socratic collaboration paradigm where agent actively probes for information. Uses information theory framework with information gain (reduction of Shannon entropy) as intrinsic reward signal. Trains agent (Nous) via automated simulation pipeline generating large-scale preference-based dataset for scientific diagram generation.

Result: Comprehensive experiments show Nous achieves leading efficiency and output quality, robust to varying user expertise levels. Avoids reliance on costly human preference annotations or external reward models.

Conclusion: Provides systematic methodology and new perspective for addressing ambiguous intentions in complex human-machine collaboration through active inquiry-based approach.

Abstract: A fundamental bottleneck in human-AI collaboration is the ``intention expression gap," the difficulty for humans to effectively convey complex, high-dimensional thoughts to AI. This challenge often traps users in inefficient trial-and-error loops and is exacerbated by the diverse expertise levels of users. We reframe this problem from passive instruction following to a Socratic collaboration paradigm, proposing an agent that actively probes for information to resolve its uncertainty about user intent. we name the proposed agent Nous, trained to acquire proficiency in this inquiry policy. The core mechanism of Nous is a training framework grounded in the first principles of information theory. Within this framework, we define the information gain from dialogue as an intrinsic reward signal, which is fundamentally equivalent to the reduction of Shannon entropy over a structured task space. This reward design enables us to avoid reliance on costly human preference annotations or external reward models. To validate our framework, we develop an automated simulation pipeline to generate a large-scale, preference-based dataset for the challenging task of scientific diagram generation. Comprehensive experiments, including ablations, subjective and objective evaluations, and tests across user expertise levels, demonstrate the effectiveness of our proposed framework. Nous achieves leading efficiency and output quality, while remaining robust to varying user expertise. In conclusion, our research provides a systematic methodology and a new perspective for addressing the issue of ambiguous intentions in complex human-machine collaboration.

[407] SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

Zhi Zheng, Yu Gu, Wei Liu, Yee Whye Teh, Wee Sun Lee

Main category: cs.AI

TL;DR: SofT-GRPO: A novel policy optimization algorithm that enables reinforcement learning for soft-thinking LLM reasoning, outperforming discrete-token GRPO on code generation tasks.

Details

Motivation: Soft-thinking LLM reasoning shows promise but is challenging to combine with RL due to difficulties in injecting stochasticity into soft tokens and updating soft-thinking policies. Previous attempts underperform discrete-token GRPO counterparts.

Method: SofT-GRPO injects Gumbel noise into logits, uses Gumbel-Softmax to keep soft tokens within pre-trained embedding space, and leverages reparameterization trick in policy gradient. Tested on LLMs from 1.5B to 7B parameters.

Result: SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% average accuracy) and show substantial improvement on Pass@32 (+2.19% average accuracy).

Conclusion: The proposed SofT-GRPO successfully unlocks the potential of soft-thinking reasoning by enabling effective RL-based policy optimization, demonstrating superior performance over discrete-token approaches.

Abstract: The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master

[408] Two Heads are Better than One: Distilling Large Language Model Features Into Small Models with Feature Decomposition and Mixture

Tianhao Fu, Xinxin Xu, Weichen Xu, Jue Chen, Ruilong Ren, Bowen Deng, Xinyu Zhao, Jian Cao, Xixin Cao

Main category: cs.AI

TL;DR: A novel framework called Cooperative Market Making (CMM) that uses knowledge distillation to transfer LLM capabilities to smaller models for faster market making in financial trading, addressing the slow inference speed of direct LLM applications.

Details

Motivation: While LLMs show promise in financial applications like market making, their slow inference speed hinders practical deployment. Current research lacks specialized LLM distillation methods for this specific task, creating a need for efficient knowledge transfer approaches.

Method: Proposes CMM framework that decouples LLM features across three dimensions (layer, task, data) and uses multiple student models to collaboratively learn different feature aspects. Introduces Hájek-MoE to integrate student model outputs by analyzing contributions in a kernel function-generated common feature space.

Result: Extensive experiments on four real-world market datasets show CMM outperforms current distillation methods and RL-based market-making strategies, demonstrating superior performance.

Conclusion: CMM provides an effective knowledge distillation framework for market making that addresses LLM inference speed limitations while maintaining performance, offering a practical solution for deploying LLM capabilities in financial trading.

Abstract: Market making (MM) through Reinforcement Learning (RL) has attracted significant attention in financial trading. With the development of Large Language Models (LLMs), more and more attempts are being made to apply LLMs to financial areas. A simple, direct application of LLM as an agent shows significant performance. Such methods are hindered by their slow inference speed, while most of the current research has not studied LLM distillation for this specific task. To address this, we first propose the normalized fluorescent probe to study the mechanism of the LLM’s feature. Based on the observation found by our investigation, we propose Cooperative Market Making (CMM), a novel framework that decouples LLM features across three orthogonal dimensions: layer, task, and data. Various student models collaboratively learn simple LLM features along with different dimensions, with each model responsible for a distinct feature to achieve knowledge distillation. Furthermore, CMM introduces an Hájek-MoE to integrate the output of the student models by investigating the contribution of different models in a kernel function-generated common feature space. Extensive experimental results on four real-world market datasets demonstrate the superiority of CMM over the current distillation method and RL-based market-making strategies.

[409] Beyond Distributions: Geometric Action Control for Continuous Reinforcement Learning

Zhihao Lin

Main category: cs.AI

TL;DR: GAC is a novel action generation method for continuous control RL that uses spherical distributions to respect action space geometry while simplifying computation compared to traditional Gaussian or vMF approaches.

Details

Motivation: Gaussian policies in continuous control RL have a fundamental mismatch with bounded action spaces, requiring ad-hoc squashing functions that distort geometry. While vMF distributions offer theoretical benefits on spheres, their computational complexity hinders practical adoption.

Method: GAC decomposes action generation into a direction vector and learnable concentration parameter, enabling efficient interpolation between deterministic actions and uniform spherical noise. This reduces parameters from 2d to d+1 and avoids O(dk) vMF rejection sampling complexity with simple O(d) operations.

Result: GAC consistently matches or exceeds state-of-the-art methods across six MuJoCo benchmarks, achieving 37.6% improvement over SAC on Ant-v4 and up to 112% on complex DMControl tasks. Ablation studies show both spherical normalization and adaptive concentration control are essential.

Conclusion: Robust and efficient continuous control doesn’t require complex distributions but principled respect for action space geometry. GAC demonstrates that spherical distributions with simplified computation can outperform traditional approaches.

Abstract: Gaussian policies have dominated continuous control in deep reinforcement learning (RL), yet they suffer from a fundamental mismatch: their unbounded support requires ad-hoc squashing functions that distort the geometry of bounded action spaces. While von Mises-Fisher (vMF) distributions offer a theoretically grounded alternative on the sphere, their reliance on Bessel functions and rejection sampling hinders practical adoption. We propose \textbf{Geometric Action Control (GAC)}, a novel action generation paradigm that preserves the geometric benefits of spherical distributions while \textit{simplifying computation}. GAC decomposes action generation into a direction vector and a learnable concentration parameter, enabling efficient interpolation between deterministic actions and uniform spherical noise. This design reduces parameter count from (2d) to (d+1), and avoids the (O(dk)) complexity of vMF rejection sampling, achieving simple (O(d)) operations. Empirically, GAC consistently matches or exceeds state-of-the-art methods across six MuJoCo benchmarks, achieving 37.6% improvement over SAC on Ant-v4 and up to 112% on complex DMControl tasks, demonstrating strong performance across diverse benchmarks. Our ablation studies reveal that both \textbf{spherical normalization} and \textbf{adaptive concentration control} are essential to GAC’s success. These findings suggest that robust and efficient continuous control does not require complex distributions, but a principled respect for the geometry of action spaces.

[410] Memento 2: Learning by Stateful Reflective Memory

Jun Wang

Main category: cs.AI

TL;DR: Theoretical study of continual learning in LLM agents using episodic memory and reflection, formalized as Stateful Reflective Decision Process with convergence guarantees.

Details

Motivation: To understand how large language model agents can achieve continual adaptation without parameter updates through reflection and episodic memory, bridging the gap between training and testing phases.

Method: Introduces Stateful Reflective Decision Process as a formal model with episodic memory operations (writing for policy evaluation, reading for policy improvement). Develops read-write reflective learning framework integrating retrieval into soft policy iteration.

Result: Establishes convergence guarantees showing that as memory grows and covers state space more densely, the composite policy converges to optimal solution. Empirical findings suggest reflection enables generalized adaptation across open-ended tasks.

Conclusion: Provides rigorous mathematical basis for building reflective, memory-embedded agents capable of continual general-purpose learning, connecting practical memory-based methods with principled reinforcement learning.

Abstract: We present a theoretical study of continual and experiential learning in large language model agents that combine episodic memory with reinforcement learning. We argue that the key mechanism for continual adaptation, without updating model parameters, is reflection: the agent’s ability to use past experience to guide future actions. Empirical findings suggest that episodic, experience-driven reflection enables generalised adaptation across a wide range of open-ended, long-horizon tasks. This indicates that efficient learning can occur during deployment and weakens the traditional separation between training and testing. Motivated by this, we introduce the Stateful Reflective Decision Process, a formal model of reflective memory dynamics. In this abstraction, an agent maintains an episodic memory and performs two core operations. Writing stores interaction outcomes and plays the role of policy evaluation. Reading retrieves relevant past cases to inform decisions and plays the role of policy improvement. This perspective treats reflective memory as a control object that can be analysed using classical reinforcement learning tools. We then develop a read-write reflective learning framework by integrating retrieval into soft policy iteration and establish convergence guarantees. We show that as memory grows and provides denser coverage of the state space, the resulting composite policy converges to the optimal solution. Overall, this framework connects practical memory-based methods with principled reinforcement learning, providing a rigorous mathematical basis for building reflective, memory-embedded agents capable of continual general-purpose learning.

[411] Bridging Synthetic and Real Routing Problems via LLM-Guided Instance Generation and Progressive Adaptation

Jianghan Zhu, Yaoxin Wu, Zhuoyi Lin, Zhengyuan Zhang, Haiyan Yin, Zhiguang Cao, Senthilnath Jayavelu, Xiaoli Li

Main category: cs.AI

TL;DR: EvoReal uses evolutionary algorithms guided by LLMs to generate realistic synthetic routing instances, improving neural combinatorial optimization solvers’ generalization from synthetic to real-world problems.

Details

Motivation: Neural combinatorial optimization methods struggle to generalize from synthetic, uniformly-distributed training data to real-world VRP scenarios like TSPLib and CVRPLib benchmarks due to distribution mismatch.

Method: Evolutionary Realistic Instance Synthesis (EvoReal) uses an evolutionary module guided by large language models to generate synthetic instances with structural patterns that statistically mimic real-world instances. Pre-trained NCO models are then refined through progressive adaptation: first aligning with enriched synthetic distributions, then fine-tuning on actual benchmark instances.

Result: EvoReal significantly improves generalization of state-of-the-art neural solvers, reducing performance gap to optimal solutions on TSPLib (1.05%) and CVRPLib (2.71%) benchmarks across various problem scales.

Conclusion: The proposed EvoReal framework effectively bridges the generalization gap in neural combinatorial optimization by generating realistic synthetic instances and enabling progressive model adaptation, demonstrating strong performance on real-world routing benchmarks.

Abstract: Recent advances in Neural Combinatorial Optimization (NCO) methods have significantly improved the capability of neural solvers to handle synthetic routing instances. Nonetheless, existing neural solvers typically struggle to generalize effectively from synthetic, uniformly-distributed training data to real-world VRP scenarios, including widely recognized benchmark instances from TSPLib and CVRPLib. To bridge this generalization gap, we present Evolutionary Realistic Instance Synthesis (EvoReal), which leverages an evolutionary module guided by large language models (LLMs) to generate synthetic instances characterized by diverse and realistic structural patterns. Specifically, the evolutionary module produces synthetic instances whose structural attributes statistically mimics those observed in authentic real-world instances. Subsequently, pre-trained NCO models are progressively refined, firstly aligning them with these structurally enriched synthetic distributions and then further adapting them through direct fine-tuning on actual benchmark instances. Extensive experimental evaluations demonstrate that EvoReal markedly improves the generalization capabilities of state-of-the-art neural solvers, yielding a notable reduced performance gap compared to the optimal solutions on the TSPLib (1.05%) and CVRPLib (2.71%) benchmarks across a broad spectrum of problem scales.

[412] Wireless Power Transfer and Intent-Driven Network Optimization in AAVs-assisted IoT for 6G Sustainable Connectivity

Xiaoming He, Gaofeng Wang, Huajun Cui, Rui Yuan, Haitao Zhao

Main category: cs.AI

TL;DR: Proposes an intent-driven framework for AAV-assisted IoT networks using Hyperdimensional Transformer for intent prediction and Double Actions MAPPO for decision-making to optimize network performance.

Details

Motivation: AAV-assisted IoT networks need reliable intent prediction and low-latency action execution, but existing approaches struggle with high-dimensional action sequences and intensive on-board computation.

Method: Two-module framework: 1) Hyperdimensional Transformer (HDT) for intent prediction using hyperdimensional vector encoding and symbolic computations, 2) Double Actions MAPPO (DA-MAPPO) for decision-making with two independently parameterized networks for user-intent and trajectory planning.

Result: Superior performance across diverse scenarios on real IoT action dataset with authentic wireless data.

Conclusion: The proposed framework effectively addresses intent prediction and decision-making challenges in AAV-assisted IoT networks through novel hyperdimensional encoding and multi-agent reinforcement learning approaches.

Abstract: Autonomous Aerial Vehicle (AAV)-assisted Internet of Things (IoT) represents a collaborative architecture in which AAV allocate resources over 6G links to jointly enhance user-intent interpretation and overall network performance. Owing to this mutual dependence, improvements in intent inference and policy decisions on one component reinforce the efficiency of others, making highly reliable intent prediction and low-latency action execution essential. Although numerous approaches can model intent relationships, they encounter severe obstacles when scaling to high-dimensional action sequences and managing intensive on-board computation. We propose an Intent-Driven Framework for Autonomous Network Optimization comprising prediction and decision modules. First, implicit intent modeling is adopted to mitigate inaccuracies arising from ambiguous user expressions. For prediction, we introduce Hyperdimensional Transformer (HDT), which embeds data into a Hyperdimensional space via Hyperdimensional vector encoding and replaces standard matrix and attention operations with symbolic Hyperdimensional computations. For decision-making, where AAV must respond to user intent while planning trajectories, we design Double Actions based Multi-Agent Proximal Policy Optimization (DA-MAPPO). Building upon MAPPO, it samples actions through two independently parameterized networks and cascades the user-intent network into the trajectory network to maintain action dependencies. We evaluate our framework on a real IoT action dataset with authentic wireless data. Experimental results demonstrate that HDT and DA-MAPPO achieve superior performance across diverse scenarios.

[413] State-Augmented Graphs for Circular Economy Triage

Richard Fox, Rui Li, Gustav Jonsson, Farzaneh Goli, Miying Yang, Emel Aktas, Yongjing Wang

Main category: cs.AI

TL;DR: A decision-making framework for circular economy triage using state-augmented disassembly sequencing planning with Markov property enforcement for optimal recursive evaluation of product end-of-life pathways.

Details

Motivation: Circular economy triage requires adaptive decision-making to balance retained value against processing costs and constraints, but existing approaches lack unified formalisms for optimal decision-making across diverse products and operational contexts.

Method: Novel deterministic solver over state-augmented Disassembly Sequencing Planning (DSP) graph that encodes disassembly history into state to enforce Markov property, enabling optimal recursive evaluation where each decision depends only on previous state.

Result: Demonstrated framework flexibility with hierarchical triage of electric vehicle batteries, showing how unified formalism accommodates varying mechanical complexity, safety requirements, and economic drivers through recursive component valuation.

Conclusion: The unified formalism provides tractable and generalizable foundation for optimizing circular economy triage decisions across diverse products and operational contexts.

Abstract: Circular economy (CE) triage is the assessment of products to determine which sustainable pathway they can follow once they reach the end of their usefulness as they are currently being used. Effective CE triage requires adaptive decisions that balance retained value against the costs and constraints of processing and labour. This paper presents a novel decision-making framework as a simple deterministic solver over a state-augmented Disassembly Sequencing Planning (DSP) graph. By encoding the disassembly history into the state, our framework enforces the Markov property, enabling optimal, recursive evaluation by ensuring each decision only depends on the previous state. The triage decision involves choices between continuing disassembly or committing to a CE option. The model integrates condition-aware utility based on diagnostic health scores and complex operational constraints. We demonstrate the framework’s flexibility with a worked example: the hierarchical triage of electric vehicle (EV) batteries, where decisions are driven by the recursive valuation of components. The example illustrates how a unified formalism enables the accommodation of varying mechanical complexity, safety requirements, and economic drivers. This unified formalism therefore provides a tractable and generalisable foundation for optimising CE triage decisions across diverse products and operational contexts.

[414] TimeSeries2Report prompting enables adaptive large language model management of lithium-ion batteries

Jiayang Yang, Martin Guay, Zhixing Cao, Chunhui Zhao

Main category: cs.AI

TL;DR: TS2R framework converts battery time-series data into semantic reports for LLM reasoning, improving BESS management tasks without model retraining.

Details

Motivation: While LLMs show promise for time-series interpretation, their application to real-world battery energy storage system (BESS) operation remains unexplored, creating a gap between raw sensor data and actionable insights.

Method: TS2R uses segmentation, semantic abstraction, and rule-based interpretation to encode short-term temporal dynamics into natural language reports, bridging low-level sensor signals with high-level contextual insights for LLM processing.

Result: TS2R consistently improves LLM performance across accuracy, robustness, and explainability metrics for anomaly detection, state-of-charge prediction, and charging/discharging management, achieving expert-level decision quality without retraining.

Conclusion: TS2R establishes a practical path for adaptive, LLM-driven battery intelligence by enabling semantic translation of time-series data into structured reports that enhance LLM reasoning capabilities for BESS management.

Abstract: Large language models (LLMs) offer promising capabilities for interpreting multivariate time-series data, yet their application to real-world battery energy storage system (BESS) operation and maintenance remains largely unexplored. Here, we present TimeSeries2Report (TS2R), a semantic translation framework that converts raw lithium-ion battery operational time-series into structured, semantically enriched reports, enabling LLMs to reason, predict, and make decisions in BESS management scenarios. TS2R encodes short-term temporal dynamics into natural language through a combination of segmentation, semantic abstraction, and rule-based interpretation, effectively bridging low-level sensor signals with high-level contextual insights. We benchmark TS2R across both lab-scale and real-world datasets, evaluating report quality and downstream task performance in anomaly detection, state-of-charge prediction, and charging/discharging management. Compared with vision-, embedding-, and text-based prompting baselines, report-based prompting via TS2R consistently improves LLM performance in terms of across accuracy, robustness, and explainability metrics. Notably, TS2R-integrated LLMs achieve expert-level decision quality and predictive consistency without retraining or architecture modification, establishing a practical path for adaptive, LLM-driven battery intelligence.

[415] LLM Personas as a Substitute for Field Experiments in Method Benchmarking

Enoch Hyunwook Kang

Main category: cs.AI

TL;DR: The paper provides theoretical conditions under which LLM-based persona simulation can validly replace human participants in A/B testing for societal systems, showing equivalence when methods observe only aggregate outcomes and evaluation is method-blind.

Details

Motivation: Field experiments (A/B tests) are expensive and slow, hindering rapid methodological progress in societal systems. LLM-based persona simulation offers a cheaper alternative, but it's unclear whether this preserves the benchmark interface that methods optimize against.

Method: Theoretical analysis proving an if-and-only-if characterization: when methods observe only aggregate outcomes (aggregate-only observation) and evaluation depends only on submitted artifacts (method-blind evaluation), swapping humans for personas is equivalent to changing the evaluation population. Also defines information-theoretic discriminability of the induced aggregate channel and provides sample-size bounds for reliable method comparison.

Result: Proves that persona simulation is valid under specific conditions (aggregate-only observation and method-blind evaluation). Provides explicit bounds on the number of independent persona evaluations needed to distinguish meaningfully different methods at chosen resolution.

Conclusion: LLM-based persona simulation can serve as a valid alternative to field experiments for benchmarking methods in societal systems under specific conditions, with reliability fundamentally determined by sample size rather than simulation validity.

Abstract: Field experiments (A/B tests) are often the most credible benchmark for methods (algorithms) in societal systems, but their cost and latency bottleneck rapid methodological progress. LLM-based persona simulation offers a cheap synthetic alternative, yet it is unclear whether replacing humans with personas preserves the benchmark interface that adaptive methods optimize against. We prove an if-and-only-if characterization: when (i) methods observe only the aggregate outcome (aggregate-only observation) and (ii) evaluation depends only on the submitted artifact and not on the method’s identity or provenance (method-blind evaluation), swapping humans for personas is just panel change from the method’s point of view, indistinguishable from changing the evaluation population (e.g., New York to Jakarta). Furthermore, we move from validity to usefulness: we define an information-theoretic discriminability of the induced aggregate channel and show that making persona benchmarking as decision-relevant as a field experiment is fundamentally a sample-size question, yielding explicit bounds on the number of independent persona evaluations required to reliably distinguish meaningfully different methods at a chosen resolution.

[416] SimpleMem: Efficient Lifelong Memory for LLM Agents

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, Huaxiu Yao

Main category: cs.AI

TL;DR: SimpleMem is an efficient memory framework for LLM agents that uses semantic lossless compression to manage historical experiences, reducing redundancy and token costs while maintaining performance.

Details

Motivation: LLM agents need memory systems for long-term interaction in complex environments, but existing approaches either retain full histories (causing redundancy) or use iterative reasoning (incurring high token costs).

Method: Three-stage pipeline: (1) Semantic Structured Compression distills interactions into compact, multi-view indexed memory units; (2) Online Semantic Synthesis integrates related context into unified abstract representations; (3) Intent-Aware Retrieval Planning infers search intent to dynamically determine retrieval scope.

Result: Outperforms baselines in accuracy, retrieval efficiency, and inference cost, achieving 26.4% average F1 improvement in LoCoMo while reducing inference-time token consumption by up to 30-fold.

Conclusion: SimpleMem demonstrates superior balance between performance and efficiency for LLM agent memory systems through semantic lossless compression.

Abstract: To support long-term interaction in complex environments, LLM agents require memory systems that manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three-stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which distills unstructured interactions into compact, multi-view indexed memory units; (2) Online Semantic Synthesis, an intra-session process that instantly integrates related context into unified abstract representations to eliminate redundancy; and (3) Intent-Aware Retrieval Planning, which infers search intent to dynamically determine retrieval scope and construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% in LoCoMo while reducing inference-time token consumption by up to 30-fold, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming-lab/SimpleMem.

[417] Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing

Yuguang Yue, Irakli Salia, Samuel Hunt, Chris Green, Wenzhe Shi, Jonathan J Hunt

Main category: cs.AI

TL;DR: Open-source behavior cloning foundation model for real-time video game playing, with scaling law analysis showing improved causal reasoning with larger models and more data

Details

Motivation: To create an open-source foundation model for video game playing that can run in real-time on consumer GPUs, and to investigate scaling laws of behavior cloning with focus on causal reasoning

Method: Developed an open recipe for training behavior cloning models using 8300+ hours of human gameplay data, trained models up to 1.2B parameters, and analyzed scaling laws in both controlled toy settings and at scale

Result: Best model achieves performance competitive with human players across various 3D games; scaling analysis shows that increasing training data and network depth leads to models learning more causal policies

Conclusion: Behavior cloning scales effectively for video game playing, with larger models and more data improving causal reasoning capabilities, enabling competitive real-time gameplay on consumer hardware

Abstract: Behavior cloning has seen a resurgence as scaling model and data sizes demonstrate strong performance. In this work, we introduce an open recipe for training a video game playing foundation model designed for inference in realtime on a consumer GPU. We release all data (8300+ hours of high quality human gameplay), training and inference code, and pretrained checkpoints under an open license. Empirically, we show that our best model achieves performance competitive with human players across a variety of 3D games. We use this recipe to investigate the scaling laws of behavior cloning, with a focus on causal reasoning. In a controlled toy setting, we first demonstrate that increasing training data and network depth leads to the model learning a more causal policy. We then validate these findings at scale, analyzing models up to 1.2 billion parameters. We observe that the causal improvements seen in the toy domain hold true as model size and training steps increase.

[418] Panning for Gold: Expanding Domain-Specific Knowledge Graphs with General Knowledge

Runhao Zhao, Weixin Zeng, Wentao Zhang, Chong Chen, Zhengpin Li, Xiang Zhao, Lei Chen

Main category: cs.AI

TL;DR: ExeFuse: A neuro-symbolic framework for domain-specific knowledge graph fusion that mines relevant facts from general knowledge graphs to enhance domain-specific KG completeness.

Details

Motivation: Domain-specific knowledge graphs (DKGs) suffer from limited coverage compared to general knowledge graphs (GKGs). Existing approaches for enriching DKGs are limited in scope and quality, with little systematic exploration of how comprehensive GKGs can effectively supplement DKGs.

Method: ExeFuse uses a neuro-symbolic framework based on a novel Fact-as-Program paradigm. It treats fusion as an executable process, utilizing neuro-symbolic execution to infer logical relevance beyond surface similarity and employing target space grounding to calibrate granularity between coarse-grained GKG facts and fine-grained DKG requirements.

Result: The authors construct two new datasets to establish the first standardized evaluation suite for domain-specific knowledge graph fusion. Extensive experiments demonstrate that ExeFuse effectively overcomes domain barriers to achieve superior fusion performance.

Conclusion: ExeFuse addresses the critical gap in leveraging comprehensive GKGs to supplement DKGs, overcoming challenges of domain relevance ambiguity and cross-domain knowledge granularity misalignment through its neuro-symbolic approach.

Abstract: Domain-specific knowledge graphs (DKGs) are critical yet often suffer from limited coverage compared to General Knowledge Graphs (GKGs). Existing tasks to enrich DKGs rely primarily on extracting knowledge from external unstructured data or completing KGs through internal reasoning, but the scope and quality of such integration remain limited. This highlights a critical gap: little systematic exploration has been conducted on how comprehensive, high-quality GKGs can be effectively leveraged to supplement DKGs. To address this gap, we propose a new and practical task: domain-specific knowledge graph fusion (DKGF), which aims to mine and integrate relevant facts from general knowledge graphs into domain-specific knowledge graphs to enhance their completeness and utility. Unlike previous research, this new task faces two key challenges: (1) high ambiguity of domain relevance, i.e., difficulty in determining whether knowledge from a GKG is truly relevant to the target domain , and (2) cross-domain knowledge granularity misalignment, i.e., GKG facts are typically abstract and coarse-grained, whereas DKGs frequently require more contextualized, fine-grained representations aligned with particular domain scenarios. To address these, we present ExeFuse, a neuro-symbolic framework based on a novel Fact-as-Program paradigm. ExeFuse treats fusion as an executable process, utilizing neuro-symbolic execution to infer logical relevance beyond surface similarity and employing target space grounding to calibrate granularity. We construct two new datasets to establish the first standardized evaluation suite for this task. Extensive experiments demonstrate that ExeFuse effectively overcomes domain barriers to achieve superior fusion performance.

[419] Health Facility Location in Ethiopia: Leveraging LLMs to Integrate Expert Knowledge into Algorithmic Planning

Yohai Trabelsi, Guojun Xiong, Fentabil Getnet, Stéphane Verguet, Milind Tambe

Main category: cs.AI

TL;DR: A hybrid framework combining optimization algorithms with LLMs to prioritize health facility upgrades in Ethiopia, integrating quantitative coverage optimization with qualitative expert guidance expressed in natural language.

Details

Motivation: Ethiopia needs to prioritize limited resources for upgrading health posts to maximize population coverage, but expert preferences are often expressed in qualitative natural language rather than formal quantitative objectives, creating a gap between optimization methods and real-world decision-making.

Method: Developed the LEG (Large language model and Extended Greedy) framework that combines: 1) a provable approximation algorithm for population coverage optimization, and 2) LLM-driven iterative refinement with human-AI alignment to incorporate expert qualitative guidance while preserving mathematical guarantees.

Result: Experiments on real-world data from three Ethiopian regions demonstrate the framework’s effectiveness in producing solutions that balance quantitative coverage optimization with qualitative expert preferences, showing potential for equitable, data-driven health system planning.

Conclusion: The LEG framework successfully bridges the gap between formal optimization methods and real-world qualitative decision-making by integrating LLMs with classical algorithms, offering a practical approach for resource-constrained health system planning that respects both mathematical guarantees and human expertise.

Abstract: Ethiopia’s Ministry of Health is upgrading health posts to improve access to essential services, particularly in rural areas. Limited resources, however, require careful prioritization of which facilities to upgrade to maximize population coverage while accounting for diverse expert and stakeholder preferences. In collaboration with the Ethiopian Public Health Institute and Ministry of Health, we propose a hybrid framework that systematically integrates expert knowledge with optimization techniques. Classical optimization methods provide theoretical guarantees but require explicit, quantitative objectives, whereas stakeholder criteria are often articulated in natural language and difficult to formalize. To bridge these domains, we develop the Large language model and Extended Greedy (LEG) framework. Our framework combines a provable approximation algorithm for population coverage optimization with LLM-driven iterative refinement that incorporates human-AI alignment to ensure solutions reflect expert qualitative guidance while preserving coverage guarantees. Experiments on real-world data from three Ethiopian regions demonstrate the framework’s effectiveness and its potential to inform equitable, data-driven health system planning.

[420] Actionable Interpretability Must Be Defined in Terms of Symmetries

Pietro Barbiero, Mateo Espinosa Zarlenga, Francesco Giannini, Alberto Termine, Filippo Bonchi, Mateja Jamnik, Giuseppe Marra

Main category: cs.AI

TL;DR: The paper argues that current AI interpretability research is ill-posed and proposes a symmetry-based framework to formally define and test interpretability through four specific symmetries.

Details

Motivation: Current definitions of interpretability in AI lack formal testability and design principles, making interpretability research fundamentally ill-posed. The authors seek to establish actionable definitions that can be formally tested and guide model design.

Method: Proposes a symmetry-based framework using four specific symmetries: inference equivariance, information invariance, concept-closure invariance, and structural invariance. These symmetries formalize interpretable models as a subclass of probabilistic models and provide testable conditions for interpretability.

Result: The symmetry framework enables formalization of interpretable models, unified formulation of interpretable inference as Bayesian inversion, and provides a formal framework for verifying compliance with safety standards and regulations.

Conclusion: Interpretability research needs mathematically rigorous foundations, and symmetry-based definitions provide the necessary formal framework for designing, testing, and verifying interpretable AI systems.

Abstract: This paper argues that interpretability research in Artificial Intelligence (AI) is fundamentally ill-posed as existing definitions of interpretability fail to describe how interpretability can be formally tested or designed for. We posit that actionable definitions of interpretability must be formulated in terms of symmetries that inform model design and lead to testable conditions. Under a probabilistic view, we hypothesise that four symmetries (inference equivariance, information invariance, concept-closure invariance, and structural invariance) suffice to (i) formalise interpretable models as a subclass of probabilistic models, (ii) yield a unified formulation of interpretable inference (e.g., alignment, interventions, and counterfactuals) as a form of Bayesian inversion, and (iii) provide a formal framework to verify compliance with safety standards and regulations.

[421] Graph Neural Networks are Heuristics

Yimeng Min, Carla P. Gomes

Main category: cs.AI

TL;DR: A single graph neural network training trajectory can create an unsupervised heuristic for combinatorial optimization problems like TSP, generating solutions directly without search or supervision.

Details

Motivation: To demonstrate that graph neural networks can internalize global combinatorial structure and function as learned heuristics without requiring supervised training or explicit search algorithms.

Method: Use graph neural networks with global structural constraints as inductive bias, enabling non-autoregressive solution generation via direct forward passes. At inference, employ dropout and snapshot ensembling to create implicit ensembles for increased solution diversity.

Result: The approach successfully generates solutions for the Travelling Salesman Problem without search, supervision, or sequential decision-making, reducing optimality gaps through solution diversity from implicit ensembles.

Conclusion: Graph neural networks can internalize global combinatorial structure to function as strong learned heuristics, reframing learning’s role in combinatorial optimization from augmenting classical algorithms to directly instantiating new heuristics.

Abstract: We demonstrate that a single training trajectory can transform a graph neural network into an unsupervised heuristic for combinatorial optimization. Focusing on the Travelling Salesman Problem, we show that encoding global structural constraints as an inductive bias enables a non-autoregressive model to generate solutions via direct forward passes, without search, supervision, or sequential decision-making. At inference time, dropout and snapshot ensembling allow a single model to act as an implicit ensemble, reducing optimality gaps through increased solution diversity. Our results establish that graph neural networks do not require supervised training nor explicit search to be effective. Instead, they can internalize global combinatorial structure and function as strong, learned heuristics. This reframes the role of learning in combinatorial optimization: from augmenting classical algorithms to directly instantiating new heuristics.

[422] Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation

Saurabh Jha, Rohan Arora, Bhavya, Noah Zheutlin, Paulina Toro Isaza, Laura Shwartz, Yu Deng, Daby Sow, Ruchi Mahindru, Ruchir Puri

Main category: cs.AI

TL;DR: EoG framework improves LLM agent reliability for open-ended investigations by separating reasoning from control and using dependency graphs for abductive reasoning.

Details

Motivation: Current LLM agents struggle with open-ended investigations involving massive, heterogeneous data where evidence has hidden dependencies. ReAct-style agents are brittle due to context window limitations, exploration order sensitivity, and entanglement of reasoning with controller duties.

Method: Proposes EoG (Explanations over Graphs) framework that formulates investigation as abductive reasoning over dependency graphs. LLM performs bounded local evidence mining and labeling while a deterministic controller manages traversal, state, and belief propagation to compute minimal explanatory frontier.

Result: On ITBench diagnostics task, EoG improves both accuracy and run-to-run consistency over ReAct baselines, including 7x average gain in Majority-at-k entity F1.

Conclusion: Disaggregating reasoning from control and using structured dependency graphs addresses reliability issues in LLM agents for complex investigations, providing more deterministic and accurate results.

Abstract: LLM agents excel when environments are mostly static and the needed information fits in a model’s context window, but they often fail in open-ended investigations where explanations must be constructed by iteratively mining evidence from massive, heterogeneous operational data. These investigations exhibit hidden dependency structure: entities interact, signals co-vary, and the importance of a fact may only become clear after other evidence is discovered. Because the context window is bounded, agents must summarize intermediate findings before their significance is known, increasing the risk of discarding key evidence. ReAct-style agents are especially brittle in this regime. Their retrieve-summarize-reason loop makes conclusions sensitive to exploration order and introduces run-to-run non-determinism, producing a reliability gap where Pass-at-k may be high but Majority-at-k remains low. Simply sampling more rollouts or generating longer reasoning traces does not reliably stabilize results, since hypotheses cannot be autonomously checked as new evidence arrives and there is no explicit mechanism for belief bookkeeping and revision. In addition, ReAct entangles semantic reasoning with controller duties such as tool orchestration and state tracking, so execution errors and plan drift degrade reasoning while consuming scarce context. We address these issues by formulating investigation as abductive reasoning over a dependency graph and proposing EoG (Explanations over Graphs), a disaggregated framework in which an LLM performs bounded local evidence mining and labeling (cause vs symptom) while a deterministic controller manages traversal, state, and belief propagation to compute a minimal explanatory frontier. On a representative ITBench diagnostics task, EoG improves both accuracy and run-to-run consistency over ReAct baselines, including a 7x average gain in Majority-at-k entity F1.

[423] PROTEUS: SLA-Aware Routing via Lagrangian RL for Multi-LLM Serving Systems

Amit Singh Bhatti, Vishal Vaddina, Dagnachew Birru

Main category: cs.AI

TL;DR: PROTEUS is a polymorphic LLM router that accepts accuracy targets as runtime input and uses Lagrangian dual control to translate specified accuracy values into routing decisions that satisfy them, enabling single-model operation across the full accuracy spectrum without retraining.

Details

Motivation: Current LLM routers force operators to tune parameters offline and guess resulting accuracy, with indirect, non-monotonic, dataset-dependent relationships between parameters and outcomes. Operators need to specify accuracy targets directly rather than inferring them from opaque settings.

Method: PROTEUS uses Lagrangian dual control where a learned dual variable lambda tracks constraint violations during training and conditions the policy network, allowing the router to translate specified accuracy targets (tau) into routing decisions that satisfy them.

Result: PROTEUS achieves consistent floor compliance (accuracy meets/exceeds tau) with target-response correlation of 0.97-0.98, operates across tau in [0.85, 0.95] from a single model, achieves 90.1% accuracy on RouterBench (within 1.3% of oracle) and 94.0% on SPROUT (within 4.6% of oracle), with cost savings up to 89.8% versus best fixed model.

Conclusion: PROTEUS enables operators to specify accuracy targets directly as runtime input, providing predictable performance across diverse workloads while maintaining high accuracy and significant cost savings compared to existing routing approaches.

Abstract: Production LLM deployments serve diverse workloads where cost and quality requirements vary by customer tier, time of day, and query criticality. Model serving systems accept latency SLOs directly. LLM routers do not. They force operators to tune parameters offline and guess what accuracy might result. The relationship between parameters and outcomes is indirect, non-monotonic, and dataset-dependent. Operators need to specify accuracy targets, not infer them from opaque settings. We present PROTEUS (Polymorphic Router for Operational Target Enforcement with Unified SLA), a router that accepts accuracy targets tau as runtime input. PROTEUS uses Lagrangian dual control. A learned dual variable lambda tracks constraint violations during training and conditions the policy network. This lets the router translate specified tau values into routing decisions that satisfy them. A single trained model serves the full accuracy spectrum without retraining.We evaluate on RouterBench (11 models, 405K queries) and SPROUT (14 models, 45K queries). PROTEUS achieves consistent floor compliance where accuracy meets or exceeds tau. The target-response correlation reaches 0.97 to 0.98. The closest baseline, OmniRouter, meets floors only 22% of the time despite also using Lagrangian optimization. PROTEUS operates across tau in [0.85, 0.95] from a single model. On RouterBench it achieves 90.1% accuracy, within 1.3% of oracle. On SPROUT it achieves 94.0% accuracy, within 4.6% of oracle. Cost savings reach 89.8% versus the best fixed model.

[424] GAVEL: Towards rule-based safety through activation monitoring

Shir Rozenfeld, Rahul Pankajakshan, Itay Zloczower, Eyal Lenga, Gilad Gressel, Yisroel Mirsky

Main category: cs.AI

TL;DR: Rule-based activation safety framework for LLMs using interpretable cognitive elements and predicate rules to detect harmful behaviors with higher precision and flexibility than existing approaches.

Details

Motivation: Existing activation safety approaches for LLMs have poor precision, limited flexibility, and lack interpretability. They struggle with domain-specific behaviors and require retraining for updates, making them impractical for real-world deployment.

Method: Proposes modeling activations as interpretable cognitive elements (CEs) - fine-grained factors like ‘making a threat’ and ‘payment processing’. Builds a framework with predicate rules over CEs that can detect violations in real time without retraining models or detectors.

Result: Compositional rule-based activation safety improves precision, supports domain customization, and enables scalable, interpretable, and auditable AI governance. The GAVEL framework will be released as open-source with an automated rule creation tool.

Conclusion: Rule-based activation safety offers a practical, flexible, and interpretable approach to LLM safety monitoring that addresses limitations of existing methods and supports real-world deployment with transparency and auditability.

Abstract: Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as ‘‘making a threat’’ and ‘‘payment processing’’, that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We will release GAVEL as an open-source framework and provide an accompanying automated rule creation tool.

[425] Should I Have Expressed a Different Intent? Counterfactual Generation for LLM-Based Autonomous Control

Amirmohammad Farzaneh, Salvatore D’Oro, Osvaldo Simeone

Main category: cs.AI

TL;DR: A framework for counterfactual reasoning in LLM-powered agents with formal reliability guarantees using structural causal models and conformal prediction.

Details

Motivation: Users interacting with LLM-powered agents may wonder how different phrasing of their intents would affect outcomes, requiring reliable counterfactual reasoning methods.

Method: Models user-agent-environment interaction as a structural causal model, uses test-time scaling for probabilistic abduction, and applies conformal counterfactual generation with offline calibration for reliability guarantees.

Result: Demonstrated on wireless network control, showing significant advantages over naive re-execution baselines with formal reliability guarantees.

Conclusion: Provides a principled framework for counterfactual reasoning in LLM-driven control with formal reliability guarantees, enabling users to explore “what-if” scenarios.

Abstract: Large language model (LLM)-powered agents can translate high-level user intents into plans and actions in an environment. Yet after observing an outcome, users may wonder: What if I had phrased my intent differently? We introduce a framework that enables such counterfactual reasoning in agentic LLM-driven control scenarios, while providing formal reliability guarantees. Our approach models the closed-loop interaction between a user, an LLM-based agent, and an environment as a structural causal model (SCM), and leverages test-time scaling to generate multiple candidate counterfactual outcomes via probabilistic abduction. Through an offline calibration phase, the proposed conformal counterfactual generation (CCG) yields sets of counterfactual outcomes that are guaranteed to contain the true counterfactual outcome with high probability. We showcase the performance of CCG on a wireless network control use case, demonstrating significant advantages compared to naive re-execution baselines.

[426] Normative Equivalence in Human-AI Cooperation: Behaviour, Not Identity, Drives Cooperation in Mixed-Agent Groups

Nico Mutzner, Taha Yasseri, Heiko Rauhut

Main category: cs.AI

TL;DR: AI agents in human groups show normative equivalence - cooperative norms extend to AI agents similarly to humans in group decision-making settings.

Details

Motivation: To understand how AI agents influence cooperative social norms in small groups, moving beyond dyadic human-AI interactions to examine group dynamics.

Method: Online experiment using repeated four-player Public Goods Game with 236 participants. Groups had 3 humans + 1 bot (framed as human or AI) following three strategies: unconditional cooperation, conditional cooperation, or free-riding.

Result: Cooperation driven by reciprocal group dynamics and behavioral inertia, operating identically across human/AI conditions. No significant differences in cooperation levels, norm persistence, or normative perceptions between human and AI labels.

Conclusion: Cooperative norms are flexible enough to extend to AI agents, supporting normative equivalence where mechanisms sustaining cooperation function similarly in mixed human-AI and all-human groups.

Abstract: The introduction of artificial intelligence (AI) agents into human group settings raises essential questions about how these novel participants influence cooperative social norms. While previous studies on human-AI cooperation have primarily focused on dyadic interactions, little is known about how integrating AI agents affects the emergence and maintenance of cooperative norms in small groups. This study addresses this gap through an online experiment using a repeated four-player Public Goods Game (PGG). Each group consisted of three human participants and one bot, which was framed either as human or AI and followed one of three predefined decision strategies: unconditional cooperation, conditional cooperation, or free-riding. In our sample of 236 participants, we found that reciprocal group dynamics and behavioural inertia primarily drove cooperation. These normative mechanisms operated identically across conditions, resulting in cooperation levels that did not differ significantly between human and AI labels. Furthermore, we found no evidence of differences in norm persistence in a follow-up Prisoner’s Dilemma, or in participants’ normative perceptions. Participants’ behaviour followed the same normative logic across human and AI conditions, indicating that cooperation depended on group behaviour rather than partner identity. This supports a pattern of normative equivalence, in which the mechanisms that sustain cooperation function similarly in mixed human-AI and all human groups. These findings suggest that cooperative norms are flexible enough to extend to artificial agents, blurring the boundary between humans and AI in collective decision-making.

cs.SD

[427] Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion

Jaehyuk Jang, Wonjun Lee, Kangwook Ko, Changick Kim

Main category: cs.SD

Details

Motivation: Prompt tuning in audio-language models suffers from Base-New Tradeoff due to disrupted semantic structure in embedding space, limiting generalization ability.

[428] VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings

Bharath Krishnamurthy, Ajita Rattani

Main category: cs.SD

TL;DR: VoxMorph is a zero-shot voice morphing framework that generates high-fidelity voice morphs from minimal audio input, achieving state-of-the-art performance in audio quality and attack success on speaker verification systems.

Details

Motivation: Voice morphing vulnerability in biometric security is largely unexplored compared to face recognition. Existing methods are computationally expensive, non-scalable, limited to similar voices, and not transferable from other audio domains.

Method: Disentangles vocal traits into prosody and timbre embeddings, uses Spherical Linear Interpolation (Slerp) for fine-grained interpolation, and synthesizes with autoregressive language model coupled with Conditional Flow Matching network.

Result: Achieves 2.6x gain in audio quality, 73% reduction in intelligibility errors, and 67.8% morphing attack success rate on automated speaker verification systems under strict security thresholds.

Conclusion: Establishes a practical and scalable paradigm for voice morphing with significant implications for biometric security, enabling zero-shot morphing from minimal audio input.

Abstract: Morphing techniques generate artificial biometric samples that combine features from multiple individuals, allowing each contributor to be verified against a single enrolled template. While extensively studied in face recognition, this vulnerability remains largely unexplored in voice biometrics. Prior work on voice morphing is computationally expensive, non-scalable, and limited to acoustically similar identity pairs, constraining practical deployment. Moreover, existing sound-morphing methods target audio textures, music, or environmental sounds and are not transferable to voice identity manipulation. We propose VoxMorph, a zero-shot framework that produces high-fidelity voice morphs from as little as five seconds of audio per subject without model retraining. Our method disentangles vocal traits into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style and identity. These embeddings are fused via Spherical Linear Interpolation (Slerp) and synthesized using an autoregressive language model coupled with a Conditional Flow Matching network. VoxMorph achieves state-of-the-art performance, delivering a 2.6x gain in audio quality, a 73% reduction in intelligibility errors, and a 67.8% morphing attack success rate on automated speaker verification systems under strict security thresholds. This work establishes a practical and scalable paradigm for voice morphing with significant implications for biometric security. The code and dataset are available on our project page: https://vcbsl.github.io/VoxMorph/

[429] SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition

Manali Sharma, Riya Naik, Buvaneshwari G

Main category: cs.SD

TL;DR: A modular framework for robust single-word ASR combining denoising, hybrid ASR frontend (Whisper+Vosk), and verification layer with multiple matching strategies to handle noisy, low-resource scenarios.

Details

Motivation: Single-word ASR is challenging due to lack of linguistic context, sensitivity to noise, pronunciation variation, and channel artifacts, especially in low-resource, communication-critical domains like healthcare and emergency response.

Method: Proposes a modular framework with denoising/normalization, hybrid ASR frontend (Whisper + Vosk), and verification layer supporting multiple matching strategies: embedding similarity, edit distance, and LLM-based matching with optional contextual guidance.

Result: Evaluation on Google Speech Commands dataset and real-world telephony/messaging datasets shows hybrid ASR frontend performs well on clean audio, but verification layer significantly improves accuracy on noisy/compressed channels. Context-guided and LLM-based matching yield largest gains.

Conclusion: Lightweight verification and context mechanisms can substantially improve single-word ASR robustness without sacrificing latency required for real-time telephony applications.

Abstract: Single-word Automatic Speech Recognition (ASR) is a challenging task due to the lack of linguistic context and sensitivity to noise, pronunciation variation, and channel artifacts, especially in low-resource, communication-critical domains such as healthcare and emergency response. This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection. The system combines denoising and normalization with a hybrid ASR front end (Whisper + Vosk) and a verification layer designed to handle out-of-vocabulary words and degraded audio. The verification layer supports multiple matching strategies, including embedding similarity, edit distance, and LLM-based matching with optional contextual guidance. We evaluate the framework on the Google Speech Commands dataset and a curated real-world dataset collected from telephony and messaging platforms under bandwidth-limited conditions. Results show that while the hybrid ASR front end performs well on clean audio, the verification layer significantly improves accuracy on noisy and compressed channels. Context-guided and LLM-based matching yield the largest gains, demonstrating that lightweight verification and context mechanisms can substantially improve single-word ASR robustness without sacrificing latency required for real-time telephony applications.

[430] A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models

Ryan Whetten, Titouan Parcollet, Marco Dinarelli, Yannick Estève

Main category: cs.SD

TL;DR: Prioritizing long utterances in speech SSL pre-training yields better ASR performance than optimizing for diversity, achieving superior results with half the data and 24% faster training.

Details

Motivation: To understand how data distribution affects speech SSL models, challenging the assumption that scale and diversity are primary drivers of robustness, and seeking more efficient pre-training strategies.

Method: Systematically examined curated subsets of pre-training data for ASR, comparing strategies based on acoustic, speaker, and linguistic diversity versus random sampling and length-based selection.

Result: Optimizing for diversity yields no clear improvements over random sampling, but prioritizing longest utterances achieves superior ASR results using only half the original dataset with 24% faster pre-training.

Conclusion: For speech SSL pre-training, data length is more critical than diversity or overall quantity for both performance and efficiency, offering new data selection strategies.

Abstract: Self-supervised learning (SSL) has transformed speech processing, yet its reliance on massive pre-training datasets remains a bottleneck. While robustness is often attributed to scale and diversity, the role of the data distribution is less understood. We systematically examine how curated subsets of pre-training data influence Automatic Speech Recognition (ASR) performance. Surprisingly, optimizing for acoustic, speaker, or linguistic diversity yields no clear improvements over random sampling. Instead, we find that prioritizing the longest utterances achieves superior ASR results while using only half the original dataset, reducing pre-training time by 24% on a large corpora. These findings suggest that for pre-training speech SSL models, data length is a more critical factor than either data diversity or overall data quantity for performance and efficiency, offering a new perspective for data selection strategies in SSL speech processing.

[431] Text-only adaptation in LLM-based ASR through text denoising

Sergio Burdisso, Esaú Villatoro-Tello, Andrés Carofilis, Shashi Kumar, Kadri Hacioglu, Srikanth Madikeri, Pradeep Rangappa, Manjunath K E, Petr Motlicek, Shankar Venkatesan, Andreas Stolcke

Main category: cs.SD

TL;DR: Text-only adaptation method for LLM-based ASR systems that preserves speech-text alignment by treating audio projection as text denoising

Details

Motivation: Adapting ASR systems based on LLMs to new domains using only text data is challenging because standard fine-tuning disrupts the speech-text alignment learned by the projector, degrading performance

Method: Introduces a text-only adaptation method that emulates audio projection as a text denoising task - trains LLM to recover clean transcripts from noisy inputs, preserving cross-modal alignment without architectural changes

Result: Achieves up to 22.1% relative improvement on two datasets, outperforming recent state-of-the-art text-only adaptation methods

Conclusion: Proposes an effective lightweight text-only adaptation approach for LLM-based ASR that maintains speech-text alignment through denoising-based training

Abstract: Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.

[432] PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs

Artem Dementyev, Wazeer Zulfikar, Sinan Hersek, Pascal Getreuer, Anurag Kumar, Vivek Kumar

Main category: cs.SD

TL;DR: PhaseCoder is a transformer-based spatial audio encoder that works with arbitrary microphone arrays, enabling LLMs to perform spatial reasoning and targeted transcription tasks from multichannel audio.

Details

Motivation: Current multimodal LLMs process audio as mono streams, losing spatial information crucial for embodied AI, while existing spatial audio models are limited to fixed microphone geometries, preventing deployment across diverse devices.

Method: PhaseCoder is a transformer-only spatial audio encoder that takes raw multichannel audio and microphone coordinates as inputs to perform localization and produce robust spatial embeddings. The system enables fine-tuning LLMs like Gemma 3n to reason over “Spatial Audio Tokens” generated by PhaseCoder.

Result: PhaseCoder achieves state-of-the-art results on microphone-invariant localization benchmarks and enables an LLM to perform complex spatial reasoning and targeted transcription tasks from arbitrary microphone arrays for the first time.

Conclusion: PhaseCoder bridges the gap between multimodal LLMs and spatial audio processing, enabling geometry-agnostic spatial audio understanding that can be integrated with large language models for advanced audio reasoning tasks.

Abstract: Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder, a transformer-only spatial audio encoder that is agnostic to microphone geometry. PhaseCoder takes raw multichannel audio and microphone coordinates as inputs to perform localization and produces robust spatial embeddings. We demonstrate that Gemma 3n LLM can be fine-tuned to reason over “Spatial Audio Tokens” produced by PhaseCoder. We show our encoder achieves state-of-the-art results on microphone-invariant localization benchmarks and, for the first time, enables an LLM to perform complex spatial reasoning and targeted transcription tasks from an arbitrary microphone array.

[433] Music Plagiarism Detection: Problem Formulation and a Segment-based Solution

Seonghyeon Go, Yumin Kim

Main category: cs.SD

TL;DR: Paper defines music plagiarism detection as a distinct MIR task, introduces Similar Music Pair dataset, and proposes segment transcription-based method for detection.

Details

Motivation: Music plagiarism is a growing social issue, but research has been hampered by lack of clear task definition, slowing progress and limiting real-world applicability.

Method: Defines music plagiarism detection task distinct from other MIR tasks, introduces Similar Music Pair dataset, and proposes segment transcription-based detection method.

Result: Provides clear task definition, releases dataset, and demonstrates proposed method with available demo and dataset on GitHub.

Conclusion: Clear task definition enables more focused research on music plagiarism detection, with proposed method and dataset serving as foundation for future work.

Abstract: Recently, the problem of music plagiarism has emerged as an even more pressing social issue. As music information retrieval research advances, there is a growing effort to address issues related to music plagiarism. However, many studies, including our previous work, have conducted research without clearly defining what the music plagiarism detection task actually involves. This lack of a clear definition has slowed research progress and made it hard to apply results to real-world scenarios. To fix this situation, we defined how Music Plagiarism Detection is different from other MIR tasks and explained what problems need to be solved. We introduce the Similar Music Pair dataset to support this newly defined task. In addition, we propose a method based on segment transcription as one way to solve the task. Our demo and dataset are available at https://github.com/Mippia/ICASSP2026-MPD.

[434] MusicWeaver: Composer-Style Structural Editing and Minute-Scale Coherent Music Generation

Xuanchen Wang, Heng Wang, Weidong Cai

Main category: cs.SD

TL;DR: MusicWeaver is a framework for generating and editing long-range music using human-interpretable structured plans and a global-local diffusion transformer architecture.

Details

Motivation: Current music generation lacks composer-style structural editing capabilities and minute-scale coherence, limiting practical music creation.

Method: Two-stage generation: first predicts structured multi-level song programs, then renders audio. Uses Global-Local Diffusion Transformer with global path for long-range progression and local path for fine details. Includes Motif Memory Retrieval for consistent motif recurrence and Projected Diffusion Inpainting for localized editing.

Result: Achieves state-of-the-art fidelity, controllability, and long-range coherence. Introduces Structure Coherence Score and Edit Fidelity Score for evaluation.

Conclusion: MusicWeaver enables practical music creation with structural editing and long-range coherence through interpretable representations and specialized architecture.

Abstract: Recent advances in music generation produce impressive samples, however, practical creation still lacks two key capabilities: composer-style structural editing and minute-scale coherence. We present MusicWeaver, a framework for generating and editing long-range music using a human-interpretable intermediate representation with guaranteed edit locality. MusicWeaver decomposes generation into two stages: it first predicts a structured plan, a multi-level song program encoding musical attributes that composers can directly edit, and then renders audio conditioned on this plan. To ensure minute-scale coherence, we introduce a Global-Local Diffusion Transformer, where a global path captures long-range musical progression via compressed representations and memory, while a local path synthesizes fine-grained acoustic detail. We further propose a Motif Memory Retrieval module that enables consistent motif recurrence with controllable variation. For editing, we propose Projected Diffusion Inpainting, an inpainting method that denoises only user-specified regions and preserves unchanged content, allowing repeated edits without drift. Finally, we introduce Structure Coherence Score and Edit Fidelity Score to evaluate long-range form and edit realization. Experiments demonstrate that MusicWeaver achieves state-of-the-art fidelity, controllability, and long-range coherence.

[435] Understanding Frechet Speech Distance for Synthetic Speech Quality Evaluation

June-Woo Kim, Dhruv Agarwal, Federica Cerina

Main category: cs.SD

TL;DR: Comprehensive evaluation of Fréchet Speech Distance (FSD) and Speech Maximum Mean Discrepancy (SMMD) for synthetic speech quality assessment, finding WavLM Base+ features best align with human ratings.

Details

Motivation: Human listening tests for synthetic speech quality are expensive and impractical at scale, while Fréchet Distance shows promise but its reliability depends on embeddings and experimental settings.

Method: Comprehensive evaluation of FSD and SMMD under varied embeddings and conditions, incorporating human listening evaluations alongside TTS intelligibility and synthetic-trained ASR WER to validate perceptual relevance.

Result: WavLM Base+ features yield the most stable alignment with human ratings. While FSD and SMMD cannot fully replace subjective evaluation, they serve as complementary, cost-efficient, and reproducible measures.

Conclusion: FSD and SMMD are useful objective metrics for synthetic speech quality assessment, particularly when large-scale or direct listening assessments are infeasible, with WavLM Base+ features providing the best results.

Abstract: Objective evaluation of synthetic speech quality remains a critical challenge. Human listening tests are the gold standard, but costly and impractical at scale. Fréchet Distance has emerged as a promising alternative, yet its reliability depends heavily on the choice of embeddings and experimental settings. In this work, we comprehensively evaluate Fréchet Speech Distance (FSD) and its variant Speech Maximum Mean Discrepancy (SMMD) under varied embeddings and conditions. We further incorporate human listening evaluations alongside TTS intelligibility and synthetic-trained ASR WER to validate the perceptual relevance of these metrics. Our findings show that WavLM Base+ features yield the most stable alignment with human ratings. While FSD and SMMD cannot fully replace subjective evaluation, we show that they can serve as complementary, cost-efficient, and reproducible measures, particularly useful when large-scale or direct listening assessments are infeasible. Code is available at https://github.com/kaen2891/FrechetSpeechDistance.

[436] Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

Jun Xue, Yi Chai, Yanzhen Ren, Jinshen He, Zhiqiang Tang, Zhuolin Yi, Yihuan Huang, Yuankun Xie, Yujie Chen

Main category: cs.SD

TL;DR: PELM: A prior-enhanced audio large language model framework for detecting neural speech editing by formulating it as an audio question answering task, using word-level probability priors and acoustic consistency perception loss to address forgery and semantic biases.

Details

Motivation: Existing speech editing detection methods struggle with emerging neural speech editing techniques that create seamless acoustic transitions, lacking high-quality datasets and effective detection frameworks for these advanced manipulations.

Method: Constructed AiEdit dataset using LLMs for semantic tampering logic and neural speech editing methods; proposed PELM framework that unifies detection and localization as audio QA, incorporates word-level probability priors, and uses centroid-aggregation-based acoustic consistency perception loss.

Result: PELM significantly outperforms SOTA methods on HumanEdit and AiEdit datasets, achieving EERs of 0.57% and 9.28% for localization respectively.

Conclusion: PELM effectively addresses neural speech editing detection challenges by leveraging audio LLMs with explicit acoustic priors and consistency modeling, advancing the field of speech manipulation detection.

Abstract: Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness. Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefore struggle to cope with emerging end-to-end neural speech editing techniques that generate seamless acoustic transitions. To address this challenge, we first construct a large-scale bilingual dataset, AiEdit, which leverages large language models to drive precise semantic tampering logic and employs multiple advanced neural speech editing methods for data synthesis, thereby filling the gap of high-quality speech editing datasets. Building upon this foundation, we propose PELM (Prior-Enhanced Audio Large Language Model), the first large-model framework that unifies speech editing detection and content localization by formulating them as an audio question answering task. To mitigate the inherent forgery bias and semantic-priority bias observed in existing audio large models, PELM incorporates word-level probability priors to provide explicit acoustic cues, and further designs a centroid-aggregation-based acoustic consistency perception loss to explicitly enforce the modeling of subtle local distribution anomalies. Extensive experimental results demonstrate that PELM significantly outperforms state-of-the-art methods on both the HumanEdit and AiEdit datasets, achieving equal error rates (EER) of 0.57% and 9.28% (localization), respectively.

[437] MotionBeat: Motion-Aligned Music Representation via Embodied Contrastive Learning and Bar-Equivariant Contact-Aware Encoding

Xuanchen Wang, Heng Wang, Weidong Cai

Main category: cs.SD

TL;DR: MotionBeat is a motion-aligned music representation learning framework that incorporates embodied dance information to capture rhythmic and structural cues often missed by audio-only representations.

Details

Motivation: Most audio representations ignore the embodied dimension of music, which is crucial for capturing rhythmic and structural cues that drive human movement and dance.

Method: Proposes two novel objectives: Embodied Contrastive Loss (ECL) with tempo-aware and beat-jitter negatives for rhythmic discrimination, and Structural Rhythm Alignment Loss (SRAL) to align music accents with motion events. Architecture includes bar-equivariant phase rotations and contact-guided attention.

Result: Outperforms state-of-the-art audio encoders in music-to-dance generation and shows strong transfer performance on beat tracking, music tagging, genre/instrument classification, emotion recognition, and audio-visual retrieval.

Conclusion: MotionBeat successfully incorporates embodied motion information into music representation learning, demonstrating superior performance on dance-related tasks and broad transfer capabilities across multiple audio understanding tasks.

Abstract: Music is both an auditory and an embodied phenomenon, closely linked to human motion and naturally expressed through dance. However, most existing audio representations neglect this embodied dimension, limiting their ability to capture rhythmic and structural cues that drive movement. We propose MotionBeat, a framework for motion-aligned music representation learning. MotionBeat is trained with two newly proposed objectives: the Embodied Contrastive Loss (ECL), an enhanced InfoNCE formulation with tempo-aware and beat-jitter negatives to achieve fine-grained rhythmic discrimination, and the Structural Rhythm Alignment Loss (SRAL), which ensures rhythm consistency by aligning music accents with corresponding motion events. Architecturally, MotionBeat introduces bar-equivariant phase rotations to capture cyclic rhythmic patterns and contact-guided attention to emphasize motion events synchronized with musical accents. Experiments show that MotionBeat outperforms state-of-the-art audio encoders in music-to-dance generation and transfers effectively to beat tracking, music tagging, genre and instrument classification, emotion recognition, and audio-visual retrieval. Our project demo page: https://motionbeat2025.github.io/.

[438] Localizing Speech Deepfakes Beyond Transitions via Segment-Aware Learning

Yuchen Mao, Wen Huang, Yanmin Qian

Main category: cs.SD

TL;DR: SAL (Segment-Aware Learning) improves deepfake audio localization by focusing on entire manipulated segments rather than just boundary artifacts, using segment positional labeling and cross-segment mixing.

Details

Motivation: Existing deepfake audio localization methods over-rely on boundary artifacts between real and fake segments while neglecting the manipulated content itself. Effective localization requires understanding entire segments beyond just detecting transitions.

Method: Proposes Segment-Aware Learning (SAL) with two core techniques: 1) Segment Positional Labeling - provides fine-grained frame supervision based on relative position within a segment, 2) Cross-Segment Mixing - data augmentation method that generates diverse segment patterns.

Result: Experiments across multiple deepfake localization datasets show SAL consistently achieves strong performance in both in-domain and out-of-domain settings, with notable gains in non-boundary regions and reduced reliance on transition artifacts.

Conclusion: SAL effectively addresses the limitations of boundary-focused approaches by encouraging models to focus on internal segment structure, leading to more robust deepfake audio localization.

Abstract: Localizing partial deepfake audio, where only segments of speech are manipulated, remains challenging due to the subtle and scattered nature of these modifications. Existing approaches typically rely on frame-level predictions to identify spoofed segments, and some recent methods improve performance by concentrating on the transitions between real and fake audio. However, we observe that these models tend to over-rely on boundary artifacts while neglecting the manipulated content that follows. We argue that effective localization requires understanding the entire segments beyond just detecting transitions. Thus, we propose Segment-Aware Learning (SAL), a framework that encourages models to focus on the internal structure of segments. SAL introduces two core techniques: Segment Positional Labeling, which provides fine-grained frame supervision based on relative position within a segment; and Cross-Segment Mixing, a data augmentation method that generates diverse segment patterns. Experiments across multiple deepfake localization datasets show that SAL consistently achieves strong performance in both in-domain and out-of-domain settings, with notable gains in non-boundary regions and reduced reliance on transition artifacts. The code is available at https://github.com/SentryMao/SAL.

[439] AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation

Hui Wang, Jinghua Zhao, Junyang Cheng, Cheng Liu, Yuhang Jia, Haoqin Sun, Jiaming Zhou, Yong Qin

Main category: cs.SD

TL;DR: AudioEval: A large-scale text-to-audio evaluation dataset with 4,200 samples and 126,000 ratings across 5 dimensions, plus Qwen-DisQA baseline model for multi-dimensional rating prediction.

Details

Motivation: Current text-to-audio evaluation is challenging due to expensive human studies and limited automatic metrics that capture only partial aspects of perceptual quality.

Method: Created AudioEval dataset with 4,200 generated audio samples from 24 systems, collected 126,000 ratings from experts and non-experts across 5 dimensions (enjoyment, usefulness, complexity, quality, text alignment). Proposed Qwen-DisQA as baseline model that jointly processes prompts and audio to predict multi-dimensional ratings with distributional prediction for rater disagreement.

Result: Benchmarked diverse automatic evaluators using AudioEval, revealing perspective- and dimension-level differences across model families. Qwen-DisQA achieved strong performance as a reference baseline for multi-dimensional rating prediction.

Conclusion: AudioEval provides a comprehensive evaluation resource for text-to-audio generation, and Qwen-DisQA serves as a strong baseline model for predicting multi-dimensional audio quality ratings while modeling rater disagreement.

Abstract: Text-to-audio (TTA) generation is advancing rapidly, but evaluation remains challenging because human listening studies are expensive and existing automatic metrics capture only limited aspects of perceptual quality. We introduce AudioEval, a large-scale TTA evaluation dataset with 4,200 generated audio samples (11.7 hours) from 24 systems and 126,000 ratings collected from both experts and non-experts across five dimensions: enjoyment, usefulness, complexity, quality, and text alignment. Using AudioEval, we benchmark diverse automatic evaluators to compare perspective- and dimension-level differences across model families. We also propose Qwen-DisQA as a strong reference baseline: it jointly processes prompts and generated audio to predict multi-dimensional ratings for both annotator groups, modeling rater disagreement via distributional prediction and achieving strong performance. We will release AudioEval to support future research in TTA evaluation.

[440] LLM2Fx-Tools: Tool Calling For Music Post-Production

Seungheon Doh, Junghyun Koo, Marco A. Martínez-Ramírez, Woosung Choi, Wei-Hsiang Liao, Qiyu Wu, Juhan Nam, Yuki Mitsufuji

Main category: cs.SD

TL;DR: LLM2Fx-Tools is a multimodal framework that uses LLMs to generate executable audio effects chains for music production through tool calling and chain-of-thought reasoning.

Details

Motivation: The paper aims to create an interpretable and controllable system for music post-production by applying LLM-based tool calling to audio effects modules, addressing the need for automated yet understandable audio processing.

Method: Uses LLMs with chain-of-thought planning to understand audio inputs, select effect types, determine order, and estimate parameters. Introduces LP-Fx dataset with structured CoT annotations and tool calls. Combines autoregressive sequence modeling, tool calling, and CoT reasoning.

Result: System can infer Fx-chains and parameters from audio pairs, validated in style transfer settings. LLM-as-a-judge evaluation shows appropriate CoT reasoning for music production queries.

Conclusion: First work to apply LLM-based tool calling to audio effects modules, enabling interpretable and controllable music production through multimodal understanding and generation.

Abstract: This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine their order, and estimate parameters, guided by chain-of-thought (CoT) planning. We also present LP-Fx, a new instruction-following dataset with structured CoT annotations and tool calls for audio effects modules. Experiments show that LLM2Fx-Tools can infer an Fx-chain and its parameters from pairs of unprocessed and processed audio, enabled by autoregressive sequence modeling, tool calling, and CoT reasoning. We further validate the system in a style transfer setting, where audio effects information is transferred from a reference source and applied to new content. Finally, LLM-as-a-judge evaluation demonstrates that our approach generates appropriate CoT reasoning and responses for music production queries. To our knowledge, this is the first work to apply LLM-based tool calling to audio effects modules, enabling interpretable and controllable music production.

[441] Do Foundational Audio Encoders Understand Music Structure?

Keisuke Toyama, Zhi Zhong, Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji

Main category: cs.SD

TL;DR: Comprehensive evaluation of 11 pretrained foundational audio encoders for music structure analysis reveals self-supervised learning with masked language modeling on music data works best.

Details

Motivation: While pretrained foundational audio encoders (FAEs) have shown success in various MIR tasks like music tagging and transcription, their application to music structure analysis (MSA) remains underexplored. The impact of factors like learning methods, training data, and model context length on MSA performance is unclear.

Method: Conducted comprehensive experiments on 11 types of FAEs to investigate how learning methods, training data, and model context length affect MSA performance. Evaluated different pretraining approaches including self-supervised learning with masked language modeling.

Result: FAEs using self-supervised learning with masked language modeling on music data are particularly effective for music structure analysis. The study identifies key factors that influence MSA performance across different encoder architectures.

Conclusion: The findings provide guidance for selecting appropriate FAEs for MSA tasks and pave the way for future research in foundational audio encoders and music structure analysis.

Abstract: In music information retrieval (MIR) research, the use of pretrained foundational audio encoders (FAEs) has recently become a trend. FAEs pretrained on large amounts of music and audio data have been shown to improve performance on MIR tasks such as music tagging and automatic music transcription. However, their use for music structure analysis (MSA) remains underexplored: only a small subset of FAEs has been examined for MSA, and the impact of factors such as learning methods, training data, and model context length on MSA performance remains unclear. In this study, we conduct comprehensive experiments on 11 types of FAEs to investigate how these factors affect MSA performance. Our results demonstrate that FAEs using self-supervised learning with masked language modeling on music data are particularly effective for MSA. These findings pave the way for future research in FAE and MSA.

cs.LG

[442] Rethinking LLM-Driven Heuristic Design: Generating Efficient and Specialized Solvers via Dynamics-Aware Optimization

Rongzheng Wang, Yihong Huang, Muquan Li, Jiakai Li, Di Liang, Bob Simons, Pei Ke, Shuang Liang, Ke Qin

Main category: cs.LG

TL;DR: DASH framework uses LLMs to co-optimize solver search mechanisms and runtime schedules for combinatorial optimization, with profiled library retrieval for efficient warm-starts on new distributions.

Details

Motivation: Existing LLM-Driven Heuristic Design frameworks have limitations: endpoint-only evaluation ignores convergence process and runtime efficiency, and high adaptation costs require re-adaptation for new instance groups.

Method: Proposes DASH framework with convergence-aware metric to co-optimize solver search mechanisms and runtime schedules, plus Profiled Library Retrieval (PLR) to archive specialized solvers for cost-effective warm-starts.

Result: DASH improves runtime efficiency by over 3x while surpassing solution quality of SOTA baselines across diverse problem scales, and cuts LLM adaptation costs by over 90% through profile-based warm starts.

Conclusion: DASH effectively addresses limitations of existing LHD frameworks by co-optimizing search mechanisms and runtime schedules, with PLR enabling efficient adaptation to heterogeneous distributions.

Abstract: Large Language Models (LLMs) have advanced the field of Combinatorial Optimization through automated heuristic generation. Instead of relying on manual design, this LLM-Driven Heuristic Design (LHD) process leverages LLMs to iteratively generate and refine solvers to achieve high performance. However, existing LHD frameworks face two critical limitations: (1) Endpoint-only evaluation, which ranks solvers solely by final quality, ignoring the convergence process and runtime efficiency; (2) High adaptation costs, where distribution shifts necessitate re-adaptation to generate specialized solvers for new instance groups. To address these issues, we propose Dynamics-Aware Solver Heuristics (DASH), a framework that co-optimizes solver search mechanisms and runtime schedules guided by a convergence-aware metric, thereby identifying efficient and high-performance solvers. Furthermore, to mitigate expensive re-adaptation, DASH incorporates Profiled Library Retrieval (PLR). PLR efficiently archives specialized solvers concurrently with the evolutionary process, enabling cost-effective warm-starts for heterogeneous distributions. Experiments on four combinatorial optimization problems demonstrate that DASH improves runtime efficiency by over 3 times, while surpassing the solution quality of state-of-the-art baselines across diverse problem scales. Furthermore, by enabling profile-based warm starts, DASH maintains superior accuracy under different distributions while cutting LLM adaptation costs by over 90%.

[443] Finetune-Informed Pretraining Boosts Downstream Performance

Atik Faysal, Mohammad Rostami, Reihaneh Gh. Roshan, Nikhil Muralidhar, Huaxia Wang

Main category: cs.LG

TL;DR: FIP is a pretraining method that biases multimodal representation learning toward a target modality needed during fine-tuning, improving downstream performance without extra data or compute.

Details

Motivation: Standard multimodal pretraining treats all modalities uniformly, but many practical deployments only use one modality heavily during downstream fine-tuning, leading to under-optimized representations for the modality that actually matters.

Method: FIP biases representation learning toward a designated target modality through three techniques: higher masking difficulty, stronger loss weighting, and increased decoder capacity for the target modality, without modifying the shared encoder or requiring additional supervision.

Result: When applied to masked modeling on constellation diagrams for wireless signals, FIP consistently improves downstream fine-tuned performance with no extra data or compute.

Conclusion: FIP is a simple, model-agnostic method that is architecture-compatible and broadly applicable across multimodal masked modeling pipelines, effectively optimizing representations for the modality that matters in downstream applications.

Abstract: Multimodal pretraining is effective for building general-purpose representations, but in many practical deployments, only one modality is heavily used during downstream fine-tuning. Standard pretraining strategies treat all modalities uniformly, which can lead to under-optimized representations for the modality that actually matters. We propose Finetune-Informed Pretraining (FIP), a model-agnostic method that biases representation learning toward a designated target modality needed at fine-tuning time. FIP combines higher masking difficulty, stronger loss weighting, and increased decoder capacity for the target modality, without modifying the shared encoder or requiring additional supervision. When applied to masked modeling on constellation diagrams for wireless signals, FIP consistently improves downstream fine-tuned performance with no extra data or compute. FIP is simple to implement, architecture-compatible, and broadly applicable across multimodal masked modeling pipelines.

[444] A generative machine learning model for designing metal hydrides applied to hydrogen storage

Xiyuan Liu, Christian Hacker, Shengnian Wang, Yuhua Duan

Main category: cs.LG

TL;DR: A framework combining causal discovery with generative ML to create novel metal hydride candidates for hydrogen storage, generating 6 new formulas with 4 validated by DFT simulations.

Details

Motivation: Current materials databases have limited hydride entries, constraining discovery of optimal hydrogen storage materials needed for carbon-neutral energy systems.

Method: Integrates causal discovery with lightweight generative ML model trained on 450 samples (270 training, 90 validation, 90 testing) to generate 1,000 novel metal hydride candidates, then ranks and filters them.

Result: Generated 6 previously unreported chemical formulas and crystal structures, with 4 validated by density functional theory simulations showing strong potential for experimental investigation.

Conclusion: The framework provides scalable, time-efficient approach for expanding hydrogen storage datasets and accelerating materials discovery beyond existing database limitations.

Abstract: Developing new metal hydrides is a critical step toward efficient hydrogen storage in carbon-neutral energy systems. However, existing materials databases, such as the Materials Project, contain a limited number of well-characterized hydrides, which constrains the discovery of optimal candidates. This work presents a framework that integrates causal discovery with a lightweight generative machine learning model to generate novel metal hydride candidates that may not exist in current databases. Using a dataset of 450 samples (270 training, 90 validation, and 90 testing), the model generates 1,000 candidates. After ranking and filtering, six previously unreported chemical formulas and crystal structures are identified, four of which are validated by density functional theory simulations and show strong potential for future experimental investigation. Overall, the proposed framework provides a scalable and time-efficient approach for expanding hydrogen storage datasets and accelerating materials discovery.

[445] Is Parameter Isolation Better for Prompt-Based Continual Learning?

Jiangyang Li, Chenhao Ding, Songlin Dong, Qiang Wang, Jianchao Zhao, Yuhang He, Yihong Gong

Main category: cs.LG

TL;DR: A prompt-sharing framework for continual learning with global prompt pool, task-aware gated routing, and history-aware modulator to improve parameter utilization and reduce forgetting.

Details

Motivation: Existing prompt-based continual learning methods assign fixed prompts to each task, isolating knowledge across tasks and leading to suboptimal parameter utilization. There's a need for more efficient prompt sharing and dynamic allocation.

Method: Proposes a prompt-sharing framework with: 1) Global prompt pool, 2) Task-aware gated routing mechanism that sparsely activates subsets of prompts for dynamic decoupling and collaborative optimization, 3) History-aware modulator using cumulative prompt activation statistics to protect frequently used prompts from excessive updates.

Result: Extensive analysis and empirical results demonstrate the approach consistently outperforms existing static allocation strategies in both effectiveness and efficiency.

Conclusion: The proposed prompt-sharing framework with dynamic routing and history-aware modulation effectively addresses parameter utilization and knowledge forgetting issues in continual learning.

Abstract: Prompt-based continual learning methods effectively mitigate catastrophic forgetting. However, most existing methods assign a fixed set of prompts to each task, completely isolating knowledge across tasks and resulting in suboptimal parameter utilization. To address this, we consider the practical needs of continual learning and propose a prompt-sharing framework. This framework constructs a global prompt pool and introduces a task-aware gated routing mechanism that sparsely activates a subset of prompts to achieve dynamic decoupling and collaborative optimization of task-specific feature representations. Furthermore, we introduce a history-aware modulator that leverages cumulative prompt activation statistics to protect frequently used prompts from excessive updates, thereby mitigating inefficient parameter usage and knowledge forgetting. Extensive analysis and empirical results demonstrate that our approach consistently outperforms existing static allocation strategies in effectiveness and efficiency.

[446] Faster Predictive Coding Networks via Better Initialization

Luca Pinchetti, Simon Frieder, Thomas Lukasiewicz, Tommaso Salvatori

Main category: cs.LG

TL;DR: Predictive coding networks can be made more computationally efficient through better initialization techniques that preserve iterative progress from previous training samples.

Details

Motivation: Energy-based learning algorithms like predictive coding are mathematically grounded but computationally expensive due to their iterative nature, limiting their practical applicability despite their versatility.

Method: Proposed a new initialization technique for predictive coding networks that aims to preserve the iterative progress made on previous training samples, reducing required training times.

Result: Experiments demonstrate substantial improvements in convergence speed and final test loss in both supervised and unsupervised settings.

Conclusion: The proposed initialization technique offers a promising path toward reconciling computational efficiency disparities between predictive coding and backpropagation while maintaining performance.

Abstract: Research aimed at scaling up neuroscience inspired learning algorithms for neural networks is accelerating. Recently, a key research area has been the study of energy-based learning algorithms such as predictive coding, due to their versatility and mathematical grounding. However, the applicability of such methods is held back by the large computational requirements caused by their iterative nature. In this work, we address this problem by showing that the choice of initialization of the neurons in a predictive coding network matters significantly and can notably reduce the required training times. Consequently, we propose a new initialization technique for predictive coding networks that aims to preserve the iterative progress made on previous training samples. Our approach suggests a promising path toward reconciling the disparities between predictive coding and backpropagation in terms of computational efficiency and final performance. In fact, our experiments demonstrate substantial improvements in convergence speed and final test loss in both supervised and unsupervised settings.

[447] TwinWeaver: An LLM-Based Foundation Model Framework for Pan-Cancer Digital Twins

Nikita Makarov, Maria Bordukova, Lena Voith von Voithenberg, Estrella Pivel-Villanueva, Sabrina Mielke, Jonathan Wickes, Hanchen Wang, Mingyu Derek Ma, Keunwoo Choi, Kyunghyun Cho, Stephen Ra, Raul Rodriguez-Esteban, Fabian Schmich, Michael Menden

Main category: cs.LG

TL;DR: TwinWeaver framework serializes longitudinal patient histories into text for unified event prediction using LLMs, achieving superior forecasting accuracy and risk stratification in oncology compared to traditional time-series methods.

Details

Motivation: Precision oncology requires forecasting clinical events and trajectories, but modeling sparse, multi-modal clinical time series remains a critical challenge. Current methods struggle with handling diverse data types and longitudinal patient histories effectively.

Method: TwinWeaver serializes longitudinal patient histories into text format, enabling unified event prediction and forecasting with large language models. The framework builds Genie Digital Twin (GDT) on 93,054 patients across 20 cancer types by converting clinical data into textual representations that LLMs can process.

Result: GDT significantly reduces forecasting error with median MASE of 0.87 vs 0.97 for best time-series baseline (p<0.001). It improves risk stratification with average C-index of 0.703 vs 0.662 baseline. It generalizes to out-of-distribution clinical trials, achieving median MASE of 0.75-0.88 and outperforming baselines in event prediction with C-index of 0.672 vs 0.648.

Conclusion: TwinWeaver provides a scalable and transparent foundation for longitudinal clinical modeling, enabling interpretable clinical reasoning and demonstrating that LLMs can effectively process serialized clinical data for superior forecasting and risk prediction in oncology.

Abstract: Precision oncology requires forecasting clinical events and trajectories, yet modeling sparse, multi-modal clinical time series remains a critical challenge. We introduce TwinWeaver, an open-source framework that serializes longitudinal patient histories into text, enabling unified event prediction as well as forecasting with large language models, and use it to build Genie Digital Twin (GDT) on 93,054 patients across 20 cancer types. In benchmarks, GDT significantly reduces forecasting error, achieving a median Mean Absolute Scaled Error (MASE) of 0.87 compared to 0.97 for the strongest time-series baseline (p<0.001). Furthermore, GDT improves risk stratification, achieving an average concordance index (C-index) of 0.703 across survival, progression, and therapy switching tasks, surpassing the best baseline of 0.662. GDT also generalizes to out-of-distribution clinical trials, matching trained baselines at zero-shot and surpassing them with fine-tuning, achieving a median MASE of 0.75-0.88 and outperforming the strongest baseline in event prediction with an average C-index of 0.672 versus 0.648. Finally, TwinWeaver enables an interpretable clinical reasoning extension, providing a scalable and transparent foundation for longitudinal clinical modeling.

[448] Noisy but Valid: Robust Statistical Evaluation of LLMs with Imperfect Judges

Chen Feng, Minghe Shen, Ananth Balashankar, Carsten Gerner-Beuerle, Miguel R. D. Rodrigues

Main category: cs.LG

TL;DR: A statistical testing framework for certifying LLM safety that accounts for imperfect “LLM-as-a-Judge” evaluations while maintaining theoretical validity guarantees.

Details

Motivation: Current LLM certification methods using "LLM-as-a-Judge" approaches are scalable but suffer from judge imperfections, noise, and bias that can invalidate statistical guarantees, creating a need for reliable certification despite imperfect evaluators.

Method: Introduces a “Noisy but Valid” hypothesis testing framework that uses a small human-labeled calibration set to estimate judge’s True Positive/False Positive Rates, then applies a variance-corrected critical threshold to large judge-labeled datasets while guaranteeing finite-sample Type-I error control.

Result: Theoretical guarantees show conditions where noisy testing outperforms direct evaluation; empirical validation on Jigsaw Comment, Hate Speech and SafeRLHF datasets confirms the theory; reveals significant “Oracle Gap” between practical methods and theoretical optimum.

Conclusion: Provides first systematic treatment of imperfect-judge setting, yielding interpretable diagnostics of judge reliability and clarifying evaluation power dependencies on judge quality, dataset size, and certification levels.

Abstract: Reliable certification of Large Language Models (LLMs)-verifying that failure rates are below a safety threshold-is critical yet challenging. While “LLM-as-a-Judge” offers scalability, judge imperfections, noise, and bias can invalidate statistical guarantees. We introduce a “Noisy but Valid” hypothesis testing framework to address this. By leveraging a small human-labelled calibration set to estimate the judge’s True Positive and False Positive Rates (TPR/FPR), we derive a variance-corrected critical threshold applied to a large judge-labelled dataset. Crucially, our framework theoretically guarantees finite-sample Type-I error control (validity) despite calibration uncertainty. This distinguishes our work from Prediction-Powered Inference (PPI), positioning our method as a diagnostic tool that explicitly models judge behavior rather than a black-box estimator. Our contributions include: (1) Theoretical Guarantees: We derive the exact conditions under which noisy testing yields higher statistical power than direct evaluation; (2) Empirical Validation: Experiments on Jigsaw Comment, Hate Speech and SafeRLHF confirm our theory; (3) The Oracle Gap: We reveal a significant performance gap between practical methods and the theoretical “Oracle” (perfectly known judge parameters), quantifying the cost of estimation. Specifically, we provide the first systematic treatment of the imperfect-judge setting, yielding interpretable diagnostics of judge reliability and clarifying how evaluation power depends on judge quality, dataset size, and certification levels. Together, these results sharpen understanding of statistical evaluation with LLM judges, and highlight trade-offs among competing inferential tools.

[449] Noninvasive Intracranial Pressure Estimation Using Subspace System Identification and Bespoke Machine Learning Algorithms: A Learning-to-Rank Approach

Anni Zhao, Ayca Ermis, Jeffrey Robert Vitt, Sergio Brasil, Wellingson Paiva, Magdalena Kasprowicz, Malgorzata Burzynska, Robert Hamilton, Runze Yan, Ofer Sadan, J. Claude Hemphill, Lieven Vandenberghe, Xiao Hu

Main category: cs.LG

TL;DR: A machine learning approach using system identification and ranking-constrained optimization to estimate intracranial pressure (ICP) noninvasively from arterial blood pressure, cerebral blood velocity, and R-R interval signals.

Details

Motivation: Accurate noninvasive estimation of intracranial pressure (ICP) is a major challenge in critical care, with current methods being invasive. There's a need for safe, broadly accessible ICP monitoring for patients with acute brain injury.

Method: Uses subspace system identification to model cerebral hemodynamics from noninvasive signals (ABP, CBv, R-R interval). Learns a mapping function between signal features and estimation errors using ranking constraints through convex optimization. Patients were randomly split into training/testing datasets for evaluation.

Result: Approximately 31.88% of testing entries achieved estimation errors within 2 mmHg, and 34.07% had errors between 2-6 mmHg using the nonlinear mapping with constraints.

Conclusion: The approach demonstrates feasibility for noninvasive ICP estimation, laying foundation for safe, accessible ICP monitoring, though further validation and refinement are needed before clinical deployment.

Abstract: Objective: Accurate noninvasive estimation of intracranial pressure (ICP) remains a major challenge in critical care. We developed a bespoke machine learning algorithm that integrates system identification and ranking-constrained optimization to estimate mean ICP from noninvasive signals. Methods: A machine learning framework was proposed to obtain accurate mean ICP values using arbitrary noninvasive signals. The subspace system identification algorithm is employed to identify cerebral hemodynamics models for ICP simulation using arterial blood pressure (ABP), cerebral blood velocity (CBv), and R-wave to R-wave interval (R-R interval) signals in a comprehensive database. A mapping function to describe the relationship between the features of noninvasive signals and the estimation errors is learned using innovative ranking constraints through convex optimization. Patients across multiple clinical settings were randomly split into testing and training datasets for performance evaluation of the mapping function. Results: The results indicate that about 31.88% of testing entries achieved estimation errors within 2 mmHg and 34.07% of testing entries between 2 mmHg to 6 mmHg from the nonlinear mapping with constraints. Conclusion: Our results demonstrate the feasibility of the proposed noninvasive ICP estimation approach. Significance: Further validation and technical refinement are required before clinical deployment, but this work lays the foundation for safe and broadly accessible ICP monitoring in patients with acute brain injury and related conditions.

[450] A Theory of Universal Agnostic Learning

Steve Hanneke, Shay Moran

Main category: cs.LG

TL;DR: Complete theory of optimal universal rates for binary classification in agnostic setting, extending realizable-case theory by removing realizability assumption

Details

Motivation: Extend the realizable-case theory of binary classification to the more realistic agnostic setting where the realizability assumption on the distribution is removed, providing a complete understanding of optimal universal rates

Method: Theoretical analysis identifying a fundamental tetrachotomy of optimal rates: for every concept class, the optimal universal rate of convergence of the excess error rate falls into one of four categories: e^{-n}, e^{-o(n)}, o(n^{-1/2}), or arbitrarily slow

Result: Identifies simple combinatorial structures that determine which of the four categories any given concept class falls into, providing a complete characterization of optimal universal rates for binary classification in agnostic setting

Conclusion: Provides a complete theory of optimal universal rates for binary classification in the agnostic setting, extending previous realizable-case theory and establishing a fundamental tetrachotomy of convergence rates determined by combinatorial properties of concept classes

Abstract: We provide a complete theory of optimal universal rates for binary classification in the agnostic setting. This extends the realizable-case theory of Bousquet, Hanneke, Moran, van Handel, and Yehudayoff (2021) by removing the realizability assumption on the distribution. We identify a fundamental tetrachotomy of optimal rates: for every concept class, the optimal universal rate of convergence of the excess error rate is one of $e^{-n}$, $e^{-o(n)}$, $o(n^{-1/2})$, or arbitrarily slow. We further identify simple combinatorial structures which determine which of these categories any given concept class falls into.

[451] Monotone Optimisation with Learned Projections

Ahmed Rashwan, Keith Briggs, Chris Budd, Lisa Kreusser

Main category: cs.LG

TL;DR: Learning-based approach to accelerate monotone optimization by predicting projection primitives via radial inverse networks instead of explicit function estimation

Details

Motivation: Monotone optimization problems often lack explicit objective/constraint functions and only have data access, making traditional Polyblock Outer Approximation (POA) difficult to apply directly due to costly bisection procedures

Method: Introduces Homogeneous-Monotone Radial Inverse (HM-RI) networks that predict POA’s projection primitive via radial inverse, enforcing monotonicity and homogeneity properties to enable fast projection estimation without explicit function learning

Result: Substantial speed-ups across monotone optimization benchmarks (indefinite quadratic programming, multiplicative programming, transmit power optimization) while maintaining solution quality, outperforming baselines that don’t exploit monotonic structure

Conclusion: Algorithm-aware learning approach integrating learned models into POA via radial inverse prediction provides efficient solution for data-driven monotone optimization problems

Abstract: Monotone optimisation problems admit specialised global solvers such as the Polyblock Outer Approximation (POA) algorithm, but these methods typically require explicit objective and constraint functions. In many applications, these functions are only available through data, making POA difficult to apply directly. We introduce an algorithm-aware learning approach that integrates learned models into POA by directly predicting its projection primitive via the radial inverse, avoiding the costly bisection procedure used in standard POA. We propose Homogeneous-Monotone Radial Inverse (HM-RI) networks, structured neural architectures that enforce key monotonicity and homogeneity properties, enabling fast projection estimation. We provide a theoretical characterisation of radial inverse functions and show that, under mild structural conditions, a HM-RI predictor corresponds to the radial inverse of a valid set of monotone constraints. To reduce training overhead, we further develop relaxed monotonicity conditions that remain compatible with POA. Across multiple monotone optimisation benchmarks (indefinite quadratic programming, multiplicative programming, and transmit power optimisation), our approach yields substantial speed-ups in comparison to direct function estimation while maintaining strong solution quality, outperforming baselines that do not exploit monotonic structure.

[452] Distributional Active Inference

Abdullah Akgül, Gulcin Baykal, Manuel Haußmann, Mustafa Mert Çelikok, Melih Kandemir

Main category: cs.LG

TL;DR: Active inference integrated with distributional reinforcement learning for improved sample efficiency without requiring transition dynamics modeling

Details

Motivation: Address dual challenges in robotic control: efficient sensory state organization and far-sighted planning. Reinforcement learning alone is sample-inefficient, while active inference (biological brain theory) hasn't been well integrated into AI beyond model-based approaches.

Method: Created formal abstraction of RL algorithms spanning model-based, distributional, and model-free approaches. Integrated active inference into distributional RL framework without requiring transition dynamics modeling.

Result: Performance advantages of active inference become accessible without the need for transition dynamics modeling, potentially improving sample efficiency in complex robotic environments.

Conclusion: Successfully bridges active inference theory with practical RL frameworks, making biological-inspired efficient control accessible to artificial intelligence systems.

Abstract: Optimal control of complex environments with robotic systems faces two complementary and intertwined challenges: efficient organization of sensory state information and far-sighted action planning. Because the reinforcement learning framework addresses only the latter, it tends to deliver sample-inefficient solutions. Active inference is the state-of-the-art process theory that explains how biological brains handle this dual problem. However, its applications to artificial intelligence have thus far been limited to extensions of existing model-based approaches. We present a formal abstraction of reinforcement learning algorithms that spans model-based, distributional, and model-free approaches. This abstraction seamlessly integrates active inference into the distributional reinforcement learning framework, making its performance advantages accessible without transition dynamics modeling.

[453] Pre-trained Encoders for Global Child Development: Transfer Learning Enables Deployment in Data-Scarce Settings

Md Muhtasim Munif Fahim, Md Rezaul Karim

Main category: cs.LG

TL;DR: Pre-trained encoder for global child development monitoring using UNICEF data achieves strong few-shot performance across 44 countries, enabling ML deployment in resource-constrained settings.

Details

Motivation: To address the data bottleneck in deploying machine learning for child development monitoring in new countries, where traditional models require thousands of samples but programs start with fewer than 100.

Method: Developed a pre-trained encoder trained on 357,709 children across 44 countries using UNICEF survey data, enabling few-shot learning with transfer learning approach.

Result: With only 50 training samples, achieves AUC of 0.65 (outperforming cold-start gradient boosting by 8-12%); at N=500, AUC of 0.73; zero-shot deployment to unseen countries achieves AUCs up to 0.84.

Conclusion: Pre-trained encoders can transform feasibility of ML for Sustainable Development Goal 4.2.1 monitoring in resource-constrained settings by enabling few-shot generalization.

Abstract: A large number of children experience preventable developmental delays each year, yet the deployment of machine learning in new countries has been stymied by a data bottleneck: reliable models require thousands of samples, while new programs begin with fewer than 100. We introduce the first pre-trained encoder for global child development, trained on 357,709 children across 44 countries using UNICEF survey data. With only 50 training samples, the pre-trained encoder achieves an average AUC of 0.65 (95% CI: 0.56-0.72), outperforming cold-start gradient boosting at 0.61 by 8-12% across regions. At N=500, the encoder achieves an AUC of 0.73. Zero-shot deployment to unseen countries achieves AUCs up to 0.84. We apply a transfer learning bound to explain why pre-training diversity enables few-shot generalization. These results establish that pre-trained encoders can transform the feasibility of ML for SDG 4.2.1 monitoring in resource-constrained settings.

[454] Generalized Information Gathering Under Dynamics Uncertainty

Fernando Palafox, Jingqi Li, Jesse Milzman, David Fridovich-Keil

Main category: cs.LG

TL;DR: A unifying framework for active information gathering in unknown dynamical systems that decouples modeling choices from information-gathering costs using directed information theory.

Details

Motivation: Existing active learning methods require bespoke costs for specific modeling choices (dynamics models, belief updates, observation models, planners), creating fragmentation. A unifying framework is needed to decouple these choices from information-gathering objectives.

Method: Develops a framework exposing causal dependencies between parameters, beliefs, and controls. Derives a general information-gathering cost based on Massey’s directed information that assumes only Markov dynamics with additive noise, remaining agnostic to other modeling choices.

Result: Shows that mutual information cost used in existing literature is a special case of their cost. Establishes explicit connection between mutual information cost and information gain in linearized Bayesian estimation, providing theoretical justification for mutual information-based active learning.

Conclusion: Provides a unifying theoretical framework for active information gathering that generalizes existing approaches and offers practical utility across linear, nonlinear, and multi-agent systems.

Abstract: An agent operating in an unknown dynamical system must learn its dynamics from observations. Active information gathering accelerates this learning, but existing methods derive bespoke costs for specific modeling choices: dynamics models, belief update procedures, observation models, and planners. We present a unifying framework that decouples these choices from the information-gathering cost by explicitly exposing the causal dependencies between parameters, beliefs, and controls. Using this framework, we derive a general information-gathering cost based on Massey’s directed information that assumes only Markov dynamics with additive noise and is otherwise agnostic to modeling choices. We prove that the mutual information cost used in existing literature is a special case of our cost. Then, we leverage our framework to establish an explicit connection between the mutual information cost and information gain in linearized Bayesian estimation, thereby providing theoretical justification for mutual information-based active learning approaches. Finally, we illustrate the practical utility of our framework through experiments spanning linear, nonlinear, and multi-agent systems.

[455] Top-k on a Budget: Adaptive Ranking with Weak and Strong Oracles

Lutz Oettershagen

Main category: cs.LG

TL;DR: ACE algorithm for top-k identification using noisy weak oracle and scarce strong oracle, with adaptive certification to reduce expensive strong calls

Details

Motivation: Identifying top-k items is fundamental but expensive when exact valuations require scarce, high-fidelity resources like human experts or expensive simulations. Need efficient methods that minimize costly strong oracle calls while maintaining accuracy.

Method: Two-phase approach: 1) ACE algorithm uses weak oracle confidence intervals to focus scarce strong oracle queries on critical boundary items near top-k threshold. 2) ACE-W extends with adaptive weak budget allocation before running ACE to further reduce strong costs.

Result: ACE achieves O(m(4ε_max)) strong calls matching theoretical lower bound Ω(m(ε_max)), where m(·) is near-tie mass around top-k threshold. ACE-W further reduces strong costs through adaptive weak budget allocation.

Conclusion: Adaptive certification algorithms can significantly reduce expensive strong oracle calls for top-k identification while maintaining theoretical guarantees, making expensive expert verification more practical.

Abstract: Identifying the top-$k$ items is fundamental but often prohibitive when exact valuations are expensive. We study a two-oracle setting with a fast, noisy weak oracle and a scarce, high-fidelity strong oracle (e.g., human expert verification or expensive simulation). We first analyze a simple screen-then-certify baseline (STC) and prove it makes at most $m(4\varepsilon_{\max})$ strong calls given jointly valid weak confidence intervals with maximum radius $\varepsilon_{\max}$, where $m(\cdot)$ denotes the near-tie mass around the top-$k$ threshold. We establish a conditional lower bound of $Ω(m(\varepsilon_{\max}))$ for any algorithm given the same weak uncertainty. Our main contribution is ACE, an adaptive certification algorithm that focuses strong queries on critical boundary items, achieving the same $O(m(4\varepsilon_{\max}))$ bound while reducing strong calls in practice. We then introduce ACE-W, a fully adaptive two-phase method that allocates weak budget adaptively before running ACE, further reducing strong costs.

[456] Efficient Test-Time Adaptation through Latent Subspace Coefficients Search

Xinyu Luo, Jie Liu, Kecheng Chen, Junyi Yang, Bo Ding, Arindam Basu, Haoliang Li

Main category: cs.LG

TL;DR: ELaTTA: Gradient-free test-time adaptation using low-dimensional latent subspace optimization for efficient on-device deployment.

Details

Motivation: Existing test-time adaptation methods are unsuitable for edge deployment due to reliance on backpropagation, activation buffering, and batch processing, causing high latency and memory overhead. Need for efficient single-instance adaptation under strict on-device constraints.

Method: Freezes model weights and adapts each test sample by optimizing low-dimensional coefficient vector in pre-computed principal latent subspace (via truncated SVD). Uses CMA-ES to optimize coefficients for prediction confidence, effectively optimizing Gaussian-smoothed objective for stability near decision boundaries.

Result: Achieves state-of-the-art accuracy across six benchmarks and multiple architectures under strict and continual single-instance protocols. Reduces compute by up to 63× and peak memory by 11×. Demonstrated on-device deployment on ZYNQ-7020 platform.

Conclusion: ELaTTA provides efficient gradient-free test-time adaptation suitable for edge deployment with minimal overhead, addressing practical constraints of real-world deployment.

Abstract: Real-world deployment often exposes models to distribution shifts, making test-time adaptation (TTA) critical for robustness. Yet most TTA methods are unfriendly to edge deployment, as they rely on backpropagation, activation buffering, or test-time mini-batches, leading to high latency and memory overhead. We propose $\textbf{ELaTTA}$ ($\textit{Efficient Latent Test-Time Adaptation}$), a gradient-free framework for single-instance TTA under strict on-device constraints. ELaTTA freezes model weights and adapts each test sample by optimizing a low-dimensional coefficient vector in a source-induced principal latent subspace, pre-computed offline via truncated SVD and stored with negligible overhead. At inference, ELaTTA encourages prediction confidence by optimizing the $k$-D coefficients with CMA-ES, effectively optimizing a Gaussian-smoothed objective and improving stability near decision boundaries. Across six benchmarks and multiple architectures, ELaTTA achieves state-of-the-art accuracy under both strict and continual single-instance protocols, while reducing compute by up to $\textit{63$\times$}$ and peak memory by $\textit{11$\times$}$. We further demonstrate on-device deployment on a ZYNQ-7020 platform. Code will be released upon acceptance.

[457] The Depth Delusion: Why Transformers Should Be Wider, Not Deeper

Md Muhtasim Munif Fahim, Md Rezaul Karim

Main category: cs.LG

TL;DR: Architecture-conditioned scaling laws reveal optimal depth-width tradeoffs in transformers, showing width should grow 2.8x faster than depth, with a critical depth beyond which adding layers harms performance despite adding parameters.

Details

Motivation: Current neural scaling laws treat architecture as interchangeable, ignoring how different depth-width configurations affect performance at scale. The paper aims to understand optimal architecture scaling beyond just parameter count.

Method: Proposed architecture-conditioned scaling laws that decompose loss dependence on depth and width. Analyzed 30 transformer architectures from 17M to 7B parameters trained on representative high-compute samples to derive scaling relationships.

Result: Found optimal depth scales as D* ~ C^0.12 and optimal width as W* ~ C^0.34, meaning width grows 2.8x faster than depth. Discovered critical depth phenomenon: beyond D_crit ~ W^0.44, adding layers increases loss despite adding parameters. Validated with R^2 = 0.922 across architectures.

Conclusion: Optimal depth-width tradeoffs persist at production scale, with width being more important than depth for scaling. The “Depth Delusion” shows that beyond critical depth, deeper models underperform despite having more parameters.

Abstract: Neural scaling laws describe how language model loss decreases with parameters and data, but treat architecture as interchangeable–a billion parameters could arise from a shallow-wide model (10 layers & 8,192 hidden dimension) or a deep-narrow one (80 layers & 2,048 hidden dimension). We propose architecture-conditioned scaling laws decomposing this dependence, finding that optimal depth scales as D* ~ C^0.12 while optimal width scales as W* ~ C^0.34, meaning width should grow 2.8x faster than depth. We discover a critical depth phenomenon: beyond D_crit ~ W^0.44 (sublinear in W), adding layers increases loss despite adding parameters–the Depth Delusion. Empirically, we validate these findings across 30 transformer architectures spanning 17M to 7B parameters, each trained on representative high-compute samples, achieving R^2 = 0.922. Our central finding: at 7B scale, a 64-layer model (6.38B params) underperforms a 32-layer model (6.86B params) by 0.12 nats, despite being significantly deeper. This demonstrates that optimal depth-width tradeoffs persist at the production scale.

[458] Scaling Next-Brain-Token Prediction for MEG

Richard Csaky

Main category: cs.LG

TL;DR: A large autoregressive model for source-space MEG that scales next-token prediction to long context across datasets and scanners, generating minutes of MEG from context using a modified vector-quantizer and Qwen2.5-VL backbone.

Details

Motivation: To develop a model that can handle large-scale MEG data across different datasets and scanners, enabling long-context brain activity generation and evaluation of neurophysiological fidelity.

Method: Modified SEANet-style vector-quantizer reduces multichannel MEG into flattened token stream, then train Qwen2.5-VL backbone from scratch for next-token prediction and recursive MEG generation. Evaluation uses task-matched tests: on-manifold stability and conditional specificity with neurophysiologically grounded metrics.

Result: Model trained on CamCAN and Omega datasets, evaluated on held-out MOUS dataset showing cross-dataset generalization. Generations remain stable over long rollouts and are closer to correct continuation than swapped controls across metrics.

Conclusion: Successfully developed a scalable autoregressive model for MEG generation that generalizes across datasets and maintains neurophysiological fidelity over long sequences.

Abstract: We present a large autoregressive model for source-space MEG that scales next-token prediction to long context across datasets and scanners: handling a corpus of over 500 hours and thousands of sessions across the three largest MEG datasets. A modified SEANet-style vector-quantizer reduces multichannel MEG into a flattened token stream on which we train a Qwen2.5-VL backbone from scratch to predict the next brain token and to recursively generate minutes of MEG from up to a minute of context. To evaluate long-horizon generation, we introduce task-matched tests: (i) on-manifold stability via generated-only drift compared to the time-resolved distribution of real sliding windows, and (ii) conditional specificity via correct context versus prompt-swap controls using a neurophysiologically grounded metric set. We train on CamCAN and Omega and run all analyses on held-out MOUS, establishing cross-dataset generalization. Across metrics, generations remain relatively stable over long rollouts and are closer to the correct continuation than swapped controls. Code available at: https://github.com/ricsinaruto/brain-gen.

[459] MADE: Benchmark Environments for Closed-Loop Materials Discovery

Shreshth A Malik, Tiarnan Doherty, Panagiotis Tigas, Muhammed Razzak, Stephen J. Roberts, Aron Walsh, Yarin Gal

Main category: cs.LG

TL;DR: MADE is a benchmarking framework for end-to-end autonomous materials discovery pipelines that simulates closed-loop discovery campaigns with constrained budgets, enabling evaluation of iterative scientific workflows.

Details

Motivation: Existing materials discovery benchmarks focus on static predictive tasks or isolated sub-tasks, neglecting the iterative, adaptive nature of real scientific discovery workflows that involve sequential decision-making under resource constraints.

Method: MADE simulates closed-loop discovery campaigns where agents propose, evaluate, and refine candidate materials under constrained oracle budgets. It formalizes discovery as searching for thermodynamically stable compounds relative to convex hulls, with flexible composition of agents from interchangeable components like generative models, filters, and planners.

Result: The framework enables systematic experiments across material systems, ablation studies of pipeline components, and comparison of method scalability with system complexity. It demonstrates evaluation of diverse workflows from fixed pipelines to fully agentic systems with tool use.

Conclusion: MADE provides a comprehensive benchmarking framework for end-to-end autonomous materials discovery that captures the sequential, resource-limited nature of real scientific workflows, enabling better evaluation and comparison of discovery algorithms.

Abstract: Existing benchmarks for computational materials discovery primarily evaluate static predictive tasks or isolated computational sub-tasks. While valuable, these evaluations neglect the inherently iterative and adaptive nature of scientific discovery. We introduce MAterials Discovery Environments (MADE), a novel framework for benchmarking end-to-end autonomous materials discovery pipelines. MADE simulates closed-loop discovery campaigns in which an agent or algorithm proposes, evaluates, and refines candidate materials under a constrained oracle budget, capturing the sequential and resource-limited nature of real discovery workflows. We formalize discovery as a search for thermodynamically stable compounds relative to a given convex hull, and evaluate efficacy and efficiency via comparison to baseline algorithms. The framework is flexible; users can compose discovery agents from interchangeable components such as generative models, filters, and planners, enabling the study of arbitrary workflows ranging from fixed pipelines to fully agentic systems with tool use and adaptive decision making. We demonstrate this by conducting systematic experiments across a family of systems, enabling ablation of components in discovery pipelines, and comparison of how methods scale with system complexity.

Niklas Freymuth, Philipp Dahlinger, Tobias Würth, Simon Reisch, Luise Kärger, Gerhard Neumann

Main category: cs.LG

TL;DR: ASMR++: A novel Adaptive Mesh Refinement approach using collaborating homogeneous agents that iteratively split to reduce maximum mesh element error, outperforming heuristic and learned baselines while generalizing across domains.

Details

Motivation: Traditional FEM becomes computationally expensive for complex problems. Classical AMR methods rely on heuristics or expensive error estimators, while machine learning-based AMR methods only scale to simple problems. There's a need for efficient, scalable AMR that can handle complex simulations.

Method: Formulates AMR as a system of collaborating homogeneous agents that iteratively split into multiple new agents. Uses spatial reward formulation focused on reducing maximum mesh element error. ASMR++ enables efficient, stable optimization and generates adaptive meshes at user-defined resolution during inference.

Result: Outperforms heuristic approaches and learned baselines, matches performance of expensive error-based oracle AMR strategies. Generalizes to different domains during inference. Produces meshes that simulate up to 2 orders of magnitude faster than uniform refinements in demanding settings.

Conclusion: ASMR++ provides an effective agent-based approach to AMR that offers computational efficiency, generalization capability, and performance matching expensive oracle methods while being more scalable than existing ML-based approaches.

Abstract: Simulating physical systems is essential in engineering, but analytical solutions are limited to straightforward problems. Consequently, numerical methods like the Finite Element Method (FEM) are widely used. However, the FEM becomes computationally expensive as problem complexity and accuracy demands increase. Adaptive Mesh Refinement (AMR) improves the FEM by dynamically placing mesh elements on the domain, balancing computational speed and accuracy. Classical AMR depends on heuristics or expensive error estimators, which may lead to suboptimal performance for complex simulations. While AMR methods based on machine learning are promising, they currently only scale to simple problems. In this work, we formulate AMR as a system of collaborating, homogeneous agents that iteratively split into multiple new agents. This agent-wise perspective enables a spatial reward formulation focused on reducing the maximum mesh element error. Our approach, Adaptive Swarm Mesh Refinement++ (ASMR++), offers efficient, stable optimization and generates highly adaptive meshes at user-defined resolution at inference time. Extensive experiments demonstrate that ASMR++ outperforms heuristic approaches and learned baselines, matching the performance of expensive error-based oracle AMR strategies. ASMR additionally generalizes to different domains during inference, and produces meshes that simulate up to 2 orders of magnitude faster than uniform refinements in more demanding settings.

[461] Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

Ruicheng Ao, David Simchi-Levi, Xinshang Wang

Main category: cs.LG

TL;DR: New benchmarks ORDebug and ORBias evaluate LLMs on iterative debugging of infeasible optimization models and behavioral rationality in newsvendor problems, showing domain-specific training outperforms scale.

Details

Motivation: Existing LLM benchmarks for Operations Research only test one-shot code generation, ignoring the crucial iterative debugging process that practitioners actually use to fix infeasible models.

Method: Introduces two benchmarks: ORDebug (5,000+ problems with 9 error types) that evaluates iterative self-correction with solver feedback, and ORBias (2,000 newsvendor instances) that measures systematic deviations from optimal policies.

Result: Domain-specific RLVR training enables an 8B model to surpass frontier APIs: 95.3% vs 86.2% recovery rate, 62.4% vs 47.8% diagnostic accuracy, and 2.25 vs 3.78 steps to resolution. Curriculum training reduces systematic bias by 48%.

Conclusion: Process-level evaluation with verifiable oracles enables targeted training that outperforms scale, demonstrating the importance of evaluating LLMs in realistic iterative workflows rather than one-shot tasks.

Abstract: Operations Research practitioners routinely debug infeasible models through an iterative process: analyzing Irreducible Infeasible Subsystems (\IIS{}), identifying constraint conflicts, and systematically repairing formulations until feasibility is achieved. Yet existing LLM benchmarks evaluate OR as one-shot translation – given a problem description, generate solver code – ignoring this diagnostic loop entirely. We introduce two benchmarks that place the \textbf{solver in the evaluation loop}. \textbf{\ORDebug{}} evaluates iterative self-correction through 5,000+ problems spanning 9 error types; each repair action triggers solver re-execution and \IIS{} recomputation, providing deterministic, verifiable feedback. \textbf{\ORBias{}} evaluates behavioral rationality through 2,000 newsvendor instances (1,000 ID + 1,000 OOD), measuring systematic deviations from closed-form optimal policies. Across 26 models and 12,000+ samples, we find that domain-specific RLVR training enables an 8B model to surpass frontier APIs: 95.3% vs 86.2% recovery rate (+9.1%), 62.4% vs 47.8% diagnostic accuracy (+14.6%), and 2.25 vs 3.78 steps to resolution (1.7$\times$ faster). On \ORBias{}, curriculum training achieves the only negative ID$\rightarrow$OOD bias drift among models evaluated (-9.6%), reducing systematic bias by 48% (from 20.0% to 10.4%). These results demonstrate that process-level evaluation with verifiable oracles enables targeted training that outperforms scale.

[462] Order-Aware Test-Time Adaptation: Leveraging Temporal Dynamics for Robust Streaming Inference

Young Kyung Kim, Oded Schlesinger, Qiangqiang Wu, J. Matías Di Martino, Guillermo Sapiro

Main category: cs.LG

TL;DR: OATTA introduces temporal dynamics modeling for test-time adaptation by treating test streams as time series with temporal dependencies, using Bayesian estimation with learned transition matrices and a safety gate for weakly structured data.

Details

Motivation: Existing test-time adaptation methods treat test-time streams as independent samples, ignoring the valuable supervisory signal inherent in temporal dynamics that could improve adaptation to distribution shifts.

Method: Formulates TTA as gradient-free recursive Bayesian estimation using learned dynamic transition matrices as temporal priors. Includes a likelihood-ratio gate for safety when temporal evidence is weak. The approach is lightweight and model-agnostic.

Result: Extensive experiments across image classification, wearable/physiological signal analysis, and language sentiment analysis show OATTA consistently boosts established baselines, improving accuracy by up to 6.35%.

Conclusion: Modeling temporal dynamics provides a critical orthogonal signal beyond standard order-agnostic TTA approaches, demonstrating the importance of temporal structure in test-time adaptation.

Abstract: Test-Time Adaptation (TTA) enables pre-trained models to adjust to distribution shift by learning from unlabeled test-time streams. However, existing methods typically treat these streams as independent samples, overlooking the supervisory signal inherent in temporal dynamics. To address this, we introduce Order-Aware Test-Time Adaptation (OATTA). We formulate test-time adaptation as a gradient-free recursive Bayesian estimation task, using a learned dynamic transition matrix as a temporal prior to refine the base model’s predictions. To ensure safety in weakly structured streams, we introduce a likelihood-ratio gate (LLR) that reverts to the base predictor when temporal evidence is absent. OATTA is a lightweight, model-agnostic module that incurs negligible computational overhead. Extensive experiments across image classification, wearable and physiological signal analysis, and language sentiment analysis demonstrate its universality; OATTA consistently boosts established baselines, improving accuracy by up to 6.35%. Our findings establish that modeling temporal dynamics provides a critical, orthogonal signal beyond standard order-agnostic TTA approaches.

[463] Conditional Denoising Model as a Physical Surrogate Model

José Afonso, Pedro Viegas, Rodrigo Ventura, Vasco Guerra

Main category: cs.LG

TL;DR: Conditional Denoising Model (CDM) learns physical manifold geometry through denoising training, enabling deterministic projection onto solution subspace without explicit physics constraints.

Details

Motivation: Current surrogate modeling faces trade-off between accuracy and physical consistency; physics-consistent approaches often fail to guarantee strict adherence to governing equations or rely on post-processing that doesn't learn underlying solution geometry.

Method: Introduces Conditional Denoising Model (CDM) - a generative model that learns physical manifold geometry by training network to restore clean states from noisy ones, learning a vector field pointing toward valid solution subspace. Uses time-independent formulation transforming inference into deterministic fixed-point iteration.

Result: CDM achieves higher parameter and data efficiency than physics-consistent baselines on low-temperature plasma physics benchmark. Denoising objective acts as powerful implicit regularizer - despite never seeing governing equations during training, model adheres to physical constraints more strictly than baselines with explicit physics losses.

Conclusion: CDM provides novel approach to physics-consistent surrogate modeling by learning solution manifold geometry through denoising, enabling strict physical adherence without explicit constraint enforcement.

Abstract: Surrogate modeling for complex physical systems typically faces a trade-off between data-fitting accuracy and physical consistency. Physics-consistent approaches typically treat physical laws as soft constraints within the loss function, a strategy that frequently fails to guarantee strict adherence to the governing equations, or rely on post-processing corrections that do not intrinsically learn the underlying solution geometry. To address these limitations, we introduce the {Conditional Denoising Model (CDM)}, a generative model designed to learn the geometry of the physical manifold itself. By training the network to restore clean states from noisy ones, the model learns a vector field that points continuously towards the valid solution subspace. We introduce a time-independent formulation that transforms inference into a deterministic fixed-point iteration, effectively projecting noisy approximations onto the equilibrium manifold. Validated on a low-temperature plasma physics and chemistry benchmark, the CDM achieves higher parameter and data efficiency than physics-consistent baselines. Crucially, we demonstrate that the denoising objective acts as a powerful implicit regularizer: despite never seeing the governing equations during training, the model adheres to physical constraints more strictly than baselines trained with explicit physics losses.

[464] SIGMA-PPG: Statistical-prior Informed Generative Masking Architecture for PPG Foundation Model

Zongheng Guo, Tao Chen, Yang Jiao, Yi Pan, Xiao Hu, Manuela Ferrario

Main category: cs.LG

TL;DR: SIGMA-PPG is a generative foundation model for PPG signals that uses prior-guided adversarial masking and semantic consistency constraints to overcome redundancy and noise challenges.

Details

Motivation: Current foundation models for PPG signals face challenges with intrinsic redundancy and noise. Standard masked modeling yields trivial solutions while contrastive methods lack morphological precision.

Method: Proposes Statistical-prior Informed Generative Masking Architecture (SIGMA-PPG) with Prior-Guided Adversarial Masking mechanism using reinforcement learning-driven teacher with statistical priors, and semantic consistency constraint via vector quantization.

Result: Pre-trained on over 120,000 hours of data, achieves superior average performance compared to five state-of-the-art baselines across 12 diverse downstream tasks.

Conclusion: SIGMA-PPG effectively addresses PPG signal challenges through generative modeling with adversarial masking and semantic consistency, demonstrating strong performance across multiple tasks.

Abstract: Current foundation model for photoplethysmography (PPG) signals is challenged by the intrinsic redundancy and noise of the signal. Standard masked modeling often yields trivial solutions while contrastive methods lack morphological precision. To address these limitations, we propose a Statistical-prior Informed Generative Masking Architecture (SIGMA-PPG), a generative foundation model featuring a Prior-Guided Adversarial Masking mechanism, where a reinforcement learning-driven teacher leverages statistical priors to create challenging learning paths that prevent overfitting to noise. We also incorporate a semantic consistency constraint via vector quantization to ensure that physiologically identical waveforms (even those altered by recording artifacts or minor perturbations) map to shared indices. This enhances codebook semantic density and eliminates redundant feature structures. Pre-trained on over 120,000 hours of data, SIGMA-PPG achieves superior average performance compared to five state-of-the-art baselines across 12 diverse downstream tasks. The code is available at https://github.com/ZonghengGuo/SigmaPPG.

[465] Predict-Project-Renoise: Sampling Diffusion Models under Hard Constraints

Omer Rochman-Sharabi, Gilles Louppe

Main category: cs.LG

TL;DR: A constrained diffusion sampling framework (PPR) that enforces hard physical constraints during generation for scientific applications

Details

Motivation: Neural emulators based on diffusion models show promise for scientific applications but vanilla models cannot guarantee physical accuracy or constraint satisfaction, requiring methods to enforce hard constraints like physical laws or observational consistency

Method: Introduces Predict-Project-Renoise (PPR) - an iterative algorithm that samples from constrained marginals by alternating between denoising predictions, projecting onto the feasible set, and renoising; defines a constrained forward process that diffuses only over constraint-satisfying samples

Result: PPR reduces constraint violations by over an order of magnitude while improving sample consistency and better matching the true constrained distribution compared to baselines in experiments on 2D distributions, PDEs, and global weather forecasting

Conclusion: The constrained sampling framework successfully enforces hard constraints in diffusion models for scientific applications, making neural emulators more physically accurate and reliable

Abstract: Neural emulators based on diffusion models show promise for scientific applications, but vanilla models cannot guarantee physical accuracy or constraint satisfaction. We address this by introducing a constrained sampling framework that enforces hard constraints, such as physical laws or observational consistency, at generation time. Our approach defines a constrained forward process that diffuses only over the feasible set of constraint-satisfying samples, inducing constrained marginal distributions. To reverse this, we propose Predict-Project-Renoise (PPR), an iterative algorithm that samples from the constrained marginals by alternating between denoising predictions, projecting onto the feasible set, and renoising. Experiments on 2D distributions, PDEs, and global weather forecasting demonstrate that PPR reduces constraint violations by over an order of magnitude while improving sample consistency and better matching the true constrained distribution compared to baselines.

[466] Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning

Chengzu Li, Zanyi Wang, Jiaang Li, Yi Xu, Han Zhou, Huanyu Zhang, Ruichuan An, Dengyang Jiang, Zhaochong An, Ivan Vulić, Serge Belongie, Anna Korhonen

Main category: cs.LG

TL;DR: Video generation models can serve as intermediate reasoning steps for visual reasoning tasks, showing strong zero-shot generalization, effective visual context utilization, and test-time scaling benefits.

Details

Motivation: Vision-Language Models struggle with fine-grained spatial understanding and continuous action planning needed for complex visual reasoning. The paper explores whether video generation models can serve as intermediate reasoning steps between initial states and solutions.

Method: Formulates visual reasoning using video generation models, evaluating on two regimes: Maze Navigation (sequential discrete planning with low visual change) and Tangram Puzzle (continuous manipulation with high visual change).

Result: Three key findings: 1) Robust zero-shot generalization on unseen data distributions, 2) Effective use of visual context as explicit control, 3) Test-time scaling law where longer generated videos improve zero-shot generalization to complex paths.

Conclusion: Video generation is not just a media tool but a scalable, generalizable paradigm for visual reasoning that can handle both discrete planning and continuous manipulation tasks.

Abstract: Vision-Language Models have excelled at textual reasoning, but they often struggle with fine-grained spatial understanding and continuous action planning, failing to simulate the dynamics required for complex visual reasoning. In this work, we formulate visual reasoning by means of video generation models, positing that generated frames can act as intermediate reasoning steps between initial states and solutions. We evaluate their capacity in two distinct regimes: Maze Navigation for sequential discrete planning with low visual change and Tangram Puzzle for continuous manipulation with high visual change. Our experiments reveal three critical insights: (1) Robust Zero-Shot Generalization: In both tasks, the model demonstrates strong performance on unseen data distributions without specific finetuning. (2) Visual Context: The model effectively uses visual context as explicit control, such as agent icons and tangram shapes, enabling it to maintain high visual consistency and adapt its planning capability robustly to unseen patterns. (3) Visual Test-Time Scaling: We observe a test-time scaling law in sequential planning; increasing the generated video length (visual inference budget) empowers better zero-shot generalization to spatially and temporally complex paths. These findings suggest that video generation is not merely a media tool, but a scalable, generalizable paradigm for visual reasoning.

[467] Test-Time Adaptation for Unsupervised Combinatorial Optimization

Yiqiao Liao, Farinaz Koushanfar, Parinaz Naghizadeh

Main category: cs.LG

TL;DR: TACO: A test-time adaptation framework that bridges generalization-focused and instance-specific unsupervised neural combinatorial optimization by strategically relaxing trained parameters while preserving inductive bias.

Details

Motivation: Existing unsupervised NCO approaches are divided into models trained for generalization (efficient but lack instance-wise adaptability) and instance-specific models (flexible but prone to poor local optima and don't exploit learned structure). The paper aims to bridge these two paradigms to leverage learned inductive bias while enabling effective instance-wise adaptation.

Method: Proposes TACO, a model-agnostic test-time adaptation framework that uses strategic warm-starting to partially relax trained parameters while preserving inductive bias. This enables rapid unsupervised adaptation at test time without the computational cost of training from scratch or the poor performance of naive fine-tuning.

Result: TACO achieves better solution quality than naive fine-tuning of generalizable models or optimizing instance-specific models from scratch, with negligible additional computational cost. Experiments on Minimum Vertex Cover and Maximum Clique problems demonstrate effectiveness across static, distribution-shifted, and dynamic CO problems.

Conclusion: TACO successfully bridges generalizable and instance-specific unsupervised NCO paradigms, providing a practical framework that leverages learned inductive bias while enabling effective test-time adaptation for combinatorial optimization problems.

Abstract: Unsupervised neural combinatorial optimization (NCO) enables learning powerful solvers without access to ground-truth solutions. Existing approaches fall into two disjoint paradigms: models trained for generalization across instances, and instance-specific models optimized independently at test time. While the former are efficient during inference, they lack effective instance-wise adaptability; the latter are flexible but fail to exploit learned inductive structure and are prone to poor local optima. This motivates the central question of our work: how can we leverage the inductive bias learned through generalization while unlocking the flexibility required for effective instance-wise adaptation? We first identify a challenge in bridging these two paradigms: generalization-focused models often constitute poor warm starts for instance-wise optimization, potentially underperforming even randomly initialized models when fine-tuned at test time. To resolve this incompatibility, we propose TACO, a model-agnostic test-time adaptation framework that unifies and extends the two existing paradigms for unsupervised NCO. TACO applies strategic warm-starting to partially relax trained parameters while preserving inductive bias, enabling rapid and effective unsupervised adaptation. Crucially, compared to naively fine-tuning a trained generalizable model or optimizing an instance-specific model from scratch, TACO achieves better solution quality while incurring negligible additional computational cost. Experiments on canonical CO problems, Minimum Vertex Cover and Maximum Clique, demonstrate the effectiveness and robustness of TACO across static, distribution-shifted, and dynamic combinatorial optimization problems, establishing it as a practical bridge between generalizable and instance-specific unsupervised NCO.

[468] SMKC: Sketch Based Kernel Correlation Images for Variable Cardinality Time Series Anomaly Detection

Haokun Zhou

Main category: cs.LG

TL;DR: SMKC framework for anomaly detection in multivariate time series with sensor churn, using permutation-invariant feature hashing and hybrid kernel images to handle variable input dimensions.

Details

Motivation: Real-world monitoring systems experience sensor churn where signals appear, disappear, or are renamed, creating data windows with varying cardinality and unseen values during training, which conventional anomaly detection methods cannot handle.

Method: SMKC decouples dynamic input structure from anomaly detection using permutation-invariant feature hashing to sketch raw inputs into fixed-size state sequences, constructs hybrid kernel images to capture global temporal structure through pairwise comparisons, and learns normal patterns via masked reconstruction and teacher-student prediction.

Result: Robust log-distance channels provide primary discriminative signal while cosine representations often fail; detector using random projections and nearest neighbors on SMKC representation performs competitively with fully trained baselines without gradient updates.

Conclusion: SMKC effectively handles sensor churn in multivariate time series anomaly detection, with the representation itself being highly effective, offering practical cold-start solutions for resource-constrained deployments.

Abstract: Conventional anomaly detection in multivariate time series relies on the assumption that the set of observed variables remains static. In operational environments, however, monitoring systems frequently experience sensor churn. Signals may appear, disappear, or be renamed, creating data windows where the cardinality varies and may include values unseen during training. To address this challenge, we propose SMKC, a framework that decouples the dynamic input structure from the anomaly detector. We first employ permutation-invariant feature hashing to sketch raw inputs into a fixed size state sequence. We then construct a hybrid kernel image to capture global temporal structure through pairwise comparisons of the sequence and its derivatives. The model learns normal patterns using masked reconstruction and a teacher-student prediction objective. Our evaluation reveals that robust log-distance channels provide the primary discriminative signal, whereas cosine representations often fail to capture sufficient contrast. Notably, we find that a detector using random projections and nearest neighbors on the SMKC representation performs competitively with fully trained baselines without requiring gradient updates. This highlights the effectiveness of the representation itself and offers a practical cold-start solution for resource-constrained deployments.

[469] Snowball: A Scalable All-to-All Ising Machine with Dual-Mode Markov Chain Monte Carlo Spin Selection and Asynchronous Spin Updates for Fast Combinatorial Optimization

Seungki Hong, Kyeongwon Jeong, Taekwang Jang

Main category: cs.LG

TL;DR: Snowball: A digital, scalable, all-to-all coupled Ising machine with dual-mode MCMC spin selection and asynchronous updates that reduces time-to-solution by 8× compared to state-of-the-art on same benchmark.

Details

Motivation: To enable practical deployment of Ising machines for combinatorial optimization by addressing three key challenges: (1) hardware topology limitations requiring minor embedding, (2) spin selection and update algorithms that can oscillate or stall with naive parallel updates, and (3) scalable coupling-coefficient precision that affects solution quality.

Method: Presents Snowball, a digital, scalable, all-to-all coupled Ising machine architecture. It integrates dual-mode Markov chain Monte Carlo (MCMC) spin selection with asynchronous spin updates to promote convergence. The digital architecture supports wide, configurable coupling precision, unlike analog realizations. Implemented as a prototype on an AMD Alveo U250 accelerator card.

Result: Achieves an 8× reduction in time-to-solution relative to a state-of-the-art Ising machine on the same benchmark instance. The digital architecture enables configurable coupling precision at high bit widths.

Conclusion: Snowball demonstrates significant improvements in Ising machine performance through digital architecture, advanced spin selection algorithms, and scalable precision, making practical deployment more feasible for combinatorial optimization problems.

Abstract: Ising machines have emerged as accelerators for combinatorial optimization. To enable practical deployment, this work aims to reduce time-to-solution by addressing three challenges: (1) hardware topology, (2) spin selection and update algorithms, and (3) scalable coupling-coefficient precision. Restricted topologies require minor embedding; naive parallel updates can oscillate or stall; and limited precision can preclude feasible mappings or degrade solution quality. This work presents Snowball, a digital, scalable, all-to-all coupled Ising machine that integrates dual-mode Markov chain Monte Carlo spin selection with asynchronous spin updates to promote convergence and reduce time-to-solution. The digital architecture supports wide, configurable coupling precision, unlike many analog realizations at high bit widths. A prototype on an AMD Alveo U250 accelerator card achieves an 8$\times$ reduction in time-to-solution relative to a state-of-the-art Ising machine on the same benchmark instance.

[470] Human-LLM Collaborative Feature Engineering for Tabular Data

Zhuoyan Li, Aditya Bansal, Jinzhao Li, Shishuang He, Zhuoran Lu, Mutian Zhang, Qin Liu, Yiwei Yang, Swati Jain, Ming Yin, Yunyao Li

Main category: cs.LG

TL;DR: LLM-human collaborative framework for tabular feature engineering that decouples operation proposal (by LLM) from selection (guided by utility modeling and human feedback)

Details

Motivation: Current LLM-based feature engineering approaches treat LLMs as black-box optimizers that both propose and select operations based on internal heuristics, lacking calibrated utility estimation and leading to inefficient exploration of low-yield operations without principled prioritization strategies.

Method: Proposes a human-LLM collaborative framework that decouples transformation operation proposal (using LLMs to generate candidates) from selection (guided by explicit modeling of operation utility and uncertainty). Incorporates selective human expert preference feedback to compare which operations are more promising, especially useful when accurate utility estimation is difficult in early rounds.

Result: Evaluations on synthetic studies and real user studies demonstrate improved feature engineering performance across various tabular datasets and reduced cognitive load for users during the feature engineering process.

Conclusion: The proposed framework effectively combines LLM capabilities for operation proposal with principled selection mechanisms and human feedback, leading to better feature engineering outcomes and reduced user burden compared to black-box LLM approaches.

Abstract: Large language models (LLMs) are increasingly used to automate feature engineering in tabular learning. Given task-specific information, LLMs can propose diverse feature transformation operations to enhance downstream model performance. However, current approaches typically assign the LLM as a black-box optimizer, responsible for both proposing and selecting operations based solely on its internal heuristics, which often lack calibrated estimations of operation utility and consequently lead to repeated exploration of low-yield operations without a principled strategy for prioritizing promising directions. In this paper, we propose a human-LLM collaborative feature engineering framework for tabular learning. We begin by decoupling the transformation operation proposal and selection processes, where LLMs are used solely to generate operation candidates, while the selection is guided by explicitly modeling the utility and uncertainty of each proposed operation. Since accurate utility estimation can be difficult especially in the early rounds of feature engineering, we design a mechanism within the framework that selectively elicits and incorporates human expert preference feedback, comparing which operations are more promising, into the selection process to help identify more effective operations. Our evaluations on both the synthetic study and the real user study demonstrate that the proposed framework improves feature engineering performance across a variety of tabular datasets and reduces users’ cognitive load during the feature engineering process.

[471] Signal from Structure: Exploiting Submodular Upper Bounds in Generative Flow Networks

Alexandre Larouche, Audrey Durand

Main category: cs.LG

TL;DR: SUB0-GFN leverages submodular reward structure to derive upper bounds on unseen compositional objects, enabling more efficient training of Generative Flow Networks with significantly more training data from the same number of reward queries.

Details

Motivation: Generative Flow Networks (GFlowNets) sample compositional objects proportionally to their reward, but require many expensive reward function queries. When rewards have submodular structure, this can be exploited to derive bounds on unseen objects, potentially reducing the number of required queries.

Method: The paper introduces SUB0-GFN which uses submodularity to compute upper bounds on rewards of unobserved compositional objects. It analyzes probability and coverage of such bounds, then applies Optimism in the Face of Uncertainty principle to train GFNs using these bounds instead of actual reward queries.

Result: SUB0-GFN generates orders of magnitude more training data than classical GFNs for the same number of reward function queries. It demonstrates effectiveness in distribution matching and high-quality candidate generation on both synthetic and real-world submodular tasks.

Conclusion: Harnessing submodular structure through upper bounds enables dramatically more efficient GFN training, making them more practical for applications where reward queries are expensive.

Abstract: Generative Flow Networks (GFlowNets; GFNs) are a class of generative models that learn to sample compositional objects proportionally to their a priori unknown value, their reward. We focus on the case where the reward has a specified, actionable structure, namely that it is submodular. We show submodularity can be harnessed to retrieve upper bounds on the reward of compositional objects that have not yet been observed. We provide in-depth analyses of the probability of such bounds occurring, as well as how many unobserved compositional objects can be covered by a bound. Following the Optimism in the Face of Uncertainty principle, we then introduce SUBo-GFN, which uses the submodular upper bounds to train a GFN. We show that SUBo-GFN generates orders of magnitude more training data than classical GFNs for the same number of queries to the reward function. We demonstrate the effectiveness of SUBo-GFN in terms of distribution matching and high-quality candidate generation on synthetic and real-world submodular tasks.

[472] Textual Equilibrium Propagation for Deep Compound AI Systems

Minghui Chen, Wenlong Deng, James Zou, Han Yu, Xiaoxiao Li

Main category: cs.LG

TL;DR: TEP introduces local prompt optimization for deep compound AI systems to address textual gradient issues in long-horizon workflows

Details

Motivation: Existing global textual feedback methods (like TextGrad) degrade performance in deep compound AI systems due to exploding/vanishing textual gradients as system depth increases

Method: Textual Equilibrium Propagation (TEP) uses local learning with two phases: free phase where local LLM critics iteratively refine prompts to equilibrium, and nudged phase with proximal prompt edits using forward signaling rather than backward feedback chains

Result: TEP consistently improves accuracy and efficiency over global propagation methods like TextGrad across long-horizon QA benchmarks and multi-agent tool-use datasets, with gains increasing with system depth

Conclusion: TEP enables effective optimization of deep compound AI systems while preserving practicality of black-box LLM components, addressing scaling issues of global textual feedback methods

Abstract: Large language models (LLMs) are increasingly deployed as part of compound AI systems that coordinate multiple modules (e.g., retrievers, tools, verifiers) over long-horizon workflows. Recent approaches that propagate textual feedback globally (e.g., TextGrad) make it feasible to optimize such pipelines, but we find that performance degrades as system depth grows. In particular, long-horizon agentic workflows exhibit two depth-scaling failure modes: 1) exploding textual gradient, where textual feedback grows exponentially with depth, leading to prohibitively long message and amplifies evaluation biases; and 2) vanishing textual gradient, where limited long-context ability causes models overemphasize partial feedback and compression of lengthy feedback causes downstream messages to lose specificity gradually as they propagate many hops upstream. To mitigate these issues, we introduce Textual Equilibrium Propagation (TEP), a local learning principle inspired by Equilibrium Propagation in energy-based models. TEP includes two phases: 1) a free phase where a local LLM critics iteratively refine prompts until reaching equilibrium (no further improvements are suggested); and 2) a nudged phase which applies proximal prompt edits with bounded modification intensity, using task-level objectives that propagate via forward signaling rather than backward feedback chains. This design supports local prompt optimization followed by controlled adaptation toward global goals without the computational burden and signal degradation of global textual backpropagation. Across long-horizon QA benchmarks and multi-agent tool-use dataset, TEP consistently improves accuracy and efficiency over global propagation methods such as TextGrad. The gains grows with depth, while preserving the practicality of black-box LLM components in deep compound AI system.

[473] Out-of-Distribution Generalization in Graph Foundation Models

Haoyang Li, Haibo Chen, Xin Wang, Wenwu Zhu

Main category: cs.LG

TL;DR: Survey paper reviewing graph foundation models (GFMs) from the perspective of out-of-distribution generalization, discussing challenges, approaches, and evaluation protocols for handling distribution shifts in graph learning.

Details

Motivation: Graph learning models often suffer from limited generalization when applied beyond their training distributions due to various types of distribution shifts (graph structure, domain semantics, modalities, task formulations). The paper aims to survey recent progress on graph foundation models that address these OOD generalization challenges.

Method: Survey methodology: 1) Discuss main challenges of distribution shifts in graph learning and outline unified problem setting, 2) Organize existing approaches based on whether they operate under fixed task specifications or support generalization across heterogeneous tasks, 3) Summarize corresponding OOD handling strategies and pretraining objectives, 4) Review common evaluation protocols.

Result: The paper provides a comprehensive survey of graph foundation models focusing on OOD generalization, organizing the field into systematic categories and identifying key strategies for handling distribution shifts in graph learning.

Conclusion: This is the first survey for OOD generalization in graph foundation models, highlighting the importance of addressing distribution shifts and providing a structured overview of current approaches while identifying open research directions for future work.

Abstract: Graphs are a fundamental data structure for representing relational information in domains such as social networks, molecular systems, and knowledge graphs. However, graph learning models often suffer from limited generalization when applied beyond their training distributions. In practice, distribution shifts may arise from changes in graph structure, domain semantics, available modalities, or task formulations. To address these challenges, graph foundation models (GFMs) have recently emerged, aiming to learn general-purpose representations through large-scale pretraining across diverse graphs and tasks. In this survey, we review recent progress on GFMs from the perspective of out-of-distribution (OOD) generalization. We first discuss the main challenges posed by distribution shifts in graph learning and outline a unified problem setting. We then organize existing approaches based on whether they are designed to operate under a fixed task specification or to support generalization across heterogeneous task formulations, and summarize the corresponding OOD handling strategies and pretraining objectives. Finally, we review common evaluation protocols and discuss open directions for future research. To the best of our knowledge, this paper is the first survey for OOD generalization in GFMs.

[474] LOCUS: Low-Dimensional Model Embeddings for Efficient Model Exploration, Comparison, and Selection

Shivam Patel, William Cocke, Gauri Joshi

Main category: cs.LG

TL;DR: LOCUS generates low-dimensional vector embeddings representing LLM capabilities via attention-based encoding of query evaluations, enabling efficient model management, routing, and similarity analysis.

Details

Motivation: The rapid growth of LLMs creates challenges in managing and utilizing the vast, dynamic pool of models effectively. There's a need for compact representations of model capabilities that can handle new models without retraining.

Method: LOCUS uses an attention-based approach that generates embeddings through deterministic forward passes over query encodings and evaluation scores via an encoder model. It also trains a correctness predictor using model embeddings and query encodings for routing.

Result: LOCUS needs up to 4.8x fewer query evaluation samples than baselines to produce informative embeddings. The learned embedding space is geometrically meaningful, enabling model comparison, clustering, portfolio selection, and proxy modeling.

Conclusion: LOCUS provides an effective method for compactly representing LLM capabilities, enabling efficient model management, routing, and various downstream applications with minimal evaluation data.

Abstract: The rapidly growing ecosystem of Large Language Models (LLMs) makes it increasingly challenging to manage and utilize the vast and dynamic pool of models effectively. We propose LOCUS, a method that produces low-dimensional vector embeddings that compactly represent a language model’s capabilities across queries. LOCUS is an attention-based approach that generates embeddings by a deterministic forward pass over query encodings and evaluation scores via an encoder model, enabling seamless incorporation of new models to the pool and refinement of existing model embeddings without having to perform any retraining. We additionally train a correctness predictor that uses model embeddings and query encodings to achieve state-of-the-art routing accuracy on unseen queries. Experiments show that LOCUS needs up to 4.8x fewer query evaluation samples than baselines to produce informative and robust embeddings. Moreover, the learned embedding space is geometrically meaningful: proximity reflects model similarity, enabling a range of downstream applications including model comparison and clustering, model portfolio selection, and resilient proxies of unavailable models.

[475] MapPFN: Learning Causal Perturbation Maps in Context

Marvin Sextro, Weronika Kłos, Gabriel Dernbach

Main category: cs.LG

TL;DR: MapPFN is a meta-learning approach using prior-data fitted networks to predict cellular perturbation effects across unseen biological contexts through in-context learning without gradient optimization.

Details

Motivation: Biological intervention planning requires treatment-effect models that can adapt to unseen contexts by identifying specific underlying mechanisms, but current methods cannot leverage new interventional evidence at inference time and are limited by small single-cell perturbation datasets spanning few biological contexts.

Method: MapPFN uses prior-data fitted networks pretrained on synthetic data generated from a prior over causal perturbations, employing in-context learning to predict post-perturbation distributions without gradient-based optimization.

Result: Despite being pretrained only on in silico gene knockouts, MapPFN identifies differentially expressed genes with performance matching models trained on real single-cell data.

Conclusion: MapPFN demonstrates effective meta-learning for perturbation effect estimation across biological contexts using in-context learning, enabling adaptation to unseen contexts without retraining.

Abstract: Planning effective interventions in biological systems requires treatment-effect models that adapt to unseen biological contexts by identifying their specific underlying mechanisms. Yet single-cell perturbation datasets span only a handful of biological contexts, and existing methods cannot leverage new interventional evidence at inference time to adapt beyond their training data. To meta-learn a perturbation effect estimator, we present MapPFN, a prior-data fitted network (PFN) pretrained on synthetic data generated from a prior over causal perturbations. Given a set of experiments, MapPFN uses in-context learning to predict post-perturbation distributions, without gradient-based optimization. Despite being pretrained on in silico gene knockouts alone, MapPFN identifies differentially expressed genes, matching the performance of models trained on real single-cell data. Our code and data are available at https://github.com/marvinsxtr/MapPFN.

[476] Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed

Minjae Kwon, Josephine Lamp, Lu Feng

Main category: cs.LG

TL;DR: Safe RL algorithms trained under fixed conditions fail to maintain safety guarantees when deployed on unseen patients due to distribution shift, but test-time shielding using learned dynamics models effectively restores safety across different algorithms and patient populations.

Details

Motivation: Current safe RL algorithms are typically evaluated under fixed training conditions, but their safety guarantees may not transfer to deployment under distribution shift, especially in safety-critical domains like diabetes management where patient variability creates significant distribution shifts.

Method: Benchmarked eight safe RL algorithms on a unified clinical diabetes simulator, evaluated their performance across three diabetes types and three age groups, and implemented test-time shielding that filters unsafe actions using learned dynamics models to restore safety during deployment.

Result: Revealed a safety generalization gap where policies satisfying constraints during training frequently violate safety requirements on unseen patients. Test-time shielding achieved Time-in-Range gains of 13-14% for strong baselines like PPO-Lag and CPO while reducing clinical risk index and glucose variability across all tested algorithms and patient populations.

Conclusion: Training-time safety guarantees do not necessarily transfer to deployment under distribution shift, but test-time shielding using learned dynamics models can effectively restore safety across different algorithms and patient populations, providing a practical solution for safety-critical control domains with distribution shift.

Abstract: Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training-time safety guarantees transfer to deployment under distribution shift, using diabetes management as a safety-critical testbed. We benchmark safe RL algorithms on a unified clinical simulator and reveal a safety generalization gap: policies satisfying constraints during training frequently violate safety requirements on unseen patients. We demonstrate that test-time shielding, which filters unsafe actions using learned dynamics models, effectively restores safety across algorithms and patient populations. Across eight safe RL algorithms, three diabetes types, and three age groups, shielding achieves Time-in-Range gains of 13–14% for strong baselines such as PPO-Lag and CPO while reducing clinical risk index and glucose variability. Our simulator and benchmark provide a platform for studying safety under distribution shift in safety-critical control domains. Code is available at https://github.com/safe-autonomy-lab/GlucoSim and https://github.com/safe-autonomy-lab/GlucoAlg.

[477] TRACE: Trajectory Recovery for Continuous Mechanism Evolution in Causal Representation Learning

Shicheng Fan, Kun Zhang, Lu Cheng

Main category: cs.LG

TL;DR: TRACE learns continuous causal mechanism transitions as convex combinations of atomic mechanisms, enabling recovery of smooth mechanism trajectories beyond discrete switching assumptions.

Details

Motivation: Existing temporal causal representation learning methods assume instantaneous switching between discrete causal mechanisms, but real-world systems often exhibit continuous mechanism transitions (e.g., vehicle dynamics during turns, human gait transitions). Current approaches fail to capture these smooth transitions.

Method: TRACE models transitional mechanisms as convex combinations of finitely many atomic mechanisms with time-varying mixing coefficients. It uses a Mixture-of-Experts framework where each expert learns one atomic mechanism during training, enabling recovery of continuous mechanism trajectories at test time.

Result: Experiments on synthetic and real-world data show TRACE recovers mixing trajectories with up to 0.99 correlation, substantially outperforming discrete-switching baselines. The method generalizes to intermediate mechanism states never observed during training.

Conclusion: TRACE successfully addresses the limitation of discrete mechanism switching by modeling continuous causal transitions, providing theoretical identifiability guarantees and practical performance improvements for real-world temporal systems.

Abstract: Temporal causal representation learning methods assume that causal mechanisms switch instantaneously between discrete domains, yet real-world systems often exhibit continuous mechanism transitions. For example, a vehicle’s dynamics evolve gradually through a turning maneuver, and human gait shifts smoothly from walking to running. We formalize this setting by modeling transitional mechanisms as convex combinations of finitely many atomic mechanisms, governed by time-varying mixing coefficients. Our theoretical contributions establish that both the latent causal variables and the continuous mixing trajectory are jointly identifiable. We further propose TRACE, a Mixture-of-Experts framework where each expert learns one atomic mechanism during training, enabling recovery of mechanism trajectories at test time. This formulation generalizes to intermediate mechanism states never observed during training. Experiments on synthetic and real-world data demonstrate that TRACE recovers mixing trajectories with up to 0.99 correlation, substantially outperforming discrete-switching baselines.

[478] Smooth Dynamic Cutoffs for Machine Learning Interatomic Potentials

Kevin Han, Haolin Cong, Bowen Deng, Amir Barati Farimani

Main category: cs.LG

TL;DR: Dynamic cutoff formulation for MLIPs reduces memory and inference time by targeting specific neighbor counts per atom instead of fixed cutoff radius.

Details

Motivation: MLIPs face bottlenecks in inference time and memory consumption preventing realistic simulation scales. The paper challenges the fixed cutoff radius paradigm to address these limitations.

Method: Introduces dynamic cutoff formulation that induces sparsity on atom graphs by targeting specific number of neighbors per atom, implemented on four state-of-the-art MLIPs (MACE, Nequip, Orbv3, TensorNet).

Result: Achieves 2.26x less memory consumption and 2.04x faster inference time with minimal accuracy dropoff compared to fixed cutoff models on materials and molecular datasets.

Conclusion: Dynamic cutoff enables more efficient MLIPs for molecular dynamics simulations while maintaining accuracy, with all implementations to be open-sourced.

Abstract: Machine learning interatomic potentials (MLIPs) have proven to be wildly useful for molecular dynamics simulations, powering countless drug and materials discovery applications. However, MLIPs face two primary bottlenecks preventing them from reaching realistic simulation scales: inference time and memory consumption. In this work, we address both issues by challenging the long-held belief that the cutoff radius for the MLIP must be held to a fixed, constant value. For the first time, we introduce a dynamic cutoff formulation that still leads to stable, long timescale molecular dynamics simulation. In introducing the dynamic cutoff, we are able to induce sparsity onto the underlying atom graph by targeting a specific number of neighbors per atom, significantly reducing both memory consumption and inference time. We show the effectiveness of a dynamic cutoff by implementing it onto 4 state of the art MLIPs: MACE, Nequip, Orbv3, and TensorNet, leading to 2.26x less memory consumption and 2.04x faster inference time, depending on the model and atomic system. We also perform an extensive error analysis and find that the dynamic cutoff models exhibit minimal accuracy dropoff compared to their fixed cutoff counterparts on both materials and molecular datasets. All model implementations and training code will be fully open sourced.

[479] Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement

Maria Despoina Siampou, Shushman Choudhury, Shang-Ling Hsu, Neha Arora, Cyrus Shahabi

Main category: cs.LG

TL;DR: ME-POIs framework learns POI representations by augmenting language model embeddings with human mobility data to capture both place identity and function, outperforming text-only and mobility-only baselines on map enrichment tasks.

Details

Motivation: Existing POI representation methods focus on static textual metadata or trajectory context, missing the essential signal of POI function - how places are actually used. There's a need for general-purpose POI representations that capture real-world usage patterns.

Method: ME-POIs framework encodes individual visits as temporally contextualized embeddings and aligns them with learnable POI representations via contrastive learning. It propagates temporal visit patterns from nearby frequently visited POIs across multiple spatial scales to address long-tail sparsity.

Result: ME-POIs consistently outperforms both text-only and mobility-only baselines across five map enrichment tasks. Notably, ME-POIs trained on mobility data alone can surpass text-only models on certain tasks.

Conclusion: POI function is a critical component of accurate and generalizable POI representations. Augmenting text-based embeddings with mobility data captures both identity and function, enabling better map enrichment and understanding of real-world locations.

Abstract: Recent progress in geospatial foundation models highlights the importance of learning general-purpose representations for real-world locations, particularly points-of-interest (POIs) where human activity concentrates. Existing approaches, however, focus primarily on place identity derived from static textual metadata, or learn representations tied to trajectory context, which capture movement regularities rather than how places are actually used (i.e., POI’s function). We argue that POI function is a missing but essential signal for general POI representations. We introduce Mobility-Embedded POIs (ME-POIs), a framework that augments POI embeddings derived, from language models with large-scale human mobility data to learn POI-centric, context-independent representations grounded in real-world usage. ME-POIs encodes individual visits as temporally contextualized embeddings and aligns them with learnable POI representations via contrastive learning to capture usage patterns across users and time. To address long-tail sparsity, we propose a novel mechanism that propagates temporal visit patterns from nearby, frequently visited POIs across multiple spatial scales. We evaluate ME-POIs on five newly proposed map enrichment tasks, testing its ability to capture both the identity and function of POIs. Across all tasks, augmenting text-based embeddings with ME-POIs consistently outperforms both text-only and mobility-only baselines. Notably, ME-POIs trained on mobility data alone can surpass text-only models on certain tasks, highlighting that POI function is a critical component of accurate and generalizable POI representations.

[480] Can Neural Networks Learn Small Algebraic Worlds? An Investigation Into the Group-theoretic Structures Learned By Narrow Models Trained To Predict Group Operations

Henry Kvinge, Andrew Aguilar, Nayda Farnsworth, Grace O’Brien, Robert Jasper, Sarah Scullen, Helen Jenne

Main category: cs.LG

TL;DR: Neural networks trained on group operations can capture some abstract algebraic structure like commutativity and subgroups, but not all concepts like identity elements.

Details

Motivation: Current AI applications to mathematics focus on question-answering, but real mathematical discovery is open-ended. The paper explores whether models trained on specific mathematical tasks learn broader mathematical structure that can be extracted for deeper insights.

Method: Train neural networks to predict group operations (modular arithmetic, permutation composition), then design tests to assess if models capture group-theoretic concepts like identity elements, commutativity, and subgroups through representation analysis.

Result: Models learn representations that capture some abstract algebraic properties: evidence of capturing commutativity in modular arithmetic, ability to train linear classifiers that distinguish subgroup elements without explicit labels. However, unable to extract concepts like identity elements.

Conclusion: Even small neural networks can distill interesting abstract structure from mathematical objects, suggesting potential for AI to go beyond simple question-answering in mathematical discovery.

Abstract: While a real-world research program in mathematics may be guided by a motivating question, the process of mathematical discovery is typically open-ended. Ideally, exploration needed to answer the original question will reveal new structures, patterns, and insights that are valuable in their own right. This contrasts with the exam-style paradigm in which the machine learning community typically applies AI to math. To maximize progress in mathematics using AI, we will need to go beyond simple question answering. With this in mind, we explore the extent to which narrow models trained to solve a fixed mathematical task learn broader mathematical structure that can be extracted by a researcher or other AI system. As a basic test case for this, we use the task of training a neural network to predict a group operation (for example, performing modular arithmetic or composition of permutations). We describe a suite of tests designed to assess whether the model captures significant group-theoretic notions such as the identity element, commutativity, or subgroups. Through extensive experimentation we find evidence that models learn representations capable of capturing abstract algebraic properties. For example, we find hints that models capture the commutativity of modular arithmetic. We are also able to train linear classifiers that reliably distinguish between elements of certain subgroups (even though no labels for these subgroups are included in the data). On the other hand, we are unable to extract notions such as the concept of the identity element. Together, our results suggest that in some cases the representations of even small neural networks can be used to distill interesting abstract structure from new mathematical objects.

[481] Learning to Advect: A Neural Semi-Lagrangian Architecture for Weather Forecasting

Carlos A. Pereira, Stéphane Gaudreault, Valentin Dallerit, Christopher Subich, Shoyon Panday, Siqi Wei, Sasa Zhang, Siddharth Rout, Eldad Haber, Raymond J. Spiteri, David Millard, Emilia Diaconescu

Main category: cs.LG

TL;DR: PARADIS is a physics-inspired global weather prediction model that decomposes weather forecasting into explicit advection, diffusion, and reaction blocks with neural operators, achieving state-of-the-art performance with much lower training cost.

Details

Motivation: Traditional monolithic ML weather forecasting models implicitly represent physical mechanisms within single networks, which is problematic for advection requiring expensive global interactions. The authors aim to incorporate explicit physical inductive biases through functional decomposition.

Method: PARADIS decomposes weather prediction into three functional blocks: 1) Neural Semi-Lagrangian operator for advection using differentiable interpolation on the sphere, 2) depthwise-separable spatial mixing for diffusion, and 3) pointwise channel interactions for local source terms and vertical interactions. This enables operator-level physical structure.

Result: The 1-degree PARADIS model, trained with less than a GPU month, meets or exceeds performance of 0.25-degree traditional and ML baselines including ECMWF HRES forecast and DeepMind’s GraphCast on ERA5-based benchmarks.

Conclusion: Explicit functional decomposition with physics-inspired inductive biases enables efficient and accurate weather forecasting, achieving state-of-the-art performance with significantly reduced computational cost compared to monolithic approaches.

Abstract: Recent machine-learning approaches to weather forecasting often employ a monolithic architecture, where distinct physical mechanisms (advection, transport), diffusion-like mixing, thermodynamic processes, and forcing are represented implicitly within a single large network. This representation is particularly problematic for advection, where long-range transport must be treated with expensive global interaction mechanisms or through deep, stacked convolutional layers. To mitigate this, we present PARADIS, a physics-inspired global weather prediction model that imposes inductive biases on network behavior through a functional decomposition into advection, diffusion, and reaction blocks acting on latent variables. We implement advection through a Neural Semi-Lagrangian operator that performs trajectory-based transport via differentiable interpolation on the sphere, enabling end-to-end learning of both the latent modes to be transported and their characteristic trajectories. Diffusion-like processes are modeled through depthwise-separable spatial mixing, while local source terms and vertical interactions are modeled via pointwise channel interactions, enabling operator-level physical structure. PARADIS provides state-of-the-art forecast skill at a fraction of the training cost. On ERA5-based benchmarks, the 1 degree PARADIS model, with a total training cost of less than a GPU month, meets or exceeds the performance of 0.25 degree traditional and machine-learning baselines, including the ECMWF HRES forecast and DeepMind’s GraphCast.

[482] A Federated Generalized Expectation-Maximization Algorithm for Mixture Models with an Unknown Number of Components

Michael Ibrahim, Nagi Gebraeel, Weijun Xie

Main category: cs.LG

TL;DR: FedGEM: A federated generalized EM algorithm for clustering with unknown number of clusters across heterogeneous clients with potentially overlapping cluster sets.

Details

Motivation: Address the challenge of federated clustering when the total number of clusters is unknown and clients have heterogeneous, potentially overlapping cluster sets in their local data, which is common in distributed settings where data cannot be centralized.

Method: Clients perform EM steps locally and construct uncertainty sets around each local component maximizer. The central server uses these uncertainty sets to learn potential cluster overlaps between clients and infer the global number of clusters via closed-form computations.

Result: Theoretical probabilistic convergence guarantees under common assumptions, with specific tractable computations for isotropic GMMs. Empirical results show comparable performance to centralized EM and outperformance of existing federated clustering methods.

Conclusion: FedGEM provides an effective federated clustering solution for unknown number of clusters with heterogeneous client data, offering theoretical guarantees and practical performance comparable to centralized approaches.

Abstract: We study the problem of federated clustering when the total number of clusters $K$ across clients is unknown, and the clients have heterogeneous but potentially overlapping cluster sets in their local data. To that end, we develop FedGEM: a federated generalized expectation-maximization algorithm for the training of mixture models with an unknown number of components. Our proposed algorithm relies on each of the clients performing EM steps locally, and constructing an uncertainty set around the maximizer associated with each local component. The central server utilizes the uncertainty sets to learn potential cluster overlaps between clients, and infer the global number of clusters via closed-form computations. We perform a thorough theoretical study of our algorithm, presenting probabilistic convergence guarantees under common assumptions. Subsequently, we study the specific setting of isotropic GMMs, providing tractable, low-complexity computations to be performed by each client during each iteration of the algorithm, as well as rigorously verifying assumptions required for algorithm convergence. We perform various numerical experiments, where we empirically demonstrate that our proposed method achieves comparable performance to centralized EM, and that it outperforms various existing federated clustering methods.

[483] Efficient Simple Regret Algorithms for Stochastic Contextual Bandits

Shuai Liu, Alireza Bakhtiari, Alex Ayoub, Botao Hao, Csaba Szepesvári

Main category: cs.LG

TL;DR: First simple regret guarantees for stochastic contextual logistic bandits with algorithms achieving Õ(d/√T) and Õ(d³/²/√T) bounds, eliminating dependence on exponential constant κ.

Details

Motivation: While simple regret guarantees exist for linear contextual bandits, no such results were known for logistic bandits. The paper aims to bridge this gap by developing algorithms for stochastic contextual logistic bandits under simple regret objective.

Method: Proposes deterministic algorithm based on contextual linear bandits and self-concordant analysis, plus new Thompson Sampling variant tailored for simple-regret setting. Extends randomized approach to logistic case with similar structure.

Result: Achieves simple regret Õ(d/√T) for deterministic algorithm and Õ(d³/²/√T) for randomized algorithms, both without dependence on exponential constant κ = O(exp(S)). Algorithms are tractable for finite action sets and cheaper to run for randomized versions.

Conclusion: First simple regret guarantees for stochastic contextual logistic bandits with κ-independent bounds, introducing both deterministic and randomized algorithms with theoretical guarantees validated empirically.

Abstract: We study stochastic contextual logistic bandits under the simple regret objective. While simple regret guarantees have been established for the linear case, no such results were previously known for the logistic setting. Building on ideas from contextual linear bandits and self-concordant analysis, we propose the first algorithm that achieves simple regret $\tilde{\mathcal{O}}(d/\sqrt{T})$. Notably, the leading term of our regret bound is free of the constant $κ= \mathcal O(\exp(S))$, where $S$ is a bound on the magnitude of the unknown parameter vector. The algorithm is shown to be fully tractable when the action set is finite. We also introduce a new variant of Thompson Sampling tailored to the simple-regret setting. This yields the first simple regret guarantee for randomized algorithms in stochastic contextual linear bandits, with regret $\tilde{\mathcal{O}}(d^{3/2}/\sqrt{T})$. Extending this method to the logistic case, we obtain a similarly structured Thompson Sampling algorithm that achieves the same regret bound – $\tilde{\mathcal{O}}(d^{3/2}/\sqrt{T})$ – again with no dependence on $κ$ in the leading term. The randomized algorithms, as expected, are cheaper to run than their deterministic counterparts. Finally, we conducted a series of experiments to empirically validate these theoretical guarantees.

[484] The Powers of Precision: Structure-Informed Detection in Complex Systems – From Customer Churn to Seizure Onset

Augusto Santos, Teresa Santos, Catarina Rodrigues, José M. F. Moura

Main category: cs.LG

TL;DR: A machine learning method for early detection of emergent phenomena by learning optimal feature representations from covariance/precision matrix powers, with applications to seizure detection and churn prediction.

Details

Motivation: Emergent phenomena like epileptic seizures, customer churn, or pandemics arise from hidden causal interactions in complex systems. The core challenge is unveiling latent causal structure when the data-generating process is unknown and partially observed.

Method: Proposes a method that learns optimal feature representations from a one-parameter family of estimators (powers of empirical covariance or precision matrix). This provides a principled way to tune into underlying structure driving critical events. A supervised learning module then classifies the learned representation.

Result: Proves structural consistency of the family and demonstrates empirical soundness on seizure detection and churn prediction, attaining competitive results in both. The optimal covariance power shows evidence of good identifiability while capturing structural signatures.

Conclusion: The method reconciles predictive performance with interpretable statistical structure, offering both early detection capability and explainability for emergent phenomena in complex systems.

Abstract: Emergent phenomena – onset of epileptic seizures, sudden customer churn, or pandemic outbreaks – often arise from hidden causal interactions in complex systems. We propose a machine learning method for their early detection that addresses a core challenge: unveiling and harnessing a system’s latent causal structure despite the data-generating process being unknown and partially observed. The method learns an optimal feature representation from a one-parameter family of estimators – powers of the empirical covariance or precision matrix – offering a principled way to tune in to the underlying structure driving the emergence of critical events. A supervised learning module then classifies the learned representation. We prove structural consistency of the family and demonstrate the empirical soundness of our approach on seizure detection and churn prediction, attaining competitive results in both. Beyond prediction, and toward explainability, we ascertain that the optimal covariance power exhibits evidence of good identifiability while capturing structural signatures, thus reconciling predictive performance with interpretable statistical structure.

[485] AC2L-GAD: Active Counterfactual Contrastive Learning for Graph Anomaly Detection

Kamal Berahmand, Saman Forouzandeh, Mehrnoush Mohammadi, Parham Moradi, Mahdi Jalili

Main category: cs.LG

TL;DR: AC2L-GAD is an active counterfactual contrastive learning framework for graph anomaly detection that addresses label scarcity and class imbalance through strategic counterfactual generation and active selection.

Details

Motivation: Graph anomaly detection suffers from label scarcity and extreme class imbalance. Existing graph contrastive learning methods have two critical limitations: random augmentations break semantic consistency in positive pairs, and naive negative sampling produces trivial, uninformative contrasts.

Method: Proposes AC2L-GAD combining information-theoretic active selection with counterfactual generation. It identifies structurally complex nodes and generates anomaly-preserving positive augmentations alongside normal negative counterparts for hard contrasts, while restricting expensive counterfactual generation to a strategically selected subset.

Result: Reduces computational overhead by approximately 65% compared to full-graph counterfactual generation while maintaining detection quality. Achieves competitive or superior performance on nine benchmark datasets including real-world financial transaction graphs from GADBench, with notable gains in datasets where anomalies exhibit complex attribute-structure interactions.

Conclusion: AC2L-GAD effectively addresses limitations of existing graph contrastive learning methods through principled counterfactual reasoning and active selection, providing an efficient and effective solution for graph anomaly detection.

Abstract: Graph anomaly detection aims to identify abnormal patterns in networks, but faces significant challenges from label scarcity and extreme class imbalance. While graph contrastive learning offers a promising unsupervised solution, existing methods suffer from two critical limitations: random augmentations break semantic consistency in positive pairs, while naive negative sampling produces trivial, uninformative contrasts. We propose AC2L-GAD, an Active Counterfactual Contrastive Learning framework that addresses both limitations through principled counterfactual reasoning. By combining information-theoretic active selection with counterfactual generation, our approach identifies structurally complex nodes and generates anomaly-preserving positive augmentations alongside normal negative counterparts that provide hard contrasts, while restricting expensive counterfactual generation to a strategically selected subset. This design reduces computational overhead by approximately 65% compared to full-graph counterfactual generation while maintaining detection quality. Experiments on nine benchmark datasets, including real-world financial transaction graphs from GADBench, show that AC2L-GAD achieves competitive or superior performance compared to state-of-the-art baselines, with notable gains in datasets where anomalies exhibit complex attribute-structure interactions.

[486] Breaking the Reasoning Horizon in Entity Alignment Foundation Models

Yuanning Cui, Zequn Sun, Wei Hu, Kexuan Xin, Zhangjie Fu

Main category: cs.LG

TL;DR: A novel entity alignment foundation model using parallel encoding with seed pairs as anchors to bridge the reasoning horizon gap for aligning unseen knowledge graphs without retraining.

Details

Motivation: Existing entity alignment models lack transferability and cannot align unseen knowledge graphs without retraining. While graph foundation models offer potential, they suffer from a "reasoning horizon gap" - EA requires capturing long-range dependencies across sparse, heterogeneous KG structures unlike link prediction tasks.

Method: Proposes an EA foundation model with parallel encoding strategy: uses seed EA pairs as local anchors to guide information flow, initializes two parallel streams simultaneously for anchor-conditioned message passing. Incorporates merged relation graph for global dependencies and learnable interaction module for precise matching.

Result: Extensive experiments verify the framework’s effectiveness and highlight strong generalizability to unseen knowledge graphs.

Conclusion: The proposed parallel encoding approach with anchor guidance successfully addresses the reasoning horizon gap in entity alignment, enabling transferable EA models that can handle unseen KGs without retraining.

Abstract: Entity alignment (EA) is critical for knowledge graph (KG) fusion. Existing EA models lack transferability and are incapable of aligning unseen KGs without retraining. While using graph foundation models (GFMs) offer a solution, we find that directly adapting GFMs to EA remains largely ineffective. This stems from a critical “reasoning horizon gap”: unlike link prediction in GFMs, EA necessitates capturing long-range dependencies across sparse and heterogeneous KG structuresTo address this challenge, we propose a EA foundation model driven by a parallel encoding strategy. We utilize seed EA pairs as local anchors to guide the information flow, initializing and encoding two parallel streams simultaneously. This facilitates anchor-conditioned message passing and significantly shortens the inference trajectory by leveraging local structural proximity instead of global search. Additionally, we incorporate a merged relation graph to model global dependencies and a learnable interaction module for precise matching. Extensive experiments verify the effectiveness of our framework, highlighting its strong generalizability to unseen KGs.

[487] Flow Perturbation++: Multi-Step Unbiased Jacobian Estimation for High-Dimensional Boltzmann Sampling

Xin Peng, Ang Gao

Main category: cs.LG

TL;DR: Flow Perturbation++ is a variance-reduced extension of Flow Perturbation for unbiased Boltzmann sampling using continuous normalizing flows, achieving lower variance while maintaining unbiasedness through multi-step Jacobian estimation.

Details

Motivation: Current methods for unbiased Boltzmann sampling with continuous normalizing flows face limitations: exact Jacobian evaluation is computationally expensive in high dimensions, stochastic estimators like Hutchinson introduce bias, and Flow Perturbation has high variance.

Method: Flow Perturbation++ discretizes the probability-flow ODE and performs unbiased stepwise Jacobian estimation at each integration step, creating a multi-step construction that reduces variance while maintaining unbiasedness. It’s integrated into a Sequential Monte Carlo framework.

Result: The method achieves significantly improved equilibrium sampling on a 1000D Gaussian Mixture Model and the all-atom Chignolin protein compared to Hutchinson-based and single-step Flow Perturbation baselines.

Conclusion: Flow Perturbation++ provides an effective variance-reduced approach for unbiased Boltzmann sampling in high-dimensional systems, overcoming limitations of previous Jacobian estimation methods for continuous normalizing flows.

Abstract: The scalability of continuous normalizing flows (CNFs) for unbiased Boltzmann sampling remains limited in high-dimensional systems due to the cost of Jacobian-determinant evaluation, which requires $D$ backpropagation passes through the flow layers. Existing stochastic Jacobian estimators such as the Hutchinson trace estimator reduce computation but introduce bias, while the recently proposed Flow Perturbation method is unbiased yet suffers from high variance. We present \textbf{Flow Perturbation++}, a variance-reduced extension of Flow Perturbation that discretizes the probability-flow ODE and performs unbiased stepwise Jacobian estimation at each integration step. This multi-step construction retains the unbiasedness of Flow Perturbation while achieves substantially lower estimator variance. Integrated into a Sequential Monte Carlo framework, Flow Perturbation++ achieves significantly improved equilibrium sampling on a 1000D Gaussian Mixture Model and the all-atom Chignolin protein compared with Hutchinson-based and single-step Flow Perturbation baselines.

Xin Peng, Ang Gao

Main category: cs.LG

TL;DR: BFR is a bi-stage flow refinement framework that corrects biases in generative models through post-hoc refinement in latent and data spaces without perturbing original sampling dynamics.

Details

Motivation: Generative models like diffusion and flow-based models often have systematic biases that degrade sample quality, especially in high-dimensional settings. Existing refinement methods typically require noise injection or multi-step resampling, which complicates the sampling process.

Method: Proposes Bi-stage Flow Refinement (BFR) with two refinement strategies: 1) latent space alignment for approximately invertible generators, and 2) data space refinement trained with lightweight augmentations. Unlike previous methods, BFR preserves the original ODE trajectory and applies deterministic corrections to generated samples without perturbing sampling dynamics.

Result: Experiments on MNIST, CIFAR-10, and FFHQ at 256x256 resolution show consistent improvements in fidelity and coverage. Starting from base samples with FID 3.95, latent space refinement achieves state-of-the-art FID of 1.46 on MNIST using only a single additional function evaluation (1-NFE) while maintaining sample diversity.

Conclusion: Effective bias correction can be achieved as a post-hoc procedure without noise injection or multi-step resampling. BFR provides a practical framework for improving generative model outputs while preserving original sampling dynamics.

Abstract: Generative models, including diffusion and flow-based models, often exhibit systematic biases that degrade sample quality, particularly in high-dimensional settings. We revisit refinement methods and show that effective bias correction can be achieved as a post-hoc procedure, without noise injection or multi-step resampling of the sampling process. We propose a flow-matching-based \textbf{Bi-stage Flow Refinement (BFR)} framework with two refinement strategies operating at different stages: latent space alignment for approximately invertible generators and data space refinement trained with lightweight augmentations. Unlike previous refiners that perturb sampling dynamics, BFR preserves the original ODE trajectory and applies deterministic corrections to generated samples. Experiments on MNIST, CIFAR-10, and FFHQ at 256x256 resolution demonstrate consistent improvements in fidelity and coverage; notably, starting from base samples with FID 3.95, latent space refinement achieves a \textbf{state-of-the-art} FID of \textbf{1.46} on MNIST using only a single additional function evaluation (1-NFE), while maintaining sample diversity.

[489] Rethinking Self-Training Based Cross-Subject Domain Adaptation for SSVEP Classification

Weiguang Wang, Yong Liu, Yingjie Gao, Guangyuan Xu

Main category: cs.LG

TL;DR: A novel cross-subject domain adaptation method for SSVEP-based BCIs using filter-bank alignment, adversarial learning, and contrastive learning to address signal variability and annotation costs.

Details

Motivation: SSVEP-based BCIs face challenges with signal variability across subjects and expensive user-specific annotation, limiting recognition performance and practical deployment.

Method: Proposes a cross-subject self-training framework with Filter-Bank Euclidean Alignment (FBEA), Pre-Training with Adversarial Learning (PTAL) for distribution alignment, Dual-Ensemble Self-Training (DEST) for pseudo-label refinement, and Time-Frequency Augmented Contrastive Learning (TFA-CL) for enhanced feature discriminability.

Result: Achieves state-of-the-art performance on Benchmark and BETA datasets across varying signal lengths, demonstrating superiority over existing methods.

Conclusion: The proposed cross-subject domain adaptation framework effectively addresses signal variability and annotation limitations in SSVEP-based BCIs, enabling improved performance without extensive user-specific training.

Abstract: Steady-state visually evoked potentials (SSVEP)-based brain-computer interfaces (BCIs) are widely used due to their high signal-to-noise ratio and user-friendliness. Accurate decoding of SSVEP signals is crucial for interpreting user intentions in BCI applications. However, signal variability across subjects and the costly user-specific annotation limit recognition performance. Therefore, we propose a novel cross-subject domain adaptation method built upon the self-training paradigm. Specifically, a Filter-Bank Euclidean Alignment (FBEA) strategy is designed to exploit frequency information from SSVEP filter banks. Then, we propose a Cross-Subject Self-Training (CSST) framework consisting of two stages: Pre-Training with Adversarial Learning (PTAL), which aligns the source and target distributions, and Dual-Ensemble Self-Training (DEST), which refines pseudo-label quality. Moreover, we introduce a Time-Frequency Augmented Contrastive Learning (TFA-CL) module to enhance feature discriminability across multiple augmented views. Extensive experiments on the Benchmark and BETA datasets demonstrate that our approach achieves state-of-the-art performance across varying signal lengths, highlighting its superiority.

[490] A Sheaf-Theoretic and Topological Perspective on Complex Network Modeling and Attention Mechanisms in Graph Neural Models

Chuan-Shen Hu

Main category: cs.LG

TL;DR: A cellular sheaf theoretic framework for analyzing feature consistency and harmonicity in graph-based deep learning architectures, with multiscale extensions for hierarchical feature interactions.

Details

Motivation: The distribution and diffusion behavior of features in geometric and topological deep learning (GDL/TDL) architectures remains underexplored, motivating a framework to analyze local consistency and harmonicity of node features and edge weights.

Method: Introduces cellular sheaf theory to model local feature alignments and agreements in graph-based architectures, with multiscale extensions inspired by topological data analysis (TDA) to capture hierarchical feature interactions.

Result: Provides a topological perspective on feature diffusion and aggregation, enabling joint characterization of GDL/TDL architectures based on their geometric/topological structures and learned signals.

Conclusion: The framework offers insights for future studies on conventional tasks like node classification, substructure detection, and community detection through topological analysis of feature behavior.

Abstract: Combinatorial and topological structures, such as graphs, simplicial complexes, and cell complexes, form the foundation of geometric and topological deep learning (GDL and TDL) architectures. These models aggregate signals over such domains, integrate local features, and generate representations for diverse real-world applications. However, the distribution and diffusion behavior of GDL and TDL features during training remains an open and underexplored problem. Motivated by this gap, we introduce a cellular sheaf theoretic framework for modeling and analyzing the local consistency and harmonicity of node features and edge weights in graph-based architectures. By tracking local feature alignments and agreements through sheaf structures, the framework offers a topological perspective on feature diffusion and aggregation. Furthermore, a multiscale extension inspired by topological data analysis (TDA) is proposed to capture hierarchical feature interactions in graph models. This approach enables a joint characterization of GDL and TDL architectures based on their underlying geometric and topological structures and the learned signals defined on them, providing insights for future studies on conventional tasks such as node classification, substructure detection, and community detection.

[491] Temporal Context and Architecture: A Benchmark for Naturalistic EEG Decoding

Mehmet Ergezer

Main category: cs.LG

TL;DR: S5 model achieves 98.7% EEG decoding accuracy with 20x fewer parameters than CNN, but EEGXF shows better robustness to cross-frequency shifts and OOD tasks despite being less calibrated in-distribution.

Details

Motivation: To understand how model architecture and temporal context interact in naturalistic EEG decoding, and to benchmark different architectures on real-world robustness metrics including cross-frequency shifts, cross-task OOD inputs, and leave-one-subject-out generalization.

Method: Used HBN movie-watching dataset to benchmark five architectures (CNN, LSTM, EEGXF, S4, S5) on a 4-class task across segment lengths from 8s to 128s. Evaluated zero-shot cross-frequency shifts, cross-task OOD inputs, and leave-one-subject-out generalization.

Result: Accuracy improves with longer context: at 64s, S5 reaches 98.7%±0.6 and CNN 98.3%±0.3, with S5 using ~20x fewer parameters. S5 achieves stronger cross-subject accuracy but makes over-confident errors on OOD tasks; EEGXF is more conservative and stable under frequency shifts.

Conclusion: Reveals a practical efficiency-robustness trade-off: S5 for parameter-efficient peak accuracy; EEGXF when robustness and conservative uncertainty are critical in EEG decoding applications.

Abstract: We study how model architecture and temporal context interact in naturalistic EEG decoding. Using the HBN movie-watching dataset, we benchmark five architectures, CNN, LSTM, a stabilized Transformer (EEGXF), S4, and S5, on a 4-class task across segment lengths from 8s to 128s. Accuracy improves with longer context: at 64s, S5 reaches 98.7%+/-0.6 and CNN 98.3%+/-0.3, while S5 uses ~20x fewer parameters than CNN. To probe real-world robustness, we evaluate zero-shot cross-frequency shifts, cross-task OOD inputs, and leave-one-subject-out generalization. S5 achieves stronger cross-subject accuracy but makes over-confident errors on OOD tasks; EEGXF is more conservative and stable under frequency shifts, though less calibrated in-distribution. These results reveal a practical efficiency-robustness trade-off: S5 for parameter-efficient peak accuracy; EEGXF when robustness and conservative uncertainty are critical.

[492] Soft Quantization: Model Compression Via Weight Coupling

Daniel T. Bernstein, Luca Di Carlo, David Schwab

Main category: cs.LG

TL;DR: Soft quantization method using short-range attractive couplings between neural network weights during training enables mixed-precision discretization with minimal hyperparameters

Details

Motivation: To develop a novel model quantization approach that discretizes weight distributions during training rather than post-training, providing more flexible compression and better understanding of compression-generalization trade-offs

Method: Introduces short-range attractive couplings between neural network weights during training that rapidly induce discretization of weight distribution in a mixed-precision manner using only two additional hyperparameters

Result: Outperforms histogram-equalized post-training quantization on ResNet-20/CIFAR-10 within appropriate hyperparameter ranges

Conclusion: Soft quantization provides a new pipeline for flexible model compression and a tool for investigating compression-generalization trade-offs in high-dimensional loss landscapes

Abstract: We show that introducing short-range attractive couplings between the weights of a neural network during training provides a novel avenue for model quantization. These couplings rapidly induce the discretization of a model’s weight distribution, and they do so in a mixed-precision manner despite only relying on two additional hyperparameters. We demonstrate that, within an appropriate range of hyperparameters, our “soft quantization’’ scheme outperforms histogram-equalized post-training quantization on ResNet-20/CIFAR-10. Soft quantization provides both a new pipeline for the flexible compression of machine learning models and a new tool for investigating the trade-off between compression and generalization in high-dimensional loss landscapes.

[493] PHDME: Physics-Informed Diffusion Models without Explicit Governing Equations

Kaiyuan Tan, Kendra Givens, Peilun Li, Thomas Beckers

Main category: cs.LG

TL;DR: PHDME: A port-Hamiltonian diffusion framework for forecasting dynamical systems with sparse observations and incomplete physics knowledge, combining Gaussian processes, diffusion models, and physics-informed learning.

Details

Motivation: Diffusion models are good priors for forecasting dynamical systems but unreliable with sparse data. Physics-informed ML helps but requires explicit governing equations, which are often only partially known for complex nonlinear dynamics.

Method: Train a Gaussian process distributed Port-Hamiltonian system (GP-dPHS) on limited observations to capture energy-based dynamics representation. Use GP-dPHS to generate physically consistent artificial dataset for diffusion training, and inform diffusion model with structured physics residual loss. Apply split conformal calibration for uncertainty quantification.

Result: Experiments on PDE benchmarks and real-world spring system show improved accuracy and physical consistency under data scarcity compared to baseline methods.

Conclusion: PHDME provides an effective framework for forecasting dynamical systems with sparse observations and incomplete physics knowledge by combining port-Hamiltonian structural priors, diffusion models, and physics-informed learning with uncertainty quantification.

Abstract: Diffusion models provide expressive priors for forecasting trajectories of dynamical systems, but are typically unreliable in the sparse data regime. Physics-informed machine learning (PIML) improves reliability in such settings; however, most methods require \emph{explicit governing equations} during training, which are often only partially known due to complex and nonlinear dynamics. We introduce \textbf{PHDME}, a port-Hamiltonian diffusion framework designed for \emph{sparse observations} and \emph{incomplete physics}. PHDME leverages port-Hamiltonian structural prior but does not require full knowledge of the closed-form governing equations. Our approach first trains a Gaussian process distributed Port-Hamiltonian system (GP-dPHS) on limited observations to capture an energy-based representation of the dynamics. The GP-dPHS is then used to generate a physically consistent artificial dataset for diffusion training, and to inform the diffusion model with a structured physics residual loss. After training, the diffusion model acts as an amortized sampler and forecaster for fast trajectory generation. Finally, we apply split conformal calibration to provide uncertainty statements for the generated predictions. Experiments on PDE benchmarks and a real-world spring system show improved accuracy and physical consistency under data scarcity.

[494] Understanding Diffusion Models via Ratio-Based Function Approximation with SignReLU Networks

Luwei Sun, Dongrui Shen, Jianfe Li, Yulong Zhao, Han Feng

Main category: cs.LG

TL;DR: Theoretical framework for approximating ratio-type functionals (f1/f2) using SignReLU neural networks, with applications to diffusion models and generalization guarantees.

Details

Motivation: Address challenges in conditional generative modeling where target conditional densities take ratio form f1/f2, which is central to diffusion-based generative models like DDPMs.

Method: Develop theoretical framework using deep neural networks with SignReLU activation function to approximate ratio-type functionals. Provide L^p approximation bounds and convergence rates under regularity assumptions. Construct SignReLU-based neural estimator for DDPM reverse process.

Result: Establish approximation bounds for ratio-type functionals, derive bounds on excess KL risk between generated and true data distributions, decompose risk into approximation and estimation errors, providing generalization guarantees for diffusion models.

Conclusion: Theoretical framework provides generalization guarantees for finite-sample training of diffusion-based generative models through neural approximation of ratio-type functionals.

Abstract: Motivated by challenges in conditional generative modeling, where the target conditional density takes the form of a ratio f1 over f2, this paper develops a theoretical framework for approximating such ratio-type functionals. Here, f1 and f2 are kernel-based marginal densities that capture structured interactions, a setting central to diffusion-based generative models. We provide a concise proof for approximating these ratio-type functionals using deep neural networks with the SignReLU activation function, leveraging the activation’s piecewise structure. Under standard regularity assumptions, we establish L^p(Omega) approximation bounds and convergence rates. Specializing to Denoising Diffusion Probabilistic Models (DDPMs), we construct a SignReLU-based neural estimator for the reverse process and derive bounds on the excess Kullback-Leibler (KL) risk between the generated and true data distributions. Our analysis decomposes this excess risk into approximation and estimation error components. These results provide generalization guarantees for finite-sample training of diffusion-based generative models.

[495] Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

Yiju Guo, Tianyi Hu, Zexu Sun, Yankai Lin

Main category: cs.LG

TL;DR: LENS framework improves RLVR for LLM reasoning by identifying and removing interference tokens from prompts, then transferring successful rollouts to supervise policy optimization on noisy prompts.

Details

Motivation: Current RLVR methods suffer from inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. Many failures come from interference tokens in prompts rather than problem difficulty itself.

Method: Proposes Less Noise Sampling Framework (LENS): 1) Identifies and removes interference tokens from prompts, 2) Uses purified prompts for exploration, 3) Transfers successful rollouts to supervise policy optimization on original noisy prompts, enabling models to learn to ignore interference.

Result: LENS significantly outperforms GRPO with 3.88% average performance gain and over 1.6× speedup in convergence, demonstrating improved rollout efficiency and training stability.

Conclusion: Pruning interference tokens is critical for improving rollout efficiency in RLVR, offering a new perspective for RLVR research by addressing prompt noise rather than just task difficulty.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.

[496] Conditional Generative Framework with Peak-Aware Attention for Robust Chemical Detection under Interferences

Namkyung Yoon, Sanghong Kim, Hwangnam Kim

Main category: cs.LG

TL;DR: AI framework using peak-aware conditional GAN to generate synthetic GC-MS data for improved chemical substance discrimination under interference conditions

Details

Motivation: GC-MS reliability deteriorates with interfering substances causing nonspecific peaks, time shifts, and noise, leading to reduced sensitivity and false alarms. Need to improve discrimination accuracy without conducting extensive real experiments.

Method: Proposes peak-aware conditional generative adversarial network (CGAN) framework with novel peak-aware mechanism highlighting characteristic GC-MS peaks. Encodes chemical and solvent information in latent vectors to generate synthetic GC-MS signals consistent with experimental conditions.

Result: Achieves cosine similarity and Pearson correlation coefficient values above 0.9, preserves peak number diversity, and reduces false alarms in discrimination models. Generated synthetic data validated through quantitative and qualitative evaluations.

Conclusion: The peak-aware conditional generative model effectively generates realistic GC-MS data under interference conditions, improving AI-based discrimination model performance without requiring extensive real experimental data collection.

Abstract: Gas chromatography-mass spectrometry (GC-MS) is a widely used analytical method for chemical substance detection, but measurement reliability tends to deteriorate in the presence of interfering substances. In particular, interfering substances cause nonspecific peaks, residence time shifts, and increased background noise, resulting in reduced sensitivity and false alarms. To overcome these challenges, in this paper, we propose an artificial intelligence discrimination framework based on a peak-aware conditional generative model to improve the reliability of GC-MS measurements under interference conditions. The framework is learned with a novel peak-aware mechanism that highlights the characteristic peaks of GC-MS data, allowing it to generate important spectral features more faithfully. In addition, chemical and solvent information is encoded in a latent vector embedded with it, allowing a conditional generative adversarial neural network (CGAN) to generate a synthetic GC-MS signal consistent with the experimental conditions. This generates an experimental dataset that assumes indirect substance situations in chemical substance data, where acquisition is limited without conducting real experiments. These data are used for the learning of AI-based GC-MS discrimination models to help in accurate chemical substance discrimination. We conduct various quantitative and qualitative evaluations of the generated simulated data to verify the validity of the proposed framework. We also verify how the generative model improves the performance of the AI discrimination framework. Representatively, the proposed method is shown to consistently achieve cosine similarity and Pearson correlation coefficient values above 0.9 while preserving peak number diversity and reducing false alarms in the discrimination model.

[497] Model-Free Neural State Estimation in Nonlinear Dynamical Systems: A Comparative Study of Neural Architectures and Classical Filters

Zhuochen Liu, Hans Walker, Rahul Jain

Main category: cs.LG

TL;DR: Neural network models can approach performance of strong nonlinear Kalman filters for state estimation without requiring system dynamics knowledge, with state-space models showing particular promise.

Details

Motivation: There is a need to understand how neural network models behave as principled filters in nonlinear dynamical systems compared to classical filtering methods that rely on explicit system knowledge.

Method: Systematic empirical comparison between model-free neural network models (Transformer-based models, state-space neural networks, recurrent architectures) and classical filtering methods (particle filters, nonlinear Kalman filters) across multiple nonlinear scenarios.

Result: Neural models, particularly state-space models (SSMs), achieve state estimation performance approaching strong nonlinear Kalman filters in nonlinear scenarios and outperform weaker classical baselines despite lacking access to system models, while also attaining substantially higher inference throughput.

Conclusion: Neural network models can serve as effective state estimators without requiring explicit system dynamics knowledge, with state-space neural networks showing competitive performance to classical filtering methods while offering computational advantages.

Abstract: Neural network models are increasingly used for state estimation in control and decision-making problems, yet it remains unclear to what extent they behave as principled filters in nonlinear dynamical systems. Unlike classical filters, which rely on explicit knowledge of system dynamics and noise models, neural estimators can be trained purely from data without access to the underlying system equations. In this work, we present a systematic empirical comparison between such model-free neural network models and classical filtering methods across multiple nonlinear scenarios. Our study evaluates Transformer-based models, state-space neural networks, and recurrent architectures alongside particle filters and nonlinear Kalman filters. The results show that neural models (in particular, state-space models (SSMs)) achieve state estimation performance that approaches strong nonlinear Kalman filters in nonlinear scenarios and outperform weaker classical baselines despite lacking access to system models, while also attaining substantially higher inference throughput.

[498] EGAM: Extended Graph Attention Model for Solving Routing Problems

Licheng Wang, Yuzi Yan, Mingtao Huang, Yuan Shen

Main category: cs.LG

TL;DR: EGAM extends graph attention models for neural combinatorial optimization, using multi-head attention on both nodes and edges to improve routing problem solving.

Details

Motivation: Existing graph attention models (GAM) for neural combinatorial optimization only consider node features, limiting their ability to handle complex routing problems with edge constraints. The authors aim to generalize the attention mechanism to incorporate both node and edge embeddings for better performance on constrained problems.

Method: Proposes Extended Graph Attention Model (EGAM) with multi-head dot-product attention that updates both node and edge embeddings. Uses autoregressive encoder-decoder architecture trained with policy gradient algorithms incorporating a specially designed baseline.

Result: EGAM matches or outperforms existing methods across various routing problems, demonstrating exceptional performance on highly constrained problems and showing efficiency in handling complex graph structures.

Conclusion: Extending graph attention to include edge embeddings significantly improves neural combinatorial optimization solvers, particularly for constrained routing problems, offering a more comprehensive approach to graph-based optimization.

Abstract: Neural combinatorial optimization (NCO) solvers, implemented with graph neural networks (GNNs), have introduced new approaches for solving routing problems. Trained with reinforcement learning (RL), the state-of-the-art graph attention model (GAM) achieves near-optimal solutions without requiring expert knowledge or labeled data. In this work, we generalize the existing graph attention mechanism and propose the extended graph attention model (EGAM). Our model utilizes multi-head dot-product attention to update both node and edge embeddings, addressing the limitations of the conventional GAM, which considers only node features. We employ an autoregressive encoder-decoder architecture and train it with policy gradient algorithms that incorporate a specially designed baseline. Experiments show that EGAM matches or outperforms existing methods across various routing problems. Notably, the proposed model demonstrates exceptional performance on highly constrained problems, highlighting its efficiency in handling complex graph structures.

[499] DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher

Yisheng Zhong, Zhengbang Yang, Zhuangdi Zhu

Main category: cs.LG

TL;DR: DUET is a distillation-based LLM unlearning method that uses a prompt-steered teacher model to efficiently remove undesirable knowledge while preserving general domain knowledge, achieving better forgetting and utility preservation with higher data efficiency.

Details

Motivation: Current LLM unlearning methods have limitations: tuning-based approaches are computationally heavy and cause catastrophic forgetting, while in-contextualized methods are lightweight but vulnerable to prompt removal or reverse engineering attacks.

Method: DUET uses a distillation-based approach where a student model learns to imitate a prompt-steered teacher model. The teacher effectively refuses undesirable knowledge generation while preserving general domain knowledge, combining the strengths of both tuning-based and in-contextualized unlearning.

Result: Extensive evaluations show DUET achieves higher performance in both forgetting undesirable knowledge and preserving utility, while being orders of magnitude more data-efficient than state-of-the-art unlearning methods.

Conclusion: DUET provides an effective and efficient solution for LLM unlearning that addresses the limitations of existing methods, making it a promising approach for trustworthy AI development.

Abstract: LLM unlearning is a technique to remove the impacts of undesirable knowledge from the model without retraining from scratch, which is indispensable towards trustworthy AI. Existing unlearning methods face significant limitations: conventional tuning-based unlearning is computationally heavy and prone to catastrophic forgetting. In contrast, in-contextualized unlearning is lightweight for precise unlearning but vulnerable to prompt removal or reverse engineering attacks. In response, we propose Distilled Unlearning from an Efficient Teacher (DUET), a novel distillation-based unlearning method that combines the merits of these two lines of work. It learns a student model to imitate the behavior of a prompt-steered teacher that effectively refuses undesirable knowledge generation while preserving general domain knowledge. Extensive evaluations on existing benchmarks with our enriched evaluation protocols demonstrate that DUET achieves higher performance in both forgetting and utility preservation, while being orders of magnitude more data-efficient than state-of-the-art unlearning methods.

[500] PILD: Physics-Informed Learning via Diffusion

Tianyi Zeng, Tianyi Wang, Jiaru Zhang, Zimo Zeng, Feiyang Zhang, Yiming Xu, Sikai Chen, Yajie Zou, Yangyang Wang, Junfeng Jiao, Christian Claudel, Xinbo Chen

Main category: cs.LG

TL;DR: PILD integrates physical constraints into diffusion models using virtual residual observations and conditional embedding modules for improved accuracy in scientific/engineering applications.

Details

Motivation: Standard diffusion models are purely data-driven and lack physical constraints, limiting their applicability in scientific/engineering problems where physical laws must be followed.

Method: Introduces Physics-Informed Learning via Diffusion (PILD) with: 1) virtual residual observations from Laplace distribution to supervise generation, and 2) conditional embedding module to inject physical information into denoising network at multiple layers.

Result: PILD substantially improves accuracy, stability, and generalization over existing physics-informed and diffusion-based baselines across engineering/scientific tasks including vehicle trajectories, tire forces, Darcy flow, and plasma dynamics.

Conclusion: PILD provides a concise, modular framework that unifies diffusion modeling with physical constraints, making diffusion models more applicable to practical scientific and engineering problems.

Abstract: Diffusion models have emerged as powerful generative tools for modeling complex data distributions, yet their purely data-driven nature limits applicability in practical engineering and scientific problems where physical laws need to be followed. This paper proposes Physics-Informed Learning via Diffusion (PILD), a framework that unifies diffusion modeling and first-principles physical constraints by introducing a virtual residual observation sampled from a Laplace distribution to supervise generation during training. To further integrate physical laws, a conditional embedding module is incorporated to inject physical information into the denoising network at multiple layers, ensuring consistent guidance throughout the diffusion process. The proposed PILD framework is concise, modular, and broadly applicable to problems governed by ordinary differential equations, partial differential equations, as well as algebraic equations or inequality constraints. Extensive experiments across engineering and scientific tasks including estimating vehicle trajectories, tire forces, Darcy flow and plasma dynamics, demonstrate that our PILD substantially improves accuracy, stability, and generalization over existing physics-informed and diffusion-based baselines.

[501] Zenith: Scaling up Ranking Models for Billion-scale Livestreaming Recommendation

Ruifeng Zhang, Zexi Huang, Zikai Wang, Ke Sun, Bohang Zheng, Zhen Ouyang, Huimin Xie, Phil Shen, Junlin Zhang, Wentao Guo, Qinglei Wang

Main category: cs.LG

TL;DR: Zenith is a scalable ranking architecture for recommender systems that efficiently learns complex feature interactions with minimal runtime overhead, achieving significant performance gains in real-world deployment on TikTok Live.

Details

Motivation: While capturing feature interactions is crucial for recommender systems, existing approaches struggle with efficient feature handling and scaling model capacity without excessive inference latency. There's a need for architectures that can learn complex interactions while maintaining practical runtime performance.

Method: Zenith uses a novel architecture with Prime Tokens, Token Fusion, and Token Boost modules to handle high-dimensional features efficiently. It focuses on improved token heterogeneity and exhibits superior scaling laws compared to other ranking methods.

Result: Deployed on TikTok Live, Zenith achieved +1.05%/-1.10% improvements in online CTR AUC and Logloss, with +9.93% gains in Quality Watch Session/User and +8.11% in Quality Watch Duration/User.

Conclusion: Zenith demonstrates that scalable and efficient ranking architectures can significantly improve recommendation performance in real-world applications while maintaining practical inference latency.

Abstract: Accurately capturing feature interactions is essential in recommender systems, and recent trends show that scaling up model capacity could be a key driver for next-level predictive performance. While prior work has explored various model architectures to capture multi-granularity feature interactions, relatively little attention has been paid to efficient feature handling and scaling model capacity without incurring excessive inference latency. In this paper, we address this by presenting Zenith, a scalable and efficient ranking architecture that learns complex feature interactions with minimal runtime overhead. Zenith is designed to handle a few high-dimensional Prime Tokens with Token Fusion and Token Boost modules, which exhibits superior scaling laws compared to other state-of-the-art ranking methods, thanks to its improved token heterogeneity. Its real-world effectiveness is demonstrated by deploying the architecture to TikTok Live, a leading online livestreaming platform that attracts billions of users globally. Our A/B test shows that Zenith achieves +1.05%/-1.10% in online CTR AUC and Logloss, and realizes +9.93% gains in Quality Watch Session / User and +8.11% in Quality Watch Duration / User.

[502] TimeSliver : Symbolic-Linear Decomposition for Explainable Time Series Classification

Akash Pandey, Payal Mohapatra, Wei Chen, Qi Zhu, Sinan Keten

Main category: cs.LG

TL;DR: TimeSliver: An explainable deep learning framework for time-series classification that jointly uses raw data and symbolic abstraction to create interpretable temporal attributions.

Details

Motivation: Existing explainable methods for time-series classification have limitations: gradient-based methods suffer from reference state sensitivity and treat time points independently, while attention mechanisms often fail to provide faithful temporal importance measures. There's a need for better interpretable temporal attribution methods.

Method: TimeSliver jointly utilizes raw time-series data and its symbolic abstraction to construct a representation that maintains original temporal structure. Each element in this representation linearly encodes the contribution of each temporal segment to the final prediction, allowing assignment of meaningful importance scores to every time point.

Result: Outperforms other temporal attribution methods by 11% on 7 distinct synthetic and real-world multivariate time-series datasets. Achieves predictive performance within 2% of state-of-the-art baselines across 26 UEA benchmark datasets.

Conclusion: TimeSliver provides a strong and explainable framework for general time-series classification that offers faithful temporal importance attribution while maintaining competitive predictive performance.

Abstract: Identifying the extent to which every temporal segment influences a model’s predictions is essential for explaining model decisions and increasing transparency. While post-hoc explainable methods based on gradients and feature-based attributions have been popular, they suffer from reference state sensitivity and struggle to generalize across time-series datasets, as they treat time points independently and ignore sequential dependencies. Another perspective on explainable time-series classification is through interpretable components of the model, for instance, leveraging self-attention mechanisms to estimate temporal attribution; however, recent findings indicate that these attention weights often fail to provide faithful measures of temporal importance. In this work, we advance this perspective and present a novel explainability-driven deep learning framework, TimeSliver, which jointly utilizes raw time-series data and its symbolic abstraction to construct a representation that maintains the original temporal structure. Each element in this representation linearly encodes the contribution of each temporal segment to the final prediction, allowing us to assign a meaningful importance score to every time point. For time-series classification, TimeSliver outperforms other temporal attribution methods by 11% on 7 distinct synthetic and real-world multivariate time-series datasets. TimeSliver also achieves predictive performance within 2% of state-of-the-art baselines across 26 UEA benchmark datasets, positioning it as a strong and explainable framework for general time-series classification.

[503] Physics-Guided Tiny-Mamba Transformer for Reliability-Aware Early Fault Warning

Changyu Li, Dingcheng Huang, Kexuan Yao, Xiaoya Ni, Lijuan Shen, Fei Luo

Main category: cs.LG

TL;DR: PG-TMT: A compact tri-branch encoder combining depthwise-separable convolutions, Tiny-Mamba state-space modeling, and lightweight transformers for reliable rotating machinery prognostics with physics-guided explanations and EVT-calibrated decision rules.

Details

Motivation: Need for reliable early warning signals in rotating machinery that work under nonstationary conditions, domain shifts, and class imbalance while maintaining low false-alarm rates for practical deployment in prognostics and health management.

Method: Physics-Guided Tiny-Mamba Transformer (PG-TMT) with three branches: 1) depthwise-separable convolutional stem for micro-transients, 2) Tiny-Mamba state-space branch for long-range dynamics, 3) lightweight local Transformer for cross-channel resonances. Uses analytic temporal-to-spectral mapping to align attention with bearing fault-order bands, and extreme-value theory (EVT) with dual-threshold hysteresis for reliable decision-making.

Result: Achieves higher precision-recall AUC (primary under imbalance), competitive or better ROC AUC, shorter mean time-to-detect at matched false-alarm intensity, and strong cross-domain transfer on CWRU, Paderborn, XJTU-SY, and industrial datasets.

Conclusion: PG-TMT provides calibrated, interpretable, and deployment-ready early warnings by coupling physics-aligned representations with EVT-calibrated decision rules for reliability-centric prognostics and health management.

Abstract: Reliability-centered prognostics for rotating machinery requires early warning signals that remain accurate under nonstationary operating conditions, domain shifts across speed/load/sensors, and severe class imbalance, while keeping the false-alarm rate small and predictable. We propose the Physics-Guided Tiny-Mamba Transformer (PG-TMT), a compact tri-branch encoder tailored for online condition monitoring. A depthwise-separable convolutional stem captures micro-transients, a Tiny-Mamba state-space branch models near-linear long-range dynamics, and a lightweight local Transformer encodes cross-channel resonances. We derive an analytic temporal-to-spectral mapping that ties the model’s attention spectrum to classical bearing fault-order bands, yielding a band-alignment score that quantifies physical plausibility and provides physics-grounded explanations. To ensure decision reliability, healthy-score exceedances are modeled with extreme-value theory (EVT), which yields an on-threshold achieving a target false-alarm intensity (events/hour); a dual-threshold hysteresis with a minimum hold time further suppresses chatter. Under a leakage-free streaming protocol with right-censoring of missed detections on CWRU, Paderborn, XJTU-SY, and an industrial pilot, PG-TMT attains higher precision-recall AUC (primary under imbalance), competitive or better ROC AUC, and shorter mean time-to-detect at matched false-alarm intensity, together with strong cross-domain transfer. By coupling physics-aligned representations with EVT-calibrated decision rules, PG-TMT delivers calibrated, interpretable, and deployment-ready early warnings for reliability-centric prognostics and health management.

[504] Missing-Data-Induced Phase Transitions in Spectral PLS for Multimodal Learning

Anders Gjølbye, Ida Kargaard, Emma Kargaard, Lars Kai Hansen

Main category: cs.LG

TL;DR: PLS-SVD analysis under missing data shows phase transition behavior where informative singular vectors emerge only above critical signal-to-noise threshold attenuated by missingness.

Details

Motivation: Multimodal datasets often have missing entries in both views, but PLS-SVD analysis typically assumes complete data. Need to understand how missing data affects PLS-SVD performance and when it can recover meaningful shared structure.

Method: Theoretical analysis of PLS-SVD under independent entry-wise missing-completely-at-random masking in proportional high-dimensional spiked model. Study masked cross-covariance as spiked rectangular random matrix with effective signal strength attenuated by √ρ (joint entry retention probability).

Result: PLS-SVD exhibits sharp BBP-type phase transition: below critical signal-to-noise threshold, leading singular vectors are asymptotically uninformative; above threshold, they achieve nontrivial alignment with latent shared directions with closed-form asymptotic overlap formulas.

Conclusion: Missing data significantly impacts PLS-SVD performance, with phase transition behavior determining when meaningful shared structure can be recovered. Results validated through simulations and semi-synthetic multimodal experiments.

Abstract: Partial Least Squares (PLS) learns shared structure from paired data via the top singular vectors of the empirical cross-covariance (PLS-SVD), but multimodal datasets often have missing entries in both views. We study PLS-SVD under independent entry-wise missing-completely-at-random masking in a proportional high-dimensional spiked model. After appropriate normalization, the masked cross-covariance behaves like a spiked rectangular random matrix whose effective signal strength is attenuated by $\sqrtρ$, where $ρ$ is the joint entry retention probability. As a result, PLS-SVD exhibits a sharp BBP-type phase transition: below a critical signal-to-noise threshold the leading singular vectors are asymptotically uninformative, while above it they achieve nontrivial alignment with the latent shared directions, with closed-form asymptotic overlap formulas. Simulations and semi-synthetic multimodal experiments corroborate the predicted phase diagram and recovery curves across aspect ratios, signal strengths, and missingness levels.

[505] Grounding and Enhancing Informativeness and Utility in Dataset Distillation

Shaobo Wang, Yantai Yang, Guo Chen, Peiru Li, Kaixin Li, Yufa Zhou, Zhaorun Chen, Linfeng Zhang

Main category: cs.LG

TL;DR: InfoUtil: A theoretical framework for dataset distillation that balances informativeness and utility using game theory and gradient analysis

Details

Motivation: Current dataset distillation methods rely on heuristics without exploring the fundamental relationship between original and synthetic data, lacking solid theoretical foundations

Method: Introduces InfoUtil framework with two components: 1) game-theoretic informativeness maximization using Shapley Value to extract key information from samples, and 2) principled utility maximization by selecting globally influential samples based on Gradient Norm

Result: Achieves 6.1% performance improvement over previous state-of-the-art on ImageNet-1K dataset using ResNet-18

Conclusion: Provides a solid theoretical foundation for dataset distillation that balances informativeness and utility, leading to improved performance

Abstract: Dataset Distillation (DD) seeks to create a compact dataset from a large, real-world dataset. While recent methods often rely on heuristic approaches to balance efficiency and quality, the fundamental relationship between original and synthetic data remains underexplored. This paper revisits knowledge distillation-based dataset distillation within a solid theoretical framework. We introduce the concepts of Informativeness and Utility, capturing crucial information within a sample and essential samples in the training set, respectively. Building on these principles, we define optimal dataset distillation mathematically. We then present InfoUtil, a framework that balances informativeness and utility in synthesizing the distilled dataset. InfoUtil incorporates two key components: (1) game-theoretic informativeness maximization using Shapley Value attribution to extract key information from samples, and (2) principled utility maximization by selecting globally influential samples based on Gradient Norm. These components ensure that the distilled dataset is both informative and utility-optimized. Experiments demonstrate that our method achieves a 6.1% performance improvement over the previous state-of-the-art approach on ImageNet-1K dataset using ResNet-18.

[506] Achieving $\varepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle

Zijun Chen, Zaiwei Chen, Nian Si, Shengbo Wang

Main category: cs.LG

TL;DR: The paper establishes optimal sample complexity bounds for synchronous and asynchronous Q-learning in average-reward MDPs using lazy dynamics transformation.

Details

Motivation: Existing non-asymptotic results for average-reward MDPs either require strong assumptions to enforce contraction or rely on approximations that lead to suboptimal sample complexity. The absence of contraction in average-reward settings poses a fundamental challenge.

Method: Proposes a simple variant of Q-learning that samples from lazified dynamics (system remains in current state with fixed probability). Constructs an instance-dependent seminorm and shows that after lazy transformation, the Bellman operator becomes one-step contractive under this seminorm.

Result: Establishes optimal Õ(ε⁻²) sample complexity guarantees (up to logarithmic factors) for both synchronous and asynchronous Q-learning under a reachability assumption.

Conclusion: The lazy dynamics transformation enables achieving optimal sample complexity for average-reward MDPs by creating a contractive structure that overcomes the fundamental challenge of contraction absence in these settings.

Abstract: We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes, where the absence of contraction poses a fundamental challenge. Existing non-asymptotic results overcome this challenge by either imposing strong assumptions to enforce seminorm contraction or relying on discounted or episodic Markov decision processes as successive approximations, which either require unknown parameters or result in suboptimal sample complexity. In this work, under a reachability assumption, we establish optimal $\widetilde{O}(\varepsilon^{-2})$ sample complexity guarantees (up to logarithmic factors) for a simple variant of synchronous and asynchronous Q-learning that samples from the lazified dynamics, where the system remains in the current state with some fixed probability. At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.

[507] The Surprising Difficulty of Search in Model-Based Reinforcement Learning

Wei-Di Chang, Mikael Henaff, Brandon Amos, Gregory Dudek, Scott Fujimoto

Main category: cs.LG

TL;DR: Search in model-based RL is not a simple plug-and-play replacement for learned policies, and can harm performance even with accurate models; distribution shift mitigation is more critical than improving model or value function accuracy.

Details

Motivation: The paper challenges conventional wisdom in model-based RL that long-term predictions and compounding errors are the main obstacles, and investigates why search often fails to improve performance despite accurate models.

Method: The authors systematically analyze search in model-based RL, identifying that distribution shift is the key issue rather than model accuracy. They develop techniques to mitigate distribution shift and enable effective search.

Result: The paper shows that search can harm performance even with highly accurate models, and that their distribution shift mitigation techniques enable effective search, achieving state-of-the-art performance across multiple benchmark domains.

Conclusion: Distribution shift is a more critical challenge than model accuracy for enabling effective search in model-based RL, and addressing it leads to significant performance improvements.

Abstract: This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, showing that search is not a plug-and-play replacement for a learned policy. Surprisingly, we find that search can harm performance even when the model is highly accurate. Instead, we show that mitigating distribution shift matters more than improving model or value function accuracy. Building on this insight, we identify key techniques for enabling effective search, achieving state-of-the-art performance across multiple popular benchmark domains.

[508] Transferable Graph Condensation from the Causal Perspective

Huaming Du, Yijie Huang, Su Yao, Yiying Wang, Yueyang Zhou, Jingwen Yang, Jinshi Zhang, Han Ji, Yu Zhao, Guisong Liu, Hegui Zhang, Carl Yang, Gang Kou

Main category: cs.LG

TL;DR: TGCC is a causal-invariance-based graph dataset condensation method that creates transferable condensed datasets for cross-task and cross-domain scenarios by preserving domain-invariant causal features.

Details

Motivation: Existing graph dataset condensation methods require downstream applications to match the original dataset and task, failing in cross-task and cross-domain scenarios. There's a need for transferable condensed datasets that preserve essential causal information across different tasks and domains.

Method: TGCC extracts domain causal-invariant features from spatial domain using causal interventions, performs enhanced condensation operations to capture structural and feature information, and injects causal-invariant features into condensed graphs through spectral-domain enhanced contrastive learning.

Result: TGCC achieves up to 13.41% improvement in cross-task and cross-domain scenarios compared to existing methods, and achieves state-of-the-art performance on 5 out of 6 datasets in single dataset/task scenarios.

Conclusion: TGCC provides an effective and transferable graph dataset condensation method that preserves causal information and works well in complex cross-task and cross-domain scenarios.

Abstract: The increasing scale of graph datasets has significantly improved the performance of graph representation learning methods, but it has also introduced substantial training challenges. Graph dataset condensation techniques have emerged to compress large datasets into smaller yet information-rich datasets, while maintaining similar test performance. However, these methods strictly require downstream applications to match the original dataset and task, which often fails in cross-task and cross-domain scenarios. To address these challenges, we propose a novel causal-invariance-based and transferable graph dataset condensation method, named \textbf{TGCC}, providing effective and transferable condensed datasets. Specifically, to preserve domain-invariant knowledge, we first extract domain causal-invariant features from the spatial domain of the graph using causal interventions. Then, to fully capture the structural and feature information of the original graph, we perform enhanced condensation operations. Finally, through spectral-domain enhanced contrastive learning, we inject the causal-invariant features into the condensed graph, ensuring that the compressed graph retains the causal information of the original graph. Experimental results on five public datasets and our novel \textbf{FinReport} dataset demonstrate that TGCC achieves up to a 13.41% improvement in cross-task and cross-domain complex scenarios compared to existing methods, and achieves state-of-the-art performance on 5 out of 6 datasets in the single dataset and task scenario.

[509] Few-Shot Learning for Dynamic Operations of Automated Electric Taxi Fleets under Evolving Charging Infrastructure: A Meta-Deep Reinforcement Learning Approach

Xiaozhuang Li, Xindi Tang, Fang He

Main category: cs.LG

TL;DR: GAT-PEARL: A meta-reinforcement learning framework using graph attention networks and probabilistic embeddings for adaptive management of autonomous electric taxi fleets in dynamic charging environments.

Details

Motivation: Existing research assumes static charging networks for autonomous electric taxi fleets, creating a gap between theoretical models and real-world operations with dynamic, uncertain charging availability.

Method: Proposes GAT-PEARL framework integrating graph attention networks (GAT) to extract spatial representations and model spatiotemporal relationships, combined with probabilistic embeddings for actor-critic reinforcement learning (PEARL) for rapid adaptation to charging network changes.

Result: Outperforms conventional reinforcement learning baselines in simulations on real-world Chengdu data, showing superior generalization to unseen infrastructure layouts and higher operational efficiency in dynamic settings.

Conclusion: GAT-PEARL effectively bridges the gap between theoretical models and real-world operations by enabling adaptive fleet management in dynamic charging environments without retraining.

Abstract: With the rapid expansion of electric vehicles (EVs) and charging infrastructure, the effective management of Autonomous Electric Taxi (AET) fleets faces a critical challenge in environments with dynamic and uncertain charging availability. While most existing research assumes a static charging network, this simplification creates a significant gap between theoretical models and real-world operations. To bridge this gap, we propose GAT-PEARL, a novel meta-reinforcement learning framework that learns an adaptive operational policy. Our approach integrates a graph attention network (GAT) to effectively extract robust spatial representations under infrastructure layouts and model the complex spatiotemporal relationships of the urban environment, and employs probabilistic embeddings for actor-critic reinforcement learning (PEARL) to enable rapid, inference-based adaptation to changes in charging network layouts without retraining. Through extensive simulations on real-world data in Chengdu, China, we demonstrate that GAT-PEARL significantly outperforms conventional reinforcement learning baselines, showing superior generalization to unseen infrastructure layouts and achieving higher overall operational efficiency in dynamic settings.

[510] Distributionally Robust Classification for Multi-source Unsupervised Domain Adaptation

Seonghwi Kim, Sung Ho Jo, Wooseok Ha, Minwoo Chae

Main category: cs.LG

TL;DR: A distributionally robust learning framework for unsupervised domain adaptation that handles uncertainty in both covariate and conditional label distributions, particularly effective when target data is scarce.

Details

Motivation: Existing UDA methods struggle with limited target domain data and spurious correlations in source domains, motivating a more robust approach to handle distributional uncertainty.

Method: Proposes a distributionally robust learning framework modeling uncertainty in both covariate and conditional label distributions, with an efficient algorithm compatible with existing UDA methods.

Result: Extensive experiments show consistent outperformance of strong baselines, especially when target data is extremely scarce across various distribution shift scenarios.

Conclusion: The proposed framework effectively addresses UDA challenges with limited target data and spurious correlations, offering a versatile solution applicable to both single-source and multi-source scenarios.

Abstract: Unsupervised domain adaptation (UDA) is a statistical learning problem when the distribution of training (source) data is different from that of test (target) data. In this setting, one has access to labeled data only from the source domain and unlabeled data from the target domain. The central objective is to leverage the source data and the unlabeled target data to build models that generalize to the target domain. Despite its potential, existing UDA approaches often struggle in practice, particularly in scenarios where the target domain offers only limited unlabeled data or spurious correlations dominate the source domain. To address these challenges, we propose a novel distributionally robust learning framework that models uncertainty in both the covariate distribution and the conditional label distribution. Our approach is motivated by the multi-source domain adaptation setting but is also directly applicable to the single-source scenario, making it versatile in practice. We develop an efficient learning algorithm that can be seamlessly integrated with existing UDA methods. Extensive experiments under various distribution shift scenarios show that our method consistently outperforms strong baselines, especially when target data are extremely scarce.

[511] Heterogeneous Vertiport Selection Optimization for On-Demand Air Taxi Services: A Deep Reinforcement Learning Approach

Aoyu Pang, Maonan Wang, Zifan Sha, Wenwei Yue, Changle Li, Chung Shue Chen, Man-On Pun

Main category: cs.LG

TL;DR: Proposes a unified air-ground mobility coordination framework using deep reinforcement learning and V2X communication to optimize vertiport selection and air taxi routing, reducing travel time by 34% compared to conventional methods.

Details

Motivation: Urban Air Mobility (UAM) can alleviate urban congestion but requires integration with ground transportation for efficient door-to-door travel. Current research lacks systematic exploration of optimal integrated routing strategies for air-ground mobility systems.

Method: Proposes a unified optimization model integrating air and ground transportation strategy selection, capturing dynamic multimodal transport networks with real-time traffic conditions and passenger decision-making. Develops Unified Air-Ground Mobility Coordination (UAGMC) framework using deep reinforcement learning and V2X communication to optimize vertiport selection and dynamically plan air taxi routes.

Result: UAGMC achieves 34% reduction in average travel time compared to conventional proportional allocation methods, enhancing overall travel efficiency in multimodal transportation systems.

Conclusion: The work lays a foundation for advancing intelligent urban mobility solutions through air-ground transportation coordination, demonstrating significant efficiency improvements through integrated optimization.

Abstract: Urban Air Mobility (UAM) has emerged as a transformative solution to alleviate urban congestion by utilizing low-altitude airspace, thereby reducing pressure on ground transportation networks. To enable truly efficient and seamless door-to-door travel experiences, UAM requires close integration with existing ground transportation infrastructure. However, current research on optimal integrated routing strategies for passengers in air-ground mobility systems remains limited, with a lack of systematic exploration.To address this gap, we first propose a unified optimization model that integrates strategy selection for both air and ground transportation. This model captures the dynamic characteristics of multimodal transport networks and incorporates real-time traffic conditions alongside passenger decision-making behavior. Building on this model, we propose a Unified Air-Ground Mobility Coordination (UAGMC) framework, which leverages deep reinforcement learning (RL) and Vehicle-to-Everything (V2X) communication to optimize vertiport selection and dynamically plan air taxi routes. Experimental results demonstrate that UAGMC achieves a 34% reduction in average travel time compared to conventional proportional allocation methods, enhancing overall travel efficiency and providing novel insights into the integration and optimization of multimodal transportation systems. This work lays a solid foundation for advancing intelligent urban mobility solutions through the coordination of air and ground transportation modes. The related code can be found at https://github.com/Traffic-Alpha/UAGMC.

[512] Adversarial Vulnerability Transcends Computational Paradigms: Feature Engineering Provides No Defense Against Neural Adversarial Transfer

Achraf Hsain, Ahmed Abdelkader, Emmanuel Baldwin Mbaya, Hamoud Aljamaan

Main category: cs.LG

TL;DR: Adversarial examples transfer from DNNs to classical ML classifiers using HOG features, refuting the hypothesis that feature engineering protects against attacks.

Details

Motivation: To investigate whether classical ML pipelines with handcrafted features (like HOG) are vulnerable to adversarial examples generated from neural network surrogates, testing the hypothesis that feature engineering creates information bottlenecks that filter adversarial signals.

Method: Used VGG16 as surrogate to generate FGSM and PGD adversarial examples on CIFAR-10, then tested transfer to four classical classifiers (KNN, Decision Tree, Linear SVM, Kernel SVM) and a shallow neural network across eight HOG configurations.

Result: All classifiers suffered 16.6%-59.1% relative accuracy drops, comparable to neural-to-neural transfer. Surprisingly, FGSM caused greater degradation than PGD in 100% of classical ML cases (attack hierarchy reversal), and block normalization provided only partial mitigation.

Conclusion: Adversarial vulnerability is not an artifact of end-to-end differentiability but a fundamental property of image classification systems, with implications for security-critical deployments across computational paradigms.

Abstract: Deep neural networks are vulnerable to adversarial examples–inputs with imperceptible perturbations causing misclassification. While adversarial transfer within neural networks is well-documented, whether classical ML pipelines using handcrafted features inherit this vulnerability when attacked via neural surrogates remains unexplored. Feature engineering creates information bottlenecks through gradient quantization and spatial binning, potentially filtering high-frequency adversarial signals. We evaluate this hypothesis through the first comprehensive study of adversarial transfer from DNNs to HOG-based classifiers. Using VGG16 as a surrogate, we generate FGSM and PGD adversarial examples and test transfer to four classical classifiers (KNN, Decision Tree, Linear SVM, Kernel SVM) and a shallow neural network across eight HOG configurations on CIFAR-10. Our results strongly refute the protective hypothesis: all classifiers suffer 16.6%-59.1% relative accuracy drops, comparable to neural-to-neural transfer. More surprisingly, we discover attack hierarchy reversal–contrary to patterns where iterative PGD dominates FGSM within neural networks, FGSM causes greater degradation than PGD in 100% of classical ML cases, suggesting iterative attacks overfit to surrogate-specific features that don’t survive feature extraction. Block normalization provides partial but insufficient mitigation. These findings demonstrate that adversarial vulnerability is not an artifact of end-to-end differentiability but a fundamental property of image classification systems, with implications for security-critical deployments across computational paradigms.

[513] An introductory Generalization of the standard SVMs loss and its applications to Shallow and Deep Neural Networks

Filippo Portera

Main category: cs.LG

TL;DR: Proposes new convex loss functions for SVMs that incorporate pattern correlations to improve generalization, with preliminary experiments showing promising results compared to standard losses.

Details

Motivation: To develop improved SVM loss functions that incorporate pattern correlations within the loss formulation to enhance generalization performance beyond standard SVM losses.

Method: Mathematical derivation of dual problems for new convex loss functions for both binary classification and regression SVMs, followed by experimental validation on small datasets due to SVM scalability limitations.

Result: Generalization measures are never worse than standard losses and often better, demonstrating the potential of incorporating pattern correlations in loss functions.

Conclusion: The proposed loss shows promise and warrants further study, particularly when coupled with shallow and deep neural networks, where some novel results have already been obtained.

Abstract: We propose a new convex loss for SVMs, both for the binary classification and for the regression models. Therefore, we show the mathematical derivation of the dual problems and we experiment them with several small data-sets. The minimal dimension of those data-sets is due to the difficult scalability of the SVM method to bigger instances. This preliminary study should prove that using pattern correlations inside the loss function could enhance the generalisation performances. Coherently, results show that generalisation measures are never worse than the standard losses and several times they are better. In our opinion, it should be considered a careful study of this loss, coupled with shallow and deep neural networks. In fact, we present some novel results obtained with those architectures.

[514] Memorization Control in Diffusion Models from Denoising-centric Perspective

Thuy Phuong Vu, Mai Viet Hoang Do, Minhhuy Le, Dinh-Cuong Hoang, Phan Xuan Tan

Main category: cs.LG

TL;DR: Proposes a timestep sampling strategy for diffusion models that controls memorization by shifting learning emphasis toward later denoising steps, improving distributional alignment.

Details

Motivation: Existing approaches to controlling memorization in diffusion models focus on data-centric or model-centric modifications, treating the model as an isolated predictor. The paper aims to study memorization from a denoising-centric perspective, showing that uniform timestep sampling leads to unequal learning contributions across steps due to SNR differences, biasing training toward memorization.

Method: Proposes a timestep sampling strategy that explicitly controls where learning occurs along the denoising trajectory. By adjusting the width of the confidence interval, the method provides direct control over the memorization-generalization trade-off. The approach shifts learning emphasis toward later denoising steps to reduce memorization.

Result: Experiments on image and 1D signal generation tasks demonstrate that shifting learning emphasis toward later denoising steps consistently reduces memorization and improves distributional alignment with training data, validating the generality and effectiveness of the approach.

Conclusion: The denoising-centric perspective provides a principled way to control memorization in diffusion models through timestep sampling, offering direct control over the memorization-generalization trade-off without requiring data or model modifications.

Abstract: Controlling memorization in diffusion models is critical for applications that require generated data to closely match the training distribution. Existing approaches mainly focus on data centric or model centric modifications, treating the diffusion model as an isolated predictor. In this paper, we study memorization in diffusion models from a denoising centric perspective. We show that uniform timestep sampling leads to unequal learning contributions across denoising steps due to differences in signal to noise ratio, which biases training toward memorization. To address this, we propose a timestep sampling strategy that explicitly controls where learning occurs along the denoising trajectory. By adjusting the width of the confidence interval, our method provides direct control over the memorization generalization trade off. Experiments on image and 1D signal generation tasks demonstrate that shifting learning emphasis toward later denoising steps consistently reduces memorization and improves distributional alignment with training data, validating the generality and effectiveness of our approach.

[515] L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

Minghao Yang, Ren Togo, Guang Li, Takahiro Ogawa, Miki Haseyama

Main category: cs.LG

TL;DR: L2R is a novel routing framework for Mixture-of-Experts models that improves routing stability and expert specialization through low-rank latent routing space and Lipschitz-controlled scoring.

Details

Motivation: Current MoE systems use linear routers in high-dimensional spaces, which suffer from representation mismatch, angular concentration, and scale-sensitive scoring, undermining routing discriminability and expert specialization stability.

Method: Proposes Low-rank & Lipschitz-controlled Routing (L2R) with: 1) Shared low-rank latent routing space for expert assignment, 2) Saturated Inner-Product Scoring (SIPS) to control Lipschitz behavior, 3) Parameter-efficient multi-anchor routing mechanism for enhanced expert expressiveness.

Result: Extensive experiments on large-scale language MoE models and vision MoE on ImageNet show consistent improvements in routing stability, expert specialization, and overall model performance.

Conclusion: L2R provides a unified routing framework that addresses fundamental limitations of current MoE routing systems, leading to more stable and effective expert specialization across both language and vision domains.

Abstract: Mixture-of-Experts (MoE) models scale neural networks by conditionally activating a small subset of experts, where the router plays a central role in determining expert specialization and overall model performance. However, many modern MoE systems still adopt linear routers in raw high-dimensional representation spaces, where representation mismatch, angular concentration, and scale-sensitive scoring can jointly undermine routing discriminability and stable expert specialization. In this work, we propose Low-rank & Lipschitz-controlled Routing (L2R), a unified routing framework that reshapes both the routing space and scoring geometry. L2R performs expert assignment in a shared low-rank latent routing space and introduces Saturated Inner-Product Scoring (SIPS) to explicitly control the Lipschitz behavior of routing functions, yielding smoother and more stable routing geometry. In addition, L2R incorporates a parameter-efficient multi-anchor routing mechanism to enhance expert expressiveness. Extensive experiments on a large-scale language MoE model and a vision MoE setting on ImageNet demonstrate that L2R consistently improves routing stability, expert specialization, and overall model performance.

[516] Factored Causal Representation Learning for Robust Reward Modeling in RLHF

Yupei Yang, Lin Yang, Wanxi Deng, Lin Qu, Fan Feng, Biwei Huang, Shikui Tu, Lei Xu

Main category: cs.LG

TL;DR: A causal approach to reward modeling that decomposes embeddings into causal and non-causal factors to prevent reward hacking in RLHF.

Details

Motivation: Standard reward models for RLHF are vulnerable to spurious correlations with non-causal features (like text length or sycophantic bias), leading to reward hacking where high predicted rewards don't correspond to better behavior.

Method: Proposes a factored representation learning framework that decomposes contextual embeddings into causal factors (sufficient for reward prediction) and non-causal factors (capturing reward-irrelevant attributes). The reward head is constrained to use only causal factors, while an adversarial head tries to predict reward from non-causal factors with gradient reversal to discourage encoding reward-relevant information.

Result: Experiments on mathematical and dialogue tasks show the method learns more robust reward models and consistently improves downstream RLHF performance over state-of-the-art baselines. Analyses confirm effectiveness in mitigating reward hacking behaviors related to length and sycophantic bias.

Conclusion: The causal perspective and factored representation approach effectively address reward hacking in RLHF by separating causal from non-causal factors, leading to more reliable reward models and better alignment with human preferences.

Abstract: A reliable reward model is essential for aligning large language models with human preferences through reinforcement learning from human feedback. However, standard reward models are susceptible to spurious features that are not causally related to human labels. This can lead to reward hacking, where high predicted reward does not translate into better behavior. In this work, we address this problem from a causal perspective by proposing a factored representation learning framework that decomposes the model’s contextual embedding into (1) causal factors that are sufficient for reward prediction and (2) non-causal factors that capture reward-irrelevant attributes such as length or sycophantic bias. The reward head is then constrained to depend only on the causal component. In addition, we introduce an adversarial head trained to predict reward from the non-causal factors, while applying gradient reversal to discourage them from encoding reward-relevant information. Experiments on both mathematical and dialogue tasks demonstrate that our method learns more robust reward models and consistently improves downstream RLHF performance over state-of-the-art baselines. Analyses on length and sycophantic bias further validate the effectiveness of our method in mitigating reward hacking behaviors.

[517] Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving

Chendong Song, Meixuan Wang, Hang Zhou, Hong Liang, Yuan Lyu, Zixi Chen, Yuwei Fan, Zijie Zhou

Main category: cs.LG

TL;DR: AFD architecture separates Attention and FFN computation for LLM decoding, requiring optimal resource allocation to minimize idle time and maximize throughput.

Details

Motivation: Attention-FFN disaggregation enables independent scaling of memory and compute resources for LLM decoding, but performance is highly sensitive to the Attention/FFN provisioning ratio, with mis-sizing causing blocking and device idle time.

Method: Developed a tractable analytical framework for sizing AFD bundles in rA-1F topology using a probabilistic workload model to derive closed-form rules for optimal A/F ratio that maximizes average throughput.

Result: The theoretical optimal A/F ratio matches simulation-optimal within 10% across workloads and consistently reduces idle time in trace-calibrated AFD simulations.

Conclusion: The analytical framework provides effective closed-form rules for optimal resource allocation in AFD architectures, enabling better throughput and reduced idle time in LLM decoding systems.

Abstract: Attention-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop a tractable analytical framework for sizing AFD bundles in an $r$A-$1$F topology, where the key difficulty is that Attention-side work is nonstationary-token context grows and requests are continuously replenished with random lengths-while FFN work is stable given the aggregated batch. Using a probabilistic workload model, we derive closed-form rules for the optimal A/F ratio that maximize average throughput per instance across the system. A trace-calibrated AFD simulator validates the theory: across workloads, the theoretical optimal A/F ratio matches the simulation-optimal within 10%, and consistently reduces idle time.

[518] Expected Improvement via Gradient Norms

Joshua Hang Sai Ip, Georgios Makrygiorgos, Ali Mesbah

Main category: cs.LG

TL;DR: Proposed EI-GN, a novel Bayesian Optimization acquisition function that uses gradient information to balance exploration and exploitation by promoting sampling in regions approaching stationarity.

Details

Motivation: Standard Expected Improvement (EI) in Bayesian Optimization is known to be overly exploitative and can converge to suboptimal stationary points. There's a need for acquisition functions that better balance exploration and exploitation using gradient information.

Method: EI-GN applies the improvement principle to a gradient-aware auxiliary objective that considers both function values and gradient norms. It uses gradient-enhanced surrogate models to enable principled gradient inference from function evaluations, with a tractable closed-form expression for efficient optimization.

Result: Empirical evaluations on standard BO benchmarks show EI-GN yields consistent improvements against standard baselines. The method is also demonstrated to be applicable to control policy learning problems.

Conclusion: EI-GN provides a novel gradient-aware acquisition function that improves upon standard EI by better balancing exploration and exploitation through consideration of gradient information, with practical applications in optimization and control problems.

Abstract: Bayesian Optimization (BO) is a principled approach for optimizing expensive black-box functions, with Expected Improvement (EI) being one of the most widely used acquisition functions. Despite its empirical success, EI is known to be overly exploitative and can converge to suboptimal stationary points. We propose Expected Improvement via Gradient Norms (EI-GN), a novel acquisition function that applies the improvement principle to a gradient-aware auxiliary objective, thereby promoting sampling in regions that are both high-performing and approaching first-order stationarity. EI-GN relies on gradient observations used to learn gradient-enhanced surrogate models that enable principled gradient inference from function evaluations. We derive a tractable closed-form expression for EI-GN that allows efficient optimization and show that the proposed acquisition is consistent with the improvement-based acquisition framework. Empirical evaluations on standard BO benchmarks demonstrate that EI-GN yields consistent improvements against standard baselines. We further demonstrate applicability of EI-GN to control policy learning problems.

[519] Graph-Free Root Cause Analysis

Luan Pham

Main category: cs.LG

TL;DR: PRISM is a framework for Root Cause Analysis in complex systems without dependency graphs, addressing fault propagation issues where small root cause delays accumulate into larger downstream anomalies.

Details

Motivation: Existing RCA methods without dependency graphs incorrectly assume the root cause has the highest anomaly score, failing when faults propagate and small delays at the root accumulate into larger anomalies downstream.

Method: PRISM formulates a class of component-based systems and performs RCA with theoretical guarantees, operating without dependency graphs while addressing fault propagation issues.

Result: On 735 failures across 9 real-world datasets, PRISM achieves 68% Top-1 accuracy, a 258% improvement over the best baseline, while requiring only 8ms per diagnosis.

Conclusion: PRISM provides a simple and efficient framework for RCA without dependency graphs, effectively handling fault propagation scenarios with strong theoretical guarantees and practical performance.

Abstract: Failures in complex systems demand rapid Root Cause Analysis (RCA) to prevent cascading damage. Existing RCA methods that operate without dependency graph typically assume that the root cause having the highest anomaly score. This assumption fails when faults propagate, as a small delay at the root cause can accumulate into a much larger anomaly downstream. In this paper, we propose PRISM, a simple and efficient framework for RCA when the dependency graph is absent. We formulate a class of component-based systems under which PRISM performs RCA with theoretical guarantees. On 735 failures across 9 real-world datasets, PRISM achieves 68% Top-1 accuracy, a 258% improvement over the best baseline, while requiring only 8ms per diagnosis.

[520] Perceptrons and localization of attention’s mean-field landscape

Antonio Álvarez-López, Borjan Geshkovski, Domènec Ruiz-Balet

Main category: cs.LG

TL;DR: The paper analyzes Transformer forward passes as interacting particle systems on the unit sphere, showing that perceptron blocks lead to atomic, localized critical points on subsets of the sphere.

Details

Motivation: To understand the mathematical structure of Transformers by viewing them as interacting particle systems on the unit sphere, and to analyze how perceptron blocks affect the system's dynamics and critical points.

Method: Theoretical analysis using mathematical tools from interacting particle systems, gradient flows, and mean-field theory. The approach treats token embeddings as particles on the unit sphere, layers as time steps, and analyzes the system’s dynamics through Wasserstein gradient flows.

Result: Shows that critical points of the system with perceptron blocks are generically atomic (discrete rather than continuous) and localized on subsets of the sphere, providing insight into the geometric structure of Transformer representations.

Conclusion: The perceptron block in Transformers induces specific geometric properties in the representation space, leading to discrete, localized critical points that may have implications for understanding Transformer dynamics and optimization.

Abstract: The forward pass of a Transformer can be seen as an interacting particle system on the unit sphere: time plays the role of layers, particles that of token embeddings, and the unit sphere idealizes layer normalization. In some weight settings the system can even be seen as a gradient flow for an explicit energy, and one can make sense of the infinite context length (mean-field) limit thanks to Wasserstein gradient flows. In this paper we study the effect of the perceptron block in this setting, and show that critical points are generically atomic and localized on subsets of the sphere.

Hang Ni, Weijia Zhang, Fei Wang, Zezhi Shao, Hao Liu

Main category: cs.LG

TL;DR: MADI is a multimodal LLM for time series understanding that addresses cross-modal integration challenges through patch-level alignment, discrete disentangled interaction, and critical-token highlighting.

Details

Motivation: Existing multimodal LLMs for time series struggle with fine-grained temporal misalignment across modalities and entanglement between shared and modality-specific semantics, hindering localized interpretation and complementary reasoning.

Method: Three key components: (1) Patch-level Alignment for fine-grained correspondence across modalities, (2) Discrete Disentangled Interaction to separate common semantics into discrete latents and synergize unique information, (3) Critical-token Highlighting to emphasize query-relevant signals.

Result: MADI consistently outperforms both general-purpose LLMs and time-series-specialized MLLMs on synthetic and real-world benchmarks.

Conclusion: The proposed approach effectively addresses cross-modal integration challenges in time series understanding, enabling better multimodal reasoning through fine-grained alignment and disentangled interaction.

Abstract: Advances in multi-modal large language models (MLLMs) have inspired time series understanding and reasoning tasks, that enable natural language querying over time series, producing textual analyses of complex temporal dynamics. Recent attempts hybridize numerical time series with their visualized plots, facilitating precise value reasoning and visual structure comprehension for comprehensive time series understanding of MLLMs. However, effective cross-modal integration remains challenging due to fine-grained temporal misalignment across modalities and severe entanglement between shared and modality-specific semantics, which hinder localized interpretation and complementary reasoning. To address these issues, we propose MADI, a multi-modal LLM enhanced with fine-grained alignment and disentangled interaction, featuring (1) Patch-level Alignment, which enforces physically grounded fine-grained correspondence across heterogeneous modalities, (2) Discrete Disentangled Interaction, which separates modality-common semantics into compact discrete latents and adaptively synergizes the purified modality-unique information, and (3) Critical-token Highlighting, which emphasizes informative, query-relevant signals for robust reasoning. Experiments on synthetic and real-world benchmarks show that MADI consistently outperforms general-purpose LLMs and time-series-specialized MLLMs.

[522] Rethinking Federated Graph Foundation Models: A Graph-Language Alignment-based Approach

Yinlin Zhu, Di Wu, Xianzhi Zhang, Yuming Ai, Xunkai Li, Miao Hu, Guocong Quan

Main category: cs.LG

TL;DR: FedGALA is a federated learning framework that aligns graph neural networks with pre-trained language models in continuous embedding space to handle distributed, privacy-restricted graph data while avoiding quantization losses.

Details

Motivation: Existing federated graph foundational models suffer from irreversible knowledge loss during vector quantization processes and struggle with data heterogeneity and communication constraints in distributed environments. There's a need to reconcile semantic-structural orthogonality between PLMs and GNNs while maintaining privacy.

Method: Uses unsupervised contrastive learning to align GNNs with frozen pre-trained language models in continuous embedding space, then employs communication-efficient prompt tuning to adapt to downstream tasks without full-parameter fine-tuning.

Result: Outperforms all competitive baselines across multi-domain datasets on multiple tasks with up to 14.37% performance improvement.

Conclusion: FedGALA effectively addresses knowledge loss in federated graph learning by aligning GNNs and PLMs in continuous space and using efficient prompt tuning, demonstrating superior performance in distributed settings.

Abstract: Recent studies of federated graph foundational models (FedGFMs) break the idealized and untenable assumption of having centralized data storage to train graph foundation models, and accommodate the reality of distributed, privacy-restricted data silos. Despite their simplicity and intuition, existing studies that project aligned generalizable knowledge onto a discrete token space via vector-quantized backbones suffer from irreversible knowledge loss during the quantization process. In this context, we argue that reconciling the semantic-structural orthogonality and integrity between pre-trained language models (PLMs) and graph neural networks (GNNs) is paramount for developing effective FedGFMs while simultaneously mitigating the severe data heterogeneity and communication constraints inherent in distributed, resource-limited environments. To address these issues, we propose FedGALA (Federated Graph And Language Alignment), a framework that resolves graph-based semantic-structural orthogonality and integrity in federated settings by employing unsupervised contrastive learning to align GNNs and frozen PLMs within a continuous embedding space, thereby capturing robust, transferable general knowledge. Subsequently, FedGALA leverages a communication-efficient prompt tuning mechanism to steer these pre-aligned encoders and frozen PLMs, facilitating effective adaptation to diverse downstream tasks while circumventing the prohibitive overhead of full-parameter fine-tuning. The comprehensive experiments validate that FedGALA outperforms all competitive baselines across multi-domain datasets on multiple tasks with up to 14.37% performance improvement.

[523] DA-SPS: A Dual-stage Network based on Singular Spectrum Analysis, Patching-strategy and Spearman-correlation for Multivariate Time-series Prediction

Tianhao Zhang, Shusen Ma, Yu Kang, Yun-Bo Zhao

Main category: cs.LG

TL;DR: DA-SPS model for multivariate time-series forecasting that separates target variable processing (using SSA, LSTM, P-Conv-LSTM) from extraneous variable processing (using correlation analysis and L-Attention), achieving state-of-the-art results on public datasets and a private laptop motherboard dataset.

Details

Motivation: Existing multivariate time-series forecasting methods fail to effectively consider the impact of extraneous variables on target variable prediction and don't fully extract complex sequence information based on various time patterns.

Method: Two-stage approach: 1) Target Variable Processing Stage (TVPS) uses Singular Spectrum Analysis (SSA) to decompose target variable, then LSTM for trend and P-Conv-LSTM with patching strategy for seasonality; 2) Extraneous Variables Processing Stage (EVPS) filters variables using Spearman correlation analysis and analyzes them with L-Attention module (LSTM + attention). Results combined via weighted summation and linear mapping.

Result: Outperforms existing state-of-the-art methods on four public datasets and shows strong performance on a private laptop motherboard test items dataset.

Conclusion: DA-SPS effectively handles multivariate time-series forecasting by separately processing target and extraneous variables with specialized modules, demonstrating superior performance across multiple datasets.

Abstract: Multivariate time-series forecasting, as a typical problem in the field of time series prediction, has a wide range of applications in weather forecasting, traffic flow prediction, and other scenarios. However, existing works do not effectively consider the impact of extraneous variables on the prediction of the target variable. On the other hand, they fail to fully extract complex sequence information based on various time patterns of the sequences. To address these drawbacks, we propose a DA-SPS model, which adopts different modules for feature extraction based on the information characteristics of different variables. DA-SPS mainly consists of two stages: the target variable processing stage (TVPS) and the extraneous variables processing stage (EVPS). In TVPS, the model first uses Singular Spectrum Analysis (SSA) to process the target variable sequence and then uses Long Short-Term Memory (LSTM) and P-Conv-LSTM which deploys a patching strategy to extract features from trend and seasonality components, respectively. In EVPS, the model filters extraneous variables that have a strong correlation with the target variate by using Spearman correlation analysis and further analyses them using the L-Attention module which consists of LSTM and attention mechanism. Finally, the results obtained by TVPS and EVPS are combined through weighted summation and linear mapping to produce the final prediction. The results on four public datasets demonstrate that the DA-SPS model outperforms existing state-of-the-art methods. Additionally, its performance in real-world scenarios is further validated using a private dataset collected by ourselves, which contains the test items’ information on laptop motherboards.

[524] Shaping capabilities with token-level data filtering

Neil Rathi, Alec Radford

Main category: cs.LG

TL;DR: Filtering specific tokens during pretraining is more effective than document filtering for removing undesired capabilities like medical knowledge, with effectiveness increasing at larger model scales.

Details

Motivation: Current post-hoc methods for removing undesired capabilities in language models are vulnerable to adversarial bypassing. The paper explores shaping capabilities during pretraining itself as a more robust alternative.

Method: Uses token-level filtering of pretraining data (rather than document-level) to remove specific capabilities. Introduces methodology using sparse autoencoders to label tokens and distill cheap, high-quality classifiers for filtering.

Result: Token filtering is highly effective, robust, and inexpensive at scale. For largest models, token filtering leads to 7000x compute slowdown on forget domain. Filtering effectiveness increases with model scale, and models can still be aligned on forget domains after filtering.

Conclusion: Filtering specific tokens during pretraining is a powerful approach for controlling model capabilities that becomes more effective with scale and can be robust to noisy labels with sufficient compute.

Abstract: Current approaches to reducing undesired capabilities in language models are largely post hoc, and can thus be easily bypassed by adversaries. A natural alternative is to shape capabilities during pretraining itself. On the proxy task of removing medical capabilities, we show that the simple intervention of filtering pretraining data is highly effective, robust, and inexpensive at scale. Inspired by work on data attribution, we show that filtering tokens is more effective than filtering documents, achieving the same hit to undesired capabilities at a lower cost to benign ones. Training models spanning two orders of magnitude, we then demonstrate that filtering gets more effective with scale: for our largest models, token filtering leads to a 7000x compute slowdown on the forget domain. We also show that models trained with token filtering can still be aligned on the forget domain. Along the way, we introduce a methodology for labeling tokens with sparse autoencoders and distilling cheap, high-quality classifiers. We also demonstrate that filtering can be robust to noisy labels with sufficient pretraining compute.

[525] Sim-MSTNet: sim2real based Multi-task SpatioTemporal Network Traffic Forecasting

Hui Ma, Qingzhong Li, Jin Wang, Jie Wu, Shaoyu Dou, Li Feng, Xinjun Pei

Main category: cs.LG

TL;DR: Sim-MSTNet: A multi-task spatiotemporal network traffic forecasting model using sim2real approach with domain randomization and attention mechanisms to address data scarcity and task imbalance.

Details

Motivation: Existing network traffic forecasting techniques perform poorly with limited data, and multi-task learning methods struggle with task imbalance and negative transfer when modeling various service types.

Method: Proposes Sim-MSTNet using sim2real approach: leverages simulator to generate synthetic data, employs domain randomization with bi-level optimization to reduce synthetic-real distribution gap, incorporates attention mechanisms for selective knowledge sharing between tasks, and applies dynamic loss weighting for task balancing.

Result: Extensive experiments on two open-source datasets show Sim-MSTNet consistently outperforms state-of-the-art baselines, achieving enhanced accuracy and generalization.

Conclusion: Sim-MSTNet effectively addresses data scarcity and task imbalance in network traffic forecasting through sim2real approach with domain randomization and attention-based multi-task learning.

Abstract: Network traffic forecasting plays a crucial role in intelligent network operations, but existing techniques often perform poorly when faced with limited data. Additionally, multi-task learning methods struggle with task imbalance and negative transfer, especially when modeling various service types. To overcome these challenges, we propose Sim-MSTNet, a multi-task spatiotemporal network traffic forecasting model based on the sim2real approach. Our method leverages a simulator to generate synthetic data, effectively addressing the issue of poor generalization caused by data scarcity. By employing a domain randomization technique, we reduce the distributional gap between synthetic and real data through bi-level optimization of both sample weighting and model training. Moreover, Sim-MSTNet incorporates attention-based mechanisms to selectively share knowledge between tasks and applies dynamic loss weighting to balance task objectives. Extensive experiments on two open-source datasets show that Sim-MSTNet consistently outperforms state-of-the-art baselines, achieving enhanced accuracy and generalization.

[526] Learning to Optimize Job Shop Scheduling Under Structural Uncertainty

Rui Zhang, Jianwei Niu, Xuefeng Liu, Shaojie Tang, Jing Yuan

Main category: cs.LG

TL;DR: UP-AAC: A novel actor-critic method with asymmetric architecture for Job-Shop Scheduling Problem with structural uncertainty, using deterministic hindsight states for critic and attention-based uncertainty perception for actor.

Details

Motivation: Address structural uncertainty in JSSP where jobs may follow different routing paths based on situational factors unknown in advance, which existing methods struggle with due to incorrect credit assignment.

Method: Proposes UP-AAC with asymmetric architecture: actor receives stochastic states while critic gets deterministic states reconstructed in hindsight. Also includes attention-based Uncertainty Perception Model (UPM) to enhance actor’s scheduling decisions.

Result: Extensive experiments show the method outperforms existing approaches in reducing makespan on benchmark instances.

Conclusion: UP-AAC effectively addresses structural uncertainty in JSSP through asymmetric actor-critic architecture and uncertainty perception, leading to more stable learning and better scheduling performance.

Abstract: The Job-Shop Scheduling Problem (JSSP), under various forms of manufacturing uncertainty, has recently attracted considerable research attention. Most existing studies focus on parameter uncertainty, such as variable processing times, and typically adopt the actor-critic framework. In this paper, we explore a different but prevalent form of uncertainty in JSSP: structural uncertainty. Structural uncertainty arises when a job may follow one of several routing paths, and the selection is determined not by policy, but by situational factors (e.g., the quality of intermediate products) that cannot be known in advance. Existing methods struggle to address this challenge due to incorrect credit assignment: a high-quality action may be unfairly penalized if it is followed by a time-consuming path. To address this problem, we propose a novel method named UP-AAC. In contrast to conventional actor-critic methods, UP-AAC employs an asymmetric architecture. While its actor receives a standard stochastic state, the critic is crucially provided with a deterministic state reconstructed in hindsight. This design allows the critic to learn a more accurate value function, which in turn provides a lower-variance policy gradient to the actor, leading to more stable learning. In addition, we design an attention-based Uncertainty Perception Model (UPM) to enhance the actor’s scheduling decisions. Extensive experiments demonstrate that our method outperforms existing approaches in reducing makespan on benchmark instances.

[527] Breaking the Overscaling Curse: Thinking Parallelism Before Parallel Thinking

Yiming Wang, Zhuosheng Zhang, Rui Wang

Main category: cs.LG

TL;DR: T2: A method to optimize parallel thinking in LLMs by dynamically allocating parallelism levels per sample to avoid overscaling and reduce computational costs while maintaining performance.

Details

Motivation: Current parallel thinking approaches allocate a fixed, large parallelism level N to all samples, but due to sample heterogeneity, many samples could achieve comparable performance with smaller N', leading to computational budget redundancy - the "overscaling curse".

Method: Proposes T2, which uses latent representations to estimate the optimal parallelism level for each sample before decoding, enabling dynamic allocation of computational resources based on sample difficulty.

Result: Experiments show T2 significantly reduces computational costs while maintaining comparable performance to fixed high-parallelism baselines, enabling more efficient parallel thinking.

Conclusion: The overscaling curse is a universal and severe problem in parallel thinking, and T2 provides a lightweight solution that breaks this curse by enabling sample-adaptive parallelism allocation.

Abstract: Parallel thinking enhances LLM reasoning by multi-path sampling and aggregation. In system-level evaluations, a global parallelism level N is allocated to all samples, typically set large to maximize overall dataset accuracy. However, due to sample heterogeneity, some samples can achieve comparable performance with a smaller N’< N, causing budget redundancy. This incompatibility between system-level efficacy and sample-level efficiency constitutes the overscaling curse. In this paper, we formalize and quantify the overscaling curse, showing its universality and severity in practice, and analyze its trigger mechanism. We then propose a lightweight method, T2, to break the overscaling curse, which utilizes latent representations to estimate the optimal parallelism level for each sample before decoding. Experiments show that T2 significantly reduces cost while maintaining comparable performance, enabling more efficient parallel thinking.

[528] Intrinsic Reward Policy Optimization for Sparse-Reward Environments

Minjae Cho, Huy Trong Tran

Main category: cs.LG

TL;DR: IRPO is a policy optimization framework that uses multiple intrinsic rewards to guide exploration in sparse-reward RL environments without pretraining subpolicies, improving performance and sample efficiency.

Details

Motivation: Standard exploration strategies fail in sparse-reward environments, and existing intrinsic reward approaches suffer from unstable credit assignment (when combined with extrinsic rewards) or sample inefficiency/sub-optimality (when used for hierarchical learning).

Method: Proposes Intrinsic Reward Policy Optimization (IRPO) - a policy optimization framework that leverages multiple intrinsic rewards to directly optimize a policy for extrinsic rewards without pretraining subpolicies. Uses a surrogate policy gradient that provides more informative learning signals than true gradients in sparse-reward environments.

Result: IRPO improves performance and sample efficiency relative to baselines in both discrete and continuous environments. Formal analysis of the optimization problem solved by IRPO is provided.

Conclusion: IRPO effectively addresses exploration challenges in sparse-reward RL by using intrinsic rewards in a novel policy optimization framework, overcoming limitations of existing approaches.

Abstract: Exploration is essential in reinforcement learning as an agent relies on trial and error to learn an optimal policy. However, when rewards are sparse, naive exploration strategies, like noise injection, are often insufficient. Intrinsic rewards can also provide principled guidance for exploration by, for example, combining them with extrinsic rewards to optimize a policy or using them to train subpolicies for hierarchical learning. However, the former approach suffers from unstable credit assignment, while the latter exhibits sample inefficiency and sub-optimality. We propose a policy optimization framework that leverages multiple intrinsic rewards to directly optimize a policy for an extrinsic reward without pretraining subpolicies. Our algorithm – intrinsic reward policy optimization (IRPO) – achieves this by using a surrogate policy gradient that provides a more informative learning signal than the true gradient in sparse-reward environments. We demonstrate that IRPO improves performance and sample efficiency relative to baselines in discrete and continuous environments, and formally analyze the optimization problem solved by IRPO. Our code is available at https://github.com/Mgineer117/IRPO.

[529] SWE-Spot: Building Small Repo-Experts with Repository-Centric Learning

Jinjun Peng, Magnus Saebo, Tianjun Zhong, Yi-Jie Cheng, Junfeng Yang, Baishakhi Ray, Simin Chen, Yangruibo Ding

Main category: cs.LG

TL;DR: SWE-Spot-4B: A compact 4B parameter model trained with Repository-Centric Learning (RCL) paradigm that focuses on deep repository understanding rather than broad task exposure, achieving state-of-the-art performance on software engineering tasks while being highly efficient.

Details

Motivation: Small Language Models (SLMs) for coding struggle with complicated, unfamiliar codebases due to their limited generalization capabilities. Current Task-Centric Learning (TCL) approaches that expose models to many disparate repositories fail to address this fundamental limitation, creating a need for a new paradigm that enables deep understanding of specific software environments.

Method: Proposes Repository-Centric Learning (RCL) paradigm that prioritizes vertical repository depth over horizontal task breadth. Implements a four-unit Repository-Centric Experience that transforms static codebases into interactive learning signals. Trains SWE-Spot-4B family of compact models (4B parameters) as repository-specialized experts.

Result: SWE-Spot-4B outperforms open-weight models up to larger sizes (including CWM by Meta, Qwen3-Coder-30B) and surpasses/matches efficiency-focused commercial models (GPT-4.1-mini, GPT-5-nano) across multiple software engineering tasks. RCL demonstrates higher training sample efficiency and lower inference costs.

Conclusion: Repository mastery is a distinct and necessary dimension that complements general coding capability for building efficient intelligence. The RCL paradigm enables SLMs to internalize the “physics” of target software environments through parametric knowledge acquisition rather than costly inference-time search.

Abstract: The deployment of coding agents in privacy-sensitive and resource-constrained environments drives the demand for capable open-weight Small Language Models (SLMs). However, they suffer from a fundamental capability gap: unlike frontier large models, they lack the inference-time strong generalization to work with complicated, unfamiliar codebases. We identify that the prevailing Task-Centric Learning (TCL) paradigm, which scales exposure across disparate repositories, fails to address this limitation. In response, we propose Repository-Centric Learning (RCL), a paradigm shift that prioritizes vertical repository depth over horizontal task breadth, suggesting SLMs must internalize the “physics” of a target software environment through parametric knowledge acquisition, rather than attempting to recover it via costly inference-time search. Following this new paradigm, we design a four-unit Repository-Centric Experience, transforming static codebases into interactive learning signals, to train SWE-Spot-4B, a family of highly compact models built as repo-specialized experts that breaks established scaling trends, outperforming open-weight models up to larger (e.g., CWM by Meta, Qwen3-Coder-30B) and surpassing/matching efficiency-focused commercial models (e.g., GPT-4.1-mini, GPT-5-nano) across multiple SWE tasks. Further analysis reveals that RCL yields higher training sample efficiency and lower inference costs, emphasizing that for building efficient intelligence, repository mastery is a distinct and necessary dimension that complements general coding capability.

[530] Mitigating Overthinking in Large Reasoning Models via Difficulty-aware Reinforcement Learning

Qian Wan, Ziao Xu, Luona Wei, Xiaoxuan Shen, Jianwen Sun

Main category: cs.LG

TL;DR: DiPO is a reinforcement learning framework that helps Large Reasoning Models (LRMs) become aware of task difficulty to avoid overthinking on simple tasks, reducing redundant reasoning tokens while maintaining performance.

Details

Motivation: LRMs often overthink simple tasks, producing unnecessarily long reasoning chains due to generation preferences from post-training reward functions. Existing approaches overlook task difficulty awareness, preventing effective allocation of reasoning resources.

Method: DiPO uses reinforcement learning to encourage LRMs to model task complexity spontaneously. It includes a difficulty modeling method based on model self-reasoning (reducing manual annotation needs) and a difficulty-signal-enhanced reward function that penalizes lengthy reasoning while considering performance and output format.

Result: DiPO enables models to adjust inference overhead spontaneously, significantly reducing redundant tokens without performance loss due to thought compression.

Conclusion: The framework successfully addresses overthinking in LRMs by integrating task difficulty awareness into reinforcement learning, improving resource efficiency while maintaining reasoning quality.

Abstract: Large Reasoning Models (LRMs) achieve explicit chain-of-thought expansion by imitating deep thinking behaviors of humans, demonstrating excellent performance in complex task scenarios. However, the deep-thinking mode often leads to unnecessarily lengthy reasoning and resource inefficiency when handling simple tasks. This overthinking phenomenon may arise from the generation preference triggered by the reward function during post-training. Existing research attempts to mitigate overthinking from the perspective of prompt design or model training, but generally underestimates the importance of task difficulty awareness, which makes it difficult for LRMs to effectively allocate reasoning resources. In this paper, we propose Difficulty-aware Policy Optimization (DiPO), a reinforcement learning-based LRM training framework. DiPO encourages LRM to spontaneously model task complexity, and integrates them into reinforcement learning framework to adjust the generation preferences introduced by post-training. A difficulty modeling method based on model self-reasoning is proposed, which significantly reduces the dependence on manual annotation and formalize task complexity. We further develop a difficulty-signal-enhanced reward function that incorporates a penalty for lengthy reasoning while considering reasoning performance and output format. Experimental results indicate that DiPO enables the model to spontaneously adjust inference overhead, significantly reducing redundant tokens without losing performance due to thought compression.

[531] Beyond Forgetting: Machine Unlearning Elicits Controllable Side Behaviors and Capabilities

Tien Dang, The-Hai Nguyen, Dinh Mai Phuong, Nguyen Minh Phuong, Hoang Thanh-Tung, Le-Minh Nguyen, Naoya Inoue

Main category: cs.LG

TL;DR: This paper explores representation misdirection (RM) for LLM unlearning, showing that manipulating forget-representations not only enables forgetting but also elicits controllable side behaviors and enhanced capabilities related to the target concept.

Details

Motivation: To investigate the under-explored roles of target vectors in representation misdirection (RM) methods for LLM unlearning, and to understand how manipulating forget-representations affects model behavior beyond just forgetting.

Method: Approaches RM through the lens of linear representation hypothesis, identifying one-dimensional representations corresponding to high-level concepts and performing linear operations on these concept vectors within forget-representation space.

Result: Empirical validation across various tasks shows that machine unlearning elicits both controllable side behaviors (truth, sentiment, refusal control) and enhanced capabilities (improved in-context learning), revealing this as either a hidden risk or a mechanism for developing models with stronger capabilities.

Conclusion: The phenomenon of representation manipulation in unlearning can be either a risk if misused or a valuable mechanism for creating models with enhanced capabilities and controllable behaviors, highlighting the dual nature of representation-based unlearning approaches.

Abstract: We consider representation misdirection (RM), a class of LLM unlearning methods that achieves forgetting by manipulating the forget-representations, that is, latent representations of forget samples. Despite being important, the roles of target vectors used in RM, however, remain underexplored. Here, we approach and revisit RM through the lens of the linear representation hypothesis. Specifically, if one can somehow identify a one-dimensional representation corresponding to a high-level concept, the linear representation hypothesis enables linear operations on this concept vector within the forget-representation space. Under this view, we hypothesize that, beyond forgetting, machine unlearning elicits controllable side behaviors and stronger side capabilities corresponding to the high-level concept. Our hypothesis is empirically validated across a wide range of tasks, including behavioral control (e.g., controlling unlearned models’ truth, sentiment, and refusal) and capability enhancement (e.g., improving unlearned models’ in-context learning capability). Our findings reveal that this fairly attractive phenomenon could be either a hidden risk if misused or a mechanism that can be harnessed for developing models that require stronger capabilities and controllable behaviors.

[532] Revisiting Diffusion Model Predictions Through Dimensionality

Qing Jin, Chaoyang Wang

Main category: cs.LG

TL;DR: Theoretical framework shows optimal prediction target in diffusion models depends on data geometry, with x-prediction superior when ambient dimension exceeds intrinsic dimension; proposes k-Diff to learn optimal target parameter automatically.

Details

Motivation: Recent diffusion models show preference shifting from noise/velocity to direct data prediction in high-dimensional settings, but lack formal explanation for why optimal target depends on data properties. Need theoretical understanding of relationship between data geometry and optimal prediction target.

Method: Develops generalized prediction formulation accommodating arbitrary output targets (ε-, v-, x-prediction as special cases). Derives analytical relationship between data geometry and optimal prediction target. Proposes k-Diff framework that learns optimal prediction parameter k directly from data without explicit dimension estimation.

Result: Theory shows x-prediction becomes superior when ambient dimension significantly exceeds data’s intrinsic dimension. k-Diff consistently outperforms fixed-target baselines across varying architectures and data scales in both latent-space and pixel-space image generation.

Conclusion: Provides rigorous justification for target selection in diffusion models based on data dimensionality. k-Diff offers principled, automated approach to enhance generative performance by learning optimal prediction target from data.

Abstract: Recent advances in diffusion and flow matching models have highlighted a shift in the preferred prediction target – moving from noise ($\varepsilon$) and velocity (v) to direct data (x) prediction – particularly in high-dimensional settings. However, a formal explanation of why the optimal target depends on the specific properties of the data remains elusive. In this work, we provide a theoretical framework based on a generalized prediction formulation that accommodates arbitrary output targets, of which $\varepsilon$-, v-, and x-prediction are special cases. We derive the analytical relationship between data’s geometry and the optimal prediction target, offering a rigorous justification for why x-prediction becomes superior when the ambient dimension significantly exceeds the data’s intrinsic dimension. Furthermore, while our theory identifies dimensionality as the governing factor for the optimal prediction target, the intrinsic dimension of manifold-bound data is typically intractable to estimate in practice. To bridge this gap, we propose k-Diff, a framework that employs a data-driven approach to learn the optimal prediction parameter k directly from data, bypassing the need for explicit dimension estimation. Extensive experiments in both latent-space and pixel-space image generation demonstrate that k-Diff consistently outperforms fixed-target baselines across varying architectures and data scales, providing a principled and automated approach to enhancing generative performance.

[533] ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

Zihao Huang, Jundong Zhou, Xingwei Qu, Qiyang Min, Ge Zhang

Main category: cs.LG

TL;DR: ConceptMoE dynamically merges semantically similar tokens into concept representations to allocate computation adaptively, improving performance and efficiency in language and multimodal tasks.

Details

Motivation: Current LLMs use uniform computation across all tokens, but some sequences are trivially predictable while others require deep reasoning. This inefficient allocation wastes computational resources.

Method: ConceptMoE uses a learnable chunk module to identify optimal token boundaries based on inter-token similarity, compressing sequences by ratio R. The MoE architecture enables controlled evaluation by reallocating saved computation to match baseline FLOPs and parameters.

Result: ConceptMoE outperforms standard MoE: +0.9 points on language pretraining, +2.3 points on long context understanding, +0.6 points on multimodal benchmarks. Gains reach +5.5 points with pretrained MoE conversion. Reduces attention computation by up to R²× and KV cache by R×, with prefill speedups up to 175% and decoding speedups up to 117% at R=2.

Conclusion: Adaptive concept-level processing fundamentally improves both effectiveness and efficiency of large language models, with minimal architectural modifications enabling straightforward integration into existing MoE systems.

Abstract: Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concept representations, performing implicit token-level compute allocation. A learnable chunk module identifies optimal boundaries by measuring inter-token similarity, compressing sequences by a target ratio $R$ before they enter the compute-intensive concept model. Crucially, the MoE architecture enables controlled evaluation: we reallocate saved computation to match baseline activated FLOPs (excluding attention map computation) and total parameters, isolating genuine architectural benefits. Under these conditions, ConceptMoE consistently outperforms standard MoE across language and vision-language tasks, achieving +0.9 points on language pretraining, +2.3 points on long context understanding, and +0.6 points on multimodal benchmarks. When converting pretrained MoE during continual training with layer looping, gains reach +5.5 points, demonstrating practical applicability. Beyond performance, ConceptMoE reduces attention computation by up to $R^2\times$ and KV cache by $R\times$. At $R=2$, empirical measurements show prefill speedups reaching 175% and decoding speedups up to 117% on long sequences. The minimal architectural modifications enable straightforward integration into existing MoE, demonstrating that adaptive concept-level processing fundamentally improves both effectiveness and efficiency of large language models.

[534] Lossy Common Information in a Learnable Gray-Wyner Network

Anderson de Andrade, Alon Harell, Ivan V. Bajić

Main category: cs.LG

TL;DR: A learnable three-channel codec inspired by Gray-Wyner network theory that disentangles shared and task-specific information across multiple vision tasks to reduce redundancy and improve efficiency.

Details

Motivation: Computer vision tasks share substantial overlapping information, but conventional codecs ignore this, leading to redundant and inefficient representations. The Gray-Wyner network from information theory offers a principled framework for separating common and task-specific information.

Method: Develop a learnable three-channel codec that disentangles shared information from task-specific details across multiple vision tasks. Characterize limits through lossy common information concept and propose optimization objective balancing tradeoffs. Compare three codec architectures on two-task scenarios across six vision benchmarks.

Result: The approach substantially reduces redundancy and consistently outperforms independent coding across various vision benchmarks, demonstrating practical value of Gray-Wyner theory in modern machine learning.

Conclusion: The work successfully bridges classic information theory with task-driven representation learning, showing that revisiting Gray-Wyner theory has practical value for creating more efficient vision representations.

Abstract: Many computer vision tasks share substantial overlapping information, yet conventional codecs tend to ignore this, leading to redundant and inefficient representations. The Gray-Wyner network, a classical concept from information theory, offers a principled framework for separating common and task-specific information. Inspired by this idea, we develop a learnable three-channel codec that disentangles shared information from task-specific details across multiple vision tasks. We characterize the limits of this approach through the notion of lossy common information, and propose an optimization objective that balances inherent tradeoffs in learning such representations. Through comparisons of three codec architectures on two-task scenarios spanning six vision benchmarks, we demonstrate that our approach substantially reduces redundancy and consistently outperforms independent coding. These results highlight the practical value of revisiting Gray-Wyner theory in modern machine learning contexts, bridging classic information theory with task-driven representation learning.

[535] Accurate Network Traffic Matrix Prediction via LEAD: an LLM-Enhanced Adapter-Based Conditional Diffusion Model

Yu Sun, Yaqiong Liu, Nan Cheng, Jiayuan Li, Zihan Jia, Xialin Du, Mugen Peng

Main category: cs.LG

TL;DR: LEAD: LLM-enhanced diffusion model for network traffic matrix forecasting using traffic-to-image transformation and dual-conditioning strategy

Details

Motivation: Network operations require predictive adaptation under computation/latency constraints; existing TM forecasting models suffer from over-smoothing and limited uncertainty awareness, especially for extreme bursts

Method: Transforms traffic matrices to RGB images for global dependency modeling; uses frozen LLM with trainable adapter for temporal semantics; employs dual-conditioning strategy to guide diffusion model for complex traffic generation

Result: Outperforms all baselines on Abilene (45.2% RMSE reduction) and GEANT (27.3% RMSE reduction) datasets; maintains stable performance from 1-step to 20-step predictions

Conclusion: LEAD effectively addresses TM forecasting challenges by combining vision backbones, LLM adapters, and diffusion models for accurate, uncertainty-aware predictions

Abstract: Driven by the evolution toward 6G and AI-native edge intelligence, network operations increasingly require predictive and risk-aware adaptation under stringent computation and latency constraints. Network Traffic Matrix (TM), which characterizes flow volumes between nodes, is a fundamental signal for proactive traffic engineering. However, accurate TM forecasting remains challenging due to the stochastic, non-linear, and bursty nature of network dynamics. Existing discriminative models often suffer from over-smoothing and provide limited uncertainty awareness, leading to poor fidelity under extreme bursts. To address these limitations, we propose LEAD, a Large Language Model (LLM)-Enhanced Adapter-based conditional Diffusion model. First, LEAD adopts a “Traffic-to-Image” paradigm to transform traffic matrices into RGB images, enabling global dependency modeling via vision backbones. Then, we design a “Frozen LLM with Trainable Adapter” model, which efficiently captures temporal semantics with limited computational cost. Moreover, we propose a Dual-Conditioning Strategy to precisely guide a diffusion model to generate complex, dynamic network traffic matrices. Experiments on the Abilene and GEANT datasets demonstrate that LEAD outperforms all baselines. On the Abilene dataset, LEAD attains a remarkable 45.2% reduction in RMSE against the best baseline, with the error margin rising only marginally from 0.1098 at one-step to 0.1134 at 20-step predictions. Meanwhile, on the GEANT dataset, LEAD achieves a 0.0258 RMSE at 20-step prediction horizon which is 27.3% lower than the best baseline.

[536] Synthetic Pattern Generation and Detection of Financial Activities using Graph Autoencoders

Francesco Zola, Lucia Muñoz, Andrea Venturi, Amaia Gil

Main category: cs.LG

TL;DR: GAEs trained on synthetic transaction patterns can detect illicit financial activities like money laundering using reconstruction error without labeled data.

Details

Motivation: Detecting money laundering patterns in transaction networks is challenging due to scarce labeled data and privacy constraints. Synthetic data offers a solution.

Method: Two-phase approach: (1) Generate synthetic samples for 7 illicit activity patterns using parametrized generators, (2) Train separate GAEs (GAE-GCN, GAE-SAGE, GAE-GAT) on each pattern using reconstruction error without labels.

Result: GAE-GCN achieves most consistent reconstruction performance across patterns. GAE-SAGE and GAE-GAT show competitive results only in specific patterns.

Conclusion: Graph-based representation learning on synthetic data provides viable path for AI-driven illicit behavior detection, overcoming financial dataset limitations.

Abstract: Illicit financial activities such as money laundering often manifest through recurrent topological patterns in transaction networks. Detecting these patterns automatically remains challenging due to the scarcity of labeled real-world data and strict privacy constraints. To address this, we investigate whether Graph Autoencoders (GAEs) can effectively learn and distinguish topological patterns that mimic money laundering operations when trained on synthetic data. The analysis consists of two phases: (i) data generation, where synthetic samples are created for seven well-known illicit activity patterns using parametrized generators that preserve structural consistency while introducing realistic variability; and (ii) model training and validation, where separate GAEs are trained on each pattern without explicit labels, relying solely on reconstruction error as an indicator of learned structure. We compare three GAE implementations based on three distinct convolutional layers: Graph Convolutional (GAE-GCN), GraphSAGE (GAE-SAGE), and Graph Attention Network (GAE-GAT). Experimental results show that GAE-GCN achieves the most consistent reconstruction performance across patterns, while GAE-SAGE and GAE-GAT exhibit competitive results only in few specific patterns. These findings suggest that graph-based representation learning on synthetic data provides a viable path toward developing AI-driven tools for detecting illicit behaviors, overcoming the limitations of financial datasets.

[537] GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization

Chuanyang Zheng, Jiankai Sun, Yihang Gao, Chi Wang, Yuehao Wang, Jing Xiong, Liliang Ren, Bo Peng, Qingmei Wang, Xiaoran Shang, Mac Schwager, Anderson Schneider, Yuriy Nevmyvaka, Xiaodong Liu

Main category: cs.LG

TL;DR: GeoNorm: A novel normalization method for Transformers using geodesic updates on manifolds, replacing standard Pre-Norm/Post-Norm with optimization-inspired layer-wise update decay.

Details

Motivation: The placement of normalization layers (Pre-Norm vs Post-Norm) in Transformers remains an open architectural question. The authors aim to rethink normalization through the lens of manifold optimization, interpreting FFN and attention outputs as update directions.

Method: Proposes GeoNorm, which replaces standard normalization with geodesic updates on the manifold. Introduces layer-wise update decay for FFN and attention components, analogous to learning rate schedules. The method can be seamlessly integrated into standard Transformer architectures.

Result: Comprehensive experiments show GeoNorm consistently outperforms existing normalization methods in Transformer models. Achieves performance improvements with negligible additional computational cost.

Conclusion: GeoNorm provides a novel optimization-inspired approach to Transformer normalization that improves performance while maintaining computational efficiency and easy integration into existing architectures.

Abstract: The placement of normalization layers, specifically Pre-Norm and Post-Norm, remains an open question in Transformer architecture design. In this work, we rethink these approaches through the lens of manifold optimization, interpreting the outputs of the Feed-Forward Network (FFN) and attention layers as update directions in optimization. Building on this perspective, we introduce GeoNorm, a novel method that replaces standard normalization with geodesic updates on the manifold. Furthermore, analogous to learning rate schedules, we propose a layer-wise update decay for the FFN and attention components. Comprehensive experiments demonstrate that GeoNorm consistently outperforms existing normalization methods in Transformer models. Crucially, GeoNorm can be seamlessly integrated into standard Transformer architectures, achieving performance improvements with negligible additional computational cost.

[538] SAGE: Sequence-level Adaptive Gradient Evolution for Generative Recommendation

Yu Xie, Xing Kai Ren, Ying Qi, Hu Yao

Main category: cs.LG

TL;DR: SAGE is a unified optimization framework for list-wise generative recommendation that addresses limitations in existing LLM-based recommender systems by enabling reuse of native LLM vocabularies and solving the “Symmetric Conservatism” problem through adaptive gradient evolution.

Details

Motivation: Current LLM-based recommender systems like OneRec require separate vocabularies, preventing reuse of native LLM architectures and incurring high maintenance costs. Additionally, their GBPO optimization suffers from "Symmetric Conservatism" - static gradient boundaries suppress cold-start item updates and fail to prevent diversity collapse in noisy environments.

Method: Proposes SAGE framework with two key innovations: 1) Sequence-level Signal Decoupling using geometric mean importance ratio with decoupled multi-objective advantages to eliminate token-level variance and solve “Reward Collapse”; 2) Asymmetric Adaptive Dynamics with dynamic gradient manifold that applies “Boost Factor” to cold-start items for super-linear updates and “Entropy Aware Penalty” to break information cocoons.

Result: Theoretical analysis and empirical results show SAGE effectively unblocks cold-start traffic and sustains recommendation diversity while maintaining the numerical stability of GBPO.

Conclusion: SAGE provides a unified optimization framework that enables efficient reuse of open-source LLM architectures without separate vocabularies and addresses key limitations in existing generative recommendation systems through adaptive gradient evolution.

Abstract: While works such as OneRec have validated the scaling laws of Large Language Models (LLMs) in recommender systems, they rely on a cumbersome separate vocabulary. This dependency prevents the model architecture from reusing native LLM vocabularies, resulting in high maintenance costs and poor scalability. In response, we aim to efficiently reuse open-source LLM architectures without constructing a separate tokenization vocabulary. Furthermore, we identify that the optimization strategy of OneRec Gradient Bounded Policy Optimization (GBPO),suffers from a “Symmetric Conservatism” problem: its static gradient boundaries structurally suppress the update momentum required for cold-start items and fail to prevent diversity collapse in high-noise environments.To address this issue, we propose SAGE (Sequence-level Adaptive Gradient Evolution), a unified optimization framework tailored for list-wise generative recommendation. SAGE introduces two key innovations:(1) Sequence-level Signal Decoupling: By combining a geometric mean importance ratio with decoupled multi-objective advantages, we eliminate token-level variance and resolve the “Reward Collapse” problem. (2) Asymmetric Adaptive Dynamics: We construct a dynamic gradient manifold that applies a “Boost Factor” to high-potential cold start items to achieve super-linear updates and employs an “Entropy Aware Penalty” to break information cocoons. Theoretical analysis and empirical results demonstrate that SAGE effectively unblocks cold-start traffic and sustains recommendation diversity, all while retaining the numerical stability of GBPO.

[539] HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing

Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, Yanghua Xiao

Main category: cs.LG

TL;DR: HER framework enables cognitive-level persona simulation in LLMs through dual-layer thinking, combining first-person character thoughts with third-person LLM reasoning, trained with reverse-engineered data and human-aligned rewards.

Details

Motivation: Current LLM role-playing captures character tones and knowledge but fails to simulate the inner cognitive processes behind behaviors. There's a need for cognitive simulation with high-quality reasoning traces and human-aligned reward signals.

Method: Proposes HER framework with dual-layer thinking (first-person character thinking + third-person LLM thinking). Creates reasoning-augmented role-playing data via reverse engineering, constructs human-aligned principles and reward models, and trains models using supervised and reinforcement learning on Qwen3-32B.

Result: Models significantly outperform Qwen3-32B baseline with 30.26% improvement on CoSER benchmark and 14.97% gain on Minimax Role-Play Bench, demonstrating effective cognitive-level persona simulation.

Conclusion: HER framework successfully enables cognitive simulation in LLM role-playing through dual-layer thinking and human-aligned training, advancing beyond surface-level persona simulation to capture inner cognitive processes.

Abstract: LLM role-playing, i.e., using LLMs to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a challenge. Towards cognitive simulation in LLM role-play, previous efforts mainly suffer from two deficiencies: data with high-quality reasoning traces, and reliable reward signals aligned with human preferences. In this paper, we propose HER, a unified framework for cognitive-level persona simulation. HER introduces dual-layer thinking, which distinguishes characters’ first-person thinking from LLMs’ third-person thinking. To bridge these gaps, we curate reasoning-augmented role-playing data via reverse engineering and construct human-aligned principles and reward models. Leveraging these resources, we train \method models based on Qwen3-32B via supervised and reinforcement learning. Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26 improvement on the CoSER benchmark and a 14.97 gain on the Minimax Role-Play Bench. Our datasets, principles, and models will be released to facilitate future research.

[540] L$^3$: Large Lookup Layers

Albert Tseng, Christopher De Sa

Main category: cs.LG

TL;DR: L³ (Large Lookup Layer) introduces a new sparse architecture using static token-based routing to aggregate learned embeddings per token, offering better hardware efficiency than MoE while maintaining contextual information.

Details

Motivation: Current sparse models using Mixture-of-Experts (MoE) have drawbacks like poor hardware efficiency, need for auxiliary losses, and dynamic routing complexity. Embedding tables are natively sparse but lack contextual information. The authors aim to create a sparse architecture that combines the benefits of both approaches.

Method: L³ generalizes embedding tables to decoder layers using static token-based routing to aggregate learned embeddings per token in a context-dependent way. It has two components: 1) a systems-friendly architecture for fast training and CPU-offloaded inference, and 2) an information-theoretic embedding allocation algorithm to balance speed and quality.

Result: Transformers with up to 2.6B active parameters trained with L³ strongly outperform both dense models and iso-sparse MoEs in both language modeling and downstream tasks.

Conclusion: L³ provides an effective new axis of sparsity that balances memory and compute by caching information in embeddings, offering better performance than existing sparse architectures while maintaining hardware efficiency.

Abstract: Modern sparse language models typically achieve sparsity through Mixture-of-Experts (MoE) layers, which dynamically route tokens to dense MLP “experts.” However, dynamic hard routing has a number of drawbacks, such as potentially poor hardware efficiency and needing auxiliary losses for stable training. In contrast, the tokenizer embedding table, which is natively sparse, largely avoids these issues by selecting a single embedding per token at the cost of not having contextual information. In this work, we introduce the Large Lookup Layer (L$^3$), which unlocks a new axis of sparsity by generalizing embedding tables to model decoder layers. L$^3$ layers use static token-based routing to aggregate a set of learned embeddings per token in a context-dependent way, allowing the model to efficiently balance memory and compute by caching information in embeddings. L$^3$ has two main components: (1) a systems-friendly architecture that allows for fast training and CPU-offloaded inference with no overhead, and (2) an information-theoretic embedding allocation algorithm that effectively balances speed and quality. We empirically test L$^3$ by training transformers with up to 2.6B active parameters and find that L$^3$ strongly outperforms both dense models and iso-sparse MoEs in both language modeling and downstream tasks.

[541] Discovering Hidden Gems in Model Repositories

Jonathan Kahana, Eliahu Horwitz, Yedid Hoshen

Main category: cs.LG

TL;DR: The paper identifies “hidden gems” - unpopular fine-tuned models that outperform popular ones, proposes a Multi-Armed Bandit approach to efficiently discover these models, achieving 50x acceleration in discovery.

Details

Motivation: Despite millions of fine-tuned models in public repositories, community usage concentrates on a small number of foundation checkpoints. The research investigates whether this reflects efficient market selection or if superior models are systematically overlooked.

Method: Extensive evaluation of over 2,000 models to identify “hidden gems.” Formulates model discovery as a Multi-Armed Bandit problem and accelerates Sequential Halving search algorithm using shared query sets and aggressive elimination schedules.

Result: Found rarely downloaded checkpoints that improve math performance from 83.2% to 96.0% without increasing inference costs. The method retrieves top models with as few as 50 queries per candidate, accelerating discovery by over 50x.

Conclusion: Superior fine-tuned models are systematically overlooked in public repositories, and efficient discovery methods like the proposed Multi-Armed Bandit approach can significantly accelerate finding these “hidden gems.”

Abstract: Public repositories host millions of fine-tuned models, yet community usage remains disproportionately concentrated on a small number of foundation checkpoints. We investigate whether this concentration reflects efficient market selection or if superior models are systematically overlooked. Through an extensive evaluation of over 2,000 models, we show the prevalence of “hidden gems”, unpopular fine-tunes that significantly outperform their popular counterparts. Notably, within the Llama-3.1-8B family, we find rarely downloaded checkpoints that improve math performance from 83.2% to 96.0% without increasing inference costs. However, discovering these models through exhaustive evaluation of every uploaded model is computationally infeasible. We therefore formulate model discovery as a Multi-Armed Bandit problem and accelerate the Sequential Halving search algorithm by using shared query sets and aggressive elimination schedules. Our method retrieves top models with as few as 50 queries per candidate, accelerating discovery by over 50x.

[542] Partial Feedback Online Learning

Shihao Shao, Cong Fang, Zhouchen Lin, Dacheng Tao

Main category: cs.LG

TL;DR: Partial-feedback online learning where learners only see one correct label per round but multiple responses may be valid, with applications to language generation. Complete characterization of minimax regret for deterministic and randomized learners in set-realizable regime.

Details

Motivation: The paper addresses partial-feedback online learning scenarios where multiple correct responses exist but only one is observed per round, capturing real-world settings like language generation where multiple valid responses exist but training data provides only single references.

Method: Introduces Partial-Feedback Littlestone dimension (PFLdim) for deterministic learners and Partial-Feedback Measure Shattering dimension (PMSdim) for randomized learners. Uses new collection version space viewpoint and auxiliary dimensions. Analyzes conditions for inseparability between deterministic and randomized learnability.

Result: Complete characterization of minimax regret in set-realizable regime. Shows PFLdim precisely governs learnability for deterministic learners. Identifies conditions ensuring inseparability between deterministic and randomized learnability. Demonstrates sharp separation from weaker realistic and agnostic variants.

Conclusion: Partial-feedback online learning requires new complexity measures beyond standard approaches. Outside set realizability, the problem becomes information-theoretically intractable, highlighting need for noise-sensitive complexity measures for meaningful learnability characterization.

Abstract: We study partial-feedback online learning, where each instance admits a set of correct labels, but the learner only observes one correct label per round; any prediction within the correct set is counted as correct. This model captures settings such as language generation, where multiple responses may be valid but data provide only a single reference. We give a near-complete characterization of minimax regret for both deterministic and randomized learners in the set-realizable regime, i.e., in the regime where sublinear regret is generally attainable. For deterministic learners, we introduce the Partial-Feedback Littlestone dimension (PFLdim) and show it precisely governs learnability and minimax regret; technically, PFLdim cannot be defined via the standard version space, requiring a new collection version space viewpoint and an auxiliary dimension used only in the proof. We further develop the Partial-Feedback Measure Shattering dimension (PMSdim) to obtain tight bounds for randomized learners. We identify broad conditions ensuring inseparability between deterministic and randomized learnability (e.g., finite Helly number or nested-inclusion label structure), and extend the argument to set-valued online learning, resolving an open question of Raman et al. [2024b]. Finally, we show a sharp separation from weaker realistic and agnostic variants: outside set realizability, the problem can become information-theoretically intractable, with linear regret possible even for $|H|=2$. This highlights the need for fundamentally new, noise-sensitive complexity measures to meaningfully characterize learnability beyond set realizability.

[543] A block-coordinate descent framework for non-convex composite optimization. Application to sparse precision matrix estimation

Guillaume Lauga

Main category: cs.LG

TL;DR: A new BCD framework for non-convex composite optimization with convergence guarantees, applied to sparse precision matrix estimation (Graphical Lasso).

Details

Motivation: BCD methods are widely used for large-scale optimization but lack comprehensive theoretical study for non-convex problems. The authors aim to develop a general BCD framework that ensures convergence for non-convex composite optimization.

Method: Proposes a general BCD framework that includes variable metric proximal gradient updates, proximal Newton updates, and alternated minimization updates. The framework unifies three popular Graphical Lasso solvers: graphical ISTA, Primal GLasso, and QUIC.

Result: The framework provides convergence guarantees for non-convex sparse precision matrix estimation and achieves up to 100-fold reduction in iterations needed to reach state-of-the-art estimation quality.

Conclusion: The proposed BCD framework offers a unified approach with strong theoretical guarantees and practical efficiency for non-convex optimization problems, particularly in sparse precision matrix estimation.

Abstract: Block-coordinate descent (BCD) is the method of choice to solve numerous large scale optimization problems, however their theoretical study for non-convex optimization, has received less attention. In this paper, we present a new block-coordinate descent (BCD) framework to tackle non-convex composite optimization problems, ensuring decrease of the objective function and convergence to a solution. This framework is general enough to include variable metric proximal gradient updates, proximal Newton updates, and alternated minimization updates. This generality allows to encompass three versions of the most used solvers in the sparse precision matrix estimation problem, deemed Graphical Lasso: graphical ISTA, Primal GLasso, and QUIC. We demonstrate the value of this new framework on non-convex sparse precision matrix estimation problems, providing convergence guarantees and up to a $100$-fold reduction in the number of iterations required to reach state-of-the-art estimation quality.

[544] PPI-SVRG: Unifying Prediction-Powered Inference and Variance Reduction for Semi-Supervised Optimization

Ruicheng Ao, Hongyu Chen, Haoyang Liu, David Simchi-Levi, Will Wei Sun

Main category: cs.LG

TL;DR: PPI-SVRG combines prediction-powered inference with stochastic variance reduced gradient methods for semi-supervised optimization with scarce labeled data, showing mathematical equivalence and improved convergence with prediction uncertainty bounds.

Details

Motivation: Addresses the challenge of semi-supervised stochastic optimization when labeled data is scarce but predictions from pre-trained models are available, aiming to leverage both prediction information and gradient variance reduction techniques.

Method: Shows mathematical equivalence between PPI (prediction-powered inference) and SVRG (stochastic variance reduced gradients), both using control variates. Develops PPI-SVRG that combines both approaches, with convergence analysis decomposing into standard SVRG rate plus error floor from prediction uncertainty.

Result: PPI-SVRG reduces MSE by 43-52% under label scarcity on mean estimation benchmarks and improves test accuracy by 2.7-2.9 percentage points on MNIST with only 10% labeled data. When predictions are perfect, recovers SVRG exactly; when predictions degrade, convergence remains stable but reaches larger neighborhood.

Conclusion: PPI-SVRG effectively combines prediction information with variance reduction for semi-supervised optimization, with convergence properties that depend only on loss geometry while predictions affect only neighborhood size, providing stable performance even with imperfect predictions.

Abstract: We study semi-supervised stochastic optimization when labeled data is scarce but predictions from pre-trained models are available. PPI and SVRG both reduce variance through control variates – PPI uses predictions, SVRG uses reference gradients. We show they are mathematically equivalent and develop PPI-SVRG, which combines both. Our convergence bound decomposes into the standard SVRG rate plus an error floor from prediction uncertainty. The rate depends only on loss geometry; predictions affect only the neighborhood size. When predictions are perfect, we recover SVRG exactly. When predictions degrade, convergence remains stable but reaches a larger neighborhood. Experiments confirm the theory: PPI-SVRG reduces MSE by 43–52% under label scarcity on mean estimation benchmarks and improves test accuracy by 2.7–2.9 percentage points on MNIST with only 10% labeled data.

[545] Best Arm Identification with LLM Judges and Limited Human

Ruicheng Ao, Hongyu Chen, Siyang Gao, Hanwei Li, David Simchi-Levi

Main category: cs.LG

TL;DR: Proposes bias-corrected best-arm identification algorithm using cheap but biased proxy scores (like LLM judges) with selective expensive ground-truth auditing, achieving near-oracle efficiency.

Details

Motivation: In many real-world applications (like content moderation), we have cheap but potentially biased proxy scores (e.g., from LLM judges) for every sample, while expensive ground-truth labels (e.g., human auditing) can only be acquired selectively. Standard multi-fidelity methods fail due to bias, and uniform auditing wastes scarce resources.

Method: Develops an estimator combining proxy scores with inverse-propensity-weighted residuals from ground-truth audits, constructs anytime-valid confidence sequences, and proposes an adaptive algorithm that concentrates audits on unreliable contexts and close arms using a plug-in Neyman rule.

Result: Theoretical guarantees show the algorithm achieves near-oracle audit efficiency and avoids mis-selection probability that wouldn’t vanish without bias correction. Numerical experiments confirm superior empirical performance compared to baselines.

Conclusion: The proposed framework effectively leverages biased proxy scores with selective ground-truth auditing for efficient best-arm identification, with theoretical guarantees and strong empirical performance.

Abstract: We study fixed-confidence best-arm identification (BAI) where a cheap but potentially biased proxy (e.g., LLM judge) is available for every sample, while an expensive ground-truth label can only be acquired selectively when using a human for auditing. Unlike classical multi-fidelity BAI, the proxy is biased (arm- and context-dependent) and ground truth is selectively observed. Consequently, standard multi-fidelity methods can mis-select the best arm, and uniform auditing, though accurate, wastes scarce resources and is inefficient. We prove that without bias correction and propensity adjustment, mis-selection probability may not vanish (even with unlimited proxy data). We then develop an estimator for the mean of each arm that combines proxy scores with inverse-propensity-weighted residuals and form anytime-valid confidence sequences for that estimator. Based on the estimator and confidence sequence, we propose an algorithm that adaptively selects and audits arms. The algorithm concentrates audits on unreliable contexts and close arms and we prove that a plug-in Neyman rule achieves near-oracle audit efficiency. Numerical experiments confirm the theoretical guarantees and demonstrate the superior empirical performance of the proposed algorithm.

[546] ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

Xiuyu Li, Jinkai Zhang, Mingyang Yi, Yu Li, Longqiang Wang, Yue Wang, Ju Fan

Main category: cs.LG

TL;DR: ETS: Training-free inference method that samples from optimal RL policy using energy-guided test-time scaling with Monte Carlo estimation

Details

Motivation: RL post-training alignment for language models is costly and unstable; need training-free inference method to directly sample from optimal RL policy

Method: Energy-Guided Test-Time Scaling (ETS) estimates energy term via online Monte Carlo with provable convergence, uses acceleration frameworks and importance sampling to reduce latency while preserving quality

Result: ETS consistently improves generation quality across reasoning, coding, and science benchmarks for MLM models including autoregressive and diffusion language models

Conclusion: ETS provides effective training-free inference method for sampling from optimal RL policy with practical efficiency and quality preservation

Abstract: Reinforcement Learning (RL) post-training alignment for language models is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion language models) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.

[547] Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning

Chengyi Cai, Zesheng Ye, Peike Li, Bo Han, Jianzhong Qi, Feng Liu

Main category: cs.LG

TL;DR: ViKeR is a visual-guided key-token regularization method for multimodal LLM unlearning that prioritizes important answer tokens using visual cues to prevent privacy leakage while maintaining model performance.

Details

Motivation: Existing MLLM unlearning methods treat all answer tokens uniformly and ignore visual cues, failing to prioritize key tokens that are crucial for effective privacy protection in multimodal contexts.

Method: Proposes Visual-Guided Key-Token Regularization (ViKeR) that uses irrelevant visual inputs to predict ideal post-unlearning token distributions, regularizes unlearning with these distributions, identifies key tokens via information entropy, and amplifies gradient updates on key tokens.

Result: Experiments on MLLMU and CLEAR benchmarks show ViKeR effectively performs unlearning while mitigating forgetting and maintaining response coherence compared to existing methods.

Conclusion: ViKeR successfully addresses limitations of existing MLLM unlearning by incorporating visual guidance and token-level prioritization, offering a more effective approach for privacy protection in multimodal question answering.

Abstract: Unlearning in Multimodal Large Language Models (MLLMs) prevents the model from revealing private information when queried about target images. Existing MLLM unlearning methods largely adopt approaches developed for LLMs. They treat all answer tokens uniformly, disregarding their varying importance in the unlearning process. Moreover, these methods focus exclusively on the language modality, disregarding visual cues that indicate key tokens in answers. In this paper, after formulating the problem of unlearning in multimodal question answering for MLLMs, we propose Visual-Guided Key-Token Regularization (ViKeR). We leverage irrelevant visual inputs to predict ideal post-unlearning token-level distributions and use these distributions to regularize the unlearning process, thereby prioritizing key tokens. Further, we define key tokens in unlearning via information entropy and discuss ViKeR’s effectiveness through token-level gradient reweighting, which amplifies updates on key tokens. Experiments on MLLMU and CLEAR benchmarks demonstrate that our method effectively performs unlearning while mitigating forgetting and maintaining response coherence.

[548] Task-Awareness Improves LLM Generations and Uncertainty

Tim Tomov, Dominik Fuchsgruber, Stephan Günnemann

Main category: cs.LG

TL;DR: A decision-theoretic framework for LLMs that models outputs in task-dependent latent structures rather than language space, enabling Bayes-optimal response synthesis and improved uncertainty estimation.

Details

Motivation: LLM responses often have underlying structures (labels, values, graphs) but current decoding and uncertainty methods ignore this structural information, operating only in language space.

Method: Model LLM outputs directly in task-dependent latent structures equipped with dissimilarity measures, compute Bayes-optimal responses by synthesizing new responses in latent space rather than selecting from samples.

Result: Bayes-optimal responses consistently outperform standard decoding methods like beam search across different tasks; uncertainty quantification via Bayesian risk better captures output quality and correctness.

Conclusion: The decision-theoretic framework enables reliable task-aware LLM predictions for any problem admitting latent response structures, improving both response quality and uncertainty estimation.

Abstract: In many applications of LLMs, natural language responses often have an underlying structure such as representing discrete labels, numerical values, or graphs. Yet, existing decoding and uncertainty estimation methods operate only in language space and largely disregard structural information. We address this by modeling LLM outputs directly in a task-dependent latent structure. By equipping this structure with a dissimilarity measure, we can compute Bayes-optimal responses. These are not selected from sampled generations but are newly synthesized by combining individual responses in the latent space. Across different tasks, Bayes-optimal responses consistently outperform standard decoding methods like beam search. Moreover, quantifying uncertainty via the induced Bayesian risk captures variations in terms of the latent structure and improves alignment with output quality and correctness. Our decision-theoretic framework is applicable to any problem that admits a latent response structure and enables reliable task-aware LLM predictions.

[549] Cascaded Transfer: Learning Many Tasks under Budget Constraints

Eloi Campagne, Yvenn Amara-Ouali, Yannig Goude, Mathilde Mougeot, Argyris Kalogeratos

Main category: cs.LG

TL;DR: Cascaded Transfer Learning: A hierarchical many-task learning approach using minimum spanning trees to efficiently transfer knowledge across large task collections under budget constraints.

Details

Motivation: Addresses the challenge of learning many related tasks when exact task relationships are unknown, aiming to enable more accurate and cost-effective adaptation across large task collections compared to existing approaches.

Method: Proposes cascaded transfer learning organized as a rooted tree where information cascades hierarchically through tasks. Uses a minimum spanning tree structure based on task distance measures to connect tasks and allocates training budget along branches.

Result: Experiments on synthetic and real many-task settings demonstrate that the method enables more accurate and cost-effective adaptation across large task collections compared to alternative approaches.

Conclusion: Cascaded Transfer Learning provides an effective paradigm for many-task learning that respects budget constraints while improving accuracy through hierarchical knowledge transfer.

Abstract: Many-Task Learning refers to the setting where a large number of related tasks need to be learned, the exact relationships between tasks are not known. We introduce the Cascaded Transfer Learning, a novel many-task transfer learning paradigm where information (e.g. model parameters) cascades hierarchically through tasks that are learned by individual models of the same class, while respecting given budget constraints. The cascade is organized as a rooted tree that specifies the order in which tasks are learned and refined. We design a cascaded transfer mechanism deployed over a minimum spanning tree structure that connects the tasks according to a suitable distance measure, and allocates the available training budget along its branches. Experiments on synthetic and real many-task settings show that the resulting method enables more accurate and cost effective adaptation across large task collections compared to alternative approaches.

[550] A Unified SPD Token Transformer Framework for EEG Classification: Systematic Comparison of Geometric Embeddings

Chi-Sheng Chen, En-Jui Kuo, Guan-Ying Chen, Xinyu Zhang, Fan Zhang

Main category: cs.LG

TL;DR: Theoretical analysis linking SPD manifold embeddings to gradient conditioning and numerical stability, with validation via Transformer framework on EEG data.

Details

Motivation: To explore the theoretical connection between embedding geometry and optimization dynamics for SPD manifolds in EEG signal processing, which has remained unexplored despite the Riemannian nature of spatial covariance matrices.

Method: Formal analysis establishing three theoretical results: 1) BWSPD’s gradient conditioning advantages, 2) Embedding-Space Batch Normalization approximation, 3) bi-Lipschitz bounds. Validation via unified Transformer framework comparing BWSPD, Log-Euclidean, and Euclidean embeddings across 1,500+ runs on three EEG paradigms.

Result: Log-Euclidean Transformer achieves state-of-the-art performance on all EEG datasets, substantially outperforming classical Riemannian classifiers and recent SPD baselines. BWSPD offers competitive accuracy with similar training time.

Conclusion: Theoretical analysis reveals embedding choice impacts gradient conditioning and numerical stability for SPD manifolds, with practical validation showing Log-Euclidean embeddings perform best for EEG classification tasks.

Abstract: Spatial covariance matrices of EEG signals are Symmetric Positive Definite (SPD) and lie on a Riemannian manifold, yet the theoretical connection between embedding geometry and optimization dynamics remains unexplored. We provide a formal analysis linking embedding choice to gradient conditioning and numerical stability for SPD manifolds, establishing three theoretical results: (1) BWSPD’s $\sqrtκ$ gradient conditioning (vs $κ$ for Log-Euclidean) via Daleckii-Kreĭn matrices provides better gradient conditioning on high-dimensional inputs ($d \geq 22$), with this advantage reducing on low-dimensional inputs ($d \leq 8$) where eigendecomposition overhead dominates; (2) Embedding-Space Batch Normalization (BN-Embed) approximates Riemannian normalization up to $O(\varepsilon^2)$ error, yielding $+26%$ accuracy on 56-channel ERP data but negligible effect on 8-channel SSVEP data, matching the channel-count-dependent prediction; (3) bi-Lipschitz bounds prove BWSPD tokens preserve manifold distances with distortion governed solely by the condition ratio $κ$. We validate these predictions via a unified Transformer framework comparing BWSPD, Log-Euclidean, and Euclidean embeddings within identical architecture across 1,500+ runs on three EEG paradigms (motor imagery, ERP, SSVEP; 36 subjects). Our Log-Euclidean Transformer achieves state-of-the-art performance on all datasets, substantially outperforming classical Riemannian classifiers and recent SPD baselines, while BWSPD offers competitive accuracy with similar training time.

[551] More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)

Sagi Meir, Tommer D. Keidar, Noam Levi, Shlomi Reuveni, Barak Hirshberg

Main category: cs.LG

TL;DR: ReD (Reset-and-Discard) is a query method for LLMs that improves coverage@cost by reducing diminishing returns in repeated attempts, connecting pass@k metrics to coverage efficiency.

Details

Motivation: Current LLM evaluation uses pass@k (probability of correct answer in k trials), but at fixed budgets, coverage@cost (unique questions answered per attempt) is more practical. There's a problem of diminishing returns as coverage grows sublinearly with attempts due to power-law behavior in pass@k.

Method: Proposes Reset-and-Discard (ReD) query method that resets the sampling process after each successful answer and discards repeated attempts on already-solved problems. This increases coverage@cost regardless of pass@k form. Also provides mathematical framework to connect pass@k and coverage@cost, and can infer power-law exponents if pass@k is unknown.

Result: Experiments on three LLMs using HumanEval show ReD substantially reduces required attempts, tokens, and USD cost to reach desired coverage. Also provides efficient way to measure inference power-laws.

Conclusion: ReD is an effective method to improve LLM query efficiency for verifiable tasks, addressing diminishing returns in coverage growth and offering practical cost savings while providing insights into model inference behavior.

Abstract: The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for any given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs using HumanEval demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws.

[552] Explicit Credit Assignment through Local Rewards and Dependence Graphs in Multi-Agent Reinforcement Learning

Bang Giang Le, Viet Cuong Ta

Main category: cs.LG

TL;DR: A multi-agent reinforcement learning method that combines global and local rewards using an interaction graph to better discern individual agent contributions while promoting cooperation.

Details

Motivation: Global rewards in multi-agent RL are noisy due to combined agent contributions, while local rewards lead to suboptimal myopic behavior. The paper aims to combine the benefits of both approaches.

Method: Proposes using an interaction graph between agents to discern individual contributions more finely than global rewards, while alleviating cooperation problems of local rewards. Also introduces a practical approach for approximating such graphs.

Result: Experiments demonstrate the flexibility of the approach, showing improvements over traditional local and global reward settings in multi-agent reinforcement learning.

Conclusion: The proposed graph-based method effectively combines the merits of both global and local reward approaches in multi-agent RL, enabling better cooperation and credit assignment.

Abstract: To promote cooperation in Multi-Agent Reinforcement Learning, the reward signals of all agents can be aggregated together, forming global rewards that are commonly known as the fully cooperative setting. However, global rewards are usually noisy because they contain the contributions of all agents, which have to be resolved in the credit assignment process. On the other hand, using local reward benefits from faster learning due to the separation of agents’ contributions, but can be suboptimal as agents myopically optimize their own reward while disregarding the global optimality. In this work, we propose a method that combines the merits of both approaches. By using a graph of interaction between agents, our method discerns the individual agent contribution in a more fine-grained manner than a global reward, while alleviating the cooperation problem with agents’ local reward. We also introduce a practical approach for approximating such a graph. Our experiments demonstrate the flexibility of the approach, enabling improvements over the traditional local and global reward settings.

[553] Fast and Geometrically Grounded Lorentz Neural Networks

Robert van der Klis, Ricardo Chávez Torres, Max van Spengler, Yuhui Ding, Thomas Hofmann, Pascal Mettes

Main category: cs.LG

TL;DR: Proposes new Lorentz linear layers for hyperbolic neural networks that maintain hyperbolic geometry properties and computational efficiency, addressing scaling issues in previous formulations.

Details

Motivation: Hyperbolic space shows promise for hierarchical representation learning, but current Lorentz model implementations have scaling issues where hyperbolic norms grow logarithmically with gradient steps, undermining hyperbolic geometry advantages.

Method: Introduces new Lorentz linear layers based on “distance-to-hyperplane” formulation, along with Lorentzian activation functions and caching strategies to maintain hyperbolic geometry while improving computational efficiency.

Result: The new formulation achieves linear scaling of output hyperbolic norms with gradient descent steps, bridging the computation gap with Euclidean neural networks while preserving hyperbolic geometry properties.

Conclusion: Proposed hyperbolic neural network formulation maintains key hyperbolic geometry advantages while achieving computational efficiency comparable to Euclidean networks, enabling practical hyperbolic representation learning.

Abstract: Hyperbolic space is quickly gaining traction as a promising geometry for hierarchical and robust representation learning. A core open challenge is the development of a mathematical formulation of hyperbolic neural networks that is both efficient and captures the key properties of hyperbolic space. The Lorentz model of hyperbolic space has been shown to enable both fast forward and backward propagation. However, we prove that, with the current formulation of Lorentz linear layers, the hyperbolic norms of the outputs scale logarithmically with the number of gradient descent steps, nullifying the key advantage of hyperbolic geometry. We propose a new Lorentz linear layer grounded in the well-known ``distance-to-hyperplane” formulation. We prove that our formulation results in the usual linear scaling of output hyperbolic norms with respect to the number of gradient descent steps. Our new formulation, together with further algorithmic efficiencies through Lorentzian activation functions and a new caching strategy results in neural networks fully abiding by hyperbolic geometry while simultaneously bridging the computation gap to Euclidean neural networks. Code available at: https://github.com/robertdvdk/hyperbolic-fully-connected.

Lige Zhang, Ali Maatouk, Jialin Chen, Leandros Tassiulas, Rex Ying

Main category: cs.LG

TL;DR: Expert Modulation: A new paradigm for multi-modal time series prediction that conditions MoE routing and expert computation on textual signals for better cross-modal alignment without token-level fusion.

Details

Motivation: Existing multi-modal time series forecasting methods rely on token-level fusion that mixes temporal patches with language tokens, which is problematic when high-quality time-text pairs are scarce and when time series exhibit substantial scale/characteristic variations, complicating cross-modal alignment.

Method: Proposes Expert Modulation, a paradigm that conditions both routing and expert computation in Mixture-of-Experts architectures on textual signals, enabling direct and efficient cross-modal control over expert behavior without token-level fusion.

Result: The method demonstrates substantial improvements in multi-modal time series prediction through comprehensive theoretical analysis and experiments.

Conclusion: Expert Modulation provides a more effective approach for multi-modal time series forecasting by enabling direct textual conditioning of expert behavior, addressing limitations of token-level fusion methods.

Abstract: Real-world time series exhibit complex and evolving dynamics, making accurate forecasting extremely challenging. Recent multi-modal forecasting methods leverage textual information such as news reports to improve prediction, but most rely on token-level fusion that mixes temporal patches with language tokens in a shared embedding space. However, such fusion can be ill-suited when high-quality time-text pairs are scarce and when time series exhibit substantial variation in scale and characteristics, thus complicating cross-modal alignment. In parallel, Mixture-of-Experts (MoE) architectures have proven effective for both time series modeling and multi-modal learning, yet many existing MoE-based modality integration methods still depend on token-level fusion. To address this, we propose Expert Modulation, a new paradigm for multi-modal time series prediction that conditions both routing and expert computation on textual signals, enabling direct and efficient cross-modal control over expert behavior. Through comprehensive theoretical analysis and experiments, our proposed method demonstrates substantial improvements in multi-modal time series prediction. The current code is available at https://github.com/BruceZhangReve/MoME

[555] HistoPrism: Unlocking Functional Pathway Analysis from Pan-Cancer Histology via Gene Expression Prediction

Susu Hu, Qinghe Zeng, Nithya Bhasker, Jakob Nicolas Kather, Stefanie Speidel

Main category: cs.LG

TL;DR: HistoPrism is a transformer-based model for pan-cancer prediction of gene expression from H&E histology images, focusing on biologically coherent pathway-level predictions rather than just individual gene variance.

Details

Motivation: Current methods for predicting spatial gene expression from histology are limited to per-cancer settings and focus on variance-based evaluation, lacking exploration of functional relevance and biological coherence. There's a need for models that generalize across cancer types and capture meaningful biological signals for clinical impact.

Method: HistoPrism uses an efficient transformer-based architecture designed for pan-cancer prediction of gene expression from H&E histology. The approach introduces a pathway-level benchmark to shift evaluation from isolated gene-level variance to coherent functional pathways.

Result: HistoPrism surpasses prior state-of-the-art models on highly variable genes and achieves substantial gains on pathway-level prediction, demonstrating its ability to recover biologically coherent transcriptomic patterns with strong pan-cancer generalization and improved efficiency.

Conclusion: HistoPrism establishes a new standard for clinically relevant transcriptomic modeling from routinely available histology by focusing on biologically meaningful pathway-level predictions and demonstrating strong generalization across cancer types.

Abstract: Predicting spatial gene expression from H&E histology offers a scalable and clinically accessible alternative to sequencing, but realizing clinical impact requires models that generalize across cancer types and capture biologically coherent signals. Prior work is often limited to per-cancer settings and variance-based evaluation, leaving functional relevance underexplored. We introduce HistoPrism, an efficient transformer-based architecture for pan-cancer prediction of gene expression from histology. To evaluate biological meaning, we introduce a pathway-level benchmark, shifting assessment from isolated gene-level variance to coherent functional pathways. HistoPrism not only surpasses prior state-of-the-art models on highly variable genes , but also more importantly, achieves substantial gains on pathway-level prediction, demonstrating its ability to recover biologically coherent transcriptomic patterns. With strong pan-cancer generalization and improved efficiency, HistoPrism establishes a new standard for clinically relevant transcriptomic modeling from routinely available histology.

[556] SAL: Selective Adaptive Learning for Backpropagation-Free Training with Sparsification

Fanping Liu, Hua Yang, Jiasi Zou

Main category: cs.LG

TL;DR: SAL is a biologically-inspired training method that uses selective parameter activation and adaptive area partitioning to mitigate gradient interference and weight symmetry constraints in deep learning.

Details

Motivation: Standard backpropagation suffers from biologically implausible weight symmetry requirements and significant gradient interference in dense representations, which limits training efficiency and scalability.

Method: Proposes Selective Adaptive Learning (SAL) that decomposes parameter space into mutually exclusive, sample-dependent regions, combining selective parameter activation with adaptive area partitioning and refined feedback alignment.

Result: SAL demonstrates competitive convergence rates, improved classification performance across 10 benchmarks, numerical consistency in deep regimes (up to 128 layers), and competitive accuracy in large-scale models (up to 1B parameters).

Conclusion: SAL offers a biologically plausible alternative to backpropagation that addresses key limitations and contributes to scalable neural network training research.

Abstract: Standard deep learning relies on Backpropagation (BP), which is constrained by biologically implausible weight symmetry and suffers from significant gradient interference within dense representations. To mitigate these bottlenecks, we propose Selective Adaptive Learning (SAL), a training method that combines selective parameter activation with adaptive area partitioning. Specifically, SAL decomposes the parameter space into mutually exclusive, sample-dependent regions. This decoupling mitigates gradient interference across divergent semantic patterns and addresses explicit weight symmetry requirements through our refined feedback alignment. Empirically, SAL demonstrates competitive convergence rates, leading to improved classification performance across 10 standard benchmarks. Additionally, SAL achieves numerical consistency and competitive accuracy even in deep regimes (up to 128 layers) and large-scale models (up to 1B parameters). Our approach is loosely inspired by biological learning mechanisms, offering a plausible alternative that contributes to the study of scalable neural network training.

[557] Representation Unlearning: Forgetting through Information Compression

Antonio Almudévar, Alfonso Ortega

Main category: cs.LG

TL;DR: Representation Unlearning: A framework for machine unlearning that operates in representation space rather than modifying model parameters, using information bottleneck to suppress forgotten data information while retaining utility.

Details

Motivation: Existing machine unlearning approaches modify model parameters directly, which can be unstable, computationally costly, and limited by local approximations. There's a need for more stable and efficient unlearning methods driven by privacy regulations and robustness concerns.

Method: Instead of modifying model parameters, the method learns a transformation over representations that imposes an information bottleneck: maximizing mutual information with retained data while suppressing information about data to be forgotten. Uses variational surrogates to make the objective tractable, working in two regimes: when both retain and forget data are available, and in a zero-shot setting where only forget data can be accessed.

Result: Experiments across several benchmarks demonstrate that Representation Unlearning achieves more reliable forgetting, better utility retention, and greater computational efficiency than parameter-centric baselines.

Conclusion: Representation Unlearning provides a more effective framework for machine unlearning by operating in representation space rather than modifying parameters directly, offering improved stability, efficiency, and performance.

Abstract: Machine unlearning seeks to remove the influence of specific training data from a model, a need driven by privacy regulations and robustness concerns. Existing approaches typically modify model parameters, but such updates can be unstable, computationally costly, and limited by local approximations. We introduce Representation Unlearning, a framework that performs unlearning directly in the model’s representation space. Instead of modifying model parameters, we learn a transformation over representations that imposes an information bottleneck: maximizing mutual information with retained data while suppressing information about data to be forgotten. We derive variational surrogates that make this objective tractable and show how they can be instantiated in two practical regimes: when both retain and forget data are available, and in a zero-shot setting where only forget data can be accessed. Experiments across several benchmarks demonstrate that Representation Unlearning achieves more reliable forgetting, better utility retention, and greater computational efficiency than parameter-centric baselines.

[558] FlexCausal: Flexible Causal Disentanglement via Structural Flow Priors and Manifold-Aware Interventions

Yutao Jin, Yuang Tao, Junyong Zhai

Main category: cs.LG

TL;DR: FlexCausal is a causal disentangled representation learning framework using block-diagonal covariance VAE with factorized flow-based priors to model complex non-Gaussian exogenous noise, integrating supervised alignment and counterfactual constraints for precise causal structure learning.

Details

Motivation: Existing disentanglement methods use mean-field approximations with diagonal posterior covariance that decorrelate all latent dimensions, and assume isotropic Gaussian priors that fail to capture complex non-Gaussian statistical properties of real-world causal factors.

Method: Proposes FlexCausal framework based on block-diagonal covariance VAE with Factorized Flow-based Prior to model complex densities of exogenous noise, integrates supervised alignment objectives with counterfactual consistency constraints, and uses manifold-aware relative intervention strategy for high-fidelity generation.

Result: Experimental results on both synthetic and real-world datasets demonstrate that FlexCausal significantly outperforms other methods.

Conclusion: FlexCausal effectively addresses limitations of existing CDRL methods by better modeling complex non-Gaussian noise distributions and ensuring precise structural correspondence between learned latent subspaces and ground-truth causal relations.

Abstract: Causal Disentangled Representation Learning(CDRL) aims to learn and disentangle low dimensional representations and their underlying causal structure from observations. However, existing disentanglement methods rely on a standard mean-field approximation with a diagonal posterior covariance, which decorrelates all latent dimensions. Additionally, these methods often assume isotropic Gaussian priors for exogenous noise, failing to capture the complex, non-Gaussian statistical properties prevalent in real-world causal factors. Therefore, we propose FlexCausal, a novel CDRL framework based on a block-diagonal covariance VAE. FlexCausal utilizes a Factorized Flow-based Prior to realistically model the complex densities of exogenous noise, effectively decoupling the learning of causal mechanisms from distributional statistics. By integrating supervised alignment objectives with counterfactual consistency constraints, our framework ensures a precise structural correspondence between the learned latent subspaces and the ground-truth causal relations. Finally, we introduce a manifold-aware relative intervention strategy to ensure high-fidelity generation. Experimental results on both synthetic and real-world datasets demonstrate that FlexCausal significantly outperforms other methods.

[559] Bridging Functional and Representational Similarity via Usable Information

Antonio Almudévar, Alfonso Ortega

Main category: cs.LG

TL;DR: A unified framework for quantifying representation similarity through usable information theory, connecting functional similarity (via stitching), representational similarity (via reconstruction metrics), and showing representational similarity is sufficient but not necessary for functional similarity.

Details

Motivation: To provide a rigorous theoretical foundation for comparing neural representations across different models, addressing limitations of existing similarity metrics and establishing formal connections between functional and representational similarity measures.

Method: Develops a unified information-theoretic framework using usable information theory. Establishes formal links between stitching performance and conditional mutual information, analyzes reconstruction-based metrics as estimators of usable information, and introduces a task-granularity hierarchy to unify concepts.

Result: Shows stitching is inherently asymmetric requiring bidirectional analysis, proves reconstruction metrics estimate usable information under constraints, demonstrates similarity depends on predictive family capacity, and establishes representational similarity as sufficient but not necessary for functional similarity.

Conclusion: Provides a comprehensive theoretical framework for representation similarity analysis that unifies functional and representational perspectives through information theory, offering principled guidance for comparing neural representations across models.

Abstract: We present a unified framework for quantifying the similarity between representations through the lens of \textit{usable information}, offering a rigorous theoretical and empirical synthesis across three key dimensions. First, addressing functional similarity, we establish a formal link between stitching performance and conditional mutual information. We further reveal that stitching is inherently asymmetric, demonstrating that robust functional comparison necessitates a bidirectional analysis rather than a unidirectional mapping. Second, concerning representational similarity, we prove that reconstruction-based metrics and standard tools (e.g., CKA, RSA) act as estimators of usable information under specific constraints. Crucially, we show that similarity is relative to the capacity of the predictive family: representations that appear distinct to a rigid observer may be identical to a more expressive one. Third, we demonstrate that representational similarity is sufficient but not necessary for functional similarity. We unify these concepts through a task-granularity hierarchy: similarity on a complex task guarantees similarity on any coarser derivative, establishing representational similarity as the limit of maximum granularity: input reconstruction.

[560] Signal-Adaptive Trust Regions for Gradient-Free Optimization of Recurrent Spiking Neural Networks

Jinhao Li, Yuhao Sun, Zhiyuan Ma, Hao He, Xinche Zhang, Xing Chen, Jin Li, Sen Song

Main category: cs.LG

TL;DR: SATR introduces signal-adaptive trust regions for training recurrent spiking neural networks in RL, using KL divergence normalized by signal energy to stabilize gradient-free optimization with limited population sizes.

Details

Motivation: Training recurrent spiking neural networks (RSNNs) for high-dimensional, long-horizon reinforcement learning is challenging due to non-differentiable spike dynamics. Gradient-free optimization with finite populations suffers from high variance in gradient estimates, leading to harmful aggressive updates.

Method: Proposes Signal-Adaptive Trust Regions (SATR), a distributional update rule that constrains relative policy changes by bounding KL divergence normalized by estimated signal energy. Automatically expands trust region under strong signals and contracts it when updates are noise-dominated. Instantiated for Bernoulli connectivity distributions and implemented with bitset optimization for binary spiking/weights.

Result: SATR improves stability under limited populations and achieves competitive returns across high-dimensional continuous-control benchmarks against strong baselines including PPO-LSTM. Bitset implementation substantially reduces wall-clock training time.

Conclusion: SATR provides an effective method for training RSNNs in RL by addressing gradient estimation variance through adaptive trust regions, making RSNN policy search practical at scale.

Abstract: Recurrent spiking neural networks (RSNNs) are a promising substrate for energy-efficient control policies, but training them for high-dimensional, long-horizon reinforcement learning remains challenging. Population-based, gradient-free optimization circumvents backpropagation through non-differentiable spike dynamics by estimating gradients. However, with finite populations, high variance of these estimates can induce harmful and overly aggressive update steps. Inspired by trust-region methods in reinforcement learning that constrain policy updates in distribution space, we propose \textbf{Signal-Adaptive Trust Regions (SATR)}, a distributional update rule that constrains relative change by bounding KL divergence normalized by an estimated signal energy. SATR automatically expands the trust region under strong signals and contracts it when updates are noise-dominated. We instantiate SATR for Bernoulli connectivity distributions, which have shown strong empirical performance for RSNN optimization. Across a suite of high-dimensional continuous-control benchmarks, SATR improves stability under limited populations and reaches competitive returns against strong baselines including PPO-LSTM. In addition, to make SATR practical at scale, we introduce a bitset implementation for binary spiking and binary weights, substantially reducing wall-clock training time and enabling fast RSNN policy search.

[561] Learning the Mechanism of Catastrophic Forgetting: A Perspective from Gradient Similarity

Mutian Yang, Zisen Zhan, Yutong Chen, Haolin Li, Kaiwen Wang, Kaili Zheng, Yuguang Wang, Qi Wang, Jiandong Gao, Ji Wu

Main category: cs.LG

TL;DR: Theoretical framework identifies gradient similarity as cause of catastrophic forgetting in LLMs, proposes Collaborative Neural Learning (CNL) to freeze conflicting neurons and update only collaborative ones, achieving near-zero forgetting.

Details

Motivation: Catastrophic forgetting during knowledge injection severely undermines continual learning in LLMs, and existing methods lack foundational theoretical explanations for this phenomenon.

Method: Established gradient-based theoretical framework to explain forgetting, identified strongly negative gradient similarity as fundamental cause, categorized neurons into conflicting (cause forgetting) and collaborative (mitigate forgetting), then proposed CNL method that freezes conflicting neurons and updates only collaborative neurons.

Result: Experiments on five LLMs, four datasets, and four optimizers show CNL achieves zero forgetting in in-set settings and reduces forgetting by 59.1%-81.7% in out-of-set settings.

Conclusion: Theoretical framework explains catastrophic forgetting through gradient similarity, and CNL effectively mitigates forgetting by selectively updating collaborative neurons while freezing conflicting ones.

Abstract: Catastrophic forgetting during knowledge injection severely undermines the continual learning capability of large language models (LLMs). Although existing methods attempt to mitigate this issue, they often lack a foundational theoretical explanation. We establish a gradient-based theoretical framework to explain catastrophic forgetting. We first prove that strongly negative gradient similarity is a fundamental cause of forgetting. We then use gradient similarity to identify two types of neurons: conflicting neurons that induce forgetting and account for 50%-75% of neurons, and collaborative neurons that mitigate forgetting and account for 25%-50%. Based on this analysis, we propose a knowledge injection method, Collaborative Neural Learning (CNL). By freezing conflicting neurons and updating only collaborative neurons, CNL theoretically eliminates catastrophic forgetting under an infinitesimal learning rate eta and an exactly known mastered set. Experiments on five LLMs, four datasets, and four optimizers show that CNL achieves zero forgetting in in-set settings and reduces forgetting by 59.1%-81.7% in out-of-set settings.

[562] Evaluating Prediction Uncertainty Estimates from BatchEnsemble

Morten Blørstad, Herman Jangsett Mostein, Nello Blaser, Pekka Parviainen

Main category: cs.LG

TL;DR: BatchEnsemble and GRUBE provide efficient uncertainty estimation for deep learning models, matching deep ensemble performance with fewer parameters and faster training/inference.

Details

Motivation: Deep learning models have difficulty with uncertainty estimation - existing methods are either computationally expensive or underestimate uncertainty, creating a need for scalable solutions.

Method: Proposes BatchEnsemble as a general uncertainty estimation method, and introduces GRUBE (BatchEnsemble GRU cell) for sequential modeling. Compares against Monte Carlo dropout and deep ensembles.

Result: BatchEnsemble matches deep ensemble uncertainty performance while outperforming Monte Carlo dropout. GRUBE achieves similar or better prediction and uncertainty estimation with fewer parameters and reduced computational costs.

Conclusion: BatchEnsemble and GRUBE offer efficient uncertainty estimation that matches traditional ensemble performance with computational advantages, making uncertainty estimation more practical for real-world applications.

Abstract: Deep learning models struggle with uncertainty estimation. Many approaches are either computationally infeasible or underestimate uncertainty. We investigate \textit{BatchEnsemble} as a general and scalable method for uncertainty estimation across both tabular and time series tasks. To extend BatchEnsemble to sequential modeling, we introduce GRUBE, a novel BatchEnsemble GRU cell. We compare the BatchEnsemble to Monte Carlo dropout and deep ensemble models. Our results show that BatchEnsemble matches the uncertainty estimation performance of deep ensembles, and clearly outperforms Monte Carlo dropout. GRUBE achieves similar or better performance in both prediction and uncertainty estimation. These findings show that BatchEnsemble and GRUBE achieve similar performance with fewer parameters and reduced training and inference time compared to traditional ensembles.

[563] CORDS: Continuous Representations of Discrete Structures

Tin Hadži Veljković, Erik Bekkers, Michael Tiemann, Jan-Willem van de Meent

Main category: cs.LG

TL;DR: CORDS: A method for predicting variable-sized sets by mapping discrete objects to continuous fields (density and feature fields) that are invertible, enabling models to operate in continuous space while remaining decodable to discrete sets.

Details

Motivation: Many learning problems require predicting sets of objects with unknown sizes (object detection, molecular modeling, astrophysical source detection). Existing methods use padded representations or explicit size inference, which pose challenges.

Method: CORDS provides an invertible mapping that transforms sets of spatial objects into continuous fields: a density field encoding object locations and count, and a feature field carrying attributes over the same support. Models operate in field space while remaining exactly decodable to discrete sets.

Result: Evaluated across molecular generation/regression, object detection, simulation-based inference, and mathematical task of recovering local maxima. Demonstrates robust handling of unknown set sizes with competitive accuracy.

Conclusion: CORDS offers a novel strategy for variable-sized set prediction by casting it as a continuous inference problem with invertible field representations, addressing limitations of existing methods.

Abstract: Many learning problems require predicting sets of objects when the number of objects is not known beforehand. Examples include object detection, molecular modeling, and scientific inference tasks such as astrophysical source detection. Existing methods often rely on padded representations or must explicitly infer the set size, which often poses challenges. We present a novel strategy for addressing this challenge by casting prediction of variable-sized sets as a continuous inference problem. Our approach, CORDS (Continuous Representations of Discrete Structures), provides an invertible mapping that transforms a set of spatial objects into continuous fields: a density field that encodes object locations and count, and a feature field that carries their attributes over the same support. Because the mapping is invertible, models operate entirely in field space while remaining exactly decodable to discrete sets. We evaluate CORDS across molecular generation and regression, object detection, simulation-based inference, and a mathematical task involving recovery of local maxima, demonstrating robust handling of unknown set sizes with competitive accuracy.

Wentao Yu, Sheng Wan, Shuo Chen, Bo Han, Chen Gong

Main category: cs.LG

TL;DR: FedSSA is a graph federated learning method that addresses both node feature and structural heterogeneity across clients through semantic and structural alignment techniques.

Details

Motivation: Graph Federated Learning (GFL) faces challenges from heterogeneity in both node features and structural topologies across distributed clients, which hinders effective knowledge sharing and model performance.

Method: Proposes FedSSA with two components: 1) For node feature heterogeneity, uses variational models to infer class-wise node distributions, clusters clients based on distributions, and minimizes divergence between local and cluster-level distributions. 2) For structural heterogeneity, employs spectral GNNs with spectral energy measures to characterize structural information, clusters clients based on spectral energy, and aligns spectral characteristics of local and cluster-level GNNs.

Result: Experiments on six homophilic and five heterophilic graph datasets under both non-overlapping and overlapping partitioning settings show FedSSA consistently outperforms eleven state-of-the-art methods.

Conclusion: FedSSA effectively addresses both semantic and structural heterogeneity in graph federated learning through distribution-based clustering and spectral alignment, achieving superior performance across diverse graph datasets.

Abstract: Graph Federated Learning (GFL) enables distributed graph representation learning while protecting the privacy of graph data. However, GFL suffers from heterogeneity arising from diverse node features and structural topologies across multiple clients. To address both types of heterogeneity, we propose a novel graph Federated learning method via Semantic and Structural Alignment (FedSSA), which shares the knowledge of both node features and structural topologies. For node feature heterogeneity, we propose a novel variational model to infer class-wise node distributions, so that we can cluster clients based on inferred distributions and construct cluster-level representative distributions. We then minimize the divergence between local and cluster-level distributions to facilitate semantic knowledge sharing. For structural heterogeneity, we employ spectral Graph Neural Networks (GNNs) and propose a spectral energy measure to characterize structural information, so that we can cluster clients based on spectral energy and build cluster-level spectral GNNs. We then align the spectral characteristics of local spectral GNNs with those of cluster-level spectral GNNs to enable structural knowledge sharing. Experiments on six homophilic and five heterophilic graph datasets under both non-overlapping and overlapping partitioning settings demonstrate that FedSSA consistently outperforms eleven state-of-the-art methods.

[565] Scalable Power Sampling: Unlocking Efficient, Training-Free Reasoning for LLMs via Distribution Sharpening

Xiaotong Ji, Rasul Tutunov, Matthieu Zimmer, Haitham Bou Ammar

Main category: cs.LG

TL;DR: A training-free method for improving LLM reasoning by approximating power distributions without MCMC sampling, achieving performance comparable to RL post-training with 10x lower latency.

Details

Motivation: RL post-training improves LLM reasoning but primarily works through distribution sharpening rather than new capabilities. MCMC sampling from power distributions can achieve similar results but is computationally expensive, creating a need for more efficient alternatives.

Method: Derives theoretical formulation showing global power distribution can be approximated by token-level scaled low-temperature distributions. Introduces training-free, verifier-free algorithm that sharpens base model’s generative distribution autoregressively without iterative MCMC.

Result: Method matches or surpasses one-shot GRPO on math, QA, and code tasks across four LLMs without external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.

Conclusion: Provides efficient alternative to RL post-training and MCMC sampling for improving LLM reasoning through distribution sharpening, with practical advantages in computational efficiency.

Abstract: Reinforcement learning (RL) post-training is a dominant approach for improving the reasoning performance of large language models (LLMs), yet growing evidence suggests that its gains arise primarily from distribution sharpening rather than the acquisition of new capabilities. Recent work has shown that sampling from the power distribution of LLMs using Markov chain Monte Carlo (MCMC) can recover performance comparable to RL post-training without relying on external rewards; however, the high computational cost of MCMC makes such approaches impractical for widespread adoption. In this work, we propose a theoretically grounded alternative that eliminates the need for iterative MCMC. We derive a novel formulation showing that the global power distribution can be approximated by a token-level scaled low-temperature one, where the scaling factor captures future trajectory quality. Leveraging this insight, we introduce a training-free and verifier-free algorithm that sharpens the base model’s generative distribution autoregressively. Empirically, we evaluate our method on math, QA, and code tasks across four LLMs, and show that our method matches or surpasses one-shot GRPO without relying on any external rewards, while reducing inference latency by over 10x compared to MCMC-based sampling.

[566] Dynamics Reveals Structure: Challenging the Linear Propagation Assumption

Hoyeon Chang, Bálint Mucsányi, Seong Joon Oh

Main category: cs.LG

TL;DR: The paper investigates geometric limits of Linear Propagation Assumption in neural networks, showing that while negation and converse operations can be handled with tensor factorization, composition faces fundamental obstruction due to bilinearity incompatibility with negation.

Details

Motivation: To understand whether first-order parameter updates in neural networks preserve logical coherence, specifically examining the Linear Propagation Assumption that local updates coherently propagate to logical consequences.

Method: Uses relation algebra to study three core operations on relations: negation, converse, and composition. Proves theoretical results about tensor factorization requirements for negation/converse and fundamental obstruction for composition due to bilinearity incompatibility.

Result: For negation and converse, direction-agnostic first-order propagation requires tensor factorization separating entity-pair context from relation content. For composition, shows it reduces to conjunction and proves any conjunction on linear features must be bilinear, which is incompatible with negation, forcing feature map collapse.

Conclusion: Failures in knowledge editing, the reversal curse, and multi-hop reasoning may stem from common structural limitations inherent to the Linear Propagation Assumption, suggesting fundamental constraints in how neural networks handle logical operations.

Abstract: Neural networks adapt through first-order parameter updates, yet it remains unclear whether such updates preserve logical coherence. We investigate the geometric limits of the Linear Propagation Assumption (LPA), the premise that local updates coherently propagate to logical consequences. To formalize this, we adopt relation algebra and study three core operations on relations: negation flips truth values, converse swaps argument order, and composition chains relations. For negation and converse, we prove that guaranteeing direction-agnostic first-order propagation necessitates a tensor factorization separating entity-pair context from relation content. However, for composition, we identify a fundamental obstruction. We show that composition reduces to conjunction, and prove that any conjunction well-defined on linear features must be bilinear. Since bilinearity is incompatible with negation, this forces the feature map to collapse. These results suggest that failures in knowledge editing, the reversal curse, and multi-hop reasoning may stem from common structural limitations inherent to the LPA.

Jiaxin Zhang, Yiqi Wang, Siwei Wang, Xihong Yang, Yu Shi, Xinwang Liu, En Zhu

Main category: cs.LG

TL;DR: TTReFT: A novel test-time representation fine-tuning framework for graph neural networks that adapts latent representations instead of model parameters to address catastrophic forgetting in out-of-distribution scenarios.

Details

Motivation: Graph Neural Networks suffer performance degradation in out-of-distribution test scenarios, and existing test-time training approaches using parameter fine-tuning suffer from catastrophic forgetting, limiting real-world applicability.

Method: Three key innovations: (1) uncertainty-guided node selection for targeted interventions, (2) low-rank representation interventions to preserve pre-trained knowledge, and (3) intervention-aware masked autoencoder that dynamically adjusts masking strategy based on node selection.

Result: Extensive experiments across five benchmark datasets demonstrate consistent and superior performance compared to existing approaches, with theoretical guarantees established for out-of-distribution settings.

Conclusion: TTReFT establishes representation fine-tuning as a new paradigm for graph test-time training, offering both theoretical grounding and immediate practical utility for real-world deployment.

Abstract: Graph Neural Networks frequently exhibit significant performance degradation in the out-of-distribution test scenario. While test-time training (TTT) offers a promising solution, existing Parameter Finetuning (PaFT) paradigm suffer from catastrophic forgetting, hindering their real-world applicability. We propose TTReFT, a novel Test-Time Representation FineTuning framework that transitions the adaptation target from model parameters to latent representations. Specifically, TTReFT achieves this through three key innovations: (1) uncertainty-guided node selection for specific interventions, (2) low-rank representation interventions that preserve pre-trained knowledge, and (3) an intervention-aware masked autoencoder that dynamically adjust masking strategy to accommodate the node selection scheme. Theoretically, we establish guarantees for TTReFT in OOD settings. Empirically, extensive experiments across five benchmark datasets demonstrate that TTReFT achieves consistent and superior performance. Our work establishes representation finetuning as a new paradigm for graph TTT, offering both theoretical grounding and immediate practical utility for real-world deployment.

[568] LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models

Stanislav Budzinskiy, Marian Gloser, Tolunay Yilmaz, Ying Hong Tham, Yuanyi Lin, Wenyi Fang, Fan Wu, Philipp Petersen

Main category: cs.LG

TL;DR: Mixed-precision adaptive recomputation strategy for transformer inference that selectively computes critical components with higher precision while keeping most computations at lower precision, achieving up to 100x accuracy improvements with minimal recomputation overhead.

Details

Motivation: To enable efficient local deployment of large language models through mixed-precision computations while maintaining accuracy, specifically addressing floating-point computation errors in compositionally-rich functions like those in transformer architectures.

Method: Developed an adaptive strategy based on rounding error analysis of function compositions f(g(x)) that identifies and selectively recomputes a small subset of critical components of g(x) with higher precision, while allowing most computations to use lower precision. Applied this strategy to different compositions within transformer architectures.

Result: Demonstrated on GPT-2 models that very low recomputation rates (small subset of components) can achieve improvements of up to two orders of magnitude (100x) in accuracy while maintaining computational efficiency.

Conclusion: The proposed mixed-precision adaptive recomputation strategy enables significantly more accurate transformer inference with minimal computational overhead, facilitating efficient local deployment of large language models.

Abstract: Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f(g(\mathrm{x}))$, we provide an adaptive strategy that selects a small subset of components of $g(\mathrm{x})$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of magnitude in accuracy.

[569] Training Memory in Deep Neural Networks: Mechanisms, Evidence, and Measurement Gaps

Vasileios Sevetlidis, George Pavlidis

Main category: cs.LG

TL;DR: Survey paper analyzing how training history (optimizer states, data ordering, augmentation policies, etc.) affects deep learning outcomes, with tools for causal measurement and reporting protocols.

Details

Motivation: Deep learning training is not memoryless - many factors like optimizer moments, data ordering, augmentation policies, and auxiliary states influence outcomes, but current practices lack systematic measurement and reporting of these effects.

Method: Organizes training mechanisms by source, lifetime, and visibility; introduces seed-paired function-space causal estimands; develops portable perturbation primitives for momentum/Adam/EMA/BatchNorm; creates reporting checklist with audit artifacts like order hashes and RNG contracts.

Result: Provides a comprehensive framework for understanding training history effects, with practical tools for causal measurement and standardized reporting of training artifacts.

Conclusion: Proposes a protocol for portable, causal, uncertainty-aware measurement that attributes how much training history matters across different models, data, and training regimes.

Abstract: Modern deep-learning training is not memoryless. Updates depend on optimizer moments and averaging, data-order policies (random reshuffling vs with-replacement, staged augmentations and replay), the nonconvex path, and auxiliary state (teacher EMA/SWA, contrastive queues, BatchNorm statistics). This survey organizes mechanisms by source, lifetime, and visibility. It introduces seed-paired, function-space causal estimands; portable perturbation primitives (carry/reset of momentum/Adam/EMA/BN, order-window swaps, queue/teacher tweaks); and a reporting checklist with audit artifacts (order hashes, buffer/BN checksums, RNG contracts). The conclusion is a protocol for portable, causal, uncertainty-aware measurement that attributes how much training history matters across models, data, and regimes.

[570] HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

Jinhao Zhang Yunquan Zhang, Zicheng yan, Boyang Zhang, Jun Sun, Daning Cheng

Main category: cs.LG

TL;DR: HeRo-Q is a post-training quantization method that addresses the ’low error, high loss’ paradox by applying a learnable rotation-compression matrix to reshape the loss landscape, reducing Hessian eigenvalues to enhance robustness to quantization noise.

Details

Motivation: Traditional PTQ methods focus on minimizing quantization error but suffer from 'low error, high loss' paradox due to sensitivity of high-curvature directions in the loss landscape to perturbations. The Hessian matrix reveals that a few directions are extremely sensitive to quantization noise.

Method: Proposes Hessian Robust Quantization (HeRo-Q) algorithm that applies a lightweight, learnable rotation-compression matrix to weight space before quantization. This joint framework reshapes loss landscape by reducing largest Hessian eigenvalue and max eigenvalue, enhancing robustness to quantization noise without architectural changes.

Result: HeRo-Q consistently outperforms state-of-the-art methods (GPTQ, AWQ, SpinQuant) on Llama and Qwen models. Achieves superior performance in standard W4A8 settings and excels in challenging W3A16 ultra-low bit regime, boosting GSM8K accuracy on Llama3 8B to 70.15% while avoiding logical collapse seen in aggressive quantization.

Conclusion: HeRo-Q provides an effective solution to the PTQ paradox by addressing Hessian sensitivity, requires no architectural modifications, has negligible computational overhead, and integrates seamlessly into existing PTQ pipelines for robust quantization across various bit-widths.

Abstract: Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical ’low error, high loss’ phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15% and effectively avoids the logical collapse commonly seen in aggressive quantization.

[571] Sampling-Free Privacy Accounting for Matrix Mechanisms under Random Allocation

Jan Schuchardt, Nikita Kalinin

Main category: cs.LG

TL;DR: The paper develops sampling-free privacy amplification bounds for differentially private matrix factorization under random allocation, using Rényi divergence and conditional composition to overcome limitations of existing Monte Carlo approaches.

Details

Motivation: Existing approaches for privacy amplification in differentially private model training with matrix factorization under random allocation have limitations: they either provide probabilistic guarantees, require random abstention, or need many samples inversely proportional to δ. The authors aim to develop more reliable sampling-free bounds.

Method: The authors develop two complementary approaches: 1) Rényi divergence-based bounds computed efficiently via dynamic programming, and 2) conditional composition for stronger privacy guarantees with small ε where Rényi divergence leads to over-approximation. The framework works for both banded and non-banded matrices.

Result: The proposed sampling-free bounds outperform existing Monte Carlo approaches, providing stronger privacy guarantees without requiring random abstention or large sample sizes. Numerical comparisons demonstrate efficacy across various matrix mechanisms used in research and practice.

Conclusion: The paper presents a robust framework for privacy amplification in differentially private matrix factorization that overcomes limitations of existing sampling-based methods, offering reliable bounds for both banded and non-banded matrices.

Abstract: We study privacy amplification for differentially private model training with matrix factorization under random allocation (also known as the balls-in-bins model). Recent work by Choquette-Choo et al. (2025) proposes a sampling-based Monte Carlo approach to compute amplification parameters in this setting. However, their guarantees either only hold with some high probability or require random abstention by the mechanism. Furthermore, the required number of samples for ensuring $(ε,δ)$-DP is inversely proportional to $δ$. In contrast, we develop sampling-free bounds based on Rényi divergence and conditional composition. The former is facilitated by a dynamic programming formulation to efficiently compute the bounds. The latter complements it by offering stronger privacy guarantees for small $ε$, where Rényi divergence bounds inherently lead to an over-approximation. Our framework applies to arbitrary banded and non-banded matrices. Through numerical comparisons, we demonstrate the efficacy of our approach across a broad range of matrix mechanisms used in research and practice.

[572] Generative Design of Ship Propellers using Conditional Flow Matching

Patrick Kruger, Rafael Diaz, Simon Hauschulz, Stefan Harries, Hanno Gottschalk

Main category: cs.LG

TL;DR: GenAI for ship propeller design using conditional flow matching to generate multiple valid designs from performance targets, with data augmentation via pseudo-labels from forward surrogate models.

Details

Motivation: Traditional forward ML models predict performance from designs, but GenAI can generate designs from performance targets, enabling multiple valid solutions for engineering design problems.

Method: Conditional flow matching establishes bidirectional mapping between design parameters and simulated noise conditioned on performance labels; uses vortex lattice method for simulation data; proposes data augmentation with pseudo-labels from forward surrogate models.

Result: Demonstrates generation of distinct propeller geometries with nearly identical performance characteristics; analyzes trade-off between model accuracy and data availability; shows pseudo-label augmentation can improve performance.

Conclusion: GenAI with conditional flow matching shows versatility and potential for engineering design by enabling multiple valid design solutions from performance targets.

Abstract: In this paper, we explore the use of generative artificial intelligence (GenAI) for ship propeller design. While traditional forward machine learning models predict the performance of mechanical components based on given design parameters, GenAI models aim to generate designs that achieve specified performance targets. In particular, we employ conditional flow matching to establish a bidirectional mapping between design parameters and simulated noise that is conditioned on performance labels. This approach enables the generation of multiple valid designs corresponding to the same performance targets by sampling over the noise vector. To support model training, we generate data using a vortex lattice method for numerical simulation and analyze the trade-off between model accuracy and the amount of available data. We further propose data augmentation using pseudo-labels derived from less data-intensive forward surrogate models, which can often improve overall model performance. Finally, we present examples of distinct propeller geometries that exhibit nearly identical performance characteristics, illustrating the versatility and potential of GenAI in engineering design.

[573] Seg-MoE: Multi-Resolution Segment-wise Mixture-of-Experts for Time Series Forecasting Transformers

Evandro S. Ortigossa, Eran Segal

Main category: cs.LG

TL;DR: Seg-MoE: A sparse mixture-of-experts architecture for time-series forecasting that routes contiguous time-step segments rather than individual tokens, better capturing temporal locality and continuity.

Details

Motivation: Existing MoE approaches for time-series forecasting use token-wise routing which fails to exploit the natural locality and continuity of temporal data. There's a need for architectures that align routing granularity with the inherent structure of time series.

Method: Introduces Seg-MoE, a sparse MoE design that routes and processes contiguous time-step segments rather than making independent expert decisions. Each expert models intra-segment interactions directly, aligning with inherent temporal patterns. Integrated into a time-series Transformer architecture.

Result: Seg-MoE consistently achieves state-of-the-art forecasting accuracy across almost all prediction horizons on multiple multivariate long-term forecasting benchmarks, outperforming both dense Transformers and prior token-wise MoE models.

Conclusion: Segment-level routing is the key factor driving performance gains. Aligning MoE routing granularity with the inherent structure of time series provides a powerful inductive bias, opening new avenues for conditionally sparse architectures in sequential data modeling.

Abstract: Transformer-based models have recently made significant advances in accurate time-series forecasting, but even these architectures struggle to scale efficiently while capturing long-term temporal dynamics. Mixture-of-Experts (MoE) layers are a proven solution to scaling problems in natural language processing. However, existing MoE approaches for time-series forecasting rely on token-wise routing mechanisms, which may fail to exploit the natural locality and continuity of temporal data. In this work, we introduce Seg-MoE, a sparse MoE design that routes and processes contiguous time-step segments rather than making independent expert decisions. Token segments allow each expert to model intra-segment interactions directly, naturally aligning with inherent temporal patterns. We integrate Seg-MoE layers into a time-series Transformer and evaluate it on multiple multivariate long-term forecasting benchmarks. Seg-MoE consistently achieves state-of-the-art forecasting accuracy across almost all prediction horizons, outperforming both dense Transformers and prior token-wise MoE models. Comprehensive ablation studies confirm that segment-level routing is the key factor driving these gains. Our results show that aligning the MoE routing granularity with the inherent structure of time series provides a powerful, yet previously underexplored, inductive bias, opening new avenues for conditionally sparse architectures in sequential data modeling.

[574] Identifiable Equivariant Networks are Layerwise Equivariant

Vahid Shahverdi, Giovanni Luca Marchetti, Georg Bökman, Kathlén Kohn

Main category: cs.LG

TL;DR: Theoretical proof that end-to-end equivariant neural networks can be parameterized with layerwise equivariant structures, explaining the emergence of equivariant weight patterns observed in practice.

Details

Motivation: To mathematically explain why neural networks trained for equivariant tasks develop equivariant structures in their weights, a phenomenon consistently observed in practice but not formally understood.

Method: Develops an abstract, architecture-agnostic formalism to prove that for networks with identifiable parameters, end-to-end equivariance implies existence of parameter choices yielding layerwise equivariance with respect to group actions on latent spaces.

Result: Proves that networks with end-to-end equivariance can be parameterized to have layerwise equivariance, assuming parameter identifiability (established for many network classes, conjectural for others).

Conclusion: Provides mathematical foundation explaining emergence of equivariant structures in neural network weights during training, bridging theory and empirical observations.

Abstract: We investigate the relation between end-to-end equivariance and layerwise equivariance in deep neural networks. We prove the following: For a network whose end-to-end function is equivariant with respect to group actions on the input and output spaces, there is a parameter choice yielding the same end-to-end function such that its layers are equivariant with respect to some group actions on the latent spaces. Our result assumes that the parameters of the model are identifiable in an appropriate sense. This identifiability property has been established in the literature for a large class of networks, to which our results apply immediately, while it is conjectural for others. The theory we develop is grounded in an abstract formalism, and is therefore architecture-agnostic. Overall, our results provide a mathematical explanation for the emergence of equivariant structures in the weights of neural networks during training – a phenomenon that is consistently observed in practice.

[575] Gauge-invariant representation holonomy

Vasileios Sevetlidis, George Pavlidis

Main category: cs.LG

TL;DR: Representation holonomy measures geometric curvature in neural network feature spaces by quantifying path dependence of representations, revealing hidden structure missed by pointwise similarity metrics like CKA.

Details

Motivation: Existing similarity measures (CKA, SVCCA) only capture pointwise overlap between activation sets but miss how representations change along input paths, failing to detect differences in model robustness to perturbations despite similar pointwise metrics.

Method: Introduces representation holonomy, a gauge-invariant statistic that measures path dependence by quantifying “twist” accumulated when features are parallel-transported around small loops in input space. Uses global whitening for gauge fixing, aligns neighborhoods with shared subspaces and rotation-only Procrustes, then embeds back to full feature space.

Result: Holonomy increases with loop radius, separates models that appear similar under CKA, correlates with adversarial and corruption robustness, and tracks training dynamics as features form and stabilize. Proves invariance to orthogonal/affine transformations and establishes linear null for affine layers.

Conclusion: Representation holonomy serves as a practical, scalable diagnostic for probing geometric structure of learned representations beyond pointwise similarity, revealing hidden curvature that affects model robustness.

Abstract: Deep networks learn internal representations whose geometry–how features bend, rotate, and evolve–affects both generalization and robustness. Existing similarity measures such as CKA or SVCCA capture pointwise overlap between activation sets, but miss how representations change along input paths. Two models may appear nearly identical under these metrics yet respond very differently to perturbations or adversarial stress. We introduce representation holonomy, a gauge-invariant statistic that measures this path dependence. Conceptually, holonomy quantifies the “twist” accumulated when features are parallel-transported around a small loop in input space: flat representations yield zero holonomy, while nonzero values reveal hidden curvature. Our estimator fixes gauge through global whitening, aligns neighborhoods using shared subspaces and rotation-only Procrustes, and embeds the result back to the full feature space. We prove invariance to orthogonal (and affine, post-whitening) transformations, establish a linear null for affine layers, and show that holonomy vanishes at small radii. Empirically, holonomy increases with loop radius, separates models that appear similar under CKA, and correlates with adversarial and corruption robustness. It also tracks training dynamics as features form and stabilize. Together, these results position representation holonomy as a practical and scalable diagnostic for probing the geometric structure of learned representations beyond pointwise similarity.

[576] TabClustPFN: A Prior-Fitted Network for Tabular Data Clustering

Tianqi Zhao, Guanyang Wang, Yan Shuo Tan, Qiong Zhang

Main category: cs.LG

TL;DR: TabClustPFN is a prior-fitted network for tabular data clustering that performs amortized Bayesian inference over cluster assignments and cluster cardinality, trained on synthetic data and applied to unseen datasets without retuning.

Details

Motivation: Clustering tabular data is challenging due to heterogeneous feature types, diverse data-generating mechanisms, and lack of transferable inductive biases across datasets. Prior-fitted networks have shown strong generalization in supervised tabular learning, but extending this to clustering is nontrivial due to its unsupervised nature, combinatorial output space, and need to infer cluster numbers.

Method: TabClustPFN is a prior-fitted network pretrained on synthetic datasets drawn from a flexible clustering prior. It performs amortized Bayesian inference over both cluster assignments and cluster cardinality, handling heterogeneous numerical and categorical features. The model clusters unseen datasets in a single forward pass without dataset-specific retraining or hyperparameter tuning.

Result: Experiments on synthetic data and curated real-world tabular benchmarks show that TabClustPFN outperforms classical, deep, and amortized clustering baselines, while exhibiting strong robustness in out-of-the-box exploratory settings.

Conclusion: TabClustPFN demonstrates effective amortized Bayesian inference for tabular clustering, providing strong generalization across diverse datasets without requiring dataset-specific tuning, making it suitable for exploratory data analysis.

Abstract: Clustering tabular data is a fundamental yet challenging problem due to heterogeneous feature types, diverse data-generating mechanisms, and the absence of transferable inductive biases across datasets. Prior-fitted networks (PFNs) have recently demonstrated strong generalization in supervised tabular learning by amortizing Bayesian inference under a broad synthetic prior. Extending this paradigm to clustering is nontrivial: clustering is unsupervised, admits a combinatorial and permutation-invariant output space, and requires inferring the number of clusters. We introduce TabClustPFN, a prior-fitted network for tabular data clustering that performs amortized Bayesian inference over both cluster assignments and cluster cardinality. Pretrained on synthetic datasets drawn from a flexible clustering prior, TabClustPFN clusters unseen datasets in a single forward pass, without dataset-specific retraining or hyperparameter tuning. The model naturally handles heterogeneous numerical and categorical features and adapts to a wide range of clustering structures. Experiments on synthetic data and curated real-world tabular benchmarks show that TabClustPFN outperforms classical, deep, and amortized clustering baselines, while exhibiting strong robustness in out-of-the-box exploratory settings. Code is available at https://github.com/Tianqi-Zhao/TabClustPFN.

[577] Epistemic Uncertainty Quantification for Pre-trained VLMs via Riemannian Flow Matching

Li Ju, Mayank Nautiyal, Andreas Hellander, Ekta Vats, Prashant Singh

Main category: cs.LG

TL;DR: REPVLM: A method for quantifying epistemic uncertainty in Vision-Language Models using Riemannian Flow Matching on hyperspherical embeddings

Details

Motivation: Vision-Language Models lack intrinsic mechanisms to quantify epistemic uncertainty (model's lack of knowledge), which is crucial for reliable deployment in real-world applications.

Method: Proposes REPVLM which uses negative log-density of embeddings as proxy for epistemic uncertainty, computing probability density on hyperspherical manifold of VLM embeddings using Riemannian Flow Matching.

Result: Achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Also provides scalable metric for out-of-distribution detection and automated data curation.

Conclusion: REPVLM effectively quantifies epistemic uncertainty in VLMs, enabling more reliable deployment and additional applications like OOD detection and data curation.

Abstract: Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model’s lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hyperspherical manifold of the VLM embeddings using Riemannian Flow Matching. We empirically demonstrate that REPVLM achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Beyond classification, we also demonstrate that the model also provides a scalable metric for out-of-distribution detection and automated data curation.

[578] SENDAI: A Hierarchical Sparse-measurement, EfficieNt Data AssImilation Framework

Xingyue Zhang, Yuxuan Bao, Mars Liyao Gao, J. Nathan Kutz

Main category: cs.LG

TL;DR: SENDAI: A hierarchical sparse-measurement framework for reconstructing spatial fields from hyper-sparse sensor observations using simulation priors and learned discrepancy corrections, demonstrated on MODIS vegetation index reconstruction.

Details

Motivation: Bridging the gap between data-rich training and observation-sparse deployment in spatiotemporal field reconstruction, especially when target domains have distributional shifts, heterogeneous structure, and multi-scale dynamics not present in training data.

Method: Hierarchical framework combining simulation-derived priors with learned discrepancy corrections to reconstruct full spatial states from hyper-sparse sensor observations. Applied to satellite remote sensing for MODIS vegetation index reconstruction across six global sites.

Result: Outperforms established baselines requiring denser observations: 185% SSIM improvement over traditional baselines and 36% improvement over recent high-frequency methods. Particularly effective for landscapes with sharp boundaries and sub-seasonal dynamics, preserving diagnostically relevant structures like field topologies and spatial gradients.

Conclusion: SENDAI provides a lightweight, operationally viable framework for sparse-measurement reconstruction applicable to physically grounded inference, resource-limited deployment, and real-time monitoring and control.

Abstract: Bridging the gap between data-rich training regimes and observation-sparse deployment conditions remains a central challenge in spatiotemporal field reconstruction, particularly when target domains exhibit distributional shifts, heterogeneous structure, and multi-scale dynamics absent from available training data. We present SENDAI, a hierarchical Sparse-measurement, EfficieNt Data AssImilation Framework that reconstructs full spatial states from hyper sparse sensor observations by combining simulation-derived priors with learned discrepancy corrections. We demonstrate the performance on satellite remote sensing, reconstructing MODIS (Moderate Resolution Imaging Spectroradiometer) derived vegetation index fields across six globally distributed sites. Using seasonal periods as a proxy for domain shift, the framework consistently outperforms established baselines that require substantially denser observations – SENDAI achieves a maximum SSIM improvement of 185% over traditional baselines and a 36% improvement over recent high-frequency-based methods. These gains are particularly pronounced for landscapes with sharp boundaries and sub-seasonal dynamics; more importantly, the framework effectively preserves diagnostically relevant structures – such as field topologies, land cover discontinuities, and spatial gradients. By yielding corrections that are more structurally and spectrally separable, the reconstructed fields are better suited for downstream inference of indirectly observed variables. The results therefore highlight a lightweight and operationally viable framework for sparse-measurement reconstruction that is applicable to physically grounded inference, resource-limited deployment, and real-time monitor and control.

[579] Expected Return Causes Outcome-Level Mode Collapse in Reinforcement Learning and How to Fix It with Inverse Probability Scaling

Abhijeet Sinha, Sundari Elango, Dianbo Liu

Main category: cs.LG

TL;DR: The paper identifies that outcome-level mode collapse in RL is inherent to the expected-return objective itself, not just exploration issues, and proposes inverse probability scaling to create reward-proportional terminal distributions that prevent collapse in multimodal settings.

Details

Motivation: Standard RL policies trained with expected return maximization collapse onto small subsets of outcomes even when multiple high-quality solutions exist. The paper challenges the conventional explanation that this is due to insufficient exploration or weak regularization, arguing instead that it's a structural flaw in the objective itself.

Method: The authors show mathematically that under idealized learning dynamics, the log-probability ratio between outcomes evolves linearly in their reward difference, causing exponential divergence and inevitable collapse. They propose inverse probability scaling (IPS) to remove outcome-frequency amplification from the learning signal. This is instantiated as IPS-GRPO (Group Relative Policy Optimization), a drop-in modification requiring no auxiliary models or architectural changes.

Result: Across different reasoning and molecular generation tasks, IPS-GRPO consistently reduces outcome-level mode collapse while matching or exceeding baseline performance. The method demonstrates that correcting the objective rather than adding exploration heuristics is key to reliable multimodal policy optimization.

Conclusion: Mode collapse in multimodal RL is a fundamental consequence of the expected-return objective, not just an exploration issue. Inverse probability scaling provides a principled solution that yields reward-proportional terminal distributions and prevents collapse, offering a more reliable approach to multimodal policy optimization.

Abstract: Many reinforcement learning (RL) problems admit multiple terminal solutions of comparable quality, where the goal is not to identify a single optimum but to represent a diverse set of high-quality outcomes. Nevertheless, policies trained by standard expected return maximization routinely collapse onto a small subset of outcomes, a phenomenon commonly attributed to insufficient exploration or weak regularization. We show that this explanation is incomplete: outcome level mode collapse is a structural consequence of the expected-return objective itself. Under idealized learning dynamics, the log-probability ratio between any two outcomes evolves linearly in their reward difference, implying exponential ratio divergence and inevitable collapse independent of the exploration strategy, entropy regularization, or optimization algorithm. We identify the source of this pathology as the probability multiplier inside the expectation and propose a minimal correction: inverse probability scaling, which removes outcome-frequency amplification from the learning signal, fundamentally changes the learning dynamics, and provably yields reward-proportional terminal distributions, preventing collapse in multimodal settings. We instantiate this principle in Group Relative Policy Optimization (GRPO) as a drop-in modification, IPS-GRPO, requiring no auxiliary models or architectural changes. Across different reasoning and molecular generation tasks, IPS-GRPO consistently reduces outcome-level mode collapse while matching or exceeding baseline performance, suggesting that correcting the objective rather than adding exploration heuristics is key to reliable multimodal policy optimization.

[580] XFACTORS: Disentangled Information Bottleneck via Contrastive Supervision

Alexandre Myara, Nicolas Bourriez, Thomas Boyer, Thomas Lemercier, Ihab Bendidi, Auguste Genovesio

Main category: cs.LG

TL;DR: XFactors: A weakly-supervised VAE framework for disentangling and controlling specific factors using contrastive supervision and KL regularization without adversarial training.

Details

Motivation: Current disentangled representation learning approaches have limitations - unsupervised methods fail on real data without strong biases, while supervised methods are unstable and don't scale well due to adversarial objectives or auxiliary classifiers.

Method: Proposes a weakly-supervised VAE framework that decomposes representation into factor-specific subspaces and a residual subspace. Uses contrastive InfoNCE loss to encode target factors in assigned subspaces (pulling together same-factor latents, pushing apart mismatched pairs) and KL regularization to impose Gaussian structure without additional supervision for non-targeted factors.

Result: Achieves state-of-the-art disentanglement scores across multiple datasets with constant hyperparameters, shows consistent qualitative factor alignment, enables controlled factor swapping via latent replacement, scales correctly with increasing latent capacity, and works on real-world CelebA dataset.

Conclusion: XFactors provides an effective weakly-supervised approach for disentangling and controlling specific factors without adversarial training or classifiers, addressing scalability and stability issues of previous methods.

Abstract: Disentangled representation learning aims to map independent factors of variation to independent representation components. On one hand, purely unsupervised approaches have proven successful on fully disentangled synthetic data, but fail to recover semantic factors from real data without strong inductive biases. On the other hand, supervised approaches are unstable and hard to scale to large attribute sets because they rely on adversarial objectives or auxiliary classifiers. We introduce \textsc{XFactors}, a weakly-supervised VAE framework that disentangles and provides explicit control over a chosen set of factors. Building on the Disentangled Information Bottleneck perspective, we decompose the representation into a residual subspace $\mathcal{S}$ and factor-specific subspaces $\mathcal{T}_1,\ldots,\mathcal{T}_K$ and a residual subspace $\mathcal{S}$. Each target factor is encoded in its assigned $\mathcal{T}_i$ through contrastive supervision: an InfoNCE loss pulls together latents sharing the same factor value and pushes apart mismatched pairs. In parallel, KL regularization imposes a Gaussian structure on both $\mathcal{S}$ and the aggregated factor subspaces, organizing the geometry without additional supervision for non-targeted factors and avoiding adversarial training and classifiers. Across multiple datasets, with constant hyperparameters, \textsc{XFactors} achieves state-of-the-art disentanglement scores and yields consistent qualitative factor alignment in the corresponding subspaces, enabling controlled factor swapping via latent replacement. We further demonstrate that our method scales correctly with increasing latent capacity and evaluate it on the real-world dataset CelebA. Our code is available at \href{https://github.com/ICML26-anon/XFactors}{github.com/ICML26-anon/XFactors}.

[581] LLM4Fluid: Large Language Models as Generalizable Neural Solvers for Fluid Dynamics

Qisong Xiao, Xinhai Chen, Qinglin Wang, Xiaowei Guo, Binglin Wang, Weifeng Chen, Zhichao Wang, Yunfei Liu, Rui Xia, Hang Zou, Gencheng Liu, Shuai Li, Jie Liu

Main category: cs.LG

TL;DR: LLM4Fluid uses Large Language Models as generalizable neural solvers for fluid dynamics prediction, combining reduced-order modeling with physics-informed disentanglement and modality alignment for stable long-term forecasting.

Details

Motivation: Existing deep learning approaches for fluid dynamics suffer from limited generalization to unseen flow conditions and typically require retraining for new scenarios, highlighting the need for more generalizable neural solvers.

Method: 1) Compress high-dimensional flow fields into compact latent space via reduced-order modeling with physics-informed disentanglement; 2) Use pretrained LLM as temporal processor with autoregressive prediction; 3) Implement modality alignment strategy to bridge gap between prompts and physical sequences.

Result: LLM4Fluid achieves state-of-the-art accuracy across diverse flow scenarios without retraining, demonstrating robust generalization, powerful zero-shot learning, and in-context learning capabilities.

Conclusion: LLMs can serve as effective generalizable neural solvers for spatio-temporal fluid dynamics prediction when combined with appropriate physics-informed compression and modality alignment techniques.

Abstract: Deep learning has emerged as a promising paradigm for spatio-temporal modeling of fluid dynamics. However, existing approaches often suffer from limited generalization to unseen flow conditions and typically require retraining when applied to new scenarios. In this paper, we present LLM4Fluid, a spatio-temporal prediction framework that leverages Large Language Models (LLMs) as generalizable neural solvers for fluid dynamics. The framework first compresses high-dimensional flow fields into a compact latent space via reduced-order modeling enhanced with a physics-informed disentanglement mechanism, effectively mitigating spatial feature entanglement while preserving essential flow structures. A pretrained LLM then serves as a temporal processor, autoregressively predicting the dynamics of physical sequences with time series prompts. To bridge the modality gap between prompts and physical sequences, which can otherwise degrade prediction accuracy, we propose a dedicated modality alignment strategy that resolves representational mismatch and stabilizes long-term prediction. Extensive experiments across diverse flow scenarios demonstrate that LLM4Fluid functions as a robust and generalizable neural solver without retraining, achieving state-of-the-art accuracy while exhibiting powerful zero-shot and in-context learning capabilities. Code and datasets are publicly available at https://github.com/qisongxiao/LLM4Fluid.

[582] Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics

Mohamed Elgaar, Hadi Amiri

Main category: cs.LG

TL;DR: Curriculum learning in language models changes data exposure order but not learning trajectory phases; helps smaller models by reducing gradient noise and output-head saturation, with diminishing benefits at larger scales.

Details

Motivation: To understand whether curriculum learning fundamentally changes the learning trajectory of language models or merely reorders data exposure along a fixed trajectory, and to investigate how different curricula affect optimization stability.

Method: Trained Pythia models (14M-410M parameters) for 300B tokens using three linguistically motivated curricula (Age-of-Acquisition, word frequency, Verb Variation) compared against Random ordering; also compared Random vs VV at 1B parameters. Analyzed training dynamics, gradient noise, output-head spectral saturation, and formalized link between difficulty pacing and optimization stability through gradient-variance control analysis.

Result: Across orderings, training follows a shared sequence of latent phases; curricula mainly change within-phase data exposure. Smaller models (up to 160M) with Random ordering show higher gradient noise and stronger late-training output-head spectral saturation with lower final accuracy; curricula reduce both effects. At larger scales, saturation differences are smaller and curriculum gains shrink.

Conclusion: Curriculum learning helps by stabilizing within-phase optimization rather than by creating new learning phases, with practical benefits more pronounced for smaller models where optimization stability is more critical.

Abstract: Curriculum learning changes the order of pre-training data, but it remains unclear whether it changes the learning trajectory or mainly reorders exposure over a fixed trajectory. We train Pythia models (14M-410M parameters) for 300B tokens under three linguistically motivated curricula-Age-of-Acquisition, word frequency, and Verb Variation (VV)-and compare each against Random ordering; at 1B parameters we compare Random and VV. Across orderings, training follows a shared sequence of latent phases, while curricula mainly change within-phase data exposure. In smaller models (up to 160M parameters), Random ordering exhibits higher gradient noise and stronger late-training output-head spectral saturation, alongside lower final accuracy; curricula reduce both effects at matched compute. At larger scales, saturation differences are smaller and curriculum gains shrink. We formalize the link between difficulty pacing and optimization stability in an idealized analysis based on gradient-variance control, and our results point to a practical takeaway: curricula help by stabilizing within-phase optimization rather than by creating new phases.

[583] Can Local Learning Match Self-Supervised Backpropagation?

Wu S. Zihan, Ariane Delrocq, Wulfram Gerstner, Guillaume Bellec

Main category: cs.LG

TL;DR: Local self-supervised learning algorithms can match global backpropagation performance in deep networks through theoretical analysis and novel algorithm variants.

Details

Motivation: To bridge the gap between global backpropagation-based self-supervised learning (global BP-SSL) and local self-supervised learning (local-SSL) approaches, which have struggled to build functional representations in deep neural networks.

Method: First developed theory for deep linear networks to identify conditions where local-SSL algorithms (Forward-forward, CLAPP) implement exactly the same weight update as global BP-SSL. Then developed novel variants of local-SSL algorithms to approximate global BP-SSL in deep non-linear convolutional neural networks.

Result: Variants that improve similarity between local-SSL and global BP-SSL gradient updates show better performance on image datasets (CIFAR-10, STL-10, Tiny ImageNet). Best local-SSL rule with CLAPP loss matches performance of comparable global BP-SSL with InfoNCE or CPC-like loss functions, improving state-of-the-art for local SSL on these benchmarks.

Conclusion: Local self-supervised learning algorithms can be designed to match global backpropagation performance, establishing a theoretical and practical link between these two learning paradigms in deep neural networks.

Abstract: While end-to-end self-supervised learning with backpropagation (global BP-SSL) has become central for training modern AI systems, theories of local self-supervised learning (local-SSL) have struggled to build functional representations in deep neural networks. To establish a link between global and local rules, we first develop a theory for deep linear networks: we identify conditions for local-SSL algorithms (like Forward-forward or CLAPP) to implement exactly the same weight update as a global BP-SSL. Starting from the theoretical insights, we then develop novel variants of local-SSL algorithms to approximate global BP-SSL in deep non-linear convolutional neural networks. Variants that improve the similarity between gradient updates of local-SSL with those of global BP-SSL also show better performance on image datasets (CIFAR-10, STL-10, and Tiny ImageNet). The best local-SSL rule with the CLAPP loss function matches the performance of a comparable global BP-SSL with InfoNCE or CPC-like loss functions, and improves upon state-of-the-art for local SSL on these benchmarks.

[584] Don’t be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Yüzügüler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello

Main category: cs.LG

TL;DR: StiefAttention: A post-training KV-cache compression method that learns orthonormal projection bases by directly minimizing decoder-layer output reconstruction error, enabling flexible layer-wise rank allocation under error budgets.

Details

Motivation: KV caching becomes a bottleneck in HBM capacity and bandwidth at long contexts. Existing compression methods use SVD-style proxy objectives that may poorly reflect end-to-end reconstruction after full transformer operations.

Method: Learns orthonormal projection bases by directly minimizing decoder-layer output reconstruction error (not just KV matrix approximation). Precomputes error-rank profiles for each layer to enable flexible layer-wise rank allocation under user-specified error budgets.

Result: On Llama3-8B, outperforms EigenAttention by 11.9 points on C4 perplexity and 5.4% on 0-shot MMLU accuracy at iso-compression, with lower relative error and higher cosine similarity to original decoder-layer outputs.

Conclusion: StiefAttention provides superior KV-cache compression by directly optimizing for end-to-end decoder output reconstruction rather than proxy objectives, enabling efficient long-context inference with flexible error control.

Abstract: Key–value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emph{orthonormal} projection bases by directly minimizing \emph{decoder-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank allocation under a user-specified error budget. Noteworthy, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $11.9$ points on C4 perplexity and $5.4%$ on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.

[585] When does predictive inverse dynamics outperform behavior cloning?

Lukas Schäfer, Pallavi Choudhury, Abdelhak Lemkhenter, Chris Lovett, Somjit Nath, Luis França, Matheus Ribeiro Furtado de Mendonça, Alex Lamb, Riashat Islam, Siddhartha Sen, John Langford, Katja Hofmann, Sergio Valcarcel Macua

Main category: cs.LG

TL;DR: PIDM architectures combine future state prediction with inverse dynamics modeling to outperform behavior cloning, especially with limited expert demonstrations, through a bias-variance tradeoff.

Details

Motivation: Behavior cloning often fails with limited expert demonstrations. While predictive inverse dynamics models (PIDM) outperform BC, the theoretical reasons for their benefits remain unclear. The paper aims to provide a theoretical explanation for PIDM's advantages.

Method: Theoretical analysis of PIDM’s bias-variance tradeoff: predicting future states introduces bias but conditioning inverse dynamics models on predictions reduces variance. Establishes conditions on state predictor bias for PIDM to outperform BC. Empirical validation in 2D navigation tasks and complex 3D video game environments with high-dimensional visual inputs.

Result: BC requires up to 5 times (3 times on average) more demonstrations than PIDM in 2D navigation tasks to reach comparable performance. In complex 3D video game environments with high-dimensional visual inputs and stochastic transitions, BC requires over 66% more samples than PIDM.

Conclusion: PIDM introduces a beneficial bias-variance tradeoff that makes it more sample-efficient than behavior cloning, especially when expert demonstrations are limited. The theoretical framework explains why PIDM outperforms BC and provides conditions for its advantages.

Abstract: Behavior cloning (BC) is a practical offline imitation learning method, but it often fails when expert demonstrations are limited. Recent works have introduced a class of architectures named predictive inverse dynamics models (PIDM) that combine a future state predictor with an inverse dynamics model (IDM). While PIDM often outperforms BC, the reasons behind its benefits remain unclear. In this paper, we provide a theoretical explanation: PIDM introduces a bias-variance tradeoff. While predicting the future state introduces bias, conditioning the IDM on the prediction can significantly reduce variance. We establish conditions on the state predictor bias for PIDM to achieve lower prediction error and higher sample efficiency than BC, with the gap widening when additional data sources are available. We validate the theoretical insights empirically in 2D navigation tasks, where BC requires up to five times (three times on average) more demonstrations than PIDM to reach comparable performance; and in a complex 3D environment in a modern video game with high-dimensional visual inputs and stochastic transitions, where BC requires over 66% more samples than PIDM.

[586] Understanding Model Merging: A Unified Generalization Framework for Heterogeneous Experts

Qinglun Li, Anke Tang, Miao Zhang, Mengzhu Wang, Quanjun Yin, Li Shen

Main category: cs.LG

TL;DR: Theoretical analysis of model merging under heterogeneous hyperparameters using L2-Stability theory, providing unified framework and practical recommendations for fine-tuning merge-friendly models.

Details

Motivation: Model merging effectively combines capabilities from multiple fine-tuned models but lacks theoretical understanding, especially under varying hyperparameters. Practitioners need guidance for creating merge-friendly models without hyperparameter transparency.

Method: Employ L2-Stability theory under heterogeneous hyperparameter environments to analyze generalization of merged models. Provides theoretical framework explaining existing merging algorithms and offers practical recommendations for fine-tuning.

Result: Extensive experiments on ResNet/ViT across 20/8 visual classification tasks with thousands of fine-tuned models confirm theoretical predictions about hyperparameter impact on merged model generalization.

Conclusion: Theoretical framework explains existing merging algorithms and provides actionable guidance for practitioners to strategically fine-tune expert models for better merging outcomes.

Abstract: Model merging efficiently aggregates capabilities from multiple fine-tuned models into a single one, operating purely in parameter space without original data or expensive re-computation. Despite empirical successes, a unified theory for its effectiveness under heterogeneous finetuning hyperparameters (e.g., varying learning rates, batch sizes) remains missing. Moreover, the lack of hyperparameter transparency in open-source fine-tuned models makes it difficult to predict merged-model performance, leaving practitioners without guidance on how to fine-tune merge-friendly experts. To address those two challenges, we employ $L_2$-Stability theory under heterogeneous hyperparameter environments to analyze the generalization of the merged model $\boldsymbol{x}{avg}$. This pioneering analysis yields two key contributions: (i) \textit{A unified theoretical framework} is provided to explain existing merging algorithms, revealing how they optimize specific terms in our bound, thus offering a strong theoretical foundation for empirical observations. (ii) \textit{Actionable recommendations} are proposed for practitioners to strategically fine-tune expert models, enabling the construction of merge-friendly models within the pretraining-to-finetuning pipeline. Extensive experiments on the ResNet/Vit family across 20/8 visual classification tasks, involving thousands of finetuning models, robustly confirm the impact of different hyperparameters on the generalization of $\boldsymbol{x}{avg}$ predicted by our theoretical results.

[587] Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle

Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ortí

Main category: cs.LG

TL;DR: Adam optimizer performs better when momentum parameters β₁=β₂ due to gradient scale invariance property

Details

Motivation: Empirical observations show that Adam optimizer works better when β₁=β₂, but there's no theoretical explanation for why this choice helps training performance and validation scores

Method: Formalize the concept of gradient scale invariance, prove that Adam becomes gradient scale invariant of first order if and only if β₁=β₂, and conduct experiments across vision and language tasks with different architectures

Result: When β₁=β₂, gradient rescaling has a markedly smoother effect on updates, and this balanced regime aligns with design principles of recent scale-robust optimizers

Conclusion: The β₁=β₂ condition provides gradient scale invariance, explaining the empirical success of balanced Adam and offering a principle for future optimizer design

Abstract: Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $β_{1}=β_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $β_{1}=β_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $β_{1}=β_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.

[588] SmartMeterFM: Unifying Smart Meter Data Generative Tasks Using Flow Matching Models

Nan Lin, Yanbo Wang, Jacco Heres, Peter Palensky, Pedro P. Vergara

Main category: cs.LG

TL;DR: A flow matching model for unified generative tasks on smart meter data including imputation, super-resolution, and synthetic generation without task-specific retraining.

Details

Motivation: Smart meter data faces availability issues due to privacy regulations, corruption from sensor/transmission failures, and insufficient resolution for downstream tasks. Current ML approaches require separate models for each generative task (synthetic generation, imputation, super-resolution), leading to redundancy and inefficiency.

Method: Proposes a flow matching model for conditional generation of high-dimensional time series data (monthly smart meter data at 15 min resolution). Different generative tasks are unified by treating them as partial data observations that are injected into the generation process, allowing a single model to handle multiple tasks without retraining.

Result: The model generates data consistent with given observations while remaining realistic, outperforming interpolation methods and other ML baselines dedicated to specific tasks.

Conclusion: Flow matching models provide a powerful unified approach for diverse smart meter data generative tasks, eliminating the need for task-specific models and retraining while maintaining data consistency and realism.

Abstract: Smart meter data is the foundation for planning and operating the distribution network. Unfortunately, such data are not always available due to privacy regulations. Meanwhile, the collected data may be corrupted due to sensor or transmission failure, or it may not have sufficient resolution for downstream tasks. A wide range of generative tasks is formulated to address these issues, including synthetic data generation, missing data imputation, and super-resolution. Despite the success of machine learning models on these tasks, dedicated models need to be designed and trained for each task, leading to redundancy and inefficiency. In this paper, by recognizing the powerful modeling capability of flow matching models, we propose a new approach to unify diverse smart meter data generative tasks with a single model trained for conditional generation. The proposed flow matching models are trained to generate challenging, high-dimensional time series data, specifically monthly smart meter data at a 15 min resolution. By viewing different generative tasks as distinct forms of partial data observations and injecting them into the generation process, we unify tasks such as imputation and super-resolution with a single model, eliminating the need for re-training. The data generated by our model not only are consistent with the given observations but also remain realistic, showing better performance against interpolation and other machine learning based baselines dedicated to the tasks.

[589] Temporal Sepsis Modeling: a Fully Interpretable Relational Way

Vincent Lemaire, Nédra Meloulli, Pierre Jaquet

Main category: cs.LG

TL;DR: A relational machine learning framework for sepsis prediction using temporal EMR data, focusing on interpretability through propositionalisation and selective naive Bayesian classification with four types of interpretation.

Details

Motivation: Sepsis is complex and heterogeneous, but current deep learning models lack interpretability and ignore patient sub-phenotypes. There's a need for interpretable models that can reveal latent patient subgroups while maintaining predictive performance.

Method: 1) Represent temporal EMR data as multivariate patient logs in relational schema, 2) Apply propositionalisation (relational aggregation/selection functions) to create interpretable features and “flatten” data, 3) Use selective naive Bayesian classifier for classification, 4) Provide four types of interpretation: univariate, global, local, and counterfactual.

Result: Experimental validation demonstrates the approach’s relevance and extreme interpretability while maintaining predictive performance comparable to deep learning methods.

Conclusion: The relational framework offers an interpretable alternative to black-box deep learning for sepsis prediction, enabling better understanding of patient sub-phenotypes and clinical decision-making through multiple interpretation perspectives.

Abstract: Sepsis remains one of the most complex and heterogeneous syndromes in intensive care, characterized by diverse physiological trajectories and variable responses to treatment. While deep learning models perform well in the early prediction of sepsis, they often lack interpretability and ignore latent patient sub-phenotypes. In this work, we propose a machine learning framework by opening up a new avenue for addressing this issue: a relational approach. Temporal data from electronic medical records (EMRs) are viewed as multivariate patient logs and represented in a relational data schema. Then, a propositionalisation technique (based on classic aggregation/selection functions from the field of relational data) is applied to construct interpretable features to “flatten” the data. Finally, the flattened data is classified using a selective naive Bayesian classifier. Experimental validation demonstrates the relevance of the suggested approach as well as its extreme interpretability. The interpretation is fourfold: univariate, global, local, and counterfactual.

[590] LoRA and Privacy: When Random Projections Help (and When They Don’t)

Yaxi Hu, Johanna Düngler, Bernhard Schölkopf, Amartya Sanyal

Main category: cs.LG

TL;DR: The Wishart projection mechanism is a randomized privacy mechanism using Wishart-distributed matrices. It provides differential privacy for vector queries without additive noise but fails for matrix queries, requiring noise for privacy. LoRA-style updates are shown to be non-private without noise.

Details

Motivation: To develop a privacy mechanism that leverages randomness from Wishart distributions for differential privacy, exploring whether structured randomness alone can provide privacy guarantees without additive noise, particularly for machine learning applications like fine-tuning.

Method: Proposes the Wishart projection mechanism: S ↦ M f(S) where M ∼ W_d(1/r I_d, r). Analyzes differential privacy properties for vector-valued and matrix-valued queries. For matrix queries, adds noise and analyzes privacy amplification from randomness and low-rank projection. Shows LoRA updates are an instance of this mechanism.

Result: For vector queries: non-asymptotic DP guarantees without additive noise. For matrix queries: noise-free version is not DP (AUC > 0.99 for membership inference attacks). Noisy variant shows privacy amplification from randomness and low-rank projection. LoRA is not inherently private but low-rank fine-tuning can be more private than full fine-tuning at same noise level.

Conclusion: Wishart randomness alone provides DP for vector queries but not for matrix queries. Noise combined with low-rank projection amplifies privacy. LoRA-style updates require careful privacy accounting as they’re not inherently private despite built-in randomness.

Abstract: We introduce the (Wishart) projection mechanism, a randomized map of the form $S \mapsto M f(S)$ with $M \sim W_d(1/r I_d, r)$ and study its differential privacy properties. For vector-valued queries $f$, we prove non-asymptotic DP guarantees without any additive noise, showing that Wishart randomness alone can suffice. For matrix-valued queries, however, we establish a sharp negative result: in the noise-free setting, the mechanism is not DP, and we demonstrate its vulnerability by implementing a near perfect membership inference attack (AUC $> 0.99$). We then analyze a noisy variant and prove privacy amplification due to randomness and low rank projection, in both large- and small-rank regimes, yielding stronger privacy guarantees than additive noise alone. Finally, we show that LoRA-style updates are an instance of the matrix-valued mechanism, implying that LoRA is not inherently private despite its built-in randomness, but that low-rank fine-tuning can be more private than full fine-tuning at the same noise level. Preliminary experiments suggest that tighter accounting enables lower noise and improved accuracy in practice.

[591] Amortized Spectral Kernel Discovery via Prior-Data Fitted Network

Kaustubh Sharma, Srijan Tiwari, Ojasva Nema, Parikshit Pareek

Main category: cs.LG

TL;DR: Interpretability framework for PFNs that extracts explicit spectral densities and kernels from pre-trained models using attention latents, enabling fast inference with transparent covariance models.

Details

Motivation: PFNs enable efficient amortized inference but lack transparency in learned priors and kernels, hindering downstream tasks like surrogate-based optimization that require explicit covariance models.

Method: Mechanistic analysis identifies attention latent output as key intermediary linking function data to spectral structure. Propose decoder architectures that map PFN latents to explicit spectral density estimates and stationary kernels via Bochner’s theorem.

Result: Decoders recover complex multi-peak spectral mixtures and produce explicit kernels supporting Gaussian process regression with accuracy comparable to PFNs and optimization baselines, with orders-of-magnitude faster inference.

Conclusion: Framework enables interpretable extraction of spectral information from PFNs, providing transparent covariance models for downstream tasks while maintaining efficiency of amortized inference.

Abstract: Prior-Data Fitted Networks (PFNs) enable efficient amortized inference but lack transparent access to their learned priors and kernels. This opacity hinders their use in downstream tasks, such as surrogate-based optimization, that require explicit covariance models. We introduce an interpretability-driven framework for amortized spectral discovery from pre-trained PFNs with decoupled attention. We perform a mechanistic analysis on a trained PFN that identifies attention latent output as the key intermediary, linking observed function data to spectral structure. Building on this insight, we propose decoder architectures that map PFN latents to explicit spectral density estimates and corresponding stationary kernels via Bochner’s theorem. We study this pipeline in both single-realization and multi-realization regimes, contextualizing theoretical limits on spectral identifiability and proving consistency when multiple function samples are available. Empirically, the proposed decoders recover complex multi-peak spectral mixtures and produce explicit kernels that support Gaussian process regression with accuracy comparable to PFNs and optimization-based baselines, while requiring only a single forward pass. This yields orders-of-magnitude reductions in inference time compared to optimization-based baselines.

[592] Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators

Rebecca Pelke, Joel Klein, Jose Cubero-Cascante, Nils Bosbach, Jan Moritz Joseph, Rainer Leupers

Main category: cs.LG

TL;DR: A reinforcement learning-based mixed-precision training and compilation framework for Computing-in-Memory accelerators to optimize quantization below 8-bit for faster ML inference with minimal accuracy loss.

Details

Motivation: Current CIM accelerators have limited bit widths (below 8-bit) but most compilers don't support such low-precision quantization, requiring many compute cycles and inefficient weight storage in crossbar cells.

Method: Proposes a mixed-precision training and compilation framework using reinforcement learning to search for optimal quantization configurations that balance latency and accuracy in the massive search space.

Result: Achieves up to 2.48x speedup over state-of-the-art solutions with only 0.086% accuracy loss in the best case.

Conclusion: The RL-based approach effectively optimizes quantization for CIM accelerators, significantly improving performance while maintaining high accuracy.

Abstract: Computing-in-Memory (CIM) accelerators are a promising solution for accelerating Machine Learning (ML) workloads, as they perform Matrix-Vector Multiplications (MVMs) on crossbar arrays directly in memory. Although the bit widths of the crossbar inputs and cells are very limited, most CIM compilers do not support quantization below 8 bit. As a result, a single MVM requires many compute cycles, and weights cannot be efficiently stored in a single crossbar cell. To address this problem, we propose a mixed-precision training and compilation framework for CIM architectures. The biggest challenge is the massive search space, that makes it difficult to find good quantization parameters. This is why we introduce a reinforcement learning-based strategy to find suitable quantization configurations that balance latency and accuracy. In the best case, our approach achieves up to a 2.48x speedup over existing state-of-the-art solutions, with an accuracy loss of only 0.086 %.

[593] ECSEL: Explainable Classification via Signomial Equation Learning

Adia Lumadjeng, Ilker Birbil, Erman Acar

Main category: cs.LG

TL;DR: ECSEL is an explainable classification method that learns signomial equations, achieving competitive accuracy while maintaining interpretability and providing insights into dataset biases.

Details

Motivation: The paper is motivated by the need for interpretable machine learning models that can provide explanations for their decisions. Many symbolic regression benchmarks have compact signomial structure, suggesting signomial equations could serve as both accurate classifiers and interpretable explanations.

Method: ECSEL learns formal expressions in the form of signomial equations. It directly constructs structural, closed-form expressions that serve as both classifiers and explanations. The method is designed to be computationally efficient while recovering target equations from symbolic regression benchmarks.

Result: On symbolic regression benchmarks, ECSEL recovers a larger fraction of target equations than state-of-the-art approaches with substantially less computation. It achieves classification accuracy competitive with established ML models while maintaining interpretability. The learned equations expose dataset biases, support counterfactual reasoning, and yield actionable insights in real-world applications like e-commerce and fraud detection.

Conclusion: ECSEL demonstrates that signomial equations can serve as effective, interpretable classifiers that provide meaningful explanations and insights into dataset characteristics, making them valuable for applications requiring both accuracy and transparency.

Abstract: We introduce ECSEL, an explainable classification method that learns formal expressions in the form of signomial equations, motivated by the observation that many symbolic regression benchmarks admit compact signomial structure. ECSEL directly constructs a structural, closed-form expression that serves as both a classifier and an explanation. On standard symbolic regression benchmarks, our method recovers a larger fraction of target equations than competing state-of-the-art approaches while requiring substantially less computation. Leveraging this efficiency, ECSEL achieves classification accuracy competitive with established machine learning models without sacrificing interpretability. Further, we show that ECSEL satisfies some desirable properties regarding global feature behavior, decision-boundary analysis, and local feature attributions. Experiments on benchmark datasets and two real-world case studies i.e., e-commerce and fraud detection, demonstrate that the learned equations expose dataset biases, support counterfactual reasoning, and yield actionable insights.

[594] FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer

Chenrui Xu, Wenjing Yan, Ying-Jun Angela Zhang

Main category: cs.LG

TL;DR: FISMO optimizer combines isotropic momentum updates with anisotropic Fisher information geometry for better optimization in large neural networks.

Details

Motivation: Current optimizers like Adam dominate practice but Muon's strict isotropy discards valuable curvature information. Need methods that balance geometric structure with adaptivity to local loss landscape.

Method: Generalizes isotropic updates to incorporate anisotropic curvature through Fisher information geometry. Reformulates optimizer update as trust-region problem constrained by Kronecker-factored Fisher metric for structured preconditioning.

Result: Establishes convergence guarantees with O(1/√T) rate for expected squared gradient norm. Empirical evaluation shows superior training efficiency and final performance on image classification and language modeling benchmarks.

Conclusion: FISMO achieves structured preconditioning that adapts to local loss landscape geometry while maintaining computational tractability, outperforming established baselines.

Abstract: Training large-scale neural networks requires solving nonconvex optimization where the choice of optimizer fundamentally determines both convergence behavior and computational efficiency. While adaptive methods like Adam have long dominated practice, the recently proposed Muon optimizer achieves superior performance through orthogonalized momentum updates that enforce isotropic geometry with uniform singular values. However, this strict isotropy discards potentially valuable curvature information encoded in gradient spectra, motivating optimization methods that balance geometric structure with adaptivity. We introduce FISMO (Fisher-Structured Momentum-Orthogonalized) optimizer, which generalizes isotropic updates to incorporate anisotropic curvature information through Fisher information geometry. By reformulating the optimizer update as a trust-region problem constrained by a Kronecker-factored Fisher metric, FISMO achieves structured preconditioning that adapts to local loss landscape geometry while maintaining computational tractability. We establish convergence guarantees for FISMO in stochastic nonconvex settings, proving an $\mathcal{O}(1/\sqrt{T})$ rate for the expected squared gradient norm with explicit characterization of variance reduction through mini-batching. Empirical evaluation on image classification and language modeling benchmarks demonstrates that FISMO achieves superior training efficiency and final performance compared to established baselines.

[595] Effective LoRA Adapter Routing using Task Representations

Akash Dhasade, Anne-Marie Kermarrec, Igor Pavlovic, Diana Petrescu, Rafael Pires, Mathis Randl, Martijn de Vos

Main category: cs.LG

TL;DR: LORAUTER is a novel routing framework that selects and composes LoRA adapters using task embeddings rather than adapter characteristics, enabling efficient scaling and improved performance across diverse tasks.

Details

Motivation: As public LoRA adapter pools grow rapidly with diverse tasks, there's a need for effective routing mechanisms to select and compose appropriate adapters for queries without requiring adapter training data.

Method: LORAUTER routes queries via task embeddings derived from small validation sets, operating at the task level rather than mapping queries directly to adapters. This enables efficient routing that scales with the number of tasks rather than adapters.

Result: LORAUTER consistently outperforms baseline routing approaches, matches Oracle performance (101.2%) when task-aligned adapters exist, achieves state-of-the-art results on unseen tasks (+5.2 points), and scales robustly to over 1500 adapters.

Conclusion: LORAUTER provides an effective task-level routing framework for LoRA adapters that scales efficiently and performs well across diverse tasks, including unseen ones, while being robust to large, noisy adapter pools.

Abstract: Low-rank adaptation (LoRA) enables parameter efficient specialization of large language models (LLMs) through modular adapters, resulting in rapidly growing public adapter pools spanning diverse tasks. Effectively using these adapters requires routing: selecting and composing the appropriate adapters for a query. We introduce LORAUTER, a novel routing framework that selects and composes LoRA adapters using task representations rather than adapter characteristics. Unlike existing approaches that map queries directly to adapters, LORAUTER routes queries via task embeddings derived from small validation sets and does not require adapter training data. By operating at the task level, LORAUTER achieves efficient routing that scales with the number of tasks rather than the number of adapters. Experiments across multiple tasks show that LORAUTER consistently outperforms baseline routing approaches, matching Oracle performance (101.2%) when task-aligned adapters exist and achieving state-of-the-art results on unseen tasks (+5.2 points). We further demonstrate the robustness of LORAUTER to very large, noisy adapter pools by scaling it to over 1500 adapters.

[596] Differentiable Knapsack and Top-k Operators via Dynamic Programming

Germain Vivier-Ardisson, Michaël E. Sander, Axel Parmentier, Mathieu Blondel

Main category: cs.LG

TL;DR: A unified framework for differentiable relaxations of Knapsack and Top-k operators using dynamic programming and entropy regularization, enabling gradient-based learning with discrete subset selection.

Details

Motivation: Knapsack and Top-k operators are useful for selecting discrete subsets of variables in neural networks, but they are piecewise constant with zero gradients almost everywhere, making them challenging to integrate into gradient-based learning.

Method: Proposes a unified framework casting these operators as dynamic programs, deriving differentiable relaxations by smoothing the underlying recursions. Develops efficient parallel algorithms for deterministic/stochastic forward passes and vector-Jacobian products for backward passes.

Result: Proves Shannon entropy is the unique regularization choice yielding permutation-equivariant operators, characterizes regularizers inducing sparse selections. Demonstrates framework on decision-focused learning benchmark, constrained dynamic assortment RL problem, and discrete VAEs extension.

Conclusion: Provides a practical framework for integrating discrete subset selection operators into neural networks through differentiable relaxations, with theoretical guarantees and efficient algorithms.

Abstract: Knapsack and Top-k operators are useful for selecting discrete subsets of variables. However, their integration into neural networks is challenging as they are piecewise constant, yielding gradients that are zero almost everywhere. In this paper, we propose a unified framework casting these operators as dynamic programs, and derive differentiable relaxations by smoothing the underlying recursions. On the algorithmic side, we develop efficient parallel algorithms supporting both deterministic and stochastic forward passes, and vector-Jacobian products for the backward pass. On the theoretical side, we prove that Shannon entropy is the unique regularization choice yielding permutation-equivariant operators, and characterize regularizers inducing sparse selections. Finally, on the experimental side, we demonstrate our framework on a decision-focused learning benchmark, a constrained dynamic assortment RL problem, and an extension of discrete VAEs.

[597] Quantum LEGO Learning: A Modular Design Principle for Hybrid Artificial Intelligence

Jun Qi, Chao-Han Huck Yang, Pin-Yu Chen, Min-Hsiu Hsieh, Hector Zenil, Jesper Tegner

Main category: cs.LG

TL;DR: Quantum LEGO Learning: A modular framework that treats classical neural networks as frozen feature blocks and variational quantum circuits as trainable adaptive modules, enabling efficient hybrid quantum-classical learning with theoretical guarantees.

Details

Motivation: Existing hybrid quantum-classical models often use tightly coupled architectures or task-specific encoders, limiting conceptual clarity, generality, and transferability across learning settings. There's a need for a more modular, architecture-agnostic framework.

Method: Proposes Quantum LEGO Learning framework where pre-trained classical neural networks serve as frozen feature blocks and variational quantum circuits act as trainable adaptive modules. Develops block-wise generalization theory decomposing learning error into approximation and estimation components. Validates through systematic block-swap experiments across different feature extractors and adaptive heads.

Result: Experiments on quantum dot classification demonstrate stable optimization, reduced sensitivity to qubit count, and robustness to realistic noise. The framework enables efficient learning under constrained quantum resources and provides principled abstraction for analyzing hybrid models.

Conclusion: Quantum LEGO Learning offers a modular, theoretically grounded framework for hybrid quantum-classical learning that separates feature extraction from adaptive processing, enabling better analysis and more efficient use of quantum resources while maintaining performance.

Abstract: Hybrid quantum-classical learning models increasingly integrate neural networks with variational quantum circuits (VQCs) to exploit complementary inductive biases. However, many existing approaches rely on tightly coupled architectures or task-specific encoders, limiting conceptual clarity, generality, and transferability across learning settings. In this work, we introduce Quantum LEGO Learning, a modular and architecture-agnostic learning framework that treats classical and quantum components as reusable, composable learning blocks with well-defined roles. Within this framework, a pre-trained classical neural network serves as a frozen feature block, while a VQC acts as a trainable adaptive module that operates on structured representations rather than raw inputs. This separation enables efficient learning under constrained quantum resources and provides a principled abstraction for analyzing hybrid models. We develop a block-wise generalization theory that decomposes learning error into approximation and estimation components, explicitly characterizing how the complexity and training status of each block influence overall performance. Our analysis generalizes prior tensor-network-specific results and identifies conditions under which quantum modules provide representational advantages over comparably sized classical heads. Empirically, we validate the framework through systematic block-swap experiments across frozen feature extractors and both quantum and classical adaptive heads. Experiments on quantum dot classification demonstrate stable optimization, reduced sensitivity to qubit count, and robustness to realistic noise.

[598] NetMamba+: A Framework of Pre-trained Models for Efficient and Accurate Network Traffic Classification

Tongze Wang, Xiaohui Xie, Wenduo Wang, Chuyi Wang, Jinzhou Liu, Boyan Huang, Yannan Hu, Youjian Zhao, Yong Cui

Main category: cs.LG

TL;DR: NetMamba+ is an efficient network traffic classification framework using Mamba architecture and multimodal representations to address computational inefficiency, representation bias, and long-tail distribution challenges.

Details

Motivation: Address three critical challenges in network traffic classification: computational inefficiency of Transformers, inadequate traffic representations that lose byte-level features while retaining biases, and poor handling of long-tail distributions in real-world data.

Method: Three key innovations: (1) efficient architecture using Mamba and Flash Attention mechanisms, (2) multimodal traffic representation preserving essential information while eliminating biases, and (3) label distribution-aware fine-tuning strategy.

Result: Superior classification performance with up to 6.44% F1 score improvement over SOTA baselines, 1.7x higher inference throughput, excellent few-shot learning abilities, and real-world throughput of 261.87 Mb/s in online system.

Conclusion: NetMamba+ is the first framework to adapt Mamba architecture for network traffic classification, enabling efficient and accurate traffic analysis in complex network environments with robust real-world performance.

Abstract: With the rapid growth of encrypted network traffic, effective traffic classification has become essential for network security and quality of service management. Current machine learning and deep learning approaches for traffic classification face three critical challenges: computational inefficiency of Transformer architectures, inadequate traffic representations with loss of crucial byte-level features while retaining detrimental biases, and poor handling of long-tail distributions in real-world data. We propose NetMamba+, a framework that addresses these challenges through three key innovations: (1) an efficient architecture considering Mamba and Flash Attention mechanisms, (2) a multimodal traffic representation scheme that preserves essential traffic information while eliminating biases, and (3) a label distribution-aware fine-tuning strategy. Evaluation experiments on massive datasets encompassing four main classification tasks showcase NetMamba+’s superior classification performance compared to state-of-the-art baselines, with improvements of up to 6.44% in F1 score. Moreover, NetMamba+ demonstrates excellent efficiency, achieving 1.7x higher inference throughput than the best baseline while maintaining comparably low memory usage. Furthermore, NetMamba+ exhibits superior few-shot learning abilities, achieving better classification performance with fewer labeled data. Additionally, we implement an online traffic classification system that demonstrates robust real-world performance with a throughput of 261.87 Mb/s. As the first framework to adapt Mamba architecture for network traffic classification, NetMamba+ opens new possibilities for efficient and accurate traffic analysis in complex network environments.

[599] Knowledge Vector Weakening: Efficient Training-free Unlearning for Large Vision-Language Models

Yejin Kim, Dongjun Hwang, Sungmin Cha, Junsuk Choe

Main category: cs.LG

TL;DR: KVW is a training-free unlearning method for Large Vision-Language Models that directly intervenes in model activations to remove undesirable knowledge without gradient computation.

Details

Motivation: Large Vision-Language Models raise privacy and safety concerns, but existing unlearning methods rely on computationally expensive gradient-based optimization. There's a need for efficient unlearning approaches for large-scale models.

Method: Knowledge Vector Weakening (KVW) identifies knowledge vectors activated during model output generation on forget sets and progressively weakens their contributions without gradient computation, directly intervening in the full model.

Result: KVW achieves stable forget-retain trade-off on MLLMU and CLEAR benchmarks while significantly improving computational efficiency over gradient-based and LoRA-based unlearning methods.

Conclusion: KVW provides an efficient, training-free solution for unlearning in Large Vision-Language Models, addressing computational bottlenecks while maintaining effective knowledge removal.

Abstract: Large Vision-Language Models (LVLMs) are widely adopted for their strong multimodal capabilities, yet they raise serious concerns such as privacy leakage and harmful content generation. Machine unlearning has emerged as a promising solution for removing the influence of specific data from trained models. However, existing approaches largely rely on gradient-based optimization, incurring substantial computational costs for large-scale LVLMs. To address this limitation, we propose Knowledge Vector Weakening (KVW), a training-free unlearning method that directly intervenes in the full model without gradient computation. KVW identifies knowledge vectors that are activated during the model’s output generation on the forget set and progressively weakens their contributions, thereby preventing the model from exploiting undesirable knowledge. Experiments on the MLLMU and CLEAR benchmarks demonstrate that KVW achieves a stable forget-retain trade-off while significantly improving computational efficiency over gradient-based and LoRA-based unlearning methods.

[600] MoHETS: Long-term Time Series Forecasting with Mixture-of-Heterogeneous-Experts

Evandro S. Ortigossa, Guy Lutsker, Eran Segal

Main category: cs.LG

TL;DR: MoHETS: A Transformer-based model with Mixture-of-Heterogeneous-Experts for multivariate time series forecasting, combining convolutional and Fourier experts with cross-attention for exogenous variables.

Details

Motivation: Real-world multivariate time series have complex multi-scale structures (global trends, local periodicities, non-stationary regimes) that make long-horizon forecasting challenging. Existing sparse Mixture-of-Experts approaches use homogeneous MLP experts that fail to capture diverse temporal dynamics.

Method: Encoder-only Transformer with sparse Mixture-of-Heterogeneous-Experts layers that route temporal patches to specialized experts: shared depthwise-convolution expert for sequence-level continuity and routed Fourier-based experts for patch-level periodic structures. Incorporates exogenous information via cross-attention and uses lightweight convolutional patch decoder instead of parameter-heavy linear heads.

Result: State-of-the-art performance across seven multivariate benchmarks and multiple horizons, reducing average MSE by 12% compared to strong recent baselines. Demonstrates effective heterogeneous specialization for long-term forecasting.

Conclusion: MoHETS effectively addresses the limitations of homogeneous MoE approaches by introducing heterogeneous experts specialized for different temporal patterns, improving both performance and parameter efficiency for multivariate time series forecasting.

Abstract: Real-world multivariate time series can exhibit intricate multi-scale structures, including global trends, local periodicities, and non-stationary regimes, which makes long-horizon forecasting challenging. Although sparse Mixture-of-Experts (MoE) approaches improve scalability and specialization, they typically rely on homogeneous MLP experts that poorly capture the diverse temporal dynamics of time series data. We address these limitations with MoHETS, an encoder-only Transformer that integrates sparse Mixture-of-Heterogeneous-Experts (MoHE) layers. MoHE routes temporal patches to a small subset of expert networks, combining a shared depthwise-convolution expert for sequence-level continuity with routed Fourier-based experts for patch-level periodic structures. MoHETS further improves robustness to non-stationary dynamics by incorporating exogenous information via cross-attention over covariate patch embeddings. Finally, we replace parameter-heavy linear projection heads with a lightweight convolutional patch decoder, improving parameter efficiency, reducing training instability, and allowing a single model to generalize across arbitrary forecast horizons. We validate across seven multivariate benchmarks and multiple horizons, with MoHETS consistently achieving state-of-the-art performance, reducing the average MSE by $12%$ compared to strong recent baselines, demonstrating effective heterogeneous specialization for long-term forecasting.

[601] Nonparametric LLM Evaluation from Preference Data

Dennis Frauen, Athiya Deviyani, Mihaela van der Schaar, Stefan Feuerriegel

Main category: cs.LG

TL;DR: DMLEval: A nonparametric statistical framework using debiased machine learning to compare and rank LLMs from preference data with valid uncertainty quantification and optimal data collection policies.

Details

Motivation: Existing methods for evaluating LLMs from human preference data either rely on restrictive parametric assumptions or lack valid uncertainty quantification when using flexible machine learning methods, creating a need for more robust statistical frameworks.

Method: Proposes DMLEval framework using debiased machine learning (DML) with generalized average ranking scores (GARS) that generalize common ranking models like Bradley-Terry and PageRank, supporting complex human responses including ties.

Result: Theoretical and empirical validation using synthetic and real-world preference datasets shows DMLEval produces statistically efficient estimates, allows black-box ML methods, integrates with LLM-as-a-judge evaluators, and suggests optimal data collection policies.

Conclusion: DMLEval provides practitioners with powerful, state-of-the-art methods for comparing and ranking LLMs with proper statistical foundations and practical advantages for preference-based evaluation.

Abstract: Evaluating the performance of large language models (LLMs) from human preference data is crucial for obtaining LLM leaderboards. However, many existing approaches either rely on restrictive parametric assumptions or lack valid uncertainty quantification when flexible machine learning methods are used. In this paper, we propose a nonparametric statistical framework, DMLEval, for comparing and ranking LLMs from preference data using debiased machine learning (DML). For this, we introduce generalized average ranking scores (GARS), which generalize commonly used ranking models, including the Bradley-Terry model or PageRank/ Rank centrality, with complex human responses such as ties. DMLEval comes with the following advantages: (i) It produces statistically efficient estimates of GARS ranking scores. (ii) It naturally allows the incorporation of black-box machine learning methods for estimation. (iii) It can be combined with pre-trained LLM evaluators (e.g., using LLM-as-a-judge). (iv) It suggests optimal policies for collecting preference data under budget constraints. We demonstrate these advantages both theoretically and empirically using both synthetic and real-world preference datasets. In summary, our framework provides practitioners with powerful, state-of-the-art methods for comparing or ranking LLMs.

[602] DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

Xinwei Qiang, Hongmin Chen, Shixuan Sun, Jingwen Leng, Xin Liu, Minyi Guo

Main category: cs.LG

TL;DR: DASH introduces scheduling strategies to improve throughput of deterministic attention backward passes in LLM training by optimizing compute and gradient-reduction scheduling.

Details

Motivation: Determinism is crucial for reproducibility in LLM training, but current deterministic attention implementations (like FlashAttention-3) suffer significant performance penalties (up to 37.9% throughput reduction) due to suboptimal scheduling of compute and gradient-reduction operations.

Method: Formulates deterministic attention backward pass as a scheduling problem on a Directed Acyclic Graph (DAG) and develops DASH with two strategies: (1) Descending Q-Tile Iteration for causal attention, and (2) Shift Scheduling for both full and causal masks.

Result: On NVIDIA H800 GPUs, DASH improves attention backward pass throughput by up to 1.28× compared to baseline, significantly narrowing the performance gap of deterministic attention.

Conclusion: DASH advances the efficiency of reproducible LLM training by optimizing scheduling strategies for deterministic attention, making determinism more practical without sacrificing performance.

Abstract: Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non-deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient-reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q-Tile Iteration, a reversed query-block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28$\times$ compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open-sourced at https://github.com/SJTU-Liquid/deterministic-FA3.

[603] Robust Multimodal Representation Learning in Healthcare

Xiaoguang Zhu, Linxiao Gong, Lianlong Sun, Yang Liu, Haoyu Wang, Jing Liu

Main category: cs.LG

TL;DR: A dual-stream feature decorrelation framework for medical multimodal representation learning that addresses systematic biases through causal analysis and decorrelation techniques.

Details

Motivation: Real-world medical datasets contain systematic biases from multiple sources that challenge multimodal representation learning. Existing approaches focus on effective multimodal fusion but neglect inherent biased features that affect generalization ability.

Method: Proposes a Dual-Stream Feature Decorrelation Framework using structural causal analysis with latent confounders. Employs dual-stream neural networks to disentangle causal features from spurious correlations, using generalized cross-entropy loss and mutual information minimization for decorrelation.

Result: Comprehensive experiments on MIMIC-IV, eICU, and ADNI datasets demonstrate consistent performance improvements.

Conclusion: The framework is model-agnostic and can be integrated into existing medical multimodal learning methods to handle biases and improve generalization.

Abstract: Medical multimodal representation learning aims to integrate heterogeneous data into unified patient representations to support clinical outcome prediction. However, real-world medical datasets commonly contain systematic biases from multiple sources, which poses significant challenges for medical multimodal representation learning. Existing approaches typically focus on effective multimodal fusion, neglecting inherent biased features that affect the generalization ability. To address these challenges, we propose a Dual-Stream Feature Decorrelation Framework that identifies and handles the biases through structural causal analysis introduced by latent confounders. Our method employs a causal-biased decorrelation framework with dual-stream neural networks to disentangle causal features from spurious correlations, utilizing generalized cross-entropy loss and mutual information minimization for effective decorrelation. The framework is model-agnostic and can be integrated into existing medical multimodal learning methods. Comprehensive experiments on MIMIC-IV, eICU, and ADNI datasets demonstrate consistent performance improvements.

[604] Goal-Driven Adaptive Sampling Strategies for Machine Learning Models Predicting Fields

Jigar Parekh, Philipp Bekemeyer

Main category: cs.LG

TL;DR: Active learning strategy for field predictions using Gaussian processes that reduces epistemic model error and scalar-field prediction differences, applied to NASA common research model uncertainty propagation.

Details

Motivation: Current active learning strategies are limited to scalar quantities and lack extension to field predictions, which are needed in various applications like computational fluid dynamics. There's a need for model-agnostic active learning approaches that can efficiently handle field predictions with minimal computational cost.

Method: Proposes an active learning strategy combining Gaussian process models for scalar reference values while simultaneously reducing epistemic model error and differences between scalar and field predictions. The approach is model-architecture agnostic and introduces different specific forms of the strategy.

Result: Applied to NASA common research model for uncertainty propagation task, achieving high accuracy at significantly smaller computational cost compared to approaches without active learning.

Conclusion: The proposed active learning strategy successfully extends active learning to field predictions, providing an efficient, model-agnostic approach that reduces computational costs while maintaining accuracy in complex simulation tasks.

Abstract: Machine learning models are widely regarded as a way forward to tackle multi-query challenges that arise once expensive black-box simulations such as computational fluid dynamics are investigated. However, ensuring the desired level of accuracy for a certain task at minimal computational cost, e.g. as few black-box samples as possible, remains a challenges. Active learning strategies are used for scalar quantities to overcome this challenges and different so-called infill criteria exists and are commonly employed in several scenarios. Even though needed in various field an extension of active learning strategies towards field predictions is still lacking or limited to very specific scenarios and/or model types. In this paper we propose an active learning strategy for machine learning models that are capable if predicting field which is agnostic to the model architecture itself. For doing so, we combine a well-established Gaussian process model for a scalar reference value and simultaneously aim at reducing the epistemic model error and the difference between scalar and field predictions. Different specific forms of the above-mentioned approach are introduced and compared to each other as well as only scalar-valued based infill. Results are presented for the NASA common research model for an uncertainty propagation task showcasing high level of accuracy at significantly smaller cost compared to an approach without active learning.

[605] Scalable Linearized Laplace Approximation via Surrogate Neural Kernel

Luis A. Ortega, Simón Rodríguez-Santana, Daniel Hernández-Lobato

Main category: cs.LG

TL;DR: A scalable method to approximate Linearized Laplace Approximation kernels using a surrogate DNN that learns compact feature representations to replicate Neural Tangent Kernel without computing large Jacobians.

Details

Motivation: The Linearized Laplace Approximation (LLA) requires computing large Jacobians for uncertainty estimation in deep neural networks, which is computationally expensive and impractical for large-scale pre-trained models. There's a need for scalable methods to approximate LLA kernels efficiently.

Method: Uses a surrogate deep neural network to learn a compact feature representation whose inner product replicates the Neural Tangent Kernel (NTK). Training relies solely on efficient Jacobian-vector products, avoiding the need to compute large Jacobians directly.

Result: Experimental results show similar or improved uncertainty estimation and calibration compared to existing LLA approximations. Biasing the learned kernel significantly enhances out-of-distribution detection performance.

Conclusion: The proposed method enables efficient uncertainty estimation for large-scale pre-trained DNNs and demonstrates that better kernels than NTK can be found for LLA, particularly improving out-of-distribution detection capabilities.

Abstract: We introduce a scalable method to approximate the kernel of the Linearized Laplace Approximation (LLA). For this, we use a surrogate deep neural network (DNN) that learns a compact feature representation whose inner product replicates the Neural Tangent Kernel (NTK). This avoids the need to compute large Jacobians. Training relies solely on efficient Jacobian-vector products, allowing to compute predictive uncertainty on large-scale pre-trained DNNs. Experimental results show similar or improved uncertainty estimation and calibration compared to existing LLA approximations. Notwithstanding, biasing the learned kernel significantly enhances out-of-distribution detection. This remarks the benefits of the proposed method for finding better kernels than the NTK in the context of LLA to compute prediction uncertainty given a pre-trained DNN.

[606] Constrained Meta Reinforcement Learning with Provable Test-Time Safety

Tingting Ni, Maryam Kamgarpour

Main category: cs.LG

TL;DR: A constrained meta reinforcement learning algorithm with safety guarantees and optimal sample complexity for test tasks

Details

Motivation: Real-world applications like robotics and healthcare require safety constraints during testing, but existing meta RL approaches lack provable safety guarantees while maintaining sample efficiency for new tasks.

Method: Proposes an algorithm that refines policies learned during training with provable safety and sample complexity guarantees for learning near-optimal policies on test tasks, with matching lower bound showing optimality.

Result: The algorithm achieves provable safety guarantees and optimal sample complexity for learning near-optimal policies on test tasks, with matching lower bound demonstrating tightness of the sample complexity.

Conclusion: The work addresses the critical gap in constrained meta RL by providing an algorithm with provable safety guarantees and optimal sample complexity for real-world applications requiring safety constraints.

Abstract: Meta reinforcement learning (RL) allows agents to leverage experience across a distribution of tasks on which the agent can train at will, enabling faster learning of optimal policies on new test tasks. Despite its success in improving sample complexity on test tasks, many real-world applications, such as robotics and healthcare, impose safety constraints during testing. Constrained meta RL provides a promising framework for integrating safety into meta RL. An open question in constrained meta RL is how to ensure the safety of the policy on the real-world test task, while reducing the sample complexity and thus, enabling faster learning of optimal policies. To address this gap, we propose an algorithm that refines policies learned during training, with provable safety and sample complexity guarantees for learning a near optimal policy on the test tasks. We further derive a matching lower bound, showing that this sample complexity is tight.

[607] RaZeR: Pushing the Limits of NVFP4 Quantization with Redundant Zero Remapping

Yuzong Chen, Xilai Dai, Jake Hyun, Chi-Chih Chang, Wonsuk Jang, Yuheng Wu, Thierry Tambe, Jae-sun Seo, Mohamed S. Abdelfattah

Main category: cs.LG

TL;DR: RaZeR improves NVFP4 quantization by eliminating redundant bits in FP4 zero representation and FP8 scaling factors, using them to add more quantization values for better LLM accuracy at same memory cost.

Details

Motivation: The paper identifies two types of redundancy in NVFP4 format: 1) FP4's sign-magnitude representation creates unused quantization values due to positive/negative zeros, and 2) FP8 block scaling factors have unused sign bits since they're always positive. Additionally, LLM weights tolerate lower-precision scaling factors. These observations motivate creating a more efficient format.

Method: Proposes Redundant Zero Remapping (RaZeR), which leverages redundant bits from block scaling factors to adaptively remap redundant FP4 zeros to additional quantization values. Includes efficient GPU kernel designs for RaZeR-quantized LLM inference and proposes novel hardware for native support.

Result: Extensive experiments show RaZeR significantly outperforms native NVFP4. Reduces average perplexity loss by 34.6% for weight-only quantization and 31.2% for weight-activation quantization, demonstrating superior 4-bit LLM quantization performance.

Conclusion: RaZeR successfully pushes the limits of NVFP4 quantization by eliminating redundancies and repurposing bits for better accuracy, offering practical improvements for memory-efficient LLM inference with both software and hardware solutions.

Abstract: The recently introduced NVFP4 format demonstrates remarkable performance and memory benefits for quantized large language model (LLM) inference. However, we observe two types of redundancy in NVFP4 encoding: (1) The FP4 element format naturally exposes an unused quantization value due to its sign-magnitude representation that contains both positive and negative zeros. (2) The FP8 block scaling factor has an unused sign bit because it is always positive. Additionally, we find that LLM weights are more tolerant to a lower-precision block scaling factor. Based on these observations, we propose Redundant Zero Remapping (RaZeR), an enhanced numerical format that pushes the limits of NVFP4 for more accurate LLM quantization under the same memory footprint. RaZeR leverages the redundant bits of the block scaling factor to adaptively remap the redundant FP4 zero to additional quantization values with improved accuracy. To demonstrate the practicality of RaZeR, we design efficient GPU kernels for RaZeR-quantized LLM inference and propose novel hardware to natively support this. Extensive experiments validate RaZeR’s superior performance for 4-bit LLM quantization. For example, relative to native NVFP4, RaZeR reduces the average perplexity loss by 34.6% and 31.2% under weight-only and weight-activation quantization, respectively.

[608] Geometry of Drifting MDPs with Path-Integral Stability Certificates

Zuyuan Zhang, Mahdi Imani, Tian Lan

Main category: cs.LG

TL;DR: HT-RL and HT-MCTS are lightweight wrappers that track nonstationary MDPs by modeling environment changes as differentiable homotopy paths, adapting learning/planning intensity based on length, curvature, and near-tie proximity metrics.

Details

Motivation: Real-world RL faces nonstationary environments with drifting rewards/dynamics, oscillations, and abrupt switches. Existing theory uses coarse models measuring how much change occurs, not how it changes locally, missing acceleration and near-tie effects that drive tracking error and policy chattering.

Method: Takes geometric view of nonstationary discounted MDPs by modeling environment as differentiable homotopy path and tracking induced motion of optimal Bellman fixed point. Derives length-curvature-kink signature of intrinsic complexity. Proves solver-agnostic path-integral stability bound and gap-safe feasible regions. Introduces HT-RL and HT-MCTS wrappers that estimate replay-based proxies of length, curvature, and near-tie proximity online to adapt learning/planning intensity.

Result: Experiments show improved tracking and dynamic regret over matched static baselines, with largest gains in oscillatory and switch-prone regimes.

Conclusion: Geometric modeling of nonstationary MDPs as homotopy paths provides effective framework for tracking environment changes, with lightweight wrappers HT-RL/HT-MCTS demonstrating practical improvements in nonstationary RL settings.

Abstract: Real-world reinforcement learning is often \emph{nonstationary}: rewards and dynamics drift, accelerate, oscillate, and trigger abrupt switches in the optimal action. Existing theory often represents nonstationarity with coarse-scale models that measure \emph{how much} the environment changes, not \emph{how} it changes locally – even though acceleration and near-ties drive tracking error and policy chattering. We take a geometric view of nonstationary discounted Markov Decision Processes (MDPs) by modeling the environment as a differentiable homotopy path and tracking the induced motion of the optimal Bellman fixed point. This yields a length-curvature-kink signature of intrinsic complexity: cumulative drift, acceleration/oscillation, and action-gap-induced nonsmoothness. We prove a solver-agnostic path-integral stability bound and derive gap-safe feasible regions that certify local stability away from switch regimes. Building on these results, we introduce \textit{Homotopy-Tracking RL (HT-RL)} and \textit{HT-MCTS}, lightweight wrappers that estimate replay-based proxies of length, curvature, and near-tie proximity online and adapt learning or planning intensity accordingly. Experiments show improved tracking and dynamic regret over matched static baselines, with the largest gains in oscillatory and switch-prone regimes.

[609] READY: Reward Discovery for Meta-Black-Box Optimization

Zechuan Huang, Zhiguang Cao, Hongshu Guo, Yue-Jiao Gong, Zeyuan Ma

Main category: cs.LG

TL;DR: LLM-based automated reward discovery for Meta-Black-Box Optimization using evolutionary search and multi-task architecture

Details

Motivation: Existing MetaBBO approaches rely on human-designed reward functions which introduce design bias and reward hacking risks; need automated reward discovery

Method: Uses LLMs as automated reward discovery tool with evolutionary search paradigm for continuous improvement and multi-task architecture for parallel reward discovery across diverse MetaBBO approaches

Result: Discovered reward functions boost existing MetaBBO works, demonstrating importance of automated reward design

Conclusion: LLMs can effectively automate reward discovery for MetaBBO, overcoming human design limitations and improving optimization performance

Abstract: Meta-Black-Box Optimization (MetaBBO) is an emerging avenue within Optimization community, where algorithm design policy could be meta-learned by reinforcement learning to enhance optimization performance. So far, the reward functions in existing MetaBBO works are designed by human experts, introducing certain design bias and risks of reward hacking. In this paper, we use Large Language Model~(LLM) as an automated reward discovery tool for MetaBBO. Specifically, we consider both effectiveness and efficiency sides. On effectiveness side, we borrow the idea of evolution of heuristics, introducing tailored evolution paradigm in the iterative LLM-based program search process, which ensures continuous improvement. On efficiency side, we additionally introduce multi-task evolution architecture to support parallel reward discovery for diverse MetaBBO approaches. Such parallel process also benefits from knowledge sharing across tasks to accelerate convergence. Empirical results demonstrate that the reward functions discovered by our approach could be helpful for boosting existing MetaBBO works, underscoring the importance of reward design in MetaBBO. We provide READY’s project at https://anonymous.4open.science/r/ICML_READY-747F.

[610] Visual Disentangled Diffusion Autoencoders: Scalable Counterfactual Generation for Foundation Models

Sidney Bender, Marco Morik

Main category: cs.LG

TL;DR: DiDAE is a framework that integrates frozen foundation models with disentangled dictionary learning for efficient, gradient-free counterfactual generation to mitigate shortcut learning in vision models.

Details

Motivation: Foundation models are vulnerable to spurious correlations and 'Clever Hans' strategies, but existing mitigation methods require unavailable group labels or expensive gradient-based adversarial optimization. There's a need for efficient, gradient-free counterfactual generation methods.

Method: Visual Disentangled Diffusion Autoencoders (DiDAE) integrates frozen foundation models with disentangled dictionary learning. It first edits foundation model embeddings in interpretable disentangled directions using the dictionary, then decodes them via a diffusion autoencoder to generate multiple diverse, disentangled counterfactuals for each factual input.

Result: DiDAE generates counterfactuals much faster than existing baselines (which produce single entangled counterfactuals). When combined with Counterfactual Knowledge Distillation (DiDAE-CFKD), it achieves state-of-the-art performance in mitigating shortcut learning and improves downstream performance on unbalanced datasets.

Conclusion: DiDAE provides an efficient, gradient-free approach to counterfactual generation that effectively mitigates shortcut learning in foundation models, outperforming existing methods while being computationally efficient.

Abstract: Foundation models, despite their robust zero-shot capabilities, remain vulnerable to spurious correlations and ‘Clever Hans’ strategies. Existing mitigation methods often rely on unavailable group labels or computationally expensive gradient-based adversarial optimization. To address these limitations, we propose Visual Disentangled Diffusion Autoencoders (DiDAE), a novel framework integrating frozen foundation models with disentangled dictionary learning for efficient, gradient-free counterfactual generation directly for the foundation model. DiDAE first edits foundation model embeddings in interpretable disentangled directions of the disentangled dictionary and then decodes them via a diffusion autoencoder. This allows the generation of multiple diverse, disentangled counterfactuals for each factual, much faster than existing baselines, which generate single entangled counterfactuals. When paired with Counterfactual Knowledge Distillation, DiDAE-CFKD achieves state-of-the-art performance in mitigating shortcut learning, improving downstream performance on unbalanced datasets.

[611] Plain Transformers Can be Powerful Graph Learners

Liheng Ma, Soumyasundar Pal, Yingxue Zhang, Philip H. S. Torr, Mark Coates

Main category: cs.LG

TL;DR: Plain Transformer architecture can be effective for graph learning with three simple modifications: simplified L2 attention, adaptive RMS normalization, and MLP-based positional encoding.

Details

Motivation: Most Graph Transformers deviate significantly from plain Transformers by integrating message-passing or complex attention mechanisms, which hinders adoption of training advances from other domains. The authors want to demonstrate that plain Transformers can be powerful graph learners with minimal modifications.

Method: Propose Powerful Plain Graph Transformers (PPGT) with three simple modifications: 1) simplified L2 attention for measuring token magnitude closeness, 2) adaptive root-mean-square normalization to preserve token magnitude information, and 3) simple MLP-based stem for graph positional encoding.

Result: PPGT shows strong empirical performance on graph expressivity benchmarks, comparing favorably to more complex alternatives like subgraph GNNs and higher-order GNNs. It also performs well across various graph datasets.

Conclusion: Plain Transformer architectures are versatile and have strong potential as a unified backbone for multimodal learning across language, vision, and graph domains.

Abstract: Transformers have attained outstanding performance across various modalities, owing to their simple but powerful scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers (GTs) have strayed far from plain Transformers, exhibiting major architectural differences either by integrating message-passing or incorporating sophisticated attention mechanisms. These divergences hinder the easy adoption of training advances for Transformers developed in other domains. Contrary to previous GTs, this work demonstrates that the plain Transformer architecture can be a powerful graph learner. To achieve this, we propose to incorporate three simple, minimal, and easy-to-implement modifications to the plain Transformer architecture to construct our Powerful Plain Graph Transformers (PPGT): (1) simplified $L_2$ attention for measuring the magnitude closeness among tokens; (2) adaptive root-mean-square normalization to preserve token magnitude information; and (3) a simple MLP-based stem for graph positional encoding. Consistent with its theoretical expressivity, PPGT demonstrates noteworthy realized expressivity on the empirical graph expressivity benchmark, comparing favorably to more complicated alternatives such as subgraph GNNs and higher-order GNNs. Its empirical performance across various graph datasets also justifies the effectiveness of PPGT. This finding underscores the versatility of plain Transformer architectures and highlights their strong potential as a unified backbone for multimodal learning across language, vision, and graph domains.

[612] Low-Rank Plus Sparse Matrix Transfer Learning under Growing Representations and Ambient Dimensions

Jinhang Chai, Xuyuan Liu, Elynn Chen, Yujun Yan

Main category: cs.LG

TL;DR: Transfer learning framework for structured matrix estimation where target parameters decompose into embedded source components, low-rank innovations, and sparse edits, with applications to Markov transition matrices and covariance estimation.

Details

Motivation: Learning systems often expand features over time, embedding earlier representations into larger spaces. The paper addresses transfer learning for structured matrix estimation when both ambient dimension and intrinsic representation grow simultaneously.

Method: Proposes a transfer framework where target parameters decompose into: 1) embedded source component, 2) low-dimensional low-rank innovations, and 3) sparse edits. Develops anchored alternating projection estimator that preserves transferred subspaces while estimating only low-dimensional innovations and sparse modifications.

Result: Establishes deterministic error bounds separating target noise, representation growth, and source estimation error, yielding improved rates when rank and sparsity increments are small. Demonstrates framework on Markov transition matrix estimation from single trajectory and structured covariance estimation under enlarged dimensions.

Conclusion: The transfer framework provides theoretical guarantees for structured matrix estimation with growing dimensions, showing consistent transfer gains when innovations are low-dimensional and sparse.

Abstract: Learning systems often expand their ambient features or latent representations over time, embedding earlier representations into larger spaces with limited new latent structure. We study transfer learning for structured matrix estimation under simultaneous growth of the ambient dimension and the intrinsic representation, where a well-estimated source task is embedded as a subspace of a higher-dimensional target task. We propose a general transfer framework in which the target parameter decomposes into an embedded source component, low-dimensional low-rank innovations, and sparse edits, and develop an anchored alternating projection estimator that preserves transferred subspaces while estimating only low-dimensional innovations and sparse modifications. We establish deterministic error bounds that separate target noise, representation growth, and source estimation error, yielding strictly improved rates when rank and sparsity increments are small. We demonstrate the generality of the framework by applying it to two canonical problems. For Markov transition matrix estimation from a single trajectory, we derive end-to-end theoretical guarantees under dependent noise. For structured covariance estimation under enlarged dimensions, we provide complementary theoretical analysis in the appendix and empirically validate consistent transfer gains.

[613] Managing Solution Stability in Decision-Focused Learning with Cost Regularization

Victor Spitzer, Francois Sanson

Main category: cs.LG

TL;DR: Paper proposes regularization for decision-focused learning to address instability from perturbation fluctuations during training, improving robustness in combinatorial optimization coefficient estimation.

Details

Motivation: Decision-focused learning trains models to improve decision quality rather than just prediction accuracy, but faces challenges in differentiating through combinatorial optimization problems. Perturbation-based approximations can lead to ineffective training due to fluctuations in perturbation intensity during learning.

Method: The authors establish theoretical link between perturbation fluctuations and solution stability in combinatorial optimization, then propose regularization of estimated cost vectors to improve robustness and reliability of the learning process.

Result: Extensive numerical experiments demonstrate that the proposed regularization approach improves the robustness and reliability of decision-focused learning for combinatorial optimization coefficient estimation.

Conclusion: Regularization of estimated cost vectors effectively addresses instability issues caused by perturbation fluctuations in decision-focused learning, leading to more robust and reliable training for combinatorial optimization problems.

Abstract: Decision-focused learning integrates predictive modeling and combinatorial optimization by training models to directly improve decision quality rather than prediction accuracy alone. Differentiating through combinatorial optimization problems represents a central challenge, and recent approaches tackle this difficulty by introducing perturbation-based approximations. In this work, we focus on estimating the objective function coefficients of a combinatorial optimization problem. Our study demonstrates that fluctuations in perturbation intensity occurring during the learning phase can lead to ineffective training, by establishing a theoretical link to the notion of solution stability in combinatorial optimization. We propose addressing this issue by introducing a regularization of the estimated cost vectors which improves the robustness and reliability of the learning process, as demonstrated by extensive numerical experiments.

[614] Guided Perturbation Sensitivity (GPS): Detecting Adversarial Text via Embedding Stability and Word Importance

Bryan E. Tuck, Rakesh M. Verma

Main category: cs.LG

TL;DR: GPS is an attack-agnostic adversarial text detection framework that measures embedding sensitivity when masking important words, achieving over 85% detection accuracy across multiple datasets and attacks without requiring model retraining.

Details

Motivation: Existing adversarial text defenses are typically attack-specific or require costly model retraining, creating a need for attack-agnostic detection methods that can work across different attacks and models without retraining.

Method: GPS ranks words using importance heuristics, measures embedding sensitivity when masking top-k critical words, and processes resulting patterns with a BiLSTM detector. It uses gradient-based ranking to identify important words and measures how embeddings change when these words are masked.

Result: GPS achieves over 85% detection accuracy across three datasets, three attack types, and two victim models. Gradient-based ranking significantly outperforms attention, hybrid, and random selection approaches, with identification quality strongly correlating with detection performance (ρ=0.65).

Conclusion: GPS provides a practical, attack-agnostic solution for adversarial text detection that generalizes to unseen datasets, attacks, and models without requiring retraining, offering competitive performance at lower computational cost.

Abstract: Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining, leaving a gap for attack-agnostic detection. We introduce Guided Perturbation Sensitivity (GPS), a detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. GPS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-k critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, GPS achieves over 85% detection accuracy and demonstrates competitive performance compared to existing state-of-the-art methods, often at lower computational cost. Using Normalized Discounted Cumulative Gain (NDCG) to measure perturbation identification quality, we demonstrate that gradient-based ranking significantly outperforms attention, hybrid, and random selection approaches, with identification quality strongly correlating with detection performance for word-level attacks ($ρ= 0.65$). GPS generalizes to unseen datasets, attacks, and models without retraining, providing a practical solution for adversarial text detection.

[615] Latent Adversarial Regularization for Offline Preference Optimization

Enyi Jiang, Yibo Jacky Zhang, Yinglun Xu, Andreas Haupt, Nancy Amato, Sanmi Koyejo

Main category: cs.LG

TL;DR: GANPO introduces latent-space regularization for language model preference optimization using adversarial training to minimize divergence between policy and reference model representations, improving robustness under distributional shift.

Details

Motivation: Preference optimization for language models is challenging because token-space similarity doesn't guarantee semantic or behavioral similarity. Existing methods rely on token-level regularization which may not capture deeper semantic relationships.

Method: GANPO uses adversarial training inspired by GANs to regularize latent-space divergence between policy and reference models. It integrates this as a regularizer into existing offline preference optimization objectives, penalizing divergence in internal representations rather than token outputs.

Result: Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.

Conclusion: Latent-space regularization via adversarial training offers a promising alternative to token-level regularization for language model preference optimization, providing better semantic alignment and robustness to distribution shifts.

Abstract: Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.

[616] Not All Code Is Equal: A Data-Centric Study of Code Complexity and LLM Reasoning

Lukas Twist, Shu Yang, Hanqi Yan, Jingzhi Gong, Di Wang, Helen Yannakoudakis, Jie M. Zhang

Main category: cs.LG

TL;DR: Code structural complexity (control flow, composition) strongly influences LLM reasoning improvement during fine-tuning, with targeted complexity ranges outperforming diverse code in 83% of experiments.

Details

Motivation: While code exposure is known to enhance LLM reasoning, it's unclear which specific properties of code contribute to this improvement. The paper investigates whether structural complexity of code (capturing control flow and compositional structure) shapes how models internalize multi-step reasoning during fine-tuning.

Method: Examines two settings: solution-driven complexity (varying complexity across solutions to same problem) and problem-driven complexity (complexity reflecting task variation). Uses cyclomatic complexity and logical lines of code to construct controlled fine-tuning datasets. Evaluates open-weight LLMs on diverse reasoning benchmarks.

Result: Code can improve reasoning, but structural properties strongly determine its usefulness. In 83% of experiments, restricting fine-tuning data to specific structural complexity ranges outperforms training on structurally diverse code.

Conclusion: Structural complexity of code is a key factor in enhancing LLM reasoning, suggesting a data-centric approach (curating code by complexity) can improve reasoning beyond simply scaling model size or data quantity.

Abstract: Large Language Models (LLMs) increasingly exhibit strong reasoning abilities, often attributed to their capacity to generate chain-of-thought-style intermediate reasoning. Recent work suggests that exposure to code can further enhance these skills, but existing studies largely treat code as a generic training signal, leaving open the question of which properties of code actually contribute to improved reasoning. To address this gap, we study the structural complexity of code, which captures control flow and compositional structure that may shape how models internalise multi-step reasoning during fine-tuning. We examine two complementary settings: solution-driven complexity, where complexity varies across multiple solutions to the same problem, and problem-driven complexity, where complexity reflects variation in the underlying tasks. Using cyclomatic complexity and logical lines of code to construct controlled fine-tuning datasets, we evaluate a range of open-weight LLMs on diverse reasoning benchmarks. Our findings show that although code can improve reasoning, structural properties strongly determine its usefulness. In 83% of experiments, restricting fine-tuning data to a specific structural complexity range outperforms training on structurally diverse code, pointing to a data-centric path for improving reasoning beyond scaling.

[617] A Low-Complexity Plug-and-Play Deep Learning Model for Generalizable Massive MIMO Precoding

Ali Hasanzadeh Karkan, Ahmed Ibrahim, Jean-François Frigon, François Leduc-Primeau

Main category: cs.LG

TL;DR: A plug-and-play deep learning framework for mMIMO precoding that can be reused across sites, power levels, and channel conditions without retraining from scratch, achieving computational efficiency and robustness.

Details

Motivation: Existing mMIMO precoding solutions face challenges: traditional algorithms like WMMSE are computationally expensive and sensitive to SNR/channel conditions, while DL-based approaches lack robustness and require site-specific retraining.

Method: PaPP uses a teacher-student framework with self-supervised loss balancing teacher imitation and normalized sum-rate, trained via meta-learning domain generalization and transmit-power-aware input normalization for both FDP and HBF architectures.

Result: PaPP outperforms conventional and DL baselines on unseen sites after minimal fine-tuning, achieves >21× reduction in computation energy, and maintains good performance under channel-estimation errors.

Conclusion: PaPP provides a practical, energy-efficient solution for mMIMO precoding that is reusable across deployment scenarios without extensive retraining.

Abstract: Massive multiple-input multiple-output (mMIMO) downlink precoding offers high spectral efficiency but remains challenging to deploy in practice because near-optimal algorithms such as the weighted minimum mean squared error (WMMSE) are computationally expensive, and sensitive to SNR and channel-estimation quality, while existing deep learning (DL)-based solutions often lack robustness and require retraining for each deployment site. This paper proposes a plug-and-play precoder (PaPP), a DL framework with a backbone that can be trained for either fully digital (FDP) or hybrid beamforming (HBF) precoding and reused across sites, transmit-power levels, and with varying amounts of channel estimation error, avoiding the need to train a new model from scratch at each deployment. PaPP combines a high-capacity teacher and a compact student with a self-supervised loss that balances teacher imitation and normalized sum-rate, trained using meta-learning domain-generalization and transmit-power-aware input normalization. Numerical results on ray-tracing data from three unseen sites show that the PaPP FDP and HBF models both outperform conventional and deep learning baselines, after fine-tuning with a small set of local unlabeled samples. Across both architectures, PaPP achieves more than 21$\times$ reduction in modeled computation energy and maintains good performance under channel-estimation errors, making it a practical solution for energy-efficient mMIMO precoding.

[618] Value-Based Pre-Training with Downstream Feedback

Shuqi Ke, Giulia Fanti

Main category: cs.LG

TL;DR: V-Pretraining: A value-based method for controlled continued pretraining that uses a lightweight task designer to steer self-supervised pretraining toward downstream capabilities using minimal verified goal information.

Details

Motivation: Standard pretraining optimizes fixed proxy objectives (like next-token prediction) that may misallocate compute away from downstream capabilities of interest. The authors want to steer expensive foundation model pretraining using minimal verified goal information.

Method: V-Pretraining uses a value-based, modality-agnostic approach where a lightweight task designer reshapes the pretraining task to maximize the value of each gradient step. For SSL with sample augmentation, it selects pretraining tasks (augmentations) where the pretraining loss gradient aligns with gradients computed over downstream tasks. The pretrained model is never updated on downstream labels - they only shape the pretraining task.

Result: For 0.5B-7B language models, V-Pretraining improves reasoning (GSM8K test Pass@1) by up to 18% relative over standard next-token prediction using only 12% of GSM8K training examples as feedback. In vision SSL, it improves state-of-the-art results on ADE20K by up to 1.07 mIoU, reduces NYUv2 RMSE while improving ImageNet linear accuracy, and shows pilot evidence of improved token efficiency.

Conclusion: V-Pretraining demonstrates that minimal verified goal information can effectively steer expensive self-supervised pretraining of foundation models toward relevant downstream capabilities without directly training on downstream labels.

Abstract: Can a small amount of verified goal information steer the expensive self-supervised pretraining of foundation models? Standard pretraining optimizes a fixed proxy objective (e.g., next-token prediction), which can misallocate compute away from downstream capabilities of interest. We introduce V-Pretraining: a value-based, modality-agnostic method for controlled continued pretraining in which a lightweight task designer reshapes the pretraining task to maximize the value of each gradient step. For example, consider self-supervised learning (SSL) with sample augmentation. The V-Pretraining task designer selects pretraining tasks (e.g., augmentations) for which the pretraining loss gradient is aligned with a gradient computed over a downstream task (e.g., image segmentation). This helps steer pretraining towards relevant downstream capabilities. Notably, the pretrained model is never updated on downstream task labels; they are used only to shape the pretraining task. Under matched learner update budgets, V-Pretraining of 0.5B–7B language models improves reasoning (GSM8K test Pass@1) by up to 18% relative over standard next-token prediction using only 12% of GSM8K training examples as feedback. In vision SSL, we improve the state-of-the-art results on ADE20K by up to 1.07 mIoU and reduce NYUv2 RMSE while improving ImageNet linear accuracy, and we provide pilot evidence of improved token efficiency in continued pretraining.

[619] Breaking the Regional Barrier: Inductive Semantic Topology Learning for Worldwide Air Quality Forecasting

Zhiqing Cui, Siru Zhong, Ming Jin, Shirui Pan, Qingsong Wen, Yuxuan Liang

Main category: cs.LG

TL;DR: OmniAir is a semantic topology learning framework for global station-level air quality forecasting that addresses spatial heterogeneity and poor generalization by encoding physical environmental attributes into station identities and constructing adaptive sparse topologies.

Details

Motivation: Existing air quality forecasting models struggle with extreme spatial heterogeneity and poor generalization to unseen regions, particularly in global contexts with unevenly distributed monitoring stations.

Method: Proposes OmniAir framework that encodes invariant physical environmental attributes into generalizable station identities and dynamically constructs adaptive sparse topologies to capture long-range non-Euclidean correlations and physical diffusion patterns across global networks.

Result: OmniAir achieves state-of-the-art performance against 18 baselines, maintains high efficiency and scalability with speeds nearly 10 times faster than existing models, and effectively bridges the monitoring gap in data-sparse regions.

Conclusion: The semantic topology learning approach with adaptive sparse topologies and generalizable station identities provides an effective solution for global air quality forecasting with strong generalization capabilities.

Abstract: Global air quality forecasting grapples with extreme spatial heterogeneity and the poor generalization of existing transductive models to unseen regions. To tackle this, we propose OmniAir, a semantic topology learning framework tailored for global station-level prediction. By encoding invariant physical environmental attributes into generalizable station identities and dynamically constructing adaptive sparse topologies, our approach effectively captures long-range non-Euclidean correlations and physical diffusion patterns across unevenly distributed global networks. We further curate WorldAir, a massive dataset covering over 7,800 stations worldwide. Extensive experiments show that OmniAir achieves state-of-the-art performance against 18 baselines, maintaining high efficiency and scalability with speeds nearly 10 times faster than existing models, while effectively bridging the monitoring gap in data-sparse regions.

[620] Hardware-Triggered Backdoors

Jonas Möller, Erik Imgrund, Thorsten Eisenhofer, Konrad Rieck

Main category: cs.LG

TL;DR: Hardware-triggered backdoors exploit numerical variations across computing hardware to create attacks where models produce different predictions on different hardware for the same input.

Details

Motivation: Machine learning models are deployed across diverse hardware that can produce small numerical variations during inference. The authors investigate whether these hardware differences can be exploited as a novel attack vector for creating backdoors in ML models.

Method: The approach shapes the model’s decision function to yield different predictions for the same input on different hardware. This is achieved by moving the decision boundary close to target inputs and refining numerical deviations to flip predictions on selected hardware. The method is empirically demonstrated across common GPU accelerators.

Result: The research shows that hardware-triggered backdoors can be reliably created across common GPU accelerators, revealing a novel attack vector affecting the use of third-party models.

Conclusion: Hardware differences in computing platforms create a previously unexplored attack surface for ML security. The paper investigates defenses to counter this threat and highlights the risks of using third-party models without considering hardware-specific vulnerabilities.

Abstract: Machine learning models are routinely deployed on a wide range of computing hardware. Although such hardware is typically expected to produce identical results, differences in its design can lead to small numerical variations during inference. In this work, we show that these variations can be exploited to create backdoors in machine learning models. The core idea is to shape the model’s decision function such that it yields different predictions for the same input when executed on different hardware. This effect is achieved by locally moving the decision boundary close to a target input and then refining numerical deviations to flip the prediction on selected hardware. We empirically demonstrate that these hardware-triggered backdoors can be created reliably across common GPU accelerators. Our findings reveal a novel attack vector affecting the use of third-party models, and we investigate different defenses to counter this threat.

[621] Optimistic Transfer under Task Shift via Bellman Alignment

Jinhang Chai, Enpei Zhang, Elynn Chen, Yujun Yan

Main category: cs.LG

TL;DR: Online transfer RL with source task experience, using Bellman alignment to correct for task mismatch via re-weighted targeting, enabling statistically sound source data reuse with improved regret bounds.

Details

Motivation: Online RL algorithms face difficulty when transferring knowledge from source tasks because task similarity is defined in terms of rewards/transitions, but RL algorithms operate on Bellman regression targets, causing naive source data reuse to introduce systematic bias and invalidate regret guarantees.

Method: Proposes re-weighted targeting (RWT), an operator-level correction that retargets continuation values and compensates for transition mismatch via change of measure. RWT reduces task mismatch to a fixed one-step correction and enables statistically sound reuse of source data. This yields a two-stage RWT Q-learning framework separating variance reduction from bias correction.

Result: Under RKHS function approximation, establishes regret bounds that scale with the complexity of the task shift rather than the target MDP. Empirical results in both tabular and neural network settings demonstrate consistent improvements over single-task learning and naïve pooling.

Conclusion: Bellman alignment serves as a model-agnostic transfer principle for online RL, enabling effective knowledge transfer from source tasks while maintaining statistical soundness and improved performance guarantees.

Abstract: We study online transfer reinforcement learning (RL) in episodic Markov decision processes, where experience from related source tasks is available during learning on a target task. A fundamental difficulty is that task similarity is typically defined in terms of rewards or transitions, whereas online RL algorithms operate on Bellman regression targets. As a result, naively reusing source Bellman updates introduces systematic bias and invalidates regret guarantees. We identify one-step Bellman alignment as the correct abstraction for transfer in online RL and propose re-weighted targeting (RWT), an operator-level correction that retargets continuation values and compensates for transition mismatch via a change of measure. RWT reduces task mismatch to a fixed one-step correction and enables statistically sound reuse of source data. This alignment yields a two-stage RWT $Q$-learning framework that separates variance reduction from bias correction. Under RKHS function approximation, we establish regret bounds that scale with the complexity of the task shift rather than the target MDP. Empirical results in both tabular and neural network settings demonstrate consistent improvements over single-task learning and naïve pooling, highlighting Bellman alignment as a model-agnostic transfer principle for online RL.

[622] StepShield: When, Not Whether to Intervene on Rogue Agents

Gloria Felicia, Michael Eniolade, Jinfeng He, Zitha Sasindran, Hemant Kumar, Milan Hussain Angati, Sandeep Bandarupalli

Main category: cs.LG

TL;DR: StepShield is a benchmark for evaluating when AI agent safety violations are detected (not just whether), introducing temporal metrics like Early Intervention Rate to measure detection timing for real-time intervention.

Details

Motivation: Current agent safety benchmarks only report binary accuracy, failing to distinguish between early detection (enabling intervention) and late detection (only forensic value). This gap prevents measuring the critical timing aspect of safety violations.

Method: Introduces StepShield benchmark with 9,213 code agent trajectories (1,278 annotated training pairs + 7,935 test trajectories) featuring realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. Proposes three temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved.

Result: LLM-based judge achieves 59% EIR vs static analyzer’s 26% (2.3x gap invisible to standard accuracy metrics). Cascaded HybridGuard detector reduces monitoring costs by 75% with projected $108M cumulative savings over 5 years at enterprise scale.

Conclusion: StepShield shifts evaluation focus from whether to when violations are detected, providing foundation for safer and more economically viable AI agents. Early detection enables real-time intervention and significant cost savings.

Abstract: Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides only forensic value. This distinction is critical, yet current benchmarks cannot measure it. We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether. StepShield contains 9,213 code agent trajectories, including 1,278 meticulously annotated training pairs and a 7,935-trajectory test set with a realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. We propose three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Surprisingly, our evaluation reveals that an LLM-based judge achieves 59% EIR while a static analyzer achieves only 26%, a 2.3x performance gap that is entirely invisible to standard accuracy metrics. We further show that early detection has direct economic benefits: our cascaded HybridGuard detector reduces monitoring costs by 75% and projects to $108M in cumulative savings over five years at enterprise scale. By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents. The code and data are released under an Apache 2.0 license.

[623] LoRIF: Low-Rank Influence Functions for Scalable Training Data Attribution

Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann

Main category: cs.LG

TL;DR: LoRIF introduces low-rank influence functions to scale gradient-based training data attribution, reducing storage and computation bottlenecks while maintaining attribution quality.

Details

Motivation: Current gradient-based TDA methods face scalability challenges with large training sets: (i) storage/loading of projected gradients causes I/O bottlenecks, and (ii) forming inverse Hessian approximations requires O(D²) memory. Increasing projection dimension D improves quality but worsens scalability.

Method: LoRIF exploits low-rank structures of gradients: (1) stores rank-c factors of projected per-example gradients instead of full matrices, reducing storage from O(D) to O(c√D); (2) uses truncated SVD with Woodbury identity to approximate Hessian in r-dimensional subspace, reducing memory from O(D²) to O(Dr).

Result: On models from 0.1B to 70B parameters trained on million-example datasets, LoRIF achieves up to 20× storage reduction and query-time speedup compared to LoGRA, while matching or exceeding attribution quality.

Conclusion: LoRIF makes gradient-based training data attribution practical at frontier scale by addressing both storage and computational bottlenecks through low-rank approximations.

Abstract: Training data attribution (TDA) identifies which training examples most influenced a model’s prediction. The best-performing TDA methods exploits gradients to define an influence function. To overcome the scalability challenge arising from gradient computation, the most popular strategy is random projection (e.g., TRAK, LoGRA). However, this still faces two bottlenecks when scaling to large training sets and high-quality attribution: \emph{(i)} storing and loading projected per-example gradients for all $N$ training examples, where query latency is dominated by I/O; and \emph{(ii)} forming the $D \times D$ inverse Hessian approximation, which costs $O(D^2)$ memory. Both bottlenecks scale with the projection dimension $D$, yet increasing $D$ is necessary for attribution quality – creating a quality-scalability tradeoff. We introduce \textbf{LoRIF (Low-Rank Influence Functions)}, which exploits low-rank structures of gradient to address both bottlenecks. First, we store rank-$c$ factors of the projected per-example gradients rather than full matrices, reducing storage and query-time I/O from $O(D)$ to $O(c\sqrt{D})$ per layer per sample. Second, we use truncated SVD with the Woodbury identity to approximate the Hessian term in an $r$-dimensional subspace, reducing memory from $O(D^2)$ to $O(Dr)$. On models from 0.1B to 70B parameters trained on datasets with millions of examples, LoRIF achieves up to 20$\times$ storage reduction and query-time speedup compared to LoGRA, while matching or exceeding its attribution quality. LoRIF makes gradient-based TDA practical at frontier scale.

[624] The Illusion of Certainty: Uncertainty Quantification for LLMs Fails under Ambiguity

Tim Tomov, Dominik Fuchsgruber, Tom Wollschläger, Stephan Günnemann

Main category: cs.LG

TL;DR: Current LLM uncertainty quantification methods fail on ambiguous language data, despite performing well on unambiguous tasks, revealing a fundamental limitation in existing approaches.

Details

Motivation: Real-world language is inherently ambiguous (aleatoric uncertainty), but existing uncertainty quantification methods for LLMs are only benchmarked on unambiguous tasks, creating a gap between research and practical deployment needs.

Method: Introduces MAQA* and AmbigQA* - the first ambiguous question-answering datasets with ground-truth answer distributions estimated from factual co-occurrence. Tests three uncertainty estimation paradigms: predictive distribution, internal representations, and model ensembles.

Result: Current uncertainty estimators degrade to close-to-random performance on ambiguous data, despite working well on unambiguous tasks. This deterioration is consistent across all three estimation paradigms. Theoretical analysis shows predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity.

Conclusion: Reveals a critical shortcoming in current LLM uncertainty quantification methods and motivates rethinking of modeling paradigms to handle real-world language ambiguity.

Abstract: Accurate uncertainty quantification (UQ) in Large Language Models (LLMs) is critical for trustworthy deployment. While real-world language is inherently ambiguous, reflecting aleatoric uncertainty, existing UQ methods are typically benchmarked against tasks with no ambiguity. In this work, we demonstrate that while current uncertainty estimators perform well under the restrictive assumption of no ambiguity, they degrade to close-to-random performance on ambiguous data. To this end, we introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions estimated from factual co-occurrence. We find this performance deterioration to be consistent across different estimation paradigms: using the predictive distribution itself, internal representations throughout the model, and an ensemble of models. We show that this phenomenon can be theoretically explained, revealing that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity. Overall, our study reveals a key shortcoming of current UQ methods for LLMs and motivates a rethinking of current modeling paradigms.

[625] PRISM: Distribution-free Adaptive Computation of Matrix Functions for Accelerating Neural Network Training

Shenghao Yang, Zhichao Wang, Oleg Balabanov, N. Benjamin Erichson, Michael W. Mahoney

Main category: cs.LG

TL;DR: PRISM framework accelerates iterative matrix function computations (square roots, inverse roots, orthogonalization) for neural network training by combining adaptive polynomial approximation with randomized sketching, eliminating need for explicit spectral bounds.

Details

Motivation: Matrix functions like square root and orthogonalization are crucial for preconditioned gradient methods in neural network training. Existing iterative algorithms avoid expensive eigendecompositions but still need improvements in efficiency and adaptability to varying spectral properties.

Method: PRISM combines adaptive polynomial approximation with randomized sketching: at each iteration, it fits a polynomial surrogate to the current spectrum via sketched least-squares problems, adapting automatically to the evolving spectrum without requiring explicit spectral bounds or singular value estimates.

Result: PRISM accelerates Newton-Schulz-like iterations for matrix square roots and orthogonalization, and empirically accelerates training when integrated into Shampoo and Muon optimizers for neural networks.

Conclusion: PRISM provides an efficient, adaptive framework for accelerating matrix function computations in neural network training, particularly beneficial for preconditioned gradient methods on modern GPU accelerators.

Abstract: Matrix functions such as square root, inverse roots, and orthogonalization play a central role in preconditioned gradient methods for neural network training. This has motivated the development of iterative algorithms that avoid explicit eigendecompositions and rely primarily on matrix multiplications, making them well suited for modern GPU accelerators. We present PRISM (Polynomial-fitting and Randomized Iterative Sketching for Matrix functions computation), a general framework for accelerating iterative algorithms for computing matrix functions. PRISM combines adaptive polynomial approximation with randomized sketching: at each iteration, it fits a polynomial surrogate to the current spectrum via a sketched least-squares problem, adapting to the instance at hand with minimal overhead. We apply PRISM to accelerate Newton-Schulz-like iterations for matrix square roots and orthogonalization, which are core primitives in machine learning. Unlike prior methods, PRISM requires no explicit spectral bounds or singular value estimates; and it adapts automatically to the evolving spectrum. Empirically, PRISM accelerates training when integrated into Shampoo and Muon optimizers.

[626] Entropy-Based Dimension-Free Convergence and Loss-Adaptive Schedules for Diffusion Models

Ahmad Aghapour, Erhan Bayraktar, Ziqing Zhang

Main category: cs.LG

TL;DR: The paper proposes an information-theoretic approach for dimension-free convergence analysis of diffusion models and introduces a Loss-Adaptive Schedule (LAS) for efficient discretization of reverse SDEs.

Details

Motivation: Existing convergence analyses of diffusion models typically scale linearly with ambient dimension and require restrictive geometric assumptions about target distributions. The authors aim to develop dimension-free convergence bounds without geometric restrictions and create practical adaptive schedules for sampling.

Method: Develops an information-theoretic approach using KL divergence bounds based on Shannon entropy, proposes Loss-Adaptive Schedule (LAS) that adapts discretization based on training loss without heavy post-training computation.

Result: Achieves dimension-free convergence bound of O(H²/K) where H is Shannon entropy and K is number of sampling steps. LAS empirically improves sampling quality over common heuristic schedules.

Conclusion: The information-theoretic framework provides dimension-free convergence analysis without geometric assumptions, and LAS offers a practical, lightweight approach for adaptive discretization that improves sampling quality.

Abstract: Diffusion generative models synthesize samples by discretizing reverse-time dynamics driven by a learned score (or denoiser). Existing convergence analyses of diffusion models typically scale at least linearly with the ambient dimension, and sharper rates often depend on intrinsic-dimension assumptions or other geometric restrictions on the target distribution. We develop an alternative, information-theoretic approach to dimension-free convergence that avoids any geometric assumptions. Under mild assumptions on the target distribution, we bound KL divergence between the target and generated distributions by $O(H^2/K)$ (up to endpoint factors), where $H$ is the Shannon entropy and $K$ is the number of sampling steps. Moreover, using a reformulation of the KL divergence, we propose a Loss-Adaptive Schedule (LAS) for efficient discretization of reverse SDE which is lightweight and relies only on the training loss, requiring no post-training heavy computation. Empirically, LAS improves sampling quality over common heuristic schedules.

[627] The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Xiaoze Liu, Weichen Yu, Matt Fredrikson, Xiaoqian Wang, Jing Gao

Main category: cs.LG

TL;DR: Tokenizer transplant in model composition creates a supply-chain vulnerability where a single engineered “breaker token” appears inert in donor models but becomes malicious after transplant, sabotaging base model generation while evading detection.

Details

Motivation: As open-weight language models increasingly rely on model composition techniques (weight merging, speculative decoding, vocabulary expansion), tokenizer transplant becomes essential for aligning incompatible vocabularies. The authors identify this interoperability step as a critical vulnerability in the AI supply chain that can be exploited.

Method: The attack engineers a single “breaker token” that appears functionally inert in donor models but reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, the attack sabotages base model generation while leaving donor utility statistically indistinguishable from nominal behavior. The authors formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver.

Result: The attack is training-free and evades outlier detection, while demonstrating structural persistence against fine-tuning and weight merging. The vulnerability highlights hidden risks in the pipeline of modular AI composition.

Conclusion: Tokenizer transplant, a critical prerequisite for model composition techniques, introduces a supply-chain vulnerability that can be exploited through engineered breaker tokens. This attack persists through fine-tuning and weight merging, revealing security risks in modular AI composition pipelines.

Abstract: The open-weight language model ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single breaker token that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack sabotages the base model’s generation while leaving the donor’s utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and evades outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition. Code is available at https://github.com/xz-liu/tokenforge

[628] Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models

Konstantinos P. Panousis, Diego Marcos

Main category: cs.LG

TL;DR: This paper investigates interpretability in Vision-Language Models, focusing on Concept Bottleneck Models and introducing a clarity measure to evaluate the trade-off between performance and concept representation quality.

Details

Motivation: The widespread adoption of Vision-Language Models has raised concerns about their black-box nature and lack of interpretability. Despite various interpretability methods, systematic evaluation of learned representations remains limited, especially for sparsity-aware methods that claim to "induce interpretability."

Method: The paper focuses on Concept Bottleneck Models and introduces a clarity measure capturing the interplay between downstream performance, sparsity, and precision of concept representation. It proposes an interpretability assessment framework using datasets with ground truth concept annotations. The study examines both VLM- and attribute predictor-based CBMs with three sparsity-inducing strategies: per-example ℓ₁, ℓ₀, and Bernoulli-based formulations.

Result: Experiments reveal a critical trade-off between flexibility and interpretability, where different methods can exhibit markedly different behaviors even at comparable performance levels. The clarity measure helps quantify this trade-off.

Conclusion: The work provides a systematic framework for evaluating interpretability in Vision-Language Models, particularly for Concept Bottleneck Models, highlighting the importance of balancing performance with interpretability through proper sparsity-inducing strategies.

Abstract: The widespread adoption of Vision-Language Models (VLMs) across fields has amplified concerns about model interpretability. Distressingly, these models are often treated as black-boxes, with limited or non-existent investigation of their decision making process. Despite numerous post- and ante-hoc interepretability methods, systematic and objective evaluation of the learned representations remains limited, particularly for sparsity-aware methods that are increasingly considered to “induce interpretability”. In this work, we focus on Concept Bottleneck Models and investigate how different modeling decisions affect the emerging representations. We introduce the notion of clarity, a measure, capturing the interplay between the downstream performance and the sparsity and precision of the concept representation, while proposing an interpretability assessment framework using datasets with ground truth concept annotations. We consider both VLM- and attribute predictor-based CBMs, and three different sparsity-inducing strategies: per example $\ell_1, \ell_0$ and Bernoulli-based formulations. Our experiments reveal a critical trade-off between flexibility and interpretability, under which a given method can exhibit markedly different behaviors even at comparable performance levels. The code will be made publicly available upon publication.

[629] Dependence of Equilibrium Propagation Training Success on Network Architecture

Qingshan Wang, Clara C. Wanjura, Florian Marquardt

Main category: cs.LG

TL;DR: Physics-based training of locally connected lattice networks using equilibrium propagation achieves performance comparable to dense networks, offering energy-efficient neuromorphic computing alternatives.

Details

Motivation: Addressing unsustainable AI energy consumption by exploring neuromorphic computing and physics-based training alternatives to digital neural networks, focusing on more realistic architectures that account for experimental connectivity constraints.

Method: Uses equilibrium propagation training method on locally connected lattice architectures (XY model), tracking evolution of spatially distributed responses and couplings during training on various benchmark tasks.

Result: Sparse networks with only local connections can achieve performance comparable to dense networks, with architectural choices influencing training dynamics and performance.

Conclusion: Provides guidelines for scaling up equilibrium propagation architectures in realistic settings, demonstrating viability of sparse local connectivity for energy-efficient neuromorphic computing.

Abstract: The rapid rise of artificial intelligence has led to an unsustainable growth in energy consumption. This has motivated progress in neuromorphic computing and physics-based training of learning machines as alternatives to digital neural networks. Many theoretical studies focus on simple architectures like all-to-all or densely connected layered networks. However, these may be challenging to realize experimentally, e.g. due to connectivity constraints. In this work, we investigate the performance of the widespread physics-based training method of equilibrium propagation for more realistic architectural choices, specifically, locally connected lattices. We train an XY model and explore the influence of architecture on various benchmark tasks, tracking the evolution of spatially distributed responses and couplings during training. Our results show that sparse networks with only local connections can achieve performance comparable to dense networks. Our findings provide guidelines for further scaling up architectures based on equilibrium propagation in realistic settings.

[630] Embracing Aleatoric Uncertainty in Medical Multimodal Learning with Missing Modalities

Linxiao Gong, Yang Liu, Lianlong Sun, Yulai Bi, Jing Liu, Xiaoguang Zhu

Main category: cs.LG

TL;DR: AUM framework uses aleatoric uncertainty modeling to handle missing medical modalities by quantifying modality reliability and dynamically emphasizing more certain information through graph-based message passing.

Details

Motivation: Medical multimodal learning faces challenges with missing modalities in clinical practice. Existing approaches assume equal modality contribution and random missing patterns, neglecting inherent uncertainty in medical data acquisition.

Method: Proposes Aleatoric Uncertainty Modeling (AUM) that models each unimodal representation as multivariate Gaussian distribution to capture aleatoric uncertainty. Uses dynamic message-passing mechanism within bipartite patient-modality graph with uncertainty-aware aggregation to adaptively emphasize reliable information.

Result: AUM achieves 2.26% AUC-ROC improvement on MIMIC-IV mortality prediction and 2.17% gain on eICU, outperforming existing state-of-the-art approaches.

Conclusion: The framework effectively handles missing modalities in medical multimodal learning by explicitly modeling aleatoric uncertainty and dynamically aggregating reliable information through uncertainty-aware mechanisms.

Abstract: Medical multimodal learning faces significant challenges with missing modalities prevalent in clinical practice. Existing approaches assume equal contribution of modality and random missing patterns, neglecting inherent uncertainty in medical data acquisition. In this regard, we propose the Aleatoric Uncertainty Modeling (AUM) that explicitly quantifies unimodal aleatoric uncertainty to address missing modalities. Specifically, AUM models each unimodal representation as a multivariate Gaussian distribution to capture aleatoric uncertainty and enable principled modality reliability quantification. To adaptively aggregate captured information, we develop a dynamic message-passing mechanism within a bipartite patient-modality graph using uncertainty-aware aggregation mechanism. Through this process, missing modalities are naturally accommodated, while more reliable information from available modalities is dynamically emphasized to guide representation generation. Our AUM framework achieves an improvement of 2.26% AUC-ROC on MIMIC-IV mortality prediction and 2.17% gain on eICU, outperforming existing state-of-the-art approaches.

[631] Uncertainty-Aware Data-Based Method for Fast and Reliable Shape Optimization

Yunjia Yang, Runze Li, Yufei Zhang, Haixin Chen

Main category: cs.LG

TL;DR: Proposes an uncertainty-aware data-based optimization framework for aerodynamic shape optimization that integrates uncertainty quantification to improve robustness against prediction errors from surrogate models.

Details

Motivation: Data-based optimization relies on surrogate models that can produce significant prediction errors for samples outside the training distribution, potentially misleading the optimization process. Uncertainty quantification is needed to detect outliers and enhance robustness.

Method: Develops a probabilistic encoder-decoder surrogate model to predict uncertainties associated with its outputs, then integrates these uncertainties into a model-confidence-aware objective function that penalizes samples with large prediction errors during optimization.

Result: UA-DBO consistently reduces prediction errors in optimized samples and achieves superior performance gains compared to original DBO. It offers comparable optimization effectiveness to full computational simulations while significantly accelerating optimization speed.

Conclusion: The uncertainty-aware framework improves the robustness and reliability of data-based optimization for aerodynamic shape design by effectively managing surrogate model uncertainties.

Abstract: Data-based optimization (DBO) offers a promising approach for efficiently optimizing shape for better aerodynamic performance by leveraging a pretrained surrogate model for offline evaluations during iterations. However, DBO heavily relies on the quality of the training database. Samples outside the training distribution encountered during optimization can lead to significant prediction errors, potentially misleading the optimization process. Therefore, incorporating uncertainty quantification into optimization is critical for detecting outliers and enhancing robustness. This study proposes an uncertainty-aware data-based optimization (UA-DBO) framework to monitor and minimize surrogate model uncertainty during DBO. A probabilistic encoder-decoder surrogate model is developed to predict uncertainties associated with its outputs, and these uncertainties are integrated into a model-confidence-aware objective function to penalize samples with large prediction errors during data-based optimization process. The UA-DBO framework is evaluated on two multipoint optimization problems aimed at improving airfoil drag divergence and buffet performance. Results demonstrate that UA-DBO consistently reduces prediction errors in optimized samples and achieves superior performance gains compared to original DBO. Moreover, compared to multipoint optimization based on full computational simulations, UA-DBO offers comparable optimization effectiveness while significantly accelerating optimization speed.

[632] From Tokens to Blocks: A Block-Diffusion Perspective on Molecular Generation

Qianwei Yang, Dong Xu, Zhangfan Yang, Sisi Yuan, Zexuan Zhu, Jianqiang Li, Junkai Ji

Main category: cs.LG

TL;DR: SoftMol is a unified framework for target-aware molecular generation that introduces soft fragments representation and block-diffusion modeling to overcome limitations of existing GPT-based molecular language models.

Details

Motivation: Existing molecular language models inadequately capture graph-structured nature of molecules and lack explicit mechanisms for target-aware generation, limiting their effectiveness in drug discovery applications.

Method: Introduces soft fragments (rule-free block representation of SMILES), develops SoftBD (first block-diffusion molecular language model combining local bidirectional diffusion with autoregressive generation), trains on ZINC-Curated dataset, and integrates gated Monte Carlo tree search for target-aware fragment assembly.

Result: Achieves 100% chemical validity, improves binding affinity by 9.7%, yields 2-3x increase in molecular diversity, and delivers 6.6x speedup in inference efficiency compared to state-of-the-art models.

Conclusion: SoftMol provides an effective framework for target-aware molecular generation that addresses fundamental limitations of existing approaches through novel representation, modeling, and search strategies.

Abstract: Drug discovery can be viewed as a combinatorial search over an immense chemical space, motivating the development of deep generative models for de novo molecular design. Among these, GPT-based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large-scale data. However, existing MLMs face two fundamental limitations: they inadequately capture the graph-structured nature of molecules when formulated as next-token prediction problems, and they typically lack explicit mechanisms for target-aware generation. Here, we propose SoftMol, a unified framework that co-designs molecular representation, model architecture, and search strategy for target-aware molecular generation. SoftMol introduces soft fragments, a rule-free block representation of SMILES that enables diffusion-native modeling, and develops SoftBD, the first block-diffusion molecular language model that combines local bidirectional diffusion with autoregressive generation under molecular structural constraints. To favor generated molecules with high drug-likeness and synthetic accessibility, SoftBD is trained on a carefully curated dataset named ZINC-Curated. SoftMol further integrates a gated Monte Carlo tree search to assemble fragments in a target-aware manner. Experimental results show that, compared with current state-of-the-art models, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2-3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency. Code is available at https://github.com/szu-aicourse/softmol

[633] Bridging Graph Structure and Knowledge-Guided Editing for Interpretable Temporal Knowledge Graph Reasoning

Shiqi Fan, Quanming Yao, Hongyi Nie, Wentao Ma, Zhen Wang, Wen Hua

Main category: cs.LG

TL;DR: IGETR is a hybrid framework combining GNNs and LLMs for temporal knowledge graph reasoning, using GNNs for structural-temporal path extraction and LLMs for semantic refinement to improve accuracy and reduce hallucinations.

Details

Motivation: Existing LLM-based temporal knowledge graph reasoning methods prioritize contextual over structural relations, struggling with dynamic graphs and leading to unstructured, hallucination-prone inferences with temporal inconsistencies.

Method: Three-stage pipeline: 1) Temporal GNN identifies structurally/temporally coherent candidate paths, 2) LLM-guided path editing addresses logical/semantic inconsistencies, 3) Integration of refined paths for interpretable predictions.

Result: State-of-the-art performance on TKG benchmarks with relative improvements up to 5.6% on Hits@1 and 8.1% on Hits@3 on challenging ICEWS datasets; ablation studies confirm effectiveness of each component.

Conclusion: IGETR effectively combines GNNs’ structured temporal modeling with LLMs’ contextual understanding to improve temporal knowledge graph reasoning accuracy while reducing hallucinations through grounded structural evidence.

Abstract: Temporal knowledge graph reasoning (TKGR) aims to predict future events by inferring missing entities with dynamic knowledge structures. Existing LLM-based reasoning methods prioritize contextual over structural relations, struggling to extract relevant subgraphs from dynamic graphs. This limits structural information understanding, leading to unstructured, hallucination-prone inferences especially with temporal inconsistencies. To address this problem, we propose IGETR (Integration of Graph and Editing-enhanced Temporal Reasoning), a hybrid reasoning framework that combines the structured temporal modeling capabilities of Graph Neural Networks (GNNs) with the contextual understanding of LLMs. IGETR operates through a three-stage pipeline. The first stage aims to ground the reasoning process in the actual data by identifying structurally and temporally coherent candidate paths through a temporal GNN, ensuring that inference starts from reliable graph-based evidence. The second stage introduces LLM-guided path editing to address logical and semantic inconsistencies, leveraging external knowledge to refine and enhance the initial paths. The final stage focuses on integrating the refined reasoning paths to produce predictions that are both accurate and interpretable. Experiments on standard TKG benchmarks show that IGETR achieves state-of-the-art performance, outperforming strong baselines with relative improvements of up to 5.6% on Hits@1 and 8.1% on Hits@3 on the challenging ICEWS datasets. Additionally, we execute ablation studies and additional analyses confirm the effectiveness of each component.

[634] Investigation into using stochastic embedding representations for evaluating the trustworthiness of the Fréchet Inception Distance

Ciaran Bench, Vivek Desai, Carlijn Roozemond, Ruben van Engen, Spencer A. Thomas

Main category: cs.LG

TL;DR: Analysis of FID metric reliability for medical images using Monte Carlo dropout to estimate predictive variance and assess out-of-distribution sensitivity.

Details

Motivation: The Fréchet Inception Distance (FID) is widely used for synthetic image quality assessment but relies on InceptionV3 pretrained on natural images, making it potentially unreliable for medical images. The paper aims to quantify how well FID captures meaningful differences in medical image characteristics.

Method: Uses Monte Carlo dropout to compute predictive variance in FID scores and supplemental variance in feature embeddings. Tests on ImageNet1K validation set with various augmentations and external datasets to measure correlation between predictive variance and out-of-distribution extent.

Result: Predictive variance magnitudes show varying degrees of correlation with how out-of-distribution test inputs are relative to training data, providing insight into FID trustworthiness indicators.

Conclusion: The study provides quantitative evidence about FID reliability limitations for medical images and offers variance-based indicators for assessing metric trustworthiness in out-of-distribution scenarios.

Abstract: Feature embeddings acquired from pretrained models are widely used in medical applications of deep learning to assess the characteristics of datasets; e.g. to determine the quality of synthetic, generated medical images. The Fréchet Inception Distance (FID) is one popular synthetic image quality metric that relies on the assumption that the characteristic features of the data can be detected and encoded by an InceptionV3 model pretrained on ImageNet1K (natural images). While it is widely known that this makes it less effective for applications involving medical images, the extent to which the metric fails to capture meaningful differences in image characteristics is not obviously known. Here, we use Monte Carlo dropout to compute the predictive variance in the FID as well as a supplemental estimate of the predictive variance in the feature embedding model’s latent representations. We show that the magnitudes of the predictive variances considered exhibit varying degrees of correlation with the extent to which test inputs (ImageNet1K validation set augmented at various strengths, and other external datasets) are out-of-distribution relative to its training data, providing some insight into the effectiveness of their use as indicators of the trustworthiness of the FID.

[635] Investigating Batch Inference in a Sequential Monte Carlo Framework for Neural Networks

Andrew Millard, Joshua Murphy, Peter Green, Simon Maskell

Main category: cs.LG

TL;DR: Data annealing in Sequential Monte Carlo for neural network Bayesian inference achieves 6x faster training with minimal accuracy loss on image classification tasks.

Details

Motivation: Bayesian inference for neural networks is computationally expensive, especially with full-batch sampling methods like SMC. The paper aims to reduce computational cost while maintaining accuracy by gradually introducing mini-batches.

Method: Proposes data annealing techniques for SMC samplers, gradually introducing more mini-batches of data into likelihood and gradient evaluations instead of using full batches.

Result: Achieves up to 6x faster training with minimal accuracy loss on benchmark image classification problems using neural networks.

Conclusion: Data annealing in SMC samplers provides significant computational efficiency gains for Bayesian neural network inference while maintaining accuracy.

Abstract: Bayesian inference allows us to define a posterior distribution over the weights of a generic neural network (NN). Exact posteriors are usually intractable, in which case approximations can be employed. One such approximation - variational inference - is computationally efficient when using mini-batch stochastic gradient descent as subsets of the data are used for likelihood and gradient evaluations, though the approach relies on the selection of a variational distribution which sufficiently matches the form of the posterior. Particle-based methods such as Markov chain Monte Carlo and Sequential Monte Carlo (SMC) do not assume a parametric family for the posterior by typically require higher computational cost. These sampling methods typically use the full-batch of data for likelihood and gradient evaluations, which contributes to this computational expense. We explore several methods of gradually introducing more mini-batches of data (data annealing) into likelihood and gradient evaluations of an SMC sampler. We find that we can achieve up to $6\times$ faster training with minimal loss in accuracy on benchmark image classification problems using NNs.

[636] Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols

Yongwoo Kim, Sungmin Cha, Donghyun Kim

Main category: cs.LG

TL;DR: Comprehensive evaluation of machine unlearning methods using representation-based metrics reveals current approaches either degrade model quality or only modify classifiers, failing to truly remove data from representations.

Details

Motivation: Existing machine unlearning evaluations focus on logit-based metrics in small-scale scenarios, which may provide a false sense of security. The authors aim to verify whether unlearning approaches truly eliminate targeted data from the model's representation perspective under realistic large-scale conditions.

Method: Conduct comprehensive evaluation using representation-based metrics on large-scale scenarios. Introduce novel evaluation scenario where forgetting classes are semantically similar to downstream task classes, requiring feature representations to diverge significantly from original model.

Result: Current state-of-the-art unlearning approaches either completely degrade representational quality or merely modify classifiers, achieving good logit-based performance while maintaining representational similarity to original model.

Conclusion: Proposed benchmark provides standardized protocol for evaluating unlearning algorithms under realistic conditions, revealing limitations of current approaches from representation perspective.

Abstract: Machine unlearning is a process to remove specific data points from a trained model while maintaining the performance on the retain data, addressing privacy or legal requirements. Despite its importance, existing unlearning evaluations tend to focus on logit-based metrics under small-scale scenarios. We observe that this could lead to a false sense of security in unlearning approaches under real-world scenarios. In this paper, we conduct a comprehensive evaluation that employs representation-based evaluations of the unlearned model under large-scale scenarios to verify whether the unlearning approaches truly eliminate the targeted data from the model’s representation perspective. Our analysis reveals that current state-of-the-art unlearning approaches either completely degrade the representational quality of the unlearned model or merely modify the classifier, thereby achieving superior logit-based performance while maintaining representational similarity to the original model. Furthermore, we introduce a novel unlearning evaluation scenario in which the forgetting classes exhibit semantic similarity to downstream task classes, necessitating that feature representations diverge significantly from those of the original model, thus enabling a more thorough evaluation from a representation perspective. We hope our benchmark will serve as a standardized protocol for evaluating unlearning algorithms under realistic conditions.

[637] PowerGenie: Analytically-Guided Evolutionary Discovery of Superior Reconfigurable Power Converters

Jian Gao, Yiwei Zou, Abhishek Pradhan, Wenhao Huang, Yumin Su, Kaiyuan Yang, Xuan Zhang

Main category: cs.LG

TL;DR: PowerGenie is an AI framework for automated discovery of high-performance reconfigurable power converter topologies using evolutionary finetuning and analytical verification without SPICE simulation.

Details

Motivation: Discovering superior circuit topologies requires navigating exponentially large design spaces, traditionally done by human experts. Existing AI methods are limited to predefined templates or small-scale generation without rigorous verification, leaving large-scale performance-driven discovery underexplored.

Method: 1) Automated analytical framework to determine converter functionality and theoretical performance limits without component sizing or SPICE simulation. 2) Evolutionary finetuning method that co-evolves a generative model with its training distribution through fitness selection and uniqueness verification.

Result: PowerGenie achieves higher syntax validity, function validity, novelty rate, and figure-of-merit (FoM) than existing methods. It discovers a novel 8-mode reconfigurable converter with 23% higher FoM than the best training topology. SPICE simulations confirm average absolute efficiency gains of 10% across 8 modes and up to 17% at a single mode.

Conclusion: PowerGenie enables automated, large-scale discovery of high-performance power converter topologies through a novel combination of analytical verification and evolutionary finetuning, outperforming existing AI methods and discovering novel, superior designs.

Abstract: Discovering superior circuit topologies requires navigating an exponentially large design space-a challenge traditionally reserved for human experts. Existing AI methods either select from predefined templates or generate novel topologies at a limited scale without rigorous verification, leaving large-scale performance-driven discovery underexplored. We present PowerGenie, a framework for automated discovery of higher-performance reconfigurable power converters at scale. PowerGenie introduces: (1) an automated analytical framework that determines converter functionality and theoretical performance limits without component sizing or SPICE simulation, and (2) an evolutionary finetuning method that co-evolves a generative model with its training distribution through fitness selection and uniqueness verification. Unlike existing methods that suffer from mode collapse and overfitting, our approach achieves higher syntax validity, function validity, novelty rate, and figure-of-merit (FoM). PowerGenie discovers a novel 8-mode reconfigurable converter with 23% higher FoM than the best training topology. SPICE simulations confirm average absolute efficiency gains of 10% across 8 modes and up to 17% at a single mode. Code is available at https://github.com/xz-group/PowerGenie.

[638] A Coreset Selection of Coreset Selection Literature: Introduction and Recent Advances

Brian B. Moser, Arundhati S. Shanbhag, Stanislav Frolov, Federico Raue, Joachim Folz, Andreas Dengel

Main category: cs.LG

TL;DR: A comprehensive survey paper that unifies three major approaches to coreset selection (training-free, training-oriented, and label-free) into a single taxonomy, examining subfields like submodular formulations, bilevel optimization, and pseudo-labeling, while comparing methods under computational, robustness, and performance demands.

Details

Motivation: Existing surveys on coreset selection have been narrow in scope, focusing primarily on either classical geometry-based methods or active learning techniques. There's a need for a more comprehensive view that unifies different approaches and examines overlooked subfields to provide better insights for the research community.

Method: The paper presents a unified taxonomy of coreset selection methods organized into three major categories: training-free approaches (geometry-based methods), training-oriented approaches (active learning and related techniques), and label-free approaches (for unlabeled datasets). It examines subfields including submodular formulations, bilevel optimization, and pseudo-labeling strategies.

Result: The survey provides a comprehensive comparison of coreset selection methods under varying computational, robustness, and performance demands. It offers new insights into how pruning strategies influence generalization and neural scaling laws, which were absent from prior reviews.

Conclusion: The paper highlights open challenges in coreset selection research, including robustness issues, outlier filtering, and the need to adapt coreset selection methods to foundation models. It serves as a comprehensive resource for researchers working on data reduction strategies in machine learning.

Abstract: Coreset selection targets the challenge of finding a small, representative subset of a large dataset that preserves essential patterns for effective machine learning. Although several surveys have examined data reduction strategies before, most focus narrowly on either classical geometry-based methods or active learning techniques. In contrast, this survey presents a more comprehensive view by unifying three major lines of coreset research, namely, training-free, training-oriented, and label-free approaches, into a single taxonomy. We present subfields often overlooked by existing work, including submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets. Additionally, we examine how pruning strategies influence generalization and neural scaling laws, offering new insights that are absent from prior reviews. Finally, we compare these methods under varying computational, robustness, and performance demands and highlight open challenges, such as robustness, outlier filtering, and adapting coreset selection to foundation models, for future research.

[639] Elign: Equivariant Diffusion Model Alignment from Foundational Machine Learning Force Fields

Yunyang Li, Lin Huang, Luojia Xia, Wenhe Zhang, Mark Gerstein

Main category: cs.LG

TL;DR: Elign is a post-training framework that amortizes computational costs in physics-guided 3D molecular conformation generation by using pretrained ML force fields and shifting physical steering to training via reinforcement learning.

Details

Motivation: Current E(3)-equivariant diffusion models for 3D molecular conformations reproduce biases from training data rather than capturing true equilibrium distributions. Physics-based guidance helps but faces computational bottlenecks from expensive quantum-chemical evaluations (DFT) and repeated queries during sampling.

Method: Elign uses: 1) Pretrained foundational machine-learning force field (MLFF) instead of expensive DFT for physical signals; 2) Formulates reverse diffusion as reinforcement learning with Force-Energy Disentangled Group Relative Policy Optimization (FED-GRPO) to fine-tune denoising policy, shifting physical steering to training phase.

Result: Generates conformations with lower gold-standard DFT energies and forces while improving stability. Inference remains as fast as unguided sampling since no energy evaluations are required during generation.

Conclusion: Elign successfully amortizes computational costs of physics guidance in molecular conformation generation, producing more physically accurate structures without sacrificing inference speed.

Abstract: Generative models for 3D molecular conformations must respect Euclidean symmetries and concentrate probability mass on thermodynamically favorable, mechanically stable structures. However, E(3)-equivariant diffusion models often reproduce biases from semi-empirical training data rather than capturing the equilibrium distribution of a high-fidelity Hamiltonian. While physics-based guidance can correct this, it faces two computational bottlenecks: expensive quantum-chemical evaluations (e.g., DFT) and the need to repeat such queries at every sampling step. We present Elign, a post-training framework that amortizes both costs. First, we replace expensive DFT evaluations with a faster, pretrained foundational machine-learning force field (MLFF) to provide physical signals. Second, we eliminate repeated run-time queries by shifting physical steering to the training phase. To achieve the second amortization, we formulate reverse diffusion as a reinforcement learning problem and introduce Force–Energy Disentangled Group Relative Policy Optimization (FED-GRPO) to fine-tune the denoising policy. FED-GRPO includes a potential-based energy reward and a force-based stability reward, which are optimized and group-normalized independently. Experiments show that Elign generates conformations with lower gold-standard DFT energies and forces, while improving stability. Crucially, inference remains as fast as unguided sampling, since no energy evaluations are required during generation.

[640] Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models

Rui Cai, Bangzheng Li, Xiaofei Wen, Muhao Chen, Zhe Zhao

Main category: cs.LG

TL;DR: A unified finetuning framework improves multimodal LLM robustness against spurious modality interference through perturbation-based data augmentation and consistency regularization.

Details

Motivation: Multimodal LLMs show strong benchmark performance but poor robustness when exposed to spurious modality interference (irrelevant text in vision tasks or irrelevant visuals in QA), which distorts model decisions.

Method: Proposes unified finetuning framework combining heuristic and adversarial perturbation-based data augmentation with output-level consistency regularization between original and perturbed inputs.

Result: Extensive experiments across image-heavy, text-heavy, and multimodal benchmarks show consistent improvements in unimodal robustness and generalization, while also improving standard multimodal performance.

Conclusion: The framework effectively addresses modality interference issues in MLLMs, enhancing robustness across different architectures and model scales while maintaining or improving standard performance.

Abstract: Multimodal Large Language Models demonstrate strong performance on multimodal benchmarks, yet often exhibit poor robustness when exposed to spurious modality interference, such as irrelevant text in vision understanding, or irrelevant visual content in question answering. At its core, modality interference refers to cases where spurious signals from non-essential modalities distort model decisions, which we systematically analyze through causal, perturbation-based diagnostic experiments. To address this problem, we propose a unified finetuning framework that combines heuristic and adversarial perturbation-based data augmentation with output-level consistency regularization between original and perturbed inputs. Extensive experiments across image-heavy, text-heavy, and multimodal benchmarks, spanning multiple MLLM architectures and model scales, demonstrate consistent improvements in unimodal robustness and generalization, while improving standard multimodal performance.

[641] Negatives-Dominant Contrastive Learning for Generalization in Imbalanced Domains

Meng Cao, Jiexi Liu, Songcan Chen

Main category: cs.LG

TL;DR: NDCL proposes a negative-dominant contrastive learning approach for imbalanced domain generalization that enhances discriminability while enforcing posterior consistency across domains to handle both domain and label shifts.

Details

Motivation: Imbalanced Domain Generalization (IDG) addresses both domain shift and label shift under heterogeneous long-tailed distributions across domains, which remains underexplored due to technical complexity and lack of theoretical foundations.

Method: Negative-Dominant Contrastive Learning (NDCL) emphasizes negatives as primary signals to enhance inter-class decision-boundary separation, uses re-weighted cross-entropy for intra-class compactness, and employs prediction-central alignment for posterior consistency across domains.

Result: Rigorous experiments on benchmarks validate the effectiveness of NDCL for handling imbalanced domain generalization challenges.

Conclusion: NDCL provides a novel approach to IDG by directly steering decision boundaries through negative-dominant contrastive learning, offering theoretical foundations and practical solutions for handling entangled domain and label shifts.

Abstract: Imbalanced Domain Generalization (IDG) focuses on mitigating both domain and label shifts, both of which fundamentally shape the model’s decision boundaries, particularly under heterogeneous long-tailed distributions across domains. Despite its practical significance, it remains underexplored, primarily due to the technical complexity of handling their entanglement and the paucity of theoretical foundations. In this paper, we begin by theoretically establishing the generalization bound for IDG, highlighting the role of posterior discrepancy and decision margin. This bound motivates us to focus on directly steering decision boundaries, marking a clear departure from existing methods. Subsequently, we technically propose a novel Negative-Dominant Contrastive Learning (NDCL) for IDG to enhance discriminability while enforce posterior consistency across domains. Specifically, inter-class decision-boundary separation is enhanced by placing greater emphasis on negatives as the primary signal in our contrastive learning, naturally amplifying gradient signals for minority classes to avoid the decision boundary being biased toward majority classes. Meanwhile, intra-class compactness is encouraged through a re-weighted cross-entropy strategy, and posterior consistency across domains is enforced through a prediction-central alignment strategy. Finally, rigorous yet challenging experiments on benchmarks validate the effectiveness of our NDCL. The code is available at https://github.com/Alrash/NDCL.

[642] Rate-Distortion Optimization for Transformer Inference

Anderson de Andrade, Alon Harell, Ivan V. Bajić

Main category: cs.LG

TL;DR: A rate-distortion framework for compressing transformer intermediate representations to enable efficient multi-device inference, with theoretical bounds and empirical validation.

Details

Motivation: Transformers have heavy compute/memory requirements during inference, especially when partitioning across multiple devices. This requires compressing intermediate representations, but existing methods lack principled approaches that explicitly trade off compression rate against accuracy.

Method: Proposes a rate-distortion-based framework for lossy compression that learns compact encodings with explicit rate-accuracy tradeoffs. Extends information theory to define the gap between rate and entropy, derives bounds, and develops PAC-style bounds for estimating this gap.

Result: The proposed codec achieves substantial savings with improved accuracy in some cases on language benchmarks, outperforming more complex baseline methods. Empirical analysis shows transformer rates are driven by the theoretical bounds.

Conclusion: Provides a principled framework for transformer representation compression with theoretical foundations, offering a unified lens for understanding performance in representation coding and improving explainability.

Abstract: Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process across multiple devices, which, in turn, requires compressing its intermediate representations. In this work, we introduce a principled rate-distortion-based framework for lossy compression that learns compact encodings that explicitly trade off bitrate against accuracy. Experiments on language benchmarks show that the proposed codec achieves substantial savings with improved accuracy in some cases, outperforming more complex baseline methods. We characterize and analyze the rate-distortion performance of transformers, offering a unified lens for understanding performance in representation coding. This formulation extends information-theoretic concepts to define the gap between rate and entropy, and derive some of its bounds. We further develop probably approximately correct (PAC)-style bounds for estimating this gap. For different architectures and tasks, we empirically demonstrate that their rates are driven by these bounds, adding to the explainability of the formulation.

[643] JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation

Guillem Capellera, Luis Ferraz, Antonio Rubio, Alexandre Alahi, Antonio Agudo

Main category: cs.LG

TL;DR: JointDiff is a diffusion framework that jointly generates continuous spatio-temporal data and synchronous discrete events, applied to sports analytics with multi-agent trajectories and possession events.

Details

Motivation: Current generative models treat continuous data and discrete events separately, creating a gap in modeling complex interactive systems where these elements interact synchronously, such as in sports analytics.

Method: Introduces JointDiff diffusion framework for unified generation, CrossGuid conditioning operation for multi-agent domains, and demonstrates with weak-possessor-guidance and text-guidance control in sports scenarios.

Result: Achieves state-of-the-art performance, validated with non-controllable generation and novel controllable scenarios, using enhanced soccer and football datasets with textual descriptions.

Conclusion: Joint modeling of continuous and discrete processes is crucial for building realistic and controllable generative models for interactive systems like sports analytics.

Abstract: Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they interact synchronously. To bridge this gap, we introduce JointDiff, a novel diffusion framework designed to unify these two processes by simultaneously generating continuous spatio-temporal data and synchronous discrete events. We demonstrate its efficacy in the sports domain by simultaneously modeling multi-agent trajectories and key possession events. This joint modeling is validated with non-controllable generation and two novel controllable generation scenarios: weak-possessor-guidance, which offers flexible semantic control over game dynamics through a simple list of intended ball possessors, and text-guidance, which enables fine-grained, language-driven generation. To enable the conditioning with these guidance signals, we introduce CrossGuid, an effective conditioning operation for multi-agent domains. We also share a new unified sports benchmark enhanced with textual descriptions for soccer and football datasets. JointDiff achieves state-of-the-art performance, demonstrating that joint modeling is crucial for building realistic and controllable generative models for interactive systems. https://guillem-cf.github.io/JointDiff/

[644] Exploring Diverse Generation Paths via Inference-time Stiefel Activation Steering

Dongxuan Zhu, Ly Tran Ho Khanh, Andy Yat-Ming Cheung, Man-Chung Yue, Viet Anh Nguyen

Main category: cs.LG

TL;DR: STARS is a training-free inference-time method that uses Stiefel manifold optimization to steer activations orthogonally, promoting diverse generation paths in language models without sacrificing quality.

Details

Motivation: Language models tend to produce homogeneous outputs due to defaulting to high-probability responses, and existing sampling methods struggle to guarantee diversity across concurrent generation runs.

Method: STARS collects hidden activations from concurrent generation runs and optimizes multiple additive steering directions jointly on the Stiefel manifold to maximize geometric volume of steered activations while maintaining orthogonality through manifold constraints.

Result: STARS outperforms standard sampling methods on test case generation and scientific discovery benchmarks, achieving greater diversity without sacrificing qualitative performance.

Conclusion: STARS provides an effective training-free intervention for promoting diverse reasoning paths in language models through geometric optimization on the Stiefel manifold.

Abstract: Language models often default to a narrow set of high-probability outputs, leaving their generation paths homogeneous and prone to mode collapse. Sampling-based strategies inject randomness but still struggle to guarantee diversity across multiple concurrent generation runs. We address this limitation by introducing STARS ($\textbf{St}$iefel-based $\textbf{A}$ctivation Steering for Diverse $\textbf{R}$ea$\textbf{S}$oning), a training-free, inference-time intervention method that transforms activation steering into an exploration engine. At each token, STARS collects the hidden activations of concurrent generation runs and optimizes multiple additive steering directions jointly on the Stiefel manifold. STARS maximizes the geometric volume of the steered activations, while the Stiefel manifold induces orthogonality of the steering interventions. This formulation explicitly promotes divergent activation vectors of concurrent generation runs, and implicitly promotes divergent generation trajectories. This manifold optimization formulation can be solved using a Riemannian gradient descent algorithm with convergence guarantees, but this algorithm is too time-consuming for real-time inference. To guarantee low latency, we further design a lightweight one-step update with an aggressive, closed-form stepsize. For test case generation and scientific discovery benchmarks, STARS consistently outperforms standard sampling methods, achieving greater diversity without sacrificing qualitative performance.

[645] Putting a Face to Forgetting: Continual Learning meets Mechanistic Interpretability

Sergi Masip, Gido M. van de Ven, Javier Ferrando, Tinne Tuytelaars

Main category: cs.LG

TL;DR: A mechanistic framework for analyzing catastrophic forgetting in continual learning through feature encoding transformations, with geometric interpretations and empirical validation on vision transformers.

Details

Motivation: Current measures of catastrophic forgetting focus on performance or last-layer representations, overlooking the underlying mechanisms of how features are transformed during continual learning.

Method: Introduces a mechanistic framework with geometric interpretation of forgetting as transformations to feature encodings, analyzes a tractable model to formalize best/worst-case scenarios, and uses Crosscoders to study practical models like Vision Transformers on sequential CIFAR-10.

Result: The framework provides a feature-centric vocabulary for continual learning, identifies how transformations reduce feature capacity and disrupt downstream readout, and demonstrates the detrimental effect of depth through empirical experiments.

Conclusion: Offers a new mechanistic perspective on catastrophic forgetting that focuses on feature encoding transformations rather than just performance metrics, enabling better analysis of continual learning systems.

Abstract: Catastrophic forgetting in continual learning is often measured at the performance or last-layer representation level, overlooking the underlying mechanisms. We introduce a mechanistic framework that offers a geometric interpretation of catastrophic forgetting as the result of transformations to the encoding of individual features. These transformations can lead to forgetting by reducing the allocated capacity of features (worse representation) and disrupting their readout by downstream computations. Analysis of a tractable model formalizes this view, allowing us to identify best- and worst-case scenarios. Through experiments on this model, we empirically test our formal analysis and highlight the detrimental effect of depth. Finally, we demonstrate how our framework can be used in the analysis of practical models through the use of Crosscoders. We present a case study of a Vision Transformer trained on sequential CIFAR-10. Our work provides a new, feature-centric vocabulary for continual learning.

[646] TBDFiltering: Sample-Efficient Tree-Based Data Filtering

Robert Istvan Busa-Fekete, Julian Zimmert, Anne Xiangyi Zheng, Claudio Gentile, Andras Gyorgy

Main category: cs.LG

TL;DR: Proposes a hierarchical clustering method using text embeddings to efficiently select documents for LLM quality evaluation, reducing the number of expensive LLM queries needed for training data curation.

Details

Motivation: Current methods for selecting high-quality training data for LLMs are either too expensive (querying LLMs for each document) or rely on unreliable classifiers. There's a need for scalable, query-efficient approaches to filter billions of documents.

Method: Uses text embeddings to create hierarchical clusters of documents, then adaptively selects documents from clusters for LLM evaluation. The algorithm queries LLMs only for strategically chosen documents to estimate cluster quality, leveraging the assumption that clusters tend to be homogeneous in quality.

Result: Theoretically proves query efficiency: with high probability, correctly predicts document quality after querying a small number of documents proportional to the size of the smallest subtree with pure leaves. Experimental results show benefits over classifier-based filtering methods.

Conclusion: The hierarchical clustering approach provides a scalable, query-efficient solution for LLM training data curation, significantly reducing the number of expensive LLM queries needed while maintaining quality assessment accuracy.

Abstract: The quality of machine learning models depends heavily on their training data. Selecting high-quality, diverse training sets for large language models (LLMs) is a difficult task, due to the lack of cheap and reliable quality metrics. While querying existing LLMs for document quality is common, this is not scalable to the large number (billions) of documents used in training. Instead, practitioners often use classifiers trained on sparse quality signals. In this paper, we propose a text-embedding-based hierarchical clustering approach that adaptively selects the documents to be evaluated by the LLM to estimate cluster quality. We prove that our method is query efficient: under the assumption that the hierarchical clustering contains a subtree such that each leaf cluster in the tree is pure enough (i.e., it mostly contains either only good or only bad documents), with high probability, the method can correctly predict the quality of each document after querying a small number of documents. The number of such documents is proportional to the size of the smallest subtree with (almost) pure leaves, without the algorithm knowing this subtree in advance. Furthermore, in a comprehensive experimental study, we demonstrate the benefits of our algorithm compared to other classifier-based filtering methods.

[647] From Logits to Latents: Contrastive Representation Shaping for LLM Unlearning

Haoran Tang, Rajiv Khanna

Main category: cs.LG

TL;DR: CLReg introduces a contrastive representation regularizer for LLM unlearning that reduces entanglement between forgotten and retained knowledge by pushing their features apart in representation space.

Details

Motivation: Current LLM unlearning methods using alignment-style objectives in prediction space may only suppress forgotten content rather than truly removing it, allowing forgotten concepts to persist in representations and remain entangled with retained knowledge.

Method: CLReg uses contrastive representation regularization to identify forget features and push them away from retain features, explicitly reducing forget-retain interference while minimizing shifts on retained features.

Result: Across unlearning benchmarks and LLMs of different sizes, CLReg decreases forget-retain representation entanglement and facilitates mainstream unlearning methods without introducing extra privacy risks.

Conclusion: The work provides theoretical insights linking representation shaping to entanglement reduction and inspires future approaches that reshape representation space to truly remove forget concepts rather than just suppressing them.

Abstract: Most LLM unlearning methods aim to approximate retrain-from-scratch behaviors with minimal distribution shift, often via alignment-style objectives defined in the prediction space. While effective at reducing forgotten content generation, such approaches may act as suppression: forgotten concepts can persist in representations and remain entangled with retained knowledge. We introduce CLReg, a contrastive representation regularizer that identifies forget features while pushing them away from retain features, explicitly reducing forget-retain interference with minimal shifts on retain features. We provide first theoretical insights that relate representation shaping to entanglement reduction. Across unlearning benchmarks and LLMs of different sizes, CLReg decreases forget-retain representation entanglement that facilitates mainstream unlearning methods without positing extra privacy risks, inspiring future work that reshapes the representation space to remove forget concepts.

[648] The Ensemble Inverse Problem: Applications and Methods

Zhengyan Huan, Camila Pazos, Martin Klassen, Vincent Croft, Pierre-Hugues Beauchemin, Shuchin Aeron

Main category: cs.LG

TL;DR: Proposes ensemble inverse generative models for solving the Ensemble Inverse Problem (EIP) - reconstructing true distributions from distorted measurements without iterative forward model use at inference time.

Details

Motivation: Addresses the Ensemble Inverse Problem (EIP) which arises in high energy physics (unfolding), full waveform inversion, and inverse imaging with unknown priors. Existing methods require iterative use of forward models at inference time, which can be computationally expensive.

Method: Proposes non-iterative inference-time methods using conditional generative models called ensemble inverse generative models. These models use ensemble information from observation sets and are trained across multiple truth-observation pairs consistent with the same forward model but from diverse priors, implicitly encoding the likelihood model.

Result: Demonstrates the method on synthetic and real datasets in inverse imaging, high energy physics, and full waveform inversion. Shows that ensemble information helps posterior inference and enables generalization to unseen priors.

Conclusion: Proposes a novel approach to the Ensemble Inverse Problem using ensemble inverse generative models that avoid explicit iterative forward model use at inference time, enabling efficient posterior sampling and generalization to new priors.

Abstract: We introduce a new multivariate statistical problem that we refer to as the Ensemble Inverse Problem (EIP). The aim of EIP is to invert for an ensemble that is distributed according to the pushforward of a prior under a forward process. In high energy physics (HEP), this is related to a widely known problem called unfolding, which aims to reconstruct the true physics distribution of quantities, such as momentum and angle, from measurements that are distorted by detector effects. In recent applications, the EIP also arises in full waveform inversion (FWI) and inverse imaging with unknown priors. We propose non-iterative inference-time methods that construct posterior samplers based on a new class of conditional generative models, which we call ensemble inverse generative models. For the posterior modeling, these models additionally use the ensemble information contained in the observation set on top of single measurements. Unlike existing methods, our proposed methods avoid explicit and iterative use of the forward model at inference time via training across several sets of truth-observation pairs that are consistent with the same forward model, but originate from a wide range of priors. We demonstrate that this training procedure implicitly encodes the likelihood model. The use of ensemble information helps posterior inference and enables generalization to unseen priors. We benchmark the proposed method on several synthetic and real datasets in inverse imaging, HEP, and FWI. The codes are available at https://github.com/ZhengyanHuan/The-Ensemble-Inverse-Problem--Applications-and-Methods.

[649] Continual GUI Agents

Ziwei Liu, Borui Kang, Hangjie Yuan, Zixiang Zhao, Wei Li, Yifan Zhu, Tao Feng

Main category: cs.LG

TL;DR: Proposes GUI-AiF, a reinforcement fine-tuning framework for continual GUI agents that stabilizes learning under shifting domains/resolutions using novel anchoring rewards.

Details

Motivation: GUI agents trained on static environments deteriorate when faced with changing digital environments (new domains/resolutions over time). Existing methods fail to maintain stable grounding as GUI distributions shift due to diverse UI interaction points and regions.

Method: Introduces GUI-Anchoring in Flux (GUI-AiF), a reinforcement fine-tuning framework with two novel rewards: Anchoring Point Reward in Flux (APR-iF) and Anchoring Region Reward in Flux (ARR-iF). These rewards guide agents to align with shifting interaction points and regions, preventing over-adaptation to static grounding cues.

Result: Extensive experiments show GUI-AiF surpasses state-of-the-art baselines. The framework demonstrates stable continual learning performance as GUI distributions shift over time.

Conclusion: Establishes the first continual learning framework for GUI agents, revealing the potential of reinforcement fine-tuning for continual GUI agents operating in fluxing digital environments.

Abstract: As digital environments (data distribution) are in flux, with new GUI data arriving over time-introducing new domains or resolutions-agents trained on static environments deteriorate in performance. In this work, we introduce Continual GUI Agents, a new task that requires GUI agents to perform continual learning under shifted domains and resolutions. We find existing methods fail to maintain stable grounding as GUI distributions shift over time, due to the diversity of UI interaction points and regions in fluxing scenarios. To address this, we introduce GUI-Anchoring in Flux (GUI-AiF), a new reinforcement fine-tuning framework that stabilizes continual learning through two novel rewards: Anchoring Point Reward in Flux (APR-iF) and Anchoring Region Reward in Flux (ARR-iF). These rewards guide the agents to align with shifting interaction points and regions, mitigating the tendency of existing reward strategies to over-adapt to static grounding cues (e.g., fixed coordinates or element scales). Extensive experiments show GUI-AiF surpasses state-of-the-art baselines. Our work establishes the first continual learning framework for GUI agents, revealing the untapped potential of reinforcement fine-tuning for continual GUI Agents.

[650] Per-parameter Task Arithmetic for Unlearning in Large Language Models

Chengyi Cai, Zesheng Ye, Jiangchao Yao, Jianzhong Qi, Bo Han, Xiaolu Zhang, Feng Liu, Jun Zhou

Main category: cs.LG

TL;DR: PerTA: A per-parameter task arithmetic mechanism for LLM unlearning that rescales task vectors to balance forgetting private information while retaining other knowledge, addressing over-forgetting issues in standard task arithmetic.

Details

Motivation: Standard task arithmetic for LLM unlearning subtracts task vectors but causes over-forgetting by disrupting parameters essential for retaining other information. The authors observed that each parameter has different importance for forgetting versus retention, motivating a more fine-grained approach.

Method: Proposes PerTA (Per-parameter Task Arithmetic) that rescales task vectors with per-parameter weights quantifying relative importance for forgetting vs retention. Two estimation methods: PerTA-grad (using gradients) and PerTA-fisher (using diagonal Fisher information approximation). Also extends to more general forms and provides theoretical analysis.

Result: Extensive experiments show PerTA consistently improves upon standard task vectors, and in many cases surpasses widely used training-based unlearning methods in both forgetting effectiveness and overall model utility.

Conclusion: PerTA offers a principled and practical framework for LLM unlearning that retains the efficiency of task arithmetic while mitigating over-forgetting, providing better balance between forgetting private information and maintaining model utility.

Abstract: In large language model (LLM) unlearning, private information is required to be removed. Task arithmetic unlearns by subtracting a specific task vector (TV)–defined as the parameter difference between a privacy-information-tuned model and the original model. While efficient, it can cause over-forgetting by disrupting parameters essential for retaining other information. Motivated by the observation that each parameter exhibits different importance for forgetting versus retention, we propose a per-parameter task arithmetic (PerTA) mechanism to rescale the TV, allowing per-parameter adjustment. These weights quantify the relative importance of each parameter for forgetting versus retention, estimated via gradients (i.e., PerTA-grad) or the diagonal Fisher information approximation (i.e., PerTA-fisher). Moreover, we discuss the effectiveness of PerTA, extend it to a more general form, and provide further analysis. Extensive experiments demonstrate that PerTA consistently improves upon standard TV, and in many cases surpasses widely used training-based unlearning methods in both forgetting effectiveness and overall model utility. By retaining the efficiency of task arithmetic while mitigating over-forgetting, PerTA offers a principled and practical framework for LLM unlearning.

[651] Holographic generative flows with AdS/CFT

Ehsan Mirafzali, Sanjit Shashi, Sanya Murdeshwar, Edgar Shaghoulian, Daniele Venturi, Razvan Marinescu

Main category: cs.LG

TL;DR: A framework combining holographic principle from quantum gravity (AdS/CFT correspondence) with deep learning and transport theory for generative modeling, representing data flow using bulk-to-boundary mapping in AdS space.

Details

Motivation: To develop novel generative modeling paradigms by leveraging the holographic principle from quantum gravity, specifically the AdS/CFT correspondence, to provide physically interpretable versions of flow-matching algorithms and establish the utility of AdS physics in machine learning.

Method: Represent the flow of data from a base distribution to a learned distribution using the bulk-to-boundary mapping of scalar fields in anti-de Sitter (AdS) space. Augment the flow-matching algorithm with AdS physics, creating a hybrid approach that combines deep learning, transport theory, and holographic principles.

Result: The model achieves faster and higher quality convergence than comparable physics-free flow-matching models on both a checkerboard toy dataset and MNIST. The method provides physically interpretable flow matching and demonstrates the utility of AdS physics in generative modeling.

Conclusion: The framework successfully establishes the utility of AdS physics and geometry in developing novel generative modeling paradigms, providing physically interpretable versions of flow matching that outperform traditional physics-free approaches.

Abstract: We present a framework for generative machine learning that leverages the holographic principle of quantum gravity, or to be more precise its manifestation as the anti-de Sitter/conformal field theory (AdS/CFT) correspondence, with techniques for deep learning and transport theory. Our proposal is to represent the flow of data from a base distribution to some learned distribution using the bulk-to-boundary mapping of scalar fields in AdS. In the language of machine learning, we are representing and augmenting the flow-matching algorithm with AdS physics. Using a checkerboard toy dataset and MNIST, we find that our model achieves faster and higher quality convergence than comparable physics-free flow-matching models. Our method provides a physically interpretable version of flow matching. More broadly, it establishes the utility of AdS physics and geometry in the development of novel paradigms in generative modeling.

[652] Cross-Fusion Distance: A Novel Metric for Measuring Fusion and Separability Between Data Groups in Representation Space

Xiaolong Zhang, Jianwei Zhang, Xubo Song

Main category: cs.LG

TL;DR: CFD is a new distance metric for representation learning that isolates fusion-altering geometric factors while remaining invariant to fusion-preserving variations like scaling, providing better alignment with downstream generalization under domain shift.

Details

Motivation: Existing distributional distance metrics conflate fusion-altering factors (like geometric displacement between representation groups) with fusion-preserving factors (like global scaling), making them uninformative about the true extent of fusion between data groups under domain shift.

Method: Introduces Cross-Fusion Distance (CFD), a principled measure with linear computational complexity that isolates fusion-altering geometry while remaining robust to fusion-preserving variations. The method is theoretically characterized for invariance and sensitivity properties.

Result: CFD shows theoretical invariance/sensitivity properties validated in synthetic experiments. On real-world datasets with domain shift, CFD aligns more closely with downstream generalization degradation than commonly used alternatives.

Conclusion: CFD provides a theoretically grounded and interpretable distance measure for representation learning that better captures the true extent of fusion between data groups under domain shift.

Abstract: Quantifying degrees of fusion and separability between data groups in representation space is a fundamental problem in representation learning, particularly under domain shift. A meaningful metric should capture fusion-altering factors like geometric displacement between representation groups, whose variations change the extent of fusion, while remaining invariant to fusion-preserving factors such as global scaling and sampling-induced layout changes, whose variations do not. Existing distributional distance metrics conflate these factors, leading to measures that are not informative of the true extent of fusion between data groups. We introduce Cross-Fusion Distance (CFD), a principled measure that isolates fusion-altering geometry while remaining robust to fusion-preserving variations, with linear computational complexity. We characterize the invariance and sensitivity properties of CFD theoretically and validate them in controlled synthetic experiments. For practical utility on real-world datasets with domain shift, CFD aligns more closely with downstream generalization degradation than commonly used alternatives. Overall, CFD provides a theoretically grounded and interpretable distance measure for representation learning.

[653] Making Foundation Models Probabilistic via Singular Value Ensembles

Mehmet Ozgur Turkoglu, Dominik J. Mühlematter, Alexander Becker, Konrad Schindler, Helge Aasen

Main category: cs.LG

TL;DR: SVE is a parameter-efficient ensemble method for foundation models that trains only singular values while freezing singular vectors to achieve uncertainty quantification with minimal parameter overhead.

Details

Motivation: Foundation models have calibration issues and standard ensemble methods for uncertainty quantification are computationally prohibitive for large models, requiring a parameter-efficient alternative.

Method: Freezes singular vectors of weight matrices as knowledge directions and trains only per-member singular values to modulate contributions, creating ensemble diversity through stochastic training.

Result: Achieves uncertainty quantification comparable to explicit deep ensembles with less than 1% parameter increase, improving calibration while maintaining predictive accuracy across NLP and vision tasks.

Conclusion: SVE enables practical uncertainty estimation for large foundation models by leveraging the structure of pretrained weights, making principled uncertainty quantification accessible in resource-constrained settings.

Abstract: Foundation models have become a dominant paradigm in machine learning, achieving remarkable performance across diverse tasks through large-scale pretraining. However, these models often yield overconfident, uncalibrated predictions. The standard approach to quantifying epistemic uncertainty, training an ensemble of independent models, incurs prohibitive computational costs that scale linearly with ensemble size, making it impractical for large foundation models. We propose Singular Value Ensemble (SVE), a parameter-efficient implicit ensemble method that builds on a simple, but powerful core assumption: namely, that the singular vectors of the weight matrices constitute meaningful subspaces of the model’s knowledge. Pretrained foundation models encode rich, transferable information in their weight matrices. If the singular vectors are indeed meaningful (orthogonal) “knowledge directions”. To obtain a model ensemble, we modulate only how strongly each direction contributes to the output. Rather than learning entirely new parameters, we freeze the singular vectors and only train per-member singular values that rescale the contribution of each direction in that shared knowledge basis. Ensemble diversity emerges naturally as stochastic initialization and random sampling of mini-batches during joint training cause different members to converge to different combinations of the same underlying knowledge. SVE achieves uncertainty quantification comparable to explicit deep ensembles while increasing the parameter count of the base model by less than 1%, making principled uncertainty estimation accessible in resource-constrained settings. We validate SVE on NLP and vision tasks with various different backbones and show that it improves calibration while maintaining predictive accuracy.

[654] Where Do the Joules Go? Diagnosing Inference Energy Consumption

Jae-Won Chung, Ruofan Wu, Jeff J. Ma, Mosharaf Chowdhury

Main category: cs.LG

TL;DR: Large-scale measurement study of inference time and energy consumption across 46 generative AI models, revealing order-of-magnitude variations in energy usage across different tasks and configurations.

Details

Motivation: Energy has become a critical resource in ML computing, and while measuring consumption is valuable, understanding the underlying reasons for energy differences is crucial for optimization in generative AI systems.

Method: Conducted large-scale measurement study with 46 generative AI models, 7 tasks, and 1,858 configurations on NVIDIA H100 and B200 GPUs, then developed a framework to analyze underlying mechanisms governing time and energy consumption through latent metrics like memory and utilization.

Result: Found order-of-magnitude variations: LLM task types cause 25× energy differences, video generation consumes over 100× more energy than images, and GPU utilization differences lead to 3-5× energy variations.

Conclusion: Proposed a framework for understanding time and energy consumption mechanisms through latent metrics affected by algorithm, software, and hardware factors, extending to throughput per watt for power-constrained datacenters.

Abstract: Energy is now a critical ML computing resource. While measuring energy consumption and observing trends is a valuable first step, accurately understanding and diagnosing why those differences occur is crucial for optimization. To that end, we begin by presenting a large-scale measurement study of inference time and energy across the generative AI landscape with 46 models, 7 tasks, and 1,858 different configurations on NVIDIA H100 and B200 GPUs. Our empirical findings span order-of-magnitude variations: LLM task type can lead to 25$\times$ energy differences, video generation sometimes consumes more than 100$\times$ the energy of images, and GPU utilization differences can result in 3–5$\times$ energy differences. Based on our observations, we present a framework for reasoning about the underlying mechanisms that govern time and energy consumption. The essence is that time and energy are determined by latent metrics like memory and utilization, which are in turn affected by various factors across the algorithm, software, and hardware layers. Our framework also extends directly to throughput per watt, a critical metric for power-constrained datacenters.

[655] Boosting CVaR Policy Optimization with Quantile Gradients

Yudong Luo, Erick Delage

Main category: cs.LG

TL;DR: Proposes a more sample-efficient approach to Conditional Value-at-Risk optimization in reinforcement learning by augmenting CVaR with an expected quantile term that leverages all sampled data through dynamic programming.

Details

Motivation: CVaR optimization using policy gradient (CVaR-PG) suffers from sample inefficiency because it focuses only on tail-end performance and ignores many sampled trajectories, making it impractical for real-world applications.

Method: Augments CVaR with an expected quantile term, enabling dynamic programming formulation that can leverage all sampled data rather than just tail-end trajectories, improving sample efficiency while preserving the CVaR objective.

Result: Empirical results in domains with verifiable risk-averse behavior show the algorithm substantially improves upon CVaR-PG and consistently outperforms other existing methods within the Markovian policy class.

Conclusion: The proposed approach successfully addresses sample inefficiency in CVaR optimization by incorporating quantile optimization with dynamic programming, making risk-averse reinforcement learning more practical.

Abstract: Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.

[656] Prior-Informed Flow Matching for Graph Reconstruction

Harvey Chen, Nicolas Zilberstein, Santiago Segarra

Main category: cs.LG

TL;DR: PIFM combines graph embedding priors with flow matching for improved graph reconstruction from partial observations.

Details

Motivation: Graph reconstruction from partial observations is challenging: classical embedding methods lack global consistency, while modern generative models struggle to incorporate structural priors. There's a need to bridge this gap.

Method: PIFM integrates embedding-based priors (graphons, GraphSAGE, node2vec) with continuous-time flow matching. It uses priors to form initial adjacency estimates from local info, then applies rectified flow matching to refine toward true graph distribution.

Result: PIFM consistently enhances classical embeddings and outperforms both them and state-of-the-art generative baselines in reconstruction accuracy across different datasets.

Conclusion: PIFM successfully bridges the gap between classical embeddings and modern generative models for graph reconstruction by effectively incorporating structural priors into flow matching.

Abstract: We introduce Prior-Informed Flow Matching (PIFM), a conditional flow model for graph reconstruction. Reconstructing graphs from partial observations remains a key challenge; classical embedding methods often lack global consistency, while modern generative models struggle to incorporate structural priors. PIFM bridges this gap by integrating embedding-based priors with continuous-time flow matching. Grounded in a permutation equivariant version of the distortion-perception theory, our method first uses a prior, such as graphons or GraphSAGE/node2vec, to form an informed initial estimate of the adjacency matrix based on local information. It then applies rectified flow matching to refine this estimate, transporting it toward the true distribution of clean graphs and learning a global coupling. Experiments on different datasets demonstrate that PIFM consistently enhances classical embeddings, outperforming them and state-of-the-art generative baselines in reconstruction accuracy.

[657] Physics Informed Reconstruction of Four-Dimensional Atmospheric Wind Fields Using Multi-UAS Swarm Observations in a Synthetic Turbulent Environment

Abdullah Tasim, Wei Sun

Main category: cs.LG

TL;DR: A framework for 4D atmospheric wind field reconstruction using coordinated UAS swarms with Bi-LSTM for local wind estimation and PINN for physics-informed continuous field reconstruction.

Details

Motivation: Conventional atmospheric wind measurement instruments leave spatio-temporal gaps in the lower atmospheric boundary layer, while individual UAS platforms only sample along flight trajectories, limiting full wind-field recovery for applications like weather forecasting and hazard prediction.

Method: Uses coordinated UAS swarm measurements with a bidirectional LSTM network to estimate local wind components from UAS dynamics, then assimilates these into a physics-informed neural network (PINN) to reconstruct continuous 4D wind fields in space and time.

Result: Bi-LSTM achieves RMSE of 0.064-0.273 m/s for horizontal components depending on wind conditions, with vertical component errors of 0.029-0.091 m/s. PINN reconstruction recovers dominant spatial/temporal structure up to 1000m altitude, achieving overall RMSE of 0.118-0.154 m/s with five-UAS swarm performing best.

Conclusion: Coordinated UAS swarms enable accurate, scalable 4D wind-field reconstruction without dedicated wind sensors or fixed infrastructure, demonstrating potential for atmospheric monitoring applications.

Abstract: Accurate reconstruction of atmospheric wind fields is essential for applications such as weather forecasting, hazard prediction, and wind energy assessment, yet conventional instruments leave spatio-temporal gaps within the lower atmospheric boundary layer. Unmanned aircraft systems (UAS) provide flexible in situ measurements, but individual platforms sample wind only along their flight trajectories, limiting full wind-field recovery. This study presents a framework for reconstructing four-dimensional atmospheric wind fields using measurements obtained from a coordinated UAS swarm. A synthetic turbulence environment and high-fidelity multirotor simulation are used to generate training and evaluation data. Local wind components are estimated from UAS dynamics using a bidirectional long short-term memory network (Bi-LSTM) and assimilated into a physics-informed neural network (PINN) to reconstruct a continuous wind field in space and time. For local wind estimation, the bidirectional LSTM achieves root-mean-square errors (RMSE) of 0.064 and 0.062 m/s for the north and east components in low-wind conditions, increasing to 0.122 to 0.129 m/s under moderate winds and 0.271 to 0.273 m/s in high-wind conditions, while the vertical component exhibits higher error, with RMSE values of 0.029 to 0.091 m/s. The physics-informed reconstruction recovers the dominant spatial and temporal structure of the wind field up to 1000 m altitude while preserving mean flow direction and vertical shear. Under moderate wind conditions, the reconstructed mean wind field achieves an overall RMSE between 0.118 and 0.154 m/s across evaluated UAS configurations, with the lowest error obtained using a five-UAS swarm. These results demonstrate that coordinated UAS measurements enable accurate and scalable four-dimensional wind-field reconstruction without dedicated wind sensors or fixed infrastructure.

[658] Learning Hamiltonian Flow Maps: Mean Flow Consistency for Large-Timestep Molecular Dynamics

Winfried Ripken, Michael Plainer, Gregor Lied, Thorben Frank, Oliver T. Unke, Stefan Chmiela, Frank Noé, Klaus Robert Müller

Main category: cs.LG

TL;DR: Learning Hamiltonian Flow Maps to enable stable large-timestep evolution of Hamiltonian systems by predicting mean phase-space evolution over chosen time intervals

Details

Motivation: Overcome limitations of small timesteps required for stable numerical integration in Hamiltonian systems, particularly for molecular dynamics simulations using machine-learned force fields

Method: Introduce framework to learn Hamiltonian Flow Maps by predicting mean phase-space evolution over chosen time span Δt, imposing Mean Flow consistency condition for time-averaged Hamiltonian dynamics, enabling training on independent phase-space samples without trajectory generation

Result: Method enables significantly larger integration timesteps beyond classical stability limits while maintaining comparable training/inference costs, validated across diverse Hamiltonian systems including molecular dynamics with MLFFs

Conclusion: Framework successfully overcomes timestep limitations in Hamiltonian system evolution, particularly beneficial for molecular dynamics simulations using widely-available trajectory-free MLFF datasets

Abstract: Simulating the long-time evolution of Hamiltonian systems is limited by the small timesteps required for stable numerical integration. To overcome this constraint, we introduce a framework to learn Hamiltonian Flow Maps by predicting the mean phase-space evolution over a chosen time span $Δt$, enabling stable large-timestep updates far beyond the stability limits of classical integrators. To this end, we impose a Mean Flow consistency condition for time-averaged Hamiltonian dynamics. Unlike prior approaches, this allows training on independent phase-space samples without access to future states, avoiding expensive trajectory generation. Validated across diverse Hamiltonian systems, our method in particular improves upon molecular dynamics simulations using machine-learned force fields (MLFF). Our models maintain comparable training and inference cost, but support significantly larger integration timesteps while trained directly on widely-available trajectory-free MLFF datasets.

[659] SMOG: Scalable Meta-Learning for Multi-Objective Bayesian Optimization

Leonard Papenmeier, Petru Tighineanu

Main category: cs.LG

TL;DR: SMOG: A scalable meta-learning model using multi-output Gaussian processes for multi-objective Bayesian optimization that learns correlations between objectives and propagates metadata uncertainty.

Details

Motivation: There's a need to combine meta-learning with multi-objective Bayesian optimization to leverage historical data from related optimization tasks, accelerating optimization when only black-box access and limited measurements are available.

Method: SMOG uses a multi-output Gaussian process that explicitly learns correlations between objectives, builds a structured joint Gaussian process prior across meta- and target tasks, and yields a closed-form target-task prior with a flexible residual multi-output kernel after conditioning on metadata.

Result: The method achieves linear scaling with the number of meta-tasks through hierarchical, parallel training where meta-task Gaussian processes are fit once and cached, and integrates seamlessly with standard multi-objective Bayesian optimization acquisition functions.

Conclusion: SMOG provides a principled approach for meta-learned priors in multi-objective Bayesian optimization that efficiently propagates metadata uncertainty and scales well with meta-task data.

Abstract: Multi-objective optimization aims to solve problems with competing objectives, often with only black-box access to a problem and a limited budget of measurements. In many applications, historical data from related optimization tasks is available, creating an opportunity for meta-learning to accelerate the optimization. Bayesian optimization, as a promising technique for black-box optimization, has been extended to meta-learning and multi-objective optimization independently, but methods that simultaneously address both settings - meta-learned priors for multi-objective Bayesian optimization - remain largely unexplored. We propose SMOG, a scalable and modular meta-learning model based on a multi-output Gaussian process that explicitly learns correlations between objectives. SMOG builds a structured joint Gaussian process prior across meta- and target tasks and, after conditioning on metadata, yields a closed-form target-task prior augmented by a flexible residual multi-output kernel. This construction propagates metadata uncertainty into the target surrogate in a principled way. SMOG supports hierarchical, parallel training: meta-task Gaussian processes are fit once and then cached, achieving linear scaling with the number of meta-tasks. The resulting surrogate integrates seamlessly with standard multi-objective Bayesian optimization acquisition functions.

[660] Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

Ziming Dong, Hardik Sharma, Evan O’Toole, Jaya Prakash Champati, Kui Wu

Main category: cs.LG

TL;DR: LLM Shepherding reduces inference costs by having LLMs generate only short prefixes (hints) that guide smaller language models, achieving significant cost savings while maintaining accuracy on math and coding tasks.

Details

Motivation: LLMs have high inference costs that limit deployment, while SLMs are cheaper but less accurate. Existing approaches treat LLMs as all-or-nothing resources, missing opportunities for more efficient collaboration.

Method: Proposes LLM Shepherding framework where LLM generates only a short prefix (hint) that guides SLM generation. Includes two-stage predictor to decide if hint is needed and how many tokens to request.

Result: Reduces costs by 42-94% vs LLM-only inference on GSM8K, CNK12, HumanEval, MBPP benchmarks. Achieves up to 2.8x cost reduction vs routing/cascading baselines while matching accuracy.

Conclusion: LLM Shepherding enables efficient SLM-LLM collaboration through token-level budget control, significantly reducing inference costs while maintaining performance on reasoning tasks.

Abstract: Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.

[661] Late Breaking Results: Conversion of Neural Networks into Logic Flows for Edge Computing

Daniel Stein, Shaoyi Huang, Rolf Drechsler, Bing Li, Grace Li Zhang

Main category: cs.LG

TL;DR: Proposes converting neural networks into logic flows (if-else structures) for efficient execution on CPUs by first converting to decision trees and then compressing decision paths, reducing latency by up to 14.9% on RISC-V CPUs.

Details

Motivation: Neural networks are increasingly deployed on resource-constrained edge devices with CPUs (not GPUs), but CPUs are inefficient at executing large numbers of MAC operations. CPUs are better suited for control flow logic, so converting neural networks to logic flows could improve computation efficiency on CPU-based edge devices.

Method: 1) Convert neural networks into equivalent decision trees, 2) Select decision paths with constant leaves, 3) Compress these paths into logic flows consisting of if-else structures with reduced MAC operations. The approach transforms mathematical operations into control flow logic better suited for CPU execution.

Result: Experimental results show latency reduction of up to 14.9% on a simulated RISC-V CPU without any accuracy degradation. The code is open-sourced at https://github.com/TUDa-HWAI/NN2Logic.

Conclusion: Converting neural networks to logic flows is an effective approach to improve computation efficiency on CPU-based edge devices by leveraging CPUs’ strengths in executing control flow logic rather than mathematical operations.

Abstract: Neural networks have been successfully applied in various resource-constrained edge devices, where usually central processing units (CPUs) instead of graphics processing units exist due to limited power availability. State-of-the-art research still focuses on efficiently executing enormous numbers of multiply-accumulate (MAC) operations. However, CPUs themselves are not good at executing such mathematical operations on a large scale, since they are more suited to execute control flow logic, i.e., computer algorithms. To enhance the computation efficiency of neural networks on CPUs, in this paper, we propose to convert them into logic flows for execution. Specifically, neural networks are first converted into equivalent decision trees, from which decision paths with constant leaves are then selected and compressed into logic flows. Such logic flows consist of if and else structures and a reduced number of MAC operations. Experimental results demonstrate that the latency can be reduced by up to 14.9 % on a simulated RISC-V CPU without any accuracy degradation. The code is open source at https://github.com/TUDa-HWAI/NN2Logic

[662] Low-redundancy Distillation for Continual Learning

RuiQi Liu, Boyu Diao, Libo Huang, Zijia An, Hangda Liu, Zhulin An, Yongjun Xu

Main category: cs.LG

TL;DR: LoRD is a novel continual learning method that reduces redundancy in student/teacher models and rehearsal samples to improve training efficiency while maintaining accuracy, inspired by the brain’s contextual gating mechanism.

Details

Motivation: Current continual learning methods focus too much on accuracy while neglecting training efficiency, limiting practical applications. The paper draws inspiration from the brain's contextual gating mechanism that selectively filters neural information and updates memories.

Method: LoRD eliminates redundancy in three aspects: 1) compressing learnable parameters of student model, 2) pruning teacher model to retain prior knowledge, and 3) optimizing rehearsal sample selection and frequency. It uses distillation and rehearsal strategies to balance efficiency and precision.

Result: Extensive experiments across various benchmark datasets and environments show LoRD achieves the highest accuracy with the lowest training FLOPs, demonstrating superiority over other methods.

Conclusion: LoRD effectively balances training efficiency and model precision in continual learning by reducing redundancy in multiple dimensions, making continual learning more practical for real-world applications.

Abstract: Continual learning (CL) aims to learn new tasks without erasing previous knowledge. However, current CL methods primarily emphasize improving accuracy while often neglecting training efficiency, which consequently restricts their practical application. Drawing inspiration from the brain’s contextual gating mechanism, which selectively filters neural information and continuously updates past memories, we propose Low-redundancy Distillation (LoRD), a novel CL method that enhances model performance while maintaining training efficiency. This is achieved by eliminating redundancy in three aspects of CL: student model redundancy, teacher model redundancy, and rehearsal sample redundancy. By compressing the learnable parameters of the student model and pruning the teacher model, LoRD facilitates the retention and optimization of prior knowledge, effectively decoupling task-specific knowledge without manually assigning isolated parameters for each task. Furthermore, we optimize the selection of rehearsal samples and refine rehearsal frequency to improve training efficiency. Through a meticulous design of distillation and rehearsal strategies, LoRD effectively balances training efficiency and model precision. Extensive experimentation across various benchmark datasets and environments demonstrates LoRD’s superiority, achieving the highest accuracy with the lowest training FLOPs.

[663] One Model, Any Conjunctive Query: Graph Neural Networks for Answering Queries over Incomplete Knowledge Graphs

Krzysztof Olejniczak, Xingyue Huang, Mikhail Galkin, İsmail İlkan Ceylan

Main category: cs.LG

TL;DR: AnyCQ is a graph neural network model using reinforcement learning to answer conjunctive queries on incomplete knowledge graphs, generalizing from small training instances to handle large, complex queries.

Details

Motivation: Knowledge graphs are often incomplete, creating a need for query answering systems that can predict answers not explicitly present in the graph but present in its theoretical completion.

Method: Proposes AnyCQ, a graph neural network trained with reinforcement learning to answer Boolean queries. The model is trained on small, simple instances but generalizes to large queries of arbitrary structure.

Result: AnyCQ outperforms existing approaches on challenging benchmarks, reliably classifying and retrieving answers to complex queries that other methods fail to handle. It also shows effective transfer to novel knowledge graphs when combined with link prediction models.

Conclusion: AnyCQ provides an effective solution for querying incomplete knowledge graphs, demonstrating strong generalization capabilities and transfer learning potential across different knowledge graphs.

Abstract: Motivated by the incompleteness of modern knowledge graphs, a new setup for query answering has emerged, where the goal is to predict answers that do not necessarily appear in the knowledge graph, but are present in its completion. In this paper, we formally introduce and study two query answering problems, namely, query answer classification and query answer retrieval. To solve these problems, we propose AnyCQ, a model that can classify answers to any conjunctive query on any knowledge graph. At the core of our framework lies a graph neural network trained using a reinforcement learning objective to answer Boolean queries. Trained only on simple, small instances, AnyCQ generalizes to large queries of arbitrary structure, reliably classifying and retrieving answers to queries that existing approaches fail to handle. This is empirically validated through our newly proposed, challenging benchmarks. Finally, we empirically show that AnyCQ can effectively transfer to completely novel knowledge graphs when equipped with an appropriate link prediction model, highlighting its potential for querying incomplete data.

[664] Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining

Jie Cheng, Ruixi Qiao, Yingwei Ma, Binhua Li, Gang Xiong, Qinghai Miao, Yongbin Li, Yisheng Lv

Main category: cs.LG

TL;DR: JOWA: Jointly-Optimized World-Action model for offline RL that uses world models and joint optimization to achieve strong generalization across Atari games with efficient transfer learning.

Details

Motivation: To develop a generalist RL agent that can scale effectively from large heterogeneous datasets without heavy reliance on expert trajectories, addressing limitations of prior offline RL approaches in generalization to diverse unseen tasks.

Method: Jointly optimizes world-action model through shared transformer backbone on 6B tokens of Atari game data; uses provably efficient parallelizable planning algorithm to compensate for Q-value estimation error; enables stable temporal difference learning with large models.

Result: 150M parameter model achieves 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming SOTA by 31.6%; transfers to novel games with only 5k offline fine-tuning data (~4 trajectories) per game.

Conclusion: JOWA demonstrates superior generalization and scaling properties for offline RL, showing world models can effectively scale decision-making from large datasets and transfer efficiently to new tasks with minimal data.

Abstract: A significant aspiration of offline reinforcement learning (RL) is to develop a generalist agent with high capabilities from large and heterogeneous datasets. However, prior approaches that scale offline RL either rely heavily on expert trajectories or struggle to generalize to diverse unseen tasks. Inspired by the excellent generalization of world model in conditional video generation, we explore the potential of image observation-based world model for scaling offline RL and enhancing generalization on novel tasks. In this paper, we introduce JOWA: Jointly-Optimized World-Action model, an offline model-based RL agent pretrained on multiple Atari games with 6 billion tokens data to learn general-purpose representation and decision-making ability. Our method jointly optimizes a world-action model through a shared transformer backbone, which stabilize temporal difference learning with large models during pretraining. Moreover, we propose a provably efficient and parallelizable planning algorithm to compensate for the Q-value estimation error and thus search out better policies. Experimental results indicate that our largest agent, with 150 million parameters, achieves 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange. Furthermore, JOWA scales favorably with model capacity and can sample-efficiently transfer to novel games using only 5k offline fine-tuning data (approximately 4 trajectories) per game, demonstrating superior generalization. We will release codes and model weights at https://github.com/CJReinforce/JOWA

[665] WL Tests Are Far from All We Need: Revisiting WL-Test Hardness and GNN Expressive Power from a Distributed Computation Perspective

Guanyu Cui, Yuhe Guo, Zhewei Wei, Hsin-Hao Su

Main category: cs.LG

TL;DR: This paper examines the limitations of using Weisfeiler-Lehman tests for understanding GNN expressivity and proposes an alternative framework based on extended CONGEST models.

Details

Motivation: The paper identifies two gaps in current approaches to studying GNN expressivity: (1) uncertainty about whether WL tests are sufficiently primitive for understanding GNN expressivity, and (2) misalignment between WL-induced equivalence and characterizing the function classes that GNNs can approximate or compute.

Method: The authors first strengthen hardness results for vanilla WL tests, showing they’re not primitive enough for constant-depth GNNs. Then they propose an alternative framework based on an extended CONGEST model with explicit preprocessing, identifying implicit shortcuts in prior analyses and establishing results for WL tests with virtual nodes/edges.

Result: The paper demonstrates limitations of WL tests for understanding GNN expressivity and provides a new framework that better captures the computational capabilities of GNNs, particularly in settings with virtual nodes and edges.

Conclusion: Current WL-based approaches have fundamental limitations for understanding GNN expressivity, and the proposed CONGEST-based framework offers a more appropriate alternative for characterizing what functions GNNs can compute.

Abstract: The expressive power of graph neural networks (GNNs) is often studied through their relationship to the Weisfeiler-Lehman (WL) tests. Despite its influence, this perspective leaves two gaps: (i) it is unclear whether WL tests are sufficiently primitive for understanding GNN expressivity, and (ii) WL-induced equivalence does not align well with characterizing the function classes that GNNs can approximate or compute. We attempt to address both gaps. First, we strengthen hardness results for the vanilla WL test, showing that in many settings it is not primitive enough to be implemented by constant-depth GNNs. Second, we propose an alternative framework for studying GNN expressivity based on an extended CONGEST model with an explicit preprocessing phase. Within this framework, we identify implicit shortcuts introduced in prior analyses and establish further results for WL tests in settings where graphs are augmented with virtual nodes and virtual edges.

[666] SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling

Loris Gaven, Clement Romac, Thomas Carta, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer

Main category: cs.LG

TL;DR: LLM agents can use off-policy RL with hindsight relabeling to improve learning in sequential decision-making tasks, outperforming on-policy methods and enabling autonomous goal-directed behavior.

Details

Motivation: Current LLM agents primarily use on-policy RL methods, limiting their ability to use important techniques like experience replay and hindsight relabeling that could enable more efficient learning and autonomous goal-directed behavior.

Method: Adapts Soft Actor-Critic (SAC) and hindsight relabeling techniques to LLM agents, enabling off-policy learning with experience replay for more efficient exploration and exploitation.

Result: The method outperforms on-policy approaches in multi-goal RL environments and provides a foundation for developing autotelic LLM agents that can learn autonomously online.

Conclusion: Off-policy RL with hindsight relabeling is a promising direction for LLM agents, enabling more efficient learning and paving the way for autonomous goal-directed agents.

Abstract: The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision-making tasks. When facing complex environments where their zero-shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on-policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor-Critic and hindsight relabeling to LLM agents. Our method not only paves the path towards autotelic LLM agents that learn online but can also outperform on-policy methods in more classic multi-goal RL environments.

[667] Do graph neural network states contain graph properties?

Tom Pelletreau-Duris, Ruud van Bakel, Michael Cochez

Main category: cs.LG

TL;DR: A model-agnostic explainability pipeline for Graph Neural Networks using diagnostic classifiers to probe learned representations for graph-theoretic properties.

Details

Motivation: Graph Neural Networks (GNNs) have complex internal representations due to their non-Euclidean nature, making existing explainability methods difficult to apply. While instance-based explanation methods exist for GNNs, there's a lack of model-based methods that probe GNN embeddings for structural graph properties.

Method: The paper presents a model-agnostic explainability pipeline for GNNs using diagnostic classifiers. It proposes using graph-theoretic properties as features to study the emergence of representations in GNNs, allowing probing and interpretation of learned representations across various architectures and datasets.

Result: The pipeline enables systematic probing of GNN representations for structural graph properties, providing insights into what information is captured in different layers and how representations evolve during training.

Conclusion: This approach refines understanding and trust in GNN models by providing interpretable insights into their learned representations through graph-theoretic property analysis.

Abstract: Deep neural networks (DNNs) achieve state-of-the-art performance on many tasks, but this often requires increasingly larger model sizes, which in turn leads to more complex internal representations. Explainability techniques (XAI) have made remarkable progress in the interpretability of ML models. However, the non-euclidean nature of Graph Neural Networks (GNNs) makes it difficult to reuse already existing XAI methods. While other works have focused on instance-based explanation methods for GNNs, very few have investigated model-based methods and, to our knowledge, none have tried to probe the embedding of the GNNs for structural graph properties. In this paper we present a model agnostic explainability pipeline for Graph Neural Networks (GNNs) employing diagnostic classifiers. We propose to consider graph-theoretic properties as the features of choice for studying the emergence of representations in GNNs. This pipeline aims to probe and interpret the learned representations in GNNs across various architectures and datasets, refining our understanding and trust in these models.

[668] Tight Lower Bounds and Improved Convergence in Performative Prediction

Pedram Khorsandi, Rushil Gupta, Mehrnaz Mofakhami, Simon Lacoste-Julien, Gauthier Gidel

Main category: cs.LG

TL;DR: Extends Repeated Risk Minimization (RRM) framework using historical datasets to create Affine Risk Minimizers, enabling convergence to performatively stable points for broader problem classes with improved bounds and faster convergence.

Details

Motivation: Performative prediction addresses distribution shifts caused by model deployment, requiring stable solutions in evolving environments. Current RRM methods have limitations in convergence speed and applicability to certain problem classes.

Method: Extends RRM by utilizing historical datasets from previous retraining snapshots to create Affine Risk Minimizers. Introduces new upper bounds for methods using only final iteration data and analyzes tightness of bounds.

Result: Proves tightness of new and existing bounds, shows historical datasets can surpass lower bounds for last iterate RRM, demonstrates faster convergence on benchmarks, and provides first lower bound analysis for RRM within Affine Risk Minimizers class.

Conclusion: Historical dataset utilization improves convergence to performatively stable points, expands applicable problem classes, and provides theoretical foundations for better performative prediction algorithms with quantifiable improvements.

Abstract: Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in the real world. Ensuring rapid convergence to a stable solution where the data distribution remains the same after the model deployment is crucial, especially in evolving environments. This paper extends the Repeated Risk Minimization (RRM) framework by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers and enabling convergence to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that utilizing historical datasets can surpass the lower bound for last iterate RRM, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our framework.

[669] ASAP: Exploiting the Satisficing Generalization Edge in Neural Combinatorial Optimization

Han Fang, Paul Weng, Yutong Ban

Main category: cs.LG

TL;DR: ASAP framework improves generalization of deep reinforcement learning for combinatorial optimization by decomposing decision-making into proposal and selection phases, enabling rapid online adaptation to distribution shifts.

Details

Motivation: Deep reinforcement learning for combinatorial optimization problems suffers from brittleness when facing distribution shifts, limiting practical deployment in real-world scenarios where problem distributions may change.

Method: Proposes Adaptive Selection After Proposal (ASAP) framework with two-phase decision-making: proposal policy as robust filter and selection policy as adaptable decision maker. Uses Model-Agnostic Meta-Learning (MAML) to prime models for fast adaptation to new distributions.

Result: ASAP improves generalization capability of state-of-the-art baselines on 3D-BPP, TSP, and CVRP problems, achieving superior online adaptation on out-of-distribution instances.

Conclusion: The Satisficing Generalization Edge principle and ASAP framework provide an effective approach to address distribution shift challenges in neural combinatorial optimization solvers.

Abstract: Deep Reinforcement Learning (DRL) has emerged as a promising approach for solving Combinatorial Optimization (CO) problems, such as the 3D Bin Packing Problem (3D-BPP), Traveling Salesman Problem (TSP), or Vehicle Routing Problem (VRP), but these neural solvers often exhibit brittleness when facing distribution shifts. To address this issue, we uncover the Satisficing Generalization Edge, which we validate both theoretically and experimentally: identifying a set of promising actions is inherently more generalizable than selecting the single optimal action. To exploit this property, we propose Adaptive Selection After Proposal (ASAP), a generic framework that decomposes the decision-making process into two distinct phases: a proposal policy that acts as a robust filter, and a selection policy as an adaptable decision maker. This architecture enables a highly effective online adaptation strategy where the selection policy can be rapidly fine-tuned on a new distribution. Concretely, we introduce a two-phase training framework enhanced by Model-Agnostic Meta-Learning (MAML) to prime the model for fast adaptation. Extensive experiments on 3D-BPP, TSP, and CVRP demonstrate that ASAP improves the generalization capability of state-of-the-art baselines and achieves superior online adaptation on out-of-distribution instances.

[670] Neural Force Field: Few-shot Learning of Generalized Physical Reasoning

Shiqian Li, Ruihong Shen, Yaoyu Tao, Chi Zhang, Yixin Zhu

Main category: cs.LG

TL;DR: Neural Force Field (NFF) extends Neural ODEs to learn physical dynamics through continuous force field representations, enabling efficient learning from few examples and strong generalization to unseen scenarios.

Details

Motivation: Current AI models struggle with physical reasoning generalization, especially in OOD settings, due to inability to abstract core physical principles. The goal is to develop representations that can efficiently learn and generalize physical dynamics from minimal data.

Method: NFF extends Neural ODEs to learn complex object interactions through force field representations, which are integrated via ODE solvers to predict trajectories. Unlike discrete latent space approaches, NFF captures fundamental physics concepts (gravity, support, collision) in continuous explicit force fields.

Result: Experiments on three challenging physical reasoning tasks show NFF achieves strong generalization to unseen scenarios with only few training examples. The physics-grounded representation enables efficient forward-backward planning and rapid adaptation through interactive refinement.

Conclusion: Incorporating physics-inspired representations into learning systems can help bridge the gap between artificial and human physical reasoning capabilities.

Abstract: Physical reasoning is a remarkable human ability that enables rapid learning and generalization from limited experience. Current AI models, despite extensive training, still struggle to achieve similar generalization, especially in Out-of-distribution (OOD) settings. This limitation stems from their inability to abstract core physical principles from observations. A key challenge is developing representations that can efficiently learn and generalize physical dynamics from minimal data. Here we present Neural Force Field (NFF), a framework extending Neural Ordinary Differential Equation (NODE) to learn complex object interactions through force field representations, which can be efficiently integrated through an Ordinary Differential Equation (ODE) solver to predict object trajectories. Unlike existing approaches that rely on discrete latent spaces, NFF captures fundamental physical concepts such as gravity, support, and collision in continuous explicit force fields. Experiments on three challenging physical reasoning tasks demonstrate that NFF, trained with only a few examples, achieves strong generalization to unseen scenarios. This physics-grounded representation enables efficient forward-backward planning and rapid adaptation through interactive refinement. Our work suggests that incorporating physics-inspired representations into learning systems can help bridge the gap between artificial and human physical reasoning capabilities.

[671] Hexcute: A Compiler Framework for Automating Layout Synthesis in GPU Programs

Xiao Zhang, Yaoyao Ding, Bolin Sun, Yang Hu, Tatiana Shpeisman, Gennady Pekhimenko

Main category: cs.LG

TL;DR: Hexcute is a compiler framework that automates tensor layout synthesis for GPU programming while giving explicit control over dataflow and pipelining, reducing programming effort while matching performance of existing libraries.

Details

Motivation: Current GPU programming approaches have trade-offs: low-level frameworks like CUTLASS require manual layout specification (high programming effort), while high-level languages like Triton use heuristics that don't generalize well to complex operators. There's a need to balance expressiveness and programmability.

Method: Hexcute formalizes layout synthesis as a constraint programming problem and solves it with a type-inference-based algorithm. It automates layout synthesis while providing explicit programmer control over dataflow and pipelining strategies.

Result: Hexcute matches cuBLAS and FlashAttention performance on GEMM and Attention operators while reducing code by 1.27-7.94× vs CUTLASS. For MoE operators, it achieves 6.46× speedup over Triton. In end-to-end vLLM evaluations, it delivers up to 2.60× speedup on DeepSeek-R1-AWQ and 2.04× on Mamba-based models.

Conclusion: Hexcute successfully balances expressiveness and programmability by automating layout synthesis while maintaining control over dataflow and pipelining, enabling efficient GPU programming for deep learning applications with reduced development effort.

Abstract: Efficient GPU programming is crucial for achieving high performance in deep learning (DL) applications. The performance of GPU programs depends on how data is parallelized across threads and arranged within memory subsystems. The mapping functions describing tensors on GPUs are known as \emph{tensor layouts}. Low-level programming frameworks, such as CUTLASS and Hidet, provide expressive layout abstractions but often require \emph{considerable programming effort} to manually specify optimal layouts. High-level GPU programming languages, such as Triton, rely on compiler heuristics to generate dataflow, layouts, and pipelining strategies in GPU programs. However, the heuristics for dataflow and pipelining strategies are not generalizable to complex operators. To balance expressiveness and programmability, we propose Hexcute, a compiler framework that automates layout synthesis while providing explicit control over dataflow and pipelining. Hexcute formalizes layout synthesis as a constraint programming problem and solves it with a type-inference-based algorithm. This approach enables systematic exploration of optimal layouts and instructions. Our evaluation shows that Hexcute matches the performance of libraries like cuBLAS and FlashAttention on GEMM, Attention, and their variants, while reducing the amount of code by 1.27$\times$-7.94$\times$ compared to CUTLASS. For mixed-type mixture-of-experts (MoE) operators, Hexcute achieves an average speedup of 6.46$\times$ over Triton. In the end-to-end evaluations of vLLM, Hexcute delivers up to 2.60$\times$ speedup on DeepSeek-R1-AWQ and 2.04$\times$ on a Mamba-based model.

[672] Redefining Neural Operators in $d+1$ Dimensions for Embedding Evolution

Haoze Song, Zhihao Li, Xiaobo Zhang, Zecheng Gan, Zhilu Lai, Wei Wang

Main category: cs.LG

TL;DR: SKNO introduces an auxiliary dimension to model embedding evolution in neural operators, improving performance on challenging PDE benchmarks while maintaining resolution invariance.

Details

Motivation: Current neural operator architectures focus on kernel parameterization over original physical domains, leading to brute-force embedding lengthening for better approximation at high computational cost. Embedding evolution remains largely unexplored.

Method: Introduces auxiliary dimension to model embedding evolution in operator form, redefining neural operators in d+1 dimensions. Develops Schrödingerised Kernel Neural Operator (SKNO) using Fourier-based operators to model d+1 dimensional evolution.

Result: SKNO consistently outperforms other baselines across more than ten challenging benchmarks from 1D heat equation to highly nonlinear 3D Rayleigh-Taylor instability. Shows resolution invariance under mixed-resolution training and super-resolution inference, with zero-shot generalization to unseen temporal regimes.

Conclusion: The auxiliary dimension approach effectively models embedding evolution, improving neural operator performance while maintaining computational efficiency. Provides broader design choices for lifting and recovery operators.

Abstract: Neural Operators (NOs) have emerged as powerful tools for learning mappings between function spaces. Among them, the kernel integral operator has been widely used in universally approximating architectures. Following the original formulation, most advancements focus on designing better parameterizations for the kernel over the original physical domain (with $d$ spatial dimensions, $d\in{1,2,3,\ldots}$). In contrast, embedding evolution remains largely unexplored, which often drives models toward brute-force embedding lengthening to improve approximation, but at the cost of substantially increased computation. In this paper, we introduce an auxiliary dimension that explicitly models embedding evolution in operator form, thereby redefining the NO framework in $d+1$ dimensions (the original $d$ dimensions plus one auxiliary dimension). Under this formulation, we develop a Schrödingerised Kernel Neural Operator (SKNO), which leverages Fourier-based operators to model the $d+1$ dimensional evolution. Across more than ten increasingly challenging benchmarks, ranging from the 1D heat equation to the highly nonlinear 3D Rayleigh-Taylor instability, SKNO consistently outperforms other baselines. We further validate its resolution invariance under mixed-resolution training and super-resolution inference, and evaluate zero-shot generalization to unseen temporal regimes. In addition, we present a broader set of design choices for the lifting and recovery operators, demonstrating their impact on SKNO’s predictive performance.

[673] Predictability-Aware Compression and Decompression Framework for Multichannel Time Series Data with Latent Seasonality

Ziqi Liu, Pei Zeng, Yi Ding

Main category: cs.LG

TL;DR: A predictability-aware compression-decompression framework for multichannel time series prediction that reduces runtime and communication costs while maintaining accuracy across diverse predictors.

Details

Motivation: Real-world multichannel time series prediction needs efficiency across edge and cloud environments, requiring channel compression to reduce runtime and communication costs while preserving prediction accuracy.

Method: Uses a circular seasonal key matrix with orthogonality to capture time series predictability during compression and mitigate reconstruction errors during decompression with more realistic data assumptions.

Result: The framework is theoretically time-efficient and accuracy-preserving under many channels, with experiments on six datasets showing superior overall performance considering both accuracy and runtime.

Conclusion: The proposed method achieves efficient multichannel time series prediction with strong compatibility across diverse predictors while balancing accuracy and runtime.

Abstract: Real-world multichannel time series prediction faces growing demands for efficiency across edge and cloud environments, making channel compression a timely and essential problem. Motivated by the success of Multiple-Input Multiple-Output (MIMO) methods in signal processing, we propose a predictability-aware compression-decompression framework to reduce runtime, decrease communication cost, and maintain prediction accuracy across diverse predictors. The core idea involves using a circular seasonal key matrix with orthogonality to capture underlying time series predictability during compression and to mitigate reconstruction errors during decompression by introducing more realistic data assumptions. Theoretical analyses show that the proposed framework is both time-efficient and accuracy-preserving under a large number of channels. Extensive experiments on six datasets across various predictors demonstrate that the proposed method achieves superior overall performance by jointly considering prediction accuracy and runtime, while maintaining strong compatibility with diverse predictors.

[674] scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell Omics

Davide D’Ascenzo, Sebastiano Cultrera di Montesano

Main category: cs.LG

TL;DR: scDataset: A PyTorch data loader for efficient training on massive single-cell datasets that combines block sampling and batched fetching to achieve quasi-random sampling with high I/O efficiency while maintaining minibatch diversity.

Details

Motivation: Training deep learning models on single-cell datasets with hundreds of millions of cells requires loading data from disk due to memory constraints. Random sampling provides data diversity but is prohibitively slow due to random access overhead, while sequential streaming achieves high throughput but introduces biases that degrade model performance.

Method: scDataset combines block sampling and batched fetching to achieve quasi-random sampling that balances I/O efficiency with minibatch diversity. It works directly with AnnData files and provides seamless integration across diverse storage formats.

Result: On Tahoe-100M (100 million cells), scDataset achieves more than two orders of magnitude speedup compared to true random sampling. The method provides theoretical bounds on minibatch diversity and empirically matches the performance of true random sampling across multiple classification tasks.

Conclusion: scDataset enables efficient training from on-disk data for massive single-cell datasets while maintaining the performance benefits of random sampling through its quasi-random sampling approach.

Abstract: Training deep learning models on single-cell datasets with hundreds of millions of cells requires loading data from disk, as these datasets exceed available memory. While random sampling provides the data diversity needed for effective training, it is prohibitively slow due to the random access pattern overhead, whereas sequential streaming achieves high throughput but introduces biases that degrade model performance. We present scDataset, a PyTorch data loader that enables efficient training from on-disk data with seamless integration across diverse storage formats. Our approach combines block sampling and batched fetching to achieve quasi-random sampling that balances I/O efficiency with minibatch diversity. On Tahoe-100M, a dataset of 100 million cells, scDataset achieves more than two orders of magnitude speedup compared to true random sampling while working directly with AnnData files. We provide theoretical bounds on minibatch diversity and empirically show that scDataset matches the performance of true random sampling across multiple classification tasks.

[675] Near-Optimal Online Deployment and Routing for Streaming LLMs

Shaoang Li, Jian Li

Main category: cs.LG

TL;DR: StageRoute is an online algorithm for managing streaming inventory of LLMs under concurrency caps and cost budgets, combining stage-wise deployment with per-query routing.

Details

Motivation: The rapid evolution of LLMs creates challenges for providers who must manage streaming inventory under strict concurrency caps and per-query cost budgets, requiring efficient online decision-making.

Method: StageRoute uses a hierarchical approach: (i) optimistically selects up to M_max models for next stage using reward upper-confidence and cost lower-confidence bounds, (ii) routes each query by solving budget- and throughput-constrained bandit subproblem over deployed models.

Result: Proves regret bound of Õ(T^{2/3}) with matching lower bound (near-optimal), and empirical validation shows StageRoute tracks a strong oracle under tight budgets across diverse workloads.

Conclusion: StageRoute provides an effective solution for managing streaming LLM inventory with theoretical guarantees and practical performance.

Abstract: The rapid pace at which new large language models (LLMs) appear, and older ones become obsolete, forces providers to manage a streaming inventory under a strict concurrency cap and per-query cost budgets. We cast this as an online decision problem that couples stage-wise deployment (at fixed maintenance windows) with per-query routing among live models. We introduce StageRoute, a hierarchical algorithm that (i) optimistically selects up to $M_{\max}$ models for the next stage using reward upper-confidence and cost lower-confidence bounds, and (ii) routes each incoming query by solving a budget- and throughput-constrained bandit subproblem over the deployed set. We prove a regret of $\tilde{\mathcal{O}}(T^{2/3})$ with a matching lower bound, establishing near-optimality, and validate the theory empirically: StageRoute tracks a strong oracle under tight budgets across diverse workloads.

[676] Dealing with Uncertainty in Contextual Anomaly Detection

Luca Bindini, Lorenzo Perini, Stefano Nistri, Jesse Davis, Paolo Frasconi

Main category: cs.LG

TL;DR: Proposes Normalcy Score (NS), a contextual anomaly detection framework using heteroscedastic Gaussian process regression to model aleatoric and epistemic uncertainties, providing confidence intervals for anomaly assessments.

Details

Motivation: Contextual anomaly detection needs to identify anomalies in target variables conditioned on contextual variables that influence normalcy but aren't themselves anomalous. Existing methods lack proper uncertainty quantification which is crucial for reliable decision-making, especially in sensitive domains like healthcare.

Method: Uses heteroscedastic Gaussian process regression to model both aleatoric (data noise) and epistemic (model uncertainty) uncertainties. Treats Z-score as a random variable to provide confidence intervals that reflect reliability of anomaly assessments.

Result: Outperforms state-of-the-art CAD methods on benchmark datasets and real-world cardiology applications in both detection accuracy and interpretability. Confidence intervals enable adaptive, uncertainty-driven decision-making.

Conclusion: The NS framework provides a principled approach to contextual anomaly detection with uncertainty quantification, offering improved accuracy and interpretability, particularly valuable in domains requiring reliable decision-making like healthcare.

Abstract: Contextual anomaly detection (CAD) aims to identify anomalies in a target (behavioral) variable conditioned on a set of contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In many anomaly detection tasks, there exist contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In this work, we propose a novel framework for CAD, normalcy score (NS), that explicitly models both the aleatoric and epistemic uncertainties. Built on heteroscedastic Gaussian process regression, our method regards the Z-score as a random variable, providing confidence intervals that reflect the reliability of the anomaly assessment. Through experiments on benchmark datasets and a real-world application in cardiology, we demonstrate that NS outperforms state-of-the-art CAD methods in both detection accuracy and interpretability. Moreover, confidence intervals enable an adaptive, uncertainty-driven decision-making process, which may be very important in domains such as healthcare.

[677] SDSC:A Structure-Aware Metric for Semantic Signal Representation Learning

Jeyoung Lee, Hochul Kang

Main category: cs.LG

TL;DR: SDSC is a structure-aware metric for time series SSL that improves semantic alignment by quantifying structural agreement between signals, outperforming MSE in forecasting and classification tasks.

Details

Motivation: Current SSL methods for signals use distance-based objectives like MSE that are sensitive to amplitude, invariant to waveform polarity, and unbounded, which hinders semantic alignment and reduces interpretability.

Method: Proposes Signal Dice Similarity Coefficient (SDSC) based on Dice Similarity Coefficient, quantifying structural agreement via intersection of signed amplitudes. Uses differentiable approximation of Heaviside function for gradient optimization, with hybrid loss combining SDSC and MSE.

Result: SDSC-based pre-training achieves comparable or improved performance over MSE, especially in in-domain and low-resource scenarios. Structural fidelity enhances semantic representation quality.

Conclusion: Structure-aware metrics like SDSC are viable alternatives to conventional distance-based methods for signal representation learning, improving semantic alignment and interpretability.

Abstract: We propose the Signal Dice Similarity Coefficient (SDSC), a structure-aware metric function for time series self-supervised representation learning. Most Self-Supervised Learning (SSL) methods for signals commonly adopt distance-based objectives such as mean squared error (MSE), which are sensitive to amplitude, invariant to waveform polarity, and unbounded in scale. These properties hinder semantic alignment and reduce interpretability. SDSC addresses this by quantifying structural agreement between temporal signals based on the intersection of signed amplitudes, derived from the Dice Similarity Coefficient (DSC).Although SDSC is defined as a structure-aware metric, it can be used as a loss by subtracting from 1 and applying a differentiable approximation of the Heaviside function for gradient-based optimization. A hybrid loss formulation is also proposed to combine SDSC with MSE, improving stability and preserving amplitude where necessary. Experiments on forecasting and classification benchmarks demonstrate that SDSC-based pre-training achieves comparable or improved performance over MSE, particularly in in-domain and low-resource scenarios. The results suggest that structural fidelity in signal representations enhances the semantic representation quality, supporting the consideration of structure-aware metrics as viable alternatives to conventional distance-based methods.

[678] Stackelberg Coupling of Online Representation Learning and Reinforcement Learning

Fernando Martinez, Tao Li, Yingdong Lu, Juntao Chen

Main category: cs.LG

TL;DR: SCORER introduces a Stackelberg game framework for RL where Q-function (leader) updates less frequently and perception network (follower) adapts more frequently to reduce Bellman error variance and stabilize representation learning.

Details

Motivation: Deep Q-learning suffers from instability due to co-adaptation between representation and value learning, where shifting representations must adapt to non-stationary value targets and value estimates depend on these changing representations, compounded by high variance in bootstrapped targets causing bias.

Method: SCORER frames representation and Q-learning as hierarchical game agents: Q-function as leader updates less frequently, perception network as follower adapts more frequently to minimize Bellman error variance given leader’s strategy, solved via bi-level optimization approximated by two-timescale algorithm.

Result: Extensive experiments on DQN and variants show performance gains from algorithmic insight rather than model complexity, demonstrating stable co-adaptation and reduced bias through asymmetric learning dynamics.

Conclusion: The Stackelberg game framework enables stable co-adaptation between representation and value learning in RL by creating asymmetric update dynamics that reduce Bellman error variance and improve learning stability.

Abstract: Deep Q-learning jointly learns representations and values within monolithic networks, promising beneficial co-adaptation between features and value estimates. Although this architecture has attained substantial success, the coupling between representation and value learning creates instability as representations must constantly adapt to non-stationary value targets, while value estimates depend on these shifting representations. This is compounded by high variance in bootstrapped targets, which causes bias in value estimation in off-policy methods. We introduce Stackelberg Coupled Representation and Reinforcement Learning (SCORER), a framework for value-based RL that views representation and Q-learning as two strategic agents in a hierarchical game. SCORER models the Q-function as the leader, which commits to its strategy by updating less frequently, while the perception network (encoder) acts as the follower, adapting more frequently to learn representations that minimize Bellman error variance given the leader’s committed strategy. Through this division of labor, the Q-function minimizes MSBE while perception minimizes its variance, thereby reducing bias accordingly, with asymmetric updates allowing stable co-adaptation, unlike simultaneous parameter updates in monolithic solutions. Our proposed SCORER framework leads to a bi-level optimization problem whose solution is approximated by a two-timescale algorithm that creates an asymmetric learning dynamic between the two players. Extensive experiments on DQN and its variants demonstrate that gains stem from algorithmic insight rather than model complexity.

[679] Residual Reservoir Memory Networks

Matteo Pinna, Andrea Ceni, Claudio Gallicchio

Main category: cs.LG

TL;DR: Residual Reservoir Memory Networks (ResRMNs) combine linear and non-linear reservoirs with residual orthogonal connections for enhanced long-term temporal propagation in time-series and 1-D classification tasks.

Details

Motivation: To improve long-term memory and temporal information propagation in Reservoir Computing models by introducing residual connections along the temporal dimension, addressing limitations of conventional RC models in capturing long-range dependencies.

Method: Proposes ResRMNs that combine a linear memory reservoir with a non-linear reservoir using residual orthogonal connections along temporal dimension. Uses linear stability analysis to study reservoir state dynamics and investigates diverse configurations for temporal residual connections.

Result: Experimental results on time-series and pixel-level 1-D classification tasks show advantages over conventional RC models, demonstrating improved performance through enhanced long-term propagation of input information.

Conclusion: ResRMNs with residual orthogonal connections effectively enhance long-term temporal information propagation in reservoir computing, outperforming conventional RC models on time-series and 1-D classification tasks.

Abstract: We introduce a novel class of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) paradigm, called Residual Reservoir Memory Networks (ResRMNs). ResRMN combines a linear memory reservoir with a non-linear reservoir, where the latter is based on residual orthogonal connections along the temporal dimension for enhanced long-term propagation of the input. The resulting reservoir state dynamics are studied through the lens of linear stability analysis, and we investigate diverse configurations for the temporal residual connections. The proposed approach is empirically assessed on time-series and pixel-level 1-D classification tasks. Our experimental results highlight the advantages of the proposed approach over other conventional RC models.

[680] Deep Residual Echo State Networks: exploring residual orthogonal connections in untrained Recurrent Neural Networks

Matteo Pinna, Andrea Ceni, Claudio Gallicchio

Main category: cs.LG

TL;DR: Deep Residual Echo State Networks (DeepResESNs) improve long-term information processing in untrained RNNs through temporal residual connections and hierarchical layers, enhancing memory capacity and temporal modeling.

Details

Motivation: Traditional Echo State Networks (ESNs) within the Reservoir Computing framework are efficient but struggle with long-term information processing, motivating the development of deeper architectures with better temporal modeling capabilities.

Method: Introduces Deep Residual Echo State Networks with temporal residual connections in a hierarchical architecture, exploring different orthogonal configurations (randomly generated and fixed-structure) and providing mathematical analysis for stable dynamics.

Result: The proposed DeepResESNs significantly boost memory capacity and long-term temporal modeling, outperforming traditional shallow and deep Reservoir Computing approaches on various time series tasks.

Conclusion: DeepResESNs with temporal residual connections provide an effective solution for improving long-term information processing in untrained recurrent neural networks while maintaining computational efficiency.

Abstract: Echo State Networks (ESNs) are a particular type of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) framework, popular for their fast and efficient learning. However, traditional ESNs often struggle with long-term information processing. In this paper, we introduce a novel class of deep untrained RNNs based on temporal residual connections, called Deep Residual Echo State Networks (DeepResESNs). We show that leveraging a hierarchy of untrained residual recurrent layers significantly boosts memory capacity and long-term temporal modeling. For the temporal residual connections, we consider different orthogonal configurations, including randomly generated and fixed-structure configurations, and we study their effect on network dynamics. A thorough mathematical analysis outlines necessary and sufficient conditions to ensure stable dynamics within DeepResESN. Our experiments on a variety of time series tasks showcase the advantages of the proposed approach over traditional shallow and deep RC.

[681] Robust Filter Attention: Self-Attention as a Parallel State Estimator

Peter Racioppo

Main category: cs.LG

TL;DR: RFA reformulates self-attention as parallel robust filtering under an SDE prior, using uncertainty-aware weighting to improve temporal consistency and extrapolation across context lengths while maintaining standard attention complexity.

Details

Motivation: To create an attention mechanism that combines the benefits of existing positional encodings (RoPE-style rotational structure) with long-context stability through explicit modeling of dissipation and diffusion, while maintaining computational efficiency.

Method: Reformulates self-attention as parallel robust filtering under a latent stochastic differential equation (SDE) prior, where analytically propagated uncertainty defines a time-dependent precision prior over attention weights. Uses isotropic constraints on dynamics and noise to match standard attention complexity.

Result: Uncertainty-aware weighting induces specialization into distinct filtering regimes across heads, improving temporal consistency and extrapolation across varying context lengths while maintaining O(N²d) time and O(N² + Nd) memory complexity.

Conclusion: RFA provides a principled attention mechanism that integrates advantages of existing positional encodings, offers improved temporal consistency, and maintains computational efficiency comparable to standard attention.

Abstract: We introduce Robust Filter Attention (RFA), an attention mechanism that reformulates self-attention as parallel robust filtering under a latent stochastic differential equation (SDE) prior, where analytically propagated uncertainty defines a time-dependent precision prior over attention weights. This formulation integrates key advantages of existing positional encodings: it preserves RoPE-style rotational structure while achieving long-context stability through explicit modeling of dissipation and diffusion. By imposing isotropic constraints on the dynamics and noise, RFA matches the $O(N^2 d)$ time and $O(N^2 + Nd)$ memory complexity of standard attention. Empirically, we find that uncertainty-aware weighting induces specialization into distinct filtering regimes across heads, improving temporal consistency and extrapolation across varying context lengths.

[682] Bridging Performance Gaps for ECG Foundation Models: A Post-Training Strategy

Ya Zhou, Yujie Yang, Xiaohan Fan, Wei Zhao

Main category: cs.LG

TL;DR: A post-training strategy improves ECG foundation models, outperforming fine-tuning and achieving state-of-the-art results on ECG classification tasks with better data efficiency.

Details

Motivation: ECG foundation models show promise but underperform compared to task-specific models even after pre-training and fine-tuning, likely due to lack of effective post-training strategies.

Method: Proposes a simple yet effective post-training approach for ECG foundation models, evaluated on a Transformer-based foundation model, incorporating techniques like stochastic depth and preview linear probing.

Result: Consistently outperforms baseline fine-tuning across multiple ECG tasks, improving macro AUROC by 0.7%-8.9% and macro AUPRC by 23.3%-77.9% on PTB-XL benchmarks, also beating recent state-of-the-art approaches.

Conclusion: Post-training strategies significantly enhance ECG foundation models, demonstrating improved training dynamics and data efficiency, with potential to advance foundation model development in the ECG domain.

Abstract: ECG foundation models are increasingly popular due to their adaptability across various tasks. However, their clinical applicability is often limited by performance gaps compared to task-specific models, even after pre-training on large ECG datasets and fine-tuning on target data. This limitation is likely due to the lack of an effective post-training strategy. In this paper, we propose a simple yet effective post-training approach to enhance ECG foundation models. We evaluate it on a publicly available Transformer-based foundation model. Experiments across multiple ECG tasks show that our method consistently outperforms baseline fine-tuning. On the PTB-XL benchmarks, it improves macro AUROC by 0.7%-8.9% and macro AUPRC by 23.3%-77.9%, also outperforming several recent state-of-the-art approaches, including task-specific and advanced architectures. Further analyses demonstrate improved training dynamics and data efficiency, with only 30% of the training data outperforming the baseline trained on the full dataset. Ablation studies highlight the importance of stochastic depth and preview linear probing. These findings underscore the potential of post-training strategies to improve ECG foundation models, and we hope this work will contribute to the continued development of foundation models in the ECG domain.

[683] Understanding Post-Training Structural Changes in Large Language Models

Xinyu He, Xianghui Cao

Main category: cs.LG

TL;DR: SVD analysis reveals post-training (instruction tuning & Long-CoT distillation) causes uniform scaling of singular values and consistent orthogonal transformations of singular vectors in LLMs, showing coordinated parameter dynamics.

Details

Motivation: Post-training significantly changes LLM behavior, but its impact on internal parameter space is poorly understood. The authors aim to systematically analyze how post-training methods affect the structural properties of LLM parameters.

Method: Conducted systematic singular value decomposition (SVD) analysis of principal linear layers in pretrained LLMs, focusing on instruction tuning and long-chain-of-thought (Long-CoT) distillation. Analyzed singular values and vectors to identify structural changes.

Result: Found two robust structural changes: 1) near-uniform geometric scaling of singular values across layers, and 2) highly consistent orthogonal transformations applied to left and right singular vectors. These reveal coordinated parameter dynamics.

Conclusion: Post-training relies on foundational pre-training capabilities, with singular value scaling enabling temperature-controlled regulation and vector rotations encoding semantic alignment. Challenges black-box view of parameter space, revealing first clear regularities in parameter evolution.

Abstract: Post-training fundamentally alters the behavior of large language models (LLMs), yet its impact on the internal parameter space remains poorly understood. In this work, we conduct a systematic singular value decomposition (SVD) analysis of principal linear layers in pretrained LLMs, focusing on two widely adopted post-training methods: instruction tuning and long-chain-of-thought (Long-CoT) distillation. Our analysis reveals two unexpected and robust structural changes: (1) a near-uniform geometric scaling of singular values across layers; and (2) highly consistent orthogonal transformations are applied to the left and right singular vectors of each matrix. Based on these findings, We propose a simple yet effective framework to describe the coordinated dynamics of parameters in LLMs, which elucidates why post-training inherently relies on the foundational capabilities developed during pre-training. Further experiments demonstrate that singular value scaling underpins the temperature-controlled regulatory mechanisms of post-training, while the coordinated rotation of singular vectors encodes the essential semantic alignment. These results challenge the prevailing view of the parameter space in large models as a black box, uncovering the first clear regularities in how parameters evolve during training, and providing a new perspective for deeper investigation into model parameter changes.

[684] Tackling GNARLy Problems: Graph Neural Algorithmic Reasoning Reimagined through Reinforcement Learning

Alex Schutz, Victor-Alexandru Darvariu, Efimia Panagiotaki, Bruno Lacerda, Nick Hawes

Main category: cs.LG

TL;DR: GNARL reframes neural algorithmic reasoning as a Markov Decision Process using RL to overcome limitations of supervised learning approaches, achieving strong results on graph problems including NP-hard ones.

Details

Motivation: Neural Algorithmic Reasoning (NAR) has limitations: can't construct valid solutions without post-processing, can't handle multiple correct solutions, performs poorly on NP-hard problems, and can't work without known expert algorithms.

Method: Reframes algorithm learning as Markov Decision Process, uses imitation and reinforcement learning, proposes GNARL framework for graph-based problems, translates NAR formulations to RL.

Result: Achieves high graph accuracy on CLRS-30 problems, matches or exceeds narrower NAR approaches for NP-hard problems, works even without expert algorithms.

Conclusion: RL-based approach overcomes key NAR limitations, enabling better solution construction, handling of multiple solutions, and applicability to problems without known algorithms.

Abstract: Neural Algorithmic Reasoning (NAR) is a paradigm that trains neural networks to execute classic algorithms by supervised learning. Despite its successes, important limitations remain: inability to construct valid solutions without post-processing and to reason about multiple correct ones, poor performance on combinatorial NP-hard problems, and inapplicability to problems for which strong algorithms are not yet known. To address these limitations, we reframe the problem of learning algorithm trajectories as a Markov Decision Process, which imposes structure on the solution construction procedure and unlocks the powerful tools of imitation and reinforcement learning (RL). We propose the GNARL framework, encompassing the methodology to translate problem formulations from NAR to RL and a learning architecture suitable for a wide range of graph-based problems. We achieve very high graph accuracy results on several CLRS-30 problems, performance matching or exceeding much narrower NAR approaches for NP-hard problems and, remarkably, applicability even when lacking an expert algorithm.

[685] Pushing Toward the Simplex Vertices: A Simple Remedy for Code Collapse in Smoothed Vector Quantization

Takashi Morita

Main category: cs.LG

TL;DR: Proposes a novel regularization method for smoothed vector quantization that simultaneously ensures tight approximation to one-hot vectors and prevents code collapse by minimizing distances between simplex vertices and their K-nearest smoothed quantizers.

Details

Motivation: Vector quantization is widely used but faces the fundamental challenge of non-differentiable quantization blocking gradient backpropagation. Smoothed vector quantization addresses this but requires balancing two key properties: (1) maintaining closeness to one-hot vectors for tight approximation, and (2) ensuring all codebook entries are utilized to prevent code collapse. Existing methods address these separately, creating a need for a unified approach.

Method: Introduces a simple regularization technique that promotes both desired properties simultaneously by minimizing the distance between each simplex vertex and its K-nearest smoothed quantizers. This encourages smoothed quantizers to remain close to one-hot vectors while ensuring all codebook entries are utilized effectively.

Result: Experiments on discrete image autoencoding and contrastive speech representation learning benchmarks demonstrate that the proposed method achieves more reliable codebook utilization and improves performance compared to prior approaches.

Conclusion: The proposed regularization method provides an effective solution for smoothed vector quantization that simultaneously addresses the dual challenges of maintaining approximation quality and preventing code collapse, outperforming existing methods that treat these objectives separately.

Abstract: Vector quantization, which discretizes a continuous vector space into a finite set of representative vectors (a codebook), has been widely adopted in modern machine learning. Despite its effectiveness, vector quantization poses a fundamental challenge: the non-differentiable quantization step blocks gradient backpropagation. Smoothed vector quantization addresses this issue by relaxing the hard assignment of a codebook vector into a weighted combination of codebook entries, represented as the matrix product of a simplex vector and the codebook. Effective smoothing requires two properties: (1) smoothed quantizers should remain close to a onehot vector, ensuring tight approximation, and (2) all codebook entries should be utilized, preventing code collapse. Existing methods typically address these desiderata separately. By contrast, the present study introduces a simple and intuitive regularization that promotes both simultaneously by minimizing the distance between each simplex vertex and its $K$-nearest smoothed quantizers. Experiments on representative benchmarks, including discrete image autoencoding and contrastive speech representation learning, demonstrate that the proposed method achieves more reliable codebook utilization and improves performance compared to prior approaches.

[686] Rotary Position Encodings for Graphs

Isaac Reid, Arijit Sehanobish, Cederik Höfs, Bruno Mlodozeniec, Leonhard Vulpius, Federico Barbero, Adrian Weller, Krzysztof Choromanski, Richard E. Turner, Petar Veličković

Main category: cs.LG

TL;DR: WIRE extends rotary position encodings (RoPE) to graph data by rotating tokens based on graph Laplacian spectrum, injecting structural information into attention mechanisms.

Details

Motivation: To adapt the successful rotary position encoding (RoPE) technique from language and vision transformers to graph-structured data, enabling structural information injection into attention mechanisms for graph learning tasks.

Method: Proposes Wave-Induced Rotary Encodings (WIRE) that rotates tokens depending on the spectrum of the graph Laplacian, making the encoding dependent on graph structure rather than grid positions.

Result: WIRE boosts performance in synthetic and real-world graph learning tasks, recovers regular RoPE on grids, depends asymptotically on graph effective resistance, and is compatible with linear attention unlike bias-based encodings.

Conclusion: RoPE can be successfully extended to graph-structured data through spectral-based rotations, providing an effective way to inject structural information into transformer architectures for graph learning.

Abstract: We study the extent to which rotary position encodings (RoPE), a recent transformer position encoding algorithm broadly adopted in large language models (LLMs) and vision transformers (ViTs), can be applied to graph-structured data. We find that rotating tokens depending on the spectrum of the graph Laplacian efficiently injects structural information into the attention mechanism, boosting performance in synthetic and real-world graph learning tasks. This approach, coined Wave-Induced Rotary Encodings (WIRE), enjoys intriguing theoretical properties: it recovers regular RoPE on grids, and depends asymptotically on the graph effective resistance. Unlike bias-based relative position encodings, WIRE is compatible with linear attention.

[687] On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs

Rongguang Ye, Ming Tang, Edith C. H. Ngai

Main category: cs.LG

TL;DR: CoA-LoRA enables dynamic adaptation of LoRA adapters to arbitrary quantization configurations without repeated fine-tuning, using a configuration-aware model trained on Pareto-optimized configuration sets.

Details

Motivation: Deploying large pre-trained models on edge devices requires compression via quantization, but edge devices have heterogeneous capabilities requiring different quantization configurations. Fine-tuning separate LoRA adapters for each configuration is computationally prohibitive.

Method: Proposes CoA-LoRA with a configuration-aware model that maps quantization configurations to low-rank adjustments. Uses Pareto-based configuration search to iteratively optimize training configuration sets for better coverage of bit-width budgets.

Result: CoA-LoRA achieves comparable or superior performance to methods requiring separate fine-tuning for each configuration, with no additional time cost for deployment.

Conclusion: CoA-LoRA provides an efficient solution for deploying compressed models on heterogeneous edge devices by enabling dynamic adaptation to arbitrary quantization configurations without repeated fine-tuning.

Abstract: As increasingly large pre-trained models are released, deploying them on edge devices for privacy-preserving applications requires effective compression. Recent works combine quantization with the fine-tuning of high-precision LoRA adapters, which can substantially reduce model size while mitigating the accuracy loss from quantization. However, edge devices have inherently heterogeneous capabilities, while performing configuration-wise fine-tuning for every quantization setting is computationally prohibitive. In this paper, we propose CoA-LoRA, a method that dynamically adjusts the LoRA adapter to arbitrary quantization configurations (i.e., the per-layer bit-width choices of a pre-trained model) without requiring repeated fine-tuning. This is accomplished via a configuration-aware model that maps each configuration to its low-rank adjustments. The effectiveness of this model critically depends on the training configuration set, a collection of configurations chosen to cover different total bit-width budgets. However, constructing a high-quality configuration set is non-trivial. We therefore design a Pareto-based configuration search that iteratively optimizes the training configuration set, yielding more precise low-rank adjustments. Our experiments demonstrate that, unlike the state-of-the-art methods that require fine-tuning a separate LoRA adapter for each configuration, CoA-LoRA incurs no additional time cost while achieving comparable or even superior performance to those methods.

[688] IBNorm: Information-Bottleneck Inspired Normalization for Representation Learning

Xiandong Zou, Jia Li, Xiaotong Yuan, Pan Zhou

Main category: cs.LG

TL;DR: IBNorm is a new normalization method based on Information Bottleneck principle that outperforms variance-centric methods like BatchNorm, LayerNorm, and RMSNorm across both language and vision models.

Details

Motivation: Existing normalization methods (BatchNorm, LayerNorm, RMSNorm) focus on variance control but don't explicitly optimize for preserving task-relevant information while suppressing nuisance variability. The authors aim to create normalization that better captures predictive information.

Method: Proposes IB-Inspired Normalization (IBNorm) based on Information Bottleneck principle. Introduces bounded compression operations that encourage embeddings to preserve predictive information while suppressing nuisance variability, maintaining compatibility with standard normalization approaches.

Result: IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across large-scale language models (LLaMA, GPT-2) and vision models (ResNet, ViT). Theoretical analysis shows higher IB value and tighter generalization bounds. Mutual information analysis confirms superior information bottleneck behavior.

Conclusion: IBNorm provides a more principled normalization approach grounded in information theory that yields more informative representations while maintaining training stability and compatibility with existing architectures.

Abstract: Normalization is fundamental to deep learning, but existing approaches such as BatchNorm, LayerNorm, and RMSNorm are variance-centric by enforcing zero mean and unit variance, stabilizing training without controlling how representations capture task-relevant information. We propose IB-Inspired Normalization (IBNorm), a simple yet powerful family of methods grounded in the Information Bottleneck principle. IBNorm introduces bounded compression operations that encourage embeddings to preserve predictive information while suppressing nuisance variability, yielding more informative representations while retaining the stability and compatibility of standard normalization. Theoretically, we prove that IBNorm achieves a higher IB value and tighter generalization bounds than variance-centric methods. Empirically, IBNorm consistently outperforms BatchNorm, LayerNorm, and RMSNorm across large-scale language models (LLaMA, GPT-2) and vision models (ResNet, ViT), with mutual information analysis confirming superior information bottleneck behavior. Code will be released publicly.

[689] Physics-Informed Neural Networks for Real-Time Gas Crossover Prediction in PEM Electrolyzers: First Application with Multi-Membrane Validation

Yong-Woon Kim, Chulung Kang, Yung-Cheol Byun

Main category: cs.LG

TL;DR: Physics-informed neural network (PINN) model for real-time hydrogen crossover prediction in PEM water electrolysis, combining physical laws with machine learning for safety monitoring.

Details

Motivation: Hydrogen crossover in PEM water electrolysis poses safety risks and reduces efficiency, but existing physics-based models are computationally expensive while purely data-driven approaches lack extrapolation capability for dynamic operation.

Method: Developed a physics-informed neural network (PINN) that embeds physical constraints (Henry’s law, Fick’s diffusion, Faraday’s law) in the loss function, trained on 184 published measurements augmented to 1,114 points, with compact architecture of 17,793 parameters.

Result: Achieved exceptional accuracy (R² = 99.84% ± 0.15%, RMSE = 0.0932% ± 0.0438%) across six membranes under industrial conditions, with sub-millisecond inference enabling real-time control and maintaining R² > 86% when predicting at pressures 2.5x beyond training range.

Conclusion: The PINN approach bridges physical rigor and computational efficiency, enabling real-time electrolyzer monitoring for safe green hydrogen production, with hardware-agnostic deployment from desktop CPUs to edge devices.

Abstract: Green hydrogen production via polymer electrolyte membrane (PEM) water electrolysis is pivotal for energy transition, yet hydrogen crossover through membranes threatens safety and economic viability-approaching explosive limits (4 mol% H$_2$ in O$_2$) while reducing Faradaic efficiency and accelerating membrane degradation. Current physics-based models require extensive calibration and computational resources that preclude real-time implementation, while purely data-driven approaches fail to extrapolate beyond training conditions-critical for dynamic electrolyzer operation. Here we present the first application of physics-informed neural networks (PINNs) for hydrogen crossover prediction, trained on 184 published measurements augmented to 1,114 points and constrained by a constitutive physics model (Henry’s law, Fick’s diffusion, and Faraday-based gas production) embedded in the loss function. Our compact architecture (17,793 parameters), validated across six membranes under industrially relevant conditions (0.05-5.0 A/cm$^2$, 1-200 bar, 25-85°C), achieves exceptional accuracy (R$^2$ = 99.84% $\pm$ 0.15%, RMSE = 0.0932% $\pm$ 0.0438%) based on five-fold cross-validation, with sub-millisecond inference enabling real-time control. Remarkably, the model maintains R$^2$ > 86% when predicting crossover at pressures 2.5x beyond training range-substantially outperforming pure neural networks (R$^2$ = 43.4%). The hardware-agnostic deployment, from desktop CPUs to edge devices (Raspberry Pi 4), enables distributed safety monitoring essential for gigawatt-scale installations. By bridging physical rigor and computational efficiency, this work establishes a new paradigm for real-time electrolyzer monitoring, accelerating deployment of safe, efficient green hydrogen infrastructure crucial for net-zero emissions targets.

[690] A Comparative Study on How Data Normalization Affects Zero-Shot Generalization in Time Series Foundation Models

Ihab Ahmed, Denis Krompaß, Cheng Feng, Volker Tresp

Main category: cs.LG

TL;DR: REVIN normalization method proves most effective for Time-Series Foundation Models, achieving 89% zero-shot MASE reduction and best accuracy-efficiency trade-off across diverse TSFM architectures.

Details

Motivation: Normalization is crucial for Time-Series Foundation Models due to significant scale variations across domains/channels and non-stationarity, which can undermine performance despite architectural complexity. Current normalization methods are well-studied for dataset-specific models but overlooked for TSFMs where generalization is critical.

Method: Systematic evaluation of normalization methods across four architecturally diverse TSFMs, with empirical comparison of REVIN against other normalization approaches and un-normalized baselines.

Result: REVIN reduced zero-shot MASE by 89% relative to un-normalized baseline and by 44% versus other normalization methods, while matching best in-domain accuracy (0.84 MASE) without dataset-level preprocessing, yielding highest accuracy-efficiency trade-off.

Conclusion: REVIN is the most efficient normalization approach for TSFMs, but its effectiveness depends on architectural design choices and optimization objectives, particularly regarding training loss scale sensitivity and model type (probabilistic, point-forecast, or LLM-based models).

Abstract: We investigate input normalization methods for Time-Series Foundation Models (TSFMs). While normalization is well-studied in dataset-specific time-series models, it remains overlooked in TSFMs where generalization is critical. Time-series data, unlike text or images, exhibits significant scale variation across domains and channels, coupled with non-stationarity, can undermine TSFM performance regardless of architectural complexity. Through systematic evaluation across four architecturally diverse TSFMs, we empirically establish REVIN as the most efficient approach, reducing zero-shot MASE by 89% relative to an un-normalized baseline and by 44% versus other normalization methods, while matching the best in-domain accuracy (0.84 MASE) without any dataset-level preprocessing – yielding the highest accuracy-efficiency trade-off. Yet its effect utilization depends on architectural design choices and optimization objective, particularly with respect to training loss scale sensitivity and model type (probabilistic, point-forecast, or LLM-based models).

[691] Parametrized Power-Iteration Clustering for Directed Graphs

Gwendal Debaussart-Joniec, Harry Sevi, Matthieu Jonckheere, Argyris Kalogeratos

Main category: cs.LG

TL;DR: ParPIC is a scalable clustering method for directed graphs using parametrized reversible random walks with automatic diffusion time tuning and embedding truncation.

Details

Motivation: Vertex-level clustering for directed graphs is challenging because edge directionality breaks assumptions of spectral methods, which also require expensive eigen-decompositions.

Method: Parametrized Power-Iteration Clustering (ParPIC) uses parametrized reversible random walk operators, automatically tunes diffusion time, and efficiently truncates the final embedding to produce low-dimensional representations.

Result: ParPIC achieves competitive clustering accuracy with improved scalability relative to spectral and teleportation-based methods on synthetic and real-world graphs.

Conclusion: ParPIC provides an effective and scalable alternative to spectral methods for directed graph clustering by leveraging random walks with parameter tuning and embedding optimization.

Abstract: Vertex-level clustering for directed graphs (digraphs) remains challenging as edge directionality breaks the key assumptions underlying popular spectral methods, which also incur the overhead of eigen-decomposition. This paper proposes Parametrized Power-Iteration Clustering (ParPIC), a random-walk-based clustering method for weakly connected digraphs. This builds over the Power-Iteration Clustering paradigm, which uses the rows of the iterated diffusion operator as a data embedding. ParPIC has three important features: the use of parametrized reversible random walk operators, the automatic tuning of the diffusion time, and the efficient truncation of the final embedding, which produces low-dimensional data representations and reduces complexity. Empirical results on synthetic and real-world graphs demonstrate that ParPIC achieves competitive clustering accuracy with improved scalability relative to spectral and teleportation-based methods.

[692] Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in ${\pm 1, \pm i}$

Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang, Peizhuang Cong, Tong Yang

Main category: cs.LG

TL;DR: Fairy2i converts pre-trained real-valued LLMs to complex-valued form for extremely low-bit quantization (down to 2 bits) while maintaining performance, enabling efficient inference on commodity hardware.

Details

Motivation: LLMs have massive memory/computational demands requiring aggressive quantization. Complex-valued LLMs offer better low-bit representation but require training from scratch, preventing use of pre-trained real-valued models. Need a way to leverage complex arithmetic efficiency while reusing existing pre-trained models.

Method: 1) Proves lossless mathematical equivalence between real and widely-linear complex maps to convert standard Transformers to complex domain. 2) Uses phase-aware quantization with efficient codebook of fourth roots of unity. 3) Introduces recursive residual quantization to iteratively minimize quantization error. 4) Enables multiplication-free accumulation for inference.

Result: Fairy2i restores LLaMA-2 7B performance at effective 2-bit precision to near full-precision baseline levels, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods.

Conclusion: Bridges gap between complex-valued arithmetic representational efficiency and practical utility of pre-trained models, enabling efficient inference on commodity hardware through extreme quantization while reusing existing checkpoints.

Abstract: Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware. We open-source the Fairy2i model and code at https://huggingface.co/PKU-DS-LAB/Fairy2i-W2 and https://github.com/PKULab1806/Fairy2i-W2.

[693] Divergence Results and Convergence of a Variance Reduced Version of ADAM

Ruiqi Wang, Diego Klabjan

Main category: cs.LG

TL;DR: The paper analyzes ADAM’s divergence issues, identifies gradient variance as the cause, proposes a variance-reduced ADAM variant with convergence guarantees, and shows comparable performance to original ADAM.

Details

Motivation: Despite ADAM's outstanding performance in training deep neural networks, it has been proven divergent for some specific problems. The authors aim to understand the root cause of this divergence and propose a fix while maintaining ADAM's good performance.

Method: The authors revisit ADAM’s divergence question, provide divergent examples under stronger conditions, identify gradient variance as the cause under variance reduction assumptions, and propose a variance-reduced version of ADAM with convergence analysis.

Result: The paper shows that gradient variance causes ADAM’s divergence, proposes a convergent variance-reduced ADAM algorithm, and demonstrates through numerical experiments that the proposed algorithm performs as well as original ADAM.

Conclusion: The work identifies gradient variance as the key factor in ADAM’s divergence and proposes a variance-reduced ADAM variant that converges while maintaining performance, suggesting a new direction for fixing convergence issues in adaptive optimization methods.

Abstract: Stochastic optimization algorithms using exponential moving averages of the past gradients, such as ADAM, RMSProp and AdaGrad, have been having great successes in many applications, especially in training deep neural networks. ADAM in particular stands out as efficient and robust. Despite of its outstanding performance, ADAM has been proved to be divergent for some specific problems. We revisit the divergent question and provide divergent examples under stronger conditions such as in expectation or high probability. Under a variance reduction assumption, we show that an ADAM-type algorithm converges, which means that it is the variance of gradients that causes the divergence of original ADAM. To this end, we propose a variance reduced version of ADAM and provide a convergent analysis of the algorithm. Numerical experiments show that the proposed algorithm has as good performance as ADAM. Our work suggests a new direction for fixing the convergence issues.

[694] Fair Graph Machine Learning under Adversarial Missingness Processes

Debolina Halder Lina, Arlei Silva

Main category: cs.LG

TL;DR: BFtS is a fair missing data imputation method for sensitive attributes in GNNs that assumes adversarial missingness and optimizes for worst-case fairness scenarios.

Details

Motivation: Existing fair GNN approaches assume sensitive attributes are either fully observed or missing completely at random, but adversarial missingness processes can disguise unfair models through imputation, leading to overestimated fairness.

Method: Proposes Better Fair than Sorry (BFtS) using a 3-player adversarial scheme: two adversaries collaborate against a GNN classifier, and the classifier minimizes the maximum bias by approximating worst-case fairness scenarios in imputation.

Result: Experiments on synthetic and real datasets show BFtS achieves better fairness-accuracy trade-off than existing alternatives under adversarial missingness processes.

Conclusion: BFtS addresses the challenge of adversarial missingness in sensitive attributes for fair GNNs by adopting a conservative approach that prepares for worst-case fairness scenarios during imputation.

Abstract: Graph Neural Networks (GNNs) have achieved state-of-the-art results in many relevant tasks where decisions might disproportionately impact specific communities. However, existing work on fair GNNs often assumes that either sensitive attributes are fully observed or they are missing completely at random. We show that an adversarial missingness process can inadvertently disguise a fair model through the imputation, leading the model to overestimate the fairness of its predictions. We address this challenge by proposing Better Fair than Sorry (BFtS), a fair missing data imputation model for sensitive attributes. The key principle behind BFtS is that imputations should approximate the worst-case scenario for fairness – i.e. when optimizing fairness is the hardest. We implement this idea using a 3-player adversarial scheme where two adversaries collaborate against a GNN classifier, and the classifier minimizes the maximum bias. Experiments using synthetic and real datasets show that BFtS often achieves a better fairness x accuracy trade-off than existing alternatives under an adversarial missingness process.

[695] A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

Yiming Tang, Harshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Qianxiao Li, Mengnan Du, Dianbo Liu

Main category: cs.LG

TL;DR: The paper develops a unified theoretical framework for sparse dictionary learning methods used in mechanistic interpretability, explaining why these methods produce polysemantic features, feature absorption, and dead neurons, and proposes a novel feature anchoring technique to improve feature recovery.

Details

Motivation: Understanding what representations AI models learn and how they encode concepts is crucial for scientific progress and trustworthy deployment. Current sparse dictionary learning methods used in mechanistic interpretability consistently produce problematic phenomena like polysemantic features, feature absorption, and dead neurons, with limited theoretical understanding of why these occur.

Method: Develops a unified theoretical framework that casts all major sparse dictionary learning variants (sparse autoencoders, transcoders, crosscoders) as a single piecewise biconvex optimization problem. Characterizes global solution set, non-identifiability, and spurious optima. Introduces Linear Representation Bench for evaluation and proposes feature anchoring technique to restore identifiability.

Result: The theoretical analysis yields principled explanations for feature absorption and dead neurons. Feature anchoring substantially improves feature recovery across synthetic benchmarks and real neural representations.

Conclusion: Provides the first unified theoretical framework for sparse dictionary learning methods in mechanistic interpretability, explaining key pathologies and offering a practical solution (feature anchoring) that improves feature recovery in both synthetic and real-world settings.

Abstract: As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they encode concepts has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have widely reported that neural networks represent meaningful concepts as linear directions in their representation spaces and often encode diverse concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, are utilized to address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into monosemantic features. These methods are the backbone of modern mechanistic interpretability, yet in practice they consistently produce polysemantic features, feature absorption, and dead neurons, with very limited theoretical understanding of why these phenomena occur. Existing theoretical work is limited to tied-weight sparse autoencoders, leaving the broader family of SDL methods without formal grounding. We develop the first unified theoretical framework that casts all major SDL variants as a single piecewise biconvex optimization problem, and characterize its global solution set, non-identifiability, and spurious optima. This analysis yields principled explanations for feature absorption and dead neurons. To expose these pathologies under full ground-truth access, we introduce the Linear Representation Bench. Guided by our theory, we propose feature anchoring, a novel technique that restores SDL identifiability, substantially improving feature recovery across synthetic benchmarks and real neural representations.

[696] Universal approximation property of Banach space-valued random feature models including random neural networks

Ariel Neufeld, Philipp Schmocker

Main category: cs.LG

TL;DR: Random feature learning extended to Banach spaces with theoretical guarantees for universal approximation and polynomial training costs.

Details

Motivation: To develop a theoretical framework for random feature learning in Banach spaces, enabling efficient kernel approximation with reduced computational complexity compared to deterministic methods.

Method: Extends random feature models to Banach space-valued random variables, proves universal approximation in Bochner spaces, derives approximation rates, and provides explicit learning algorithms. Includes random trigonometric/Fourier regression and random neural networks with only linear readout training.

Result: Proves universal approximation property for random neural networks in non-compact domains (weighted spaces, L^p-spaces, Sobolev spaces), shows polynomial training cost growth, and demonstrates empirical advantages over deterministic counterparts.

Conclusion: Random feature learning provides theoretically sound and computationally efficient framework for function approximation in Banach spaces, with practical advantages over deterministic methods.

Abstract: We introduce a Banach space-valued extension of random feature learning, a data-driven supervised machine learning technique for large-scale kernel approximation. By randomly initializing the feature maps, only the linear readout needs to be trained, which reduces the computational complexity substantially. Viewing random feature models as Banach space-valued random variables, we prove a universal approximation result in the corresponding Bochner space. Moreover, we derive approximation rates and an explicit algorithm to learn an element of the given Banach space by such models. The framework of this paper includes random trigonometric/Fourier regression and in particular random neural networks which are single-hidden-layer feedforward neural networks whose weights and biases are randomly initialized, whence only the linear readout needs to be trained. For the latter, we can then lift the universal approximation property of deterministic neural networks to random neural networks, even within function spaces over non-compact domains, e.g., weighted spaces, $L^p$-spaces, and (weighted) Sobolev spaces, where the latter includes the approximation of the (weak) derivatives. In addition, we analyze when the training costs for approximating a given function grow polynomially in both the input/output dimension and the reciprocal of a pre-specified tolerated approximation error. Furthermore, we demonstrate in a numerical example the empirical advantages of random feature models over their deterministic counterparts.

[697] NEAT: Neighborhood-Guided, Efficient, Autoregressive Set Transformer for 3D Molecular Generation

Daniel Rose, Roxane Axel Jacob, Johannes Kirchmair, Thierry Langer

Main category: cs.LG

TL;DR: NEAT is a permutation-invariant autoregressive transformer for 3D molecular generation that treats molecules as sets rather than sequences.

Details

Motivation: Standard transformers require sequential token ordering which doesn't exist for molecular atoms, and canonical orderings don't ensure permutation invariance needed for tasks like prefix completion.

Method: NEAT treats molecular graphs as sets of atoms and learns an order-agnostic distribution over admissible tokens at the graph boundary using a neighborhood-guided approach.

Result: NEAT achieves state-of-the-art performance in autoregressive 3D molecular generation while ensuring atom-level permutation invariance by design.

Conclusion: NEAT provides a promising transformer-based approach for 3D molecular generation that overcomes the sequential ordering limitation of standard transformers while maintaining permutation invariance.

Abstract: Transformer-based autoregressive models offer a promising alternative to diffusion- and flow-matching approaches for generating 3D molecular structures. However, standard transformer architectures require a sequential ordering of tokens, which is not uniquely defined for the atoms in a molecule. Prior work has addressed this by using canonical atom orderings, but these do not ensure permutation invariance of atoms, which is essential for tasks like prefix completion. We introduce NEAT, a Neighborhood-guided, Efficient, Autoregressive, Set Transformer that treats molecular graphs as sets of atoms and learns an order-agnostic distribution over admissible tokens at the graph boundary. NEAT achieves state-of-the-art performance in autoregressive 3D molecular generation whilst ensuring atom-level permutation invariance by design.

[698] Towards Identifiable Latent Additive Noise Models

Yuhang Liu, Zhen Zhang, Dong Gong, Erdun Gao, Biwei Huang, Mingming Gong, Anton van den Hengel, Kun Zhang, Javen Qinfeng Shi

Main category: cs.LG

TL;DR: A framework for causal representation learning with weaker identifiability conditions using changes in causal influences and function class constraints

Details

Motivation: Current causal representation learning methods have limited practical applicability due to strong identifiability assumptions and restrictive model classes (linear/polynomial), needing more flexible approaches for real-world settings

Method: Proposes a general framework leveraging changes in causal influences among latent variables with constraints on function classes, establishes partial identifiability under weaker conditions, extends to latent post-nonlinear models, and develops a flexible learning method

Result: Demonstrates effectiveness on synthetic and semi-synthetic datasets, and showcases applicability in human motion analysis case study

Conclusion: The approach broadens practical reach of identifiable causal representation learning models by relaxing assumptions and handling more complex real-world domains

Abstract: Causal representation learning (CRL) offers the promise of uncovering the underlying causal model by which observed data was generated, but the practical applicability of existing methods remains limited by the strong assumptions required for identifiability and by challenges in applying them to real-world settings. Most current approaches are applicable only to relatively restrictive model classes, such as linear or polynomial models, which limits their flexibility and robustness in practice. One promising approach to this problem seeks to address these issues by leveraging changes in causal influences among latent variables. In this vein we propose a more general and relaxed framework than typically applied, formulated by imposing constraints on the function classes applied. Within this framework, we establish partial identifiability results under weaker conditions, including scenarios where only a subset of causal influences change. We then extend our analysis to a broader class of latent post-nonlinear models. Building on these theoretical insights, we develop a flexible method for learning latent causal representations. We demonstrate the effectiveness of our approach on synthetic and semi-synthetic datasets, and further showcase its applicability in a case study on human motion analysis, a complex real-world domain that also highlights the potential to broaden the practical reach of identifiable CRL models.

[699] Federated Learning for Heterogeneous Electronic Health Record Systems with Cost Effective Participant Selection

Jiyoun Kim, Junu Kim, Kyunghoon Hur, Edward Choi

Main category: cs.LG

TL;DR: EHRFL: A federated learning framework for building cost-effective, institution-specific predictive models from EHR data using text-based modeling and privacy-preserving participant selection.

Details

Motivation: Healthcare institutions need to build predictive models using federated learning while addressing challenges of EHR system heterogeneity and computational cost constraints.

Method: Two-component framework: (1) text-based EHR modeling for cross-institution compatibility without data standardization, (2) differentially private participant selection based on averaged patient embedding similarity to reduce participants while maintaining performance.

Result: Experiments on multiple open-source EHR datasets demonstrate effectiveness of both components in reducing costs and time while maintaining performance.

Conclusion: EHRFL enables healthcare institutions to build institution-specific predictive models under budgetary constraints with reduced costs and time while preserving patient privacy.

Abstract: The increasing volume of electronic health records (EHRs) presents the opportunity to improve the accuracy and robustness of models in clinical prediction tasks. Unlike traditional centralized approaches, federated learning enables training on data from multiple institutions while preserving patient privacy and complying with regulatory constraints. In practice, healthcare institutions (i.e., hosts) often need to build predictive models tailored to their specific needs (e.g., creatinine-level prediction, N-day readmission prediction) using federated learning. When building a federated learning model for a single healthcare institution, two key challenges arise: (1) ensuring compatibility across heterogeneous EHR systems, and (2) managing federated learning costs within budget constraints. Specifically, heterogeneity in EHR systems across institutions hinders compatible modeling, while the computational costs of federated learning can exceed practical budget limits for healthcare institutions. To address these challenges, we propose EHRFL, a federated learning framework designed for building a cost-effective, host-specific predictive model using patient EHR data. EHRFL consists of two components: (1) text-based EHR modeling, which facilitates cross-institution compatibility without costly data standardization, and (2) a participant selection strategy based on averaged patient embedding similarity to reduce the number of participants without degrading performance. Our participant selection strategy sharing averaged patient embeddings is differentially private, ensuring patient privacy. Experiments on multiple open-source EHR datasets demonstrate the effectiveness of both components. With our framework, healthcare institutions can build institution-specific predictive models under budgetary constraints with reduced costs and time.

[700] Harmonizing Safety and Speed: A Human-Algorithm Approach to Enhance the FDA’s Medical Device Clearance Policy

Mohammad Zhalechian, Soroush Saghafian, Omar Robles

Main category: cs.LG

TL;DR: A human-algorithm approach using machine learning to predict recall risk of FDA 510(k) medical devices, with a data-driven clearance policy that reduces recall rates by 32.9% and workload by 40.5%.

Details

Motivation: The FDA's 510(k) pathway for medical device approval has inherent ambiguity leading to high recall rates, raising safety concerns. There's a need to improve this regulatory process by reducing recall risk while managing regulatory workload.

Method: Developed machine learning methods to estimate recall risk of 510(k) devices based on submission information. Proposed a data-driven clearance policy with three actions: acceptance, rejection, or deferral to FDA committees for deeper evaluation. Used a unique dataset of over 31,000 submissions from FDA and CMS sources.

Result: Compared to FDA’s current practice (10.3% recall rate, 100% normalized workload), the policy shows 32.9% improvement in recall rate and 40.5% reduction in workload. Estimated annual cost savings of $1.7 billion for healthcare system (1.1% of US annual medical device expenditure).

Conclusion: A holistic, data-driven approach can significantly improve the FDA’s 510(k) pathway by reducing recall risks and regulatory workload while generating substantial cost savings for the healthcare system.

Abstract: The United States Food and Drug Administration’s (FDA’s) 510(k) pathway allows manufacturers to gain medical device approval by demonstrating substantial equivalence to a legally marketed device. However, the inherent ambiguity of this regulatory procedure has been associated with high recall among many devices cleared through this pathway, raising significant safety concerns. In this paper, we develop a combined human-algorithm approach to assist the FDA in improving its 510(k) medical device clearance process by reducing recall risk and regulatory workload. We first develop machine learning methods to estimate the risk of recall of 510(k) medical devices based on the information available at the time of submission. We then propose a data-driven clearance policy that recommends acceptance, rejection, or deferral to FDA’s committees for in-depth evaluation. We conduct an empirical study using a unique dataset of over 31,000 submissions that we assembled based on data sources from the FDA and Centers for Medicare and Medicaid Service (CMS). Compared to the FDA’s current practice, which has a recall rate of 10.3% and a normalized workload measure of 100%, a conservative evaluation of our policy shows a 32.9% improvement in the recall rate and a 40.5% reduction in the workload. Our analyses further suggest annual cost savings of approximately $1.7 billion for the healthcare system driven by avoided replacement costs, which is equivalent to 1.1% of the entire United States annual medical device expenditure. Our findings highlight the value of a holistic and data-driven approach to improve the FDA’s current 510(k) pathway.

[701] d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation

Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, Hao Zhang

Main category: cs.LG

TL;DR: d3LLM is a pseudo-distilled diffusion large language model that balances accuracy and parallelism through training-time pseudo-trajectory distillation and inference-time entropy-based multi-block decoding with KV-cache refresh.

Details

Motivation: Diffusion LLMs offer parallel decoding and random-order generation advantages over autoregressive LLMs, but face an inherent accuracy-parallelism trade-off. Existing methods focus on either efficiency or performance, not both.

Method: Two-stage approach: (1) Training: pseudo-trajectory distillation teaches model which tokens can be decoded confidently at early steps; (2) Inference: entropy-based multi-block decoding with KV-cache refresh mechanism for high parallelism while maintaining accuracy.

Result: Achieves up to 10× speedup over vanilla LLaDA/Dream and 5× speedup over AR models without significant accuracy drop. Introduces AUP metric for joint evaluation of accuracy and parallelism.

Conclusion: d3LLM effectively balances accuracy and parallelism in diffusion LLMs through novel training and inference techniques, demonstrating practical benefits over both vanilla diffusion models and autoregressive alternatives.

Abstract: Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs, such as parallel decoding and random-order generation. However, realizing these benefits in practice is non-trivial, as dLLMs inherently face an accuracy-parallelism trade-off. Despite increasing interest, existing methods typically focus on only one-side of the coin, targeting either efficiency or performance. To address this limitation, we propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism: (i) during training, we introduce pseudo-trajectory distillation to teach the model which tokens can be decoded confidently at early steps, thereby improving parallelism; (ii) during inference, we employ entropy-based multi-block decoding with a KV-cache refresh mechanism to achieve high parallelism while maintaining accuracy. To better evaluate dLLMs, we also introduce AUP (Accuracy Under Parallelism), a new metric that jointly measures accuracy and parallelism. Experiments demonstrate that our d3LLM achieves up to 10$\times$ speedup over vanilla LLaDA/Dream and 5$\times$ speedup over AR models without much accuracy drop. Our code is available at https://github.com/hao-ai-lab/d3LLM.

[702] Understanding Large Language Models in Your Pockets: Performance Study on COTS Mobile Devices

Jie Xiao, Qianyi Huang, Xu Chen, Chen Tian

Main category: cs.LG

TL;DR: A comprehensive measurement study evaluating lightweight LLMs (like Gemini Nano, LLAMA2 7B) on commercial mobile devices, assessing both user experience metrics and developer-critical factors across different mobile SoCs.

Details

Motivation: Growing privacy concerns are pushing LLMs toward local deployment on mobile devices. The paper aims to understand the current landscape of LLM deployment on mobile platforms by evaluating performance on commercial-off-the-shelf devices, addressing both user experience and developer implementation concerns.

Method: Conducted a comprehensive measurement study on mobile devices, evaluating both user-centric metrics (token throughput, latency, response quality) and developer-critical factors (resource utilization, OS strategies, battery consumption, launch time). Also performed comparisons across mobile SoCs from major vendors.

Result: The study provides insights into performance differences across mobile SoCs in handling LLM workloads, helping developers identify and address bottlenecks for mobile LLM applications. Results highlight trade-offs between user experience metrics and resource utilization.

Conclusion: The measurement study provides valuable insights for both on-device LLM development and future mobile system architecture design, helping bridge the gap between user experience concerns and developer implementation challenges in mobile LLM deployment.

Abstract: As large language models (LLMs) increasingly integrate into every aspect of our work and daily lives, there are growing concerns about user privacy, which push the trend toward local deployment of these models. There are a number of lightweight LLMs (e.g., Gemini Nano, LLAMA2 7B) that can run locally on smartphones, providing users with greater control over their personal data. As a rapidly emerging application, we are concerned about their performance on commercial-off-the-shelf mobile devices. To fully understand the current landscape of LLM deployment on mobile platforms, we conduct a comprehensive measurement study on mobile devices. While user experience is the primary concern for end-users, developers focus more on the underlying implementations. Therefore, we evaluate both user-centric metrics-such as token throughput, latency, and response quality-and developer-critical factors, including resource utilization, OS strategies, battery consumption, and launch time. We also provide comprehensive comparisons across the mobile system-on-chips (SoCs) from major vendors, highlighting their performance differences in handling LLM workloads, which may help developers identify and address bottlenecks for mobile LLM applications. We hope that this study can provide insights for both the development of on-device LLMs and the design for future mobile system architecture.

[703] Understanding Self-Supervised Learning via Gaussian Mixture Models

Parikshit Bansal, Ali Kavis, Sujay Sanghavi

Main category: cs.LG

TL;DR: Theoretical analysis shows self-supervised learning (contrastive and non-contrastive) can find optimal subspaces in Gaussian Mixture Models, outperforming spectral methods, with extensions to multi-modal learning like CLIP.

Details

Motivation: Self-supervised learning works well empirically but lacks theoretical understanding. The paper aims to provide theoretical analysis of why self-supervised learning works, specifically in the context of dimensionality reduction in Gaussian Mixture Models.

Method: Theoretical analysis of self-supervised learning algorithms (InfoNCE contrastive loss and SimSiam non-contrastive loss) in Gaussian Mixture Models. Define augmentation as independent draws from same mixture component. Extend analysis to multi-modal contrastive learning (CLIP). Validate with synthetic data experiments.

Result: Contrastive learning finds optimal lower-dimensional subspace even for non-isotropic Gaussians where spectral methods fail. Non-contrastive learning achieves similar results. Multi-modal contrastive learning learns subset of fisher-optimal subspace, filtering out noise. Synthetic experiments confirm theoretical findings.

Conclusion: Theoretical analysis explains why self-supervised learning works well, showing it can find optimal representations in Gaussian Mixture Models and multi-modal settings, providing mathematical foundation for empirical success.

Abstract: Self-supervised learning attempts to learn representations from un-labeled data; it does so via a loss function that encourages the embedding of a point to be close to that of its augmentations. This simple idea performs remarkably well, yet it is not precisely theoretically understood why this is the case. In this paper we analyze self-supervised learning in a natural context: dimensionality reduction in Gaussian Mixture Models. Crucially, we define an augmentation of a data point as being another independent draw from the same underlying mixture component. We show that vanilla contrastive learning (specifically, the InfoNCE loss) is able to find the optimal lower-dimensional subspace even when the Gaussians are not isotropic – something that vanilla spectral techniques cannot do. We also prove a similar result for “non-contrastive” self-supervised learning (i.e., SimSiam loss). We further extend our analyses to multi-modal contrastive learning algorithms (e.g., CLIP). In this setting we show that contrastive learning learns the subset of fisher-optimal subspace, effectively filtering out all the noise from the learnt representations. Finally, we corroborate our theoretical finding through synthetic data experiments.

[704] PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates

Nilin Abrahamsen

Main category: cs.LG

TL;DR: PROMA is a proximal policy optimization method that projects gradients during microbatch accumulation to maintain orthogonality, enabling stable training without entropy collapse.

Details

Motivation: The paper aims to address limitations in proximal policy optimization methods, particularly issues with entropy collapse and poor local KL control in gradient accumulation approaches like GRPO.

Method: PROMA modifies gradient accumulation across microbatches by projecting partially accumulated gradients to be orthogonal to sequence-wise gradients of the current microbatch. This projection is applied layer-wise during backward pass, with Intra-PROMA variant acting independently across microbatches.

Result: Empirically, PROMA achieves proximal updates without entropy collapse while providing tighter local KL control than GRPO.

Conclusion: PROMA offers an effective alternative to likelihood ratio-based methods for proximal policy optimization, with better stability and control properties.

Abstract: This note introduces Projected Microbatch Accumulation (PROMA), a proximal policy method that modifies gradient accumulation across microbatches rather than relying on likelihood ratios relative to a reference policy. During accumulation, PROMA projects the partially accumulated gradient to be orthogonal to the sequence-wise gradients of the current microbatch. This projection is applied layer-wise during the backward pass, enabling efficient implementation. A within-microbatch variant (Intra-PROMA) acts independently across microbatches. Empirically, PROMA achieves proximal updates without entropy collapse while providing tighter local KL control than GRPO.

[705] Non-equilibrium active noise enhances generative memory in diffusion models

Agnish Kumar Behera, Alexandra Lamtyugina, Aditya Nandy, Daiki Goto, Carlos Floyd, Suriyanarayanan Vaikuntanathan

Main category: cs.LG

TL;DR: Active noise in diffusion models creates memory effects that preserve semantic information and improve multi-scale structure generation

Details

Motivation: Traditional diffusion models use white Gaussian noise which destroys information uniformly. The paper explores whether using active, temporally correlated noise sources can fundamentally alter information thermodynamics and improve generative processes.

Method: The authors couple data to an active non-Markovian bath to create memory effects, use Fisher information analysis to quantify information decay rates, and demonstrate how active noise facilitates earlier symmetry breaking during reverse generative processes.

Result: Active noise mechanisms significantly retard information decay compared to passive Brownian motion, preserve high-level semantic information in temporal correlations, and enable better resolution of multi-scale structures like molecular metastable states.

Conclusion: Non-equilibrium protocols inspired by active matter physics offer thermodynamically distinct and potentially advantageous pathways for recovering high-dimensional energy landscapes in generative diffusion models.

Abstract: Generative diffusion models have emerged as powerful tools for sampling high-dimensional distributions, yet they typically rely on white gaussian noise and noise schedules to destroy and reconstruct information. Here, we demonstrate that driving the generative process out of equilibrium using active, temporally correlated noise sources fundamentally alters the information thermodynamics of the system. We show that coupling the data to an active non-Markovian bath creates a `memory effect’ where high-level semantic information (such as class identity or molecular metastability) is stored in the temporal correlations of auxiliary degrees of freedom. Using Fisher information analysis, we prove that this active mechanism significantly retards the rate of information decay compared to passive Brownian motion. Crucially, this memory effect facilitates an earlier and more robust symmetry breaking (speciation) during the reverse generative process, allowing the system to resolve multi-scale structures, reminiscent of metastable states in molecular configurations that are washed out in the typical noising processes. Our results suggest that non-equilibrium protocols, inspired by active matter physics, offer a thermodynamically distinct and potentially advantageous pathway for recovering high-dimensional energy landscapes using generative diffusion.

[706] Neural Window Decoder for SC-LDPC Codes

Dae-Young Yun, Hee-Youl Kwak, Yongjune Kim, Sang-Hyo Kim, Jong-Seon No

Main category: cs.LG

TL;DR: Neural window decoder for SC-LDPC codes with trainable weights, novel training strategies, non-uniform update schedules, and adaptive error propagation mitigation.

Details

Motivation: To improve decoding performance of spatially coupled LDPC codes by incorporating neural networks into window decoding while maintaining efficiency and addressing error propagation issues.

Method: Proposes neural window decoder with trainable weights, introduces targeted loss function for variable nodes, uses active learning with normalized loss, develops non-uniform update schedules based on trainable damping factors, and implements adaptive decoding with complementary weight sets for error detection.

Result: Achieves 41% reduction in check node updates without performance degradation compared to conventional window decoder, effectively mitigates error propagation through adaptive decoding strategy.

Conclusion: The neural window decoder successfully enhances SC-LDPC code decoding efficiency while maintaining performance, with adaptive mechanisms to handle error propagation without structural changes.

Abstract: In this paper, we propose a neural window decoder (NWD) for spatially coupled low-density parity-check (SC-LDPC) codes. The proposed NWD retains the conventional window decoder (WD) process but incorporates trainable neural weights. To train the weights of NWD, we introduce two novel training strategies. First, we restrict the loss function to target variable nodes (VNs) of the window, which prunes the neural network and accordingly enhances training efficiency. Second, we employ the active learning technique with a normalized loss term to prevent the training process from biasing toward specific training regions. Next, we develop a systematic method to derive non-uniform schedules for the NWD based on the training results. We introduce trainable damping factors that reflect the relative importance of check node (CN) updates. By skipping updates with less importance, we can omit $\mathbf{41%}$ of CN updates without performance degradation compared to the conventional WD. Lastly, we address the error propagation problem inherent in SC-LDPC codes by deploying a complementary weight set, which is activated when an error is detected in the previous window. This adaptive decoding strategy effectively mitigates error propagation without requiring modifications to the code and decoder structures.

[707] SDFLoRA: Selective Decoupled Federated LoRA for Privacy-preserving Fine-tuning with Heterogeneous Clients

Zhikang Shen, Jianrong Lu, Haiyuan Wan, Jianhai Chen

Main category: cs.LG

TL;DR: SDFLoRA is a federated learning framework for LLMs that decouples client updates into shared and private components to handle rank/data heterogeneity and improve DP compatibility.

Details

Motivation: Practical federated learning for LLMs faces challenges with rank and data heterogeneity across clients, where existing methods either enforce unified ranks or mix transferable/client-specific directions, undermining personalization and causing unnecessary utility degradation under differential privacy.

Method: Proposes Selective Decoupled Federated LoRA (SDFLoRA) that decouples each client’s LoRA update into a shared component for aggregation and a private component that remains local. Only the shared component participates in subspace alignment, while private components preserve client-specific semantics without communication.

Result: Experiments on multiple benchmarks show SDFLoRA outperforms federated LoRA baselines and achieves a strong utility-privacy trade-off by avoiding noise injection into local directions.

Conclusion: SDFLoRA effectively addresses rank/data heterogeneity in federated LLM training while improving DP compatibility through selective decoupling of shared and private update components.

Abstract: Federated learning (FL) for large language models (LLMs) has attracted increasing attention as a privacy-preserving approach for adapting models over distributed data, where parameter-efficient methods such as Low-Rank Adaptation (LoRA) are widely adopted to reduce communication and memory costs. However, practical deployments often exhibit rank and data heterogeneity: clients operate under different low-rank budgets and data distributions, making direct aggregation of LoRA updates biased and unstable. Existing approaches either enforce a unified rank or align heterogeneous updates into a single shared subspace, which tends to mix transferable and client-specific directions and consequently undermines personalization. Moreover, under differential privacy (DP), perturbing such structurally mixed updates injects noise into directions that should remain purely local, leading to unnecessary utility degradation. To address these issues, we propose Selective Decoupled Federated LoRA (SDFLoRA), a structure-aware LoRA framework that decouples each client update into a shared component for aggregation and a private component that preserves client-specific semantics. Only the shared component participates in subspace alignment, while the private component remains local and uncommunicated, making the training DP-compatible and stabilizing aggregation under rank heterogeneity. By injecting noise only into the aggregated shareable update, this approach avoids perturbations to local directions and improves the utility-privacy trade-off. Experiments on multiple benchmarks demonstrate that SDFLoRA outperforms federated LoRA baselines and achieves a strong utility-privacy trade-off.

[708] How Many Ratings per Item are Necessary for Reliable Significance Testing?

Christopher Homan, Flip Korn, Deepak Pandita, Chris Welty

Main category: cs.LG

TL;DR: Paper analyzes reliability of AI evaluation datasets, showing that collecting only 5-10 responses per item is insufficient for reliable statistical testing, with implications for multimodal model evaluation.

Details

Motivation: Current AI evaluation assumes model and human responses are reliable "gold standards," but generative AI's stochastic nature and human unreliability challenge this. There's a need to determine if datasets have enough responses per item for reliable statistical testing.

Method: Adapts a previously used method for evaluating metric reliability to determine if datasets have sufficient responses per item for reliable null hypothesis statistical testing. Analyzes existing gold standard test sets with multiple disaggregated responses per item.

Result: Shows that collecting even 5-10 responses per item (from each model and human evaluators) is insufficient for many common metrics. Even existing datasets with multiple responses per item lack enough responses for reliable evaluation.

Conclusion: AI researchers need better methods for data collection in AI evaluation. The paper provides tools to help make informed decisions about response collection to ensure reliable statistical testing in model evaluation.

Abstract: A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, ``gold standard’’ data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI – along with strong evidence that humans are unreliable judges – estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show that, for many common metrics, collecting even 5-10 responses per item (from each model and team of human evaluators) is not sufficient. We apply our methods to several of the very few extant gold standard test sets with multiple disaggregated responses per item and show that even these datasets lack enough responses per item. We show how our methods can help AI researchers make better decisions about how to collect data for AI evaluation.

[709] Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Yuval Ran-Milo, Yotam Alexander, Shahar Mendel, Nadav Cohen

Main category: cs.LG

TL;DR: Transformers trained with RL and sparse rewards can spontaneously develop Chain-of-Thought reasoning, and this paper analyzes the mechanism through policy gradient dynamics on a synthetic graph traversal task.

Details

Motivation: To understand how sparse rewards in RL training drive Transformers to discover systematic intermediate reasoning steps (Chain-of-Thought), which remains poorly understood despite empirical observations.

Method: Theoretical analysis of policy gradient dynamics in single-layer Transformers on a synthetic graph traversal task that requires Chain-of-Thought reasoning. The task admits an iterative solution but cannot be solved without intermediate steps. The study characterizes distributional properties needed for emergence of reasoning.

Result: Proves that despite training only on final-answer correctness, policy gradient drives Transformers to converge to structured, interpretable algorithms that iteratively traverse graphs vertex-by-vertex. Identifies critical role of “simple examples” (instances requiring fewer reasoning steps) - when training distribution has sufficient mass on these, Transformers learn generalizable strategies that extrapolate to longer chains; otherwise learning becomes infeasible.

Conclusion: The emergence of Chain-of-Thought reasoning in RL-trained Transformers can be theoretically explained through policy gradient dynamics, with simple examples playing a crucial role in enabling systematic reasoning to develop from sparse rewards.

Abstract: Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of “simple examples”: instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler examples, the Transformer learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, policy gradient learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.

[710] An Algorithm for Fixed Budget Best Arm Identification with Combinatorial Exploration

Siddhartha Parupudi, Gourab Ghatak

Main category: cs.LG

TL;DR: Multi-arm bandit algorithm for best arm identification using group sampling instead of single arm sampling, with log₂K groups and likelihood ratio testing

Details

Motivation: Traditional best arm identification in bandit problems samples one arm at a time, but sampling subsets of arms simultaneously could provide better trade-offs between exploration efficiency and reward information quality

Method: Constructs log₂K groups of arms and performs likelihood ratio tests to detect the presence of the best arm in each group, then uses Hamming decoding to identify the unique best arm

Result: Derives upper bound for error probability based on new hardness parameter H₄, and demonstrates cases where it outperforms state-of-the-art single-play algorithms

Conclusion: Group sampling approach to best arm identification can be more efficient than single-arm sampling in certain scenarios, with theoretical guarantees on performance

Abstract: We consider the best arm identification (BAI) problem in the $K-$armed bandit framework with a modification - the agent is allowed to play a subset of arms at each time slot instead of one arm. Consequently, the agent observes the sample average of the rewards of the arms that constitute the probed subset. Several trade-offs arise here - e.g., sampling a larger number of arms together results in a wider view of the environment, while sampling fewer arms enhances the information about individual reward distributions. Furthermore, grouping a large number of suboptimal arms together albeit reduces the variance of the reward of the group, it may enhance the group mean to make it close to that containing the optimal arm. To solve this problem, we propose an algorithm that constructs $\log_2 K$ groups and performs a likelihood ratio test to detect the presence of the best arm in each of these groups. Then a Hamming decoding procedure determines the unique best arm. We derive an upper bound for the error probability of the proposed algorithm based on a new hardness parameter $H_4$. Finally, we demonstrate cases under which it outperforms the state-of-the-art algorithms for the single play case.

[711] CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning

Yousef Koka, David Selby, Gerrit Großmann, Sebastian Vollmer, Kathan Pandya

Main category: cs.LG

TL;DR: CleanSurvival is a reinforcement learning framework for automated data preprocessing optimization in survival analysis, using Q-learning to select optimal combinations of imputation, outlier detection, and feature extraction techniques.

Details

Motivation: Data preprocessing is critical but often neglected in machine learning, especially for specialized tasks like survival analysis. While automated ML pipelines exist for classification/regression, there's a lack of tailored automated preprocessing solutions for survival/time-to-event models.

Method: Uses Q-learning reinforcement learning to optimize preprocessing pipelines for survival analysis. Handles continuous and categorical variables, selecting optimal combinations of data imputation, outlier detection, and feature extraction techniques for Cox, random forest, neural network, or user-supplied time-to-event models.

Result: Q-learning-based data preprocessing achieves superior predictive performance compared to standard approaches, finding optimal models up to 10 times faster than undirected random grid search. Simulation studies demonstrate effectiveness across different types and levels of missingness and noise in data.

Conclusion: CleanSurvival successfully addresses the gap in automated preprocessing for survival analysis, providing an efficient reinforcement learning framework that improves model performance and accelerates preprocessing optimization.

Abstract: Data preprocessing is a critical yet frequently neglected aspect of machine learning, often paid little attention despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like survival or time-to-event models. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area. To address this gap, this paper presents ‘CleanSurvival’, a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables, using Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The package is available on GitHub: https://github.com/datasciapps/CleanSurvival Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing results in superior predictive performance to standard approaches, finding such a model up to 10 times faster than undirected random grid search. Furthermore, a simulation study demonstrates the effectiveness in different types and levels of missingness and noise in the data.

[712] One-Shot Federated Learning with Classifier-Free Diffusion Models

Obaidullah Zaland, Shutong Jin, Florian T. Pokorny, Monowar Bhuyan

Main category: cs.LG

TL;DR: OSCAR introduces a one-shot federated learning approach using classifier-free diffusion models to generate server-side data without auxiliary models, reducing communication by 99% while outperforming SOTA.

Details

Motivation: Federated learning has high communication costs from multiple rounds. One-shot FL reduces this but existing methods using diffusion models require auxiliary classifier training at clients, adding computation overhead.

Method: Uses foundation models to create category-specific data representations at clients, integrates these into a classifier-free diffusion model pipeline for server-side data generation, eliminating need for auxiliary models.

Result: Outperforms state-of-the-art on four benchmark datasets while reducing communication load by at least 99%.

Conclusion: OSCAR provides efficient one-shot federated learning by leveraging classifier-free diffusion models and foundation models, achieving better performance with dramatically reduced communication.

Abstract: Federated learning (FL) enables collaborative learning without data centralization but introduces significant communication costs due to multiple communication rounds between clients and the server. One-shot federated learning (OSFL) addresses this by forming a global model with a single communication round, often relying on the server’s model distillation or auxiliary dataset generation - mostly through pre-trained diffusion models (DMs). Existing DM-assisted OSFL methods, however, typically employ classifier-guided DMs, which require training auxiliary classifier models at each client, introducing additional computation overhead. This work introduces OSCAR (One-Shot Federated Learning with Classifier-Free Diffusion Models), a novel OSFL approach that eliminates the need for auxiliary models. OSCAR uses foundation models to devise category-specific data representations at each client which are integrated into a classifier-free diffusion model pipeline for server-side data generation. In our experiments, OSCAR outperforms the state-of-the-art on four benchmark datasets while reducing the communication load by at least 99%.

[713] How simple can you go? An off-the-shelf transformer approach to molecular dynamics

Max Eissler, Tim Korjakow, Stefan Ganscha, Oliver T. Unke, Klaus-Robert Müller, Stefan Gugler

Main category: cs.LG

TL;DR: Edge Transformer adapted for molecular dynamics achieves state-of-the-art results without built-in equivariance or energy conservation, showing general-purpose architectures can work for MD with proper pre-training.

Details

Motivation: To challenge the necessity of specialized architectural features (like rotational equivariance and energy conservation) in molecular dynamics models by testing if a minimally modified general-purpose transformer can perform competitively.

Method: MD-ET uses an Edge Transformer architecture with minimal MD-specific modifications, trained via supervised pre-training on ~30 million molecular structures from QCML database, then fine-tuned on benchmarks.

Result: State-of-the-art results on several benchmarks after minimal fine-tuning; approximately energy-conserving NVE simulations for small structures but energy increases for larger ones; novel method for distinguishing non-equivariance errors from other inaccuracies.

Conclusion: General-purpose transformer architectures can achieve competitive MD performance without built-in physical inductive biases, though energy conservation remains challenging for larger systems.

Abstract: Most current neural networks for molecular dynamics (MD) include physical inductive biases, resulting in specialized and complex architectures. This is in contrast to most other machine learning domains, where specialist approaches are increasingly replaced by general-purpose architectures trained on vast datasets. In line with this trend, several recent studies have questioned the necessity of architectural features commonly found in MD models, such as built-in rotational equivariance or energy conservation. In this work, we contribute to the ongoing discussion by evaluating the performance of an MD model with as few specialized architectural features as possible. We present a recipe for MD using an Edge Transformer, an off-the-shelf'' transformer architecture that has been minimally modified for the MD domain, termed MD-ET. Our model implements neither built-in equivariance nor energy conservation. We use a simple supervised pre-training scheme on $\sim$30 million molecular structures from the QCML database. Using this off-the-shelf’’ approach, we show state-of-the-art results on several benchmarks after fine-tuning for a small number of steps. Additionally, we examine the effects of being only approximately equivariant and energy conserving for MD simulations, proposing a novel method for distinguishing the errors resulting from non-equivariance from other sources of inaccuracies like numerical rounding errors. While our model exhibits runaway energy increases on larger structures, we show approximately energy-conserving NVE simulations for a range of small structures.

[714] When Does Adaptation Win? Scaling Laws for Meta-Learning in Quantum Control

Nima Leclerc, Chris Miller, Nicholas Brawand

Main category: cs.LG

TL;DR: Meta-learning scaling laws show adaptation benefits saturate exponentially with gradient steps and scale linearly with task variance, validated on quantum gate calibration and classical control.

Details

Motivation: Quantum hardware suffers from device heterogeneity and environmental drift, forcing choices between suboptimal non-adaptive controllers or costly per-device recalibration. Need quantitative framework to determine when adaptation justifies its overhead.

Method: Derived scaling law lower bound for meta-learning showing adaptation gain saturates exponentially with gradient steps and scales linearly with task variance. Validated on quantum gate calibration tasks and classical linear-quadratic control problems.

Result: Negligible benefits for low-variance tasks but >40% fidelity gains on two-qubit gates under extreme out-of-distribution conditions (10× training noise). Classical control validation confirms laws emerge from general optimization geometry rather than quantum-specific physics.

Conclusion: Provides transferable framework for decision-making in adaptive control, with implications for reducing per-device calibration time on cloud quantum processors.

Abstract: Quantum hardware suffers from intrinsic device heterogeneity and environmental drift, forcing practitioners to choose between suboptimal non-adaptive controllers or costly per-device recalibration. We derive a scaling law lower bound for meta-learning showing that the adaptation gain (expected fidelity improvement from task-specific gradient steps) saturates exponentially with gradient steps and scales linearly with task variance, providing a quantitative criterion for when adaptation justifies its overhead. Validation on quantum gate calibration shows negligible benefits for low-variance tasks but $>40%$ fidelity gains on two-qubit gates under extreme out-of-distribution conditions (10$\times$ the training noise), with implications for reducing per-device calibration time on cloud quantum processors. Further validation on classical linear-quadratic control confirms these laws emerge from general optimization geometry rather than quantum-specific physics. Together, these results offer a transferable framework for decision-making in adaptive control.

[715] Towards Anomaly-Aware Pre-Training and Fine-Tuning for Graph Anomaly Detection

Yunhui Liu, Jiashun Cheng, Yiqing Lin, Qizhuo Xie, Jia Li, Fugee Tsung, Hongzhi Yin, Tao Zheng, Jianhua Zhao, Tieke He

Main category: cs.LG

TL;DR: APF is a graph anomaly detection framework with anomaly-aware pre-training using spectral polynomial filters and Rayleigh Quotient-based subgraph selection, followed by gated fusion fine-tuning with anomaly-aware regularization.

Details

Motivation: Address two key challenges in graph anomaly detection: (1) label scarcity due to high annotation costs, and (2) homophily disparity at node and class levels.

Method: Two-stage framework: Pre-training uses Rayleigh Quotient anomaly metric to select node-specific subgraphs and two spectral polynomial filters for dual representations. Fine-tuning uses gated fusion to integrate representations and anomaly-aware regularization loss.

Result: Superior performance validated on 10 benchmark datasets compared to state-of-the-art baselines, with theoretical analysis showing APF tends to achieve linear separability under mild conditions.

Conclusion: APF effectively addresses label scarcity and homophily disparity in graph anomaly detection through anomaly-aware pre-training and fine-tuning, demonstrating strong empirical performance and theoretical guarantees.

Abstract: Graph anomaly detection (GAD) has garnered increasing attention in recent years, yet remains challenging due to two key factors: (1) label scarcity stemming from the high cost of annotations and (2) homophily disparity at node and class levels. In this paper, we introduce Anomaly-Aware Pre-Training and Fine-Tuning (APF), a targeted and effective framework to mitigate the above challenges in GAD. In the pre-training stage, APF incorporates node-specific subgraphs selected via the Rayleigh Quotient, a label-free anomaly metric, into the learning objective to enhance anomaly awareness. It further introduces two learnable spectral polynomial filters to jointly learn dual representations that capture both general semantics and subtle anomaly cues. During fine-tuning, a gated fusion mechanism adaptively integrates pre-trained representations across nodes and dimensions, while an anomaly-aware regularization loss encourages abnormal nodes to preserve more anomaly-relevant information. Furthermore, we theoretically show that APF tends to achieve linear separability under mild conditions. Comprehensive experiments on 10 benchmark datasets validate the superior performance of APF in comparison to state-of-the-art baselines.

[716] Utilising Gradient-Based Proposals Within Sequential Monte Carlo Samplers for Training of Partial Bayesian Neural Networks

Andrew Millard, Joshua Murphy, Simon Maskell, Zheng Zhao

Main category: cs.LG

TL;DR: New SMC-based training method for partial Bayesian neural networks using guided proposals and gradient-based Markov kernels improves scalability and performance

Details

Motivation: Partial Bayesian neural networks (pBNNs) perform competitively with full Bayesian neural networks but only require a subset of parameters to be stochastic. While SMC samplers have shown improved performance over parametric methods for pBNNs, there's a need for better scalability on high-dimensional problems.

Method: Introduces a new SMC-based training method for pBNNs that utilizes guided proposals and incorporates gradient-based Markov kernels to improve scalability on high-dimensional problems.

Result: The new method outperforms state-of-the-art in terms of predictive performance and optimal loss. Also shows that pBNNs scale well with larger batch sizes, resulting in significantly reduced training times and often better performance.

Conclusion: The proposed SMC-based training method with guided proposals and gradient-based Markov kernels provides an effective approach for training partial Bayesian neural networks with improved scalability and performance on high-dimensional problems.

Abstract: Partial Bayesian neural networks (pBNNs) have been shown to perform competitively with fully Bayesian neural networks while only having a subset of the parameters be stochastic. Using sequential Monte Carlo (SMC) samplers as the inference method for pBNNs gives a non-parametric probabilistic estimation of the stochastic parameters, and has shown improved performance over parametric methods. In this paper we introduce a new SMC-based training method for pBNNs by utilising a guided proposal and incorporating gradient-based Markov kernels, which gives us better scalability on high dimensional problems. We show that our new method outperforms the state-of-the-art in terms of predictive performance and optimal loss. We also show that pBNNs scale well with larger batch sizes, resulting in significantly reduced training times and often better performance.

[717] Metric Graph Kernels via the Tropical Torelli Map

Yueqi Cao, Anthea Monod

Main category: cs.LG

TL;DR: First graph kernels for metric graphs using tropical algebraic geometry, focusing on geometry/topology rather than combinatorics, with edge subdivision invariance and efficient algorithms.

Details

Motivation: Existing graph kernels focus on combinatorial properties (nodes, edges, subgraphs) but overlook geometric and topological information of metric graphs. There's a need for kernels that capture the underlying metric space structure, especially for applications like road networks where geometry matters.

Method: Develop graph kernels based on tropical algebraic geometry that operate on metric graphs. Key properties include invariance under edge subdivision. Design efficient algorithms whose complexity depends on graph genus rather than size. Apply to synthetic and real-world datasets including urban road networks.

Result: The kernels capture complementary geometric/topological information missed by standard combinatorial approaches, especially in label-free settings. They demonstrate practical utility in urban road network classification tasks. Algorithms are efficient with complexity tied to genus rather than graph size.

Conclusion: The proposed metric graph kernels provide a novel geometric/topological perspective for graph comparison, complementing existing combinatorial approaches, with practical applications in domains where metric space structure is important.

Abstract: We introduce the first graph kernels for metric graphs via tropical algebraic geometry. In contrast to conventional graph kernels based on graph combinatorics such as nodes, edges, and subgraphs, our metric graph kernels are purely based on the geometry and topology of the underlying metric space. A key characterizing property of our construction is its invariance under edge subdivision, making the kernels intrinsically well-suited for comparing graphs representing different underlying metric spaces. We develop efficient algorithms to compute our kernels and analyze their complexity, which depends primarily on the genus of the input graphs rather than their size. Through experiments on synthetic data and selected real-world datasets, we demonstrate that our kernels capture complementary geometric and topological information overseen by standard combinatorial approaches, particularly in label-free settings. We further showcase their practical utility with an urban road network classification task.

[718] Representative Action Selection for Large Action Space Bandit Families

Quan Zhou, Mark Kozdoba, Shie Mannor

Main category: cs.LG

TL;DR: Bandit algorithm that reduces large action spaces by exploiting reward correlations without prior knowledge of correlation structure

Details

Motivation: Many bandit problems have large action spaces with correlated rewards, making full exploration inefficient. Need methods to reduce action space size while maintaining near-optimal performance.

Method: Proposes an algorithm that identifies and exploits correlations between action rewards to select a representative subset of actions, without requiring prior knowledge of the correlation structure.

Result: Theoretical guarantees show performance nearly matches using full action space. Empirical comparisons demonstrate effectiveness against Thompson Sampling and UCB methods.

Conclusion: The algorithm successfully reduces action space size in correlated bandit settings, improving efficiency while maintaining strong performance.

Abstract: We study the problem of selecting a subset from a large action space shared by a family of bandits, with the goal of achieving performance nearly matching that of using the full action space. Indeed, in many natural situations, while the nominal set of actions may be large, there also exist significant correlations between the rewards of different actions. In this paper we propose an algorithm that can significantly reduce the action space when such correlations are present, without the need to a-priori know the correlation structure. We provide theoretical guarantees on the performance of the algorithm and demonstrate its practical effectiveness through empirical comparisons with Thompson Sampling and Upper Confidence Bound methods.

[719] Perturbation-Induced Linearization: Constructing Unlearnable Data with Solely Linear Classifiers

Jinlin Liu, Wei Chen, Xiaojin Zhang

Main category: cs.LG

TL;DR: PIL is an efficient method for creating unlearnable examples using linear surrogate models instead of deep neural networks, achieving comparable protection with dramatically reduced computational cost.

Details

Motivation: To address concerns about unauthorized data usage in deep learning, unlearnable examples add perturbations to prevent effective model learning, but existing methods are computationally expensive due to reliance on deep neural network surrogate models.

Method: Proposes Perturbation-Induced Linearization (PIL), which generates perturbations using only linear surrogate models instead of deep neural networks, making the process much more computationally efficient while maintaining effectiveness.

Result: PIL achieves comparable or better performance than existing surrogate-based methods while dramatically reducing computational time. The paper reveals that unlearnable examples work by inducing linearization in deep models.

Conclusion: PIL provides a practical, efficient approach for data protection through unlearnable examples and offers insights into why these examples are effective by revealing the linearization mechanism.

Abstract: Collecting web data to train deep models has become increasingly common, raising concerns about unauthorized data usage. To mitigate this issue, unlearnable examples introduce imperceptible perturbations into data, preventing models from learning effectively. However, existing methods typically rely on deep neural networks as surrogate models for perturbation generation, resulting in significant computational costs. In this work, we propose Perturbation-Induced Linearization (PIL), a computationally efficient yet effective method that generates perturbations using only linear surrogate models. PIL achieves comparable or better performance than existing surrogate-based methods while reducing computational time dramatically. We further reveal a key mechanism underlying unlearnable examples: inducing linearization to deep models, which explains why PIL can achieve competitive results in a very short time. Beyond this, we provide an analysis about the property of unlearnable examples under percentage-based partial perturbation. Our work not only provides a practical approach for data protection but also offers insights into what makes unlearnable examples effective.

[720] Learning for Dynamic Combinatorial Optimization without Training Data

Yiqiao Liao, Farinaz Koushanfar, Parinaz Naghizadeh

Main category: cs.LG

TL;DR: DyCO-GNN is an unsupervised framework for dynamic combinatorial optimization that learns from problem instances without training data, using graph neural networks to accelerate optimization across time-evolving graph snapshots.

Details

Motivation: The paper addresses the challenge of solving dynamic combinatorial optimization problems where the problem structure evolves over time, requiring efficient methods that can adapt to changes without extensive training data or retraining.

Method: DyCO-GNN leverages structural similarities across time-evolving graph snapshots using graph neural networks. It operates in an unsupervised manner, requiring no training data beyond the problem instance itself, and accelerates optimization by transferring knowledge across temporal snapshots.

Result: DyCO-GNN achieves superior performance on dynamic maximum cut, maximum independent set, and traveling salesman problems across diverse datasets, outperforming baseline methods and achieving high-quality solutions 3-60x faster under tight time budgets.

Conclusion: DyCO-GNN demonstrates practical effectiveness for rapidly evolving resource-constrained settings by efficiently solving dynamic combinatorial optimization problems without requiring training data, making it suitable for real-world applications with time-varying constraints.

Abstract: We introduce DyCO-GNN, a novel unsupervised learning framework for Dynamic Combinatorial Optimization that requires no training data beyond the problem instance itself. DyCO-GNN leverages structural similarities across time-evolving graph snapshots to accelerate optimization while maintaining solution quality. We evaluate DyCO-GNN on dynamic maximum cut, maximum independent set, and the traveling salesman problem across diverse datasets of varying sizes, demonstrating its superior performance under tight and moderate time budgets. DyCO-GNN consistently outperforms the baseline methods, achieving high-quality solutions up to 3-60x faster, highlighting its practical effectiveness in rapidly evolving resource-constrained settings.

[721] NeuraLSP: An Efficient and Rigorous Neural Left Singular Subspace Preconditioner for Conjugate Gradient Methods

Alexander Benanti, Xi Han, Hong Qin

Main category: cs.LG

TL;DR: NeuraLSP: A neural preconditioner using left singular subspace of near-nullspace vectors to accelerate PDE solving with theoretical guarantees and up to 53% speedup.

Details

Motivation: Existing neural preconditioners for PDEs using GNNs suffer from rank inflation and suboptimal convergence rates when aggregating discretized system matrices into graphs.

Method: Proposes NeuraLSP, a neural preconditioner combined with a novel loss metric that leverages the left singular subspace of the system matrix’s near-nullspace vectors, compressing spectral information into a fixed low-rank operator.

Result: Method exhibits theoretical guarantees and empirical robustness to rank inflation, achieving up to 53% speedup across diverse families of PDEs.

Conclusion: NeuraLSP provides a theoretically sound neural preconditioning approach that addresses limitations of existing methods and demonstrates significant acceleration for PDE solving.

Abstract: Numerical techniques for solving partial differential equations (PDEs) are integral for many fields across science and engineering. Such techniques usually involve solving large, sparse linear systems, where preconditioning methods are critical. In recent years, neural methods, particularly graph neural networks (GNNs), have demonstrated their potential through accelerated convergence. Nonetheless, to extract connective structures, existing techniques aggregate discretized system matrices into graphs, and suffer from rank inflation and a suboptimal convergence rate. In this paper, we articulate NeuraLSP, a novel neural preconditioner combined with a novel loss metric that leverages the left singular subspace of the system matrix’s near-nullspace vectors. By compressing spectral information into a fixed low-rank operator, our method exhibits both theoretical guarantees and empirical robustness to rank inflation, affording up to a 53% speedup. Besides the theoretical guarantees for our newly-formulated loss function, our comprehensive experimental results across diverse families of PDEs also substantiate the aforementioned theoretical advances.

[722] There Was Never a Bottleneck in Concept Bottleneck Models

Antonio Almudévar, José Miguel Hernández-Lobato, Alfonso Ortega

Main category: cs.LG

TL;DR: MCBMs improve concept bottleneck models by adding information bottleneck regularization to ensure each representation component encodes only information relevant to its corresponding concept, enhancing interpretability and intervention validity.

Details

Motivation: Standard Concept Bottleneck Models (CBMs) don't truly enforce concept bottlenecks - components that can predict concepts may still encode unrelated information, compromising interpretability and intervention procedures.

Method: Propose Minimal Concept Bottleneck Models (MCBMs) that incorporate an Information Bottleneck objective via variational regularization to constrain each representation component to retain only information relevant to its corresponding concept.

Result: MCBMs yield more interpretable representations, support principled concept-level interventions, and remain consistent with probability-theoretic foundations compared to standard CBMs.

Conclusion: MCBMs address the fundamental limitation of CBMs by enforcing true concept bottlenecks through information bottleneck regularization, improving model interpretability and intervention reliability.

Abstract: Deep learning representations are often difficult to interpret, which can hinder their deployment in sensitive applications. Concept Bottleneck Models (CBMs) have emerged as a promising approach to mitigate this issue by learning representations that support target task performance while ensuring that each component predicts a concrete concept from a predefined set. In this work, we argue that CBMs do not impose a true bottleneck: the fact that a component can predict a concept does not guarantee that it encodes only information about that concept. This shortcoming raises concerns regarding interpretability and the validity of intervention procedures. To overcome this limitation, we propose Minimal Concept Bottleneck Models (MCBMs), which incorporate an Information Bottleneck (IB) objective to constrain each representation component to retain only the information relevant to its corresponding concept. This IB is implemented via a variational regularization term added to the training loss. As a result, MCBMs yield more interpretable representations, support principled concept-level interventions, and remain consistent with probability-theoretic foundations.

[723] Causal-Driven Feature Evaluation for Cross-Domain Image Classification

Chen Cheng, Ang Li

Main category: cs.LG

TL;DR: Proposes causal evaluation framework for OOD generalization that measures necessity and sufficiency of representations rather than just domain invariance

Details

Motivation: Current OOD generalization methods focus on domain-invariant representations, but invariant features aren't necessarily causally effective for prediction. Need better evaluation of representation quality under distribution shift.

Method: Introduces explicit segment-level framework to directly measure causal effectiveness (necessity and sufficiency) of learned representations across domains, moving beyond invariance alone

Result: Experiments on multi-domain benchmarks show consistent improvements in OOD performance, especially under challenging domain shifts

Conclusion: Causal evaluation provides more faithful criterion than invariance alone for robust generalization, highlighting value of causal perspective for OOD classification

Abstract: Out-of-distribution (OOD) generalization remains a fundamental challenge in real-world classification, where test distributions often differ substantially from training data. Most existing approaches pursue domain-invariant representations, implicitly assuming that invariance implies reliability. However, features that are invariant across domains are not necessarily causally effective for prediction. In this work, we revisit OOD classification from a causal perspective and propose to evaluate learned representations based on their necessity and sufficiency under distribution shift. We introduce an explicit segment-level framework that directly measures causal effectiveness across domains, providing a more faithful criterion than invariance alone. Experiments on multi-domain benchmarks demonstrate consistent improvements in OOD performance, particularly under challenging domain shifts, highlighting the value of causal evaluation for robust generalization.

[724] DBellQuant: Breaking the Bell with Double-Bell Transformation for LLMs Post Training Binarization

Zijian Ye, Wei Huang, Yifei Yu, Tianhe Ren, Zhongrui Wang, Xiaojuan Qi

Main category: cs.LG

TL;DR: DBellQuant is a post-training quantization framework that achieves near-1-bit weight compression and 6-bit activation quantization for LLMs with minimal performance degradation using a dual-bell distribution transformation technique.

Details

Motivation: LLMs face computational and memory challenges limiting practical deployment. While quantization helps, existing methods suffer from quantization errors due to non-quantization-friendly weight distributions and activation outliers.

Method: DBellQuant uses Learnable Transformation for Dual-Bell (LTDB) algorithm that transforms single-bell weight distributions into dual-bell forms to reduce binarization errors, and applies inverse transformations to smooth activations for effective quantization.

Result: Achieves state-of-the-art performance with aggressive quantization: on Wikitext2, DBellQuant gets perplexity of 14.39 on LLaMA2-13B with 6-bit activation quantization, significantly outperforming BiLLM’s 21.35 without activation quantization.

Conclusion: DBellQuant demonstrates effective LLM compression for real-world applications through innovative dual-bell transformation, enabling near-1-bit weight compression with minimal performance loss.

Abstract: Large language models (LLMs) demonstrate remarkable performance but face substantial computational and memory challenges that limit their practical deployment. Quantization has emerged as a promising solution; however, its effectiveness is often limited by quantization errors arising from weight distributions that are not quantization-friendly and the presence of activation outliers. To address these challenges, we introduce DBellQuant, an innovative post-training quantization (PTQ) framework that achieves nearly 1-bit weight compression and 6-bit activation quantization with minimal performance degradation. DBellQuant uses Learnable Transformation for Dual-Bell (LTDB) algorithm, which transforms single-bell weight distributions into dual-bell forms to reduce binarization errors and applies inverse transformations to smooth activations. DBellQuant sets a new state-of-the-art by preserving superior model performance under aggressive weight and activation quantization. For example, on the Wikitext2 dataset, DBellQuant achieves a perplexity of 14.39 on LLaMA2-13B with 6-bit activation quantization, significantly outperforming BiLLM’s 21.35 without activation quantization, underscoring its potential in compressing LLMs for real-world applications.

[725] CTRLS: Chain-of-Thought Reasoning via Latent State-Transition

Junda Wu, Yuxin Xiong, Xintong Li, Sheldon Yu, Zhengmian Hu, Tong Yu, Rui Wang, Xiang Chen, Jingbo Shang, Julian McAuley

Main category: cs.LG

TL;DR: CTRLS is a framework that formulates chain-of-thought reasoning as a Markov decision process with latent state transitions, using distributional reinforcement learning for principled exploration of reasoning trajectories.

Details

Motivation: Conventional CoT methods rely on heuristic sampling without structured modeling of reasoning transitions, limiting systematic exploration of diverse and effective reasoning trajectories. There's a need for more principled approaches to model reasoning dynamics.

Method: Formulates CoT reasoning as a Markov decision process with latent state transitions, using distributional reinforcement learning to model reasoning actions as explicit probability distributions in latent space. Incorporates on-policy RL with epsilon-greedy exploration and entropy-based regularization to refine latent state transitions without fine-tuning the underlying LLM.

Result: Theoretical analyses provide evidence lower bounds (ELBO) for transition-aware modeling. Experiments show improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.

Conclusion: CTRLS provides a principled framework for modeling reasoning transitions in CoT, enabling systematic exploration of reasoning space through distributional reinforcement learning and latent state modeling.

Abstract: Chain-of-thought (CoT) reasoning enables large language models (LLMs) to break down complex problems into interpretable intermediate steps, significantly enhancing model transparency and performance in reasoning tasks. However, conventional CoT methods rely on heuristic sampling without structured modeling of reasoning transitions, constraining their ability to systematically explore and discover diverse and effective reasoning trajectories. In this work, we introduce CTRLS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions, enabling principled and state-aware exploration via distributional reinforcement learning. By modelling reasoning actions as explicit probability distributions in latent space, our approach explicitly models epistemic uncertainty, facilitating robust exploration of the reasoning space. As part of our framework, we introduce an on-policy reinforcement learning strategy incorporating epsilon-greedy exploration and entropy-based regularization to iteratively refine latent state transitions without requiring additional fine-tuning of the underlying LLM. Theoretical analyses provide evidence lower bounds (ELBO), theoretically grounding our transition-aware modeling of latent reasoning dynamics. Further experiments demonstrate improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.

[726] $\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval

Zihao Wang, Hang Yin, Lihui Liu, Hanghang Tong, Yangqiu Song, Ginny Wong, Simon See

Main category: cs.LG

TL;DR: The paper studies Minimal Embeddable Dimension (MED) for embedding subset memberships into vector spaces, finding tight bounds that show logarithmic dependency on number of elements, suggesting retrieval limitations are due to learnability not geometric constraints.

Details

Motivation: To understand the fundamental geometric limitations of embedding-based retrieval systems, specifically whether the difficulties in learning effective embeddings stem from geometric constraints or learnability challenges.

Method: Theoretical derivation of tight bounds for MED for various distance/similarity metrics (ℓ₂, inner product, cosine similarity), plus numerical simulations using centroid-based subset embeddings.

Result: Found logarithmic dependency between MED and number of elements, with simulations easily achieving this bound, suggesting geometric constraints are not the primary limitation.

Conclusion: Embedding-based retrieval limitations primarily stem from learnability challenges rather than geometric constraints, providing guidance for future algorithm design.

Abstract: This paper studies the minimal dimension required to embed subset memberships ($m$ elements and ${m\choose k}$ subsets of at most $k$ elements) into vector spaces, denoted as Minimal Embeddable Dimension (MED). The tight bounds of MED are derived theoretically and supported empirically for various notions of “distances” or “similarities,” including the $\ell_2$ metric, inner product, and cosine similarity. In addition, we conduct numerical simulation in a more achievable setting, where the ${m\choose k}$ subset embeddings are chosen as the centroid of the embeddings of the contained elements. Our simulation easily realizes a logarithmic dependency between the MED and the number of elements to embed. These findings imply that embedding-based retrieval limitations stem primarily from learnability challenges, not geometric constraints, guiding future algorithm design.

[727] Transforming Datasets to Requested Complexity with Projection-based Many-Objective Genetic Algorithm

Joanna Komorniczak

Main category: cs.LG

TL;DR: Genetic algorithm optimizes classification and regression datasets to achieve target complexity measures, enabling controlled difficulty evaluation of ML methods.

Details

Motivation: Need for advanced synthetic data generators to evaluate ML methods with diverse problem complexities, addressing limitations in current evaluation datasets.

Method: Genetic algorithm optimizes 10 complexity measures for classification and 4 for regression tasks, using linear feature projections on synthetic datasets to achieve target complexity values.

Result: Algorithm successfully generates datasets with varying difficulty levels; experiments show correlation between generated data complexity and recognition quality of state-of-the-art classifiers/regressors.

Conclusion: Proposed method increases availability of diverse complexity datasets for reliable ML evaluation, with complexity-targeted generation showing meaningful correlation with model performance.

Abstract: The research community continues to seek increasingly more advanced synthetic data generators to reliably evaluate the strengths and limitations of machine learning methods. This work aims to increase the availability of datasets encompassing a diverse range of problem complexities by proposing a genetic algorithm that optimizes a set of problem complexity measures for classification and regression tasks towards specific targets. For classification, a set of 10 complexity measures was used, while for regression tasks, 4 measures demonstrating promising optimization capabilities were selected. Experiments confirmed that the proposed genetic algorithm can generate datasets with varying levels of difficulty by transforming synthetically created datasets to achieve target complexity values through linear feature projections. Evaluations involving state-of-the-art classifiers and regressors revealed a correlation between the complexity of the generated data and the recognition quality.

[728] A Comprehensive Evaluation on Quantization Techniques for Large Language Models

Yutong Liu, Cairong Zhao, Guosheng Hu

Main category: cs.LG

TL;DR: Comprehensive review and evaluation of post-training quantization methods for LLMs, analyzing pre-quantization transformations and error mitigation techniques under standardized conditions.

Details

Motivation: Existing quantization methods are often evaluated under different settings, making fair comparisons difficult. There's a need for systematic analysis of connections between methods and comprehensive evaluation under standardized conditions to understand quantization techniques better.

Method: Decoupled quantization methods into two steps: pre-quantization transformation (to reduce outlier impact) and quantization error mitigation (to offset quantization errors). Conducted extensive experiments evaluating different settings including granularity, symmetry, and new FP4 data formats (MXFP4 and NVFP4).

Result: Optimized rotation and scaling yield best pre-quantization performance; combining low-rank compensation with GPTQ can occasionally outperform GPTQ alone. Finer granularity improves performance but increases storage. Scaling-factor format and precision greatly affect FP4 performance, and rotation-based strategies effective for INT4 offer limited gains for MXFP4/NVFP4.

Conclusion: Systematic analysis reveals important insights about quantization methods, highlighting the need for further study of FP4 formats and motivating continued research in efficient model compression techniques for LLMs.

Abstract: For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often evaluated under different settings because a method typically contains multiple components. Analyzing connections among existing methods is important for deeper understanding. To bridge these gaps, we conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison. To our knowledge, such a fair and extensive investigation remains critically underexplored. To better understand connections, first, we decouple published quantization methods into two steps: pre-quantization transformation and quantization error mitigation. The former is a preprocessing step that reduces outlier impact by flattening the data distribution; the latter offsets quantization errors to improve performance. Second, we evaluate and analyze the impact of different settings, including granularity and symmetry. Third, we analyze and evaluate the latest MXFP4 and NVFP4 data formats and their performance. Our experiments first demonstrate that optimized rotation and scaling yield the best pre-quantization performance, and that combining low-rank compensation with GPTQ can occasionally outperform GPTQ alone for error mitigation. Second, finer granularity improves performance but increases storage overhead. Third, we find that scaling-factor format and precision greatly affect FP4 performance, and that rotation-based strategies effective for INT4 offer limited gains for MXFP4 and NVFP4, motivating further study.

[729] A Trainable Optimizer

Ruiqi Wang, Diego Klabjan

Main category: cs.LG

TL;DR: Trainable Optimizer (TO) framework that jointly learns gradient estimators and model weights, with pseudo-linear TO matching SGD convergence while reducing variance, plus efficient variants for faster convergence than ADAM.

Details

Motivation: Current optimization methods like ADAM rely on manually defined gradient estimators. The paper aims to develop trainable optimization strategies that can learn better gradient estimators jointly with model parameters, potentially improving convergence and efficiency.

Method: Proposes a framework for jointly training gradient estimators and model weights. Introduces pseudo-linear TO as a linear approximation of full gradient that matches SGD’s convergence rate while reducing variance. Also presents two simplified variants of pseudo-linear TO for improved computational efficiency.

Result: TO methods converge faster than benchmark algorithms like ADAM in both strongly convex and non-convex settings. The approach also shows effectiveness in fine-tuning an LLM. Pseudo-linear TO incurs negligible computational overhead with only minimal additional tensor multiplications.

Conclusion: Trainable optimizers offer a promising alternative to manually designed optimization algorithms, providing faster convergence with minimal computational overhead across various optimization settings including LLM fine-tuning.

Abstract: The concept of learning to optimize involves utilizing a trainable optimization strategy rather than relying on manually defined full gradient estimations such as ADAM. We present a framework that jointly trains the full gradient estimator and the trainable weights of the model. Specifically, we prove that pseudo-linear TO (Trainable Optimizer), a linear approximation of the full gradient, matches SGD’s convergence rate while effectively reducing variance. Pseudo-linear TO incurs negligible computational overhead, requiring only minimal additional tensor multiplications. To further improve computational efficiency, we introduce two simplified variants of Pseudo-linear TO. Experiments demonstrate that TO methods converge faster than benchmark algorithms (e.g., ADAM) in both strongly convex and non-convex settings, and fine tuning of an LLM.

[730] Delayed Momentum Aggregation: Communication-efficient Byzantine-robust Federated Learning with Partial Participation

Kaoru Otsuka, Yuki Takezawa, Makoto Yamada

Main category: cs.LG

TL;DR: DeMoA introduces delayed momentum aggregation for Byzantine-robust federated learning under partial participation, where cached momentum from non-sampled clients prevents Byzantine dominance.

Details

Motivation: Existing Byzantine-robust federated learning methods assume full client participation, but in partial participation settings, Byzantine clients can dominate the sampled set and break existing methods.

Method: DeMoA uses delayed momentum aggregation where the server aggregates cached momentum from non-sampled clients along with fresh momentum from sampled clients, ensuring Byzantine clients remain a minority from the server’s perspective.

Result: DeMoA achieves the best accuracy with 20% Byzantine ratio and only 10% partial participation rate, outperforming existing methods that fail under these conditions.

Conclusion: Delayed momentum aggregation principle enables Byzantine-robust federated learning under partial participation, with DeMoA demonstrating practical effectiveness where previous methods fail.

Abstract: Partial participation is essential for communication-efficient federated learning at scale, yet existing Byzantine-robust methods typically assume full client participation. In the partial participation setting, a majority of the sampled clients may be Byzantine, once Byzantine clients dominate, existing methods break down immediately. We introduce delayed momentum aggregation, a principle where the central server aggregates cached momentum from non-sampled clients along with fresh momentum from sampled clients. This principle ensures Byzantine clients remain a minority from the server’s perspective even when they dominate the sampled set. We instantiate this principle in our optimizer DeMoA. We analyze the convergence rate of DeMoA, showing that DeMoA is Byzantine-robust under partial participation. Experiments show that, with 20% Byzantine ratio and only 10% partial participation rate, DeMoA achieves the best accuracy even when existing methods fail empirically.

[731] MOMEMTO: Patch-based Memory Gate Model in Time Series Foundation Model

Samuel Yoon, Jongwon Kim, Juyoung Ha, Young Myoung Ko

Main category: cs.LG

TL;DR: MOMEMTO enhances time series foundation models for anomaly detection with a patch-based memory module to prevent over-generalization, enabling multi-domain training and achieving state-of-the-art performance across 23 datasets.

Details

Motivation: Reconstruction-based deep models for time series anomaly detection tend to over-generalize and reconstruct anomalies, while existing memory-based approaches have high training costs and aren't effectively integrated with time series foundation models.

Method: Proposes MOMEMTO, an improved TFM variant with patch-based memory module that captures normal patterns from multiple domains. Memory items are initialized from pre-trained encoder representations, organized into patch-level units, and updated via attention mechanism with multi-domain training strategy.

Result: Achieves higher AUC and VUS scores across 23 univariate benchmark datasets compared to baselines, enhances backbone TFM performance, and shows particular strength in few-shot learning scenarios.

Conclusion: MOMEMTO effectively addresses over-generalization in time series anomaly detection through memory-augmented foundation models, enabling efficient multi-domain learning and improved performance.

Abstract: Recently reconstruction-based deep models have been widely used for time series anomaly detection, but as their capacity and generalization capability increase, these models tend to over-generalize, often reconstructing unseen anomalies accurately. Prior works have attempted to mitigate this by incorporating a memory architecture that stores prototypes of normal patterns. Nevertheless, these approaches suffer from high training costs and have yet to be effectively integrated with time series foundation models (TFMs). To address these challenges, we propose MOMEMTO, an improved variant of TFM for anomaly detection, enhanced with a patch-based memory module to mitigate over-generalization. The memory module is designed to capture representative normal patterns from multiple domains and enables a single model to be jointly fine-tuned across multiple datasets through a multi-domain training strategy. MOMEMTO initializes memory items with latent representations from a pre-trained encoder, organizes them into patch-level units, and updates them via an attention mechanism. We evaluate our method using 23 univariate benchmark datasets. Experimental results demonstrate that MOMEMTO, as a single model, achieves higher scores on AUC and VUS metrics compared to baseline methods, and further enhances the performance of its backbone TFM, particularly in few-shot learning scenarios.

[732] Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels

Dongming Huang, Zhifan Li, Yicheng Li, Qian Lin

Main category: cs.LG

TL;DR: The paper introduces Effective Span Dimension (ESD), a new complexity measure for kernel learning that depends on signal, spectrum, and noise, showing it governs minimax excess risk scaling and can be reduced via gradient flow for improved generalization.

Details

Motivation: Traditional kernel methods assume fixed kernels, but modern machine learning often learns kernels from data. Existing theories don't adequately capture how kernel learning affects generalization, especially in over-parameterized regimes where feature learning occurs.

Method: Introduces Effective Span Dimension (ESD) as an alignment-sensitive complexity measure. Analyzes spectral algorithms in sequence models, proves minimax excess risk scales as σ²K where K is ESD. Studies over-parameterized gradient flow showing it can reduce ESD. Extends framework to linear models and RKHS regression with numerical validation.

Result: ESD provides a unified framework for analyzing kernel learning. For sequence models with ESD ≤ K, minimax excess risk scales as σ²K. Gradient flow in over-parameterized settings can reduce ESD, connecting feature learning to improved generalization. Framework applies broadly beyond traditional fixed-kernel theories.

Conclusion: ESD offers a novel perspective on generalization in kernel learning, bridging traditional spectral analysis with modern feature learning. It explains how adaptive kernel learning can provably improve generalization through ESD reduction, providing theoretical foundation for empirical success of learned kernels.

Abstract: We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $σ^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $σ^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.

[733] Deterministic Discrete Denoising

Hideyuki Suzuki, Wataru Kurebayashi, Hiroshi Yamashita

Main category: cs.LG

TL;DR: A deterministic denoising algorithm for discrete-state diffusion models using herding algorithm to replace stochastic reverse processes, improving efficiency and sample quality without retraining.

Details

Motivation: Current discrete diffusion models rely on stochastic denoising processes. The authors aim to develop a deterministic alternative that can improve efficiency and sample quality while maintaining the same model architecture without requiring retraining.

Method: The method introduces a variant of the herding algorithm to derandomize the generative reverse Markov chain, inducing deterministic state transitions driven by weakly chaotic dynamics. This serves as a direct replacement for the stochastic denoising process without needing continuous state embeddings or model retraining.

Result: The algorithm demonstrates consistent improvements in both efficiency and sample quality on text and image generation tasks. It also yields improved solutions for diffusion-based combinatorial optimization problems.

Conclusion: Herding-based denoising is a simple yet promising approach for enhancing the generative process of discrete diffusion models. The results show that deterministic reverse processes, well-established in continuous diffusion, can also be effective in discrete state spaces.

Abstract: We propose a deterministic denoising algorithm for discrete-state diffusion models. The key idea is to derandomize the generative reverse Markov chain by introducing a variant of the herding algorithm, which induces deterministic state transitions driven by weakly chaotic dynamics. It serves as a direct replacement for the stochastic denoising process, without requiring retraining or continuous state embeddings. We demonstrate consistent improvements in both efficiency and sample quality on text and image generation tasks. In addition, the proposed algorithm yields improved solutions for diffusion-based combinatorial optimization. Thus, herding-based denoising is a simple yet promising approach for enhancing the generative process of discrete diffusion models. Furthermore, our results reveal that deterministic reverse processes, well established in continuous diffusion, can also be effective in discrete state spaces.

[734] Machine Learning. The Science of Selection under Uncertainty

Yevgeny Seldin

Main category: cs.LG

TL;DR: Statistical learning theory book covering concentration inequalities, generalization bounds for offline supervised learning (including PAC-Bayesian analysis), and regret bounds for online learning in various environments.

Details

Motivation: To provide comprehensive statistical tools for analyzing learning as a selection process under uncertainty, addressing both offline and online learning scenarios with theoretical guarantees.

Method: Presents concentration inequalities (Markov, Chebyshev, Hoeffding, Bernstein, etc.), generalization bounds via Occam’s razor, VC analysis, PAC-Bayesian analysis, and regret analysis for online learning in stochastic/adversarial environments with full/bandit feedback.

Result: A unified statistical framework for deriving theoretical guarantees on learning outcomes, covering both empirical risk minimization in offline settings and regret minimization in online settings.

Conclusion: The book provides essential statistical tools for understanding and bounding the uncertainty inherent in learning processes, applicable across various learning paradigms from classical supervised learning to online learning.

Abstract: Learning, whether natural or artificial, is a process of selection. It starts with a set of candidate options and selects the more successful ones. In the case of machine learning the selection is done based on empirical estimates of prediction accuracy of candidate prediction rules on some data. Due to randomness of data sampling the empirical estimates are inherently noisy, leading to selection under uncertainty. The book provides statistical tools to obtain theoretical guarantees on the outcome of selection under uncertainty. We start with concentration of measure inequalities, which are the main statistical instrument for controlling how much an empirical estimate of expectation of a function deviates from the true expectation. The book covers a broad range of inequalities, including Markov’s, Chebyshev’s, Hoeffding’s, Bernstein’s, Empirical Bernstein’s, Unexpected Bernstein’s, kl, and split-kl. We then study the classical (offline) supervised learning and provide a range of tools for deriving generalization bounds, including Occam’s razor, Vapnik-Chervonenkis analysis, and PAC-Bayesian analysis. The latter is further applied to derive generalization guarantees for weighted majority votes. After covering the offline setting, we turn our attention to online learning. We present the space of online learning problems characterized by environmental feedback, environmental resistance, and structural complexity. A common performance measure in online learning is regret, which compares performance of an algorithm to performance of the best prediction rule in hindsight, out of a restricted set of prediction rules. We present tools for deriving regret bounds in stochastic and adversarial environments, and under full information and bandit feedback.

[735] SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights

Lorenz K. Müller, Philippe Bich, Jiawei Zhuang, Ahmet Çelik, Luca Benfenati, Lukas Cavigelli

Main category: cs.LG

TL;DR: SINQ introduces a second-axis scale factor and Sinkhorn-Knopp algorithm for post-training quantization of LLMs, improving low-bit quantization by normalizing per-row and per-column variances to better handle outliers.

Details

Motivation: Current post-training quantization methods suffer from perplexity degradation at ≤4 bits due to precision issues from outliers in parameters sharing scales with these outliers, especially problematic for calibration-free uniform quantization methods.

Method: SINQ augments existing post-training quantizers with an additional second-axis scale factor and uses a fast Sinkhorn-Knopp-style algorithm to find scales that normalize per-row and per-column variances, approximating activation-aware quantization by recovering column scales from weight matrix structure.

Result: SINQ reduces perplexity gap on WikiText2 and C4 by over 50% against uncalibrated uniform quantization baselines, incurs zero to negligible compute overhead, and can be combined with calibration and non-uniform quantization levels.

Conclusion: SINQ provides an effective, architecture-agnostic solution for low-bit quantization of LLMs that can be trivially applied to any linear layer without inter-layer interactions, significantly improving quantization quality with minimal overhead.

Abstract: Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm that finds scales to normalize per-row and per-column variances. We show that this approximates activation-aware quantization by recovering column scales from the weight matrix structure that are predictive of the typical activation magnitudes the matrix received during training. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layer. We evaluate our method on the Qwen3 model family, among others. SINQ reduces the perplexity gap on WikiText2 and C4 by over 50% against uncalibrated uniform quantization baselines, incurs zero to negligible compute overhead, and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code is available at https://github.com/huawei-csl/SINQ.

[736] Bridging On-Device and Cloud LLMs for Collaborative Reasoning: A Unified Methodology for Local Routing and Post-Training

Wenzhi Fang, Dong-Jun Han, Liangqi Yuan, Evan Chen, Christopher Brinton

Main category: cs.LG

TL;DR: On-device LLMs learn to decide when to offload queries to cloud models via reinforcement learning with hierarchical rewards, improving reasoning performance while maintaining efficiency.

Details

Motivation: Current device-cloud collaboration approaches struggle with determining query difficulty from prompts alone, especially for complex reasoning tasks, leading to inefficient routing decisions.

Method: Reinforcement learning-based post-training with hierarchical rewards that encourage local problem solving and judicious cloud offloading, using group-level policy gradient for stable optimization and adaptive prompt filtering to prevent policy collapse.

Result: Extensive experiments on on-device-scale LLaMA and Qwen models across multiple reasoning benchmarks show consistent outperformance over baselines and significant narrowing of the gap to full cloud LLMs.

Conclusion: Enabling on-device LLMs to make internal offloading decisions through reinforcement learning post-training effectively balances efficiency and reasoning capability in device-cloud collaboration settings.

Abstract: Device-cloud collaboration holds promise for deploying large language models (LLMs), leveraging lightweight on-device models for efficiency while relying on powerful cloud models for superior reasoning. A central challenge in this setting is determining, for each incoming query, whether it should be processed locally or offloaded to the cloud. Existing approaches typically rely on external routers, which often struggle to determine difficulty from the prompt itself, especially for tasks involving complex reasoning. Motivated by this limitation, we propose enabling on-device LLMs to decide internally whether to invoke cloud assistance at inference time, with this capability instilled through reinforcement learning based post-training. Casting on-device LLM post-training as a reward maximization problem, we design hierarchical rewards to encourage local problem solving and judicious cloud offloading. To solve the resulting problem, we develop an algorithm featuring a group-level policy gradient that stabilizes optimization, together with adaptive prompt filtering that provides complementary learning signals to mitigate policy collapse (i.e., exclusive local execution or exclusive cloud offloading). Extensive experiments on on-device-scale LLaMA and Qwen models across multiple reasoning benchmarks show that our method consistently outperforms baselines and significantly narrows the gap to full cloud LLMs.

Ran Elbaz, Guy Bar-Shalom, Yam Eitan, Fabrizio Frasca, Haggai Maron

Main category: cs.LG

TL;DR: FS-KAN introduces permutation-equivariant Kolmogorov-Arnold Networks using function sharing schemes, achieving better data efficiency than standard parameter-sharing layers while maintaining KANs’ interpretability.

Details

Motivation: While equivariant KANs have been explored for specific data types, there's no principled framework for applying them to data with permutation symmetries in general contexts. The paper aims to create a unified approach for constructing equivariant and invariant KA layers for arbitrary permutation symmetry groups.

Method: FS-KAN generalizes parameter-sharing schemes to the Kolmogorov-Arnold setup, creating equivariant and invariant KA layers for arbitrary permutation symmetry groups. The approach unifies and extends previous work by applying function sharing principles to KAN architectures.

Result: FS-KANs demonstrate superior data efficiency compared to standard parameter-sharing layers, often by wide margins, while preserving KANs’ interpretability and adaptability. Theoretical analysis shows FS-KANs have the same expressive power as networks using standard parameter-sharing layers.

Conclusion: FS-KAN provides a principled framework for building permutation-equivariant KANs, making them an excellent architecture choice for low-data regimes due to their data efficiency, interpretability, and adaptability.

Abstract: Permutation equivariant neural networks employing parameter-sharing schemes have emerged as powerful models for leveraging a wide range of data symmetries, significantly enhancing the generalization and computational efficiency of the resulting models. Recently, Kolmogorov-Arnold Networks (KANs) have demonstrated promise through their improved interpretability and expressivity compared to traditional architectures based on MLPs. While equivariant KANs have been explored in recent literature for a few specific data types, a principled framework for applying them to data with permutation symmetries in a general context remains absent. This paper introduces Function Sharing KAN (FS-KAN), a principled approach to constructing equivariant and invariant KA layers for arbitrary permutation symmetry groups, unifying and significantly extending previous work in this domain. We derive the basic construction of these FS-KAN layers by generalizing parameter-sharing schemes to the Kolmogorov-Arnold setup and provide a theoretical analysis demonstrating that FS-KANs have the same expressive power as networks that use standard parameter-sharing layers, allowing us to transfer well-known and important expressivity results from parameter-sharing networks to FS-KANs. Empirical evaluations on multiple data types and symmetry groups show that FS-KANs exhibit superior data efficiency compared to standard parameter-sharing layers, by a wide margin in certain cases, while preserving the interpretability and adaptability of KANs, making them an excellent architecture choice in low-data regimes.

[738] Black-Box Combinatorial Optimization with Order-Invariant Reinforcement Learning

Olivier Goudet, Quentin Suire, Adrien Goëffon, Frédéric Saubion, Sylvain Lamprier

Main category: cs.LG

TL;DR: Order-invariant RL framework for black-box combinatorial optimization using multivariate autoregressive models trained with random generation orders to achieve variable order invariance and improve search diversity.

Details

Motivation: Classical estimation-of-distribution algorithms for combinatorial optimization rely on learning explicit variable dependency graphs, which can be computationally expensive and may fail to capture complex interactions efficiently. There's a need for more flexible and efficient approaches that don't depend on fixed variable orderings.

Method: Parameterizes a multivariate autoregressive generative model trained without fixed variable ordering. Uses random generation orders during training as a form of information-preserving dropout to encourage order invariance. Adapts Group Relative Policy Optimization (GRPO) for stable policy-gradient updates from scale-invariant advantages.

Result: Across a wide range of benchmark algorithms and problem instances of varying sizes, the method frequently achieves the best performance and consistently avoids catastrophic failures.

Conclusion: The order-invariant reinforcement learning framework provides an effective approach for black-box combinatorial optimization that promotes search-space diversity, focuses on relevant variable dependencies, and improves sample efficiency compared to traditional methods.

Abstract: We introduce an order-invariant reinforcement learning framework for black-box combinatorial optimization. Classical estimation-of-distribution algorithms (EDAs) often rely on learning explicit variable dependency graphs, which can be costly and fail to capture complex interactions efficiently. In contrast, we parameterize a multivariate autoregressive generative model trained without a fixed variable ordering. By sampling random generation orders during training, a form of information-preserving dropout, the model is encouraged to be invariant to variable order, promoting search-space diversity, and shaping the model to focus on the most relevant variable dependencies, improving sample efficiency. We adapt Group Relative Policy Optimization (GRPO) to this setting, providing stable policy-gradient updates from scale-invariant advantages. Across a wide range of benchmark algorithms and problem instances of varying sizes, our method frequently achieves the best performance and consistently avoids catastrophic failures.

[739] Fine-Tuning Flow Matching via Maximum Likelihood Estimation of Reconstructions

Zhaoyi Li, Jingtao Ding, Yong Li, Shihua Li

Main category: cs.LG

TL;DR: A method to fine-tune Flow Matching models via Maximum Likelihood Estimation of reconstructions, addressing the train-inference gap and straight flow limitations through theoretical analysis and practical framework with artificial viscosity for enhanced stability.

Details

Motivation: Flow Matching models have a train-inference gap where model outputs cannot be assessed during training, and the straight flow assumption has inherent limitations that need to be addressed for better generative performance.

Method: Proposes fine-tuning FM via Maximum Likelihood Estimation of reconstructions enabled by FM’s smooth ODE formulation, with theoretical analysis of training loss vs inference error, and incorporates generalized artificial viscosity term for enhanced flow stability and robustness.

Result: Experiments show effectiveness across diverse settings: toy example provides mechanistic insights, while large-scale evaluations on meteorological forecasting and robotic manipulation policies validate reliable performance improvements.

Conclusion: The proposed MLE-based fine-tuning framework with artificial viscosity addresses FM limitations, improves stability and performance, and shows practical effectiveness in real-world applications.

Abstract: Flow Matching (FM) models achieve remarkable results in generative tasks. Building upon diffusion models, FM’s simulation-free training paradigm enables simplicity and efficiency but introduces a train-inference gap: model outputs cannot be assessed during training. Moreover, the straight flow assumption suffers from some inherent limitations. To address this, we propose to fine-tune FM via Maximum Likelihood Estimation (MLE) of reconstructions – enabled by FM’s smooth ODE formulation, unlike the stochastic differential equations (SDEs) in diffusion models. We first theoretically analyze the relationship between training loss and inference error in FM under numerical precision constraints. We then propose an easy-to-implement fine-tuning framework based on MLE of reconstructions, with flexibility for sophisticated extensions. Building on this, we incorporate a generalized artificial viscosity term that enhances flow stability and robustness, accompanied by a direct parameterization method and rigorous theoretical guarantees. Experiments demonstrate our method’s effectiveness across diverse settings: a toy example provides mechanistic insights into the fine-tuning process, while large-scale evaluations on meteorological forecasting and robotic manipulation policies validate reliable performance improvements.

[740] DPMM-CFL: Clustered Federated Learning via Dirichlet Process Mixture Model Nonparametric Clustering

Mariona Jaramillo-Civill, Peng Wu, Pau Closas

Main category: cs.LG

TL;DR: DPMM-CFL: A nonparametric Bayesian approach to clustered federated learning that automatically infers the number of clusters using Dirichlet Process priors, eliminating the need to pre-specify cluster count.

Details

Motivation: Most clustered federated learning methods require pre-specifying the number of clusters K, which is impractical when the latent client structure is unknown. There's a need for methods that can automatically determine the appropriate number of clusters while handling non-IID data distributions.

Method: Proposes DPMM-CFL which places a Dirichlet Process prior over cluster parameters, enabling nonparametric Bayesian inference to jointly infer both the number of clusters and client assignments. The method couples federated updates with cluster inferences at each round, optimizing per-cluster federated objectives.

Result: Validated on benchmark datasets under Dirichlet and class-split non-IID partitions, showing effective automatic cluster discovery while maintaining federated learning performance.

Conclusion: DPMM-CFL provides a practical solution to clustered federated learning by eliminating the need to pre-specify cluster count, making it more applicable to real-world scenarios with unknown client heterogeneity structures.

Abstract: Clustered Federated Learning (CFL) improves performance under non-IID client heterogeneity by clustering clients and training one model per cluster, thereby balancing between a global model and fully personalized models. However, most CFL methods require the number of clusters K to be fixed a priori, which is impractical when the latent structure is unknown. We propose DPMM-CFL, a CFL algorithm that places a Dirichlet Process (DP) prior over the distribution of cluster parameters. This enables nonparametric Bayesian inference to jointly infer both the number of clusters and client assignments, while optimizing per-cluster federated objectives. This results in a method where, at each round, federated updates and cluster inferences are coupled, as presented in this paper. The algorithm is validated on benchmark datasets under Dirichlet and class-split non-IID partitions.

[741] Geodesic Calculus on Implicitly Defined Latent Manifolds

Florine Hartwig, Josua Sassen, Juliane Braunsmann, Martin Rumpf, Benedikt Wirth

Main category: cs.LG

TL;DR: A framework for geometric analysis of autoencoder latent manifolds using implicit representations and discrete Riemannian calculus for computing geodesics and exponential maps.

Details

Motivation: Autoencoder latent manifolds provide low-dimensional data representations, but lack tools for geometric analysis like computing geodesics and Riemannian operations on these learned manifolds.

Method: Proposes representing latent manifolds as implicit submanifolds, develops discrete Riemannian calculus tools robust to representation inaccuracies, and learns approximate projection via denoising objective independent of underlying autoencoder.

Result: Enables computation of geodesic paths connecting endpoints and shooting geodesics via Riemannian exponential maps on latent manifolds, evaluated on various autoencoders with synthetic and real data.

Conclusion: Provides a general framework for geometric analysis of autoencoder latent spaces supporting different Riemannian geometries and enabling fundamental geometric operations like geodesic computation.

Abstract: Latent manifolds of autoencoders provide low-dimensional representations of data, which can be studied from a geometric perspective. We propose to describe these latent manifolds as implicit submanifolds of some ambient latent space. Based on this, we develop tools for a discrete Riemannian calculus approximating classical geometric operators. These tools are robust against inaccuracies of the implicit representation often occurring in practical examples. To obtain a suitable implicit representation, we propose to learn an approximate projection onto the latent manifold by minimizing a denoising objective. This approach is independent of the underlying autoencoder and supports the use of different Riemannian geometries on the latent manifolds. The framework in particular enables the computation of geodesic paths connecting given end points and shooting geodesics via the Riemannian exponential maps on latent manifolds. We evaluate our approach on various autoencoders trained on synthetic and real data.

[742] Neural Weight Compression for Language Models

Jegwang Ryu, Minkyu Kim, Seungjun Shin, Hee Min Choi, Dokwan Oh, Jaeho Lee

Main category: cs.LG

TL;DR: Neural Weight Compression (NWC) learns neural codecs for compressing language model weights, achieving SOTA compression-accuracy tradeoffs without handcrafted transforms.

Details

Motivation: As language models scale, efficient weight storage/transmission becomes critical. Existing compression methods rely on handcrafted transforms with limited understanding of weights as a data modality, motivating a shift to learning-based compression schemes optimized directly from data.

Method: Formulates weight compression as neural codec learning. Proposes NWC framework with chunk-and-normalize preprocessing to handle tensor shape heterogeneity and importance-aware training objective to align training losses with downstream performance. Uses entropy-constrained quantization and learned transforms.

Result: Achieves state-of-the-art accuracy-compression tradeoffs, particularly at 4-6 bit regime, without relying on rigid handcrafted components like Hadamard transform. Gains extend across diverse architectures including vision encoders.

Conclusion: Learning-based weight compression paradigm is effective, with entropy-constrained quantization important for high rate regime and learned transforms adapting to downstream tasks. NWC demonstrates superior performance over traditional handcrafted approaches.

Abstract: Efficient storage and transmission of language model weights are increasingly critical as model scale and deployment grow. Yet, most existing compression methods rely on handcrafted transforms and heuristics, reflecting the limited understanding of weights as a data modality. This motivates a shift toward learning-based paradigm, where compression schemes are optimized directly from data rather than manually designed. In this work, we take a step in this direction by formulating weight compression as a neural codec learning. We propose Neural Weight Compression (NWC), a flexible framework for training neural codecs on pretrained weight datasets. NWC addresses challenges intrinsic to weight compression, such as tensor shape heterogeneity and the misalignment between training losses and downstream performance, through components such as chunk-and-normalize preprocessing and an importance-aware training objective. Experiments show that NWC achieves state-of-the-art accuracy-compression tradeoffs, particularly at 4–6 bit regime, without relying on rigid handcrafted components such as the Hadamard transform. These gains extend across diverse architectures, e.g., vision encoders. Our analysis further supports the learning-based perspective, highlighting the roles of entropy-constrained quantization in high rate regime and learned transforms in adapting to downstream tasks.

[743] Policy Learning with Abstention

Ayush Sawarni, Jikai Jin, Justin Whitehouse, Vasilis Syrgkanis

Main category: cs.LG

TL;DR: Policy learning with abstention allows deferring to safe defaults when uncertain, with O(1/n) regret guarantees and applications to margin conditions, distributional robustness, and safe policy improvement.

Details

Motivation: Current policy learning algorithms force decisions even when predictions are uncertain, which is risky in high-stakes settings like personalized medicine and advertising. There's a need for methods that can abstain from decisions when uncertain, deferring to safe defaults or experts instead.

Method: Proposes a two-stage learner: first identifies a set of near-optimal policies, then constructs an abstention rule based on their disagreements. Uses doubly robust (DR) objective for unknown propensities. Abstention is implemented with small additive reward when deferring to random guess.

Result: Establishes fast O(1/n)-type regret guarantees when propensities are known, extends to unknown-propensity case via DR objective. Shows abstention enables improved guarantees under margin conditions without realizability assumption, connects to distributionally robust policy learning, and supports safe policy improvement with high probability guarantees.

Conclusion: Abstention is a versatile tool for policy learning that addresses uncertainty in high-stakes decisions, provides theoretical guarantees, and connects to multiple important problems including margin conditions, distributional robustness, and safe policy improvement.

Abstract: Policy learning algorithms are widely used in areas such as personalized medicine and advertising to develop individualized treatment regimes. However, most methods force a decision even when predictions are uncertain, which is risky in high-stakes settings. We study policy learning with abstention, where a policy may defer to a safe default or an expert. When a policy abstains, it receives a small additive reward on top of the value of a random guess. We propose a two-stage learner that first identifies a set of near-optimal policies and then constructs an abstention rule from their disagreements. We establish fast O(1/n)-type regret guarantees when propensities are known, and extend these guarantees to the unknown-propensity case via a doubly robust (DR) objective. We further show that abstention is a versatile tool with direct applications to other core problems in policy learning: it yields improved guarantees under margin conditions without the common realizability assumption, connects to distributionally robust policy learning by hedging against small data shifts, and supports safe policy improvement by ensuring improvement over a baseline policy with high probability.

[744] Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime

Noah Oberweis, Semih Cayci

Main category: cs.LG

TL;DR: Non-asymptotic convergence analysis of Stochastic Gradient Langevin Dynamics (SGLD) in lazy training regime, showing exponential convergence to empirical risk minimizer with finite-time/finite-width bounds.

Details

Motivation: Continuous-time models provide insights into training dynamics of deep learning optimization algorithms. Need to understand convergence properties of SGLD, which is an SDE approximation of SGD, particularly in the lazy training regime.

Method: Theoretical analysis of SGLD with multiplicative and state-dependent noise under regularity conditions on Hessian of loss function. Shows SGLD yields non-degenerate kernel throughout training with high probability and achieves exponential convergence to empirical risk minimizer.

Result: Established finite-time and finite-width bounds on optimality gap. Theoretical findings corroborated with numerical examples in regression setting.

Conclusion: SGLD in lazy training regime provides strong convergence guarantees to empirical risk minimizer with explicit bounds, offering theoretical insights into optimization dynamics of deep learning algorithms.

Abstract: Continuous-time models provide important insights into the training dynamics of optimization algorithms in deep learning. In this work, we establish a non-asymptotic convergence analysis of stochastic gradient Langevin dynamics (SGLD), which is an Itô stochastic differential equation (SDE) approximation of stochastic gradient descent in continuous time, in the lazy training regime. We show that, under regularity conditions on the Hessian of the loss function, SGLD with multiplicative and state-dependent noise (i) yields a non-degenerate kernel throughout the training process with high probability, and (ii) achieves exponential convergence to the empirical risk minimizer in expectation, and we establish finite-time and finite-width bounds on the optimality gap. We corroborate our theoretical findings with numerical examples in the regression setting.

[745] An Evidence-Based Post-Hoc Adjustment Framework for Anomaly Detection Under Data Contamination

Sukanya Patra, Souhaib Ben Taieb

Main category: cs.LG

TL;DR: EPHAD is a test-time adaptation framework that improves anomaly detection models trained on contaminated datasets by updating their outputs using evidence from multimodal foundation models like CLIP or classical AD methods.

Details

Motivation: Real-world anomaly detection datasets often contain undetected or mislabeled anomalies, causing performance degradation. Existing solutions require access to training pipelines, data, or prior knowledge of anomaly proportions, limiting real-world applicability.

Method: EPHAD integrates prior knowledge from AD models trained on contaminated datasets with evidence from multimodal foundation models (like CLIP), classical AD methods (like Local Outlier Factor), or domain-specific knowledge through a test-time adaptation framework.

Result: Validated across eight visual AD datasets, twenty-six tabular AD datasets, and a real-world industrial AD dataset. Shows effectiveness and robustness across diverse AD models and evidence pairs, with ablation studies analyzing hyperparameter influence and contamination levels.

Conclusion: EPHAD provides a versatile and robust framework for improving anomaly detection in real-world contaminated datasets without requiring access to training pipelines or prior knowledge of anomaly proportions.

Abstract: Unsupervised anomaly detection (AD) methods typically assume clean training data, yet real-world datasets often contain undetected or mislabeled anomalies, leading to significant performance degradation. Existing solutions require access to the training pipelines, data or prior knowledge of the proportions of anomalies in the data, limiting their real-world applicability. To address this challenge, we propose EPHAD, a simple yet effective test-time adaptation framework that updates the outputs of AD models trained on contaminated datasets using evidence gathered at test time. Our approach integrates the prior knowledge captured by the AD model trained on contaminated datasets with evidence derived from multimodal foundation models like Contrastive Language-Image Pre-training (CLIP), classical AD methods like the Local Outlier Factor or domain-specific knowledge. We illustrate the intuition behind EPHAD using a synthetic toy example and validate its effectiveness through comprehensive experiments across eight visual AD datasets, twenty-six tabular AD datasets, and a real-world industrial AD dataset. Additionally, we conduct an ablation study to analyse hyperparameter influence and robustness to varying contamination levels, demonstrating the versatility and robustness of EPHAD across diverse AD models and evidence pairs. To ensure reproducibility, our code is publicly available at https://github.com/sukanyapatra1997/EPHAD.

[746] On Uncertainty Calibration for Equivariant Functions

Edward Berman, Jacob Ginesin, Marco Pacini, Robin Walters

Main category: cs.LG

TL;DR: Theoretical analysis of how equivariance affects uncertainty estimation and model calibration in deep learning, with bounds on calibration errors under different equivariance conditions.

Details

Motivation: Data-sparse domains like robotic manipulation and molecular physics are challenging for deep learning. While equivariant networks can improve modeling and uncertainty estimation can prevent overconfidence, the relationship between equivariance and model calibration hasn't been studied. The paper aims to understand how symmetry properties affect uncertainty estimation.

Method: Develops a theoretical framework proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions. Uses mathematical analysis to relate equivariance to calibration error definitions. Complements theory with numerical experiments on real and simulated datasets to examine symmetry mismatch, group size, and different uncertainty types.

Result: Theoretical bounds show how equivariance affects generalization limits and how symmetry mismatch leads to miscalibration in both classification and regression. Experimental results clarify relationships between equivariance and uncertainty, revealing trends with symmetry properties and uncertainty types.

Conclusion: Establishes foundational theory connecting equivariance to uncertainty estimation, showing that symmetry properties significantly impact model calibration. Provides insights into when equivariant models are well-calibrated and how symmetry mismatches affect uncertainty reliability.

Abstract: Data-sparse settings such as robotic manipulation, molecular physics, and galaxy morphology classification are some of the hardest domains for deep learning. For these problems, equivariant networks can help improve modeling across undersampled parts of the input space, and uncertainty estimation can guard against overconfidence. However, until now, the relationships between equivariance and model confidence, and more generally equivariance and model calibration, has yet to be studied. Since traditional classification and regression error terms show up in the definitions of calibration error, it is natural to suspect that previous work can be used to help understand the relationship between equivariance and calibration error. In this work, we present a theory relating equivariance to uncertainty estimation. By proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions, we elucidate the generalization limits of equivariant models and illustrate how symmetry mismatch can result in miscalibration in both classification and regression. We complement our theoretical framework with numerical experiments that clarify the relationship between equivariance and uncertainty using a variety of real and simulated datasets, and we comment on trends with symmetry mismatch, group size, and aleatoric and epistemic uncertainties.

[747] $π_\texttt{RL}$: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Xiang Li, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, Tiejun Huang, Yu Wang, Chao Yu

Main category: cs.LG

TL;DR: π_RL enables reinforcement learning for flow-based Vision-Language-Action models using Flow-Noise and Flow-SDE approaches to overcome intractable action log-likelihoods from flow matching.

Details

Motivation: Applying RL to large-scale flow-based VLA models is challenging due to intractable action log-likelihoods from flow matching, which hinders automated data collection for scaling supervised fine-tuning.

Method: Two technical approaches: (1) Flow-Noise models denoising as discrete-time MDP with learnable noise network for exact log-likelihood computation; (2) Flow-SDE integrates denoising with agent-environment interaction using ODE-to-SDE conversion for efficient RL exploration.

Result: π_RL demonstrates significant performance improvements across various benchmarks in both in-distribution and out-of-distribution settings.

Conclusion: The proposed methods successfully enable RL training for flow-based VLA models, overcoming previous limitations and improving robotic task performance.

Abstract: Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying RL to large-scale flow-based VLAs (\eg, $π_0$, $π_{0.5}$) remains challenging due to intractable action log-likelihoods raised from flow matching. We address this challenge with $π_{\texttt{RL}}$, featuring two technical approaches: (1) \textbf{Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) \textbf{Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate $π_{\texttt{RL}}$ across various benchmarks, with experiments demonstrating that RL yields significant performance improvements in both in-distribution and out-of-distribution settings.

[748] A Likely Geometry of Generative Models

Frederik Möbius Rygaard, Shen Zhu, Yinzhu Jin, Søren Hauberg, Tom Fletcher

Main category: cs.LG

TL;DR: A geometric framework for analyzing generative models without retraining, constructing shortest paths through high-density regions using a pseudo-metric derived from Newtonian dynamics on Riemannian manifolds.

Details

Motivation: Most generative models lack a principled geometric framework for interpolation and model inspection without restrictive assumptions. The paper aims to construct a general geometry compatible with different metrics and probability distributions to analyze pre-trained generative models.

Method: Constructs a pseudo-metric based on curves constrained to target high-density regions learned by generative models. Formulates this as a Newtonian system on a Riemannian manifold, showing shortest paths correspond to geodesics under a suitable Riemannian metric. Derives an algorithm to compute shortest paths and generalized Fréchet means.

Result: The proposed metric produces curves that traverse regions of higher density than baselines across various models and datasets. The algorithm efficiently computes shortest paths and generalized Fréchet means.

Conclusion: Provides a principled geometric framework for analyzing generative models without additional training, enabling better interpolation and model inspection through high-density regions.

Abstract: The geometry of generative models serves as the basis for interpolation, model inspection, and more. Unfortunately, most generative models lack a principal notion of geometry without restrictive assumptions on either the model or the data dimension. In this paper, we construct a general geometry compatible with different metrics and probability distributions to analyze generative models that do not require additional training. We consider curves analogous to geodesics constrained to a suitable data distribution aimed at targeting high-density regions learned by generative models. We formulate this as a (pseudo)-metric and prove that this corresponds to a Newtonian system on a Riemannian manifold. We show that shortest paths in our framework can be characterized by a system of ordinary differential equations, which locally corresponds to geodesics under a suitable Riemannian metric. Numerically, we derive a novel algorithm to efficiently compute shortest paths and generalized Fréchet means. Quantitatively, we show that curves using our metric traverse regions of higher density than baselines across a range of models and datasets.

[749] Superpositional Gradient Descent: Harnessing Quantum Principles for Model Training

Ahmet Erdem Pamuk, Emir Kaan Özdemir, Şuayp Talha Kocabay

Main category: cs.LG

TL;DR: Superpositional Gradient Descent (SGD) is a quantum-inspired optimizer that links gradient updates with quantum superposition through quantum circuit perturbations, showing faster convergence and lower loss than AdamW in LLM training.

Details

Motivation: While classical optimization techniques like AdamW are widely used for LLM training, the mechanisms by which quantum-inspired methods can enhance classical training remain underexplored. The authors aim to bridge this gap by investigating how quantum principles can improve deep learning optimization.

Method: Introduces Superpositional Gradient Descent (SGD), a novel optimizer that connects gradient updates with quantum superposition by injecting quantum circuit perturbations. The paper presents a mathematical framework and implements hybrid quantum-classical circuits using PyTorch and Qiskit.

Result: On synthetic sequence classification tasks and large-scale LLM fine-tuning, SGD demonstrates faster convergence and yields lower final loss compared to AdamW. The method shows promising performance improvements in training efficiency.

Conclusion: The work provides new insights into the intersection of quantum computing and deep learning, suggesting practical pathways for leveraging quantum principles to control and enhance model behavior, though scalability and hardware constraints currently limit adoption.

Abstract: Large language models (LLMs) are increasingly trained with classical optimization techniques like AdamW to improve convergence and generalization. However, the mechanisms by which quantum-inspired methods enhance classical training remain underexplored. We introduce Superpositional Gradient Descent (SGD), a novel optimizer linking gradient updates with quantum superposition by injecting quantum circuit perturbations. We present a mathematical framework and implement hybrid quantum-classical circuits in PyTorch and Qiskit. On synthetic sequence classification and large-scale LLM fine-tuning, SGD converges faster and yields lower final loss than AdamW. Despite promising results, scalability and hardware constraints limit adoption. Overall, this work provides new insights into the intersection of quantum computing and deep learning, suggesting practical pathways for leveraging quantum principles to control and enhance model behavior.

[750] Graph Homomorphism Distortion: A Metric to Distinguish Them All and in the Latent Space Bind Them

Martin Carrasco, Olga Zaghen, Kavir Sumaraj, Erik Bekkers, Bastian Rieck

Main category: cs.LG

TL;DR: A novel pseudo-metric based on graph homomorphisms that measures feature distortion when mapping between graphs, addressing the interplay between structure and features in graph learning.

Details

Motivation: Existing approaches for analyzing graph neural network expressivity ignore features and focus only on structure, making it difficult to assess similarity between graphs with close features. The paper aims to address this gap by developing a measure that captures both structural and feature-based similarities.

Method: Develops a new pseudo-metric called “graph homomorphism distortion” inspired by concepts from metric geometry. This measure quantifies the minimal worst-case distortion that node features undergo when mapping one graph to another via graph homomorphisms.

Result: The proposed measure can be efficiently calculated under certain assumptions, complements existing expressivity measures like 1-WL, and enables the definition of structural encodings that improve graph neural network predictive capabilities.

Conclusion: The graph homomorphism distortion metric provides a principled way to incorporate both structural and feature information when analyzing graph similarity and neural network expressivity, addressing a key limitation in current graph learning approaches.

Abstract: A large driver of the complexity of graph learning is the interplay between \emph{structure} and \emph{features}.When analyzing the expressivity of graph neural networks, however, existing approaches ignore features in favor of structure, making it nigh-impossible to assess to what extent two graphs with close features should be considered similar.We address this by developing a new \mbox{(pseudo-)metric} based on graph homomorphisms.Inspired by concepts from metric geometry, our \emph{graph homomorphism distortion} measures the minimal worst-case distortion that node features of one graph are subjected to when mapping one graph to another.We demonstrate the utility of our novel measure by showing that (i.) it can be efficiently calculated under some additional assumptions, (ii.) it complements existing expressivity measures like \mbox{$1$-WL}, and (iii.)it permits defining structural encodings, which improve the predictive capabilities of graph neural networks.

[751] SiDGen: Structure-informed Diffusion for Generative modeling of Ligands for Proteins

Samyak Sanghvi, Nishant Ranjan, Tarak Karmakar

Main category: cs.LG

TL;DR: SiDGen is a protein-conditioned diffusion framework for generating chemically valid ligands that are structurally compatible with protein binding pockets, using masked SMILES generation with lightweight folding-derived features for efficient pocket-aware molecular design.

Details

Motivation: Existing approaches for ligand design either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability in computational drug discovery.

Method: SiDGen integrates masked SMILES generation with lightweight folding-derived features, offering two conditioning pathways: streamlined mode (pools coarse structural signals) and full mode (injects localized pairwise biases). Uses coarse-stride folding with nearest-neighbor upsampling to reduce quadratic memory costs, plus in-loop chemical validity checks and invalidity penalty for learning stability.

Result: Generates ligands with high validity, uniqueness, and novelty, achieving competitive performance in docking-based evaluations while maintaining reasonable molecular properties.

Conclusion: SiDGen provides scalable, pocket-aware molecular design for high-throughput drug discovery through efficient conditional generation.

Abstract: Designing ligands that are both chemically valid and structurally compatible with protein binding pockets is a key bottleneck in computational drug discovery. Existing approaches either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability. We present SiDGen (Structure-informed Diffusion Generator), a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness. To balance expressivity with efficiency, SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling. A coarse-stride folding mechanism with nearest-neighbor upsampling alleviates the quadratic memory costs of pair tensors, enabling training on realistic sequence lengths. Learning stability is maintained through in-loop chemical validity checks and an invalidity penalty, while large-scale training efficiency is restored \textit{via} selective compilation, dataloader tuning, and gradient accumulation. In automated benchmarks, SiDGen generates ligands with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties. These results demonstrate that SiDGen can deliver scalable, pocket-aware molecular design, providing a practical route to conditional generation for high-throughput drug discovery.

[752] QuantKAN: A Unified Quantization Framework for Kolmogorov Arnold Networks

Kazi Ahmed Asif Fuad, Lizhong Chen

Main category: cs.LG

TL;DR: QuantKAN: A unified framework for quantizing Kolmogorov Arnold Networks (KANs) that extends existing quantization methods to spline-based architectures, enabling efficient deployment in resource-constrained environments.

Details

Motivation: KANs offer strong expressivity and interpretability but their heterogeneous spline and base branch parameters hinder efficient quantization, which remains unexamined compared to CNNs and Transformers. There's a need for systematic quantization methods for these novel architectures.

Method: QuantKAN extends modern quantization algorithms (LSQ, LSQ+, PACT, DoReFa, QIL, GPTQ, BRECQ, AdaRound, AWQ, HAWQ-V2) to spline-based layers with branch-specific quantizers for base, spline, and activation components. It supports both quantization aware training (QAT) and post-training quantization (PTQ) regimes.

Result: Established first systematic benchmarks for low-bit spline networks. KANs are compatible with low-bit quantization but exhibit strong method-architecture interactions. LSQ, LSQ+, and PACT preserve near full precision accuracy at 4-bit for shallow KAN models, while DoReFa provides stable behavior for deeper KAGN variants. For PTQ, GPTQ and Uniform consistently deliver strongest overall performance.

Conclusion: QuantKAN unifies spline learning and quantization, providing practical tools and guidelines for efficiently deploying KANs in real-world, resource-constrained environments. The framework enables systematic quantization of these novel neural architectures.

Abstract: Kolmogorov Arnold Networks (KANs) represent a new class of neural architectures that replace conventional linear transformations and node-based nonlinearities with spline-based function approximations distributed along network edges. Although KANs offer strong expressivity and interpretability, their heterogeneous spline and base branch parameters hinder efficient quantization, which remains unexamined compared to CNNs and Transformers. In this paper, we present QuantKAN, a unified framework for quantizing KANs across both quantization aware training (QAT) and post-training quantization (PTQ) regimes. QuantKAN extends modern quantization algorithms, such as LSQ, LSQ+, PACT, DoReFa, QIL, GPTQ, BRECQ, AdaRound, AWQ, and HAWQ-V2, to spline based layers with branch-specific quantizers for base, spline, and activation components. Through extensive experiments on MNIST, CIFAR 10, and CIFAR 100 across multiple KAN variants (EfficientKAN, FastKAN, PyKAN, and KAGN), we establish the first systematic benchmarks for low-bit spline networks. Our results show that KANs, particularly deeper KAGN variants, are compatible with low-bit quantization but exhibit strong method architecture interactions: LSQ, LSQ+, and PACT preserve near full precision accuracy at 4 bit for shallow KAN MLP and ConvNet models, while DoReFa provides the most stable behavior for deeper KAGN under aggressive low-bit settings. For PTQ, GPTQ and Uniform consistently deliver the strongest overall performance across datasets, with BRECQ highly competitive on simpler regimes such as MNIST. Our proposed QuantKAN framework thus unifies spline learning and quantization, and provides practical tools and guidelines for efficiently deploying KANs in real-world, resource-constrained environments.

[753] Geometry-Aware Deep Congruence Networks for Manifold Learning in Cross-Subject Motor Imagery

Sanjeev Manivannan, Chandrashekar Lakshminarayan

Main category: cs.LG

TL;DR: Proposes geometry-aware congruence transforms (DCT, DLDCT, DDCT-UNet) for cross-subject EEG motor-imagery decoding, addressing dispersion scaling and orientation alignment to mitigate inter-subject variability.

Details

Motivation: Cross-subject motor-imagery decoding in EEG-based BCIs faces strong inter-subject variability. Existing manifold-based approaches on covariance representations don't adequately address dispersion scaling and orientation alignment issues.

Method: Introduces three geometry-aware models using congruence transforms: Discriminative Congruence Transform (DCT), Deep Linear DCT (DLDCT), and Deep DCT-UNet (DDCT-UNet). These can be used as pre-alignment modules or end-to-end discriminative systems trained via cross-entropy backpropagation with custom logistic-regression head.

Result: The proposed framework improves transductive cross-subject accuracy by 2-3% across challenging motor-imagery benchmarks, demonstrating the value of geometry-aware congruence learning.

Conclusion: Geometry-aware congruence transforms effectively address dispersion scaling and orientation alignment in cross-subject EEG motor-imagery decoding, leading to improved performance in brain-computer interfaces.

Abstract: Cross-subject motor-imagery decoding remains a major challenge in EEG-based brain-computer interfaces. To mitigate strong inter-subject variability, recent work has emphasized manifold-based approaches operating on covariance representations. Yet dispersion scaling and orientation alignment remain largely unaddressed in existing methods. In this paper, we address both issues through congruence transforms and introduce three complementary geometry-aware models: (i) Discriminative Congruence Transform (DCT), (ii) Deep Linear DCT (DLDCT), and (iii) Deep DCT-UNet (DDCT-UNet). These models are evaluated both as pre-alignment modules for downstream classifiers and as end-to-end discriminative systems trained via cross-entropy backpropagation with a custom logistic-regression head. Across challenging motor-imagery benchmarks, the proposed framework improves transductive cross-subject accuracy by 2-3%, demonstrating the value of geometry-aware congruence learning.

[754] Pushing the Limits of Distillation-Based Class-Incremental Learning via Lightweight Plugins

Zhiming Xu, Baile Xu, Jian Zhao, Furao Shen, Suorong Yang

Main category: cs.LG

TL;DR: DLC is a plug-in extension paradigm using LoRA components to enhance existing class-incremental learning methods by injecting task-specific residuals and aggregating them with a lightweight weighting unit.

Details

Motivation: Existing replay and distillation-based class-incremental learning methods are constrained by the stability-plasticity dilemma. While they encode rich representations as pre-trained bases, they need enhancement to better handle incremental tasks without catastrophic forgetting.

Method: Proposes Deployment of LoRA Components (DLC) - a plug-in extension paradigm that uses Low-Rank Adaptation (LoRA) to inject task-specific residuals into base model’s deep layers. During inference, representations with task-specific residuals are aggregated, with a lightweight weighting unit to mitigate interference from non-target LoRA plugins.

Result: On ImageNet-100, DLC achieves 8% accuracy improvement with only 4% of ResNet-18 parameters. Under fixed memory budget, methods equipped with DLC surpass state-of-the-art expansion-based methods.

Conclusion: DLC serves as an efficient plug-and-play enhancement that extends base class-incremental learning methods, demonstrating exceptional parameter efficiency and performance improvements on large-scale datasets.

Abstract: Existing replay and distillation-based class-incremental learning (CIL) methods are effective at retaining past knowledge but are still constrained by the stability-plasticity dilemma. Since their resulting models are learned over a sequence of incremental tasks, they encode rich representations and can be regarded as pre-trained bases. Building on this view, we propose a plug-in extension paradigm termed Deployment of LoRA Components (DLC) to enhance them. For each task, we use Low-Rank Adaptation (LoRA) to inject task-specific residuals into the base model’s deep layers. During inference, representations with task-specific residuals are aggregated to produce classification predictions. To mitigate interference from non-target LoRA plugins, we introduce a lightweight weighting unit. This unit learns to assign importance scores to different LoRA-tuned representations. Like downloadable content in software, DLC serves as a plug-and-play enhancement that efficiently extends the base methods. Remarkably, on the large-scale ImageNet-100, with merely 4% of the parameters of a standard ResNet-18, our DLC model achieves a significant 8% improvement in accuracy, demonstrating exceptional efficiency. Under a fixed memory budget, methods equipped with DLC surpass state-of-the-art expansion-based methods.

[755] Learning to Orchestrate Agents in Natural Language with the Conductor

Stefan Nielsen, Edoardo Cetin, Peter Schwendeman, Qi Sun, Jinglue Xu, Yujin Tang

Main category: cs.LG

TL;DR: A Conductor model uses reinforcement learning to coordinate multiple LLMs, learning communication topologies and prompt engineering to achieve state-of-the-art performance on reasoning benchmarks.

Details

Motivation: Different LLMs have been trained to specialize in various domains, but coordinating them effectively for complex tasks requires sophisticated strategies. The paper aims to discover optimal coordination strategies among LLMs through reinforcement learning.

Method: Train a 7B Conductor model with reinforcement learning to design communication topologies and engineer prompts for worker LLMs. The conductor learns to coordinate pools of LLMs, adapts to arbitrary agent sets, and can recursively include itself as a worker for dynamic test-time scaling.

Result: The Conductor achieves significant performance gains beyond any individual worker LLM, attaining state-of-the-art results on challenging reasoning benchmarks like LiveCodeBench and GPQA. It effectively adapts to arbitrary sets of open- and closed-source agents.

Conclusion: Language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally through end-to-end reward maximization. The approach enables effective multi-agent collaboration and dynamic scaling.

Abstract: Powerful large language models (LLMs) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs. Our Conductor learns not only to design targeted communication topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the LLMs to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker LLMs, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation. More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in LLMs through pure end-to-end reward maximization.

[756] Let the Optimizers Optimize Themselves

Jaerin Lee, Kyoung Mu Lee

Main category: cs.LG

TL;DR: Theoretical foundation for automating optimizer design in gradient-based learning by formulating optimizer design as maximizing instantaneous loss decrease, treating optimizers as functions mapping gradients to parameter updates.

Details

Motivation: To establish a systematic, theoretical approach for automating the design of optimizers and their hyperparameters in gradient-based learning, moving beyond heuristic or manual tuning methods.

Method: Formulates optimizer design as maximizing instantaneous loss decrease using greedy principle. Treats optimizers as functions translating loss gradients to parameter motions, reducing to convex optimization problems over optimizer space. Solves under various constraints to recover popular optimizers as closed-form solutions.

Result: Recovers wide range of popular optimizers as closed-form solutions, produces optimal hyperparameters for given problems, enables systematic optimizer design and dynamic hyperparameter tuning based on gradient statistics collected during training.

Conclusion: Provides theoretical foundation for automating optimizer design, enabling systematic creation of optimizers and dynamic hyperparameter tuning, with optimization of optimization possible during training process.

Abstract: We lay the theoretical foundation for automating optimizer design in gradient-based learning. Based on the greedy principle, we formulate the problem of designing optimizers and their hyperparameters as maximizing the instantaneous decrease in loss. By treating an optimizer as a function that translates loss gradient signals into parameter motions, the problem reduces to a family of convex optimization problems over the space of optimizers. Solving these problems under various constraints not only recovers a wide range of popular optimizers as closed-form solutions, but also produces the optimal hyperparameters of these optimizers with respect to the problems at hand. This enables a systematic approach to design optimizers and tune their hyperparameters dynamically according to the gradient statistics that are collected during the training process. Furthermore, this optimization of optimization can be performed dynamically during training.

[757] Physics-Aware Heterogeneous GNN Architecture for Real-Time BESS Optimization in Unbalanced Distribution Systems

Aoxiang Ma, Salah Ghamizi, Jun Cao, Pedro Rodriguez

Main category: cs.LG

TL;DR: Graph neural networks with three-phase grid embeddings and physics-informed loss functions for accurate battery energy storage dispatch in unbalanced distribution grids

Details

Motivation: Existing deep learning approaches for battery energy storage systems in three-phase unbalanced distribution grids lack explicit three-phase representation, making it difficult to accurately model phase-specific dynamics and enforce operational constraints, leading to infeasible dispatch solutions.

Method: Embed detailed three-phase grid information (phase voltages, unbalanced loads, BESS states) into heterogeneous graph nodes, then use diverse GNN architectures (GCN, GAT, GraphSAGE, GPS) to predict network state variables. Incorporate physics-informed loss function with soft penalties for critical battery constraints (SoC and C-rate limits) during training.

Result: Experimental validation on CIGRE 18-bus distribution system shows low prediction errors: bus voltage MSEs of 6.92e-07 (GCN), 1.21e-06 (GAT), 3.29e-05 (GPS), and 9.04e-07 (SAGE). Physics-informed method ensures nearly zero SoC and C-rate constraint violations.

Conclusion: The embedding-loss approach with heterogeneous graph representation and physics-informed constraints enables accurate, constraint-compliant dispatch for battery energy storage systems in unbalanced distribution grids.

Abstract: Battery energy storage systems (BESS) have become increasingly vital in three-phase unbalanced distribution grids for maintaining voltage stability and enabling optimal dispatch. However, existing deep learning approaches often lack explicit three-phase representation, making it difficult to accurately model phase-specific dynamics and enforce operational constraints–leading to infeasible dispatch solutions. This paper demonstrates that by embedding detailed three-phase grid information–including phase voltages, unbalanced loads, and BESS states–into heterogeneous graph nodes, diverse GNN architectures (GCN, GAT, GraphSAGE, GPS) can jointly predict network state variables with high accuracy. Moreover, a physics-informed loss function incorporates critical battery constraints–SoC and C-rate limits–via soft penalties during training. Experimental validation on the CIGRE 18-bus distribution system shows that this embedding-loss approach achieves low prediction errors, with bus voltage MSEs of 6.92e-07 (GCN), 1.21e-06 (GAT), 3.29e-05 (GPS), and 9.04e-07 (SAGE). Importantly, the physics-informed method ensures nearly zero SoC and C-rate constraint violations, confirming its effectiveness for reliable, constraint-compliant dispatch.

[758] Corrective Diffusion Language Models

Shuibai Zhang, Fred Zhangzhi Peng, Yiheng Zhang, Jin Pan, Grigorios G. Chrysos

Main category: cs.LG

TL;DR: This paper addresses the limitation of Diffusion Language Models (DLMs) in iterative refinement by proposing a correction-oriented training approach that enables discriminative confidence estimation and targeted error correction, particularly for code revision tasks.

Details

Motivation: While DLMs are theoretically suited for iterative refinement due to their non-causal structure, they often fail to reliably revise incorrect tokens in practice because they cannot distinguish between correct and erroneous tokens in visible sequences. Standard masked diffusion training focuses only on unmasking, which undermines effective refinement guided by confidence.

Method: The authors propose a post-training principle oriented by correction that explicitly supervises visible incorrect tokens, enabling discriminative confidence and targeted refinement. They introduce the Code Revision Benchmark for evaluating error localization and in-place correction, and demonstrate their approach on code revision tasks and parallel decoding scenarios.

Result: Models trained with the proposed correction-oriented approach substantially outperform standard masked diffusion language models (MDLMs), with gains most pronounced when parallel decoding introduces substantial uncertainty and iterative refinement becomes essential.

Conclusion: The paper shows that corrective behavior in DLMs is not induced by conventional masked diffusion objectives and requires explicit supervision of visible incorrect tokens. The proposed approach enables effective error localization and targeted refinement, particularly valuable for code revision tasks where iterative refinement is crucial.

Abstract: While Diffusion Language Models (DLMs) are theoretically well-suited for iterative refinement due to their non-causal structure, they often fail to reliably revise incorrect tokens in practice. The key challenge lies in the model’s inability to distinguish between correct and erroneous tokens in a visible sequence. Standard masked diffusion language model (MDLM) training is restricted to the objective of unmasking, undermining the effectiveness of refinement guided by confidence. Based on this observation, we study corrective behavior in DLMs, defined as the ability to assign lower confidence to incorrect tokens and iteratively refine them while preserving correct content. We show that this capability is not induced by conventional masked diffusion objectives and propose a post-training principle oriented by correction that explicitly supervises visible incorrect tokens, enabling discriminative confidence and targeted refinement. To evaluate corrective behavior, we introduce the Code Revision Benchmark, a controllable and executable benchmark for assessing error localization and in-place correction. Experiments on code revision tasks and parallel decoding scenarios demonstrate that models trained with our approach substantially outperform standard MDLMs, with gains that are most pronounced when parallel decoding introduces substantial uncertainty and iterative refinement becomes essential. Our code is publicly available at https://github.com/zhangshuibai/CDLM.

[759] Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

Mathieu Blondel, Michael E. Sander, Germain Vivier-Ardisson, Tianlin Liu, Vincent Roulet

Main category: cs.LG

TL;DR: The paper establishes a theoretical equivalence between autoregressive models (ARMs) and energy-based models (EBMs) for language modeling, showing they are bijective in function space and connected to reinforcement learning principles.

Details

Motivation: ARMs dominate current LLMs while EBMs naturally characterize optimal policies in alignment. The paper aims to unify these model classes to understand their theoretical connections and implications for planning capabilities in next-token prediction models.

Method: Starting from the chain rule of probability, the authors establish an explicit bijection between ARMs and EBMs in function space, showing it corresponds to a special case of the soft Bellman equation in maximum entropy RL. They derive equivalence between supervised learning of both model types and analyze EBM-to-ARM distillation with theoretical error bounds.

Result: The paper provides theoretical foundations showing ARMs and EBMs are mathematically equivalent through a bijective mapping, explains how this connects to reinforcement learning principles, and offers insights into why ARMs can exhibit planning behavior despite next-token prediction.

Conclusion: The unified view bridges ARM and EBM paradigms, revealing deep connections between language modeling and reinforcement learning, and explains how ARMs can perform planning tasks despite their autoregressive nature.

Abstract: Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energy-based models (EBMs) represent another class of models, which have historically been less prevalent in LLM development, yet naturally characterize the optimal policy in post-training alignment. In this paper, we provide a unified view of these two model classes. Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation in maximum entropy reinforcement learning. Building upon this bijection, we derive the equivalence between supervised learning of ARMs and EBMs. Furthermore, we analyze the distillation of EBMs into ARMs by providing theoretical error bounds. Our results provide insights into the ability of ARMs to plan ahead, despite being based on the next-token prediction paradigm.

[760] Generative Modeling through Koopman Spectral Analysis: An Operator-Theoretic Perspective

Yuanchao Xu, Fengyi Li, Masahiro Fujisawa, Xiaoyuan Cheng, Youssef Marzouk, Isao Ishikawa

Main category: cs.LG

TL;DR: KSWGD is a particle-based generative modeling framework that learns Langevin generators via Koopman theory and Wasserstein gradient descent, achieving linear convergence without vanishing gradients.

Details

Motivation: Existing kernel-based particle methods suffer from vanishing-gradient phenomena and slow convergence. The authors aim to develop a method that can learn the spectral structure of distributions directly from trajectory data without needing explicit knowledge of the target potential.

Method: Combines Koopman operator theory with Wasserstein gradient descent. Estimates spectral structure of underlying distribution from trajectory data via Koopman operator, eliminating need for explicit target potential knowledge. Maintains approximately constant dissipation rate to ensure linear convergence.

Result: KSWGD consistently outperforms baselines in convergence speed and sample quality across experiments on compact manifolds, metastable multi-well systems, and high-dimensional stochastic partial differential equations.

Conclusion: KSWGD provides an effective framework for particle-based generative modeling that overcomes vanishing-gradient issues and achieves linear convergence by leveraging Koopman spectral analysis and Wasserstein gradient descent.

Abstract: We propose Koopman Spectral Wasserstein Gradient Descent (KSWGD), a particle-based generative modeling framework that learns the Langevin generator via Koopman theory and integrates it with Wasserstein gradient descent. Our key insight is that this spectral structure of the underlying distribution can be directly estimated from trajectory data via the Koopman operator, eliminating the need for explicit knowledge of the target potential. Additionally, we prove that KSWGD maintains an approximately constant dissipation rate, thereby establishing linear convergence and overcoming the vanishing-gradient phenomenon that hinders existing kernel-based particle methods. We further provide a Feynman–Kac interpretation that clarifies the method’s probabilistic foundation. Experiments on compact manifolds, metastable multi-well systems, and high-dimensional stochastic partial differential equations demonstrate that KSWGD consistently outperforms baselines in both convergence speed and sample quality.

[761] GLUE: Gradient-free Learning to Unify Experts

Jong-Ik Park, Shreyas Chaudhari, Srinivasa Pranav, Carlee Joe-Wong, José M. F. Moura

Main category: cs.LG

TL;DR: GLUE: Gradient-free Learning to Unify Experts - a method for domain expansion by learning convex combinations of pretrained expert models using gradient-free optimization, requiring only two forward passes per step.

Details

Motivation: Existing methods for domain expansion (combining multiple specialist models for new target domains) use heuristic blending based on data size or proxy metrics, which often yields lower accuracy. Learning optimal coefficients via backpropagation is computationally expensive.

Method: GLUE initializes target model as convex combination of fixed expert models and learns mixture coefficients via gradient-free two-point SPSA (simultaneous perturbation stochastic approximation) updates, requiring only two forward passes per step instead of full backpropagation.

Result: Across three datasets and three network architectures, GLUE produces model parameter priors that outperform baselines when fine-tuned. Improves test accuracy by up to 8.5% over data-size weighting and 9.1% over proxy-metric selection. Matches or outperforms backpropagation-based mixing within 1.4%.

Conclusion: GLUE provides an efficient gradient-free method for learning optimal combinations of expert models for domain expansion, achieving strong performance with minimal computational overhead compared to heuristic or backpropagation-based approaches.

Abstract: In many deployed systems (multilingual ASR, cross-hospital imaging, region-specific perception), multiple pretrained specialist models coexist. Yet, new target domains often require domain expansion: a generalized model that performs well beyond any single specialist’s domain. Given a new target domain, existing methods obtain a single strong initialization prior for the model parameters by blending expert models to initialize a target model. However, heuristic blending – using mixing coefficients based on data size or proxy metrics – often yields lower target-domain test accuracy, and learning these coefficients on the target domain’s loss function typically requires computationally-expensive full backpropagation through a neural network. We propose GLUE, Gradient-free Learning to Unify Experts, which initializes the target model as a convex combination of fixed experts and learns the mixture coefficients of this combination via gradient-free two-point SPSA (simultaneous perturbation stochastic approximation) updates, requiring only two forward passes per step. Across experiments on three datasets and three network architectures, GLUE produces model parameter priors that can be fine-tuned to outperform baselines. GLUE improves test accuracy by up to 8.5% over data-size weighting and by up to 9.1% over proxy-metric selection. GLUE either outperforms backpropagation-based full-gradient mixing or matches its performance within 1.4%.

[762] EEG-based Graph-guided Domain Adaptation for Robust Cross-Session Emotion Recognition

Maryam Mirzaei, Farzaneh Shayegh, Hamed Narimani

Main category: cs.LG

TL;DR: EGDA framework for EEG-based emotion recognition reduces cross-session discrepancies through joint global and class-specific distribution alignment with graph regularization, achieving robust performance on SEED-IV dataset.

Details

Motivation: Accurate emotion recognition is crucial for human-machine interaction, and EEG provides reliable neural data but suffers from cross-session variability that hinders model generalization.

Method: Proposes EGDA framework that jointly aligns global (marginal) and class-specific (conditional) distributions while preserving EEG data structure through graph regularization to reduce cross-session discrepancies.

Result: Achieves accuracies of 81.22%, 80.15%, and 83.27% across three transfer tasks on SEED-IV dataset, surpassing baseline methods. Identifies Gamma band as most discriminative and central-parietal/prefrontal regions as critical.

Conclusion: EGDA effectively addresses cross-session generalization in EEG-based emotion recognition through distribution alignment and structural preservation, providing insights into discriminative frequency bands and brain regions.

Abstract: Accurate recognition of human emotional states is critical for effective human-machine interaction. Electroencephalography (EEG) offers a reliable source for emotion recognition due to its high temporal resolution and its direct reflection of neural activity. Nevertheless, variations across recording sessions present a major challenge for model generalization. To address this issue, we propose EGDA, a framework that reduces cross-session discrepancies by jointly aligning the global (marginal) and class-specific (conditional) distributions, while preserving the intrinsic structure of EEG data through graph regularization. Experimental results on the SEED-IV dataset demonstrate that EGDA achieves robust cross-session performance, obtaining accuracies of 81.22%, 80.15%, and 83.27% across three transfer tasks, and surpassing several baseline methods. Furthermore, the analysis highlights the Gamma frequency band as the most discriminative and identifies the central-parietal and prefrontal brain regions as critical for reliable emotion recognition.

[763] Expanding the Chaos: Neural Operator for Stochastic (Partial) Differential Equations

Dai Shi, Lequan Lin, Andi Han, Luke Thompson, José Miguel Hernández-Lobato, Zhiyong Wang, Junbin Gao

Main category: cs.LG

TL;DR: Neural operators based on Wiener-chaos expansions for learning SDE/SPDE solution operators, enabling single-pass trajectory reconstruction from noise with applications across domains.

Details

Motivation: SDEs and SPDEs are fundamental for modeling stochastic dynamics, but learning their solution operators with deep learning models promises faster solvers and new perspectives on classical learning tasks.

Method: Build on Wiener-chaos expansions to design neural operator architectures: project driving noise paths onto orthonormal Wick-Hermite features, use neural operators to parameterize chaos coefficients, enabling reconstruction of full trajectories from noise in a single forward pass.

Result: Achieved competitive accuracy across several tasks including SPDE benchmarks, SDE-based diffusion one-step image sampling, topological graph interpolation, financial extrapolation, parameter estimation, and manifold SDE flood forecasting.

Conclusion: WCE-based neural operators are a practical and scalable approach to learning SDE/SPDE solution operators across domains.

Abstract: Stochastic differential equations (SDEs) and stochastic partial differential equations (SPDEs) are fundamental for modeling stochastic dynamics across the natural sciences and modern machine learning. Learning their solution operators with deep learning models promises fast solvers and new perspectives on classical learning tasks. In this work, we build on Wiener-chaos expansions (WCE) to design neural operator (NO) architectures for SDEs and SPDEs: we project driving noise paths onto orthonormal Wick-Hermite features and use NOs to parameterize the resulting chaos coefficients, enabling reconstruction of full trajectories from noise in a single forward pass. We also make the underlying WCE structure explicit for multi-dimensional SDEs and semilinear SPDEs by showing the coupled deterministic ODE/PDE systems governing these coefficients. Empirically, we achieve competitive accuracy across several tasks, including standard SPDE benchmarks and SDE-based diffusion one-step image sampling, topological graph interpolation, financial extrapolation, parameter estimation, and manifold SDE flood forecasting. These results suggest WCE-based neural operators are a practical and scalable approach to learning SDE/SPDE solution operators across domains.

[764] Reward-Preserving Attacks For Robust Reinforcement Learning

Lucas Schott, Elies Gherbi, Hatem Hajri, Sylvain Lamprier

Main category: cs.LG

TL;DR: Adversarial training in RL with adaptive perturbation strength using reward-preserving attacks that maintain α fraction of nominal-to-worst-case return gap

Details

Motivation: Fixed-strength adversarial attacks in RL are problematic because perturbations compound over time, making attacks either too destructive or too conservative. Need adaptive approach that balances robustness and performance.

Method: Propose reward-preserving attacks that adapt adversarial strength dynamically using learned critic Q((s,a),η) to estimate expected return of α-reward-preserving rollouts. Perturbation magnitudes η are selected to preserve α fraction of nominal-to-worst-case return gap at each state.

Result: For intermediate α values, adaptive training yields policies robust across wide range of perturbation magnitudes while preserving nominal performance, outperforming fixed-radius and uniformly sampled-radius adversarial training.

Conclusion: Reward-preserving attacks with adaptive perturbation strength provide effective adversarial training for RL, balancing robustness and performance better than fixed-strength approaches.

Abstract: Adversarial training in reinforcement learning (RL) is challenging because perturbations cascade through trajectories and compound over time, making fixed-strength attacks either overly destructive or too conservative. We propose reward-preserving attacks, which adapt adversarial strength so that an $α$ fraction of the nominal-to-worst-case return gap remains achievable at each state. In deep RL, perturbation magnitudes $η$ are selected dynamically, using a learned critic $Q((s,a),η)$ that estimates the expected return of $α$-reward-preserving rollouts. For intermediate values of $α$, this adaptive training yields policies that are robust across a wide range of perturbation magnitudes while preserving nominal performance, outperforming fixed-radius and uniformly sampled-radius adversarial training.

[765] Kernel Alignment-based Multi-view Unsupervised Feature Selection with Sample-level Adaptive Graph Learning

Yalan Tan, Yanyong Huang, Zongxin Shen, Dongjie Wang, Fengmao Lv, Tianrui Li

Main category: cs.LG

TL;DR: KAFUSE: Kernel Alignment-based multi-view unsupervised feature selection with sample-level adaptive graph learning for handling nonlinear dependencies and local structure differences in multi-view data.

Details

Motivation: Existing multi-view unsupervised feature selection methods have limitations: they focus on linear correlations but overlook complex nonlinear dependencies, and they use sample-invariant weights for graph fusion that fail to account for differences in local neighborhood clarity among samples within each view.

Method: Proposes KAFUSE with two key components: 1) kernel alignment with orthogonal constraint to reduce feature redundancy in both linear and nonlinear relationships, and 2) cross-view consistent similarity graph learning using sample-level fusion on tensor-stacked similarity graphs, which automatically adjusts view weights for each sample during fusion.

Result: Extensive experiments on real multi-view datasets demonstrate the superiority of KAFUSE over state-of-the-art methods.

Conclusion: KAFUSE effectively addresses limitations of existing methods by handling nonlinear dependencies and sample-level view weight adaptation, leading to improved feature selection performance for multi-view data.

Abstract: Although multi-view unsupervised feature selection (MUFS) has demonstrated success in dimensionality reduction for unlabeled multi-view data, most existing methods reduce feature redundancy by focusing on linear correlations among features but often overlook complex nonlinear dependencies. This limits the effectiveness of feature selection. In addition, existing methods fuse similarity graphs from multiple views by employing sample-invariant weights to preserve local structure. However, this process fails to account for differences in local neighborhood clarity among samples within each view, thereby hindering accurate characterization of the intrinsic local structure of the data. In this paper, we propose a Kernel Alignment-based multi-view unsupervised FeatUre selection with Sample-level adaptive graph lEarning method (KAFUSE) to address these issues. Specifically, we first employ kernel alignment with an orthogonal constraint to reduce feature redundancy in both linear and nonlinear relationships. Then, a cross-view consistent similarity graph is learned by applying sample-level fusion to each slice of a tensor formed by stacking similarity graphs from different views, which automatically adjusts the view weights for each sample during fusion. These two steps are integrated into a unified model for feature selection, enabling mutual enhancement between them. Extensive experiments on real multi-view datasets demonstrate the superiority of KAFUSE over state-of-the-art methods.

[766] Low-Rank Key Value Attention

James O’Neill, Robert Clancy, Mariia Matskevichus, Fergal Reid

Main category: cs.LG

TL;DR: LRKV attention reduces KV cache memory by using shared full-rank KV projections with low-rank head-specific residuals, achieving better performance with less memory than standard attention variants.

Details

Motivation: The key-value (KV) cache is a major memory bottleneck in Transformers, and existing solutions like MQA/GQA or MLA have limitations in balancing memory efficiency with model quality.

Method: Proposes Low-Rank Key-Value (LRKV) attention where each layer uses a shared full-rank KV projection augmented with low-rank, head-specific residuals, creating a continuous trade-off between complete sharing and full independence.

Result: LRKV achieves lowest test loss among MHA, MQA/GQA, and MLA while using only 45-53% of MHA’s KV cache, reaches equivalent baseline quality 18-25% faster, and achieves highest downstream task performance across multiple benchmarks.

Conclusion: LRKV provides an effective approach to reduce KV cache memory while maintaining or improving model quality, offering better efficiency-quality trade-offs than existing attention variants.

Abstract: The key-value (KV) cache is a primary memory bottleneck in Transformers. We propose Low-Rank Key-Value (LRKV) attention, which reduces KV cache memory by exploiting redundancy across attention heads, while being compute efficient. Each layer uses a shared full-rank KV projection augmented with low-rank, head-specific residuals, providing a continuous trade-off between complete sharing and full independence. After pretraining models of size 128M to 6.3B parameters, LRKV consistently achieves the lowest test loss among standard MHA, MQA/GQA, and MLA while using only 45-53% of MHA’s KV cache. LRKV reaches equivalent baseline quality 18-25% faster (measured in training steps). After supervised midtraining, LRKV achieves the highest downstream task performance across ARC-Easy, ARC-Challenge, MMLU, GSM8K, and HumanEval benchmarks.

[767] Trainability-Oriented Hybrid Quantum Regression via Geometric Preconditioning and Curriculum Optimization

Qingyu Meng, Yangshuai Wang

Main category: cs.LG

TL;DR: Hybrid quantum-classical regression framework with geometric preconditioning and curriculum optimization improves trainability and stability of quantum neural networks for regression tasks.

Details

Motivation: Quantum neural networks (QNNs) for regression suffer from limited trainability under noisy gradients and ill-conditioned optimization, requiring methods to improve their stability and convergence.

Method: Proposes a hybrid quantum-classical framework with: 1) lightweight classical embedding as learnable geometric preconditioner to reshape input representation, 2) curriculum optimization protocol that progressively increases circuit depth and transitions from SPSA-based stochastic exploration to Adam-based gradient fine-tuning.

Result: Framework consistently improves over pure QNN baselines, yields more stable convergence in data-limited regimes, and reduces structured errors correlated with oscillatory components on scientific benchmarks.

Conclusion: Geometric preconditioning combined with curriculum training is a practical approach for stabilizing quantum regression, addressing key bottlenecks in QNN trainability.

Abstract: Quantum neural networks (QNNs) have attracted growing interest for scientific machine learning, yet in regression settings they often suffer from limited trainability under noisy gradients and ill-conditioned optimization. We propose a hybrid quantum-classical regression framework designed to mitigate these bottlenecks. Our model prepends a lightweight classical embedding that acts as a learnable geometric preconditioner, reshaping the input representation to better condition a downstream variational quantum circuit. Building on this architecture, we introduce a curriculum optimization protocol that progressively increases circuit depth and transitions from SPSA-based stochastic exploration to Adam-based gradient fine-tuning. We evaluate the approach on PDE-informed regression benchmarks and standard regression datasets under a fixed training budget in a simulator setting. Empirically, the proposed framework consistently improves over pure QNN baselines and yields more stable convergence in data-limited regimes. We further observe reduced structured errors that are visually correlated with oscillatory components on several scientific benchmarks, suggesting that geometric preconditioning combined with curriculum training is a practical approach for stabilizing quantum regression.

[768] Factorizable joint shift revisited

Dirk Tasche

Main category: cs.LG

TL;DR: The paper proposes a framework for analyzing distribution shift with general label spaces, extending factorizable joint shift (FJS) beyond categorical labels to cover both classification and regression models.

Details

Motivation: Existing research on factorizable joint shift (FJS) has been limited to categorical labels, but real-world applications often involve continuous or general label spaces. There's a need to extend distribution shift analysis beyond classification to include regression models and general label spaces.

Method: The authors develop a theoretical framework for analyzing distribution shift in general label spaces. They generalize existing FJS results to these spaces and propose an extension of the expectation maximization (EM) algorithm for class prior probabilities. They also re-examine generalized label shift (GLS) in the context of general label spaces.

Result: The framework successfully extends distribution shift analysis to general label spaces, covering both classification and regression. The generalized EM algorithm and theoretical results provide tools for handling FJS in broader contexts beyond categorical labels.

Conclusion: The proposed framework enables more comprehensive analysis of distribution shifts across diverse machine learning tasks, bridging the gap between categorical and continuous label spaces in distribution shift research.

Abstract: Factorizable joint shift (FJS) represents a type of distribution shift (or dataset shift) that comprises both covariate and label shift. Recently, it has been observed that FJS actually arises from consecutive label and covariate (or vice versa) shifts. Research into FJS so far has been confined to the case of categorical labels. We propose a framework for analysing distribution shift in the case of a general label space, thus covering both classification and regression models. Based on the framework, we generalise existing results on FJS to general label spaces and present and analyse a related extension of the expectation maximisation (EM) algorithm for class prior probabilities. We also take a fresh look at generalized label shift (GLS) in the case of a general label space.

[769] VAE with Hyperspherical Coordinates: Improving Anomaly Detection from Hypervolume-Compressed Latent Space

Alejandro Ascarate, Leo Lebrat, Rodrigo Santa Cruz, Clinton Fookes, Olivier Salvado

Main category: cs.LG

TL;DR: VAE with hyperspherical latent coordinates improves anomaly detection by addressing high-dimensional latent space issues where standard VAE latents concentrate on hypersphere equators, enabling better OOD detection on complex datasets.

Details

Motivation: Standard VAEs struggle with anomaly detection in high-dimensional latent spaces due to exponential hypervolume growth and latent vectors concentrating on hypersphere equators, which challenges identifying out-of-distribution samples.

Method: Proposes formulating VAE latent variables using hyperspherical coordinates to compress latent vectors toward specific directions on the hypersphere, creating a more expressive approximate posterior distribution.

Result: Achieves best performance on considered datasets for both fully unsupervised and OOD anomaly detection, outperforming existing methods on Mars Rover camera landscapes, galaxy imagery, and standard benchmarks like Cifar10/ImageNet subsets.

Conclusion: Hyperspherical coordinate formulation for VAE latent variables effectively addresses high-dimensional statistical challenges and improves anomaly detection capabilities on complex real-world datasets.

Abstract: Variational autoencoders (VAE) encode data into lower-dimensional latent vectors before decoding those vectors back to data. Once trained, one can hope to detect out-of-distribution (abnormal) latent vectors, but several issues arise when the latent space is high dimensional. This includes an exponential growth of the hypervolume with the dimension, which severely affects the generative capacity of the VAE. In this paper, we draw insights from high dimensional statistics: in these regimes, the latent vectors of a standard VAE are distributed on the `equators’ of a hypersphere, challenging the detection of anomalies. We propose to formulate the latent variables of a VAE using hyperspherical coordinates, which allows compressing the latent vectors towards a given direction on the hypersphere, thereby allowing for a more expressive approximate posterior. We show that this improves both the fully unsupervised and OOD anomaly detection ability of the VAE, achieving the best performance on the datasets we considered, outperforming existing methods. For the unsupervised and OOD modalities, respectively, these are: i) detecting unusual landscape from the Mars Rover camera and unusual Galaxies from ground based imagery (complex, real world datasets); ii) standard benchmarks like Cifar10 and subsets of ImageNet as the in-distribution (ID) class.

[770] Speed is Confidence

Joshua V. Dillon

Main category: cs.LG

TL;DR: Winner-take-all ensemble selection improves reasoning efficiency by acting on first confident signal, achieving 97% accuracy on Sudoku-Extreme with 10x fewer steps than probability averaging.

Details

Motivation: Biological neural systems must be fast but energy-constrained, using first-spike coding to express confidence. The paper aims to apply this principle to improve efficiency and accuracy in neural reasoning models.

Method: Uses ensembles of Tiny Recursive Models with halt-first selection (winner-take-all) and internalizes this as training-only cost by maintaining K=4 parallel latent states but backpropagating only through the lowest-loss “winner.”

Result: Achieves 97% accuracy on Sudoku-Extreme vs 91% for probability averaging with 10x fewer reasoning steps. Internalized training achieves 96.9% accuracy matching ensemble performance at 1x inference cost with less than half the variance of baseline.

Conclusion: Winner-take-all selection inspired by biological first-spike coding significantly improves reasoning efficiency and accuracy, with most baseline failures being selection problems revealing a 99% accuracy ceiling.

Abstract: Biological neural systems must be fast but are energy-constrained. Evolution’s solution: act on the first signal. Winner-take-all circuits and time-to-first-spike coding implicitly treat when a neuron fires as an expression of confidence. We apply this principle to ensembles of Tiny Recursive Models (TRM) [Jolicoeur-Martineau et al., 2025]. On Sudoku-Extreme, halt-first selection achieves 97% accuracy vs. 91% for probability averaging – while requiring 10x fewer reasoning steps. A single baseline model achieves 85.5% +/- 1.3%. Can we internalize this as a training-only cost? Yes: by maintaining K=4 parallel latent states but backpropping only through the lowest-loss “winner,” we achieve 96.9% +/- 0.6% accuracy – matching ensemble performance at 1x inference cost, with less than half the variance of the baseline. A key diagnostic: 89% of baseline failures are selection problems, revealing a 99% accuracy ceiling. As in nature, this work was also resource constrained: all experiments used a single RTX 5090. A modified SwiGLU [Shazeer, 2020] made Muon [Jordan et al., 2024] and high LR viable, enabling baseline training in 48 minutes and full WTA (K=4) in 6 hours on consumer hardware.

[771] TinyTorch: Building Machine Learning Systems from First Principles

Vijay Janapa Reddi

Main category: cs.LG

TL;DR: TinyTorch is a 20-module curriculum that teaches ML systems engineering by having students implement PyTorch’s core components in pure Python, bridging the algorithm-systems divide in ML education.

Details

Motivation: Addresses the fundamental gap in ML education where students learn algorithms without understanding the underlying systems that execute them, creating practitioners who can train models but cannot debug memory failures, optimize inference latency, or reason about deployment trade-offs.

Method: Uses “implementation-based systems pedagogy” where students construct PyTorch’s core components (tensors, autograd, optimizers, CNNs, transformers) in pure Python. Employs three design patterns: progressive disclosure of complexity, systems-first integration of profiling from the first module, and build-to-validate milestones recreating 67 years of ML breakthroughs.

Result: Developed a complete curriculum requiring only 4GB RAM and no GPU, demonstrating that deep ML systems understanding is achievable without specialized hardware. The curriculum is available open-source and covers from Perceptron (1958) through Transformers (2017) to MLPerf-style benchmarking.

Conclusion: TinyTorch successfully bridges the algorithm-systems divide in ML education through hands-on implementation of core ML framework components, making systems-level understanding accessible without specialized hardware requirements.

Abstract: Machine learning education faces a fundamental gap: students learn algorithms without understanding the systems that execute them. They study gradient descent without measuring memory, attention mechanisms without analyzing O(N^2) scaling, optimizer theory without knowing why Adam requires 3x the memory of SGD. This “algorithm-systems divide” produces practitioners who can train models but cannot debug memory failures, optimize inference latency, or reason about deployment trade-offs–the very skills industry demands as “ML systems engineering.” We present TinyTorch, a 20-module curriculum that closes this gap through “implementation-based systems pedagogy”: students construct PyTorch’s core components (tensors, autograd, optimizers, CNNs, transformers) in pure Python, building a complete framework where every operation they invoke is code they wrote. The design employs three patterns: “progressive disclosure” of complexity, “systems-first integration” of profiling from the first module, and “build-to-validate milestones” recreating 67 years of ML breakthroughs–from Perceptron (1958) through Transformers (2017) to MLPerf-style benchmarking. Requiring only 4GB RAM and no GPU, TinyTorch demonstrates that deep ML systems understanding is achievable without specialized hardware. The curriculum is available open-source at mlsysbook.ai/tinytorch.

[772] Native LLM and MLLM Inference at Scale on Apple Silicon

Wayner Barrios

Main category: cs.LG

TL;DR: vllm-mlx is a framework for efficient LLM and multimodal model inference on Apple Silicon using MLX, featuring content-based prefix caching for vision encoding optimization.

Details

Motivation: Existing tools for Apple Silicon either lack native optimization (PyTorch MPS) or focus only on text models, leaving multimodal workloads underserved despite growing adoption of Apple Silicon for ML development.

Method: Built natively on MLX for Apple Silicon, featuring continuous batching for text models and content-based prefix caching for multimodal models that identifies identical images through content hashing regardless of input format.

Result: Achieves 21-87% higher throughput than llama-cpp on text models, 4.3x aggregate throughput at 16 concurrent requests, 28x speedup on repeated image queries (reducing latency from 21.7s to <1s), and 24.7x cache speedup on video analysis with 64 frames.

Conclusion: vllm-mlx provides efficient LLM and MLLM inference on Apple Silicon with significant performance improvements, particularly for multimodal workloads through innovative caching techniques.

Abstract: The growing adoption of Apple Silicon for machine learning development has created demand for efficient inference solutions that leverage its unique unified memory architecture. However, existing tools either lack native optimization (PyTorch MPS) or focus solely on text models, leaving multimodal workloads underserved. We present vllm-mlx, a framework for efficient LLM and MLLM inference on Apple Silicon built natively on MLX. For text models, we achieve 21% to 87% higher throughput than llama-cpp across models ranging from Qwen3-0.6B to Nemotron-30B, while providing continuous batching that scales to 4.3x aggregate throughput at 16 concurrent requests. For multimodal models, we introduce content-based prefix caching that eliminates redundant vision encoding by identifying identical images through content hashing, regardless of input format. Our evaluation on Apple M4 Max demonstrates throughput of up to 525 tokens per second on text models and 28x speedup on repeated image queries, reducing multimodal latency from 21.7 seconds to under 1 second. Video analysis with up to 64 frames achieves 24.7x cache speedup. We release our implementation as open source to support efficient inference on consumer Apple hardware.

[773] Real-Time Pulsatile Flow Prediction for Realistic, Diverse Intracranial Aneurysm Morphologies using a Graph Transformer and Steady-Flow Data Augmentation

Yiying Sheng, Wenhao Ding, Dylan Roi, Leonard Leong Litt Yeo, Hwa Liang Leo, Choon Hwai Yap

Main category: cs.LG

TL;DR: Graph Transformer model predicts cardiac cycle Wall Shear Stress from aneurysm geometry using CFD-supervised learning, enabling real-time biomechanical marker computation without CFD simulations.

Details

Motivation: CFD-derived fluid mechanical markers for intracranial aneurysms show clinical promise but are impractical due to computational expense and expertise requirements, creating need for real-time deep learning alternatives.

Method: Graph Transformer model incorporating temporal information, supervised by large CFD data, predicts WSS across cardiac cycle from IA surface meshes; uses steady-state CFD data augmentation when pulsatile data is limited.

Result: Model achieves SSIM up to 0.981 and maximum-based relative L2 error of 2.8%, effectively capturing temporal WSS variations; steady-state data augmentation substantially improves performance with small pulsatile sample sizes.

Conclusion: Deep learning can compute temporal cardiovascular fluid mechanical parameters in real-time from geometric meshes, even with limited pulsatile CFD data, with potential applicability to other cardiovascular scenarios.

Abstract: Extensive studies suggested that fluid mechanical markers of intracranial aneurysms (IAs) derived from Computational Fluid Dynamics (CFD) can indicate disease progression risks, but to date this has not been translated clinically. This is because CFD requires specialized expertise and is time-consuming and low throughput, making it difficult to support clinical trials. A deep learning model that maps IA morphology to biomechanical markers can address this, enabling physicians to obtain these markers in real time without performing CFD. Here, we show that a Graph Transformer model that incorporates temporal information, which is supervised by large CFD data, can accurately predict Wall Shear Stress (WSS) across the cardiac cycle from IA surface meshes. The model effectively captures the temporal variations of the WSS pattern, achieving a Structural Similarity Index (SSIM) of up to 0.981 and a maximum-based relative L2 error of 2.8%. Ablation studies and SOTA comparison confirmed its optimality. Further, as pulsatile CFD data is computationally expensive to generate and sample sizes are limited, we engaged a strategy of injecting a large amount of steady-state CFD data, which are extremely low-cost to generate, as augmentation. This approach enhances network performance substantially when pulsatile CFD data sample size is small. Our study provides a proof of concept that temporal sequences cardiovascular fluid mechanical parameters can be computed in real time using a deep learning model from the geometric mesh, and this is achievable even with small pulsatile CFD sample size. Our approach is likely applicable to other cardiovascular scenarios.

[774] Modeling Cascaded Delay Feedback for Online Net Conversion Rate Prediction: Benchmark, Insights and Solutions

Mingxuan Luo, Guipeng Xv, Sishuo Chen, Xinyu Li, Li Zhang, Zhangming Chan, Xiang-Rong Sheng, Han Zhu, Jian Xu, Bo Zheng, Chen Lin

Main category: cs.LG

TL;DR: A new dataset and framework for predicting net conversion rate (NetCVR) that accounts for both purchases and refunds in recommender systems, addressing cascaded delayed feedback challenges.

Details

Motivation: Traditional conversion rate (CVR) prediction ignores refund behavior, failing to capture true user satisfaction and business value. NetCVR (purchase without refund) involves complex multi-stage cascaded delayed feedback with opposite effects, requiring new modeling approaches.

Method: Introduces CASCADE dataset from Taobao for online continuous NetCVR prediction, and TESLA framework with CVR-refund-rate cascaded architecture, stage-wise debiasing, and delay-time-aware ranking loss.

Result: TESLA outperforms state-of-the-art methods on CASCADE dataset, achieving 12.41% improvement in RI-AUC and 14.94% improvement in RI-PRAUC for NetCVR prediction.

Conclusion: NetCVR prediction requires specialized modeling for cascaded delayed feedback, with temporal dynamics and delay time being crucial features. The proposed framework effectively addresses these challenges.

Abstract: In industrial recommender systems, conversion rate (CVR) is widely used for traffic allocation, but it fails to fully reflect recommendation effectiveness because it ignores refund behavior. To better capture true user satisfaction and business value, net conversion rate (NetCVR), defined as the probability that a clicked item is purchased and not refunded, has been proposed.Unlike CVR, NetCVR prediction involves a more complex multi-stage cascaded delayed feedback process. The two cascaded delays from click to conversion and from conversion to refund have opposite effects, making traditional CVR modeling methods inapplicable. Moreover, the lack of open-source datasets and online continuous training schemes further hinders progress in this area.To address these challenges, we introduce CASCADE (Cascaded Sequences of Conversion and Delayed Refund), the first large-scale open dataset derived from the Taobao app for online continuous NetCVR prediction. Through an in-depth analysis of CASCADE, we identify three key insights: (1) NetCVR exhibits strong temporal dynamics, necessitating online continuous modeling; (2) cascaded modeling of CVR and refund rate outperforms direct NetCVR modeling; and (3) delay time, which correlates with both CVR and refund rate, is an important feature for NetCVR prediction.Based on these insights, we propose TESLA, a continuous NetCVR modeling framework featuring a CVR-refund-rate cascaded architecture, stage-wise debiasing, and a delay-time-aware ranking loss. Extensive experiments demonstrate that TESLA consistently outperforms state-of-the-art methods on CASCADE, achieving absolute improvements of 12.41 percent in RI-AUC and 14.94 percent in RI-PRAUC on NetCVR prediction. The code and dataset are publicly available at https://github.com/alimama-tech/NetCVR.

[775] C2:Cross learning module enhanced decision transformer with Constraint-aware loss for auto-bidding

Jinren Ding, Xuejian Xu, Shen Jiang, Zhitong Hao, Jinhui Yang, Peng Jiang

Main category: cs.LG

TL;DR: C2 enhances Decision Transformer for auto-bidding with cross-attention for inter-sequence correlation and constraint-aware loss for selective optimal trajectory learning.

Details

Motivation: Decision Transformer has limitations in cross-correlation modeling among state, action, and return-to-go sequences, and indiscriminate learning of both optimal and suboptimal behaviors in auto-bidding applications.

Method: Proposes C2 framework with two innovations: (1) Cross Learning Block using cross-attention to strengthen inter-sequence correlation modeling, (2) Constraint-aware Loss incorporating budget and CPA constraints for selective learning of optimal trajectories.

Result: Extensive offline evaluations on AuctionNet dataset show consistent performance gains (up to 3.2% over state-of-the-art) across diverse budget settings; ablation studies confirm complementary synergy of the two components.

Conclusion: C2 demonstrates superiority in auto-bidding by addressing Decision Transformer’s limitations through enhanced cross-correlation modeling and selective learning of optimal behaviors.

Abstract: Decision Transformer (DT) shows promise for generative auto-bidding by capturing temporal dependencies, but suffers from two critical limitations: insufficient cross-correlation modeling among state, action, and return-to-go (RTG) sequences, and indiscriminate learning of optimal/suboptimal behaviors. To address these, we propose C2, a novel framework enhancing DT with two core innovations: (1) a Cross Learning Block (CLB) via cross-attention to strengthen inter-sequence correlation modeling; (2) a Constraint-aware Loss (CL) incorporating budget and Cost-Per-Acquisition (CPA) constraints for selective learning of optimal trajectories. Extensive offline evaluations on the AuctionNet dataset demonstrate consistent performance gains (up to 3.2% over state-of-the-art method) across diverse budget settings; ablation studies verify the complementary synergy of CLB and CL, confirming C2’s superiority in auto-bidding. The code for reproducing our results is available at: https://github.com/Dingjinren/C2.

[776] Concept Component Analysis: A Principled Approach for Concept Extraction in LLMs

Yuhang Liu, Erdun Gao, Dong Gong, Anton van den Hengel, Javen Qinfeng Shi

Main category: cs.LG

TL;DR: ConCA is a theory-backed framework for extracting interpretable concepts from LLM representations using linear unmixing of log-posteriors, addressing theoretical ambiguities in existing sparse autoencoder methods.

Details

Motivation: Current sparse autoencoder (SAE) methods for LLM interpretability lack theoretical grounding, creating ambiguity in concept extraction and evaluation. There's a need for principled approaches that establish clear correspondence between LLM representations and human-interpretable concepts.

Method: Concept Component Analysis (ConCA) treats concepts as latent variables and approximates LLM representations as linear mixtures of log-posteriors over concepts. Sparse ConCA variants incorporate sparsity priors to solve the ill-posed unmixing problem, enabling unsupervised extraction of meaningful concepts from LLM activations.

Result: 12 sparse ConCA variants were implemented and demonstrated effective concept extraction across multiple LLMs, showing theory-backed advantages over traditional SAEs in extracting meaningful interpretable concepts.

Conclusion: ConCA provides a principled theoretical framework for LLM interpretability that addresses fundamental ambiguities in existing methods, offering a more grounded approach to extracting human-understandable concepts from model representations.

Abstract: Developing human understandable interpretation of large language models (LLMs) becomes increasingly critical for their deployment in essential domains. Mechanistic interpretability seeks to mitigate the issues through extracts human-interpretable process and concepts from LLMs’ activations. Sparse autoencoders (SAEs) have emerged as a popular approach for extracting interpretable and monosemantic concepts by decomposing the LLM internal representations into a dictionary. Despite their empirical progress, SAEs suffer from a fundamental theoretical ambiguity: the well-defined correspondence between LLM representations and human-interpretable concepts remains unclear. This lack of theoretical grounding gives rise to several methodological challenges, including difficulties in principled method design and evaluation criteria. In this work, we show that, under mild assumptions, LLM representations can be approximated as a {linear mixture} of the log-posteriors over concepts given the input context, through the lens of a latent variable model where concepts are treated as latent variables. This motivates a principled framework for concept extraction, namely Concept Component Analysis (ConCA), which aims to recover the log-posterior of each concept from LLM representations through a {unsupervised} linear unmixing process. We explore a specific variant, termed sparse ConCA, which leverages a sparsity prior to address the inherent ill-posedness of the unmixing problem. We implement 12 sparse ConCA variants and demonstrate their ability to extract meaningful concepts across multiple LLMs, offering theory-backed advantages over SAEs.

cs.MA

[777] AI-Augmented Density-Driven Optimal Control (D2OC) for Decentralized Environmental Mapping

Kooktae Lee, Julian Martinez

Main category: cs.MA

TL;DR: AI-augmented decentralized framework for multi-robot environmental mapping using optimal transport and dual MLPs to refine density estimates under limited sensing/communication

Details

Motivation: Conventional multi-agent coverage formulations perform poorly when accurate reference maps are unavailable or priors are uncertain/biased, requiring adaptive mechanisms for robust environmental mapping

Method: Optimal transport-based framework with adaptive self-correcting mechanism; dual MLP module infers local mean-variance statistics and regulates virtual uncertainty for long-unvisited regions to prevent local minima stagnation

Result: Theoretical convergence proven under Wasserstein metric; simulations show proposed method achieves robust alignment with ground-truth density and substantially higher-fidelity reconstruction of complex multi-modal spatial distributions compared to decentralized baselines

Conclusion: AI-augmented Density-Driven Optimal Control framework enables consistent, scalable multi-agent environmental mapping with theoretical guarantees and practical performance improvements

Abstract: This paper presents an AI-augmented decentralized framework for multi-agent (multi-robot) environmental mapping under limited sensing and communication. While conventional coverage formulations achieve effective spatial allocation when an accurate reference map is available, their performance deteriorates under uncertain or biased priors. The proposed method introduces an adaptive and self-correcting mechanism that enables agents to iteratively refine local density estimates within an optimal transport-based framework, ensuring theoretical consistency and scalability. A dual multilayer perceptron (MLP) module enhances adaptivity by inferring local mean-variance statistics and regulating virtual uncertainty for long-unvisited regions, mitigating stagnation around local minima. Theoretical analysis rigorously proves convergence under the Wasserstein metric, while simulation results demonstrate that the proposed AI-augmented Density-Driven Optimal Control consistently achieves robust and precise alignment with the ground-truth density, yielding substantially higher-fidelity reconstruction of complex multi-modal spatial distributions compared with conventional decentralized baselines.

[778] Mean-Field Control on Sparse Graphs: From Local Limits to GNNs via Neighborhood Distributions

Tobias Schmidt, Kai Cui

Main category: cs.MA

TL;DR: A framework for mean-field control on sparse graphs using decorated rooted neighborhoods and graph neural networks, enabling scalable reinforcement learning with proven locality properties.

Details

Motivation: Traditional mean-field control assumes dense, all-to-all interactions (exchangeability), which is restrictive for real-world network structures that are often sparse. There's a need to bridge this gap to handle complex sparse topologies while maintaining scalability.

Method: Proposes a framework for MFC on large sparse graphs by redefining system state as probability measure over decorated rooted neighborhoods. Introduces horizon-dependent locality principle: optimal policy depends on (T-t)-hop neighborhood. Develops Dynamic Programming Principle on lifted space of neighborhood distributions and uses Graph Neural Networks for actor-critic algorithms.

Result: Proves horizon-dependent locality property, making infinite-dimensional control problem tractable. Theoretically justifies GNNs for actor-critic algorithms in this context. Framework recovers classical MFC as degenerate case while enabling efficient control on complex sparse topologies.

Conclusion: Provides rigorous theoretical foundation for scalable reinforcement learning on sparse graphs, bridging the gap between traditional MFC and real-world network structures through neighborhood-based representations and GNN architectures.

Abstract: Mean-field control (MFC) offers a scalable solution to the curse of dimensionality in multi-agent systems but traditionally hinges on the restrictive assumption of exchangeability via dense, all-to-all interactions. In this work, we bridge the gap to real-world network structures by proposing a rigorous framework for MFC on large sparse graphs. We redefine the system state as a probability measure over decorated rooted neighborhoods, effectively capturing local heterogeneity. Our central contribution is a theoretical foundation for scalable reinforcement learning in this setting. We prove horizon-dependent locality: for finite-horizon problems, an agent’s optimal policy at time t depends strictly on its (T-t)-hop neighborhood. This result renders the infinite-dimensional control problem tractable and underpins a novel Dynamic Programming Principle (DPP) on the lifted space of neighborhood distributions. Furthermore, we formally and experimentally justify the use of Graph Neural Networks (GNNs) for actor-critic algorithms in this context. Our framework naturally recovers classical MFC as a degenerate case while enabling efficient, theoretically grounded control on complex sparse topologies.

[779] Learning to Communicate Across Modalities: Perceptual Heterogeneity in Multi-Agent Systems

Naomi Pitzer, Daniela Mihai

Main category: cs.MA

TL;DR: Emergent communication in heterogeneous multimodal agents shows they develop class-consistent messages despite perceptual misalignment, with unimodal systems being more efficient but multimodal requiring more information exchange and exhibiting distributional rather than compositional encoding.

Details

Motivation: Most emergent communication research assumes homogeneous modalities or aligned representational spaces, overlooking real-world perceptual heterogeneity. The paper aims to study how agents with different modalities and lack of perceptual grounding develop shared structured representations.

Method: The authors use a heterogeneous multi-step binary communication game where agents differ in modality and lack perceptual grounding. They analyze convergence patterns, communication efficiency, bit perturbation experiments to study encoding mechanisms, and interoperability analyses across different perceptual worlds.

Result: Multimodal systems converge to class-consistent messages grounded in perceptual input despite perceptual misalignment. Unimodal systems communicate more efficiently (fewer bits, lower classification entropy), while multimodal agents require greater information exchange and exhibit higher uncertainty. Bit perturbation shows meaning is encoded distributionally rather than compositionally. Cross-system communication requires fine-tuning.

Conclusion: Emergent communication serves as a framework for studying how agents adapt and transfer representations across heterogeneous modalities, with implications for understanding multimodal representation learning and cross-modal transfer in AI systems.

Abstract: Emergent communication offers insight into how agents develop shared structured representations, yet most research assumes homogeneous modalities or aligned representational spaces, overlooking the perceptual heterogeneity of real-world settings. We study a heterogeneous multi-step binary communication game where agents differ in modality and lack perceptual grounding. Despite perceptual misalignment, multimodal systems converge to class-consistent messages grounded in perceptual input. Unimodal systems communicate more efficiently, using fewer bits and achieving lower classification entropy, while multimodal agents require greater information exchange and exhibit higher uncertainty. Bit perturbation experiments provide strong evidence that meaning is encoded in a distributional rather than compositional manner, as each bit’s contribution depends on its surrounding pattern. Finally, interoperability analyses show that systems trained in different perceptual worlds fail to directly communicate, but limited fine-tuning enables successful cross-system communication. This work positions emergent communication as a framework for studying how agents adapt and transfer representations across heterogeneous modalities, opening new directions for both theory and experimentation.

[780] Empowering Scientific Workflows with Federated Agents

Alok Kamatar, J. Gregory Pauloski, Yadu Babuji, Ryan Chard, Mansi Sakarvadia, Daniel Babnigg, Kyle Chard, Ian Foster

Main category: cs.MA

TL;DR: Academy is a middleware framework for deploying autonomous agents across federated scientific cyberinfrastructure, supporting heterogeneous resources and high-throughput workflows for scientific applications.

Details

Motivation: Existing agentic frameworks are too narrow, centralized, and focused on conversational/cloud applications, while scientific applications require agents deployed across diverse cyberinfrastructure like HPC systems, experimental facilities, and data repositories.

Method: Academy is a modular, extensible middleware that provides abstractions for stateful agents, manages inter-agent coordination, supports asynchronous execution, heterogeneous resources, high-throughput data flows, and dynamic resource availability across federated research ecosystems.

Result: Microbenchmarks demonstrate high performance and scalability in HPC environments. Case studies show successful applications in materials discovery, astronomy, decentralized learning, and information extraction with agents deployed across diverse HPC systems.

Conclusion: Academy enables agentic workflow designs for scientific computing by providing a framework that meets the unique demands of deploying autonomous agents across heterogeneous, federated research infrastructure.

Abstract: Agentic systems, in which diverse agents cooperate to tackle challenging problems, are exploding in popularity in the AI community. However, existing agentic frameworks take a relatively narrow view of agents, apply a centralized model, and target conversational, cloud-native applications (e.g., LLM-based AI chatbots). In contrast, scientific applications require myriad agents be deployed and managed across diverse cyberinfrastructure. Here we introduce Academy, a modular and extensible middleware designed to deploy autonomous agents across the federated research ecosystem, including HPC systems, experimental facilities, and data repositories. To meet the demands of scientific computing, Academy supports asynchronous execution, heterogeneous resources, high-throughput data flows, and dynamic resource availability. It provides abstractions for expressing stateful agents, managing inter-agent coordination, and integrating computation with experimental control. We present microbenchmark results that demonstrate high performance and scalability in HPC environments. To explore the breadth of applications that can be supported by agentic workflow designs, we also present case studies in materials discovery, astronomy, decentralized learning, and information extraction in which agents are deployed across diverse HPC systems.

[781] Agents Trusting Agents? Restoring Lost Capabilities with Inclusive Healthcare

Alba Aguilera, Georgina Curto, Nardine Osman, Ahmed Al-Awah

Main category: cs.MA

TL;DR: This paper develops an agent-based simulation using reinforcement learning and Bayesian inverse reinforcement learning to model healthcare policies for people experiencing homelessness in Barcelona, focusing on building trust between social workers and homeless individuals.

Details

Motivation: The paper addresses the need to evaluate social policies for improving healthcare equity for homeless populations before real-world implementation. Non-profit and governmental organizations requested evaluation of policies to reduce health inequity for people experiencing homelessness in Barcelona.

Method: The authors integrate the capability approach framework to model human well-being, create a reinforcement learning environment where agents aim to restore their central human capabilities, and use Bayesian inverse reinforcement learning to calibrate behavioral parameters related to trust and engagement between homeless individuals and social workers.

Result: The results demonstrate a path to mitigate health inequity by building relationships of trust between social service workers and people experiencing homelessness through the agent-based simulation framework.

Conclusion: Agent-based simulations with reinforcement learning and inverse reinforcement learning can effectively evaluate social policies for vulnerable populations before real-world implementation, particularly for building trust-based interventions in healthcare equity.

Abstract: Agent-based simulations have an untapped potential to inform social policies on urgent human development challenges in a non-invasive way, before these are implemented in real-world populations. This paper responds to the request from non-profit and governmental organizations to evaluate policies under discussion to improve equity in health care services for people experiencing homelessness (PEH) in the city of Barcelona. With this goal, we integrate the conceptual framework of the capability approach (CA), which is explicitly designed to promote and assess human well-being, to model and evaluate the behaviour of agents who represent PEH and social workers. We define a reinforcement learning environment where agents aim to restore their central human capabilities, under existing environmental and legal constraints. We use Bayesian inverse reinforcement learning (IRL) to calibrate profile-dependent behavioural parameters in PEH agents, modeling the degree of trust and engagement with social workers, which is reportedly a key element for the success of the policies in scope. Our results open a path to mitigate health inequity by building relationships of trust between social service workers and PEH.

cs.MM

[782] HADUA: Hierarchical Attention and Dynamic Uniform Alignment for Robust Cross-Subject Emotion Recognition

Jiahao Tang, Youjun Li, Yangxuan Zheng, Xiangting Fan, Siyuan Lu, Nuo Zhang, Zi-Gang Huang

Main category: cs.MM

TL;DR: HADUA framework for robust cross-subject emotion recognition using multimodal physiological signals (EEG and eye movement) with hierarchical attention and domain adaptation techniques.

Details

Motivation: Cross-subject emotion recognition from multimodal physiological signals faces challenges due to modality heterogeneity (different signal types) and inter-subject distribution shift (different data distributions across individuals). Existing methods struggle with these issues, requiring a more robust approach.

Method: Proposes HADUA framework with three key components: 1) Hierarchical attention module for intra-modal temporal dynamics and inter-modal semantic interactions between EEG and eye movement signals, 2) Confidence-aware Gaussian weighting to handle noisy pseudo-labels during adaptation, 3) Uniform alignment loss to regularize pseudo-label distribution across classes and mitigate imbalance.

Result: Extensive experiments on multiple cross-subject emotion recognition benchmarks show HADUA consistently surpasses existing state-of-the-art methods in both accuracy and robustness, effectively handling modality gaps, noisy pseudo-labels, and class imbalance.

Conclusion: HADUA offers a practical and generalizable solution for building robust cross-subject affective computing systems by unifying multimodal representation learning with domain adaptation techniques.

Abstract: Robust cross-subject emotion recognition from multimodal physiological signals remains a challenging problem, primarily due to modality heterogeneity and inter-subject distribution shift. To tackle these challenges, we propose a novel adaptive learning framework named Hierarchical Attention and Dynamic Uniform Alignment (HADUA). Our approach unifies the learning of multimodal representations with domain adaptation. First, we design a hierarchical attention module that explicitly models intra-modal temporal dynamics and inter-modal semantic interactions (e.g., between electroencephalogram(EEG) and eye movement(EM)), yielding discriminative and semantically coherent fused features. Second, to overcome the noise inherent in pseudo-labels during adaptation, we introduce a confidence-aware Gaussian weighting scheme that smooths the supervision from target-domain samples by down-weighting uncertain instances. Third, a uniform alignment loss is employed to regularize the distribution of pseudo-labels across classes, thereby mitigating imbalance and stabilizing conditional distribution matching. Extensive experiments on multiple cross-subject emotion recognition benchmarks show that HADUA consistently surpasses existing state-of-the-art methods in both accuracy and robustness, validating its effectiveness in handling modality gaps, noisy pseudo-labels, and class imbalance. Taken together, these contributions offer a practical and generalizable solution for building robust cross-subject affective computing systems.

[783] Rethinking Fusion: Disentangled Learning of Shared and Modality-Specific Information for Stance Detection

Zhiyu Xie, Fuqiang Niu, Genan Dai, Qianlong Wang, Li Dong, Bowen Zhang, Hu Huang

Main category: cs.MM

TL;DR: DiME: A disentangled multi-modal architecture for stance detection that separates textual-dominant, visual-dominant, and cross-modal shared components using specialized expert modules and adaptive fusion.

Details

Motivation: Current multi-modal stance detection methods fail to distinguish between modality-specific signals and cross-modal evidence, leading to suboptimal performance. There's a need for explicit separation of stance information across different modalities.

Method: DiME uses target-aware Chain-of-Thought prompts for reasoning-guided textual input, dual encoders for modality features, three expert modules (textual-dominant, visual-dominant, cross-modal shared) with specialized loss functions (contrastive learning for modality-specific, cosine alignment for shared representation), and a gating network for adaptive fusion.

Result: Experiments on four benchmark datasets show DiME consistently outperforms strong unimodal and multi-modal baselines under both in-target and zero-shot settings.

Conclusion: Explicitly disentangling modality-specific and cross-modal stance information through specialized expert modules with adaptive fusion significantly improves multi-modal stance detection performance.

Abstract: Multi-modal stance detection (MSD) aims to determine an author’s stance toward a given target using both textual and visual content. While recent methods leverage multi-modal fusion and prompt-based learning, most fail to distinguish between modality-specific signals and cross-modal evidence, leading to suboptimal performance. We propose DiME (Disentangled Multi-modal Experts), a novel architecture that explicitly separates stance information into textual-dominant, visual-dominant, and cross-modal shared components. DiME first uses a target-aware Chain-of-Thought prompt to generate reasoning-guided textual input. Then, dual encoders extract modality features, which are processed by three expert modules with specialized loss functions: contrastive learning for modality-specific experts and cosine alignment for shared representation learning. A gating network adaptively fuses expert outputs for final prediction. Experiments on four benchmark datasets show that DiME consistently outperforms strong unimodal and multi-modal baselines under both in-target and zero-shot settings.

[784] MIDI-LLaMA: An Instruction-Following Multimodal LLM for Symbolic Music Understanding

Meng Yang, Jon McCormack, Maria Teresa Llano, Wanchao Su, Chao Lei

Main category: cs.MM

TL;DR: MIDI-LLaMA is the first instruction-following multimodal LLM for symbolic music understanding, aligning MusicBERT MIDI encoder with Llama-3-8B through feature alignment and instruction tuning on a newly annotated MIDI-text dataset.

Details

Motivation: While multimodal LLMs have advanced in audio music understanding, symbolic music (fundamental representation of musical structure) remains unexplored. The paper aims to bridge this gap by developing the first MLLM specifically for symbolic music understanding.

Method: Two-stage pipeline: 1) Feature alignment between MIDI encoder (MusicBERT) and Llama-3-8B, 2) Instruction tuning. Created scalable annotation pipeline to annotate GiantMIDI-Piano with fine-grained metadata, resulting in MIDI-text dataset for training.

Result: MIDI-LLaMA substantially outperforms baseline (MIDI-to-ABC notation) in captioning and semantic alignment in QA. Human evaluation confirms advantages in music understanding, emotion recognition, creativity, and overall preference.

Conclusion: Incorporating symbolic music into large language models enhances their capacity for musical understanding, demonstrating the value of symbolic representations for MLLMs in music domains.

Abstract: Recent advances in multimodal large language models (MLLM) for audio music have demonstrated strong capabilities in music understanding, yet symbolic music, a fundamental representation of musical structure, remains unexplored. In this work, we introduce MIDI-LLaMA, the first instruction-following MLLM for symbolic music understanding. Our approach aligns the MIDI encoder MusicBERT and Llama-3-8B via a two-stage pipeline comprising feature alignment and instruction tuning. To support training, we design a scalable annotation pipeline that annotates GiantMIDI-Piano with fine-grained metadata, resulting in a MIDI-text dataset. Compared with the baseline trained on converting MIDI into ABC notation under the same instruction-tuning procedure, MIDI-LLaMA substantially outperforms in captioning and semantic alignment in question answering. Human evaluation further confirms the advantages of MIDI-LLaMA in music understanding, emotion recognition, creativity, and overall preference. These findings demonstrate that incorporating symbolic music into large language models enhances their capacity for musical understanding.

eess.AS

[785] Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection

Sergio Burdisso, Esaú Villatoro-Tello, Shashi Kumar, Srikanth Madikeri, Andrés Carofilis, Pradeep Rangappa, Manjunath K E, Kadri Hacioglu, Petr Motlicek, Andreas Stolcke

Main category: eess.AS

TL;DR: LLM-based ASR systems use fixed prompts, but prompt choice significantly impacts performance. The paper proposes a prompt projector module that learns to optimize prompt embeddings, improving ASR performance and reducing variability across datasets.

Details

Motivation: Current LLM-based ASR systems use fixed, manually defined prompts during training and inference. While this enables practical applicability, the impact of prompt design is underexplored, and prompt choice significantly affects ASR performance and introduces instability across different datasets.

Method: The paper proposes a prompt projector module - a simple, model-agnostic extension that learns to project prompt embeddings to more effective regions of the LLM input space. This module is added to existing LLM-based ASR architectures without modifying the underlying model, working alongside the speech-to-LLM projector.

Result: Experiments on four datasets show that adding a prompt projector consistently improves ASR performance, reduces variability across different prompts, and outperforms the best manually selected prompts.

Conclusion: The prompt projector module effectively addresses prompt sensitivity in LLM-based ASR systems, providing a simple yet powerful way to optimize prompt embeddings and improve model performance across diverse datasets.

Abstract: LLM-based automatic speech recognition (ASR), a well-established approach, connects speech foundation models to large language models (LLMs) through a speech-to-LLM projector, yielding promising results. A common design choice in these architectures is the use of a fixed, manually defined prompt during both training and inference. This setup not only enables applicability across a range of practical scenarios, but also helps maximize model performance. However, the impact of prompt design remains underexplored. This paper presents a comprehensive analysis of commonly used prompts across diverse datasets, showing that prompt choice significantly affects ASR performance and introduces instability, with no single prompt performing best across all cases. Inspired by the speech-to-LLM projector, we propose a prompt projector module, a simple, model-agnostic extension that learns to project prompt embeddings to more effective regions of the LLM input space, without modifying the underlying LLM-based ASR model. Experiments on four datasets show that the addition of a prompt projector consistently improves performance, reduces variability, and outperforms the best manually selected prompts.

[786] Unseen but not Unknown: Using Dataset Concealment to Robustly Evaluate Speech Quality Estimation Models

Jaden Pieper, Stephen D. Voran

Main category: eess.AS

TL;DR: Dataset Concealment (DSC) is a new evaluation framework for speech quality estimation models that quantifies the gap between research results and real-world performance, with an Aligner component that improves generalization across datasets.

Details

Motivation: Current speech quality estimation models often show performance gaps between research evaluations and real-world applications. There's a need for better evaluation procedures that can quantify these gaps, provide interpretable insights into model behavior, and address dataset-specific biases (corpus effects) that limit generalization.

Method: Proposes Dataset Concealment (DSC) procedure that: 1) Quantifies and decomposes performance gaps between research and real-world scenarios, 2) Uses a dataset Aligner from AlignNet to address corpus effects when training with multiple datasets, 3) Evaluates on nine training datasets and nine unseen datasets using three models (MOSNet, NISQA, Wav2Vec2.0-based model).

Result: DSC provides interpretable views of model generalization capabilities and limitations. Adding a small 1000-parameter dataset Aligner to the 94M-parameter Wav2Vec model significantly improves its ability to estimate speech quality for unseen data. The framework allows using all available data during training while improving generalization.

Conclusion: DSC offers a rigorous evaluation framework for speech quality estimation models that bridges the gap between research and real-world applications, with practical improvements through dataset alignment techniques that enhance model generalization across diverse datasets.

Abstract: We introduce Dataset Concealment (DSC), a rigorous new procedure for evaluating and interpreting objective speech quality estimation models. DSC quantifies and decomposes the performance gap between research results and real-world application requirements, while offering context and additional insights into model behavior and dataset characteristics. We also show the benefits of addressing the corpus effect by using the dataset Aligner from AlignNet when training models with multiple datasets. We demonstrate DSC and the improvements from the Aligner using nine training datasets and nine unseen datasets with three well-studied models: MOSNet, NISQA, and a Wav2Vec2.0-based model. DSC provides interpretable views of the generalization capabilities and limitations of models, while allowing all available data to be used at training. An additional result is that adding the 1000 parameter dataset Aligner to the 94 million parameter Wav2Vec model during training does significantly improve the resulting model’s ability to estimate speech quality for unseen data.

[787] DNN-Based Online Source Counting Based on Spatial Generalized Magnitude Squared Coherence

Henri Gode, Simon Doclo

Main category: eess.AS

TL;DR: A neural network-based method for online sound source counting using spatial coherence features and change detection in binaural hearing aid scenarios.

Details

Motivation: Accurate source counting is crucial for many audio processing tasks like source localization, separation, and speech enhancement, but existing methods often lack real-time capability or robustness in complex acoustic environments.

Method: The method uses spatial coherence features (generalized magnitude-squared coherence) after spatial whitening to detect changes in active source count. A compact neural network is trained to perform frame-wise change detection by identifying when spatial coherence patterns shift due to source count changes.

Result: Simulation results with binaural hearing aids in reverberant scenes with up to 4 speakers and background noise demonstrate effective online source counting performance.

Conclusion: The proposed spatial coherence-based change detection approach provides an effective solution for real-time source counting in complex acoustic environments, particularly suitable for hearing aid applications.

Abstract: The number of active sound sources is a key parameter in many acoustic signal processing tasks, such as source localization, source separation, and multi-microphone speech enhancement. This paper proposes a novel method for online source counting by detecting changes in the number of active sources based on spatial coherence. The proposed method exploits the fact that a single coherent source in spatially white background noise yields high spatial coherence, whereas only noise results in low spatial coherence. By applying a spatial whitening operation, the source counting problem is reformulated as a change detection task, aiming to identify the time frames when the number of active sources changes. The method leverages the generalized magnitude-squared coherence as a measure to quantify spatial coherence, providing features for a compact neural network trained to detect source count changes framewise. Simulation results with binaural hearing aids in reverberant acoustic scenes with up to 4 speakers and background noise demonstrate the effectiveness of the proposed method for online source counting.

[788] Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER

Xiuwen Zheng, Sixun Dong, Bornali Phukon, Mark Hasegawa-Johnson, Chang D. Yoo

Main category: eess.AS

TL;DR: LLM-based agent for post-ASR correction of dysarthric speech that improves both WER and semantic fidelity through a Judge-Editor approach over top-k ASR hypotheses.

Details

Motivation: Traditional ASR benchmarking with WER doesn't capture semantic fidelity, which is crucial for real-world applications, especially for dysarthric speech where articulatory issues cause severe semantic distortions.

Method: Introduces an LLM-based agent with Judge-Editor architecture that operates on top-k ASR hypotheses, keeping high-confidence spans and rewriting uncertain segments, working in both zero-shot and fine-tuned modes.

Result: Achieves 14.51% WER reduction, +7.59 pp improvement in MENLI, and +7.66 pp in Slot Micro F1 on challenging dysarthric speech samples. Also releases SAP-Hypo5 benchmark for reproducibility.

Conclusion: LLM-based post-ASR correction significantly improves both transcription accuracy and semantic fidelity for dysarthric speech, with semantic metrics correlating better with downstream task performance than WER.

Abstract: While Automatic Speech Recognition (ASR) is typically benchmarked by word error rate (WER), real-world applications ultimately hinge on semantic fidelity. This mismatch is particularly problematic for dysarthric speech, where articulatory imprecision and disfluencies can cause severe semantic distortions. To bridge this gap, we introduce a Large Language Model (LLM)-based agent for post-ASR correction: a Judge-Editor over the top-k ASR hypotheses that keeps high-confidence spans, rewrites uncertain segments, and operates in both zero-shot and fine-tuned modes. In parallel, we release SAP-Hypo5, the largest benchmark for dysarthric speech correction, to enable reproducibility and future exploration. Under multi-perspective evaluation, our agent achieves a 14.51% WER reduction alongside substantial semantic gains, including a +7.59 pp improvement in MENLI and +7.66 pp in Slot Micro F1 on challenging samples. Our analysis further reveals that WER is highly sensitive to domain shift, whereas semantic metrics correlate more closely with downstream task performance.

[789] SemanticAudio: Audio Generation and Editing in Semantic Space

Zheqi Dai, Guangyan Zhang, Haolin He, Xiquan Li, Jingyu Li, Chunyat Wu, Yiwen Guo, Qiuqiang Kong

Main category: eess.AS

Details

Result: Extensive experiments show SemanticAudio surpasses existing mainstream approaches in semantic alignment. The framework enables precise attribute-level modifications without retraining.

[790] Representation-Regularized Convolutional Audio Transformer for Audio Understanding

Bing Han, Chushu Zhou, Yifan Yang, Wei Wang, Chenda Li, Wangyou Zhang, Yanmin Qian

Main category: eess.AS

TL;DR: CAT is a bootstrap-based SSL framework for audio that captures hierarchical features via multi-resolution blocks and improves training efficiency using representation regularization with frozen pre-trained encoders.

Details

Motivation: Existing SSL methods for audio operate at single granularity, limiting ability to model diverse temporal/spectral structures, and bootstrapping from scratch is computationally expensive.

Method: Proposes Convolutional Audio Transformer (CAT) with: 1) Multi-resolution Block for hierarchical feature capture, 2) Representation Regularization objective that aligns student predictions with frozen pre-trained encoder representations.

Result: Significantly outperforms baselines on audio understanding benchmarks, achieves competitive performance on AudioSet 20k with 5× faster convergence than existing methods.

Conclusion: CAT provides an efficient unified framework for audio SSL that captures hierarchical features and accelerates training through representation regularization.

Abstract: Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi-resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning its predictions with high-quality semantic representations from frozen, pre-trained external encoders. Experimental results demonstrate that CAT significantly outperforms baselines on audio understanding benchmarks. Notably, it achieves competitive performance on the AudioSet 20k dataset with 5 times faster convergence than existing methods. Codes and checkpoints will be released soon at https://github.com/realzhouchushu/CAT.

[791] Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts

Michael Kuhlmann, Alexander Werning, Thilo von Neumann, Reinhold Haeb-Umbach

Main category: eess.AS

TL;DR: Frame-level speech quality assessment with consistency regularization for better interpretability and applications in spoof detection and TTS artifact analysis

Details

Motivation: Existing speech quality assessment methods at utterance or system level lack interpretability - they can't explain why specific scores were assigned. Frame-level scores offer better interpretability but are hard to train due to lack of strong targets and regularization challenges.

Method: Proposes regularizing utterance-level speech quality predictors with a segment-based consistency constraint to reduce frame-level stochasticity. This enables more reliable frame-level scoring.

Result: The consistency constraint notably reduces frame-level stochasticity. Applied to partial spoof detection and synthesis artifact detection in TTS systems. Listening tests confirm listeners rate segments with low frame-level scores as poor quality more often than random control sets.

Conclusion: Segment-based consistency regularization enables more interpretable frame-level speech quality assessment, with practical applications in spoof detection and TTS system evaluation.

Abstract: A large number of works view the automatic assessment of speech from an utterance- or system-level perspective. While such approaches are good in judging overall quality, they cannot adequately explain why a certain score was assigned to an utterance. frame-level scores can provide better interpretability, but models predicting them are harder to tune and regularize since no strong targets are available during training. In this work, we show that utterance-level speech quality predictors can be regularized with a segment-based consistency constraint which notably reduces frame-level stochasticity. We then demonstrate two applications involving frame-level scores: The partial spoof scenario and the detection of synthesis artefacts in two state-of-the-art text-to-speech systems. For the latter, we perform listening tests and confirm that listeners rate segments to be of poor quality more often in the set defined by low frame-level scores than in a random control set.

[792] DisContSE: Single-Step Diffusion Speech Enhancement Based on Joint Discrete and Continuous Embeddings

Yihui Fu, Tim Fingscheidt

Main category: eess.AS

TL;DR: DisContSE: Efficient diffusion-based speech enhancement using joint discrete codec tokens and continuous embeddings with single-step inference

Details

Motivation: Current diffusion speech enhancement models on discrete audio codec features have high computational complexity due to multiple reverse iterations and struggle with intrusive metrics like phonetic accuracy, despite good non-intrusive metric performance.

Method: Proposes DisContSE with three modules: 1) discrete enhancement on codec tokens, 2) continuous enhancement on embeddings, 3) semantic enhancement for phonetic accuracy. Uses novel quantization error mask initialization for single-step reverse process inference.

Result: Outperforms top diffusion baselines in PESQ, POLQA, UTMOS, and subjective ITU-T P.808 listening tests on URGENT 2024 Speech Enhancement Challenge data, achieving overall top rank.

Conclusion: DisContSE successfully combines discrete and continuous representations for efficient single-step diffusion speech enhancement with improved fidelity, intelligibility, and phonetic accuracy.

Abstract: Diffusion speech enhancement on discrete audio codec features gain immense attention due to their improved speech component reconstruction capability. However, they usually suffer from high inference computational complexity due to multiple reverse process iterations. Furthermore, they generally achieve promising results on non-intrusive metrics but show poor performance on intrusive metrics, as they may struggle in reconstructing the correct phones. In this paper, we propose DisContSE, an efficient diffusion-based speech enhancement model on joint discrete codec tokens and continuous embeddings. Our contributions are three-fold. First, we formulate both a discrete and a continuous enhancement module operating on discrete audio codec tokens and continuous embeddings, respectively, to achieve improved fidelity and intelligibility simultaneously. Second, a semantic enhancement module is further adopted to achieve optimal phonetic accuracy. Third, we achieve a single-step efficient reverse process in inference with a novel quantization error mask initialization strategy, which, according to our knowledge, is the first successful single-step diffusion speech enhancement based on an audio codec. Trained and evaluated on URGENT 2024 Speech Enhancement Challenge data splits, the proposed DisContSE excels top-reported time- and frequency-domain diffusion baseline methods in PESQ, POLQA, UTMOS, and in a subjective ITU-T P.808 listening test, clearly achieving an overall top rank.

[793] TidyVoice 2026 Challenge Evaluation Plan

Aref Farhadipour, Jan Marquenie, Srikanth Madikeri, Teodora Vukovic, Volker Dellwo, Kathy Reid, Francis M. Tyers, Ingo Siegert, Eleanor Chodroff

Main category: eess.AS

TL;DR: A challenge for cross-lingual speaker verification using the TidyVoiceX dataset to address language mismatch issues in speaker recognition systems.

Details

Motivation: Speaker verification systems degrade significantly under language mismatch, exacerbated by reliance on English-centric data. Need for more inclusive, language-independent speaker recognition technologies.

Method: Proposes the TidyVoice Challenge using the TidyVoiceX dataset from Mozilla Common Voice, curated to isolate language switching effects across ~40 languages. Participants build systems robust to language mismatch, evaluated using Equal Error Rate on cross-language trials.

Result: Provides standardized data, open-source baselines, and rigorous evaluation protocol to drive research toward fairer, more inclusive speaker recognition technologies.

Conclusion: The challenge aims to advance cross-lingual speaker verification research, directly aligning with Interspeech 2026’s “Speaking Together” theme for language-independent technologies.

Abstract: The performance of speaker verification systems degrades significantly under language mismatch, a critical challenge exacerbated by the field’s reliance on English-centric data. To address this, we propose the TidyVoice Challenge for cross-lingual speaker verification. The challenge leverages the TidyVoiceX dataset from the novel TidyVoice benchmark, a large-scale, multilingual corpus derived from Mozilla Common Voice, and specifically curated to isolate the effect of language switching across approximately 40 languages. Participants will be tasked with building systems robust to this mismatch, with performance primarily evaluated using the Equal Error Rate on cross-language trials. By providing standardized data, open-source baselines, and a rigorous evaluation protocol, this challenge aims to drive research towards fairer, more inclusive, and language-independent speaker recognition technologies, directly aligning with the Interspeech 2026 theme, “Speaking Together.”

[794] Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation

Jinbae Seo, Hyeongjun Kwon, Kwonyoung Kim, Jiyoung Lee, Kwanghoon Sohn

Main category: eess.AS

TL;DR: Audio-Centric Query Generation and Sound-Aware Ordinal Counting loss address visual bias in audiovisual instance segmentation by enabling query specialization to sound sources and explicit supervision of sounding object counts.

Details

Motivation: Existing audiovisual instance segmentation methods suffer from visual bias due to uniform additive fusion (preventing query specialization to different sound sources) and visual-only training objectives (allowing queries to converge to arbitrary salient objects).

Method: Proposes Audio-Centric Query Generation using cross-attention to enable each query to selectively attend to distinct sound sources and carry sound-specific priors into visual decoding. Also introduces Sound-Aware Ordinal Counting (SAOC) loss that explicitly supervises sounding object numbers through ordinal regression with monotonic consistency constraints.

Result: Experiments on AVISeg benchmark show consistent improvements: +1.64 mAP, +0.6 HOTA, and +2.06 FSLA, validating that query specialization and explicit counting supervision are crucial for accurate audiovisual instance segmentation.

Conclusion: The proposed audio-centric query generation and sound-aware ordinal counting loss effectively address visual bias in audiovisual instance segmentation, leading to significant performance improvements on benchmark metrics.

Abstract: Audiovisual instance segmentation (AVIS) requires accurately localizing and tracking sounding objects throughout video sequences. Existing methods suffer from visual bias stemming from two fundamental issues: uniform additive fusion prevents queries from specializing to different sound sources, while visual-only training objectives allow queries to converge to arbitrary salient objects. We propose Audio-Centric Query Generation using cross-attention, enabling each query to selectively attend to distinct sound sources and carry sound-specific priors into visual decoding. Additionally, we introduce Sound-Aware Ordinal Counting (SAOC) loss that explicitly supervises sounding object numbers through ordinal regression with monotonic consistency constraints, preventing visual-only convergence during training. Experiments on AVISeg benchmark demonstrate consistent improvements: +1.64 mAP, +0.6 HOTA, and +2.06 FSLA, validating that query specialization and explicit counting supervision are crucial for accurate audiovisual instance segmentation.

[795] Quantitative Measures for Passive Sonar Texture Analysis

Jarin Ritu, Alexandra Van Dine, Joshua Peeples

Main category: eess.AS

TL;DR: CNNs struggle with statistical variations in passive sonar signals; proposed texture metrics show CNNs underperform on statistically textured signals but improve with explicit statistical texture modeling.

Details

Motivation: Passive sonar signals have complex characteristics from environmental noise, vessel machinery, and propagation effects. While CNNs perform well on classification tasks, they struggle with statistical variations in the data, which needs investigation.

Method: Generated synthetic underwater acoustic datasets focusing on amplitude and period variations. Proposed two metrics to quantify and validate statistical and structural texture characteristics. Applied these measures to real-world passive sonar datasets to assess texture information and correlate model performances.

Result: Results show that CNNs underperform on statistically textured signals, but incorporating explicit statistical texture modeling yields consistent improvements in classification performance.

Conclusion: Quantifying texture information is important for passive sonar classification, and explicit statistical texture modeling can address CNN limitations with statistical variations in acoustic signals.

Abstract: Passive sonar signals contain complex characteristics often arising from environmental noise, vessel machinery, and propagation effects. While convolutional neural networks (CNNs) perform well on passive sonar classification tasks, they can struggle with statistical variations that occur in the data. To investigate this limitation, synthetic underwater acoustic datasets are generated that centered on amplitude and period variations. Two metrics are proposed to quantify and validate these characteristics in the context of statistical and structural texture for passive sonar. These measures are applied to real-world passive sonar datasets to assess texture information in the signals and correlate the performances of the models. Results show that CNNs underperform on statistically textured signals, but incorporating explicit statistical texture modeling yields consistent improvements. These findings highlight the importance of quantifying texture information for passive sonar classification.

[796] Do We Need EMA for Diffusion-Based Speech Enhancement? Toward a Magnitude-Preserving Network Architecture

Julius Richter, Danilo de Oliveira, Timo Gerkmann

Main category: eess.AS

TL;DR: Schrodinger bridge diffusion framework for speech enhancement with novel skip-connection designs and EMA analysis showing different behavior than image generation

Details

Motivation: To improve diffusion-based speech enhancement by applying Schrodinger bridge formulation and exploring architectural innovations like skip-connection variants and magnitude-preserving designs, while investigating EMA behavior specific to speech tasks

Method: Extends EDM2 framework to speech enhancement using Schrodinger bridge formulation with time-dependent preconditioning, explores two skip-connection configurations (predicting environmental noise or clean speech), adopts magnitude-preserving architecture, and analyzes EMA impact through post-training approximations

Result: Competitive performance on VoiceBank-DEMAND and EARS-WHAM datasets with complementary strengths from different skip-connection variants, finding that short or absent EMA yields better speech enhancement performance unlike in image generation

Conclusion: Provides new insights into EMA behavior, magnitude preservation, and skip-connection design for diffusion-based speech enhancement, demonstrating the effectiveness of Schrodinger bridge formulation with architectural innovations

Abstract: We study diffusion-based speech enhancement using a Schrodinger bridge formulation and extend the EDM2 framework to this setting. We employ time-dependent preconditioning of network inputs and outputs to stabilize training and explore two skip-connection configurations that allow the network to predict either environmental noise or clean speech. To control activation and weight magnitudes, we adopt a magnitude-preserving architecture and learn the contribution of the noisy input within each network block for improved conditioning. We further analyze the impact of exponential moving average (EMA) parameter smoothing by approximating different EMA profiles post training, finding that, unlike in image generation, short or absent EMA consistently yields better speech enhancement performance. Experiments on VoiceBank-DEMAND and EARS-WHAM demonstrate competitive signal-to-distortion ratios and perceptual scores, with the two skip-connection variants exhibiting complementary strengths. These findings provide new insights into EMA behavior, magnitude preservation, and skip-connection design for diffusion-based speech enhancement.

[797] Listen, Look, Drive: Coupling Audio Instructions for User-aware VLA-based Autonomous Driving

Ziang Guo, Feng Yang, Xuefeng Zhang, Jiaqi Guo, Kun Zhao, Yixiao Zhou, Peng Lu, Sifa Zheng, Zufeng Zhang

Main category: eess.AS

TL;DR: EchoVLA: A user-aware Vision-Language-Action model for autonomous driving that incorporates real-time audio instructions with emotional context to enable more responsive and adaptive driving behavior.

Details

Motivation: Current VLA models treat language as static prior, requiring inference of shifting objectives from pixels alone, leading to delayed or conservative maneuvers. Need for online user influence channel with specific intentions.

Method: Augment nuScenes dataset with temporally aligned, intent-specific synthetic speech commands. Create multimodal Chain-of-Thought (CoT) with emotional speech-trajectory pairs. Fine-tune Qwen2.5-Omni MLM to interpret both semantic content and emotional context of audio commands.

Result: Reduces average L2 error by 59.4% and collision rate by 74.4% compared to vision-only baseline. Validates ability to steer trajectory through audio instructions and modulate driving behavior based on detected emotions in user’s speech.

Conclusion: EchoVLA demonstrates effective integration of audio instructions with emotional context for more nuanced and adaptive autonomous driving behavior, addressing limitations of static language priors in current VLA models.

Abstract: Vision Language Action (VLA) models promise an open-vocabulary interface that can translate perceptual ambiguity into semantically grounded driving decisions, yet they still treat language as a static prior fixed at inference time. As a result, the model must infer continuously shifting objectives from pixels alone, yielding delayed or overly conservative maneuvers. We argue that effective VLAs for autonomous driving need an online channel in which users can influence driving with specific intentions. To this end, we present EchoVLA, a user-aware VLA that couples camera streams with in situ audio instructions. We augment the nuScenes dataset with temporally aligned, intent-specific speech commands generated by converting ego-motion descriptions into synthetic audios. Further, we compose emotional speech-trajectory pairs into a multimodal Chain-of-Thought (CoT) for fine-tuning a Multimodal Large Model (MLM) based on Qwen2.5-Omni. Specifically, we synthesize the audio-augmented dataset with different emotion types paired with corresponding driving behaviors, leveraging the emotional cues embedded in tone, pitch, and speech tempo to reflect varying user states, such as urgent or hesitant intentions, thus enabling our EchoVLA to interpret not only the semantic content but also the emotional context of audio commands for more nuanced and emotionally adaptive driving behavior. In open-loop benchmarks, our approach reduces the average L2 error by $59.4%$ and the collision rate by $74.4%$ compared to the baseline of vision-only perception. More experiments on nuScenes dataset validate that EchoVLA not only steers the trajectory through audio instructions, but also modulates driving behavior in response to the emotions detected in the user’s speech.

[798] End-to-end audio-visual learning for cochlear implant sound coding simulations in noisy environments

Meng-Ping Lin, Enoch Hsin-Ho Huang, Shao-Yi Chien, Yu Tsao

Main category: eess.AS

TL;DR: Audio-visual speech enhancement integrated with cochlear implant sound coding improves speech intelligibility in noise by leveraging visual cues.

Details

Motivation: Cochlear implants struggle with listening in noisy environments, and recent deep learning advances show promise for improving CI sound coding by incorporating visual information to enhance speech perception.

Method: Integrated an audio-visual speech enhancement (AVSE) module with the ElectrodeNet-CS (ECS) model to create an end-to-end CI system called AVSE-ECS, using joint training to optimize performance.

Result: The AVSE-ECS system achieved high objective speech intelligibility and improved signal-to-error ratio (SER) by 7.4666 dB compared to the advanced combination encoder (ACE) strategy.

Conclusion: Audio-visual speech enhancement shows significant potential for improving cochlear implant sound coding, particularly for challenging listening conditions in noise.

Abstract: The cochlear implant (CI) is a successful biomedical device that enables individuals with severe-to-profound hearing loss to perceive sound through electrical stimulation, yet listening in noise remains challenging. Recent deep learning advances offer promising potential for CI sound coding by integrating visual cues. In this study, an audio-visual speech enhancement (AVSE) module is integrated with the ElectrodeNet-CS (ECS) model to form the end-to-end CI system, AVSE-ECS. Simulations show that the AVSE-ECS system with joint training achieves high objective speech intelligibility and improves the signal-to-error ratio (SER) by 7.4666 dB compared to the advanced combination encoder (ACE) strategy. These findings underscore the potential of AVSE-based CI sound coding.

[799] Mitigating data replication in text-to-audio generative diffusion models through anti-memorization guidance

Francisco Messina, Francesca Ronchini, Luca Comanducci, Paolo Bestagini, Fabio Antonacci

Main category: eess.AS

TL;DR: AMG (Anti-Memorization Guidance) reduces data replication in text-to-audio diffusion models by modifying sampling to discourage memorization while maintaining generation quality.

Details

Motivation: Addresses the persistent challenge of data replication in generative audio models where models unintentionally generate parts of training data during inference, which is a critical issue for text-to-audio diffusion models.

Method: Adopts Anti-Memorization Guidance (AMG) that modifies the sampling process of pre-trained diffusion models. Explores three types of guidance within AMG designed to reduce replication while preserving generation quality. Uses Stable Audio Open as backbone for its fully open-source architecture and training dataset.

Result: Comprehensive experimental analysis suggests that AMG significantly mitigates memorization in diffusion-based text-to-audio generation without compromising audio fidelity or semantic alignment.

Conclusion: AMG provides an effective solution to the data replication problem in text-to-audio diffusion models, enabling safer and more original audio generation while maintaining quality.

Abstract: A persistent challenge in generative audio models is data replication, where the model unintentionally generates parts of its training data during inference. In this work, we address this issue in text-to-audio diffusion models by exploring the use of anti-memorization strategies. We adopt Anti-Memorization Guidance (AMG), a technique that modifies the sampling process of pre-trained diffusion models to discourage memorization. Our study explores three types of guidance within AMG, each designed to reduce replication while preserving generation quality. We use Stable Audio Open as our backbone, leveraging its fully open-source architecture and training dataset. Our comprehensive experimental analysis suggests that AMG significantly mitigates memorization in diffusion-based text-to-audio generation without compromising audio fidelity or semantic alignment.

[800] No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS

Seungyoun Shin, Dongha Ahn, Jiwoo Kim, Sungwook Jeon

Main category: eess.AS

TL;DR: Iterative DPO with minimal human preference pairs improves prosodic naturalness in TTS while maintaining competitive transcription accuracy, outperforming GRPO and commercial baselines on Korean call center data.

Details

Motivation: GRPO optimization for TTS improves transcription metrics but collapses prosody into monotone speech, and adding speaker similarity destabilizes training. There's a need for methods that optimize prosodic naturalness when automatic rewards are unavailable.

Method: Iterative Direct Preference Optimization (DPO) scheme using only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model.

Result: Achieved highest human preference (ELO) with competitive CER on KoCC-TTS dataset, outperforming GRPO and strong commercial baselines.

Conclusion: Human preference optimization offers a practical and data-efficient path to natural and robust TTS when prosody cannot be rewarded automatically.

Abstract: Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an \textit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On \textbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, \textit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at \href{https://tts.ch.dev}

[801] SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

Tan Dat Nguyen, Jaehun Kim, Ji-Hoon Kim, Shukjae Choi, Youshin Lim, Joon Son Chung

Main category: eess.AS

TL;DR: SPADE is a framework for efficient LLM-based text-to-speech that combines structured pruning and adaptive distillation to reduce model size while maintaining quality.

Details

Motivation: Recent LLM-TTS systems achieve strong controllability and zero-shot generalization but have large parameter counts and high latency that limit real-world deployment. There's a need for more efficient models that maintain quality while being practical for real-time applications.

Method: SPADE combines: (1) structured pruning guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, and (2) multi-level knowledge distillation to restore autoregressive coherence after pruning.

Result: SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, achieving up to 1.7x faster real-time factor, with less than 5% of original training data.

Conclusion: Compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation, demonstrating the effectiveness of structured pruning and distillation for efficient audio generation.

Abstract: The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (LLM-TTS). Recent LLM-TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a pruning step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at https://mm.kaist.ac.kr/projects/SPADE/.

[802] Position: Towards Responsible Evaluation for Text-to-Speech

Yifan Yang, Hui Wang, Bing Han, Shujie Liu, Jinyu Li, Yong Qin, Xie Chen

Main category: eess.AS

TL;DR: Position paper advocating for Responsible Evaluation in TTS systems through three levels: accurate capability assessment, standardized benchmarks, and ethical risk mitigation.

Details

Motivation: Current TTS evaluation practices are inadequate for capturing modern systems' full capabilities, limitations, and societal impacts, necessitating a more comprehensive evaluation framework.

Method: Proposes the concept of Responsible Evaluation structured through three progressive levels: 1) accurate capability assessment with robust scoring methodologies, 2) standardized benchmarks and transparent reporting, and 3) ethical risk assessment and mitigation.

Result: Critical examination of current evaluation practices, identification of systemic shortcomings, and actionable recommendations for more reliable and trustworthy TTS technology.

Conclusion: Responsible Evaluation is essential for the next phase of TTS development to ensure ethically sound and societally beneficial applications.

Abstract: Recent advances in text-to-speech (TTS) technology have enabled systems to generate speech that is often indistinguishable from human speech, bringing benefits to accessibility, content creation, and human-computer interaction. However, current evaluation practices are increasingly inadequate for capturing the full range of capabilities, limitations, and societal impacts of modern TTS systems. This position paper introduces the concept of Responsible Evaluation and argues that it is essential and urgent for the next phase of TTS development, structured through three progressive levels: (1) ensuring the faithful and accurate reflection of a model’s true capabilities and limitations, with more robust, discriminative, and comprehensive objective and subjective scoring methodologies; (2) enabling comparability, standardization, and transferability through standardized benchmarks, transparent reporting, and transferable evaluation metrics; and (3) assessing and mitigating ethical risks associated with forgery, misuse, privacy violations, and security vulnerabilities. Through this concept, we critically examine current evaluation practices, identify systemic shortcomings, and propose actionable recommendations. We hope this concept will not only foster more reliable and trustworthy TTS technology but also guide its development toward ethically sound and societally beneficial applications.

[803] CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, Tatsuya Komatsu

Main category: eess.AS

TL;DR: CASTELLA is a new large-scale human-annotated audio benchmark for audio moment retrieval (AMR) that addresses limitations of previous synthetic datasets and small evaluation sets.

Details

Motivation: Current AMR research lacks reliable benchmarks with real-world data, relying on synthetic datasets and small evaluation sets (under 100 samples), leading to unreliable performance metrics and limiting practical applications.

Method: Created CASTELLA, a manually annotated AMR dataset with 1009 training, 213 validation, and 640 test audio recordings (24× larger than previous datasets). Established baseline models and compared fine-tuning on CASTELLA after synthetic pre-training vs. synthetic-only training.

Result: Models fine-tuned on CASTELLA after synthetic pre-training outperformed synthetic-only trained models by 10.4 points in Recall1@0.7 metric, demonstrating the value of real-world annotated data.

Conclusion: CASTELLA provides a substantial, reliable benchmark for AMR research with real-world data, enabling more accurate evaluation and development of practical audio retrieval systems.

Abstract: We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The initial study of AMR trained the models solely on synthetic datasets. Moreover, the evaluation is based on an annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1009, 213, and 640 audio recordings for training, validation, and test splits, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/.

[804] MK-SGC-SC: Multiple Kernel Guided Sparse Graph Construction in Spectral Clustering for Unsupervised Speaker Diarization

Nikhil Raghav, Avisek Gupta, Swagatam Das, Md Sahidullah

Main category: eess.AS

TL;DR: MK-SGC-SC: A fully unsupervised speaker diarization method using multiple kernel similarities and sparse graph construction for spectral clustering, achieving SOTA performance without pretraining or supervision.

Details

Motivation: Speaker diarization typically requires pretraining or weak supervision, but unsupervised approaches are challenging. The authors aim to develop a fully unsupervised method that can identify speaker regions without any supervision by leveraging improved clustering techniques on speaker embeddings.

Method: The method uses multiple kernel similarities (four polynomial kernels and a degree one arccosine kernel) to measure similarities in speaker embeddings. These similarities are used to construct sparse graphs in a principled manner that emphasize local similarities. The sparse graphs are then used for spectral clustering.

Result: The approach achieves state-of-the-art performance in unsupervised speaker diarization across challenging environments in DIHARD-III, AMI, and VoxConverse corpora, demonstrating effectiveness without any pretraining or supervision.

Conclusion: Measuring multiple kernel similarities of speaker embeddings and constructing sparse graphs for spectral clustering is sufficient for achieving SOTA unsupervised speaker diarization, providing a promising direction for fully unsupervised audio analysis.

Abstract: Speaker diarization aims to segment audio recordings into regions corresponding to individual speakers. Although unsupervised speaker diarization is inherently challenging, the prospect of identifying speaker regions without pretraining or weak supervision motivates research on clustering techniques. In this work, we share the notable observation that measuring multiple kernel similarities of speaker embeddings to thereafter craft a sparse graph for spectral clustering in a principled manner is sufficient to achieve state-of-the-art performances in a fully unsupervised setting. Specifically, we consider four polynomial kernels and a degree one arccosine kernel to measure similarities in speaker embeddings, using which sparse graphs are constructed in a principled manner to emphasize local similarities. Experiments show the proposed approach excels in unsupervised speaker diarization over a variety of challenging environments in the DIHARD-III, AMI, and VoxConverse corpora. To encourage further research, our implementations are available at https://github.com/nikhilraghav29/MK-SGC-SC.

[805] Decoding Speech Envelopes from Electroencephalogram with a Contrastive Pearson Correlation Coefficient Loss

Yayun Liang, Yuanming Zhang, Fei Chen, Jing Lu, Zhibin Lin

Main category: eess.AS

TL;DR: Proposes contrastive Pearson correlation coefficient loss for EEG-based auditory attention decoding to improve envelope separability between attended and unattended speech signals.

Details

Motivation: Existing EEG-based auditory attention decoding methods focus on maximizing correlation with attended speech envelope, but neglect the importance of maximizing the difference between attended and unattended envelope correlations for better attention decoding.

Method: Introduces contrastive PCC loss that explicitly maximizes the difference between attended PCC and unattended PCC, evaluated on three public EEG AAD datasets using four DNN architectures.

Result: The proposed objective improves envelope separability and AAD accuracy across many settings, while revealing dataset- and architecture-dependent failure cases.

Conclusion: Contrastive PCC loss enhances auditory attention decoding performance by better separating attended and unattended speech envelope reconstructions from EEG signals.

Abstract: Recent advances in reconstructing speech envelopes from Electroencephalogram (EEG) signals have enabled continuous auditory attention decoding (AAD) in multi-speaker environments. Most Deep Neural Network (DNN)-based envelope reconstruction models are trained to maximize the Pearson correlation coefficients (PCC) between the attended envelope and the reconstructed envelope (attended PCC). While the difference between the attended PCC and the unattended PCC plays an essential role in auditory attention decoding, existing methods often focus on maximizing the attended PCC. We therefore propose a contrastive PCC loss which represents the difference between the attended PCC and the unattended PCC. The proposed approach is evaluated on three public EEG AAD datasets using four DNN architectures. Across many settings, the proposed objective improves envelope separability and AAD accuracy, while also revealing dataset- and architecture-dependent failure cases.

eess.IV

[806] ECGFlowCMR: Pretraining with ECG-Generated Cine CMR Improves Cardiac Disease Classification and Phenotype Prediction

Xiaocheng Fang, Zhengyao Ding, Jieyi Cai, Yujie Xiao, Bo Liu, Jiarui Jin, Haoyu Wang, Guangkun Nie, Shun Huang, Ting Chen, Hongyan Li, Shenda Hong

Main category: eess.IV

TL;DR: ECGFlowCMR: A novel ECG-to-CMR generative framework that uses Phase-Aware Masked Autoencoder and Anatomy-Motion Disentangled Flow to generate cardiac cine MRI sequences from ECG inputs, addressing cross-modal temporal mismatches and anatomical observability gaps.

Details

Motivation: Cardiac MRI is expensive and requires expert annotations, limiting large-scale labeled datasets. ECGs are inexpensive and widely accessible, offering potential for conditioning generative synthesis of cine CMR to overcome data scarcity.

Method: Proposes ECGFlowCMR with two key components: 1) Phase-Aware Masked Autoencoder (PA-MAE) to handle cross-modal temporal mismatch between multi-beat ECG recordings and single-cycle CMR sequences, and 2) Anatomy-Motion Disentangled Flow (AMDF) to address anatomical observability gap from limited structural information in ECGs.

Result: Extensive experiments on UK Biobank and proprietary clinical dataset show ECGFlowCMR can generate realistic cine CMR sequences from ECG inputs, enabling scalable pretraining and improving performance on downstream cardiac disease classification and phenotype prediction tasks.

Conclusion: ECGFlowCMR provides an effective framework for cross-modal generation from ECG to CMR, addressing key challenges in temporal alignment and anatomical representation, with practical applications in scalable pretraining and improved downstream task performance.

Abstract: Cardiac Magnetic Resonance (CMR) imaging provides a comprehensive assessment of cardiac structure and function but remains constrained by high acquisition costs and reliance on expert annotations, limiting the availability of large-scale labeled datasets. In contrast, electrocardiograms (ECGs) are inexpensive, widely accessible, and offer a promising modality for conditioning the generative synthesis of cine CMR. To this end, we propose ECGFlowCMR, a novel ECG-to-CMR generative framework that integrates a Phase-Aware Masked Autoencoder (PA-MAE) and an Anatomy-Motion Disentangled Flow (AMDF) to address two fundamental challenges: (1) the cross-modal temporal mismatch between multi-beat ECG recordings and single-cycle CMR sequences, and (2) the anatomical observability gap due to the limited structural information inherent in ECGs. Extensive experiments on the UK Biobank and a proprietary clinical dataset demonstrate that ECGFlowCMR can generate realistic cine CMR sequences from ECG inputs, enabling scalable pretraining and improving performance on downstream cardiac disease classification and phenotype prediction tasks.

[807] Denoising and Baseline Correction of Low-Scan FTIR Spectra: A Benchmark of Deep Learning Models Against Traditional Signal Processing

Azadeh Mokari, Shravan Raghunathan, Artem Shydliukh, Oleg Ryabchykov, Christoph Krafft, Thomas Bocklitz

Main category: eess.IV

TL;DR: Physics-informed cascade Unet for FTIR imaging acceleration separates denoising and baseline correction using deterministic Physics Bridge, achieving 51.3% RMSE reduction and 32x faster imaging.

Details

Motivation: FTIR imaging requires extensive signal averaging to reduce noise and drift, limiting clinical speed. Deep learning can accelerate imaging but standard approaches suffer from spectral hallucinations and poor generalization due to statistical approximations.

Method: Proposed physics-informed cascade Unet with deterministic Physics Bridge that separates denoising and baseline correction tasks. Uses embedded SNIP layer to enforce spectroscopic constraints instead of learning statistical approximations.

Result: Cascade model achieved 51.3% reduction in RMSE compared to raw single-scan inputs, outperforming single Unet (40.2%) and traditional workflow (33.7%). Eliminates spectral hallucinations and preserves peak intensity better than traditional smoothing.

Conclusion: Cascade Unet is a robust solution for diagnostic-grade FTIR imaging, enabling imaging speeds 32 times faster than current methods while maintaining spectral fidelity.

Abstract: High-quality Fourier Transform Infrared (FTIR) imaging usually needs extensive signal averaging to reduce noise and drift which severely limits clinical speed. Deep learning can accelerate imaging by reconstructing spectra from rapid, single-scan inputs. However, separating noise and baseline drift simultaneously without ground truth is an ill-posed inverse problem. Standard black-box architectures often rely on statistical approximations that introduce spectral hallucinations or fail to generalize to unstable atmospheric conditions. To solve these issues we propose a physics-informed cascade Unet that separates denoising and baseline correction tasks using a new, deterministic Physics Bridge. This architecture forces the network to separate random noise from chemical signals using an embedded SNIP layer to enforce spectroscopic constraints instead of learning statistical approximations. We benchmarked this approach against a standard single Unet and a traditional Savitzky-Golay/SNIP workflow. We used a dataset of human hypopharyngeal carcinoma cells (FaDu). The cascade model outperformed all other methods, achieving a 51.3% reduction in RMSE compared to raw single-scan inputs, surpassing both the single Unet (40.2%) and the traditional workflow (33.7%). Peak-aware metrics show that the cascade architecture eliminates spectral hallucinations found in standard deep learning. It also preserves peak intensity with much higher fidelity than traditional smoothing. These results show that the cascade Unet is a robust solution for diagnostic-grade FTIR imaging. It enables imaging speeds 32 times faster than current methods.

[808] CompSRT: Quantization and Pruning for Image Super Resolution Transformers

Dorsa Zeinali, Hailing Wang, Yitian Zhang, Raymond Fu

Main category: eess.IV

TL;DR: CompSRT introduces Hadamard-based quantization with scalar decomposition for compressing SwinIR-light image super-resolution transformers, achieving state-of-the-art compression performance with up to 1.53 dB gains.

Details

Motivation: There's a significant performance gap between compressed and full-precision image super-resolution models, and while Hadamard transforms have shown promise in LLM quantization for reducing outliers, their empirical effects on vision transformers need deeper understanding.

Method: Analyzed weight/activation distributions in SwinIR-light, showing Hadamard transforms reduce ranges and increase values around zero. Introduced CompSRT with Hadamard-based quantization plus scalar decomposition (two trainable parameters) for SwinIR-light compression, compatible with pruning.

Result: Statistically significant improvements over SOTA (up to 1.53 dB gains), better visual quality with reduced blurriness at all bitwidths. At 3-4 bits with 40% pruning, achieved 6.67-15% reduction in bits per parameter while maintaining comparable performance.

Conclusion: Hadamard transforms effectively compress vision transformers by reducing value ranges and increasing zero-proximity values. CompSRT provides superior compression for image super-resolution transformers with practical pruning compatibility.

Abstract: Model compression has become an important tool for making image super resolution models more efficient. However, the gap between the best compressed models and the full precision model still remains large and a need for deeper understanding of compression theory on more performant models remains. Prior research on quantization of LLMs has shown that Hadamard transformations lead to weights and activations with reduced outliers, which leads to improved performance. We argue that while the Hadamard transform does reduce the effect of outliers, an empirical analysis on how the transform functions remains needed. By studying the distributions of weights and activations of SwinIR-light, we show with statistical analysis that lower errors is caused by the Hadamard transforms ability to reduce the ranges, and increase the proportion of values around $0$. Based on these findings, we introduce CompSRT, a more performant way to compress the image super resolution transformer network SwinIR-light. We perform Hadamard-based quantization, and we also perform scalar decomposition to introduce two additional trainable parameters. Our quantization performance statistically significantly surpasses the SOTA in metrics with gains as large as 1.53 dB, and visibly improves visual quality by reducing blurriness at all bitwidths. At $3$-$4$ bits, to show our method is compatible with pruning for increased compression, we also prune $40%$ of weights and show that we can achieve $6.67$-$15%$ reduction in bits per parameter with comparable performance to SOTA.

Shujaat Khan, Syed Muhammad Atif, Jaeyoung Huh, Syed Saad Azhar

Main category: eess.IV

TL;DR: Blind self-supervised ultrasound enhancement framework using physics-guided degradation model with Swin Convolutional U-Net for joint deconvolution and denoising without clean targets.

Details

Motivation: Ultrasound interpretation suffers from multiplicative speckle, acquisition blur from PSF, and scanner/operator artifacts. Supervised methods need clean targets or known degradations, which are rarely available in practice.

Method: Self-supervised framework with Swin Convolutional U-Net trained using physics-guided degradation model. From training frames, extract rotated/cropped patches, synthesize inputs by convolving with Gaussian PSF surrogate and injecting noise via spatial additive Gaussian or complex Fourier-domain perturbations. Clean-like targets obtained via non-local low-rank denoising for US scans.

Result: Achieves highest PSNR/SSIM across Gaussian and speckle noise levels on UDIAT B, JNU-IFM, XPIE Set-P datasets, with margins widening under stronger corruption. Preserves extra ~1-4dB PSNR and 0.05-0.15 SSIM in heavy Gaussian noise, and ~2-5dB PSNR and 0.05-0.20 SSIM under severe speckle compared to baselines. Shows resolution recovery without edge erosion and boosts segmentation Dice scores.

Conclusion: The approach offers practical, assumption-light path to robust ultrasound enhancement that generalizes across datasets, scanners, and degradation types, serving as effective plug-and-play preprocessor.

Abstract: Ultrasound (US) interpretation is hampered by multiplicative speckle, acquisition blur from the point-spread function (PSF), and scanner- and operator-dependent artifacts. Supervised enhancement methods assume access to clean targets or known degradations; conditions rarely met in practice. We present a blind, self-supervised enhancement framework that jointly deconvolves and denoises B-mode images using a Swin Convolutional U-Net trained with a \emph{physics-guided} degradation model. From each training frame, we extract rotated/cropped patches and synthesize inputs by (i) convolving with a Gaussian PSF surrogate and (ii) injecting noise via either spatial additive Gaussian noise or complex Fourier-domain perturbations that emulate phase/magnitude distortions. For US scans, clean-like targets are obtained via non-local low-rank (NLLR) denoising, removing the need for ground truth; for natural images, the originals serve as targets. Trained and validated on UDIAT~B, JNU-IFM, and XPIE Set-P, and evaluated additionally on a 700-image PSFHS test set, the method achieves the highest PSNR/SSIM across Gaussian and speckle noise levels, with margins that widen under stronger corruption. Relative to MSANN, Restormer, and DnCNN, it typically preserves an extra $\sim$1–4,dB PSNR and 0.05–0.15 SSIM in heavy Gaussian noise, and $\sim$2–5,dB PSNR and 0.05–0.20 SSIM under severe speckle. Controlled PSF studies show reduced FWHM and higher peak gradients, evidence of resolution recovery without edge erosion. Used as a plug-and-play preprocessor, it consistently boosts Dice for fetal head and pubic symphysis segmentation. Overall, the approach offers a practical, assumption-light path to robust US enhancement that generalizes across datasets, scanners, and degradation types.

[810] Wrapper-Aware Rate-Distortion Optimization in Feature Coding for Machines

Samuel Fernández-Menduiña, Hyomin Choi, Fabien Racapé, Eduardo Pavez, Antonio Ortega

Main category: eess.IV

TL;DR: FCM compression method improves bit allocation by making traditional codecs wrapper-aware, bridging performance gap between HEVC and VVC codecs with minimal overhead.

Details

Motivation: Current FCM methods use traditional codecs with wrappers but the codecs' rate-distortion optimization doesn't consider post-processing wrappers, wasting bits on information that gets discarded later.

Method: Proposes wrapper-aware RDO using weighted SSE metric, with temporal weight reuse across groups of pictures and fixed architecture/task-dependent weights trained offline.

Result: HEVC implementation matches VVC-based FCM state-of-the-art under MPEG test conditions, effectively bridging a codec generation gap with minimal runtime overhead.

Conclusion: Wrapper-aware bit allocation significantly improves FCM compression efficiency, making HEVC competitive with next-generation VVC codecs for split-inference applications.

Abstract: Feature coding for machines (FCM) is a lossy compression paradigm for split-inference. The transmitter encodes the outputs of the first part of a neural network before sending them to the receiver for completing the inference. Practical FCM methods ``sandwich’’ a traditional codec between pre- and post-processing neural networks, called wrappers, to make features easier to compress using video codecs. Since traditional codecs are non-differentiable, the wrappers are trained using a proxy codec, which is later replaced by a standard codec after training. These codecs perform rate-distortion optimization (RDO) based on the sum of squared errors (SSE). Because the RDO does not consider the post-processing wrapper, the inner codec can invest bits in preserving information that the post-processing later discards. In this paper, we modify the bit-allocation in the inner codec via a wrapper-aware weighted SSE metric. To make wrapper-aware RDO (WA-RDO) practical for FCM, we propose: 1) temporal reuse of weights across a group of pictures and 2) fixed, architecture- and task-dependent weights trained offline. Under MPEG test conditions, our methods implemented on HEVC match the VVC-based FCM state-of-the-art, effectively bridging a codec generation gap with minimal runtime overhead relative to SSE-RDO HEVC.

[811] Scale-Equivariant Imaging: Self-Supervised Learning for Image Super-Resolution and Deblurring

Jérémy Scanvic, Mike Davies, Patrice Abry, Julián Tachella

Main category: eess.IV

TL;DR: Self-supervised scale-equivariant imaging method for inverse problems like super-resolution and deblurring, leveraging scale invariance to recover high-frequency information from low-frequency measurements.

Details

Motivation: Self-supervised methods work well for many imaging inverse problems but fail in super-resolution and deblurring where translation/rotation invariance is insufficient. Need to leverage scale invariance to recover high-frequency information from low-frequency measurements.

Method: Proposes scale-equivariant imaging approach that exploits approximate scale invariance in image distributions. Uses scale-equivariant neural networks to learn from measurements containing only low-frequency information, enabling recovery of lost high-frequency details.

Result: Outperforms other self-supervised approaches and achieves performance on par with fully supervised learning on real datasets for super-resolution and deblurring tasks.

Conclusion: Scale-equivariant imaging is an effective self-supervised approach for inverse problems requiring high-frequency recovery, overcoming limitations of translation/rotation invariance methods.

Abstract: Self-supervised methods have recently proved to be nearly as effective as supervised ones in various imaging inverse problems, paving the way for learning-based approaches in scientific and medical imaging applications where ground truth data is hard or expensive to obtain. These methods critically rely on invariance to translations and/or rotations of the image distribution to learn from incomplete measurement data alone. However, existing approaches fail to obtain competitive performances in the problems of image super-resolution and deblurring, which play a key role in most imaging systems. In this work, we show that invariance to roto-translations is insufficient to learn from measurements that only contain low-frequency information. Instead, we propose scale-equivariant imaging, a new self-supervised approach that leverages the fact that many image distributions are approximately scale-invariant, enabling the recovery of high-frequency information lost in the measurement process. We demonstrate throughout a series of experiments on real datasets that the proposed method outperforms other self-supervised approaches, and obtains performances on par with fully supervised learning.

[812] MindGrab for BrainChop: Fast and Accurate Skull Stripping for Command Line and Browser

Armina Fani, Mike Doan, Isabelle Le, Alex Fedorov, Malte Hoffmann, Chris Rorden, Sergey Plis

Main category: eess.IV

TL;DR: MindGrab is a lightweight convolutional model for volumetric skull stripping in neuroimaging that achieves state-of-the-art performance with 40x speedups and minimal resource requirements, enabling in-browser execution.

Details

Motivation: Deployment complexity and specialized hardware requirements hinder adoption of deep learning models in neuroimaging. There's a need for accessible, efficient skull stripping tools that work across imaging modalities without sacrificing accuracy.

Method: Developed a fully convolutional model using spectral interpretation of dilated convolutions for volumetric skull stripping. Designed from first principles to be lightweight with minimal memory demands.

Result: Achieved state-of-the-art performance with mean Dice score of 95.9 (SD 1.6) across datasets and modalities, with 40-fold speedups and substantially lower memory requirements compared to established methods.

Conclusion: MindGrab removes traditional deployment barriers in neuroimaging analysis, making state-of-the-art skull stripping broadly accessible through both command-line tools and zero-installation web applications.

Abstract: Deployment complexity and specialized hardware requirements hinder the adoption of deep learning models in neuroimaging. We present MindGrab, a lightweight, fully convolutional model for volumetric skull stripping across all imaging modalities. MindGrab’s architecture is designed from first principles using a spectral interpretation of dilated convolutions, and demonstrates state-of-the-art performance (mean Dice score across datasets and modalities: 95.9 with SD 1.6), with up to 40-fold speedups and substantially lower memory demands compared to established methods. Its minimal footprint allows for fast, full-volume processing in resource-constrained environments, including direct in-browser execution. MindGrab is delivered via the BrainChop platform as both a simple command-line tool (pip install brainchop) and a zero-installation web application (brainchop.org). By removing traditional deployment barriers without sacrificing accuracy, MindGrab makes state-of-the-art neuroimaging analysis broadly accessible.

Ningyong Wu, Jiangbo Zhang, Wenhong Zhao, Jinzhi Wang, Chenzhan Yu, Zhigang Xiu, Duwei Dai, Ziyu Xu, Yongli Yang

Main category: eess.IV

TL;DR: OrthoInsight is a multimodal deep learning framework for automated rib fracture diagnosis and report generation from CT scans, combining YOLOv9 for fracture detection, medical knowledge graphs for clinical context, and fine-tuned LLaVA for report generation.

Details

Motivation: The growing volume of medical imaging data creates need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures detected via CT scans. Manual interpretation is time-consuming and error-prone, necessitating AI-assisted solutions.

Method: Multi-modal framework integrating: 1) YOLOv9 model for fracture detection in CT images, 2) medical knowledge graph for retrieving clinical context, and 3) fine-tuned LLaVA language model for generating diagnostic reports. Combines visual features from CT images with expert textual data.

Result: Evaluated on 28,675 annotated CT images and expert reports, achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with average score of 4.28, outperforming models like GPT-4 and Claude-3.

Conclusion: Demonstrates potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists through automated diagnostic tools.

Abstract: The growing volume of medical imaging data has increased the need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures, commonly detected via CT scans. Manual interpretation is time-consuming and error-prone. We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation. It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports. OrthoInsight combines visual features from CT images with expert textual data to deliver clinically useful outputs. Evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28, outperforming models like GPT-4 and Claude-3. This study demonstrates the potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists.

[814] Interpretable Modeling of Articulatory Temporal Dynamics from real-time MRI for Phoneme Recognition

Jay Park, Hong Nguyen, Sean Foley, Jihwan Lee, Yoonjeong Lee, Dani Byrd, Shrikanth Narayanan

Main category: eess.IV

TL;DR: The paper investigates compact representations of vocal tract dynamics from real-time MRI videos for phoneme recognition, comparing raw video, optical flow, and linguistically-relevant ROI features, finding that multi-feature combinations (especially ROI + raw video) achieve best performance.

Details

Motivation: Real-time MRI provides comprehensive visualization of speech articulation but produces high-dimensional, noisy signals that are difficult to interpret. The research aims to find effective compact representations of spatiotemporal articulatory dynamics for phoneme recognition.

Method: Compare three feature types from midsagittal vocal tract rtMRI videos: (1) raw video, (2) optical flow, and (3) six linguistically-relevant regions of interest (ROIs) for articulator movements. Evaluate models trained independently on each representation and multi-feature combinations.

Result: Multi-feature models consistently outperform single-feature baselines, with the lowest phoneme error rate (PER) of 0.34 obtained by combining ROI and raw video features. Temporal fidelity experiments show reliance on fine-grained articulatory dynamics, and ROI ablation studies reveal strong contributions from tongue and lip movements.

Conclusion: rtMRI-derived features provide both accuracy and interpretability for speech processing, and the study establishes effective strategies for leveraging articulatory data, highlighting the importance of multi-feature approaches combining structural and motion information.

Abstract: Real-time Magnetic Resonance Imaging (rtMRI) visualizes vocal tract action, offering a comprehensive window into speech articulation. However, its signals are high dimensional and noisy, hindering interpretation. We investigate compact representations of spatiotemporal articulatory dynamics for phoneme recognition from midsagittal vocal tract rtMRI videos. We compare three feature types: (1) raw video, (2) optical flow, and (3) six linguistically-relevant regions of interest (ROIs) for articulator movements. We evaluate models trained independently on each representation, as well as multi-feature combinations. Results show that multi-feature models consistently outperform single-feature baselines, with the lowest phoneme error rate (PER) of 0.34 obtained by combining ROI and raw video. Temporal fidelity experiments demonstrate a reliance on fine-grained articulatory dynamics, while ROI ablation studies reveal strong contributions from tongue and lips. Our findings highlight how rtMRI-derived features provide accuracy and interpretability, and establish strategies for leveraging articulatory data in speech processing.

[815] Leveraging Second-Order Curvature for Efficient Learned Image Compression: Theory and Empirical Evidence

Yichi Zhang, Fengqing Zhu

Main category: eess.IV

TL;DR: SOAP, a second-order quasi-Newton optimizer, dramatically improves training efficiency and final performance for learned image compression models by resolving gradient conflicts in the rate-distortion trade-off.

Details

Motivation: Standard first-order optimizers (SGD, Adam) struggle with gradient conflicts arising from competing rate and distortion objectives in learned image compression, leading to slow convergence and suboptimal performance.

Method: Utilizes a second-order quasi-Newton optimizer (SOAP) as a drop-in replacement for first-order optimizers. Newton preconditioning inherently resolves intra-step and inter-step update conflicts in the rate-distortion objective.

Result: SOAP dramatically improves both training efficiency and final performance across diverse LICs. Second-order trained models exhibit significantly fewer activation and latent outliers, substantially enhancing robustness to post-training quantization.

Conclusion: Second-order optimization is a powerful, practical tool for advancing the efficiency and real-world readiness of learned image compression models, achievable as a seamless drop-in replacement for standard optimizers.

Abstract: Training learned image compression (LIC) models entails navigating a challenging optimization landscape defined by the fundamental trade-off between rate and distortion. Standard first-order optimizers, such as SGD and Adam, struggle with \emph{gradient conflicts} arising from competing objectives, leading to slow convergence and suboptimal rate-distortion performance. In this work, we demonstrate that a simple utilization of a second-order quasi-Newton optimizer, \textbf{SOAP}, dramatically improves both training efficiency and final performance across diverse LICs. Our theoretical and empirical analyses reveal that Newton preconditioning inherently resolves the intra-step and inter-step update conflicts intrinsic to the R-D objective, facilitating faster, more stable convergence. Beyond acceleration, we uncover a critical deployability benefit: second-order trained models exhibit significantly fewer activation and latent outliers. This substantially enhances robustness to post-training quantization. Together, these results establish second-order optimization, achievable as a seamless drop-in replacement of the imported optimizer, as a powerful, practical tool for advancing the efficiency and real-world readiness of LICs.

Editor’s Picks

[1] Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion

[2] SemanticAudio: Audio Generation and Editing in Semantic Space

[3] Non-Markov Multi-Round Conversational Image Generation with History-Conditioned MLLMs

Today’s Research Highlights

Table of Contents

cs.CL

[1] DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

[2] asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation

[3] UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop

[4] Position-invariant Fine-tuning of Speech Enhancement Models with Self-supervised Speech Representations

[5] ChunkWise LoRA: Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference

[6] Multi-task Code LLMs: Data Mix or Model Merge?

[7] Multilingual Dysarthric Speech Assessment Using Universal Phone Recognition and Language-Specific Phonemic Contrast Modeling

[8] Large Language Models Naively Recover Ethnicity from Individual Records

[9] Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

[10] EnsembleLink: Accurate Record Linkage Without Training Data

[11] Qwen3-ASR Technical Report

[12] Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space

[13] From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning

[14] Scaling Embeddings Outperforms Scaling Experts in Language Models

[15] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

[16] Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

[17] MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

[18] SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models

[19] MoCo: A One-Stop Shop for Model Collaboration Research

[20] CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

[21] Self-Improving Pretraining: using post-trained models to pretrain better models

[22] The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation

[23] TransLaw: A Large-Scale Dataset and Multi-Agent Benchmark Simulating Professional Translation of Hong Kong Case Law

[24] User-Centric Evidence Ranking for Attribution and Fact Verification

[25] Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

[26] SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models

[27] DimStance: Multilingual Datasets for Dimensional Stance Analysis

[28] MURAD: A Large-Scale Multi-Domain Unified Reverse Arabic Dictionary Dataset

[29] LMK > CLS: Landmark Pooling for Dense Embeddings

[30] inversedMixup: Data Augmentation via Inverting Mixed Embeddings

[31] Note2Chat: Improving LLMs for Multi-Turn Clinical History Taking Using Medical Notes

[32] ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

[33] KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

[34] Language Models as Artificial Learners: Investigating Crosslinguistic Influence

[35] ILRR: Inference-Time Steering Method for Masked Diffusion Language Models

[36] AdaptBPE: From General Purpose to Specialized Tokenizers

[37] Scale-Dependent Semantic Dynamics Revealed by Allan Deviation

[38] FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning

[39] Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling

[40] Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

[41] Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

[42] TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning

[43] Enhancing Language Models for Robust Greenwashing Detection

[44] Procedural Pretraining: Warming Up Language Models with Abstract Data

[45] CE-GOCD: Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering

[46] Temporal Guidance for Large Language Models

[47] CoFrGeNet: Continued Fraction Architectures for Language Generation

[48] Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond

[49] Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention

[50] KID: Knowledge-Injected Dual-Head Learning for Knowledge-Grounded Harmful Meme Detection

[51] Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation

[52] RAG-E: Quantifying Retriever-Generator Alignment and Failure Modes

[53] Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning

[54] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

[55] Embodied Task Planning via Graph-Informed Action Generation with Large Lanaguage Model

[56] Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text

[57] SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching

[58] From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes

[59] OVD: On-policy Verbal Distillation

[60] Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding

[61] Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

[62] When “Better” Prompts Hurt: Evaluation-Driven Iteration for LLM Applications

[63] Causal Autoregressive Diffusion Language Model

[64] Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models

[65] A Separable Architecture for Continuous Token Representation in Language Models

[66] On the Paradoxical Interference between Instruction-Following and Task Solving

[67] MasalBench: A Benchmark for Contextual and Cross-Cultural Understanding of Persian Proverbs in LLMs

[68] $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA

[69] VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

[70] ECO: Quantized Training without Full-Precision Master Weights

[71] A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine

[72] Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

[73] FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale